Safe Haskell | Safe |
---|---|
Language | Haskell2010 |
This module is for working with HTML/XML. It deals with both well-formed XML and malformed HTML from the web. It features:
- A lazy parser, based on the HTML 5 specification - see
parseTags
. - A renderer that can write out HTML/XML - see
renderTags
. - Utilities for extracting information from a document - see
~==
,sections
andpartitions
.
The standard practice is to parse a String
to [
Tag
String
]
using parseTags
,
then operate upon it to extract the necessary information.
- data Tag str
- = TagOpen str [Attribute str]
- | TagClose str
- | TagText str
- | TagComment str
- | TagWarning str
- | TagPosition !Row !Column
- type Row = Int
- type Column = Int
- type Attribute str = (str, str)
- parseTags :: StringLike str => str -> [Tag str]
- parseTagsOptions :: StringLike str => ParseOptions str -> str -> [Tag str]
- data ParseOptions str = ParseOptions {
- optTagPosition :: Bool
- optTagWarning :: Bool
- optEntityData :: (str, Bool) -> [Tag str]
- optEntityAttrib :: (str, Bool) -> (str, [Tag str])
- optTagTextMerge :: Bool
- parseOptions :: StringLike str => ParseOptions str
- parseOptionsFast :: StringLike str => ParseOptions str
- parseOptionsEntities :: StringLike str => (str -> Maybe str) -> ParseOptions str
- renderTags :: StringLike str => [Tag str] -> str
- renderTagsOptions :: StringLike str => RenderOptions str -> [Tag str] -> str
- escapeHTML :: StringLike str => str -> str
- data RenderOptions str = RenderOptions {
- optEscape :: str -> str
- optMinimize :: str -> Bool
- optRawTag :: str -> Bool
- renderOptions :: StringLike str => RenderOptions str
- canonicalizeTags :: StringLike str => [Tag str] -> [Tag str]
- isTagOpen :: Tag str -> Bool
- isTagClose :: Tag str -> Bool
- isTagText :: Tag str -> Bool
- isTagWarning :: Tag str -> Bool
- isTagPosition :: Tag str -> Bool
- isTagOpenName :: Eq str => str -> Tag str -> Bool
- isTagCloseName :: Eq str => str -> Tag str -> Bool
- isTagComment :: Tag str -> Bool
- fromTagText :: Show str => Tag str -> str
- fromAttrib :: (Show str, Eq str, StringLike str) => str -> Tag str -> str
- maybeTagText :: Tag str -> Maybe str
- maybeTagWarning :: Tag str -> Maybe str
- innerText :: StringLike str => [Tag str] -> str
- sections :: (a -> Bool) -> [a] -> [[a]]
- partitions :: (a -> Bool) -> [a] -> [[a]]
- class TagRep a where
- (~==) :: (StringLike str, TagRep t) => Tag str -> t -> Bool
- (~/=) :: (StringLike str, TagRep t) => Tag str -> t -> Bool
Data structures and parsing
A single HTML element. A whole document is represented by a list of Tag
.
There is no requirement for TagOpen
and TagClose
to match.
TagOpen str [Attribute str] | An open tag with |
TagClose str | A closing tag |
TagText str | A text node, guaranteed not to be the empty string |
TagComment str | A comment |
TagWarning str | Meta: A syntax error in the input file |
TagPosition !Row !Column | Meta: The position of a parsed element |
parseTags :: StringLike str => str -> [Tag str] Source #
Parse a string to a list of tags, using an HTML 5 compliant parser.
parseTags "<hello>my&</world>" == [TagOpen "hello" [],TagText "my&",TagClose "world"]
parseTagsOptions :: StringLike str => ParseOptions str -> str -> [Tag str] Source #
Parse a string to a list of tags, using settings supplied by the ParseOptions
parameter,
eg. to output position information:
parseTagsOptions parseOptions{optTagPosition = True} "<hello>my&</world>" == [TagPosition 1 1,TagOpen "hello" [],TagPosition 1 8,TagText "my&",TagPosition 1 15,TagClose "world"]
data ParseOptions str Source #
These options control how parseTags
works. The ParseOptions
type is usually generated by one of
parseOptions
, parseOptionsFast
or parseOptionsEntities
, then selected fields may be overriden.
The options optTagPosition
and optTagWarning
specify whether to generate
TagPosition
or TagWarning
elements respectively. Usually these options should be set to False
to simplify future stages, unless you rely on position information or want to give malformed HTML
messages to the end user.
The options optEntityData
and optEntityAttrib
control how entities, for example
are handled.
Both take a string, and a boolean, where True
indicates that the entity ended with a semi-colon ;
.
Inside normal text optEntityData
will be called, and the results will be inserted in the tag stream.
Inside a tag attribute optEntityAttrib
will be called, and the first component of the result will be used
in the attribute, and the second component will be appended after the TagOpen
value (usually the second
component is []
). As an example, to not decode any entities, pass:
parseOptions {optEntityData=\(str,b) -> [TagText $ "&" ++ str ++ [';' | b]] ,optEntityAttrib\(str,b) -> ("&" ++ str ++ [';' | b], [])
ParseOptions | |
|
parseOptions :: StringLike str => ParseOptions str Source #
The default parse options value, described in ParseOptions
. Equivalent to
.parseOptionsEntities
lookupEntity
parseOptionsFast :: StringLike str => ParseOptions str Source #
A ParseOptions
structure optimised for speed, following the fast options.
parseOptionsEntities :: StringLike str => (str -> Maybe str) -> ParseOptions str Source #
A ParseOptions
structure using a custom function to lookup attributes. Any attribute
that is not found will be left intact, and a TagWarning
given (if optTagWarning
is set).
If you do not want to resolve any entities, simpliy pass const Nothing
for the lookup function.
renderTags :: StringLike str => [Tag str] -> str Source #
Show a list of tags, as they might have been parsed, using the default settings given in
RenderOptions
.
renderTags [TagOpen "hello" [],TagText "my&",TagClose "world"] == "<hello>my&</world>"
renderTagsOptions :: StringLike str => RenderOptions str -> [Tag str] -> str Source #
Show a list of tags using settings supplied by the RenderOptions
parameter,
eg. to avoid escaping any characters one could do:
renderTagsOptions renderOptions{optEscape = id} [TagText "my&"] == "my&"
escapeHTML :: StringLike str => str -> str Source #
Replace the four characters &"<>
with their HTML entities (escapeXML
lifted to StringLike
).
data RenderOptions str Source #
These options control how renderTags
works.
The strange quirk of only minimizing <br>
tags is due to Internet Explorer treating
<br></br>
as <br><br>
.
RenderOptions | |
|
renderOptions :: StringLike str => RenderOptions str Source #
The default render options value, described in RenderOptions
.
canonicalizeTags :: StringLike str => [Tag str] -> [Tag str] Source #
Turns all tag names and attributes to lower case and converts DOCTYPE to upper case.
Tag identification
isTagWarning :: Tag str -> Bool Source #
Test if a Tag
is a TagWarning
isTagPosition :: Tag str -> Bool Source #
Test if a Tag
is a TagPosition
isTagComment :: Tag str -> Bool Source #
Test if a Tag
is a TagComment
Extraction
fromTagText :: Show str => Tag str -> str Source #
fromAttrib :: (Show str, Eq str, StringLike str) => str -> Tag str -> str Source #
Extract an attribute, crashes if not a TagOpen
.
Returns ""
if no attribute present.
Warning: does not distinquish between missing attribute
and present attribute with value ""
.
maybeTagText :: Tag str -> Maybe str Source #
maybeTagWarning :: Tag str -> Maybe str Source #
Extract the string from within TagWarning
, otherwise Nothing
innerText :: StringLike str => [Tag str] -> str Source #
Extract all text content from tags (similar to Verbatim found in HaXml)
Utility
sections :: (a -> Bool) -> [a] -> [[a]] Source #
This function takes a list, and returns all suffixes whose first item matches the predicate.
partitions :: (a -> Bool) -> [a] -> [[a]] Source #
This function is similar to sections
, but splits the list
so no element appears in any two partitions.
Combinators
Define a class to allow String's or Tag str's to be used as matches
(~==) :: (StringLike str, TagRep t) => Tag str -> t -> Bool Source #
Performs an inexact match, the first item should be the thing to match. If the second item is a blank string, that is considered to match anything. For example:
(TagText "test" ~== TagText "" ) == True (TagText "test" ~== TagText "test") == True (TagText "test" ~== TagText "soup") == False
For TagOpen
missing attributes on the right are allowed.