Safe Haskell | Safe |
---|---|
Language | Haskell2010 |
This is a performance-oriented HTML tokenizer aim at web-crawling applications. It follows the HTML5 parsing specification quite closely, so it behaves reasonable well on ill-formed documents from the open Web.
Synopsis
- parseTokens :: Text -> [Token]
- parseTokensLazy :: Text -> [Token]
- token :: Parser Token
- data Token
- type TagName = Text
- type AttrName = Text
- type AttrValue = Text
- data Attr = Attr !AttrName !AttrValue
- renderTokens :: [Token] -> Text
- renderToken :: Token -> Text
- renderAttrs :: [Attr] -> Text
- renderAttr :: Attr -> Text
- canonicalizeTokens :: [Token] -> [Token]
Parsing
Types
An HTML token
TagOpen !TagName [Attr] | An opening tag. Attribute ordering is arbitrary. |
TagSelfClose !TagName [Attr] | A self-closing tag. |
TagClose !TagName | A closing tag. |
ContentText !Text | The content between tags. |
ContentChar !Char | A single character of content |
Comment !Builder | Contents of a comment. |
Doctype !Text | Doctype |
Instances
Rendering, text canonicalization
renderTokens :: [Token] -> Text Source #
See renderToken
.
renderAttrs :: [Attr] -> Text Source #
See renderAttr
.
renderAttr :: Attr -> Text Source #
Does not escape quotation in attribute values!
canonicalizeTokens :: [Token] -> [Token] Source #
Meld neighoring ContentChar
and ContentText
constructors together and drops empty text
elements.