Copyright	(c) 2020 Sam May
License	MPL-2.0
Maintainer	ag.eitilt@gmail.com
Stability	provisional
Portability	portable
Safe Haskell	Safe-Inferred
Language	Haskell98

Web.Mangrove.Parse.Tokenize

Contents

Types
- Final
- Intermediate
Initialization
Transformations

Description

This module and the internal branch it heads implement the Tokenization section of the HTML document parsing specification, processing a stream of text to add information on, or group it by, semantic category. This then allows the following stage to base its logic on such higher-level concepts as "markup tag" or "comment" without worrying about the (sometimes complex) escaping behaviour required to parse them.

Synopsis

data Token
- = Doctype DoctypeParams
- | StartTag TagParams
- | EndTag TagParams
- | Comment Text
- | Character Char
- | EndOfStream
type BasicAttribute = (AttributeName, AttributeValue)
data TagParams = TagParams {
- tagName :: ElementName
- tagIsSelfClosing :: Bool
- tagAttributes :: HashMap Text Text
}
emptyTagParams :: TagParams
data DoctypeParams = DoctypeParams {
- doctypeName :: Maybe Text
- doctypePublicId :: Maybe Text
- doctypeSystemId :: Maybe Text
- doctypeQuirks :: Bool
}
emptyDoctypeParams :: DoctypeParams
data TokenizerState
data CurrentTokenizerState
- = DataState
- | RCDataState
- | RawTextState
- | PlainTextState
- | ScriptDataState
- | ScriptDataEscapedState
- | ScriptDataDoubleEscapedState
- | CDataState
data Encoding
- = Utf8
- | Utf16be
- | Utf16le
- | Big5
- | EucJp
- | EucKr
- | Gb18030
- | Gbk
- | Ibm866
- | Iso2022Jp
- | Iso8859_2
- | Iso8859_3
- | Iso8859_4
- | Iso8859_5
- | Iso8859_6
- | Iso8859_7
- | Iso8859_8
- | Iso8859_8i
- | Iso8859_10
- | Iso8859_13
- | Iso8859_14
- | Iso8859_15
- | Iso8859_16
- | Koi8R
- | Koi8U
- | Macintosh
- | MacintoshCyrillic
- | ShiftJis
- | Windows874
- | Windows1250
- | Windows1251
- | Windows1252
- | Windows1253
- | Windows1254
- | Windows1255
- | Windows1256
- | Windows1257
- | Windows1258
- | Replacement
- | UserDefined
defaultTokenizerState :: TokenizerState
tokenizerMode :: CurrentTokenizerState -> TokenizerState -> TokenizerState
tokenizerStartTag :: Maybe Namespace -> ElementName -> TokenizerState -> TokenizerState
tokenizerEncoding :: Either SnifferEnvironment (Maybe Encoding) -> TokenizerState -> TokenizerState
tokenize :: TokenizerState -> ByteString -> ([([ParseError], Token)], TokenizerState)
tokenizeStep :: TokenizerState -> ByteString -> ([([ParseError], Token)], TokenizerState, ByteString)
finalizeTokenizer :: TokenizerState -> [([ParseError], Token)]

Types

Final

data Token Source #

The smallest segment of data which carries semantic meaning.

Constructors

Doctype DoctypeParams	HTML: `DOCTYPE token` `DocumentType`, describing the language used in the document.
StartTag TagParams	HTML: `start tag token` `Element`, marking the start of a markup section, or a point of markup which (per the specification) doesn't contain any content.
EndTag TagParams	HTML: `end tag token` `Element` with a `/` character before the name, marking the end of a section opened by `StartTag`.
Comment Text	HTML: `comment token` `Comment`, marking author's notes or other text about the document itself, rather than being part of the content.
Character Char	HTML: `character token` `Character`, usually containing (a small portion of) text which should rendered for the user or included in the header metadata, but occasionally subject to further processing (i.e. the content of `<script>` or `<style>` sections).
EndOfStream	HTML: `end-of-file token` Represents both an explicit mark of the end of the stream, when a simple `[]` doesn't suffice, and provides a seat to carry `ParseError`s if no other token is emitted at the same time. Note: the former role doesn't have any guarantees; a stream can end without an `EndOfStream` termination, and `EndOfStream` tokens can occur in places other than the end of the file.

Instances

Instances details

Eq Token Source #
Instance details Defined in Web.Mangrove.Parse.Tokenize.Common Methods (==) :: Token -> Token -> Bool # (/=) :: Token -> Token -> Bool #
Read Token Source #
Instance details Defined in Web.Mangrove.Parse.Tokenize.Common Methods readsPrec :: Int -> ReadS Token # readList :: ReadS [Token] # readPrec :: ReadPrec Token # readListPrec :: ReadPrec [Token] #
Show Token Source #
Instance details Defined in Web.Mangrove.Parse.Tokenize.Common Methods showsPrec :: Int -> Token -> ShowS # show :: Token -> String # showList :: [Token] -> ShowS #

type BasicAttribute = (AttributeName, AttributeValue) #

A simple key-value representation of an attribute on an HTML tag, before any namespace processing.

data TagParams Source #

HTML: the data associated with a start tag or an end tag token

All data comprising a markup tag which may be obtained directly from the raw document stream. Values may be easily instantiated as updates to emptyTagParams.

Constructors

TagParams

Fields

tagName :: ElementName
The primary identifier of the markup tag, defining its behaviour during rendering, and providing a means of matching opening tags with closing ones.
tagIsSelfClosing :: Bool
Whether the tag was closed at the same point it was opened, according to the XML-style "/>" syntax. HTML null elements are handled in the tree construction stage instead.
tagAttributes :: HashMap Text Text
Finer-grained metadata attached to the markup tag.

Instances

Instances details

Eq TagParams Source #
Instance details Defined in Web.Mangrove.Parse.Tokenize.Common Methods (==) :: TagParams -> TagParams -> Bool # (/=) :: TagParams -> TagParams -> Bool #
Read TagParams Source #
Instance details Defined in Web.Mangrove.Parse.Tokenize.Common Methods readsPrec :: Int -> ReadS TagParams # readList :: ReadS [TagParams] # readPrec :: ReadPrec TagParams # readListPrec :: ReadPrec [TagParams] #
Show TagParams Source #
Instance details Defined in Web.Mangrove.Parse.Tokenize.Common Methods showsPrec :: Int -> TagParams -> ShowS # show :: TagParams -> String # showList :: [TagParams] -> ShowS #

emptyTagParams :: TagParams Source #

A sane default collection for easy record initialization.

data DoctypeParams Source #

HTML: the data associated with a doctype token

All data comprising a document type declaration which may be obtained directly from the raw document stream. Values may be easily instantiated as updates to emptyDoctypeParams.

Constructors

DoctypeParams

Fields

doctypeName :: Maybe Text
The root element of the document, which may also identify the primary language used.
doctypePublicId :: Maybe Text
A globally-unique reference to the definition of the language.
doctypeSystemId :: Maybe Text
A system-dependant (but perhaps easier to access) reference to the definition of the language.
doctypeQuirks :: Bool
Whether the document should be read and rendered in a backwards-compatible manner, even if the other data in the token would match that expected by the specification. Note that False value is still subject to those expectations; this just provides an override in the case of, for example, a malformed declaration.

Instances

Instances details

Eq DoctypeParams Source #
Instance details Defined in Web.Mangrove.Parse.Tokenize.Common Methods (==) :: DoctypeParams -> DoctypeParams -> Bool # (/=) :: DoctypeParams -> DoctypeParams -> Bool #
Read DoctypeParams Source #
Instance details Defined in Web.Mangrove.Parse.Tokenize.Common Methods readsPrec :: Int -> ReadS DoctypeParams # readList :: ReadS [DoctypeParams] # readPrec :: ReadPrec DoctypeParams # readListPrec :: ReadPrec [DoctypeParams] #
Show DoctypeParams Source #
Instance details Defined in Web.Mangrove.Parse.Tokenize.Common Methods showsPrec :: Int -> DoctypeParams -> ShowS # show :: DoctypeParams -> String # showList :: [DoctypeParams] -> ShowS #

emptyDoctypeParams :: DoctypeParams Source #

A sane default collection for easy record initialization; namely, Nothings and False.

Intermediate

data TokenizerState Source #

The collection of data required to extract a list of semantic atoms from a binary document stream. Values may be easily instantiated as updates to defaultTokenizerState.

Instances

Instances details

Eq TokenizerState Source #
Instance details Defined in Web.Mangrove.Parse.Tokenize.Common Methods (==) :: TokenizerState -> TokenizerState -> Bool # (/=) :: TokenizerState -> TokenizerState -> Bool #
Read TokenizerState Source #
Instance details Defined in Web.Mangrove.Parse.Tokenize.Common Methods readsPrec :: Int -> ReadS TokenizerState # readList :: ReadS [TokenizerState] # readPrec :: ReadPrec TokenizerState # readListPrec :: ReadPrec [TokenizerState] #
Show TokenizerState Source #
Instance details Defined in Web.Mangrove.Parse.Tokenize.Common Methods showsPrec :: Int -> TokenizerState -> ShowS # show :: TokenizerState -> String # showList :: [TokenizerState] -> ShowS #

data CurrentTokenizerState Source #

The various fixed points in the tokenization algorithm, where the parser may break and re-enter seamlessly.

Constructors

DataState	HTML: `data state` The core rules, providing the most common tokenization behaviour.
RCDataState	HTML: `RCDATA state` `Character`-focused production while, unlike `RawTextState`, resolving character reference values.
RawTextState	HTML: `RAWTEXT state` `Character`-focused production which, unlike `RCDataState`, passes character reference sequences unchanged.
PlainTextState	HTML: `PLAINTEXT state` Blind conversion of the entire document stream into `Character` tokens.
ScriptDataState	HTML: `script data state` `Character`-focused production according to the (occasionally complex) rules governing the handling of `<script>` contents.
ScriptDataEscapedState	HTML: `script data escaped state` `Character`-focused production for data within a `<!--` / `-->` section within `ScriptDataState`.
ScriptDataDoubleEscapedState	HTML: `script data double escaped state` `Character`-focused production for data within a `<script>` section within `ScriptDataEscapedState`.
CDataState	HTML: `CDATA section state` `Character`-focused production for data within a foreign `<[CDATA[` / `]]>` escape section.

Instances

Instances details

Bounded CurrentTokenizerState Source #
Instance details Defined in Web.Mangrove.Parse.Tokenize.Common Methods minBound :: CurrentTokenizerState # maxBound :: CurrentTokenizerState #
Enum CurrentTokenizerState Source #
Instance details Defined in Web.Mangrove.Parse.Tokenize.Common Methods succ :: CurrentTokenizerState -> CurrentTokenizerState # pred :: CurrentTokenizerState -> CurrentTokenizerState # toEnum :: Int -> CurrentTokenizerState # fromEnum :: CurrentTokenizerState -> Int # enumFrom :: CurrentTokenizerState -> [CurrentTokenizerState] # enumFromThen :: CurrentTokenizerState -> CurrentTokenizerState -> [CurrentTokenizerState] # enumFromTo :: CurrentTokenizerState -> CurrentTokenizerState -> [CurrentTokenizerState] # enumFromThenTo :: CurrentTokenizerState -> CurrentTokenizerState -> CurrentTokenizerState -> [CurrentTokenizerState] #
Eq CurrentTokenizerState Source #
Instance details Defined in Web.Mangrove.Parse.Tokenize.Common Methods (==) :: CurrentTokenizerState -> CurrentTokenizerState -> Bool # (/=) :: CurrentTokenizerState -> CurrentTokenizerState -> Bool #
Ord CurrentTokenizerState Source #
Instance details Defined in Web.Mangrove.Parse.Tokenize.Common Methods compare :: CurrentTokenizerState -> CurrentTokenizerState -> Ordering # (<) :: CurrentTokenizerState -> CurrentTokenizerState -> Bool # (<=) :: CurrentTokenizerState -> CurrentTokenizerState -> Bool # (>) :: CurrentTokenizerState -> CurrentTokenizerState -> Bool # (>=) :: CurrentTokenizerState -> CurrentTokenizerState -> Bool # max :: CurrentTokenizerState -> CurrentTokenizerState -> CurrentTokenizerState # min :: CurrentTokenizerState -> CurrentTokenizerState -> CurrentTokenizerState #
Read CurrentTokenizerState Source #
Instance details Defined in Web.Mangrove.Parse.Tokenize.Common Methods readsPrec :: Int -> ReadS CurrentTokenizerState # readList :: ReadS [CurrentTokenizerState] # readPrec :: ReadPrec CurrentTokenizerState # readListPrec :: ReadPrec [CurrentTokenizerState] #
Show CurrentTokenizerState Source #
Instance details Defined in Web.Mangrove.Parse.Tokenize.Common Methods showsPrec :: Int -> CurrentTokenizerState -> ShowS # show :: CurrentTokenizerState -> String # showList :: [CurrentTokenizerState] -> ShowS #

data Encoding #

Encoding: encoding

All character encoding schemes supported by the HTML standard, defined as a bidirectional map between characters and binary sequences. Utf8 is strongly encouraged for new content (including all encoding purposes), but the others are retained for compatibility with existing pages.

Note that none of these are complete functions, to one degree or another, and that no guarantee is made that the mapping round-trips.

Constructors

Utf8	The UTF-8 encoding for Unicode.
Utf16be	The UTF-16 encoding for Unicode, in big endian order. No encoder is provided for this scheme.
Utf16le	The UTF-16 encoding for Unicode, in little endian order. No encoder is provided for this scheme.
Big5	Big5, primarily covering traditional Chinese characters.
EucJp	EUC-JP, primarily covering Japanese as the union of JIS-0208 and JIS-0212.
EucKr	EUC-KR, primarily covering Hangul.
Gb18030	The GB18030-2005 extension to GBK, with one tweak for web compatibility, primarily covering both forms of Chinese characters. Note that this encoding also includes a large number of four-byte sequences which aren't listed in the linked visualization.
Gbk	GBK, primarily covering simplified Chinese characters. In practice, this is just `Gb18030` with a restricted set of encodable characters; the decoder is identical.
Ibm866	DOS and OS/2 code page for Cyrillic characters.
Iso2022Jp	A Japanese-focused implementation of the ISO 2022 meta-encoding, including both JIS-0208 and halfwidth katakana.
Iso8859_2	Latin-2 (Central European).
Iso8859_3	Latin-3 (South European and Esperanto)
Iso8859_4	Latin-4 (North European).
Iso8859_5	Latin/Cyrillic.
Iso8859_6	Latin/Arabic.
Iso8859_7	Latin/Greek (modern monotonic).
Iso8859_8	Latin/Hebrew (visual order).
Iso8859_8i	Latin/Hebrew (logical order).
Iso8859_10	Latin-6 (Nordic).
Iso8859_13	Latin-7 (Baltic Rim).
Iso8859_14	Latin-8 (Celtic).
Iso8859_15	Latin-9 (revision of ISO 8859-1 Latin-1, Western European).
Iso8859_16	Latin-10 (South-Eastern European).
Koi8R	KOI-8 specialized for Russian Cyrillic.
Koi8U	KOI-8 specialized for Ukrainian Cyrillic.
Macintosh	Mac OS Roman.
MacintoshCyrillic	Mac OS Cyrillic (as of Mac OS 9.0)
ShiftJis	The Windows variant (code page 932) of Shift JIS.
Windows874	ISO 8859-11 Latin/Thai with Windows extensions in the C1 control character slots. Note that this encoding is always used instead of pure Latin/Thai.
Windows1250	The Windows extension and rearrangement of ISO 8859-2 Latin-2.
Windows1251	Windows Cyrillic.
Windows1252	The Windows extension of ISO 8859-1 Latin-1, replacing most of the C1 control characters with printable glyphs. Note that this encoding is always used instead of pure Latin-1.
Windows1253	Windows Greek (modern monotonic).
Windows1254	The Windows extension of ISO 8859-9 Latin-5 (Turkish), replacing most of the C1 control characters with printable glyphs. Note that this encoding is always used instead of pure Latin-5.
Windows1255	The Windows extension and rearrangement of ISO 8859-8 Latin/Hebrew.
Windows1256	Windows Arabic.
Windows1257	Windows Baltic.
Windows1258	Windows Vietnamese.
Replacement	The input is reduced to a single `\xFFFD` replacement character. No encoder is provided for this scheme.
UserDefined	Non-ASCII bytes (`\x80` through `\xFF`) are mapped to a portion of the Unicode Private Use Area (`\xF780` through `\xF7FF`).

Instances

Instances details

Bounded Encoding
Instance details Defined in Web.Willow.Common.Encoding.Common Methods minBound :: Encoding # maxBound :: Encoding #
Enum Encoding
Instance details Defined in Web.Willow.Common.Encoding.Common Methods succ :: Encoding -> Encoding # pred :: Encoding -> Encoding # toEnum :: Int -> Encoding # fromEnum :: Encoding -> Int # enumFrom :: Encoding -> [Encoding] # enumFromThen :: Encoding -> Encoding -> [Encoding] # enumFromTo :: Encoding -> Encoding -> [Encoding] # enumFromThenTo :: Encoding -> Encoding -> Encoding -> [Encoding] #
Eq Encoding
Instance details Defined in Web.Willow.Common.Encoding.Common Methods (==) :: Encoding -> Encoding -> Bool # (/=) :: Encoding -> Encoding -> Bool #
Ord Encoding
Instance details Defined in Web.Willow.Common.Encoding.Common Methods compare :: Encoding -> Encoding -> Ordering # (<) :: Encoding -> Encoding -> Bool # (<=) :: Encoding -> Encoding -> Bool # (>) :: Encoding -> Encoding -> Bool # (>=) :: Encoding -> Encoding -> Bool # max :: Encoding -> Encoding -> Encoding # min :: Encoding -> Encoding -> Encoding #
Read Encoding
Instance details Defined in Web.Willow.Common.Encoding.Common Methods readsPrec :: Int -> ReadS Encoding # readList :: ReadS [Encoding] # readPrec :: ReadPrec Encoding # readListPrec :: ReadPrec [Encoding] #
Show Encoding
Instance details Defined in Web.Willow.Common.Encoding.Common Methods showsPrec :: Int -> Encoding -> ShowS # show :: Encoding -> String # showList :: [Encoding] -> ShowS #
Hashable Encoding
Instance details Defined in Web.Willow.Common.Encoding.Common Methods hashWithSalt :: Int -> Encoding -> Int # hash :: Encoding -> Int #

Initialization

defaultTokenizerState :: TokenizerState Source #

A sane default collection for easy record initialization; namely, interpret the binary stream as Utf8 in the primary DataState.

tokenizerMode :: CurrentTokenizerState -> TokenizerState -> TokenizerState Source #

Specify which section of the finite state machine describing the tokenization algorithm should be active.

tokenizerStartTag :: Maybe Namespace -> ElementName -> TokenizerState -> TokenizerState Source #

Specify the data to use as the previous tag which had been emitted by the tokenizer. This only has to be called when required for external algorithms or constructions; the parser automatically updates as required for generated StartTag tokens.

tokenizerEncoding :: Either SnifferEnvironment (Maybe Encoding) -> TokenizerState -> TokenizerState Source #

Specify the encoding scheme used by a given parse environment to read from the binary input stream. Note that this will always use the initial state for the respective decoder; intermediate states as returned by decodeStep are not supported.

Transformations

tokenize :: TokenizerState -> ByteString -> ([([ParseError], Token)], TokenizerState) Source #

HTML: tokenization

Given a starting environment, transform a binary document stream into a stream of semantic atoms. If the parse fails, returns all tokens before the one which caused the error, but any trailing bytes are silently dropped.

tokenizeStep :: TokenizerState -> ByteString -> ([([ParseError], Token)], TokenizerState, ByteString) Source #

Parse a minimal number of bytes from an input stream, into a sequence of semantic tokens. Returns all data required to seamlessly resume parsing.

finalizeTokenizer :: TokenizerState -> [([ParseError], Token)] Source #

Explicitly indicate that the input stream will not contain any further bytes, and perform any finalization processing based on that.