mangrove-0.1.0.0: A parser for web documents according to the HTML5 specification.
Copyright(c) 2020 Sam May
LicenseMPL-2.0
Maintainerag.eitilt@gmail.com
Stabilityprovisional
Portabilityportable
Safe HaskellSafe-Inferred
LanguageHaskell98

Web.Mangrove.Parse.Tokenize

Description

This module and the internal branch it heads implement the Tokenization section of the HTML document parsing specification, processing a stream of text to add information on, or group it by, semantic category. This then allows the following stage to base its logic on such higher-level concepts as "markup tag" or "comment" without worrying about the (sometimes complex) escaping behaviour required to parse them.

Synopsis

Types

Final

data Token Source #

The smallest segment of data which carries semantic meaning.

Constructors

Doctype DoctypeParams

HTML: DOCTYPE token

DocumentType, describing the language used in the document.

StartTag TagParams

HTML: start tag token

Element, marking the start of a markup section, or a point of markup which (per the specification) doesn't contain any content.

EndTag TagParams

HTML: end tag token

Element with a / character before the name, marking the end of a section opened by StartTag.

Comment Text

HTML: comment token

Comment, marking author's notes or other text about the document itself, rather than being part of the content.

Character Char

HTML: character token

Character, usually containing (a small portion of) text which should rendered for the user or included in the header metadata, but occasionally subject to further processing (i.e. the content of <script> or <style> sections).

EndOfStream

HTML: end-of-file token

Represents both an explicit mark of the end of the stream, when a simple [] doesn't suffice, and provides a seat to carry ParseErrors if no other token is emitted at the same time.

Note: the former role doesn't have any guarantees; a stream can end without an EndOfStream termination, and EndOfStream tokens can occur in places other than the end of the file.

Instances

Instances details
Eq Token Source # 
Instance details

Defined in Web.Mangrove.Parse.Tokenize.Common

Methods

(==) :: Token -> Token -> Bool #

(/=) :: Token -> Token -> Bool #

Read Token Source # 
Instance details

Defined in Web.Mangrove.Parse.Tokenize.Common

Show Token Source # 
Instance details

Defined in Web.Mangrove.Parse.Tokenize.Common

Methods

showsPrec :: Int -> Token -> ShowS #

show :: Token -> String #

showList :: [Token] -> ShowS #

type BasicAttribute = (AttributeName, AttributeValue) #

A simple key-value representation of an attribute on an HTML tag, before any namespace processing.

data TagParams Source #

HTML: the data associated with a start tag or an end tag token

All data comprising a markup tag which may be obtained directly from the raw document stream. Values may be easily instantiated as updates to emptyTagParams.

Constructors

TagParams 

Fields

  • tagName :: ElementName

    The primary identifier of the markup tag, defining its behaviour during rendering, and providing a means of matching opening tags with closing ones.

  • tagIsSelfClosing :: Bool

    Whether the tag was closed at the same point it was opened, according to the XML-style "/>" syntax. HTML null elements are handled in the tree construction stage instead.

  • tagAttributes :: HashMap Text Text

    Finer-grained metadata attached to the markup tag.

emptyTagParams :: TagParams Source #

A sane default collection for easy record initialization.

data DoctypeParams Source #

HTML: the data associated with a doctype token

All data comprising a document type declaration which may be obtained directly from the raw document stream. Values may be easily instantiated as updates to emptyDoctypeParams.

Constructors

DoctypeParams 

Fields

  • doctypeName :: Maybe Text

    The root element of the document, which may also identify the primary language used.

  • doctypePublicId :: Maybe Text

    A globally-unique reference to the definition of the language.

  • doctypeSystemId :: Maybe Text

    A system-dependant (but perhaps easier to access) reference to the definition of the language.

  • doctypeQuirks :: Bool

    Whether the document should be read and rendered in a backwards-compatible manner, even if the other data in the token would match that expected by the specification. Note that False value is still subject to those expectations; this just provides an override in the case of, for example, a malformed declaration.

emptyDoctypeParams :: DoctypeParams Source #

A sane default collection for easy record initialization; namely, Nothings and False.

Intermediate

data TokenizerState Source #

The collection of data required to extract a list of semantic atoms from a binary document stream. Values may be easily instantiated as updates to defaultTokenizerState.

data CurrentTokenizerState Source #

The various fixed points in the tokenization algorithm, where the parser may break and re-enter seamlessly.

Constructors

DataState

HTML: data state

The core rules, providing the most common tokenization behaviour.

RCDataState

HTML: RCDATA state

Character-focused production while, unlike RawTextState, resolving character reference values.

RawTextState

HTML: RAWTEXT state

Character-focused production which, unlike RCDataState, passes character reference sequences unchanged.

PlainTextState

HTML: PLAINTEXT state

Blind conversion of the entire document stream into Character tokens.

ScriptDataState

HTML: script data state

Character-focused production according to the (occasionally complex) rules governing the handling of <script> contents.

ScriptDataEscapedState

HTML: script data escaped state

Character-focused production for data within a <!-- / --> section within ScriptDataState.

ScriptDataDoubleEscapedState

HTML: script data double escaped state

Character-focused production for data within a <script> section within ScriptDataEscapedState.

CDataState

HTML: CDATA section state

Character-focused production for data within a foreign <[CDATA[ / ]]> escape section.

Instances

Instances details
Bounded CurrentTokenizerState Source # 
Instance details

Defined in Web.Mangrove.Parse.Tokenize.Common

Enum CurrentTokenizerState Source # 
Instance details

Defined in Web.Mangrove.Parse.Tokenize.Common

Eq CurrentTokenizerState Source # 
Instance details

Defined in Web.Mangrove.Parse.Tokenize.Common

Ord CurrentTokenizerState Source # 
Instance details

Defined in Web.Mangrove.Parse.Tokenize.Common

Read CurrentTokenizerState Source # 
Instance details

Defined in Web.Mangrove.Parse.Tokenize.Common

Show CurrentTokenizerState Source # 
Instance details

Defined in Web.Mangrove.Parse.Tokenize.Common

data Encoding #

Encoding: encoding

All character encoding schemes supported by the HTML standard, defined as a bidirectional map between characters and binary sequences. Utf8 is strongly encouraged for new content (including all encoding purposes), but the others are retained for compatibility with existing pages.

Note that none of these are complete functions, to one degree or another, and that no guarantee is made that the mapping round-trips.

Constructors

Utf8

The UTF-8 encoding for Unicode.

Utf16be

The UTF-16 encoding for Unicode, in big endian order.

No encoder is provided for this scheme.

Utf16le

The UTF-16 encoding for Unicode, in little endian order.

No encoder is provided for this scheme.

Big5

Big5, primarily covering traditional Chinese characters.

EucJp

EUC-JP, primarily covering Japanese as the union of JIS-0208 and JIS-0212.

EucKr

EUC-KR, primarily covering Hangul.

Gb18030

The GB18030-2005 extension to GBK, with one tweak for web compatibility, primarily covering both forms of Chinese characters.

Note that this encoding also includes a large number of four-byte sequences which aren't listed in the linked visualization.

Gbk

GBK, primarily covering simplified Chinese characters.

In practice, this is just Gb18030 with a restricted set of encodable characters; the decoder is identical.

Ibm866

DOS and OS/2 code page for Cyrillic characters.

Iso2022Jp

A Japanese-focused implementation of the ISO 2022 meta-encoding, including both JIS-0208 and halfwidth katakana.

Iso8859_2

Latin-2 (Central European).

Iso8859_3

Latin-3 (South European and Esperanto)

Iso8859_4

Latin-4 (North European).

Iso8859_5

Latin/Cyrillic.

Iso8859_6

Latin/Arabic.

Iso8859_7

Latin/Greek (modern monotonic).

Iso8859_8

Latin/Hebrew (visual order).

Iso8859_8i

Latin/Hebrew (logical order).

Iso8859_10

Latin-6 (Nordic).

Iso8859_13

Latin-7 (Baltic Rim).

Iso8859_14

Latin-8 (Celtic).

Iso8859_15

Latin-9 (revision of ISO 8859-1 Latin-1, Western European).

Iso8859_16

Latin-10 (South-Eastern European).

Koi8R

KOI-8 specialized for Russian Cyrillic.

Koi8U

KOI-8 specialized for Ukrainian Cyrillic.

Macintosh

Mac OS Roman.

MacintoshCyrillic

Mac OS Cyrillic (as of Mac OS 9.0)

ShiftJis

The Windows variant (code page 932) of Shift JIS.

Windows874

ISO 8859-11 Latin/Thai with Windows extensions in the C1 control character slots.

Note that this encoding is always used instead of pure Latin/Thai.

Windows1250

The Windows extension and rearrangement of ISO 8859-2 Latin-2.

Windows1251

Windows Cyrillic.

Windows1252

The Windows extension of ISO 8859-1 Latin-1, replacing most of the C1 control characters with printable glyphs.

Note that this encoding is always used instead of pure Latin-1.

Windows1253

Windows Greek (modern monotonic).

Windows1254

The Windows extension of ISO 8859-9 Latin-5 (Turkish), replacing most of the C1 control characters with printable glyphs.

Note that this encoding is always used instead of pure Latin-5.

Windows1255

The Windows extension and rearrangement of ISO 8859-8 Latin/Hebrew.

Windows1256

Windows Arabic.

Windows1257

Windows Baltic.

Windows1258

Windows Vietnamese.

Replacement

The input is reduced to a single \xFFFD replacement character.

No encoder is provided for this scheme.

UserDefined

Non-ASCII bytes (\x80 through \xFF) are mapped to a portion of the Unicode Private Use Area (\xF780 through \xF7FF).

Instances

Instances details
Bounded Encoding 
Instance details

Defined in Web.Willow.Common.Encoding.Common

Enum Encoding 
Instance details

Defined in Web.Willow.Common.Encoding.Common

Eq Encoding 
Instance details

Defined in Web.Willow.Common.Encoding.Common

Ord Encoding 
Instance details

Defined in Web.Willow.Common.Encoding.Common

Read Encoding 
Instance details

Defined in Web.Willow.Common.Encoding.Common

Show Encoding 
Instance details

Defined in Web.Willow.Common.Encoding.Common

Hashable Encoding 
Instance details

Defined in Web.Willow.Common.Encoding.Common

Methods

hashWithSalt :: Int -> Encoding -> Int #

hash :: Encoding -> Int #

Initialization

defaultTokenizerState :: TokenizerState Source #

A sane default collection for easy record initialization; namely, interpret the binary stream as Utf8 in the primary DataState.

tokenizerMode :: CurrentTokenizerState -> TokenizerState -> TokenizerState Source #

Specify which section of the finite state machine describing the tokenization algorithm should be active.

tokenizerStartTag :: Maybe Namespace -> ElementName -> TokenizerState -> TokenizerState Source #

Specify the data to use as the previous tag which had been emitted by the tokenizer. This only has to be called when required for external algorithms or constructions; the parser automatically updates as required for generated StartTag tokens.

tokenizerEncoding :: Either SnifferEnvironment (Maybe Encoding) -> TokenizerState -> TokenizerState Source #

Specify the encoding scheme used by a given parse environment to read from the binary input stream. Note that this will always use the initial state for the respective decoder; intermediate states as returned by decodeStep are not supported.

Transformations

tokenize :: TokenizerState -> ByteString -> ([([ParseError], Token)], TokenizerState) Source #

HTML: tokenization

Given a starting environment, transform a binary document stream into a stream of semantic atoms. If the parse fails, returns all tokens before the one which caused the error, but any trailing bytes are silently dropped.

tokenizeStep :: TokenizerState -> ByteString -> ([([ParseError], Token)], TokenizerState, ByteString) Source #

Parse a minimal number of bytes from an input stream, into a sequence of semantic tokens. Returns all data required to seamlessly resume parsing.

finalizeTokenizer :: TokenizerState -> [([ParseError], Token)] Source #

Explicitly indicate that the input stream will not contain any further bytes, and perform any finalization processing based on that.