Copyright | (c) 2020 Sam May |
---|---|
License | MPL-2.0 |
Maintainer | ag.eitilt@gmail.com |
Stability | provisional |
Portability | portable |
Safe Haskell | Safe-Inferred |
Language | Haskell98 |
This module and the internal branch it heads implement the Tokenization section of the HTML document parsing specification, processing a stream of text to add information on, or group it by, semantic category. This then allows the following stage to base its logic on such higher-level concepts as "markup tag" or "comment" without worrying about the (sometimes complex) escaping behaviour required to parse them.
Synopsis
- data Token
- type BasicAttribute = (AttributeName, AttributeValue)
- data TagParams = TagParams {}
- emptyTagParams :: TagParams
- data DoctypeParams = DoctypeParams {}
- emptyDoctypeParams :: DoctypeParams
- data TokenizerState
- data CurrentTokenizerState
- data Encoding
- = Utf8
- | Utf16be
- | Utf16le
- | Big5
- | EucJp
- | EucKr
- | Gb18030
- | Gbk
- | Ibm866
- | Iso2022Jp
- | Iso8859_2
- | Iso8859_3
- | Iso8859_4
- | Iso8859_5
- | Iso8859_6
- | Iso8859_7
- | Iso8859_8
- | Iso8859_8i
- | Iso8859_10
- | Iso8859_13
- | Iso8859_14
- | Iso8859_15
- | Iso8859_16
- | Koi8R
- | Koi8U
- | Macintosh
- | MacintoshCyrillic
- | ShiftJis
- | Windows874
- | Windows1250
- | Windows1251
- | Windows1252
- | Windows1253
- | Windows1254
- | Windows1255
- | Windows1256
- | Windows1257
- | Windows1258
- | Replacement
- | UserDefined
- defaultTokenizerState :: TokenizerState
- tokenizerMode :: CurrentTokenizerState -> TokenizerState -> TokenizerState
- tokenizerStartTag :: Maybe Namespace -> ElementName -> TokenizerState -> TokenizerState
- tokenizerEncoding :: Either SnifferEnvironment (Maybe Encoding) -> TokenizerState -> TokenizerState
- tokenize :: TokenizerState -> ByteString -> ([([ParseError], Token)], TokenizerState)
- tokenizeStep :: TokenizerState -> ByteString -> ([([ParseError], Token)], TokenizerState, ByteString)
- finalizeTokenizer :: TokenizerState -> [([ParseError], Token)]
Types
Final
The smallest segment of data which carries semantic meaning.
Doctype DoctypeParams | HTML:
|
StartTag TagParams | HTML:
|
EndTag TagParams | HTML:
|
Comment Text | HTML:
|
Character Char | HTML:
|
EndOfStream | HTML:
Represents both an explicit mark of the end of the stream, when a
simple Note: the former role doesn't have any guarantees; a stream can end
without an |
type BasicAttribute = (AttributeName, AttributeValue) #
A simple key-value representation of an attribute on an HTML tag, before any namespace processing.
HTML:
the data associated with a start tag
or an end tag token
All data comprising a markup tag which may be obtained directly from the raw
document stream. Values may be easily instantiated as updates to
emptyTagParams
.
TagParams | |
|
emptyTagParams :: TagParams Source #
A sane default collection for easy record initialization.
data DoctypeParams Source #
HTML:
the data associated with a doctype token
All data comprising a document type declaration which may be obtained
directly from the raw document stream. Values may be easily instantiated as
updates to emptyDoctypeParams
.
DoctypeParams | |
|
Instances
Eq DoctypeParams Source # | |
Defined in Web.Mangrove.Parse.Tokenize.Common (==) :: DoctypeParams -> DoctypeParams -> Bool # (/=) :: DoctypeParams -> DoctypeParams -> Bool # | |
Read DoctypeParams Source # | |
Defined in Web.Mangrove.Parse.Tokenize.Common readsPrec :: Int -> ReadS DoctypeParams # readList :: ReadS [DoctypeParams] # | |
Show DoctypeParams Source # | |
Defined in Web.Mangrove.Parse.Tokenize.Common showsPrec :: Int -> DoctypeParams -> ShowS # show :: DoctypeParams -> String # showList :: [DoctypeParams] -> ShowS # |
Intermediate
data TokenizerState Source #
The collection of data required to extract a list of semantic atoms from a
binary document stream. Values may be easily instantiated as updates to
defaultTokenizerState
.
Instances
Eq TokenizerState Source # | |
Defined in Web.Mangrove.Parse.Tokenize.Common (==) :: TokenizerState -> TokenizerState -> Bool # (/=) :: TokenizerState -> TokenizerState -> Bool # | |
Read TokenizerState Source # | |
Defined in Web.Mangrove.Parse.Tokenize.Common readsPrec :: Int -> ReadS TokenizerState # readList :: ReadS [TokenizerState] # | |
Show TokenizerState Source # | |
Defined in Web.Mangrove.Parse.Tokenize.Common showsPrec :: Int -> TokenizerState -> ShowS # show :: TokenizerState -> String # showList :: [TokenizerState] -> ShowS # |
data CurrentTokenizerState Source #
The various fixed points in the tokenization algorithm, where the parser may break and re-enter seamlessly.
DataState | HTML:
The core rules, providing the most common tokenization behaviour. |
RCDataState | HTML:
|
RawTextState | HTML:
|
PlainTextState | HTML:
Blind conversion of the entire document stream into |
ScriptDataState | HTML:
|
ScriptDataEscapedState | HTML:
|
ScriptDataDoubleEscapedState | HTML:
|
CDataState | HTML:
|
Instances
Encoding:
encoding
All character encoding schemes supported by the HTML standard, defined as a
bidirectional map between characters and binary sequences. Utf8
is
strongly encouraged for new content (including all encoding purposes), but
the others are retained for compatibility with existing pages.
Note that none of these are complete functions, to one degree or another, and that no guarantee is made that the mapping round-trips.
Utf8 | The UTF-8 encoding for Unicode. |
Utf16be | The UTF-16 encoding for Unicode, in big endian order. No encoder is provided for this scheme. |
Utf16le | The UTF-16 encoding for Unicode, in little endian order. No encoder is provided for this scheme. |
Big5 | Big5, primarily covering traditional Chinese characters. |
EucJp | EUC-JP, primarily covering Japanese as the union of JIS-0208 and JIS-0212. |
EucKr | EUC-KR, primarily covering Hangul. |
Gb18030 | The GB18030-2005 extension to GBK, with one tweak for web compatibility, primarily covering both forms of Chinese characters. Note that this encoding also includes a large number of four-byte sequences which aren't listed in the linked visualization. |
Gbk | GBK, primarily covering simplified Chinese characters. In practice, this is just |
Ibm866 | DOS and OS/2 code page for Cyrillic characters. |
Iso2022Jp | A Japanese-focused implementation of the ISO 2022 meta-encoding, including both JIS-0208 and halfwidth katakana. |
Iso8859_2 | Latin-2 (Central European). |
Iso8859_3 | Latin-3 (South European and Esperanto) |
Iso8859_4 | Latin-4 (North European). |
Iso8859_5 | |
Iso8859_6 | |
Iso8859_7 | Latin/Greek (modern monotonic). |
Iso8859_8 | Latin/Hebrew (visual order). |
Iso8859_8i | Latin/Hebrew (logical order). |
Iso8859_10 | Latin-6 (Nordic). |
Iso8859_13 | Latin-7 (Baltic Rim). |
Iso8859_14 | Latin-8 (Celtic). |
Iso8859_15 | Latin-9 (revision of ISO 8859-1 Latin-1, Western European). |
Iso8859_16 | Latin-10 (South-Eastern European). |
Koi8R | KOI-8 specialized for Russian Cyrillic. |
Koi8U | KOI-8 specialized for Ukrainian Cyrillic. |
Macintosh | |
MacintoshCyrillic | Mac OS Cyrillic (as of Mac OS 9.0) |
ShiftJis | The Windows variant (code page 932) of Shift JIS. |
Windows874 | ISO 8859-11 Latin/Thai with Windows extensions in the C1 control character slots. Note that this encoding is always used instead of pure Latin/Thai. |
Windows1250 | The Windows extension and rearrangement of ISO 8859-2 Latin-2. |
Windows1251 | |
Windows1252 | The Windows extension of ISO 8859-1 Latin-1, replacing most of the C1 control characters with printable glyphs. Note that this encoding is always used instead of pure Latin-1. |
Windows1253 | Windows Greek (modern monotonic). |
Windows1254 | The Windows extension of ISO 8859-9 Latin-5 (Turkish), replacing most of the C1 control characters with printable glyphs. Note that this encoding is always used instead of pure Latin-5. |
Windows1255 | The Windows extension and rearrangement of ISO 8859-8 Latin/Hebrew. |
Windows1256 | |
Windows1257 | |
Windows1258 | |
Replacement | The input is reduced to a single No encoder is provided for this scheme. |
UserDefined | Non-ASCII bytes ( |
Instances
Bounded Encoding | |
Enum Encoding | |
Defined in Web.Willow.Common.Encoding.Common | |
Eq Encoding | |
Ord Encoding | |
Defined in Web.Willow.Common.Encoding.Common | |
Read Encoding | |
Show Encoding | |
Hashable Encoding | |
Defined in Web.Willow.Common.Encoding.Common |
Initialization
tokenizerMode :: CurrentTokenizerState -> TokenizerState -> TokenizerState Source #
Specify which section of the finite state machine describing the tokenization algorithm should be active.
tokenizerStartTag :: Maybe Namespace -> ElementName -> TokenizerState -> TokenizerState Source #
Specify the data to use as the previous tag which had been emitted by the
tokenizer. This only has to be called when required for external algorithms
or constructions; the parser automatically updates as required for generated
StartTag
tokens.
tokenizerEncoding :: Either SnifferEnvironment (Maybe Encoding) -> TokenizerState -> TokenizerState Source #
Specify the encoding scheme used by a given parse environment to read from
the binary input stream. Note that this will always use the initial state
for the respective decoder; intermediate states as returned by decodeStep
are not supported.
Transformations
tokenize :: TokenizerState -> ByteString -> ([([ParseError], Token)], TokenizerState) Source #
HTML:
tokenization
Given a starting environment, transform a binary document stream into a stream of semantic atoms. If the parse fails, returns all tokens before the one which caused the error, but any trailing bytes are silently dropped.
tokenizeStep :: TokenizerState -> ByteString -> ([([ParseError], Token)], TokenizerState, ByteString) Source #
Parse a minimal number of bytes from an input stream, into a sequence of semantic tokens. Returns all data required to seamlessly resume parsing.
finalizeTokenizer :: TokenizerState -> [([ParseError], Token)] Source #
Explicitly indicate that the input stream will not contain any further bytes, and perform any finalization processing based on that.