Copyright	(c) 2020 Sam May
License	MPL-2.0
Maintainer	ag.eitilt@gmail.com
Stability	experimental
Portability	portable
Safe Haskell	Safe-Inferred
Language	Haskell98

Web.Mangrove.Parse.Encoding.Preprocess

Contents

Initialization

Description

To simplify the tokenization parsers, the many representations of line breaks are unified into a single, Unix-style \n. While we're iterating over the input, and before some of the special characters are replaced, it's also a good time to trigger the warnings for unexpected characters (ControlCharacterInInputStream, SurrogateInInputStream, and NoncharacterInInputStream).

Synopsis

preprocess :: DecoderState -> ByteString -> ([([ParseError], Char)], DecoderState)
preprocessStep :: DecoderState -> ByteString -> ([([ParseError], Char)], DecoderState, ByteString)
data Encoding
- = Utf8
- | Utf16be
- | Utf16le
- | Big5
- | EucJp
- | EucKr
- | Gb18030
- | Gbk
- | Ibm866
- | Iso2022Jp
- | Iso8859_2
- | Iso8859_3
- | Iso8859_4
- | Iso8859_5
- | Iso8859_6
- | Iso8859_7
- | Iso8859_8
- | Iso8859_8i
- | Iso8859_10
- | Iso8859_13
- | Iso8859_14
- | Iso8859_15
- | Iso8859_16
- | Koi8R
- | Koi8U
- | Macintosh
- | MacintoshCyrillic
- | ShiftJis
- | Windows874
- | Windows1250
- | Windows1251
- | Windows1252
- | Windows1253
- | Windows1254
- | Windows1255
- | Windows1256
- | Windows1257
- | Windows1258
- | Replacement
- | UserDefined
data DecoderState
initialDecoderState :: Encoding -> DecoderState

Documentation

preprocess :: DecoderState -> ByteString -> ([([ParseError], Char)], DecoderState) Source #

Encoding: preprocessing the input stream

Given a character encoding scheme, transform a dependant ByteString into portable Chars. If any byte sequences are meaningless or illegal, they are replaced with the Unicode replacement character \xFFFD. All newlines are normallized to a single \n Char, and Unicode control characters, surrogate characters, and non-characters are marked with the proper errors.

See preprocessStep to operate over only a minimal section.

preprocessStep :: DecoderState -> ByteString -> ([([ParseError], Char)], DecoderState, ByteString) Source #

Encoding: preprocessing the input stream

Read the smallest number of bytes from the head of the ByteString which would leave the decoder in a re-enterable state. Any byte sequences which are meaningless or illegal are replaced with the Unicode replacement character \xFFFD. All newlines are normallized to a single \n Char, and Unicode control characters, surrogate characters, and non-characters are marked with the proper errors.

See preprocess to operate over the entire string at once.

Initialization

data Encoding #

Encoding: encoding

All character encoding schemes supported by the HTML standard, defined as a bidirectional map between characters and binary sequences. Utf8 is strongly encouraged for new content (including all encoding purposes), but the others are retained for compatibility with existing pages.

Note that none of these are complete functions, to one degree or another, and that no guarantee is made that the mapping round-trips.

Constructors

Utf8	The UTF-8 encoding for Unicode.
Utf16be	The UTF-16 encoding for Unicode, in big endian order. No encoder is provided for this scheme.
Utf16le	The UTF-16 encoding for Unicode, in little endian order. No encoder is provided for this scheme.
Big5	Big5, primarily covering traditional Chinese characters.
EucJp	EUC-JP, primarily covering Japanese as the union of JIS-0208 and JIS-0212.
EucKr	EUC-KR, primarily covering Hangul.
Gb18030	The GB18030-2005 extension to GBK, with one tweak for web compatibility, primarily covering both forms of Chinese characters. Note that this encoding also includes a large number of four-byte sequences which aren't listed in the linked visualization.
Gbk	GBK, primarily covering simplified Chinese characters. In practice, this is just `Gb18030` with a restricted set of encodable characters; the decoder is identical.
Ibm866	DOS and OS/2 code page for Cyrillic characters.
Iso2022Jp	A Japanese-focused implementation of the ISO 2022 meta-encoding, including both JIS-0208 and halfwidth katakana.
Iso8859_2	Latin-2 (Central European).
Iso8859_3	Latin-3 (South European and Esperanto)
Iso8859_4	Latin-4 (North European).
Iso8859_5	Latin/Cyrillic.
Iso8859_6	Latin/Arabic.
Iso8859_7	Latin/Greek (modern monotonic).
Iso8859_8	Latin/Hebrew (visual order).
Iso8859_8i	Latin/Hebrew (logical order).
Iso8859_10	Latin-6 (Nordic).
Iso8859_13	Latin-7 (Baltic Rim).
Iso8859_14	Latin-8 (Celtic).
Iso8859_15	Latin-9 (revision of ISO 8859-1 Latin-1, Western European).
Iso8859_16	Latin-10 (South-Eastern European).
Koi8R	KOI-8 specialized for Russian Cyrillic.
Koi8U	KOI-8 specialized for Ukrainian Cyrillic.
Macintosh	Mac OS Roman.
MacintoshCyrillic	Mac OS Cyrillic (as of Mac OS 9.0)
ShiftJis	The Windows variant (code page 932) of Shift JIS.
Windows874	ISO 8859-11 Latin/Thai with Windows extensions in the C1 control character slots. Note that this encoding is always used instead of pure Latin/Thai.
Windows1250	The Windows extension and rearrangement of ISO 8859-2 Latin-2.
Windows1251	Windows Cyrillic.
Windows1252	The Windows extension of ISO 8859-1 Latin-1, replacing most of the C1 control characters with printable glyphs. Note that this encoding is always used instead of pure Latin-1.
Windows1253	Windows Greek (modern monotonic).
Windows1254	The Windows extension of ISO 8859-9 Latin-5 (Turkish), replacing most of the C1 control characters with printable glyphs. Note that this encoding is always used instead of pure Latin-5.
Windows1255	The Windows extension and rearrangement of ISO 8859-8 Latin/Hebrew.
Windows1256	Windows Arabic.
Windows1257	Windows Baltic.
Windows1258	Windows Vietnamese.
Replacement	The input is reduced to a single `\xFFFD` replacement character. No encoder is provided for this scheme.
UserDefined	Non-ASCII bytes (`\x80` through `\xFF`) are mapped to a portion of the Unicode Private Use Area (`\xF780` through `\xF7FF`).

Instances

Instances details

Bounded Encoding
Instance details Defined in Web.Willow.Common.Encoding.Common Methods minBound :: Encoding # maxBound :: Encoding #
Enum Encoding
Instance details Defined in Web.Willow.Common.Encoding.Common Methods succ :: Encoding -> Encoding # pred :: Encoding -> Encoding # toEnum :: Int -> Encoding # fromEnum :: Encoding -> Int # enumFrom :: Encoding -> [Encoding] # enumFromThen :: Encoding -> Encoding -> [Encoding] # enumFromTo :: Encoding -> Encoding -> [Encoding] # enumFromThenTo :: Encoding -> Encoding -> Encoding -> [Encoding] #
Eq Encoding
Instance details Defined in Web.Willow.Common.Encoding.Common Methods (==) :: Encoding -> Encoding -> Bool # (/=) :: Encoding -> Encoding -> Bool #
Ord Encoding
Instance details Defined in Web.Willow.Common.Encoding.Common Methods compare :: Encoding -> Encoding -> Ordering # (<) :: Encoding -> Encoding -> Bool # (<=) :: Encoding -> Encoding -> Bool # (>) :: Encoding -> Encoding -> Bool # (>=) :: Encoding -> Encoding -> Bool # max :: Encoding -> Encoding -> Encoding # min :: Encoding -> Encoding -> Encoding #
Read Encoding
Instance details Defined in Web.Willow.Common.Encoding.Common Methods readsPrec :: Int -> ReadS Encoding # readList :: ReadS [Encoding] # readPrec :: ReadPrec Encoding # readListPrec :: ReadPrec [Encoding] #
Show Encoding
Instance details Defined in Web.Willow.Common.Encoding.Common Methods showsPrec :: Int -> Encoding -> ShowS # show :: Encoding -> String # showList :: [Encoding] -> ShowS #
Hashable Encoding
Instance details Defined in Web.Willow.Common.Encoding.Common Methods hashWithSalt :: Int -> Encoding -> Int # hash :: Encoding -> Int #

data DecoderState #

All the data which needs to be tracked for correct behaviour in decoding a binary stream into readable text.

Instances

Instances details

Eq DecoderState
Instance details Defined in Web.Willow.Common.Encoding.Common Methods (==) :: DecoderState -> DecoderState -> Bool # (/=) :: DecoderState -> DecoderState -> Bool #
Read DecoderState
Instance details Defined in Web.Willow.Common.Encoding.Common Methods readsPrec :: Int -> ReadS DecoderState # readList :: ReadS [DecoderState] # readPrec :: ReadPrec DecoderState # readListPrec :: ReadPrec [DecoderState] #
Show DecoderState
Instance details Defined in Web.Willow.Common.Encoding.Common Methods showsPrec :: Int -> DecoderState -> ShowS # show :: DecoderState -> String # showList :: [DecoderState] -> ShowS #

initialDecoderState :: Encoding -> DecoderState #

The collection of data which, for any given encoding scheme, results in behaviour according to the vanilla decoder before any bytes have been read.