Copyright | (c) 2020 Sam May |
---|---|
License | MPL-2.0 |
Maintainer | ag.eitilt@gmail.com |
Stability | experimental |
Portability | portable |
Safe Haskell | Safe-Inferred |
Language | Haskell98 |
To simplify the tokenization parsers, the many representations of line breaks
are unified into a single, Unix-style \n
. While we're iterating over the
input, and before some of the special characters are replaced, it's also a good
time to trigger the warnings for unexpected characters
(ControlCharacterInInputStream
, SurrogateInInputStream
, and
NoncharacterInInputStream
).
Synopsis
- preprocess :: DecoderState -> ByteString -> ([([ParseError], Char)], DecoderState)
- preprocessStep :: DecoderState -> ByteString -> ([([ParseError], Char)], DecoderState, ByteString)
- data Encoding
- = Utf8
- | Utf16be
- | Utf16le
- | Big5
- | EucJp
- | EucKr
- | Gb18030
- | Gbk
- | Ibm866
- | Iso2022Jp
- | Iso8859_2
- | Iso8859_3
- | Iso8859_4
- | Iso8859_5
- | Iso8859_6
- | Iso8859_7
- | Iso8859_8
- | Iso8859_8i
- | Iso8859_10
- | Iso8859_13
- | Iso8859_14
- | Iso8859_15
- | Iso8859_16
- | Koi8R
- | Koi8U
- | Macintosh
- | MacintoshCyrillic
- | ShiftJis
- | Windows874
- | Windows1250
- | Windows1251
- | Windows1252
- | Windows1253
- | Windows1254
- | Windows1255
- | Windows1256
- | Windows1257
- | Windows1258
- | Replacement
- | UserDefined
- data DecoderState
- initialDecoderState :: Encoding -> DecoderState
Documentation
preprocess :: DecoderState -> ByteString -> ([([ParseError], Char)], DecoderState) Source #
Encoding:
preprocessing the input stream
Given a character encoding scheme, transform a dependant ByteString
into portable Char
s. If any byte sequences are meaningless or illegal,
they are replaced with the Unicode replacement character \xFFFD
. All
newlines are normallized to a single \n
Char
, and Unicode control
characters, surrogate characters, and non-characters are marked with the
proper errors.
See preprocessStep
to operate over only a minimal section.
preprocessStep :: DecoderState -> ByteString -> ([([ParseError], Char)], DecoderState, ByteString) Source #
Encoding:
preprocessing the input stream
Read the smallest number of bytes from the head of the ByteString
which would leave the decoder in a re-enterable state. Any byte
sequences which are meaningless or illegal are replaced with the Unicode
replacement character \xFFFD
. All newlines are normallized to a single
\n
Char
, and Unicode control characters, surrogate characters, and
non-characters are marked with the proper errors.
See preprocess
to operate over the entire string at once.
Initialization
Encoding:
encoding
All character encoding schemes supported by the HTML standard, defined as a
bidirectional map between characters and binary sequences. Utf8
is
strongly encouraged for new content (including all encoding purposes), but
the others are retained for compatibility with existing pages.
Note that none of these are complete functions, to one degree or another, and that no guarantee is made that the mapping round-trips.
Utf8 | The UTF-8 encoding for Unicode. |
Utf16be | The UTF-16 encoding for Unicode, in big endian order. No encoder is provided for this scheme. |
Utf16le | The UTF-16 encoding for Unicode, in little endian order. No encoder is provided for this scheme. |
Big5 | Big5, primarily covering traditional Chinese characters. |
EucJp | EUC-JP, primarily covering Japanese as the union of JIS-0208 and JIS-0212. |
EucKr | EUC-KR, primarily covering Hangul. |
Gb18030 | The GB18030-2005 extension to GBK, with one tweak for web compatibility, primarily covering both forms of Chinese characters. Note that this encoding also includes a large number of four-byte sequences which aren't listed in the linked visualization. |
Gbk | GBK, primarily covering simplified Chinese characters. In practice, this is just |
Ibm866 | DOS and OS/2 code page for Cyrillic characters. |
Iso2022Jp | A Japanese-focused implementation of the ISO 2022 meta-encoding, including both JIS-0208 and halfwidth katakana. |
Iso8859_2 | Latin-2 (Central European). |
Iso8859_3 | Latin-3 (South European and Esperanto) |
Iso8859_4 | Latin-4 (North European). |
Iso8859_5 | |
Iso8859_6 | |
Iso8859_7 | Latin/Greek (modern monotonic). |
Iso8859_8 | Latin/Hebrew (visual order). |
Iso8859_8i | Latin/Hebrew (logical order). |
Iso8859_10 | Latin-6 (Nordic). |
Iso8859_13 | Latin-7 (Baltic Rim). |
Iso8859_14 | Latin-8 (Celtic). |
Iso8859_15 | Latin-9 (revision of ISO 8859-1 Latin-1, Western European). |
Iso8859_16 | Latin-10 (South-Eastern European). |
Koi8R | KOI-8 specialized for Russian Cyrillic. |
Koi8U | KOI-8 specialized for Ukrainian Cyrillic. |
Macintosh | |
MacintoshCyrillic | Mac OS Cyrillic (as of Mac OS 9.0) |
ShiftJis | The Windows variant (code page 932) of Shift JIS. |
Windows874 | ISO 8859-11 Latin/Thai with Windows extensions in the C1 control character slots. Note that this encoding is always used instead of pure Latin/Thai. |
Windows1250 | The Windows extension and rearrangement of ISO 8859-2 Latin-2. |
Windows1251 | |
Windows1252 | The Windows extension of ISO 8859-1 Latin-1, replacing most of the C1 control characters with printable glyphs. Note that this encoding is always used instead of pure Latin-1. |
Windows1253 | Windows Greek (modern monotonic). |
Windows1254 | The Windows extension of ISO 8859-9 Latin-5 (Turkish), replacing most of the C1 control characters with printable glyphs. Note that this encoding is always used instead of pure Latin-5. |
Windows1255 | The Windows extension and rearrangement of ISO 8859-8 Latin/Hebrew. |
Windows1256 | |
Windows1257 | |
Windows1258 | |
Replacement | The input is reduced to a single No encoder is provided for this scheme. |
UserDefined | Non-ASCII bytes ( |
Instances
Bounded Encoding | |
Enum Encoding | |
Defined in Web.Willow.Common.Encoding.Common | |
Eq Encoding | |
Ord Encoding | |
Defined in Web.Willow.Common.Encoding.Common | |
Read Encoding | |
Show Encoding | |
Hashable Encoding | |
Defined in Web.Willow.Common.Encoding.Common |
data DecoderState #
All the data which needs to be tracked for correct behaviour in decoding a binary stream into readable text.
Instances
Eq DecoderState | |
Defined in Web.Willow.Common.Encoding.Common (==) :: DecoderState -> DecoderState -> Bool # (/=) :: DecoderState -> DecoderState -> Bool # | |
Read DecoderState | |
Defined in Web.Willow.Common.Encoding.Common readsPrec :: Int -> ReadS DecoderState # readList :: ReadS [DecoderState] # | |
Show DecoderState | |
Defined in Web.Willow.Common.Encoding.Common showsPrec :: Int -> DecoderState -> ShowS # show :: DecoderState -> String # showList :: [DecoderState] -> ShowS # |
initialDecoderState :: Encoding -> DecoderState #
The collection of data which, for any given encoding scheme, results in behaviour according to the vanilla decoder before any bytes have been read.