Copyright	(c) 2020 Sam May
License	MPL-2.0
Maintainer	ag.eitilt@gmail.com
Stability	experimental
Portability	portable
Safe Haskell	Safe-Inferred
Language	Haskell98

Web.Willow.Common.Encoding.Sniffer

Contents

Types
The Algorithm
- Auxiliary

Description

In an ideal internet, every server would declare the binary encoding with which it is transmitting a file (actually, the true ideal would be for it to always be Utf8, but there are still a lot of legacy documents out there). However, that's not always the case.

A good fallback would be for every document to declare itself what encoding it has been saved in. However, not every one does, and the ones that do may still get it wrong (take, for instance, the case of a server which does translate everything it sends to Utf8).

And so, the HTML standard describes an algorithm for guessing the proper bytes-to-text translation to use in decode. While this does therefore assume some HTML syntax and specific tags, none of the semantics should cause an issue for other filetypes.

Synopsis

data Encoding
- = Utf8
- | Utf16be
- | Utf16le
- | Big5
- | EucJp
- | EucKr
- | Gb18030
- | Gbk
- | Ibm866
- | Iso2022Jp
- | Iso8859_2
- | Iso8859_3
- | Iso8859_4
- | Iso8859_5
- | Iso8859_6
- | Iso8859_7
- | Iso8859_8
- | Iso8859_8i
- | Iso8859_10
- | Iso8859_13
- | Iso8859_14
- | Iso8859_15
- | Iso8859_16
- | Koi8R
- | Koi8U
- | Macintosh
- | MacintoshCyrillic
- | ShiftJis
- | Windows874
- | Windows1250
- | Windows1251
- | Windows1252
- | Windows1253
- | Windows1254
- | Windows1255
- | Windows1256
- | Windows1257
- | Windows1258
- | Replacement
- | UserDefined
data Confidence
- = Tentative Encoding ReparseData
- | Certain Encoding
data ReparseData = ReparseData {
- parsedChars :: HashMap ShortByteString Char
- streamStart :: ByteString
}
emptyReparseData :: ReparseData
sniff :: SnifferEnvironment -> ByteString -> Confidence
data SnifferEnvironment = SnifferEnvironment {
- userOverride :: Maybe Encoding
- transportHeader :: Maybe Encoding
- prescanDepth :: Word
- parentEncoding :: Maybe Encoding
- cachedInfo :: Maybe Encoding
- userDefault :: Maybe Encoding
- localeEncoding :: Maybe Encoding
}
emptySnifferEnvironment :: SnifferEnvironment
sniffDecoderState :: SnifferEnvironment -> ByteString -> DecoderState
decoderConfidence :: DecoderState -> Confidence
confidenceEncoding :: Confidence -> Encoding
extractEncoding :: ByteString -> Maybe Encoding

Types

data Encoding Source #

Encoding: encoding

All character encoding schemes supported by the HTML standard, defined as a bidirectional map between characters and binary sequences. Utf8 is strongly encouraged for new content (including all encoding purposes), but the others are retained for compatibility with existing pages.

Note that none of these are complete functions, to one degree or another, and that no guarantee is made that the mapping round-trips.

Constructors

Utf8	The UTF-8 encoding for Unicode.
Utf16be	The UTF-16 encoding for Unicode, in big endian order. No encoder is provided for this scheme.
Utf16le	The UTF-16 encoding for Unicode, in little endian order. No encoder is provided for this scheme.
Big5	Big5, primarily covering traditional Chinese characters.
EucJp	EUC-JP, primarily covering Japanese as the union of JIS-0208 and JIS-0212.
EucKr	EUC-KR, primarily covering Hangul.
Gb18030	The GB18030-2005 extension to GBK, with one tweak for web compatibility, primarily covering both forms of Chinese characters. Note that this encoding also includes a large number of four-byte sequences which aren't listed in the linked visualization.
Gbk	GBK, primarily covering simplified Chinese characters. In practice, this is just `Gb18030` with a restricted set of encodable characters; the decoder is identical.
Ibm866	DOS and OS/2 code page for Cyrillic characters.
Iso2022Jp	A Japanese-focused implementation of the ISO 2022 meta-encoding, including both JIS-0208 and halfwidth katakana.
Iso8859_2	Latin-2 (Central European).
Iso8859_3	Latin-3 (South European and Esperanto)
Iso8859_4	Latin-4 (North European).
Iso8859_5	Latin/Cyrillic.
Iso8859_6	Latin/Arabic.
Iso8859_7	Latin/Greek (modern monotonic).
Iso8859_8	Latin/Hebrew (visual order).
Iso8859_8i	Latin/Hebrew (logical order).
Iso8859_10	Latin-6 (Nordic).
Iso8859_13	Latin-7 (Baltic Rim).
Iso8859_14	Latin-8 (Celtic).
Iso8859_15	Latin-9 (revision of ISO 8859-1 Latin-1, Western European).
Iso8859_16	Latin-10 (South-Eastern European).
Koi8R	KOI-8 specialized for Russian Cyrillic.
Koi8U	KOI-8 specialized for Ukrainian Cyrillic.
Macintosh	Mac OS Roman.
MacintoshCyrillic	Mac OS Cyrillic (as of Mac OS 9.0)
ShiftJis	The Windows variant (code page 932) of Shift JIS.
Windows874	ISO 8859-11 Latin/Thai with Windows extensions in the C1 control character slots. Note that this encoding is always used instead of pure Latin/Thai.
Windows1250	The Windows extension and rearrangement of ISO 8859-2 Latin-2.
Windows1251	Windows Cyrillic.
Windows1252	The Windows extension of ISO 8859-1 Latin-1, replacing most of the C1 control characters with printable glyphs. Note that this encoding is always used instead of pure Latin-1.
Windows1253	Windows Greek (modern monotonic).
Windows1254	The Windows extension of ISO 8859-9 Latin-5 (Turkish), replacing most of the C1 control characters with printable glyphs. Note that this encoding is always used instead of pure Latin-5.
Windows1255	The Windows extension and rearrangement of ISO 8859-8 Latin/Hebrew.
Windows1256	Windows Arabic.
Windows1257	Windows Baltic.
Windows1258	Windows Vietnamese.
Replacement	The input is reduced to a single `\xFFFD` replacement character. No encoder is provided for this scheme.
UserDefined	Non-ASCII bytes (`\x80` through `\xFF`) are mapped to a portion of the Unicode Private Use Area (`\xF780` through `\xF7FF`).

Instances

Instances details

Bounded Encoding Source #
Instance details Defined in Web.Willow.Common.Encoding.Common Methods minBound :: Encoding # maxBound :: Encoding #
Enum Encoding Source #
Instance details Defined in Web.Willow.Common.Encoding.Common Methods succ :: Encoding -> Encoding # pred :: Encoding -> Encoding # toEnum :: Int -> Encoding # fromEnum :: Encoding -> Int # enumFrom :: Encoding -> [Encoding] # enumFromThen :: Encoding -> Encoding -> [Encoding] # enumFromTo :: Encoding -> Encoding -> [Encoding] # enumFromThenTo :: Encoding -> Encoding -> Encoding -> [Encoding] #
Eq Encoding Source #
Instance details Defined in Web.Willow.Common.Encoding.Common Methods (==) :: Encoding -> Encoding -> Bool # (/=) :: Encoding -> Encoding -> Bool #
Ord Encoding Source #
Instance details Defined in Web.Willow.Common.Encoding.Common Methods compare :: Encoding -> Encoding -> Ordering # (<) :: Encoding -> Encoding -> Bool # (<=) :: Encoding -> Encoding -> Bool # (>) :: Encoding -> Encoding -> Bool # (>=) :: Encoding -> Encoding -> Bool # max :: Encoding -> Encoding -> Encoding # min :: Encoding -> Encoding -> Encoding #
Read Encoding Source #
Instance details Defined in Web.Willow.Common.Encoding.Common Methods readsPrec :: Int -> ReadS Encoding # readList :: ReadS [Encoding] # readPrec :: ReadPrec Encoding # readListPrec :: ReadPrec [Encoding] #
Show Encoding Source #
Instance details Defined in Web.Willow.Common.Encoding.Common Methods showsPrec :: Int -> Encoding -> ShowS # show :: Encoding -> String # showList :: [Encoding] -> ShowS #
Hashable Encoding Source #
Instance details Defined in Web.Willow.Common.Encoding.Common Methods hashWithSalt :: Int -> Encoding -> Int # hash :: Encoding -> Int #

data Confidence Source #

HTML: confidence

How likely the specified encoding is to be the actual stream encoding.

The spec names a third confidence level irrelevant, to be used when the stream doesn't depend on any particular encoding scheme (i.e. it is composed directly of Chars rather than parsed from a binary stream). This has not been included in the sum type, as it makes little sense to have that as a parameter of the decoding stage. Use Maybe DecoderState to represent it instead.

Constructors

Tentative Encoding ReparseData	The binary stream is likely the named encoding, but more data may prove it to be something else. In the latter case, the `ReparseData` (if available) may be used to transition to the proper encoding, or restart the stream if necessary.
Certain Encoding	The binary stream is confirmed to be of the given encoding.

Instances

Instances details

Eq Confidence Source #
Instance details Defined in Web.Willow.Common.Encoding.Common Methods (==) :: Confidence -> Confidence -> Bool # (/=) :: Confidence -> Confidence -> Bool #
Read Confidence Source #
Instance details Defined in Web.Willow.Common.Encoding.Common Methods readsPrec :: Int -> ReadS Confidence # readList :: ReadS [Confidence] # readPrec :: ReadPrec Confidence # readListPrec :: ReadPrec [Confidence] #
Show Confidence Source #
Instance details Defined in Web.Willow.Common.Encoding.Common Methods showsPrec :: Int -> Confidence -> ShowS # show :: Confidence -> String # showList :: [Confidence] -> ShowS #

data ReparseData Source #

HTML: change the encoding

The data required to determine if a new encoding would produce an identical output to what the current one has already done, and to restart the parsing with the new one if the two are incompatible. Values may be easily initialized via emptyReparseData.

Constructors

ReparseData
Fields parsedChars :: HashMap ShortByteString Char The input binary sequences and the resulting characters which are already emitted to the output. streamStart :: ByteString The complete binary sequence parsed thus far, in case it needs to be re-processed under a new, incompatible encoding.

Instances

Instances details

Eq ReparseData Source #
Instance details Defined in Web.Willow.Common.Encoding.Common Methods (==) :: ReparseData -> ReparseData -> Bool # (/=) :: ReparseData -> ReparseData -> Bool #
Read ReparseData Source #
Instance details Defined in Web.Willow.Common.Encoding.Common Methods readsPrec :: Int -> ReadS ReparseData # readList :: ReadS [ReparseData] # readPrec :: ReadPrec ReparseData # readListPrec :: ReadPrec [ReparseData] #
Show ReparseData Source #
Instance details Defined in Web.Willow.Common.Encoding.Common Methods showsPrec :: Int -> ReparseData -> ShowS # show :: ReparseData -> String # showList :: [ReparseData] -> ShowS #

emptyReparseData :: ReparseData Source #

The collection of data which would indicate nothing has yet been parsed.

The Algorithm

sniff :: SnifferEnvironment -> ByteString -> Confidence Source #

HTML: encoding sniffing algorithm

Given a stream and related metadata, try to determine what encoding may have been used to write it.

Will resolve and/or wait for the number of bytes requested by prescanDepth to be available in the stream (or, if it comes sooner, the end of the stream), if they have not yet been produced.

data SnifferEnvironment Source #

Various datapoints which may indicate a document's binary encoding, to be fed into the sniff algorithm. Values may be easily instantiated as updates to emptySnifferEnvironment.

Constructors

SnifferEnvironment

Fields

userOverride :: Maybe Encoding
The encoding the end user has specified should be used. Note that even this can still be overridden by the presence of a byte-order mark at the head of the stream.
transportHeader :: Maybe Encoding
The encoding given by the transport layer (e.g. through an HTTP Content-Type header).
prescanDepth :: Word
The number of bytes which should be skimmed for meta attributes specifying an encoding.
parentEncoding :: Maybe Encoding
The encoding used for the enclosing document (e.g., if this document is loaded via an <iframe>).
cachedInfo :: Maybe Encoding
The encoding from the last time this page was loaded, other pages on the site, or other cached data.
userDefault :: Maybe Encoding
The encoding the end user has specified as being their preferred default, if no better encoding can be determined.
localeEncoding :: Maybe Encoding
Warning: The type of this argument will be changed in a future release
The encoding recommended as a reasonable guess based on the current language of the user's system.

Instances

Instances details

Eq SnifferEnvironment Source #
Instance details Defined in Web.Willow.Common.Encoding.Sniffer Methods (==) :: SnifferEnvironment -> SnifferEnvironment -> Bool # (/=) :: SnifferEnvironment -> SnifferEnvironment -> Bool #
Read SnifferEnvironment Source #
Instance details Defined in Web.Willow.Common.Encoding.Sniffer Methods readsPrec :: Int -> ReadS SnifferEnvironment # readList :: ReadS [SnifferEnvironment] # readPrec :: ReadPrec SnifferEnvironment # readListPrec :: ReadPrec [SnifferEnvironment] #
Show SnifferEnvironment Source #
Instance details Defined in Web.Willow.Common.Encoding.Sniffer Methods showsPrec :: Int -> SnifferEnvironment -> ShowS # show :: SnifferEnvironment -> String # showList :: [SnifferEnvironment] -> ShowS #

emptySnifferEnvironment :: SnifferEnvironment Source #

A neutral set of parameters to pass to the sniff algorithm: no accessory data, and a prescanDepth limit of 1024 bytes.

sniffDecoderState :: SnifferEnvironment -> ByteString -> DecoderState Source #

Guess what encoding may be in use by the binary stream, and generate a collection of data based on that which results in the behaviour described by the decoding algorithm at the start of the stream.

Auxiliary

decoderConfidence :: DecoderState -> Confidence Source #

The encoding scheme currently in use by the parser, along with how likely that scheme actually represents the binary stream.

confidenceEncoding :: Confidence -> Encoding Source #

Extract the underlying encoding scheme from the wrapping data.

extractEncoding :: ByteString -> Maybe Encoding Source #

HTML: algorithm for extracting a character encoding from a meta element

Find the first occurrence of an ASCII-encoded string charset in the stream, and try to parse its attribute-style value into an Encoding.

Returns Nothing if the stream does not contain charset followed by =, or if the value can not be successfully parsed as an encoding label.