willow-0.1.0.0: An implementation of the web Document Object Model, and its rendering.
Copyright(c) 2020 Sam May
LicenseMPL-2.0
Maintainerag.eitilt@gmail.com
Stabilityexperimental
Portabilityportable
Safe HaskellSafe-Inferred
LanguageHaskell98

Web.Willow.Common.Encoding.Sniffer

Description

In an ideal internet, every server would declare the binary encoding with which it is transmitting a file (actually, the true ideal would be for it to always be Utf8, but there are still a lot of legacy documents out there). However, that's not always the case.

A good fallback would be for every document to declare itself what encoding it has been saved in. However, not every one does, and the ones that do may still get it wrong (take, for instance, the case of a server which does translate everything it sends to Utf8).

And so, the HTML standard describes an algorithm for guessing the proper bytes-to-text translation to use in decode. While this does therefore assume some HTML syntax and specific tags, none of the semantics should cause an issue for other filetypes.

Synopsis

Types

data Encoding Source #

Encoding: encoding

All character encoding schemes supported by the HTML standard, defined as a bidirectional map between characters and binary sequences. Utf8 is strongly encouraged for new content (including all encoding purposes), but the others are retained for compatibility with existing pages.

Note that none of these are complete functions, to one degree or another, and that no guarantee is made that the mapping round-trips.

Constructors

Utf8

The UTF-8 encoding for Unicode.

Utf16be

The UTF-16 encoding for Unicode, in big endian order.

No encoder is provided for this scheme.

Utf16le

The UTF-16 encoding for Unicode, in little endian order.

No encoder is provided for this scheme.

Big5

Big5, primarily covering traditional Chinese characters.

EucJp

EUC-JP, primarily covering Japanese as the union of JIS-0208 and JIS-0212.

EucKr

EUC-KR, primarily covering Hangul.

Gb18030

The GB18030-2005 extension to GBK, with one tweak for web compatibility, primarily covering both forms of Chinese characters.

Note that this encoding also includes a large number of four-byte sequences which aren't listed in the linked visualization.

Gbk

GBK, primarily covering simplified Chinese characters.

In practice, this is just Gb18030 with a restricted set of encodable characters; the decoder is identical.

Ibm866

DOS and OS/2 code page for Cyrillic characters.

Iso2022Jp

A Japanese-focused implementation of the ISO 2022 meta-encoding, including both JIS-0208 and halfwidth katakana.

Iso8859_2

Latin-2 (Central European).

Iso8859_3

Latin-3 (South European and Esperanto)

Iso8859_4

Latin-4 (North European).

Iso8859_5

Latin/Cyrillic.

Iso8859_6

Latin/Arabic.

Iso8859_7

Latin/Greek (modern monotonic).

Iso8859_8

Latin/Hebrew (visual order).

Iso8859_8i

Latin/Hebrew (logical order).

Iso8859_10

Latin-6 (Nordic).

Iso8859_13

Latin-7 (Baltic Rim).

Iso8859_14

Latin-8 (Celtic).

Iso8859_15

Latin-9 (revision of ISO 8859-1 Latin-1, Western European).

Iso8859_16

Latin-10 (South-Eastern European).

Koi8R

KOI-8 specialized for Russian Cyrillic.

Koi8U

KOI-8 specialized for Ukrainian Cyrillic.

Macintosh

Mac OS Roman.

MacintoshCyrillic

Mac OS Cyrillic (as of Mac OS 9.0)

ShiftJis

The Windows variant (code page 932) of Shift JIS.

Windows874

ISO 8859-11 Latin/Thai with Windows extensions in the C1 control character slots.

Note that this encoding is always used instead of pure Latin/Thai.

Windows1250

The Windows extension and rearrangement of ISO 8859-2 Latin-2.

Windows1251

Windows Cyrillic.

Windows1252

The Windows extension of ISO 8859-1 Latin-1, replacing most of the C1 control characters with printable glyphs.

Note that this encoding is always used instead of pure Latin-1.

Windows1253

Windows Greek (modern monotonic).

Windows1254

The Windows extension of ISO 8859-9 Latin-5 (Turkish), replacing most of the C1 control characters with printable glyphs.

Note that this encoding is always used instead of pure Latin-5.

Windows1255

The Windows extension and rearrangement of ISO 8859-8 Latin/Hebrew.

Windows1256

Windows Arabic.

Windows1257

Windows Baltic.

Windows1258

Windows Vietnamese.

Replacement

The input is reduced to a single \xFFFD replacement character.

No encoder is provided for this scheme.

UserDefined

Non-ASCII bytes (\x80 through \xFF) are mapped to a portion of the Unicode Private Use Area (\xF780 through \xF7FF).

Instances

Instances details
Bounded Encoding Source # 
Instance details

Defined in Web.Willow.Common.Encoding.Common

Enum Encoding Source # 
Instance details

Defined in Web.Willow.Common.Encoding.Common

Eq Encoding Source # 
Instance details

Defined in Web.Willow.Common.Encoding.Common

Ord Encoding Source # 
Instance details

Defined in Web.Willow.Common.Encoding.Common

Read Encoding Source # 
Instance details

Defined in Web.Willow.Common.Encoding.Common

Show Encoding Source # 
Instance details

Defined in Web.Willow.Common.Encoding.Common

Hashable Encoding Source # 
Instance details

Defined in Web.Willow.Common.Encoding.Common

Methods

hashWithSalt :: Int -> Encoding -> Int #

hash :: Encoding -> Int #

data Confidence Source #

HTML: confidence

How likely the specified encoding is to be the actual stream encoding.

The spec names a third confidence level irrelevant, to be used when the stream doesn't depend on any particular encoding scheme (i.e. it is composed directly of Chars rather than parsed from a binary stream). This has not been included in the sum type, as it makes little sense to have that as a parameter of the decoding stage. Use Maybe DecoderState to represent it instead.

Constructors

Tentative Encoding ReparseData

The binary stream is likely the named encoding, but more data may prove it to be something else. In the latter case, the ReparseData (if available) may be used to transition to the proper encoding, or restart the stream if necessary.

Certain Encoding

The binary stream is confirmed to be of the given encoding.

data ReparseData Source #

HTML: change the encoding

The data required to determine if a new encoding would produce an identical output to what the current one has already done, and to restart the parsing with the new one if the two are incompatible. Values may be easily initialized via emptyReparseData.

Constructors

ReparseData 

Fields

emptyReparseData :: ReparseData Source #

The collection of data which would indicate nothing has yet been parsed.

The Algorithm

sniff :: SnifferEnvironment -> ByteString -> Confidence Source #

HTML: encoding sniffing algorithm

Given a stream and related metadata, try to determine what encoding may have been used to write it.

Will resolve and/or wait for the number of bytes requested by prescanDepth to be available in the stream (or, if it comes sooner, the end of the stream), if they have not yet been produced.

data SnifferEnvironment Source #

Various datapoints which may indicate a document's binary encoding, to be fed into the sniff algorithm. Values may be easily instantiated as updates to emptySnifferEnvironment.

Constructors

SnifferEnvironment 

Fields

  • userOverride :: Maybe Encoding

    The encoding the end user has specified should be used. Note that even this can still be overridden by the presence of a byte-order mark at the head of the stream.

  • transportHeader :: Maybe Encoding

    The encoding given by the transport layer (e.g. through an HTTP Content-Type header).

  • prescanDepth :: Word

    The number of bytes which should be skimmed for meta attributes specifying an encoding.

  • parentEncoding :: Maybe Encoding

    The encoding used for the enclosing document (e.g., if this document is loaded via an <iframe>).

  • cachedInfo :: Maybe Encoding

    The encoding from the last time this page was loaded, other pages on the site, or other cached data.

  • userDefault :: Maybe Encoding

    The encoding the end user has specified as being their preferred default, if no better encoding can be determined.

  • localeEncoding :: Maybe Encoding

    Warning: The type of this argument will be changed in a future release

    The encoding recommended as a reasonable guess based on the current language of the user's system.

emptySnifferEnvironment :: SnifferEnvironment Source #

A neutral set of parameters to pass to the sniff algorithm: no accessory data, and a prescanDepth limit of 1024 bytes.

sniffDecoderState :: SnifferEnvironment -> ByteString -> DecoderState Source #

Guess what encoding may be in use by the binary stream, and generate a collection of data based on that which results in the behaviour described by the decoding algorithm at the start of the stream.

Auxiliary

decoderConfidence :: DecoderState -> Confidence Source #

The encoding scheme currently in use by the parser, along with how likely that scheme actually represents the binary stream.

confidenceEncoding :: Confidence -> Encoding Source #

Extract the underlying encoding scheme from the wrapping data.

extractEncoding :: ByteString -> Maybe Encoding Source #

HTML: algorithm for extracting a character encoding from a meta element

Find the first occurrence of an ASCII-encoded string charset in the stream, and try to parse its attribute-style value into an Encoding.

Returns Nothing if the stream does not contain charset followed by =, or if the value can not be successfully parsed as an encoding label.