Portability | portable |
---|---|
Stability | experimental |
Maintainer | Uwe Schmidt (uwe@fh-wedel.de) |
Unicode and UTF-8 Conversion Functions
- type Unicode = Char
- type UString = [Unicode]
- type UTF8Char = Char
- type UTF8String = String
- type UStringWithErrors = [Either String Char]
- type DecodingFct = String -> (UString, [String])
- type DecodingFctEmbedErrors = String -> UStringWithErrors
- isXmlChar :: Unicode -> Bool
- isXmlLatin1Char :: Unicode -> Bool
- isXmlSpaceChar :: Unicode -> Bool
- isXml11SpaceChar :: Unicode -> Bool
- isXmlNameChar :: Unicode -> Bool
- isXmlNameStartChar :: Unicode -> Bool
- isXmlNCNameChar :: Unicode -> Bool
- isXmlNCNameStartChar :: Unicode -> Bool
- isXmlPubidChar :: Unicode -> Bool
- isXmlLetter :: Unicode -> Bool
- isXmlBaseChar :: Unicode -> Bool
- isXmlIdeographicChar :: Unicode -> Bool
- isXmlCombiningChar :: Unicode -> Bool
- isXmlDigit :: Unicode -> Bool
- isXmlExtender :: Unicode -> Bool
- isXmlControlOrPermanentlyUndefined :: Unicode -> Bool
- utf8ToUnicode :: DecodingFct
- utf8ToUnicodeEmbedErrors :: DecodingFctEmbedErrors
- latin1ToUnicode :: String -> UString
- ucs2ToUnicode :: String -> UString
- ucs2BigEndianToUnicode :: String -> UString
- ucs2LittleEndianToUnicode :: String -> UString
- utf16beToUnicode :: String -> UString
- utf16leToUnicode :: String -> UString
- unicodeCharToUtf8 :: Unicode -> UTF8String
- unicodeToUtf8 :: UString -> UTF8String
- unicodeToXmlEntity :: UString -> String
- unicodeToLatin1 :: UString -> String
- unicodeRemoveNoneAscii :: UString -> String
- unicodeRemoveNoneLatin1 :: UString -> String
- intToCharRef :: Int -> String
- intToCharRefHex :: Int -> String
- getDecodingFct :: String -> Maybe DecodingFct
- getDecodingFctEmbedErrors :: String -> Maybe DecodingFctEmbedErrors
- getOutputEncodingFct :: String -> Maybe (String -> UString)
- normalizeNL :: String -> String
- guessEncoding :: String -> String
Unicode Type declarations
Unicode is represented as the Char type Precondition for this is the support of Unicode character range in the compiler (e.g. ghc but not hugs)
type UTF8String = StringSource
UTF-8 strings are implemented as Haskell strings
type UStringWithErrors = [Either String Char]Source
type DecodingFct = String -> (UString, [String])Source
Decoding function with a pair containing the result string and a list of decoding errors as result
type DecodingFctEmbedErrors = String -> UStringWithErrorsSource
Decoding function where decoding errors are interleaved with decoded characters
XML char predicates
isXmlLatin1Char :: Unicode -> BoolSource
test for a legal latin1 XML char
isXmlSpaceChar :: Unicode -> BoolSource
checking for XML space character: \n, \r, \t and " "
isXml11SpaceChar :: Unicode -> BoolSource
checking for XML1.1 space character: additional space 0x85 and 0x2028
see also : isXmlSpaceChar
isXmlNameChar :: Unicode -> BoolSource
checking for XML name character
isXmlNameStartChar :: Unicode -> BoolSource
checking for XML name start character
see also : isXmlNameChar
isXmlNCNameChar :: Unicode -> BoolSource
checking for XML NCName character: no ":" allowed
see also : isXmlNameChar
isXmlNCNameStartChar :: Unicode -> BoolSource
checking for XML NCName start character: no ":" allowed
see also : isXmlNameChar
, isXmlNCNameChar
isXmlPubidChar :: Unicode -> BoolSource
checking for XML public id character
isXmlLetter :: Unicode -> BoolSource
checking for XML letter
isXmlBaseChar :: Unicode -> BoolSource
checking for XML base charater
isXmlIdeographicChar :: Unicode -> BoolSource
checking for XML ideographic charater
isXmlCombiningChar :: Unicode -> BoolSource
checking for XML combining charater
isXmlDigit :: Unicode -> BoolSource
checking for XML digit
isXmlExtender :: Unicode -> BoolSource
checking for XML extender
isXmlControlOrPermanentlyUndefined :: Unicode -> BoolSource
checking for XML control or permanently discouraged char
see Errata to XML1.0 (http://www.w3.org/XML/xml-V10-2e-errata) No 46
Document authors are encouraged to avoid compatibility characters, as defined in section 6.8 of [Unicode] (see also D21 in section 3.6 of [Unicode3]). The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters:
UTF-8 and Unicode conversion functions
utf8ToUnicode :: DecodingFctSource
UTF-8 to Unicode conversion with deletion of leading byte order mark, as described in XML standard F.1
latin1ToUnicode :: String -> UStringSource
code conversion from latin1 to Unicode
ucs2ToUnicode :: String -> UStringSource
UCS-2 to UTF-8 conversion with byte order mark analysis
ucs2BigEndianToUnicode :: String -> UStringSource
UCS-2 big endian to Unicode conversion
ucs2LittleEndianToUnicode :: String -> UStringSource
UCS-2 little endian to Unicode conversion
utf16beToUnicode :: String -> UStringSource
UTF-16 big endian to UTF-8 conversion with removal of byte order mark
utf16leToUnicode :: String -> UStringSource
UTF-16 little endian to UTF-8 conversion with removal of byte order mark
unicodeCharToUtf8 :: Unicode -> UTF8StringSource
conversion from Unicode (Char) to a UTF8 encoded string.
unicodeToUtf8 :: UString -> UTF8StringSource
conversion from Unicode strings (UString) to UTF8 encoded strings.
unicodeToXmlEntity :: UString -> StringSource
substitute all Unicode characters, that are not legal 1-byte UTF-8 XML characters by a character reference.
This function can be used to translate all text nodes and attribute values into pure ascii.
see also : unicodeToLatin1
unicodeToLatin1 :: UString -> StringSource
substitute all Unicode characters, that are not legal latin1 UTF-8 XML characters by a character reference.
This function can be used to translate all text nodes and attribute values into ISO latin1.
see also : unicodeToXmlEntity
unicodeRemoveNoneAscii :: UString -> StringSource
removes all non ascii chars, may be used to transform a document into a pure ascii representation by removing all non ascii chars from tag and attibute names
see also : unicodeRemoveNoneLatin1
, unicodeToXmlEntity
unicodeRemoveNoneLatin1 :: UString -> StringSource
removes all non latin1 chars, may be used to transform a document into a pure ascii representation by removing all non ascii chars from tag and attibute names
see also : unicodeRemoveNoneAscii
, unicodeToLatin1
intToCharRef :: Int -> StringSource
convert an Unicode into a XML character reference.
see also : intToCharRefHex
intToCharRefHex :: Int -> StringSource
convert an Unicode into a XML hexadecimal character reference.
see also: intToCharRef
getDecodingFct :: String -> Maybe DecodingFctSource
the lookup function for selecting the decoding function
getDecodingFctEmbedErrors :: String -> Maybe DecodingFctEmbedErrorsSource
the lookup function for selecting the decoding function
getOutputEncodingFct :: String -> Maybe (String -> UString)Source
the lookup function for selecting the encoding function
normalizeNL :: String -> StringSource
White Space (XML Standard 2.3) and end of line handling (2.11)
#x0D and #x0D#x0A are mapped to #x0A
guessEncoding :: String -> StringSource