Copyright (c) 2002, members of the Haskell Internationalisation Working
Group All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of the Haskell Internationalisation Working Group nor
the names of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.
This module provides lazy stream encoding/decoding facilities for UTF8,
the Unicode Transformation Format with 8bit words.
20020902 Sven Moritz Hallberg <pesco@gmx.de>
>
> module UTF8
> ( encode ) where
#ifdef HAVE_UTF8STRING
> import qualified Codec.Binary.UTF8.String (encode)
> import Data.Word (Word8)
#else
> import Data.Char (ord)
> import Data.Word (Word8, Word16, Word32)
> import Data.Bits (Bits, shiftR, (.&.), (.|.))
#endif
///- UTF8 in General -///
Adapted from the Unicode standard, version 3.2,
Table 3.1 "UTF-8 Bit Distribution" (excluded are UTF16 encodings):
Scalar 1st Byte 2nd Byte 3rd Byte 4th Byte
000000000xxxxxxx 0xxxxxxx
00000yyyyyxxxxxx 110yyyyy 10xxxxxx
zzzzyyyyyyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx
000uuuzzzzzzyyyyyyxxxxxx 11110uuu 10zzzzzz 10yyyyyy 10xxxxxx
Also from the Unicode standard, version 3.2,
Table 3.1B "Legal UTF-8 Byte Sequences":
Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
U+0000..U+007F 00..7F
U+0080..U+07FF C2..DF 80..BF
U+0800..U+0FFF E0 A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..9F 80..BF
U+D800..U+DFFF illformed
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
///- Encoding Functions -///
Must the encoder ensure that no illegal byte sequences are output or
can we trust the Haskell system to supply only legal values?
For now I include error case for the surrogate values U+D800..U+DFFF and
outofrange scalars.
The function is pretty much a transscript of table 3.1B with error checks.
It dispatches the actual encoding to functions specific to the number of
required bytes.
#ifndef HAVE_UTF8STRING
> encodeOne :: Char -> [Word8]
> encodeOne c
>-- The report guarantees in (6.1.2) that this won't happen:
>-- | n < 0 = error "encodeUTF8: ord returned a negative value"
> | n < 0x0080 = encodeOne_onebyte n8
> | n < 0x0800 = encodeOne_twobyte n16
> | n < 0xD800 = encodeOne_threebyte n16
> | n < 0xE000 = error "encodeUTF8: ord returned a surrogate value"
> | n < 0x10000 = encodeOne_threebyte n16
>-- Haskell 98 only talks about 16 bit characters, but ghc handles 20.1.
> | n < 0x10FFFF = encodeOne_fourbyte n32
> | otherwise = error "encodeUTF8: ord returned a value above 0x10FFFF"
> where
> n = ord c :: Int
> n8 = fromIntegral n :: Word8
> n16 = fromIntegral n :: Word16
> n32 = fromIntegral n :: Word32
#endif
With the above, a stream decoder is trivial:
> encode :: [Char] -> [Word8]
#ifdef HAVE_UTF8STRING
> encode = Codec.Binary.UTF8.String.encode
#else
> encode = concatMap encodeOne
#endif
Now follow the individual encoders for certain numbers of bytes...
_
/ | __ ___ __ __
/ ^| // /__/ // //
/.==| \\ //_ // //
It's // || // \_/_//_//_ and it's here to stay!
#ifndef HAVE_UTF8STRING
> encodeOne_onebyte :: Word8 -> [Word8]
> encodeOne_onebyte cp = [cp]
#endif
00000yyyyyxxxxxx -> 110yyyyy 10xxxxxx
#ifndef HAVE_UTF8STRING
> encodeOne_twobyte :: Word16 -> [Word8]
> encodeOne_twobyte cp = [(0xC0.|.ys), (0x80.|.xs)]
> where
> xs, ys :: Word8
> ys = fromIntegral (shiftR cp 6)
> xs = (fromIntegral cp) .&. 0x3F
#endif
zzzzyyyyyyxxxxxx -> 1110zzzz 10yyyyyy 10xxxxxx
#ifndef HAVE_UTF8STRING
> encodeOne_threebyte :: Word16 -> [Word8]
> encodeOne_threebyte cp = [(0xE0.|.zs), (0x80.|.ys), (0x80.|.xs)]
> where
> xs, ys, zs :: Word8
> xs = (fromIntegral cp) .&. 0x3F
> ys = (fromIntegral (shiftR cp 6)) .&. 0x3F
> zs = fromIntegral (shiftR cp 12)
#endif
000uuuzzzzzzyyyyyyxxxxxx -> 11110uuu 10zzzzzz 10yyyyyy 10xxxxxx
#ifndef HAVE_UTF8STRING
> encodeOne_fourbyte :: Word32 -> [Word8]
> encodeOne_fourbyte cp = [0xF0.|.us, 0x80.|.zs, 0x80.|.ys, 0x80.|.xs]
> where
> xs, ys, zs, us :: Word8
> xs = (fromIntegral cp) .&. 0x3F
> ys = (fromIntegral (shiftR cp 6)) .&. 0x3F
> zs = (fromIntegral (shiftR cp 12)) .&. 0x3F
> us = fromIntegral (shiftR cp 18)
#endif