Copyright | (c) 2018 Composewell Technologies |
---|---|
License | BSD3 |
Maintainer | streamly@composewell.com |
Stability | experimental |
Portability | GHC |
Safe Haskell | None |
Language | Haskell2010 |
Processing Unicode Strings
A Char
stream is the canonical representation to process Unicode strings.
It can be processed efficiently using regular stream processing operations.
A byte stream of Unicode text read from an IO device or from an
Array
in memory can be decoded into a Char
stream
using the decoding routines in this module. A String
([Char]
) can be
converted into a Char
stream using fromList
. An Array
Char
can be unfold
ed into a stream using the array
read
unfold.
Storing Unicode Strings
A stream of Char
can be encoded into a byte stream using the encoding
routines in this module and then written to IO devices or to arrays in
memory.
If you have to store a Char
stream in memory you can convert it into a
String
using toList
or using the
toList
fold. The String
type can be more efficient
than pinned arrays for short and short lived strings.
For longer or long lived streams you can fold
the Char
stream as Array Char
using the array write
fold.
The Array
type provides a more compact representation and pinned memory
reducing GC overhead. If space efficiency is a concern you can use
encodeUtf8
on the Char
stream before writing it to an Array
providing
an even more compact representation.
String Literals
SerialT Identity Char
and Array Char
are instances of IsString
and
IsList
, therefore, OverloadedStrings
and OverloadedLists
extensions
can be used for convenience when specifying unicode strings literals using
these types.
Pitfalls
- Case conversion: Some unicode characters translate to more than one code
point on case conversion. The
toUpper
andtoLower
functions inbase
package do not handle such characters. Therefore, operations likemap toUpper
on a character stream or character array may not always perform correct conversion. - String comparison: In some cases, visually identical strings may have different unicode representations, therefore, a character stream or character array cannot be directly compared. A normalized comparison may be needed to check string equivalence correctly.
Experimental APIs
Some experimental APIs to conveniently process text using the
Array Char
represenation directly can be found in
Streamly.Internal.Memory.Unicode.Array.
Synopsis
- decodeLatin1 :: (IsStream t, Monad m) => t m Word8 -> t m Char
- decodeUtf8 :: (Monad m, IsStream t) => t m Word8 -> t m Char
- decodeUtf8Lax :: (Monad m, IsStream t) => t m Word8 -> t m Char
- encodeLatin1 :: (IsStream t, Monad m) => t m Char -> t m Word8
- encodeLatin1Lax :: (IsStream t, Monad m) => t m Char -> t m Word8
- encodeUtf8 :: (Monad m, IsStream t) => t m Char -> t m Word8
Construction (Decoding)
decodeLatin1 :: (IsStream t, Monad m) => t m Word8 -> t m Char Source #
Decode a stream of bytes to Unicode characters by mapping each byte to a
corresponding Unicode Char
in 0-255 range.
Since: 0.7.0
decodeUtf8 :: (Monad m, IsStream t) => t m Word8 -> t m Char Source #
Decode a UTF-8 encoded bytestream to a stream of Unicode characters. The incoming stream is truncated if an invalid codepoint is encountered.
Since: 0.7.0
decodeUtf8Lax :: (Monad m, IsStream t) => t m Word8 -> t m Char Source #
Decode a UTF-8 encoded bytestream to a stream of Unicode characters. Any invalid codepoint encountered is replaced with the unicode replacement character.
Since: 0.7.0
Elimination (Encoding)
encodeLatin1 :: (IsStream t, Monad m) => t m Char -> t m Word8 Source #
Encode a stream of Unicode characters to bytes by mapping each character to a byte in 0-255 range. Throws an error if the input stream contains characters beyond 255.
Since: 0.7.0
encodeLatin1Lax :: (IsStream t, Monad m) => t m Char -> t m Word8 Source #
Like encodeLatin1
but silently truncates and maps input characters beyond
255 to (incorrect) chars in 0-255 range. No error or exception is thrown
when such truncation occurs.
Since: 0.7.0