Copyright | (c) 2009 2010 2011 Bryan O'Sullivan (c) 2009 Duncan Coutts (c) 2008 2009 Tom Harper (c) 2021 Andrew Lelechenko |
---|---|
License | BSD-style |
Maintainer | bos@serpentine.com |
Portability | portable |
Safe Haskell | Trustworthy |
Language | Haskell2010 |
Functions for converting Text
values to and from ByteString
,
using several standard encodings.
To gain access to a much larger family of encodings, use the text-icu package.
Synopsis
- decodeLatin1 :: ByteString -> Text
- decodeASCIIPrefix :: ByteString -> (Text, ByteString)
- decodeUtf8Lenient :: ByteString -> Text
- decodeUtf8' :: ByteString -> Either UnicodeException Text
- decodeASCII' :: ByteString -> Maybe Text
- decodeUtf8With :: OnDecodeError -> ByteString -> Text
- decodeUtf16LEWith :: OnDecodeError -> ByteString -> Text
- decodeUtf16BEWith :: OnDecodeError -> ByteString -> Text
- decodeUtf32LEWith :: OnDecodeError -> ByteString -> Text
- decodeUtf32BEWith :: OnDecodeError -> ByteString -> Text
- streamDecodeUtf8With :: OnDecodeError -> ByteString -> Decoding
- data Decoding = Some !Text !ByteString (ByteString -> Decoding)
- decodeUtf8Chunk :: ByteString -> (StrictTextBuilder, ByteString, Maybe Utf8State)
- decodeUtf8More :: Utf8State -> ByteString -> (StrictTextBuilder, ByteString, Maybe Utf8State)
- data Utf8State
- startUtf8State :: Utf8State
- type StrictBuilder = StrictTextBuilder
- data StrictTextBuilder
- strictBuilderToText :: StrictTextBuilder -> Text
- textToStrictBuilder :: Text -> StrictTextBuilder
- decodeASCII :: ByteString -> Text
- decodeUtf8 :: ByteString -> Text
- decodeUtf16LE :: ByteString -> Text
- decodeUtf16BE :: ByteString -> Text
- decodeUtf32LE :: ByteString -> Text
- decodeUtf32BE :: ByteString -> Text
- streamDecodeUtf8 :: ByteString -> Decoding
- encodeUtf8 :: Text -> ByteString
- encodeUtf16LE :: Text -> ByteString
- encodeUtf16BE :: Text -> ByteString
- encodeUtf32LE :: Text -> ByteString
- encodeUtf32BE :: Text -> ByteString
- encodeUtf8Builder :: Text -> Builder
- encodeUtf8BuilderEscaped :: BoundedPrim Word8 -> Text -> Builder
- validateUtf8Chunk :: ByteString -> (Int, Maybe Utf8State)
- validateUtf8More :: Utf8State -> ByteString -> (Int, Maybe Utf8State)
Decoding ByteStrings to Text
All of the single-parameter functions for decoding bytestrings encoded in one of the Unicode Transformation Formats (UTF) operate in a strict mode: each will throw an exception if given invalid input.
Each function has a variant, whose name is suffixed with -With
,
that gives greater control over the handling of decoding errors.
For instance, decodeUtf8
will throw an exception, but
decodeUtf8With
allows the programmer to determine what to do on a
decoding error.
Total Functions
These functions facilitate total decoding and should be preferred over their partial counterparts.
decodeLatin1 :: ByteString -> Text Source #
Decode a ByteString
containing Latin-1 (aka ISO-8859-1) encoded text.
decodeLatin1
is semantically equivalent to
Data.Text.pack . Data.ByteString.Char8.unpack
This is a total function. However, bear in mind that decoding Latin-1 (non-ASCII) characters to UTf-8 requires actual work and is not just buffer copying.
decodeASCIIPrefix :: ByteString -> (Text, ByteString) Source #
Decode a ByteString
containing ASCII text.
This is a total function which returns a pair of the longest ASCII prefix
as Text
, and the remaining suffix as ByteString
.
Important note: the pair is lazy. This lets you check for errors by testing
whether the second component is empty, without forcing the first component
(which does a copy).
To drop references to the input bytestring, force the prefix
(using seq
or BangPatterns
) and drop references to the suffix.
Properties
- If
(prefix, suffix) = decodeAsciiPrefix s
, then
.encodeUtf8
prefix <> suffix = s - Either
suffix
is empty, or
.head
suffix > 127
Since: 2.0.2
decodeUtf8Lenient :: ByteString -> Text Source #
Decode a ByteString
containing UTF-8 encoded text.
Any invalid input bytes will be replaced with the Unicode replacement character U+FFFD.
Since: 2.0
decodeUtf8' :: ByteString -> Either UnicodeException Text Source #
Decode a ByteString
containing UTF-8 encoded text.
If the input contains any invalid UTF-8 data, the relevant exception will be returned, otherwise the decoded text.
decodeASCII' :: ByteString -> Maybe Text Source #
Decode a ByteString
containing 7-bit ASCII encoded text.
This is a total function which returns either the ByteString
converted to a
Text
containing ASCII text, or Nothing
.
Use decodeASCIIPrefix
to retain the longest ASCII prefix for an invalid
input instead of discarding it.
Since: 2.0.2
Controllable error handling
decodeUtf8With :: OnDecodeError -> ByteString -> Text Source #
Decode a ByteString
containing UTF-8 encoded text.
Surrogate code points in replacement character returned by OnDecodeError
will be automatically remapped to the replacement char U+FFFD
.
decodeUtf16LEWith :: OnDecodeError -> ByteString -> Text Source #
Decode text from little endian UTF-16 encoding.
decodeUtf16BEWith :: OnDecodeError -> ByteString -> Text Source #
Decode text from big endian UTF-16 encoding.
decodeUtf32LEWith :: OnDecodeError -> ByteString -> Text Source #
Decode text from little endian UTF-32 encoding.
decodeUtf32BEWith :: OnDecodeError -> ByteString -> Text Source #
Decode text from big endian UTF-32 encoding.
Stream oriented decoding
The streamDecodeUtf8
and streamDecodeUtf8With
functions accept
a ByteString
that represents a possibly incomplete input (e.g. a
packet from a network stream) that may not end on a UTF-8 boundary.
- The maximal prefix of
Text
that could be decoded from the given input. - The suffix of the
ByteString
that could not be decoded due to insufficient input. - A function that accepts another
ByteString
. That string will be assumed to directly follow the string that was passed as input to the original function, and it will in turn be decoded.
To help understand the use of these functions, consider the Unicode
string "hi ☃"
. If encoded as UTF-8, this becomes "hi
\xe2\x98\x83"
; the final '☃'
is encoded as 3 bytes.
Now suppose that we receive this encoded string as 3 packets that
are split up on untidy boundaries: ["hi \xe2", "\x98",
"\x83"]
. We cannot decode the entire Unicode string until we
have received all three packets, but we would like to make progress
as we receive each one.
ghci> let s0@(Some
_ _ f0) =streamDecodeUtf8
"hi \xe2" ghci> s0Some
"hi " "\xe2" _
We use the continuation f0
to decode our second packet.
ghci> let s1@(Some
_ _ f1) = f0 "\x98" ghci> s1Some
"" "\xe2\x98"
We could not give f0
enough input to decode anything, so it
returned an empty string. Once we feed our second continuation f1
the last byte of input, it will make progress.
ghci> let s2@(Some
_ _ f2) = f1 "\x83" ghci> s2Some
"\x2603" "" _
If given invalid input, an exception will be thrown by the function or continuation where it is encountered.
streamDecodeUtf8With :: OnDecodeError -> ByteString -> Decoding Source #
Decode, in a stream oriented way, a lazy ByteString
containing UTF-8
encoded text.
Since: 1.0.0.0
A stream oriented decoding result.
Since: 1.0.0.0
Some !Text !ByteString (ByteString -> Decoding) |
Incremental UTF-8 decoding
The functions decodeUtf8Chunk
and decodeUtf8More
provide more
control for error-handling and streaming.
- Those functions return an UTF-8 prefix of the given
ByteString
up to the next error. For example this lets you insert or delete arbitrary text, or do some stateful operations before resuming, such as keeping track of error locations. In contrast, the older stream-oriented interface only lets you substitute a single fixedChar
for each invalid byte inOnDecodeError
. - That prefix is encoded as a
StrictBuilder
, so you can accumulate chunks before doing the copying work to construct aText
, or you can output decoded fragments immediately as a lazyText
.
For even lower-level primitives, see validateUtf8Chunk
and validateUtf8More
.
decodeUtf8Chunk :: ByteString -> (StrictTextBuilder, ByteString, Maybe Utf8State) Source #
Decode a chunk of UTF-8 text. To be continued with decodeUtf8More
.
See decodeUtf8More
for details on the result.
Properties
decodeUtf8Chunk
=decodeUtf8More
startUtf8State
Given:
decodeUtf8Chunk
chunk = (builder, rest, ms)
builder
is a prefix and rest
is a suffix of chunk
.
encodeUtf8
(strictBuilderToText
builder)<>
rest = chunk
Since: 2.0.2
decodeUtf8More :: Utf8State -> ByteString -> (StrictTextBuilder, ByteString, Maybe Utf8State) Source #
Decode another chunk in an ongoing UTF-8 stream.
Returns a triple:
- A
StrictBuilder
for the decoded chunk of text. You can accumulate chunks with(
or output them with<>
)toText
. - The undecoded remainder of the given chunk, for diagnosing errors and resuming (presumably after skipping some bytes).
Just
the new state, orNothing
if an invalid byte was encountered (it will be within the first 4 bytes of the undecoded remainder).
Properties
Given:
(pre, suf, ms) = decodeUtf8More
s chunk
If the output
pre
is nonempty (alternatively, iflength chunk > length suf
)s2b pre `
append
` suf = p2b s `append
` chunkwhere
s2b =
encodeUtf8
.toText
p2b =partUtf8ToByteString
If the output
pre
is empty (alternatively, iflength chunk = length suf
)suf = chunk
Decoding chunks separately is equivalent to decoding their concatenation.
Given:
(pre1, suf1, Just s1) =
decodeUtf8More
s chunk1 (pre2, suf2, ms2) =decodeUtf8More
s1 chunk2 (pre3, suf3, ms3) =decodeUtf8More
s (chunk1 `B.append` chunk2)we have:
s2b (pre1
<>
pre2) = s2b pre3 ms2 = ms3
State of decoding a ByteString
in UTF-8.
Enables incremental decoding (validateUtf8Chunk
, validateUtf8More
,
decodeUtf8Chunk
, decodeUtf8More
).
Since: 2.0.2
startUtf8State :: Utf8State Source #
Initial Utf8State
.
Since: 2.0.2
type StrictBuilder = StrictTextBuilder Source #
data StrictTextBuilder Source #
A delayed representation of strict Text
.
Since: 2.1.2
Instances
Monoid StrictTextBuilder Source # | |
Defined in Data.Text.Internal.StrictBuilder | |
Semigroup StrictTextBuilder Source # | Concatenation of |
Defined in Data.Text.Internal.StrictBuilder (<>) :: StrictTextBuilder -> StrictTextBuilder -> StrictTextBuilder # sconcat :: NonEmpty StrictTextBuilder -> StrictTextBuilder # stimes :: Integral b => b -> StrictTextBuilder -> StrictTextBuilder # |
strictBuilderToText :: StrictTextBuilder -> Text Source #
Use StrictBuilder
to build Text
.
Since: 2.0.2
textToStrictBuilder :: Text -> StrictTextBuilder Source #
Copy Text
in a StrictBuilder
Since: 2.0.2
Partial Functions
These functions are partial and should only be used with great caution (preferably not at all). See Data.Text.Encoding for better solutions.
decodeASCII :: ByteString -> Text Source #
Decode a ByteString
containing 7-bit ASCII encoded text.
This is a partial function: it checks that input does not contain anything except ASCII and copies buffer or throws an error otherwise.
decodeUtf8 :: ByteString -> Text Source #
Decode a ByteString
containing UTF-8 encoded text that is known
to be valid.
If the input contains any invalid UTF-8 data, an exception will be
thrown that cannot be caught in pure code. For more control over
the handling of invalid data, use decodeUtf8'
or
decodeUtf8With
.
This is a partial function: it checks that input is a well-formed UTF-8 sequence and copies buffer or throws an error otherwise.
decodeUtf16LE :: ByteString -> Text Source #
Decode text from little endian UTF-16 encoding.
If the input contains any invalid little endian UTF-16 data, an
exception will be thrown. For more control over the handling of
invalid data, use decodeUtf16LEWith
.
decodeUtf16BE :: ByteString -> Text Source #
Decode text from big endian UTF-16 encoding.
If the input contains any invalid big endian UTF-16 data, an
exception will be thrown. For more control over the handling of
invalid data, use decodeUtf16BEWith
.
decodeUtf32LE :: ByteString -> Text Source #
Decode text from little endian UTF-32 encoding.
If the input contains any invalid little endian UTF-32 data, an
exception will be thrown. For more control over the handling of
invalid data, use decodeUtf32LEWith
.
decodeUtf32BE :: ByteString -> Text Source #
Decode text from big endian UTF-32 encoding.
If the input contains any invalid big endian UTF-32 data, an
exception will be thrown. For more control over the handling of
invalid data, use decodeUtf32BEWith
.
Stream oriented decoding
streamDecodeUtf8 :: ByteString -> Decoding Source #
Decode, in a stream oriented way, a ByteString
containing UTF-8
encoded text that is known to be valid.
If the input contains any invalid UTF-8 data, an exception will be
thrown (either by this function or a continuation) that cannot be
caught in pure code. For more control over the handling of invalid
data, use streamDecodeUtf8With
.
Since: 1.0.0.0
Encoding Text to ByteStrings
encodeUtf8 :: Text -> ByteString Source #
Encode text using UTF-8 encoding.
encodeUtf16LE :: Text -> ByteString Source #
Encode text using little endian UTF-16 encoding.
encodeUtf16BE :: Text -> ByteString Source #
Encode text using big endian UTF-16 encoding.
encodeUtf32LE :: Text -> ByteString Source #
Encode text using little endian UTF-32 encoding.
encodeUtf32BE :: Text -> ByteString Source #
Encode text using big endian UTF-32 encoding.
Encoding Text using ByteString Builders
encodeUtf8Builder :: Text -> Builder Source #
Encode text to a ByteString Builder
using UTF-8 encoding.
Since: 1.1.0.0
encodeUtf8BuilderEscaped :: BoundedPrim Word8 -> Text -> Builder Source #
Encode text using UTF-8 encoding and escape the ASCII characters using
a BoundedPrim
.
Use this function is to implement efficient encoders for text-based formats like JSON or HTML.
Since: 1.1.0.0
ByteString validation
These functions are for validating ByteString
s as encoded text.
validateUtf8Chunk :: ByteString -> (Int, Maybe Utf8State) Source #
Validate a ByteString
as UTF-8-encoded text. To be continued using validateUtf8More
.
See also validateUtf8More
for details on the result of this function.
validateUtf8Chunk
=validateUtf8More
startUtf8State
Properties
Given:
validateUtf8Chunk
chunk = (n, ms)
The prefix is valid UTF-8. In particular, it should be accepted by this validation:
validateUtf8Chunk
(take
n chunk) = (n, JuststartUtf8State
)
validateUtf8More :: Utf8State -> ByteString -> (Int, Maybe Utf8State) Source #
Validate another ByteString
chunk in an ongoing stream of UTF-8-encoded text.
Returns a pair:
- The first component
n
is the end position, relative to the current chunk, of the longest prefix of the accumulated bytestring which is valid UTF-8.n
may be negative: that happens when an incomplete code point started in a previous chunk and is not completed by the current chunk (either that code point is still incomplete, or it is broken by an invalid byte). The second component
ms
indicates the following:- if
ms = Nothing
, the remainder of the chunk contains an invalid byte, within four bytes from positionn
; - if
ms = Just s'
, you can carry on validating another chunk by callingvalidateUtf8More
with the new states'
.
- if
Properties
Given:
validateUtf8More
s chunk = (n, ms)
If the chunk is invalid, it cannot be extended to be valid.
ms = Nothing ==>
validateUtf8More
s (chunk<>
more) = (n, Nothing)Validating two chunks sequentially is the same as validating them together at once:
ms = Just s' ==>
validateUtf8More
s (chunk<>
more) =first
(length
chunk+
) (validateUtf8More
s' more)