License | BSD-style (see LICENSE) |
---|---|
Stability | experimental |
Safe Haskell | Trustworthy |
Language | Haskell2010 |
Warning: this is an internal module, and does not have a stable API or name. Functions in this module may not check or enforce preconditions expected by public modules. Use at your own risk!
Internals of Data.Text.Encoding.
Since: 2.0.2
Synopsis
- validateUtf8Chunk :: ByteString -> (Int, Maybe Utf8State)
- validateUtf8More :: Utf8State -> ByteString -> (Int, Maybe Utf8State)
- decodeUtf8Chunk :: ByteString -> (StrictTextBuilder, ByteString, Maybe Utf8State)
- decodeUtf8More :: Utf8State -> ByteString -> (StrictTextBuilder, ByteString, Maybe Utf8State)
- decodeUtf8With1 :: OnDecodeError -> String -> ByteString -> Text
- decodeUtf8With2 :: OnDecodeError -> String -> Utf8State -> ByteString -> (StrictTextBuilder, ByteString, Utf8State)
- data Utf8State
- startUtf8State :: Utf8State
- data StrictTextBuilder
- type StrictBuilder = StrictTextBuilder
- strictBuilderToText :: StrictTextBuilder -> Text
- textToStrictBuilder :: Text -> StrictTextBuilder
- skipIncomplete :: OnDecodeError -> String -> Utf8State -> StrictTextBuilder
- getCompleteLen :: Utf8State -> Int
- getPartialUtf8 :: Utf8State -> ByteString
Documentation
validateUtf8Chunk :: ByteString -> (Int, Maybe Utf8State) Source #
Validate a ByteString
as UTF-8-encoded text. To be continued using validateUtf8More
.
See also validateUtf8More
for details on the result of this function.
validateUtf8Chunk
=validateUtf8More
startUtf8State
Properties
Given:
validateUtf8Chunk
chunk = (n, ms)
The prefix is valid UTF-8. In particular, it should be accepted by this validation:
validateUtf8Chunk
(take
n chunk) = (n, JuststartUtf8State
)
validateUtf8More :: Utf8State -> ByteString -> (Int, Maybe Utf8State) Source #
Validate another ByteString
chunk in an ongoing stream of UTF-8-encoded text.
Returns a pair:
- The first component
n
is the end position, relative to the current chunk, of the longest prefix of the accumulated bytestring which is valid UTF-8.n
may be negative: that happens when an incomplete code point started in a previous chunk and is not completed by the current chunk (either that code point is still incomplete, or it is broken by an invalid byte). The second component
ms
indicates the following:- if
ms = Nothing
, the remainder of the chunk contains an invalid byte, within four bytes from positionn
; - if
ms = Just s'
, you can carry on validating another chunk by callingvalidateUtf8More
with the new states'
.
- if
Properties
Given:
validateUtf8More
s chunk = (n, ms)
If the chunk is invalid, it cannot be extended to be valid.
ms = Nothing ==>
validateUtf8More
s (chunk<>
more) = (n, Nothing)Validating two chunks sequentially is the same as validating them together at once:
ms = Just s' ==>
validateUtf8More
s (chunk<>
more) =first
(length
chunk+
) (validateUtf8More
s' more)
decodeUtf8Chunk :: ByteString -> (StrictTextBuilder, ByteString, Maybe Utf8State) Source #
Decode a chunk of UTF-8 text. To be continued with decodeUtf8More
.
See decodeUtf8More
for details on the result.
Properties
decodeUtf8Chunk
=decodeUtf8More
startUtf8State
Given:
decodeUtf8Chunk
chunk = (builder, rest, ms)
builder
is a prefix and rest
is a suffix of chunk
.
encodeUtf8
(strictBuilderToText
builder)<>
rest = chunk
Since: 2.0.2
decodeUtf8More :: Utf8State -> ByteString -> (StrictTextBuilder, ByteString, Maybe Utf8State) Source #
Decode another chunk in an ongoing UTF-8 stream.
Returns a triple:
- A
StrictBuilder
for the decoded chunk of text. You can accumulate chunks with(
or output them with<>
)toText
. - The undecoded remainder of the given chunk, for diagnosing errors and resuming (presumably after skipping some bytes).
Just
the new state, orNothing
if an invalid byte was encountered (it will be within the first 4 bytes of the undecoded remainder).
Properties
Given:
(pre, suf, ms) = decodeUtf8More
s chunk
If the output
pre
is nonempty (alternatively, iflength chunk > length suf
)s2b pre `
append
` suf = p2b s `append
` chunkwhere
s2b =
encodeUtf8
.toText
p2b =partUtf8ToByteString
If the output
pre
is empty (alternatively, iflength chunk = length suf
)suf = chunk
Decoding chunks separately is equivalent to decoding their concatenation.
Given:
(pre1, suf1, Just s1) =
decodeUtf8More
s chunk1 (pre2, suf2, ms2) =decodeUtf8More
s1 chunk2 (pre3, suf3, ms3) =decodeUtf8More
s (chunk1 `B.append` chunk2)we have:
s2b (pre1
<>
pre2) = s2b pre3 ms2 = ms3
decodeUtf8With1 :: OnDecodeError -> String -> ByteString -> Text Source #
Helper for decodeUtf8With
.
Since: 2.0.2
decodeUtf8With2 :: OnDecodeError -> String -> Utf8State -> ByteString -> (StrictTextBuilder, ByteString, Utf8State) Source #
Helper for decodeUtf8With
,
streamDecodeUtf8With
, and lazy
decodeUtf8With
,
which use an OnDecodeError
to process bad bytes.
See decodeUtf8Chunk
for a more flexible alternative.
Since: 2.0.2
State of decoding a ByteString
in UTF-8.
Enables incremental decoding (validateUtf8Chunk
, validateUtf8More
,
decodeUtf8Chunk
, decodeUtf8More
).
Since: 2.0.2
startUtf8State :: Utf8State Source #
Initial Utf8State
.
Since: 2.0.2
data StrictTextBuilder Source #
A delayed representation of strict Text
.
Since: 2.1.2
Instances
Monoid StrictTextBuilder Source # | |
Defined in Data.Text.Internal.StrictBuilder | |
Semigroup StrictTextBuilder Source # | Concatenation of |
Defined in Data.Text.Internal.StrictBuilder (<>) :: StrictTextBuilder -> StrictTextBuilder -> StrictTextBuilder # sconcat :: NonEmpty StrictTextBuilder -> StrictTextBuilder # stimes :: Integral b => b -> StrictTextBuilder -> StrictTextBuilder # |
type StrictBuilder = StrictTextBuilder Source #
strictBuilderToText :: StrictTextBuilder -> Text Source #
Use StrictBuilder
to build Text
.
Since: 2.0.2
textToStrictBuilder :: Text -> StrictTextBuilder Source #
Copy Text
in a StrictBuilder
Since: 2.0.2
Internal
skipIncomplete :: OnDecodeError -> String -> Utf8State -> StrictTextBuilder Source #
Call the error handler on each byte of the partial code point stored in
Utf8State
and append the results.
Exported for use in lazy decodeUtf8With
.
Since: 2.0.2
getCompleteLen :: Utf8State -> Int Source #
Exported for testing.
getPartialUtf8 :: Utf8State -> ByteString Source #
Exported for testing.