Safe Haskell | None |
---|---|
Language | Haskell2010 |
This module provides functions that allow treating Text values as series of UTF-16 codepoints instead of characters.
Synopsis
- type CodeUnit = Word16
- newtype CodeUnitIndex = CodeUnitIndex {
- codeUnitIndex :: Int
- indexTextArray :: Array -> Int -> CodeUnit
- isCaseInvariant :: Text -> Bool
- lengthUtf16 :: Text -> CodeUnitIndex
- lowerCodeUnit :: CodeUnit -> CodeUnit
- lowerUtf16 :: Text -> Text
- unpackUtf16 :: Text -> [CodeUnit]
- unsafeCutUtf16 :: CodeUnitIndex -> CodeUnitIndex -> Text -> (Text, Text)
- unsafeIndexUtf16 :: Text -> CodeUnitIndex -> CodeUnit
- unsafeSliceUtf16 :: CodeUnitIndex -> CodeUnitIndex -> Text -> Text
- upperCodeUnit :: CodeUnit -> CodeUnit
- upperUtf16 :: Text -> Text
Documentation
type CodeUnit = Word16 Source #
A code unit is a 16-bit integer from which UTF-16 encoded text is built up.
The Text
type is represented as a UTF-16 string.
newtype CodeUnitIndex Source #
An index into the raw UTF-16 data of a Text
. This is not the code point
index as conventionally accepted by Text
, so we wrap it to avoid confusing
the two. Incorrect index manipulation can lead to surrogate pairs being
sliced, so manipulate indices with care. This type is also used for lengths.
Instances
indexTextArray :: Array -> Int -> CodeUnit Source #
Retrieve a code unit from Text
s internal representation.
isCaseInvariant :: Text -> Bool Source #
Return whether text is the same lowercase as uppercase, such that this function will not return true when Aho–Corasick would differentiate when doing case-insensitive matching.
lengthUtf16 :: Text -> CodeUnitIndex Source #
Return the length of the text, in number of code units.
lowerCodeUnit :: CodeUnit -> CodeUnit Source #
Convert CodeUnits that represent a character on their own (i.e. that are not part of a surrogate pair) to their lower case representation.
This function has a special code path for ASCII characters, because Char.toLower is **incredibly** slow. It's implemented there if you want to see for yourself: (https:/github.comghcghcblobghc-8.6.3-releaselibrariesbasecbits/WCsubst.c#L4732) (It does a binary search on 1276 casing rules)
lowerUtf16 :: Text -> Text Source #
Lowercase each individual code unit of a text without changing their index. This is not a proper case folding, but it does ensure that indices into the lowercased string correspond to indices into the original string.
Differences from toLower
include code points in the BMP that lowercase
to multiple code points, and code points outside of the BMP.
For example, "İ" (U+0130), which toLower
converts to "i" (U+0069, U+0307),
is converted into U+0069 only by lowerUtf16
.
Also, "𑢢" (U+118A2), a code point from the Warang City writing system in the
Supplementary Multilingual Plane, introduced in 2014 to Unicode 7. It would
be lowercased to U+118C2 by toLower
, but it is left untouched by
lowerUtf16
.
unsafeCutUtf16 :: CodeUnitIndex -> CodeUnitIndex -> Text -> (Text, Text) Source #
The complement of unsafeSliceUtf16
: removes the slice, and returns the
part before and after. See unsafeSliceUtf16
for details.
unsafeIndexUtf16 :: Text -> CodeUnitIndex -> CodeUnit Source #
Return the code unit (not character) with the given index. Note: The bounds are not checked.
unsafeSliceUtf16 :: CodeUnitIndex -> CodeUnitIndex -> Text -> Text Source #
Extract a substring from a text, at a code unit offset and length. This is similar to `Text.take length . Text.drop begin`, except that the begin and length are in code *units*, not code points, so we can slice the UTF-16 array, and we don't have to walk the entire text to take surrogate pairs into account. It is the responsibility of the user to not slice surrogate pairs, and to ensure that the length is within bounds, hence this function is unsafe.
upperCodeUnit :: CodeUnit -> CodeUnit Source #
Analogous to lowerCodeUnit
.
upperUtf16 :: Text -> Text Source #
Lowercase each individual code unit of a text without changing their index.
See also lowerUtf16
and lowerCodeUnit
.