Copyright | (c) 2010 Bryan O'Sullivan |
---|---|
License | BSD-style |
Maintainer | bos@serpentine.com |
Stability | experimental |
Portability | GHC |
Safe Haskell | None |
Language | Haskell98 |
Commonly used functions for Unicode, implemented as bindings to the International Components for Unicode (ICU) libraries.
This module contains only the most commonly used types and functions. Other modules in this package expose richer interfaces.
Synopsis
- data LocaleName
- data Breaker a
- data Break a
- brkPrefix :: Break a -> Text
- brkBreak :: Break a -> Text
- brkSuffix :: Break a -> Text
- brkStatus :: Break a -> a
- data Line
- data Word
- = Uncategorized
- | Number
- | Letter
- | Kana
- | Ideograph
- breakCharacter :: LocaleName -> Breaker ()
- breakLine :: LocaleName -> Breaker Line
- breakSentence :: LocaleName -> Breaker ()
- breakWord :: LocaleName -> Breaker Word
- breaks :: Breaker a -> Text -> [Break a]
- breaksRight :: Breaker a -> Text -> [Break a]
- toCaseFold :: Bool -> Text -> Text
- toLower :: LocaleName -> Text -> Text
- toUpper :: LocaleName -> Text -> Text
- data CharIterator
- fromString :: String -> CharIterator
- fromText :: Text -> CharIterator
- fromUtf8 :: ByteString -> CharIterator
- data NormalizationMode
- normalize :: NormalizationMode -> Text -> Text
- quickCheck :: NormalizationMode -> Text -> Maybe Bool
- isNormalized :: NormalizationMode -> Text -> Bool
- data CompareOption
- compare :: [CompareOption] -> Text -> Text -> Ordering
- data Collator
- collator :: LocaleName -> Collator
- collatorWith :: LocaleName -> [Attribute] -> Collator
- collate :: Collator -> Text -> Text -> Ordering
- collateIter :: Collator -> CharIterator -> CharIterator -> Ordering
- sortKey :: Collator -> Text -> ByteString
- uca :: Collator
- data MatchOption
- data ParseError
- data Match
- data Regex
- class Regular r
- regex :: [MatchOption] -> Text -> Regex
- regex' :: [MatchOption] -> Text -> Either ParseError Regex
- pattern :: Regular r => r -> Text
- find :: Regex -> Text -> Maybe Match
- findAll :: Regex -> Text -> [Match]
- groupCount :: Regular r => r -> Int
- unfold :: (Int -> Match -> Maybe Text) -> Match -> [Text]
- span :: Match -> Text
- group :: Int -> Match -> Maybe Text
- prefix :: Int -> Match -> Maybe Text
- suffix :: Int -> Match -> Maybe Text
- data Spoof
- data SpoofParams = SpoofParams {
- spoofChecks :: Maybe [SpoofCheck]
- level :: Maybe RestrictionLevel
- locales :: Maybe [String]
- data SpoofCheck
- data RestrictionLevel
- data SpoofCheckResult
- spoof :: Spoof
- spoofWithParams :: SpoofParams -> Spoof
- spoofFromSource :: (ByteString, ByteString) -> SpoofParams -> Spoof
- spoofFromSerialized :: ByteString -> SpoofParams -> Spoof
- areConfusable :: Spoof -> Text -> Text -> SpoofCheckResult
- spoofCheck :: Spoof -> Text -> SpoofCheckResult
- getSkeleton :: Spoof -> Maybe SkeletonTypeOverride -> Text -> Text
- getChecks :: Spoof -> [SpoofCheck]
- getAllowedLocales :: Spoof -> [String]
- getRestrictionLevel :: Spoof -> Maybe RestrictionLevel
- serialize :: Spoof -> ByteString
Data representation
The Haskell Text
type is implemented as an array in the Haskell
heap. This means that its location is not pinned; it may be copied
during a garbage collection pass. ICU, on the other hand, works
with strings that are allocated in the normal system heap and have
a fixed address.
To accommodate this need, these bindings use the functions from Data.Text.Foreign to copy data between the Haskell heap and the system heap. The copied strings are still managed automatically, but the need to duplicate data does add some performance and memory overhead.
Types
data LocaleName Source #
The name of a locale.
Root | The root locale. For a description of resource bundles and the root resource, see http://userguide.icu-project.org/locale/resources. |
Locale String | A specific locale. |
Current | The program's current locale. |
Instances
Boundary analysis
Text boundary analysis is the process of locating linguistic boundaries while formatting and handling text. Examples of this process include:
- Locating appropriate points to word-wrap text to fit within specific margins while displaying or printing.
- Counting characters, words, sentences, or paragraphs.
- Making a list of the unique words in a document.
- Figuring out if a given range of text contains only whole words.
- Capitalizing the first letter of each word.
- Locating a particular unit of the text (For example, finding the third word in the document).
The Breaker
type was designed to support these kinds of
tasks.
For the impure boundary analysis API (which is richer, but less easy to use than the pure API), see the Data.Text.ICU.Break module. The impure API supports some uses that may be less efficient via the pure API, including:
- Locating the beginning of a word that the user has selected.
- Determining how far to move the text cursor when the user hits an arrow key (Some characters require more than one position in the text store and some characters in the text store do not display at all).
A break in a string.
Line break status.
Word break status.
Uncategorized | A "word" that does not fit into another category. Includes spaces and most punctuation. |
Number | A word that appears to be a number. |
Letter | A word containing letters, excluding hiragana, katakana or ideographic characters. |
Kana | A word containing kana characters. |
Ideograph | A word containing ideographic characters. |
breakCharacter :: LocaleName -> Breaker () Source #
Break a string on character boundaries.
Character boundary analysis identifies the boundaries of "Extended Grapheme Clusters", which are groupings of codepoints that should be treated as character-like units for many text operations. Please see Unicode Standard Annex #29, Unicode Text Segmentation, http://www.unicode.org/reports/tr29/ for additional information on grapheme clusters and guidelines on their use.
breakLine :: LocaleName -> Breaker Line Source #
Break a string on line boundaries.
Line boundary analysis determines where a text string can be broken when line wrapping. The mechanism correctly handles punctuation and hyphenated words.
breakSentence :: LocaleName -> Breaker () Source #
Break a string on sentence boundaries.
Sentence boundary analysis allows selection with correct interpretation of periods within numbers and abbreviations, and trailing punctuation marks such as quotation marks and parentheses.
breakWord :: LocaleName -> Breaker Word Source #
Break a string on word boundaries.
Word boundary analysis is used by search and replace functions, as well as within text editing applications that allow the user to select words with a double click. Word selection provides correct interpretation of punctuation marks within and following words. Characters that are not part of a word, such as symbols or punctuation marks, have word breaks on both sides.
breaks :: Breaker a -> Text -> [Break a] Source #
Return a list of all breaks in a string, from left to right.
breaksRight :: Breaker a -> Text -> [Break a] Source #
Return a list of all breaks in a string, from right to left.
Case mapping
:: Bool | Whether to include or exclude mappings for
dotted and dotless I and i that are marked with
|
-> Text | |
-> Text |
Case-fold the characters in a string.
Case folding is locale independent and not context sensitive, but there is an option for treating the letter I specially for Turkic languages. The result may be longer or shorter than the original.
toLower :: LocaleName -> Text -> Text Source #
Lowercase the characters in a string.
Casing is locale dependent and context sensitive. The result may be longer or shorter than the original.
toUpper :: LocaleName -> Text -> Text Source #
Uppercase the characters in a string.
Casing is locale dependent and context sensitive. The result may be longer or shorter than the original.
Iteration
data CharIterator Source #
A type that supports efficient iteration over Unicode characters.
As an example of where this may be useful, a function using this
type may be able to iterate over a UTF-8 ByteString
directly,
rather than first copying and converting it to an intermediate
form. This type also allows e.g. comparison between Text
and
ByteString
, with minimal overhead.
Instances
Eq CharIterator Source # | |
Defined in Data.Text.ICU.Iterator (==) :: CharIterator -> CharIterator -> Bool # (/=) :: CharIterator -> CharIterator -> Bool # | |
Ord CharIterator Source # | |
Defined in Data.Text.ICU.Iterator compare :: CharIterator -> CharIterator -> Ordering # (<) :: CharIterator -> CharIterator -> Bool # (<=) :: CharIterator -> CharIterator -> Bool # (>) :: CharIterator -> CharIterator -> Bool # (>=) :: CharIterator -> CharIterator -> Bool # max :: CharIterator -> CharIterator -> CharIterator # min :: CharIterator -> CharIterator -> CharIterator # | |
Show CharIterator Source # | |
Defined in Data.Text.ICU.Internal showsPrec :: Int -> CharIterator -> ShowS # show :: CharIterator -> String # showList :: [CharIterator] -> ShowS # |
fromString :: String -> CharIterator Source #
Construct a CharIterator
from a Unicode string.
fromText :: Text -> CharIterator Source #
Construct a CharIterator
from a Unicode string.
fromUtf8 :: ByteString -> CharIterator Source #
Construct a CharIterator
from a Unicode string encoded as a
UTF-8 ByteString
. The validity of the encoded string is *not*
checked.
Normalization
data NormalizationMode Source #
Normalization modes.
None | No decomposition/composition. |
NFD | Canonical decomposition. |
NFKD | Compatibility decomposition. |
NFC | Canonical decomposition followed by canonical composition. |
NFKC | Compatibility decomposition followed by canonical composition. |
FCD | "Fast C or D" form. |
Instances
Enum NormalizationMode Source # | |
Defined in Data.Text.ICU.Normalize succ :: NormalizationMode -> NormalizationMode # pred :: NormalizationMode -> NormalizationMode # toEnum :: Int -> NormalizationMode # fromEnum :: NormalizationMode -> Int # enumFrom :: NormalizationMode -> [NormalizationMode] # enumFromThen :: NormalizationMode -> NormalizationMode -> [NormalizationMode] # enumFromTo :: NormalizationMode -> NormalizationMode -> [NormalizationMode] # enumFromThenTo :: NormalizationMode -> NormalizationMode -> NormalizationMode -> [NormalizationMode] # | |
Eq NormalizationMode Source # | |
Defined in Data.Text.ICU.Normalize (==) :: NormalizationMode -> NormalizationMode -> Bool # (/=) :: NormalizationMode -> NormalizationMode -> Bool # | |
Show NormalizationMode Source # | |
Defined in Data.Text.ICU.Normalize showsPrec :: Int -> NormalizationMode -> ShowS # show :: NormalizationMode -> String # showList :: [NormalizationMode] -> ShowS # |
normalize :: NormalizationMode -> Text -> Text Source #
Normalize a string according the specified normalization mode.
quickCheck :: NormalizationMode -> Text -> Maybe Bool Source #
Perform an efficient check on a string, to quickly determine if the string is in a particular normalization form.
A Nothing
result indicates that a definite answer could not be
determined quickly, and a more thorough check is required,
e.g. with isNormalized
. The user may have to convert the string
to its normalized form and compare the results.
A result of Just
True
or Just
False
indicates that the
string definitely is, or is not, in the given normalization form.
isNormalized :: NormalizationMode -> Text -> Bool Source #
Indicate whether a string is in a given normalization form.
Unlike quickCheck
, this function returns a definitive result.
For NFD
, NFKD
, and FCD
normalization forms, both functions
work in exactly the same ways. For NFC
and NFKC
forms, where
quickCheck
may return Nothing
, this function will perform
further tests to arrive at a definitive result.
String comparison
Normalization-sensitive string comparison
data CompareOption Source #
Options to compare
.
InputIsFCD | The caller knows that both strings fulfill the
|
CompareIgnoreCase | Compare strings case-insensitively using case folding, instead of case-sensitively. If set, then the following case folding options are used. |
FoldCaseExcludeSpecialI | When case folding, exclude the special I character. For use with Turkic (Turkish/Azerbaijani) text data. |
Instances
Enum CompareOption Source # | |
Defined in Data.Text.ICU.Normalize succ :: CompareOption -> CompareOption # pred :: CompareOption -> CompareOption # toEnum :: Int -> CompareOption # fromEnum :: CompareOption -> Int # enumFrom :: CompareOption -> [CompareOption] # enumFromThen :: CompareOption -> CompareOption -> [CompareOption] # enumFromTo :: CompareOption -> CompareOption -> [CompareOption] # enumFromThenTo :: CompareOption -> CompareOption -> CompareOption -> [CompareOption] # | |
Eq CompareOption Source # | |
Defined in Data.Text.ICU.Normalize (==) :: CompareOption -> CompareOption -> Bool # (/=) :: CompareOption -> CompareOption -> Bool # | |
Show CompareOption Source # | |
Defined in Data.Text.ICU.Normalize showsPrec :: Int -> CompareOption -> ShowS # show :: CompareOption -> String # showList :: [CompareOption] -> ShowS # |
compare :: [CompareOption] -> Text -> Text -> Ordering Source #
Compare two strings for canonical equivalence. Further options include case-insensitive comparison and code point order (as opposed to code unit order).
Canonical equivalence between two strings is defined as their
normalized forms (NFD
or NFC
) being identical. This function
compares strings incrementally instead of normalizing (and
optionally case-folding) both strings entirely, improving
performance significantly.
Bulk normalization is only necessary if the strings do not fulfill
the FCD
conditions. Only in this case, and only if the strings
are relatively long, is memory allocated temporarily. For FCD
strings and short non-FCD
strings there is no memory allocation.
Locale-sensitive string collation
For the impure collation API (which is richer, but less easy to use than the pure API), see the Data.Text.ICU.Collate module.
collator :: LocaleName -> Collator Source #
collatorWith :: LocaleName -> [Attribute] -> Collator Source #
Create an immutable Collator
with the given Attribute
s.
collateIter :: Collator -> CharIterator -> CharIterator -> Ordering Source #
Compare two CharIterator
s.
If either iterator was constructed from a ByteString
, it does not
need to be copied or converted beforehand, so this function can be
quite cheap.
Regular expressions
data MatchOption Source #
Options for controlling matching behaviour.
CaseInsensitive | Enable case insensitive matching. |
Comments | Allow comments and white space within patterns. |
DotAll | If set, |
Literal | If set, treat the entire pattern as a literal string. Metacharacters or escape sequences in the input sequence will be given no special meaning. The option |
Multiline | Control behaviour of |
HaskellLines | Haskell-only line endings. When this mode is enabled, only
|
UnicodeWord | Unicode word boundaries. If set, Warning: Unicode word boundaries are quite different from traditional regular expression word boundaries. See http://unicode.org/reports/tr29/#Word_Boundaries. |
ErrorOnUnknownEscapes | Throw an error on unrecognized backslash escapes. If set, fail with an error on patterns that contain backslash-escaped ASCII letters without a known special meaning. If this flag is not set, these escaped letters represent themselves. |
WorkLimit Int | Set a processing limit for match operations. Some patterns, when matching certain strings, can run in exponential time. For practical purposes, the match operation may appear to be in an infinite loop. When a limit is set a match operation will fail with an error if the limit is exceeded. The units of the limit are steps of the match engine. Correspondence with actual processor time will depend on the speed of the processor and the details of the specific pattern, but will typically be on the order of milliseconds. By default, the matching time is not limited. |
StackLimit Int | Set the amount of heap storage avaliable for use by the match backtracking stack. ICU uses a backtracking regular expression engine, with the backtrack stack maintained on the heap. This function sets the limit to the amount of memory that can be used for this purpose. A backtracking stack overflow will result in an error from the match operation that caused it. A limit is desirable because a malicious or poorly designed pattern can use excessive memory, potentially crashing the process. A limit is enabled by default. |
Instances
Eq MatchOption Source # | |
Defined in Data.Text.ICU.Regex.Internal (==) :: MatchOption -> MatchOption -> Bool # (/=) :: MatchOption -> MatchOption -> Bool # | |
Show MatchOption Source # | |
Defined in Data.Text.ICU.Regex.Internal showsPrec :: Int -> MatchOption -> ShowS # show :: MatchOption -> String # showList :: [MatchOption] -> ShowS # |
data ParseError Source #
Detailed information about parsing errors. Used by ICU parsing
engines that parse long rules, patterns, or programs, where the
text being parsed is long enough that more information than an
ICUError
is needed to localize the error.
Instances
Show ParseError Source # | |
Defined in Data.Text.ICU.Error.Internal showsPrec :: Int -> ParseError -> ShowS # show :: ParseError -> String # showList :: [ParseError] -> ShowS # | |
Exception ParseError Source # | |
Defined in Data.Text.ICU.Error.Internal toException :: ParseError -> SomeException # fromException :: SomeException -> Maybe ParseError # displayException :: ParseError -> String # | |
NFData ParseError Source # | |
Defined in Data.Text.ICU.Error.Internal rnf :: ParseError -> () # |
A match for a regular expression.
A compiled regular expression.
Regex
values are usually constructed using the regex
or
regex'
functions. This type is also an instance of IsString
,
so if you have the OverloadedStrings
language extension enabled,
you can construct a Regex
by simply writing the pattern in
quotes (though this does not allow you to specify any Option
s).
regRe
Construction
regex :: [MatchOption] -> Text -> Regex Source #
Compile a regular expression with the given options. This
function throws a ParseError
if the pattern is invalid, so it is
best for use when the pattern is statically known.
regex' :: [MatchOption] -> Text -> Either ParseError Regex Source #
Compile a regular expression with the given options. This is safest to use when the pattern is constructed at run time.
Inspection
pattern :: Regular r => r -> Text Source #
Return the source form of the pattern used to construct this regular expression or match.
Searching
find :: Regex -> Text -> Maybe Match Source #
Find the first match for the regular expression in the given text.
findAll :: Regex -> Text -> [Match] Source #
Lazily find all matches for the regular expression in the given text.
Match groups
Capturing groups are numbered starting from zero. Group zero is always the entire matching text. Groups greater than zero contain the text matching each capturing group in a regular expression.
groupCount :: Regular r => r -> Int Source #
Return the number of capturing groups in this regular expression or match's pattern.
unfold :: (Int -> Match -> Maybe Text) -> Match -> [Text] Source #
A combinator for returning a list of all capturing groups on a
Match
.
span :: Match -> Text Source #
Return the span of text between the end of the previous match and the beginning of the current match.
group :: Int -> Match -> Maybe Text Source #
Return the nth capturing group in a match, or Nothing
if n
is out of bounds.
prefix :: Int -> Match -> Maybe Text Source #
Return the prefix of the nth capturing group in a match (the
text from the start of the string to the start of the match), or
Nothing
if n is out of bounds.
suffix :: Int -> Match -> Maybe Text Source #
Return the suffix of the nth capturing group in a match (the
text from the end of the match to the end of the string), or
Nothing
if n is out of bounds.
Spoof checking
The Spoof
type performs security checks on visually confusable
(spoof) strings. For the impure spoof checking API (which is
richer, but less easy to use than the pure API), see the
Data.Text.ICU.Spoof module.
See UTR #36 and UTS #39 for detailed information about the underlying algorithms and databases used by these functions.
data SpoofParams Source #
SpoofParams | Used to configure a |
|
Instances
Eq SpoofParams Source # | |
Defined in Data.Text.ICU.Spoof.Pure (==) :: SpoofParams -> SpoofParams -> Bool # (/=) :: SpoofParams -> SpoofParams -> Bool # | |
Show SpoofParams Source # | |
Defined in Data.Text.ICU.Spoof.Pure showsPrec :: Int -> SpoofParams -> ShowS # show :: SpoofParams -> String # showList :: [SpoofParams] -> ShowS # |
data SpoofCheck Source #
SingleScriptConfusable | Makes |
MixedScriptConfusable | Makes Makes |
WholeScriptConfusable | Makes |
AnyCase | By default, spoof checks assume the strings have been processed
through |
RestrictionLevel | Checks that identifiers are no looser than the specified
level passed to |
Invisible | Checks the identifier for the presence of invisible characters, such as zero-width spaces, or character sequences that are likely not to display, such as multiple occurrences of the same non-spacing mark. |
CharLimit | Checks whether the identifier contains only characters from a
specified set (for example, via |
MixedNumbers | Checks that the identifier contains numbers from only a single script. |
AllChecks | Enables all checks. |
AuxInfo | Enables returning a |
Instances
Bounded SpoofCheck Source # | |
Defined in Data.Text.ICU.Spoof minBound :: SpoofCheck # maxBound :: SpoofCheck # | |
Enum SpoofCheck Source # | |
Defined in Data.Text.ICU.Spoof succ :: SpoofCheck -> SpoofCheck # pred :: SpoofCheck -> SpoofCheck # toEnum :: Int -> SpoofCheck # fromEnum :: SpoofCheck -> Int # enumFrom :: SpoofCheck -> [SpoofCheck] # enumFromThen :: SpoofCheck -> SpoofCheck -> [SpoofCheck] # enumFromTo :: SpoofCheck -> SpoofCheck -> [SpoofCheck] # enumFromThenTo :: SpoofCheck -> SpoofCheck -> SpoofCheck -> [SpoofCheck] # | |
Eq SpoofCheck Source # | |
Defined in Data.Text.ICU.Spoof (==) :: SpoofCheck -> SpoofCheck -> Bool # (/=) :: SpoofCheck -> SpoofCheck -> Bool # | |
Show SpoofCheck Source # | |
Defined in Data.Text.ICU.Spoof showsPrec :: Int -> SpoofCheck -> ShowS # show :: SpoofCheck -> String # showList :: [SpoofCheck] -> ShowS # |
data RestrictionLevel Source #
ASCII | Checks that the string contains only Unicode values in the range ߝ inclusive. |
SingleScriptRestrictive | Checks that the string contains only characters from a single script. |
HighlyRestrictive | Checks that the string contains only characters from a single script, or from the combinations (Latin + Han + Hiragana + Katakana), (Latin + Han + Bopomofo), or (Latin + Han + Hangul). |
ModeratelyRestrictive | Checks that the string contains only characters from the combinations (Latin + Cyrillic + Greek + Cherokee), (Latin + Han + Hiragana + Katakana), (Latin + Han + Bopomofo), or (Latin + Han + Hangul). |
MinimallyRestrictive | Allows arbitrary mixtures of scripts. |
Unrestrictive | Allows any valid identifiers, including characters outside of the Identifier Profile. |
Instances
Bounded RestrictionLevel Source # | |
Defined in Data.Text.ICU.Spoof | |
Enum RestrictionLevel Source # | |
Defined in Data.Text.ICU.Spoof succ :: RestrictionLevel -> RestrictionLevel # pred :: RestrictionLevel -> RestrictionLevel # toEnum :: Int -> RestrictionLevel # fromEnum :: RestrictionLevel -> Int # enumFrom :: RestrictionLevel -> [RestrictionLevel] # enumFromThen :: RestrictionLevel -> RestrictionLevel -> [RestrictionLevel] # enumFromTo :: RestrictionLevel -> RestrictionLevel -> [RestrictionLevel] # enumFromThenTo :: RestrictionLevel -> RestrictionLevel -> RestrictionLevel -> [RestrictionLevel] # | |
Eq RestrictionLevel Source # | |
Defined in Data.Text.ICU.Spoof (==) :: RestrictionLevel -> RestrictionLevel -> Bool # (/=) :: RestrictionLevel -> RestrictionLevel -> Bool # | |
Show RestrictionLevel Source # | |
Defined in Data.Text.ICU.Spoof showsPrec :: Int -> RestrictionLevel -> ShowS # show :: RestrictionLevel -> String # showList :: [RestrictionLevel] -> ShowS # |
data SpoofCheckResult Source #
CheckOK | The string passed all configured spoof checks. |
CheckFailed [SpoofCheck] | The string failed one or more spoof checks. |
CheckFailedWithRestrictionLevel | The string failed one or more spoof checks, and failed to pass the configured restriction level. |
|
Instances
Eq SpoofCheckResult Source # | |
Defined in Data.Text.ICU.Spoof (==) :: SpoofCheckResult -> SpoofCheckResult -> Bool # (/=) :: SpoofCheckResult -> SpoofCheckResult -> Bool # | |
Show SpoofCheckResult Source # | |
Defined in Data.Text.ICU.Spoof showsPrec :: Int -> SpoofCheckResult -> ShowS # show :: SpoofCheckResult -> String # showList :: [SpoofCheckResult] -> ShowS # |
Construction
Open an immutable Spoof
checker with default options (all
SpoofCheck
s except CharLimit
).
spoofWithParams :: SpoofParams -> Spoof Source #
Open an immutable Spoof
checker with specific SpoofParams
to control its behavior.
spoofFromSource :: (ByteString, ByteString) -> SpoofParams -> Spoof Source #
Open a immutable Spoof
checker with specific SpoofParams
to control its behavior and custom rules given the UTF-8 encoded
contents of the confusables.txt
and confusablesWholeScript.txt
files as described in Unicode UAX #39.
spoofFromSerialized :: ByteString -> SpoofParams -> Spoof Source #
Create an immutable spoof checker with specific SpoofParams
to control its behavior and custom rules previously returned
by serialize
.
String checking
areConfusable :: Spoof -> Text -> Text -> SpoofCheckResult Source #
Check two strings for confusability.
spoofCheck :: Spoof -> Text -> SpoofCheckResult Source #
Check a string for spoofing issues.
getSkeleton :: Spoof -> Maybe SkeletonTypeOverride -> Text -> Text Source #
Generates re-usable "skeleton" strings which can be used (via Unicode equality) to check if an identifier is confusable with some large set of existing identifiers.
If you cache the returned strings in storage, you must invalidate your cache any time the underlying confusables database changes (i.e., on ICU upgrade).
By default, assumes all input strings have been passed through
toCaseFold
and are lower-case. To change this, pass
SkeletonAnyCase
.
By default, builds skeletons which catch visually confusable
characters across multiple scripts. Pass SkeletonSingleScript
to
override that behavior and build skeletons which catch visually
confusable characters across single scripts.
Configuration
getChecks :: Spoof -> [SpoofCheck] Source #
Gets the checks currently configured in the spoof checker.
getAllowedLocales :: Spoof -> [String] Source #
Gets the locales whose scripts are currently allowed by the spoof
checker. (We don't use LocaleName
since the root and default
locales have no meaning here.)
getRestrictionLevel :: Spoof -> Maybe RestrictionLevel Source #
Gets the restriction level currently configured in the spoof checker, if present.
Persistence
serialize :: Spoof -> ByteString Source #
Serializes the rules in this spoof checker to a byte array,
suitable for re-use by spoofFromSerialized
.
Only includes any data provided to openFromSource
. Does not
include any other state or configuration.