unicode-transforms: Unicode normalization

[ bsd3, data, library, text, unicode ] [ Propose Tags ]

Fast Unicode 14.0.0 normalization in Haskell (NFC, NFKC, NFD, NFKD).


[Skip to Readme]

Flags

Manual Flags

NameDescriptionDefault
dev

Developer build

Disabled
bench-show

Use bench-show to compare benchmarks

Disabled
has-icu

Use text-icu for benchmark and test comparisons

Disabled
has-llvm

Use llvm backend (faster) for compilation

Disabled
use-gauge

Use gauge instead of tasty-bench for benchmarking

Disabled

Use -f <flag> to enable a flag, or -f -<flag> to disable that flag. More info

Downloads

Note: This package has metadata revisions in the cabal description newer than included in the tarball. To unpack the package including the revisions, use 'cabal get'.

Maintainer's Corner

For package maintainers and hackage trustees

Candidates

Versions [RSS] 0.1.0.1, 0.2.0, 0.2.1, 0.3.0, 0.3.1, 0.3.2, 0.3.3, 0.3.4, 0.3.5, 0.3.6, 0.3.7, 0.3.7.1, 0.3.8, 0.4.0, 0.4.0.1 (info)
Change log Changelog.md
Dependencies base (>=4.8 && <4.18), bytestring (>=0.9 && <0.12), ghc-prim (>=0.2 && <0.10), text (>=1.1.1 && <=1.2.5.0 || >=2.0 && <2.1), unicode-data (>=0.2 && <0.5) [details]
License BSD-3-Clause
Copyright 2016-2017 Harendra Kumar, 2014–2015 Antonio Nikishaev
Author Harendra Kumar
Maintainer harendra.kumar@gmail.com
Revised Revision 2 made by wismill at 2022-10-20T15:41:04Z
Category Data, Text, Unicode
Home page http://github.com/composewell/unicode-transforms
Bug tracker https://github.com/composewell/unicode-transforms/issues
Source repo head: git clone https://github.com/composewell/unicode-transforms
Uploaded by adithyaov at 2022-03-21T11:41:13Z
Distributions Arch:0.4.0.1, Debian:0.3.6, Fedora:0.3.7.1, LTSHaskell:0.4.0.1, NixOS:0.4.0.1, Stackage:0.4.0.1, openSUSE:0.4.0
Executables chart
Downloads 33857 total (221 in the last 30 days)
Rating (no votes yet) [estimated by Bayesian average]
Your Rating
  • λ
  • λ
  • λ
Status Docs available [build log]
Last success reported on 2022-03-21 [all 1 reports]

Readme for unicode-transforms-0.4.0.1

[back to package description]

Unicode Transforms

Hackage Build Status Windows Build status Coverage Status

Fast Unicode 14.0.0 normalization in Haskell (NFC, NFKC, NFD, NFKD).

What is normalization?

Unicode characters with adornments (e.g. Á) can be represented in two different forms, as a single composed character (U+00C1 = Á) or as multiple decomposed characters (U+0041(A) U+0301( ́ ) = Á). They are differently encoded byte sequences but for humans they have exactly the same visual appearance.

A regular byte comparison may tell that two strings are different even though they might be equivalent. We need to convert both the strings in a normalized form using the Unicode Character Database before we can compare them for equivalence. For example:

>> import Data.Text.Normalize
>> normalize NFC "\193" == normalize NFC "\65\769"
True

Performance

Normalization performance comparison of this package (v0.3.7) with the text-icu package using the ICU C++ library version ICU4C 65.1 on macOS. The benchmarks compare the time taken in milliseconds to normalize files in different languages and normalization forms using both the packages. In most cases unicode-transforms outperforms ICU.

Benchmark       unicode-transforms(ms) ICU(ms)    % Diff
--------------- ---------------------- -------   --------
NFKD/Korean                       7.78   37.10    +376.87
NFD/Korean                        7.86   37.06    +371.50
NFKD/Vietnamese                   6.85   12.48     +82.20
NFKD/Deutsch                      2.17    3.55     +63.30
NFKD/English                      1.71    2.78     +62.30
NFKC/Korean                       4.77    7.65     +60.28
NFD/Deutsch                       2.24    3.53     +57.41
NFD/English                       1.76    2.77     +57.32
NFC/Vietnamese                   10.66   16.63     +56.00
NFKC/Vietnamese                  10.95   16.58     +51.43
NFD/Devanagari                    6.48    8.68     +34.10
NFC/Devanagari                    6.77    8.49     +25.48
NFD/AllChars                      6.18    7.41     +19.91
NFD/Japanese                      7.80    9.20     +17.99
NFKC/Devanagari                   7.33    8.48     +15.74
NFKD/Japanese                     8.71   10.05     +15.39
NFD/Vietnamese                    5.94    6.83     +14.99
NFKD/Devanagari                   7.59    8.68     +14.27
NFKD/AllChars                     9.80   10.66      +8.82
NFKC/Deutsch                      3.21    3.18      -0.72
NFC/Korean                        4.62    4.38      -5.35
NFKC/English                      2.21    2.06      -6.88
NFC/English                       2.19    2.04      -7.21
NFKC/AllChars                    14.67    9.75     -50.51
NFC/Deutsch                       3.02    1.95     -54.39
NFKC/Japanese                    12.46    5.42    -129.93
NFC/AllChars                      9.72    3.58    -171.63
NFC/Japanese                     11.90    3.04    -292.04

Talks

* Talks: Functional Conf 2018 Video | Functional Conf 2018 Slides

Contributing

Please use https://github.com/harendra-kumar/unicode-transforms to raise issues, or send pull requests.