unicode-transforms: Unicode normalization

This is a package candidate release! Here you can preview how this package release will appear once published to the main package index (which can be accomplished via the 'maintain' link below). Please note that once a package has been published to the main package index it cannot be undone! Please consult the package uploading documentation for more information.

[maintain] [Publish]

Warnings:

'ghc-options: -O0' is not needed. Use the --disable-optimization configure flag.
'ghc-options: -O0' is not needed. Use the --disable-optimization configure flag.
'ghc-options: -O2' is rarely needed. Check that it is giving a real benefit and not just imposing longer compile times on your users.

Fast Unicode 14.0.0 normalization in Haskell (NFC, NFKC, NFD, NFKD).

[Skip to Readme]

Properties

Versions	0.1.0.1, 0.2.0, 0.2.1, 0.3.0, 0.3.1, 0.3.2, 0.3.3, 0.3.4, 0.3.5, 0.3.6, 0.3.7, 0.3.7.1, 0.3.8, 0.4.0, 0.4.0.1, 0.4.0.1
Change log	Changelog.md
Dependencies	base (>=4.8 && <4.17), bytestring (>=0.9 && <0.12), ghc-prim (>=0.2 && <0.9), text (>=1.1.1 && <=1.2.5.0 \|\| >=2.0 && <2.1), unicode-data (>=0.2 && <0.4) [details]
License	BSD-3-Clause
Copyright	2016-2017 Harendra Kumar, 2014–2015 Antonio Nikishaev
Author	Harendra Kumar
Maintainer	harendra.kumar@gmail.com
Category	Data, Text, Unicode
Home page	http://github.com/composewell/unicode-transforms
Bug tracker	https://github.com/composewell/unicode-transforms/issues
Source repo	head: git clone https://github.com/composewell/unicode-transforms
Uploaded	by adithyaov at 2022-03-17T12:20:19Z

Modules

[Index] [Quick Jump]

Data
- ByteString
  - UTF8
    - Data.ByteString.UTF8.Normalize
- Text
  - Data.Text.Normalize
- Unicode
  - Data.Unicode.Types

Flags

Manual Flags

Name	Description	Default
dev	Developer build	Disabled
bench-show	Use bench-show to compare benchmarks	Disabled
has-icu	Use text-icu for benchmark and test comparisons	Disabled
has-llvm	Use llvm backend (faster) for compilation	Disabled
use-gauge	Use gauge instead of tasty-bench for benchmarking	Disabled

Use -f <flag> to enable a flag, or -f -<flag> to disable that flag. More info

Downloads

unicode-transforms-0.4.0.1.tar.gz [browse] (Cabal source package)
Package description (as included in the package)

Maintainer's Corner

Package maintainers

Bodigrim, harendra, adithyaov, wismill

For package maintainers and hackage trustees

edit package information

Readme for unicode-transforms-0.4.0.1

[back to package description]

Unicode Transforms

Fast Unicode 14.0.0 normalization in Haskell (NFC, NFKC, NFD, NFKD).

What is normalization?

Unicode characters with adornments (e.g. Á) can be represented in two different forms, as a single composed character (U+00C1 = Á) or as multiple decomposed characters (U+0041(A) U+0301( ́ ) = Á). They are differently encoded byte sequences but for humans they have exactly the same visual appearance.

A regular byte comparison may tell that two strings are different even though they might be equivalent. We need to convert both the strings in a normalized form using the Unicode Character Database before we can compare them for equivalence. For example:

>> import Data.Text.Normalize
>> normalize NFC "\193" == normalize NFC "\65\769"
True

Performance

Normalization performance comparison of this package (v0.3.7) with the text-icu package using the ICU C++ library version ICU4C 65.1 on macOS. The benchmarks compare the time taken in milliseconds to normalize files in different languages and normalization forms using both the packages. In most cases unicode-transforms outperforms ICU.

Benchmark       unicode-transforms(ms) ICU(ms)    % Diff
--------------- ---------------------- -------   --------
NFKD/Korean                       7.78   37.10    +376.87
NFD/Korean                        7.86   37.06    +371.50
NFKD/Vietnamese                   6.85   12.48     +82.20
NFKD/Deutsch                      2.17    3.55     +63.30
NFKD/English                      1.71    2.78     +62.30
NFKC/Korean                       4.77    7.65     +60.28
NFD/Deutsch                       2.24    3.53     +57.41
NFD/English                       1.76    2.77     +57.32
NFC/Vietnamese                   10.66   16.63     +56.00
NFKC/Vietnamese                  10.95   16.58     +51.43
NFD/Devanagari                    6.48    8.68     +34.10
NFC/Devanagari                    6.77    8.49     +25.48
NFD/AllChars                      6.18    7.41     +19.91
NFD/Japanese                      7.80    9.20     +17.99
NFKC/Devanagari                   7.33    8.48     +15.74
NFKD/Japanese                     8.71   10.05     +15.39
NFD/Vietnamese                    5.94    6.83     +14.99
NFKD/Devanagari                   7.59    8.68     +14.27
NFKD/AllChars                     9.80   10.66      +8.82
NFKC/Deutsch                      3.21    3.18      -0.72
NFC/Korean                        4.62    4.38      -5.35
NFKC/English                      2.21    2.06      -6.88
NFC/English                       2.19    2.04      -7.21
NFKC/AllChars                    14.67    9.75     -50.51
NFC/Deutsch                       3.02    1.95     -54.39
NFKC/Japanese                    12.46    5.42    -129.93
NFC/AllChars                      9.72    3.58    -171.63
NFC/Japanese                     11.90    3.04    -292.04

Talks

Talks: Functional Conf 2018 Video | Functional Conf 2018 Slides

Contributing

Please use https://github.com/harendra-kumar/unicode-transforms to raise issues, or send pull requests.