text-metrics: Calculate various string metrics efficiently

[ algorithms, bsd3, library, text ] [ Propose Tags ]

Calculate various string metrics efficiently.


[Skip to Readme]

Modules

[Index]

Flags

Manual Flags

NameDescriptionDefault
dev

Turn on development settings.

Disabled

Use -f <flag> to enable a flag, or -f -<flag> to disable that flag. More info

Downloads

Maintainer's Corner

Package maintainers

For package maintainers and hackage trustees

Candidates

  • No Candidates
Versions [RSS] 0.1.0, 0.2.0, 0.3.0, 0.3.1, 0.3.2
Change log CHANGELOG.md
Dependencies base (>=4.7 && <5.0), nats (>=1 && <2), text (>=0.2 && <1.3) [details]
License BSD-3-Clause
Author Mark Karpov <markkarpov@openmailbox.org>
Maintainer Mark Karpov <markkarpov@openmailbox.org>
Category Text, Algorithms
Home page https://github.com/mrkkrp/text-metrics
Bug tracker https://github.com/mrkkrp/text-metrics/issues
Source repo head: git clone https://github.com/mrkkrp/text-metrics.git
Uploaded by mrkkrp at 2016-09-04T10:14:29Z
Distributions Arch:0.3.2, Debian:0.3.0, Fedora:0.3.2, LTSHaskell:0.3.2, NixOS:0.3.2, Stackage:0.3.2
Reverse Dependencies 12 direct, 16 indirect [details]
Downloads 17105 total (125 in the last 30 days)
Rating 2.0 (votes: 1) [estimated by Bayesian average]
Your Rating
  • λ
  • λ
  • λ
Status Docs available [build log]
Last success reported on 2016-11-20 [all 1 reports]

Readme for text-metrics-0.1.0

[back to package description]

Text Metrics

License BSD3 Hackage Stackage Nightly Stackage LTS Build Status Coverage Status

The library provides efficient implementations of various strings metrics. It works with strict Text values and returns either Natural numbers (because the metrics cannot be negative), or Ratio Natural values because returned values are rational non-negative numbers by definition.

The current version of the package implements:

TODO list:

Comparison with the edit-distance package

There is edit-distance package whose scope overlaps with the scope of this package. The differences are:

  • edit-distance allows to specify costs for every operation when calculating Levenshtein distance (insertion, deletion, substitution, and transposition). This is rarely needed though in real-world applications, IMO.

  • edit-distance only provides single Levenshtein distance, text-metrics aims to provide implementations of most string metrics algorithms.

  • edit-distance works on Strings, while text-metrics works on strict Text values.

  • As README.md of edit-distance states, “[the algorithms] have been fairly heavily optimized”, which is apparently true, yet the text-metrics is faster for short strings (under 64 characters) and even faster for longer strings (scales better). How much faster? For short strings more than ×2.5, and about ×4 for longer strings.

Implementation

All “meat” of the algorithms is written is C in a rather straightforward way. Levenshtein variants are based on the “iterative algorithm with two matrix rows” from Wikipedia with additional improvement that we do not copy current row of distances into previous row, but just swap the pointers (which is OK, since the arrays have equal length and current row will be overwritten in the next iteration anyway).

Normalized versions are defined as thin (inlined) Haskell wrappers.

License

Copyright © 2016 Mark Karpov

Distributed under BSD 3 clause license.