alfred-margaret: Fast Aho-Corasick string searching

This is a package candidate release! Here you can preview how this package release will appear once published to the main package index (which can be accomplished via the 'maintain' link below). Please note that once a package has been published to the main package index it cannot be undone! Please consult the package uploading documentation for more information.

[maintain] [Publish]

Warnings:

An efficient implementation of the Aho-Corasick string searching algorithm.


[Skip to Readme]

Properties

Versions 1.0.0.0, 1.1.1.0, 1.1.1.0, 1.1.2.0, 2.0.0.0, 2.1.0.0, 2.1.0.2
Change log None available
Dependencies aeson (>=1.4.2 && <1.6), base (>=4.7 && <5), containers (>=0.6 && <0.7), deepseq (>=1.4 && <1.5), hashable (>=1.2.7 && <1.4), primitive (>=0.6.4 && <0.8), text (>=1.2.3 && <1.3), unordered-containers (>=0.2.9 && <0.3), vector (>=0.12 && <0.13) [details]
License BSD-3-Clause
Copyright 2020 Channable
Author The Alfred-Margaret authors
Maintainer Ruud van Asseldonk <ruud@channable.com>, Fabian Thorand <fabian@channable.com>
Category Data, Text
Home page https://github.com/channable/alfred-margaret
Source repo head: git clone https://github.com/channable/alfred-margaret
Uploaded by rkrzr at 2020-10-19T08:21:10Z

Modules

[Index] [Quick Jump]

Flags

Manual Flags

NameDescriptionDefault
llvm

Compile with LLVM

Disabled
Automatic Flags
NameDescriptionDefault
aeson

Enable aeson support

Enabled

Use -f <flag> to enable a flag, or -f -<flag> to disable that flag. More info

Downloads

Maintainer's Corner

Package maintainers

For package maintainers and hackage trustees


Readme for alfred-margaret-1.1.1.0

[back to package description]

Alfred–Margaret

Alfred–Margaret is a fast implementation of the Aho–Corasick string searching algorithm in Haskell. It powers many string-related operations in Channable.

The library is designed to work with the text package. It matches directly on the internal UTF-16 representation of Text for efficiency. See the announcement blog post for a deeper dive into Aho–Corasick, and the optimizations that make this library fast.

Alfred–Margaret is named after Alfred Aho and Margaret Corasick.

Performance

Running time to count all matches, in a real-world data set, comparing a Java implementation and a Rust implementation against Alfred–Margaret, and against memcopy to establish a lower bound:

For the full details of this benchmark, see our announcement blog post, which includes more details about the data set, the benchmark setup, and a few things to keep in mind when interpreting this graph.

Example

Check if a string contains one of the needles:

import qualified Data.Text.AhoCorasick.Searcher as Searcher

searcher = Searcher.build ["tshirt", "shirts", "shorts"]

Searcher.containsAny searcher "short tshirts"
> True

Searcher.containsAny searcher "long shirt"
> False

Searcher.containsAny searcher "Short TSHIRTS"
> False

Searcher.containsAnyIgnoreCase searcher "Short TSHIRTS"
> True

Sequentially replace many needles:

import Data.Text.AhoCorasick.Automaton (CaseSensitivity (..))
import qualified Data.Text.AhoCorasick.Replacer as Replacer

replacer = Replacer.build CaseSensitive [("tshirt", "banana"), ("shirt", "pear")]

Replacer.run replacer "tshirts for sale"
> "bananas for sale"

Replacer.run replacer "tshirts and shirts for sale"
> "bananas and pears for sale"

Replacer.run replacer "sweatshirts and shirtshirts"
> "sweabananas and shirbananas"

Replacer.run replacer "sweatshirts and shirttshirts"
> "sweabananas and pearbananas"

Get all matches, possibly overlapping:

import qualified Data.Text.AhoCorasick.Automaton as Aho

pairNeedleWithSelf text = (Aho.unpackUtf16 text, text)
automaton = Aho.build $ fmap pairNeedleWithSelf ["tshirt", "shirts", "shorts"]
allMatches = Aho.runText [] (\matches match -> Aho.Step (match : matches))

allMatches automaton "short tshirts"
> [ Match {matchPos = CodeUnitIndex 13, matchValue = "shirts"}
> , Match {matchPos = CodeUnitIndex 12, matchValue = "tshirt"}
> ]

allMatches automaton "sweatshirts and shirtshirts"
> [ Match {matchPos = CodeUnitIndex 27, matchValue = "shirts"}
> , Match {matchPos = CodeUnitIndex 26, matchValue = "tshirt"}
> , Match {matchPos = CodeUnitIndex 22, matchValue = "shirts"}
> , Match {matchPos = CodeUnitIndex 11, matchValue = "shirts"}
> , Match {matchPos = CodeUnitIndex 10, matchValue = "tshirt"}
> ]

License

Alfred–Margaret is licensed under the 3-clause BSD license.