alfred-margaret: Fast Aho-Corasick string searching

This is a package candidate release! Here you can preview how this package release will appear once published to the main package index (which can be accomplished via the 'maintain' link below). Please note that once a package has been published to the main package index it cannot be undone! Please consult the package uploading documentation for more information.

[maintain] [Publish]

Warnings:

'ghc-options: -O2' is rarely needed. Check that it is giving a real benefit and not just imposing longer compile times on your users.

An efficient implementation of the Aho-Corasick string searching algorithm.

[Skip to Readme]

Properties

Versions	1.0.0.0, 1.1.1.0, 1.1.1.0, 1.1.2.0, 2.0.0.0, 2.1.0.0, 2.1.0.2
Change log	None available
Dependencies	aeson (>=1.4.2 && <1.6), base (>=4.7 && <5), containers (>=0.6 && <0.7), deepseq (>=1.4 && <1.5), hashable (>=1.2.7 && <1.4), primitive (>=0.6.4 && <0.8), text (>=1.2.3 && <1.3), unordered-containers (>=0.2.9 && <0.3), vector (>=0.12 && <0.13) [details]
License	BSD-3-Clause
Copyright	2020 Channable
Author	The Alfred-Margaret authors
Maintainer	Ruud van Asseldonk <ruud@channable.com>, Fabian Thorand <fabian@channable.com>
Category	Data, Text
Home page	https://github.com/channable/alfred-margaret
Source repo	head: git clone https://github.com/channable/alfred-margaret
Uploaded	by rkrzr at 2020-10-19T08:21:10Z

Modules

[Index] [Quick Jump]

Data
- Text

Flags

Manual Flags

Name	Description	Default
llvm	Compile with LLVM	Disabled

Automatic Flags

Name	Description	Default
aeson	Enable aeson support	Enabled

Use -f <flag> to enable a flag, or -f -<flag> to disable that flag. More info

Downloads

alfred-margaret-1.1.1.0.tar.gz [browse] (Cabal source package)
Package description (as included in the package)

Maintainer's Corner

Package maintainers

fatho, rkrzr

For package maintainers and hackage trustees

edit package information

Readme for alfred-margaret-1.1.1.0

[back to package description]

Alfred–Margaret

Alfred–Margaret is a fast implementation of the Aho–Corasick string searching algorithm in Haskell. It powers many string-related operations in Channable.

The library is designed to work with the text package. It matches directly on the internal UTF-16 representation of Text for efficiency. See the announcement blog post for a deeper dive into Aho–Corasick, and the optimizations that make this library fast.

Alfred–Margaret is named after Alfred Aho and Margaret Corasick.

Performance

Running time to count all matches, in a real-world data set, comparing a Java implementation and a Rust implementation against Alfred–Margaret, and against memcopy to establish a lower bound:

Graph that shows that Alfred–Margaret is fast.

For the full details of this benchmark, see our announcement blog post, which includes more details about the data set, the benchmark setup, and a few things to keep in mind when interpreting this graph.

Example

Check if a string contains one of the needles:

import qualified Data.Text.AhoCorasick.Searcher as Searcher

searcher = Searcher.build ["tshirt", "shirts", "shorts"]

Searcher.containsAny searcher "short tshirts"
> True

Searcher.containsAny searcher "long shirt"
> False

Searcher.containsAny searcher "Short TSHIRTS"
> False

Searcher.containsAnyIgnoreCase searcher "Short TSHIRTS"
> True

Sequentially replace many needles:

import Data.Text.AhoCorasick.Automaton (CaseSensitivity (..))
import qualified Data.Text.AhoCorasick.Replacer as Replacer

replacer = Replacer.build CaseSensitive [("tshirt", "banana"), ("shirt", "pear")]

Replacer.run replacer "tshirts for sale"
> "bananas for sale"

Replacer.run replacer "tshirts and shirts for sale"
> "bananas and pears for sale"

Replacer.run replacer "sweatshirts and shirtshirts"
> "sweabananas and shirbananas"

Replacer.run replacer "sweatshirts and shirttshirts"
> "sweabananas and pearbananas"

Get all matches, possibly overlapping:

import qualified Data.Text.AhoCorasick.Automaton as Aho

pairNeedleWithSelf text = (Aho.unpackUtf16 text, text)
automaton = Aho.build $ fmap pairNeedleWithSelf ["tshirt", "shirts", "shorts"]
allMatches = Aho.runText [] (\matches match -> Aho.Step (match : matches))

allMatches automaton "short tshirts"
> [ Match {matchPos = CodeUnitIndex 13, matchValue = "shirts"}
> , Match {matchPos = CodeUnitIndex 12, matchValue = "tshirt"}
> ]

allMatches automaton "sweatshirts and shirtshirts"
> [ Match {matchPos = CodeUnitIndex 27, matchValue = "shirts"}
> , Match {matchPos = CodeUnitIndex 26, matchValue = "tshirt"}
> , Match {matchPos = CodeUnitIndex 22, matchValue = "shirts"}
> , Match {matchPos = CodeUnitIndex 11, matchValue = "shirts"}
> , Match {matchPos = CodeUnitIndex 10, matchValue = "tshirt"}
> ]

License

Alfred–Margaret is licensed under the 3-clause BSD license.