Safe Haskell | None |
---|---|
Language | Haskell2010 |
An efficient implementation of the Boyer-Moore string search algorithm. http://www-igm.univ-mlv.fr/~lecroq/string/node14.html#SECTION00140 https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm
This module contains a almost 1:1 translation from the C example code in the wikipedia article.
The algorithm here can be potentially improved by including the Galil rule (https:/en.wikipedia.orgwiki/Boyer%E2%80%93Moore_string-search_algorithm#The_Galil_rule)
Synopsis
- data Automaton
- data CaseSensitivity
- buildAutomaton :: Text -> Automaton
- runText :: forall a. a -> (a -> CodeUnitIndex -> Next a) -> Automaton -> Text -> a
- runLower :: forall a. a -> (a -> CodeUnitIndex -> Next a) -> Automaton -> Text -> a
- patternLength :: Automaton -> CodeUnitIndex
- patternText :: Automaton -> Text
- newtype CodeUnitIndex = CodeUnitIndex {
- codeUnitIndex :: Int
- data Next a
Documentation
A Boyer-Moore automaton is based on lookup-tables that allow skipping through the haystack. This allows for sub-linear matching in some cases, as we do not have to look at every input character.
NOTE: Unlike the AcMachine, a Boyer-Moore automaton only returns non-overlapping matches. This means that a Boyer-Moore automaton is not a 100% drop-in replacement for Aho-Corasick.
Returning overlapping matches would degrade the performance to O(nm) in pathological cases like
finding aaaa
in aaaaa....aaaaaa
as for each match it would scan back the whole m characters
of the pattern.
Instances
Eq Automaton Source # | |
Show Automaton Source # | |
Generic Automaton Source # | |
Hashable Automaton Source # | |
Defined in Data.Text.BoyerMoore.Automaton | |
ToJSON Automaton Source # | |
Defined in Data.Text.BoyerMoore.Automaton | |
FromJSON Automaton Source # | |
NFData Automaton Source # | |
Defined in Data.Text.BoyerMoore.Automaton | |
type Rep Automaton Source # | |
Defined in Data.Text.BoyerMoore.Automaton |
data CaseSensitivity Source #
Instances
buildAutomaton :: Text -> Automaton Source #
runText :: forall a. a -> (a -> CodeUnitIndex -> Next a) -> Automaton -> Text -> a Source #
Finds all matches in the text, calling the match callback with the *first* matched character of each match of the pattern.
NOTE: This is unlike Aho-Corasick, which reports the index of the character right after a match.
NOTE: To get full advantage of inlining this function, you probably want to compile the compiling module with -fllvm and the same optimization flags as this module.
runLower :: forall a. a -> (a -> CodeUnitIndex -> Next a) -> Automaton -> Text -> a Source #
Finds all matches in the lowercased text. This function lowercases the text on the fly to avoid allocating a second lowercased text array. Lowercasing is applied to individual code units, so the indexes into the lowercased text can be used to index into the original text. It is still the responsibility of the caller to lowercase the needles. Needles that contain uppercase code points will not match.
NOTE: To get full advantage of inlining this function, you probably want to compile the compiling module with -fllvm and the same optimization flags as this module.
patternLength :: Automaton -> CodeUnitIndex Source #
Length of the matched pattern measured in Utf16 code units.
patternText :: Automaton -> Text Source #
Return the pattern that was used to construct the automaton.
newtype CodeUnitIndex Source #
An index into the raw UTF-16 data of a Text
. This is not the code point
index as conventionally accepted by Text
, so we wrap it to avoid confusing
the two. Incorrect index manipulation can lead to surrogate pairs being
sliced, so manipulate indices with care. This type is also used for lengths.