Safe Haskell	None
Language	Haskell2010

Data.Text.BoyerMoore.Automaton

Description

An efficient implementation of the Boyer-Moore string search algorithm. http://www-igm.univ-mlv.fr/~lecroq/string/node14.html#SECTION00140 https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm

This module contains a almost 1:1 translation from the C example code in the wikipedia article.

The algorithm here can be potentially improved by including the Galil rule (https:/en.wikipedia.orgwiki/Boyer%E2%80%93Moore_string-search_algorithm#The_Galil_rule)

Synopsis

data Automaton
data CaseSensitivity
- = CaseSensitive
- | IgnoreCase
buildAutomaton :: Text -> Automaton
runText :: forall a. a -> (a -> CodeUnitIndex -> Next a) -> Automaton -> Text -> a
runLower :: forall a. a -> (a -> CodeUnitIndex -> Next a) -> Automaton -> Text -> a
patternLength :: Automaton -> CodeUnitIndex
patternText :: Automaton -> Text
newtype CodeUnitIndex = CodeUnitIndex {
- codeUnitIndex :: Int
}
data Next a
- = Done !a
- | Step !a

Documentation

data Automaton Source #

A Boyer-Moore automaton is based on lookup-tables that allow skipping through the haystack. This allows for sub-linear matching in some cases, as we do not have to look at every input character.

NOTE: Unlike the AcMachine, a Boyer-Moore automaton only returns non-overlapping matches. This means that a Boyer-Moore automaton is not a 100% drop-in replacement for Aho-Corasick.

Returning overlapping matches would degrade the performance to O(nm) in pathological cases like finding aaaa in aaaaa....aaaaaa as for each match it would scan back the whole m characters of the pattern.

Instances

Eq Automaton Source #
Instance details Defined in Data.Text.BoyerMoore.Automaton Methods (==) :: Automaton -> Automaton -> Bool # (/=) :: Automaton -> Automaton -> Bool #
Show Automaton Source #
Instance details Defined in Data.Text.BoyerMoore.Automaton Methods showsPrec :: Int -> Automaton -> ShowS # show :: Automaton -> String # showList :: [Automaton] -> ShowS #
Generic Automaton Source #
Instance details Defined in Data.Text.BoyerMoore.Automaton Associated Types type Rep Automaton :: Type -> Type # Methods from :: Automaton -> Rep Automaton x # to :: Rep Automaton x -> Automaton #
Hashable Automaton Source #
Instance details Defined in Data.Text.BoyerMoore.Automaton Methods hashWithSalt :: Int -> Automaton -> Int # hash :: Automaton -> Int #
ToJSON Automaton Source #
Instance details Defined in Data.Text.BoyerMoore.Automaton Methods toJSON :: Automaton -> Value # toEncoding :: Automaton -> Encoding # toJSONList :: [Automaton] -> Value # toEncodingList :: [Automaton] -> Encoding #
FromJSON Automaton Source #
Instance details Defined in Data.Text.BoyerMoore.Automaton Methods parseJSON :: Value -> Parser Automaton # parseJSONList :: Value -> Parser [Automaton] #
NFData Automaton Source #
Instance details Defined in Data.Text.BoyerMoore.Automaton Methods rnf :: Automaton -> () #
type Rep Automaton Source #
Instance details Defined in Data.Text.BoyerMoore.Automaton type Rep Automaton

data CaseSensitivity Source #

Constructors

CaseSensitive
IgnoreCase

Instances

Eq CaseSensitivity Source #
Instance details Defined in Data.Text.AhoCorasick.Automaton Methods (==) :: CaseSensitivity -> CaseSensitivity -> Bool # (/=) :: CaseSensitivity -> CaseSensitivity -> Bool #
Show CaseSensitivity Source #
Instance details Defined in Data.Text.AhoCorasick.Automaton Methods showsPrec :: Int -> CaseSensitivity -> ShowS # show :: CaseSensitivity -> String # showList :: [CaseSensitivity] -> ShowS #
Generic CaseSensitivity Source #
Instance details Defined in Data.Text.AhoCorasick.Automaton Associated Types type Rep CaseSensitivity :: Type -> Type # Methods from :: CaseSensitivity -> Rep CaseSensitivity x # to :: Rep CaseSensitivity x -> CaseSensitivity #
Hashable CaseSensitivity Source #
Instance details Defined in Data.Text.AhoCorasick.Automaton Methods hashWithSalt :: Int -> CaseSensitivity -> Int # hash :: CaseSensitivity -> Int #
ToJSON CaseSensitivity Source #
Instance details Defined in Data.Text.AhoCorasick.Automaton Methods toJSON :: CaseSensitivity -> Value # toEncoding :: CaseSensitivity -> Encoding # toJSONList :: [CaseSensitivity] -> Value # toEncodingList :: [CaseSensitivity] -> Encoding #
FromJSON CaseSensitivity Source #
Instance details Defined in Data.Text.AhoCorasick.Automaton Methods parseJSON :: Value -> Parser CaseSensitivity # parseJSONList :: Value -> Parser [CaseSensitivity] #
NFData CaseSensitivity Source #
Instance details Defined in Data.Text.AhoCorasick.Automaton Methods rnf :: CaseSensitivity -> () #
type Rep CaseSensitivity Source #
Instance details Defined in Data.Text.AhoCorasick.Automaton type Rep CaseSensitivity = D1 (MetaData "CaseSensitivity" "Data.Text.AhoCorasick.Automaton" "alfred-margaret-1.1.1.0-C7p4DoDIXY7azqNtjX433" False) (C1 (MetaCons "CaseSensitive" PrefixI False) (U1 :: Type -> Type) :+: C1 (MetaCons "IgnoreCase" PrefixI False) (U1 :: Type -> Type))

buildAutomaton :: Text -> Automaton Source #

runText :: forall a. a -> (a -> CodeUnitIndex -> Next a) -> Automaton -> Text -> a Source #

Finds all matches in the text, calling the match callback with the *first* matched character of each match of the pattern.

NOTE: This is unlike Aho-Corasick, which reports the index of the character right after a match.

NOTE: To get full advantage of inlining this function, you probably want to compile the compiling module with -fllvm and the same optimization flags as this module.

runLower :: forall a. a -> (a -> CodeUnitIndex -> Next a) -> Automaton -> Text -> a Source #

Finds all matches in the lowercased text. This function lowercases the text on the fly to avoid allocating a second lowercased text array. Lowercasing is applied to individual code units, so the indexes into the lowercased text can be used to index into the original text. It is still the responsibility of the caller to lowercase the needles. Needles that contain uppercase code points will not match.

NOTE: To get full advantage of inlining this function, you probably want to compile the compiling module with -fllvm and the same optimization flags as this module.

patternLength :: Automaton -> CodeUnitIndex Source #

Length of the matched pattern measured in Utf16 code units.

patternText :: Automaton -> Text Source #

Return the pattern that was used to construct the automaton.

newtype CodeUnitIndex Source #

An index into the raw UTF-16 data of a Text. This is not the code point index as conventionally accepted by Text, so we wrap it to avoid confusing the two. Incorrect index manipulation can lead to surrogate pairs being sliced, so manipulate indices with care. This type is also used for lengths.

Constructors

CodeUnitIndex
Fields codeUnitIndex :: Int

Instances

Bounded CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf16 Methods minBound :: CodeUnitIndex # maxBound :: CodeUnitIndex #
Eq CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf16 Methods (==) :: CodeUnitIndex -> CodeUnitIndex -> Bool # (/=) :: CodeUnitIndex -> CodeUnitIndex -> Bool #
Num CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf16 Methods (+) :: CodeUnitIndex -> CodeUnitIndex -> CodeUnitIndex # (-) :: CodeUnitIndex -> CodeUnitIndex -> CodeUnitIndex # (*) :: CodeUnitIndex -> CodeUnitIndex -> CodeUnitIndex # negate :: CodeUnitIndex -> CodeUnitIndex # abs :: CodeUnitIndex -> CodeUnitIndex # signum :: CodeUnitIndex -> CodeUnitIndex # fromInteger :: Integer -> CodeUnitIndex #
Ord CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf16 Methods compare :: CodeUnitIndex -> CodeUnitIndex -> Ordering # (<) :: CodeUnitIndex -> CodeUnitIndex -> Bool # (<=) :: CodeUnitIndex -> CodeUnitIndex -> Bool # (>) :: CodeUnitIndex -> CodeUnitIndex -> Bool # (>=) :: CodeUnitIndex -> CodeUnitIndex -> Bool # max :: CodeUnitIndex -> CodeUnitIndex -> CodeUnitIndex # min :: CodeUnitIndex -> CodeUnitIndex -> CodeUnitIndex #
Show CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf16 Methods showsPrec :: Int -> CodeUnitIndex -> ShowS # show :: CodeUnitIndex -> String # showList :: [CodeUnitIndex] -> ShowS #
Generic CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf16 Associated Types type Rep CodeUnitIndex :: Type -> Type # Methods from :: CodeUnitIndex -> Rep CodeUnitIndex x # to :: Rep CodeUnitIndex x -> CodeUnitIndex #
Hashable CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf16 Methods hashWithSalt :: Int -> CodeUnitIndex -> Int # hash :: CodeUnitIndex -> Int #
ToJSON CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf16 Methods toJSON :: CodeUnitIndex -> Value # toEncoding :: CodeUnitIndex -> Encoding # toJSONList :: [CodeUnitIndex] -> Value # toEncodingList :: [CodeUnitIndex] -> Encoding #
FromJSON CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf16 Methods parseJSON :: Value -> Parser CodeUnitIndex # parseJSONList :: Value -> Parser [CodeUnitIndex] #
NFData CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf16 Methods rnf :: CodeUnitIndex -> () #
type Rep CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf16 type Rep CodeUnitIndex = D1 (MetaData "CodeUnitIndex" "Data.Text.Utf16" "alfred-margaret-1.1.1.0-C7p4DoDIXY7azqNtjX433" True) (C1 (MetaCons "CodeUnitIndex" PrefixI True) (S1 (MetaSel (Just "codeUnitIndex") NoSourceUnpackedness NoSourceStrictness DecidedLazy) (Rec0 Int)))

data Next a Source #

Result of handling a match: stepping the automaton can exit early by returning a Done, or it can continue with a new accumulator with Step.

Constructors

Done !a
Step !a