alfred-margaret-2.1.0.0: Fast Aho-Corasick string searching
Safe HaskellSafe-Inferred
LanguageHaskell2010

Data.Text.BoyerMoore.Automaton

Description

An efficient implementation of the Boyer-Moore string search algorithm. http://www-igm.univ-mlv.fr/~lecroq/string/node14.html#SECTION00140 https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm

This module contains a almost 1:1 translation from the C example code in the wikipedia article.

The algorithm here can be potentially improved by including the Galil rule (https:/en.wikipedia.orgwiki/Boyer%E2%80%93Moore_string-search_algorithm#The_Galil_rule)

Synopsis

Documentation

data Automaton Source #

A Boyer-Moore automaton is based on lookup-tables that allow skipping through the haystack. This allows for sub-linear matching in some cases, as we do not have to look at every input character.

NOTE: Unlike the AcMachine, a Boyer-Moore automaton only returns non-overlapping matches. This means that a Boyer-Moore automaton is not a 100% drop-in replacement for Aho-Corasick.

Returning overlapping matches would degrade the performance to O(nm) in pathological cases like finding aaaa in aaaaa....aaaaaa as for each match it would scan back the whole m characters of the pattern.

Instances

Instances details
FromJSON Automaton Source # 
Instance details

Defined in Data.Text.BoyerMoore.Automaton

ToJSON Automaton Source # 
Instance details

Defined in Data.Text.BoyerMoore.Automaton

Generic Automaton Source # 
Instance details

Defined in Data.Text.BoyerMoore.Automaton

Associated Types

type Rep Automaton :: Type -> Type #

Show Automaton Source # 
Instance details

Defined in Data.Text.BoyerMoore.Automaton

NFData Automaton Source # 
Instance details

Defined in Data.Text.BoyerMoore.Automaton

Methods

rnf :: Automaton -> () #

Eq Automaton Source # 
Instance details

Defined in Data.Text.BoyerMoore.Automaton

Hashable Automaton Source # 
Instance details

Defined in Data.Text.BoyerMoore.Automaton

type Rep Automaton Source # 
Instance details

Defined in Data.Text.BoyerMoore.Automaton

data CaseSensitivity Source #

Constructors

CaseSensitive 
IgnoreCase 

Instances

Instances details
FromJSON CaseSensitivity Source # 
Instance details

Defined in Data.Text.CaseSensitivity

ToJSON CaseSensitivity Source # 
Instance details

Defined in Data.Text.CaseSensitivity

Generic CaseSensitivity Source # 
Instance details

Defined in Data.Text.CaseSensitivity

Associated Types

type Rep CaseSensitivity :: Type -> Type #

Show CaseSensitivity Source # 
Instance details

Defined in Data.Text.CaseSensitivity

NFData CaseSensitivity Source # 
Instance details

Defined in Data.Text.CaseSensitivity

Methods

rnf :: CaseSensitivity -> () #

Eq CaseSensitivity Source # 
Instance details

Defined in Data.Text.CaseSensitivity

Hashable CaseSensitivity Source # 
Instance details

Defined in Data.Text.CaseSensitivity

type Rep CaseSensitivity Source # 
Instance details

Defined in Data.Text.CaseSensitivity

type Rep CaseSensitivity = D1 ('MetaData "CaseSensitivity" "Data.Text.CaseSensitivity" "alfred-margaret-2.1.0.0-GaLGdvCW2mGJuL9TH52qO1" 'False) (C1 ('MetaCons "CaseSensitive" 'PrefixI 'False) (U1 :: Type -> Type) :+: C1 ('MetaCons "IgnoreCase" 'PrefixI 'False) (U1 :: Type -> Type))

newtype CodeUnitIndex Source #

An index into the raw UTF-8 data of a Text. This is not the code point index as conventionally accepted by Text, so we wrap it to avoid confusing the two. Incorrect index manipulation can lead to surrogate pairs being sliced, so manipulate indices with care. This type is also used for lengths.

Constructors

CodeUnitIndex 

Fields

Instances

Instances details
FromJSON CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf8

ToJSON CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf8

Bounded CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf8

Generic CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf8

Associated Types

type Rep CodeUnitIndex :: Type -> Type #

Num CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf8

Show CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf8

NFData CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf8

Methods

rnf :: CodeUnitIndex -> () #

Eq CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf8

Ord CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf8

Hashable CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf8

Prim CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf8

type Rep CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf8

type Rep CodeUnitIndex = D1 ('MetaData "CodeUnitIndex" "Data.Text.Utf8" "alfred-margaret-2.1.0.0-GaLGdvCW2mGJuL9TH52qO1" 'True) (C1 ('MetaCons "CodeUnitIndex" 'PrefixI 'True) (S1 ('MetaSel ('Just "codeUnitIndex") 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 Int)))

data Next a Source #

Constructors

Done !a 
Step !a 

patternLength :: Automaton -> CodeUnitIndex Source #

Length of the matched pattern measured in UTF-8 code units (bytes).

patternText :: Automaton -> Text Source #

Return the pattern that was used to construct the automaton.

runText :: forall a. a -> (a -> CodeUnitIndex -> Next a) -> Automaton -> Text -> a Source #

Finds all matches in the text, calling the match callback with the *first* matched character of each match of the pattern.

NOTE: This is unlike Aho-Corasick, which reports the index of the character right after a match.

NOTE: In the UTF-16 version of this module, there is a function runLower which does lower-case matching. This function does not exist for the UTF-8 version since it is very tricky to skip code points going backwards without preprocessing the whole input first.

NOTE: To get full advantage of inlining this function, you probably want to compile the compiling module with -fllvm and the same optimization flags as this module.