Safe Haskell	Safe-Inferred
Language	Haskell2010

Data.Text.AhoCorasick.Automaton

Description

An efficient implementation of the Aho-Corasick string matching algorithm. See http://web.stanford.edu/class/archive/cs/cs166/cs166.1166/lectures/02/Small02.pdf for a good explanation of the algorithm.

The memory layout of the automaton, and the function that steps it, were optimized to the point where string matching compiles roughly to a loop over the code units in the input text, that keeps track of the current state. Lookup of the next state is either just an array index (for the root state), or a linear scan through a small array (for non-root states). The pointer chases that are common for traversing Haskell data structures have been eliminated.

The construction of the automaton has not been optimized that much, because construction time is usually negligible in comparison to matching time. Therefore construction is a two-step process, where first we build the automaton as int maps, which are convenient for incremental construction. Afterwards we pack the automaton into unboxed vectors.

This module is a rewrite of the previous version which used an older version of the text package which in turn used UTF-16 internally.

Synopsis

data AcMachine v = AcMachine {
- machineValues :: !(Vector [v])
- machineTransitions :: !(TypedByteArray Transition)
- machineOffsets :: !(TypedByteArray Offset)
- machineRootAsciiTransitions :: !(TypedByteArray Transition)
}
data CaseSensitivity
- = CaseSensitive
- | IgnoreCase
newtype CodeUnitIndex = CodeUnitIndex {
- codeUnitIndex :: Int
}
data Match v = Match {
- matchPos :: !CodeUnitIndex
- matchValue :: v
}
data Next a
- = Done !a
- | Step !a
build :: [(Text, v)] -> AcMachine v
debugBuildDot :: [Text] -> String
runLower :: forall a v. a -> (a -> Match v -> Next a) -> AcMachine v -> Text -> a
runText :: forall a v. a -> (a -> Match v -> Next a) -> AcMachine v -> Text -> a
runWithCase :: forall a v. CaseSensitivity -> a -> (a -> Match v -> Next a) -> AcMachine v -> Text -> a
needleCasings :: Text -> [Text]

Documentation

data AcMachine v Source #

An Aho-Corasick automaton.

Constructors

AcMachine

Fields

machineValues :: !(Vector [v])
For every state, the values associated with its needles. If the state is not a match state, the list is empty.
machineTransitions :: !(TypedByteArray Transition)
A packed vector of transitions. For every state, there is a slice of this vector that starts at the offset given by machineOffsets, and ends at the first wildcard transition.
machineOffsets :: !(TypedByteArray Offset)
For every state, the index into machineTransitions where the transition list for that state starts.
machineRootAsciiTransitions :: !(TypedByteArray Transition)
A lookup table for transitions from the root state, an optimization to avoid having to walk all transitions, at the cost of using a bit of additional memory.

Instances

Instances details

Functor AcMachine Source #
Instance details Defined in Data.Text.AhoCorasick.Automaton Methods fmap :: (a -> b) -> AcMachine a -> AcMachine b # (<$) :: a -> AcMachine b -> AcMachine a #
Generic (AcMachine v) Source #
Instance details Defined in Data.Text.AhoCorasick.Automaton Associated Types type Rep (AcMachine v) :: Type -> Type # Methods from :: AcMachine v -> Rep (AcMachine v) x # to :: Rep (AcMachine v) x -> AcMachine v #
NFData v => NFData (AcMachine v) Source #
Instance details Defined in Data.Text.AhoCorasick.Automaton Methods rnf :: AcMachine v -> () #
type Rep (AcMachine v) Source #
Instance details Defined in Data.Text.AhoCorasick.Automaton type Rep (AcMachine v)

data CaseSensitivity Source #

Constructors

CaseSensitive
IgnoreCase

Instances

Instances details

FromJSON CaseSensitivity Source #
Instance details Defined in Data.Text.CaseSensitivity Methods parseJSON :: Value -> Parser CaseSensitivity # parseJSONList :: Value -> Parser [CaseSensitivity] #
ToJSON CaseSensitivity Source #
Instance details Defined in Data.Text.CaseSensitivity Methods toJSON :: CaseSensitivity -> Value # toEncoding :: CaseSensitivity -> Encoding # toJSONList :: [CaseSensitivity] -> Value # toEncodingList :: [CaseSensitivity] -> Encoding #
Generic CaseSensitivity Source #
Instance details Defined in Data.Text.CaseSensitivity Associated Types type Rep CaseSensitivity :: Type -> Type # Methods from :: CaseSensitivity -> Rep CaseSensitivity x # to :: Rep CaseSensitivity x -> CaseSensitivity #
Show CaseSensitivity Source #
Instance details Defined in Data.Text.CaseSensitivity Methods showsPrec :: Int -> CaseSensitivity -> ShowS # show :: CaseSensitivity -> String # showList :: [CaseSensitivity] -> ShowS #
NFData CaseSensitivity Source #
Instance details Defined in Data.Text.CaseSensitivity Methods rnf :: CaseSensitivity -> () #
Eq CaseSensitivity Source #
Instance details Defined in Data.Text.CaseSensitivity Methods (==) :: CaseSensitivity -> CaseSensitivity -> Bool # (/=) :: CaseSensitivity -> CaseSensitivity -> Bool #
Hashable CaseSensitivity Source #
Instance details Defined in Data.Text.CaseSensitivity Methods hashWithSalt :: Int -> CaseSensitivity -> Int # hash :: CaseSensitivity -> Int #
type Rep CaseSensitivity Source #
Instance details Defined in Data.Text.CaseSensitivity type Rep CaseSensitivity = D1 ('MetaData "CaseSensitivity" "Data.Text.CaseSensitivity" "alfred-margaret-2.1.0.0-GaLGdvCW2mGJuL9TH52qO1" 'False) (C1 ('MetaCons "CaseSensitive" 'PrefixI 'False) (U1 :: Type -> Type) :+: C1 ('MetaCons "IgnoreCase" 'PrefixI 'False) (U1 :: Type -> Type))

newtype CodeUnitIndex Source #

An index into the raw UTF-8 data of a Text. This is not the code point index as conventionally accepted by Text, so we wrap it to avoid confusing the two. Incorrect index manipulation can lead to surrogate pairs being sliced, so manipulate indices with care. This type is also used for lengths.

Constructors

CodeUnitIndex
Fields codeUnitIndex :: Int

Instances

Instances details

FromJSON CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf8 Methods parseJSON :: Value -> Parser CodeUnitIndex # parseJSONList :: Value -> Parser [CodeUnitIndex] #
ToJSON CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf8 Methods toJSON :: CodeUnitIndex -> Value # toEncoding :: CodeUnitIndex -> Encoding # toJSONList :: [CodeUnitIndex] -> Value # toEncodingList :: [CodeUnitIndex] -> Encoding #
Bounded CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf8 Methods minBound :: CodeUnitIndex # maxBound :: CodeUnitIndex #
Generic CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf8 Associated Types type Rep CodeUnitIndex :: Type -> Type # Methods from :: CodeUnitIndex -> Rep CodeUnitIndex x # to :: Rep CodeUnitIndex x -> CodeUnitIndex #
Num CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf8 Methods (+) :: CodeUnitIndex -> CodeUnitIndex -> CodeUnitIndex # (-) :: CodeUnitIndex -> CodeUnitIndex -> CodeUnitIndex # (*) :: CodeUnitIndex -> CodeUnitIndex -> CodeUnitIndex # negate :: CodeUnitIndex -> CodeUnitIndex # abs :: CodeUnitIndex -> CodeUnitIndex # signum :: CodeUnitIndex -> CodeUnitIndex # fromInteger :: Integer -> CodeUnitIndex #
Show CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf8 Methods showsPrec :: Int -> CodeUnitIndex -> ShowS # show :: CodeUnitIndex -> String # showList :: [CodeUnitIndex] -> ShowS #
NFData CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf8 Methods rnf :: CodeUnitIndex -> () #
Eq CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf8 Methods (==) :: CodeUnitIndex -> CodeUnitIndex -> Bool # (/=) :: CodeUnitIndex -> CodeUnitIndex -> Bool #
Ord CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf8 Methods compare :: CodeUnitIndex -> CodeUnitIndex -> Ordering # (<) :: CodeUnitIndex -> CodeUnitIndex -> Bool # (<=) :: CodeUnitIndex -> CodeUnitIndex -> Bool # (>) :: CodeUnitIndex -> CodeUnitIndex -> Bool # (>=) :: CodeUnitIndex -> CodeUnitIndex -> Bool # max :: CodeUnitIndex -> CodeUnitIndex -> CodeUnitIndex # min :: CodeUnitIndex -> CodeUnitIndex -> CodeUnitIndex #
Hashable CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf8 Methods hashWithSalt :: Int -> CodeUnitIndex -> Int # hash :: CodeUnitIndex -> Int #
Prim CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf8 Methods sizeOf# :: CodeUnitIndex -> Int# # alignment# :: CodeUnitIndex -> Int# # indexByteArray# :: ByteArray# -> Int# -> CodeUnitIndex # readByteArray# :: MutableByteArray# s -> Int# -> State# s -> (# State# s, CodeUnitIndex #) # writeByteArray# :: MutableByteArray# s -> Int# -> CodeUnitIndex -> State# s -> State# s # setByteArray# :: MutableByteArray# s -> Int# -> Int# -> CodeUnitIndex -> State# s -> State# s # indexOffAddr# :: Addr# -> Int# -> CodeUnitIndex # readOffAddr# :: Addr# -> Int# -> State# s -> (# State# s, CodeUnitIndex #) # writeOffAddr# :: Addr# -> Int# -> CodeUnitIndex -> State# s -> State# s # setOffAddr# :: Addr# -> Int# -> Int# -> CodeUnitIndex -> State# s -> State# s #
type Rep CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf8 type Rep CodeUnitIndex = D1 ('MetaData "CodeUnitIndex" "Data.Text.Utf8" "alfred-margaret-2.1.0.0-GaLGdvCW2mGJuL9TH52qO1" 'True) (C1 ('MetaCons "CodeUnitIndex" 'PrefixI 'True) (S1 ('MetaSel ('Just "codeUnitIndex") 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 Int)))

data Match v Source #

Constructors

Match
Fields matchPos :: !CodeUnitIndex The code unit index past the last code unit of the match. Note that this is not a code point (Haskell `Char`) index; a code point might be encoded as up to four code units. matchValue :: v The payload associated with the matched needle.

data Next a Source #

Result of handling a match: stepping the automaton can exit early by returning a Done, or it can continue with a new accumulator with Step.

Constructors

Done !a
Step !a

build :: [(Text, v)] -> AcMachine v Source #

Construct an Aho-Corasick automaton for the given needles. The automaton uses Unicode code points to match the input.

debugBuildDot :: [Text] -> String Source #

Build the automaton, and format it as Graphviz Dot, for visual debugging.

runLower :: forall a v. a -> (a -> Match v -> Next a) -> AcMachine v -> Text -> a Source #

runText :: forall a v. a -> (a -> Match v -> Next a) -> AcMachine v -> Text -> a Source #

runWithCase :: forall a v. CaseSensitivity -> a -> (a -> Match v -> Next a) -> AcMachine v -> Text -> a Source #

Run the automaton, possibly lowercasing the input text on the fly if case insensitivity is desired. See also runLower.

The code of this function itself is organized as a state machine as well. Each state in the diagram below corresponds to a function defined in runWithCase. These functions are written in a way such that GHC identifies them as join points. This means that they can be compiled to jumps instead of function calls, which helps performance a lot.

  ┌─────────────────────────────┐
  │                             │
┌─▼──────────┐   ┌──────────────┴─┐   ┌──────────────┐
│consumeInput├───►lookupTransition├───►collectMatches│
└─▲──────────┘   └─▲────────────┬─┘   └────────────┬─┘
  │                │            │                  │
  │                └────────────┘                  │
  │                                                │
  └────────────────────────────────────────────────┘

consumeInput decodes a code point of up to four code units and possibly lowercases it. It passes this code point to followCodePoint, which in turn calls lookupTransition.
lookupTransition checks whether the given code point matches any transitions at the given state. If so, it follows the transition and calls collectMatches. Otherwise, it follows the fallback transition and calls followCodePoint or consumeInput.
collectMatches checks whether the current state is accepting and updates the accumulator accordingly. Afterwards it loops back to consumeInput.

NOTE: followCodePoint is actually inlined into consumeInput by GHC. It is included in the diagram for illustrative reasons only.

All of these functions have the arguments offset, state and acc which encode the current input position and the accumulator, which contains the matches. If you change any of the functions above, make sure to check the Core dumps afterwards that offset and state were turned into unboxed Int# by GHC. If any of them aren't, the program will constantly allocate and deallocate heap space for them. You can nudge GHC in the right direction by using bang patterns on these arguments.

WARNING: Run benchmarks when modifying this function; its performance is fragile. It took many days to discover the current formulation which compiles to fast code; removing the wrong bang pattern could cause a 10% performance regression.

needleCasings :: Text -> [Text] Source #

Given a lower case text, this gives all the texts that would lowercase to this one

needleCasings "abc" == ["abc","abC","aBc","aBC",Abc,AbC,ABc,ABC] needleCasings ABC == [] needleCasings "ω1" == [Ω1,"ω1",Ω1]