Safe Haskell | Safe-Inferred |
---|---|
Language | Haskell2010 |
An efficient implementation of the Aho-Corasick string matching algorithm. See http://web.stanford.edu/class/archive/cs/cs166/cs166.1166/lectures/02/Small02.pdf for a good explanation of the algorithm.
The memory layout of the automaton, and the function that steps it, were optimized to the point where string matching compiles roughly to a loop over the code units in the input text, that keeps track of the current state. Lookup of the next state is either just an array index (for the root state), or a linear scan through a small array (for non-root states). The pointer chases that are common for traversing Haskell data structures have been eliminated.
The construction of the automaton has not been optimized that much, because construction time is usually negligible in comparison to matching time. Therefore construction is a two-step process, where first we build the automaton as int maps, which are convenient for incremental construction. Afterwards we pack the automaton into unboxed vectors.
This module is a rewrite of the previous version which used an older version of
the text
package which in turn used UTF-16 internally.
Synopsis
- data AcMachine v = AcMachine {
- machineValues :: !(Vector [v])
- machineTransitions :: !(TypedByteArray Transition)
- machineOffsets :: !(TypedByteArray Offset)
- machineRootAsciiTransitions :: !(TypedByteArray Transition)
- data CaseSensitivity
- newtype CodeUnitIndex = CodeUnitIndex {
- codeUnitIndex :: Int
- data Match v = Match {
- matchPos :: !CodeUnitIndex
- matchValue :: v
- data Next a
- build :: [(Text, v)] -> AcMachine v
- debugBuildDot :: [Text] -> String
- runLower :: forall a v. a -> (a -> Match v -> Next a) -> AcMachine v -> Text -> a
- runText :: forall a v. a -> (a -> Match v -> Next a) -> AcMachine v -> Text -> a
- runWithCase :: forall a v. CaseSensitivity -> a -> (a -> Match v -> Next a) -> AcMachine v -> Text -> a
- needleCasings :: Text -> [Text]
Documentation
An Aho-Corasick automaton.
AcMachine | |
|
data CaseSensitivity Source #
Instances
newtype CodeUnitIndex Source #
An index into the raw UTF-8 data of a Text
. This is not the code point
index as conventionally accepted by Text
, so we wrap it to avoid confusing
the two. Incorrect index manipulation can lead to surrogate pairs being
sliced, so manipulate indices with care. This type is also used for lengths.
Instances
Match | |
|
build :: [(Text, v)] -> AcMachine v Source #
Construct an Aho-Corasick automaton for the given needles. The automaton uses Unicode code points to match the input.
debugBuildDot :: [Text] -> String Source #
Build the automaton, and format it as Graphviz Dot, for visual debugging.
runWithCase :: forall a v. CaseSensitivity -> a -> (a -> Match v -> Next a) -> AcMachine v -> Text -> a Source #
Run the automaton, possibly lowercasing the input text on the fly if case
insensitivity is desired. See also runLower
.
The code of this function itself is organized as a state machine as well.
Each state in the diagram below corresponds to a function defined in
runWithCase
. These functions are written in a way such that GHC identifies them
as join points.
This means that they can be compiled to jumps instead of function calls, which helps performance a lot.
┌─────────────────────────────┐ │ │ ┌─▼──────────┐ ┌──────────────┴─┐ ┌──────────────┐ │consumeInput├───►lookupTransition├───►collectMatches│ └─▲──────────┘ └─▲────────────┬─┘ └────────────┬─┘ │ │ │ │ │ └────────────┘ │ │ │ └────────────────────────────────────────────────┘
consumeInput
decodes a code point of up to four code units and possibly lowercases it. It passes this code point tofollowCodePoint
, which in turn callslookupTransition
.lookupTransition
checks whether the given code point matches any transitions at the given state. If so, it follows the transition and callscollectMatches
. Otherwise, it follows the fallback transition and callsfollowCodePoint
orconsumeInput
.collectMatches
checks whether the current state is accepting and updates the accumulator accordingly. Afterwards it loops back toconsumeInput
.
NOTE: followCodePoint
is actually inlined into consumeInput
by GHC.
It is included in the diagram for illustrative reasons only.
All of these functions have the arguments offset
, state
and acc
which encode the current input
position and the accumulator, which contains the matches. If you change any of the functions above,
make sure to check the Core dumps afterwards that offset
and state
were turned
into unboxed Int#
by GHC. If any of them aren't, the program will constantly allocate and deallocate heap space for them.
You can nudge GHC in the right direction by using bang patterns on these arguments.
WARNING: Run benchmarks when modifying this function; its performance is fragile. It took many days to discover the current formulation which compiles to fast code; removing the wrong bang pattern could cause a 10% performance regression.