Safe Haskell | None |
---|---|
Language | Haskell2010 |
This modules provides the function searchSome
for searching the candidates provided by Vector
Text
. The information about the location of matches
is stored in a length-tagged unboxed vector Vector
. Such vectors have an Unbox
instances which allows us to store the collection of such mathces in an
unboxed Vector
. This significantly reduces the memory usage and pressure on garbage collector. As a result the matchers used by this function are tagged
with the number n
of needles need to be matched and are provided by MatcherSized
. An unsized interface is provided by Matcher
which is existenially
quantified over the number of needles. Functions for constructing matching and matchers have both a sized and unsized version.
Synopsis
- data MatcherSized (n :: Nat) a = MatcherSized {
- caseSensitivity :: CaseSensitivity
- machina :: !(AcMachine a)
- sizes :: !(Either Int (Vector n Int))
- data Matcher a = forall n.KnownNat n => Matcher (MatcherSized n a)
- data MatchState (n :: Nat) a = MatchState {
- endLocation :: !Int
- partialMatch :: !(Vector n Int)
- aux :: !a
- data MatchPart = MatchPart {
- matchBegin :: !Int
- matchEnd :: !Int
- data MatchFull (n :: Nat) = MatchFull {}
- data SearchSettings a (n :: Nat) = SearchSettings {}
- type Indices (n :: Nat) = (Int, Vector n Int)
- makeMatcher :: forall a. CaseSensitivity -> (Text -> Int) -> (forall n. KnownNat n => Proxy n -> CaseSensitivity -> Text -> MatcherSized n a) -> Text -> Maybe (Matcher a)
- fuzzyMatcherSized :: KnownNat n => p n -> CaseSensitivity -> Text -> MatcherSized n MatchPart
- fuzzyMatcher :: CaseSensitivity -> Text -> Maybe (Matcher MatchPart)
- fuzzyMatchSized :: KnownNat n => MatcherSized n MatchPart -> Text -> Maybe (MatchFull n)
- fuzzyMatch :: Matcher MatchPart -> Text -> Maybe [Text]
- orderlessMatcherSized :: KnownNat n => p n -> CaseSensitivity -> Text -> MatcherSized n Int
- orderlessMatcher :: CaseSensitivity -> Text -> Maybe (Matcher Int)
- orderlessMatchSized :: KnownNat n => MatcherSized n Int -> Text -> Maybe (MatchFull n)
- orderlessMatch :: Matcher Int -> Text -> Maybe [Text]
- fuzzySettings :: KnownNat n => Int -> SearchSettings (MatcherSized n MatchPart) n
- orderlessSettings :: KnownNat n => Int -> SearchSettings (MatcherSized n Int) n
- searchSome :: forall a n. KnownNat n => SearchSettings a n -> a -> Vector Text -> Vector Int -> (Vector Int, Vector (Indices n))
- parts :: Either Int (Vector Int) -> Text -> Vector Int -> [Text]
- partsOrderless :: Either Int (Vector Int) -> Text -> Vector Int -> [Text]
- minify :: Either Int (Vector Int) -> Vector Int -> [CodeUnitIndex]
Types
data MatcherSized (n :: Nat) a Source #
The MatcherSized type consists of a state machine for matching a fixed number of needles. The number of matches needed is encoded in the Nat parameterzing
the type. Here the purpose is to improve the memory consumption by utlizing the Unbox
instance for sized tagged unboxed vectors from
(vector-sized)[https:/hackage.haskell.orgpackage/vector-sized] package. This significantly reduces the memory consumption. At least in the present
implementation there is no benefit for correctness and dealing with the length tag is occasionally annoying.
MatcherSized | |
|
The existential version of MatcherSized
forall n.KnownNat n => Matcher (MatcherSized n a) |
data MatchState (n :: Nat) a Source #
The matching process essentially takes the form of a fold with possible early termination over the matches produced. See the runLower from the alfred-margaret. Here MatchState is the return type of this fold and essentially it records the positions of the matches. Here like in alfred-margaret position is the code unit index of the first code unit beyond the match. We can't use the CodeUnitIndex here because it doesn't have an unbox instance.
MatchState | |
|
Instances
Show a => Show (MatchState n a) Source # | |
Defined in Talash.Core showsPrec :: Int -> MatchState n a -> ShowS # show :: MatchState n a -> String # showList :: [MatchState n a] -> ShowS # |
MatchPart | |
|
data MatchFull (n :: Nat) Source #
The full match consisting of a score for the match and vector consisting of the positions of the match. The score is intended as for bucketing and as a
result shouldn't be two large and must be non-negative . For the fuzzy style in this module n
contiguous matches contribute n-1
to the score. The
scores thus range from 0
to n-1
where n
is the length of the string to be matched. For orderless style this score is always 0
.
data SearchSettings a (n :: Nat) Source #
The configuration for a search style with n needles and matcher of type a
SearchSettings | |
|
type Indices (n :: Nat) = (Int, Vector n Int) Source #
Type synonym for the index of a candidate in the backing vector along with the positions of the matches for it.
Matchers and matching
:: forall a. CaseSensitivity | |
-> (Text -> Int) | The function to determine the number of needles from the query string. The proxy argument is instantiated at the resulting value. |
-> (forall n. KnownNat n => Proxy n -> CaseSensitivity -> Text -> MatcherSized n a) | The functions for constructing the matcher |
-> Text | The query string |
-> Maybe (Matcher a) | Nothing if the string is empty or if the number of needles turns out to be non-positive |
A general function to construct a Matcher. Returns Nothing if the string is empty or if the number of needles turns out to be non-positive
Fuzzy style
fuzzyMatcherSized :: KnownNat n => p n -> CaseSensitivity -> Text -> MatcherSized n MatchPart Source #
Constructs the matcher for fuzzy matching. The needles are all possible contigous subtrings of the string being matched. The Nat n
must be instantiated at the
length n
of the query string. They are n choose 2 such substrings, so to the complexity of matching is \(O(m + n^2)\) where m
is the length of candidate string.
This is a rough (and probably wrong) estimate as the updating the matchstate for each found match is not a constant time operation. Not sure if Aho Corasick is
the optimal way for this kind of matching but in practice it seems fast enough.
fuzzyMatcher :: CaseSensitivity -> Text -> Maybe (Matcher MatchPart) Source #
Unsized version of fuzzyMatcherSized
fuzzyMatchSized :: KnownNat n => MatcherSized n MatchPart -> Text -> Maybe (MatchFull n) Source #
Orderless Style
orderlessMatcherSized :: KnownNat n => p n -> CaseSensitivity -> Text -> MatcherSized n Int Source #
Constructs the matcher for orderless matching, the needles are the words from the query string and the proxy argument should be instantiated at the number of words.
orderlessMatcher :: CaseSensitivity -> Text -> Maybe (Matcher Int) Source #
Unsized version of orderlessMatcherSized
orderlessMatchSized :: KnownNat n => MatcherSized n Int -> Text -> Maybe (MatchFull n) Source #
Search
fuzzySettings :: KnownNat n => Int -> SearchSettings (MatcherSized n MatchPart) n Source #
Search functions suitable for fuzzy matching. The candidate c
will match query s
if c
contains all the characters in s
in order. In general there
can be several ways of matching. This tries to find a match with minimum number of parts of. It does not find the minimum number of parts, if that requires
reducing the extent of the partial match during search. E.g. matching "as"
against "talash"
the split will be ["tal","as","h"]
and not
["t","a","la","s","h"]
. While matching "talash best match testing hat"
against "tea"
will not result in ["talash best match ","te","sting h","a","t"]
since
"te"
occurs only after we have match all three letters and we can't know if we will find the "a"
without going through the string.
orderlessSettings :: KnownNat n => Int -> SearchSettings (MatcherSized n Int) n Source #
Search functions that match the words in i.e. space separated substring in any order. "talash best"
will match "be as"
with the split
["tal","as","h","be","st"]
but "talash best"
will not match "bet"
.
:: forall a n. KnownNat n | |
=> SearchSettings a n | The configuration for finding matches |
-> a | The matcher |
-> Vector Text | The vector of candidates |
-> Vector Int | The subset of indices of candidates to search from |
-> (Vector Int, Vector (Indices n)) | The new set of filtered indices in the vector and the vector containing the indices for each match found. |
Given a matcher, search for matches in a vector of text. This function only searches for matches among the strings at indices which are in 3rd argument.
:: Either Int (Vector Int) | The information about the lengths of different needles. |
-> Text | The candidate string that has been matched |
-> Vector Int | The vector recording the positions of the needle in the matched string. |
-> [Text] | The candidate string split up according to the match |
The parts of a string resulting from a match using the fuzzy matcher.