talash-0.1.0.0: Line oriented fast enough text search
Safe HaskellNone
LanguageHaskell2010

Talash.Core

Description

This modules provides the function searchSome for searching the candidates provided by Vector Text. The information about the location of matches is stored in a length-tagged unboxed vector Vector. Such vectors have an Unbox instances which allows us to store the collection of such mathces in an unboxed Vector. This significantly reduces the memory usage and pressure on garbage collector. As a result the matchers used by this function are tagged with the number n of needles need to be matched and are provided by MatcherSized. An unsized interface is provided by Matcher which is existenially quantified over the number of needles. Functions for constructing matching and matchers have both a sized and unsized version.

Synopsis

Types

data MatcherSized (n :: Nat) a Source #

The MatcherSized type consists of a state machine for matching a fixed number of needles. The number of matches needed is encoded in the Nat parameterzing the type. Here the purpose is to improve the memory consumption by utlizing the Unbox instance for sized tagged unboxed vectors from (vector-sized)[https:/hackage.haskell.orgpackage/vector-sized] package. This significantly reduces the memory consumption. At least in the present implementation there is no benefit for correctness and dealing with the length tag is occasionally annoying.

Constructors

MatcherSized 

Fields

  • caseSensitivity :: CaseSensitivity
     
  • machina :: !(AcMachine a)

    An AhoCorasick state machine from the alfred-margaret package which does the actual string matching

  • sizes :: !(Either Int (Vector n Int))

    The sizes of the basic needles in code unit indices. The Left Int case is for when the length of all the needles is 1 with Int the number of needles.

data Matcher a Source #

The existential version of MatcherSized

Constructors

forall n.KnownNat n => Matcher (MatcherSized n a) 

data MatchState (n :: Nat) a Source #

The matching process essentially takes the form of a fold with possible early termination over the matches produced. See the runLower from the alfred-margaret. Here MatchState is the return type of this fold and essentially it records the positions of the matches. Here like in alfred-margaret position is the code unit index of the first code unit beyond the match. We can't use the CodeUnitIndex here because it doesn't have an unbox instance.

Constructors

MatchState 

Fields

  • endLocation :: !Int

    This is used to record the present extent of the match. What extent means is different to different matching styles.

  • partialMatch :: !(Vector n Int)

    The vector recording the position of the matches.

  • aux :: !a

    Any auxiliary information needed to describe the state of the match.

Instances

Instances details
Show a => Show (MatchState n a) Source # 
Instance details

Defined in Talash.Core

Methods

showsPrec :: Int -> MatchState n a -> ShowS #

show :: MatchState n a -> String #

showList :: [MatchState n a] -> ShowS #

data MatchPart Source #

Constructors

MatchPart 

Fields

Instances

Instances details
Show MatchPart Source # 
Instance details

Defined in Talash.Core

data MatchFull (n :: Nat) Source #

The full match consisting of a score for the match and vector consisting of the positions of the match. The score is intended as for bucketing and as a result shouldn't be two large and must be non-negative . For the fuzzy style in this module n contiguous matches contribute n-1 to the score. The scores thus range from 0 to n-1 where n is the length of the string to be matched. For orderless style this score is always 0.

Constructors

MatchFull 

Fields

Instances

Instances details
Show (MatchFull n) Source # 
Instance details

Defined in Talash.Core

data SearchSettings a (n :: Nat) Source #

The configuration for a search style with n needles and matcher of type a

Constructors

SearchSettings 

Fields

  • match :: a -> Text -> Maybe (MatchFull n)

    Given the matcher and the candidate text, find a match or return Nothing if there is none.

  • fullscore :: a -> Int

    The maximum score for a given matcher. It determines the number of buckets.

  • maxFullMatches :: Int

    Maximum number of matches with full score to produce.

  • orderAs :: Text -> Vector n Int -> Text -> Vector n Int -> Ordering

    The ordering to sort the matches within a given bucket. It is run with two candidates and their corresponding matches.

type Indices (n :: Nat) = (Int, Vector n Int) Source #

Type synonym for the index of a candidate in the backing vector along with the positions of the matches for it.

Matchers and matching

makeMatcher Source #

Arguments

:: forall a. CaseSensitivity 
-> (Text -> Int)

The function to determine the number of needles from the query string. The proxy argument is instantiated at the resulting value.

-> (forall n. KnownNat n => Proxy n -> CaseSensitivity -> Text -> MatcherSized n a)

The functions for constructing the matcher

-> Text

The query string

-> Maybe (Matcher a)

Nothing if the string is empty or if the number of needles turns out to be non-positive

A general function to construct a Matcher. Returns Nothing if the string is empty or if the number of needles turns out to be non-positive

Fuzzy style

fuzzyMatcherSized :: KnownNat n => p n -> CaseSensitivity -> Text -> MatcherSized n MatchPart Source #

Constructs the matcher for fuzzy matching. The needles are all possible contigous subtrings of the string being matched. The Nat n must be instantiated at the length n of the query string. They are n choose 2 such substrings, so to the complexity of matching is \(O(m + n^2)\) where m is the length of candidate string. This is a rough (and probably wrong) estimate as the updating the matchstate for each found match is not a constant time operation. Not sure if Aho Corasick is the optimal way for this kind of matching but in practice it seems fast enough.

fuzzyMatcher :: CaseSensitivity -> Text -> Maybe (Matcher MatchPart) Source #

Unsized version of fuzzyMatcherSized

Orderless Style

orderlessMatcherSized :: KnownNat n => p n -> CaseSensitivity -> Text -> MatcherSized n Int Source #

Constructs the matcher for orderless matching, the needles are the words from the query string and the proxy argument should be instantiated at the number of words.

orderlessMatcher :: CaseSensitivity -> Text -> Maybe (Matcher Int) Source #

Unsized version of orderlessMatcherSized

Search

fuzzySettings :: KnownNat n => Int -> SearchSettings (MatcherSized n MatchPart) n Source #

Search functions suitable for fuzzy matching. The candidate c will match query s if c contains all the characters in s in order. In general there can be several ways of matching. This tries to find a match with minimum number of parts of. It does not find the minimum number of parts, if that requires reducing the extent of the partial match during search. E.g. matching "as" against "talash" the split will be ["tal","as","h"] and not ["t","a","la","s","h"]. While matching "talash best match testing hat" against "tea" will not result in ["talash best match ","te","sting h","a","t"] since "te" occurs only after we have match all three letters and we can't know if we will find the "a" without going through the string.

orderlessSettings :: KnownNat n => Int -> SearchSettings (MatcherSized n Int) n Source #

Search functions that match the words in i.e. space separated substring in any order. "talash best" will match "be as" with the split ["tal","as","h","be","st"] but "talash best" will not match "bet".

searchSome Source #

Arguments

:: forall a n. KnownNat n 
=> SearchSettings a n

The configuration for finding matches

-> a

The matcher

-> Vector Text

The vector of candidates

-> Vector Int

The subset of indices of candidates to search from

-> (Vector Int, Vector (Indices n))

The new set of filtered indices in the vector and the vector containing the indices for each match found.

Given a matcher, search for matches in a vector of text. This function only searches for matches among the strings at indices which are in 3rd argument.

parts Source #

Arguments

:: Either Int (Vector Int)

The information about the lengths of different needles.

-> Text

The candidate string that has been matched

-> Vector Int

The vector recording the positions of the needle in the matched string.

-> [Text]

The candidate string split up according to the match

The parts of a string resulting from a match using the fuzzy matcher.

partsOrderless :: Either Int (Vector Int) -> Text -> Vector Int -> [Text] Source #

The parts of a string resulting from a match using the orderless matcher. See parts for an explanation of arguments.

minify :: Either Int (Vector Int) -> Vector Int -> [CodeUnitIndex] Source #

Shorten a match by collapsing the contiguous sub-matches together.