Safe Haskell | None |
---|---|
Language | Haskell2010 |
Documentation
(|>) :: a -> (a -> b) -> b infixl 1 Source #
Alternative syntax for the reverse function application operator (&)
,
known also as the pipe operator.
:: FuzzySet | The string set |
-> HashMap Text Int | A sparse vector representation of the search string (generated by |
-> HashMap Int Int | A mapping from item index to the dot product between the corresponding entry of the set and the search string |
Dot products used to compute the cosine similarity, which is the similarity score assigned to entries that match the search string in the fuzzy set.
:: FuzzySet | The string set |
-> Text | A string to search for |
-> Double | Minimum score |
-> Int | The gram size n, which must be at least 2 |
-> [(Double, Text)] | A list of results (score and matched value) |
This function performs the actual task of querying a set for matches, supported by the other functions in this module. See Implementation for an explanation.
:: Text | An input string |
-> Int | The gram size n, which must be at least 2 |
-> HashMap Text Int | A sparse vector with the number of times a substring occurs in the normalized input string |
Generate a list of n-grams (character substrings) from the normalized input and then translate this into a dictionary with the n-grams as keys mapping to the number of occurences of the substring in the list.
>>>
gramVector "xxxx" 2
fromList [("-x",1), ("xx",3), ("x-",1)]
The substring "xx"
appears three times in the normalized string:
>>>
grams "xxxx" 2
["-x","xx","xx","xx","x-"]
>>>
Data.HashMap.Strict.lookup "nts" (gramVector "intrent'srestaurantsomeoftrent'saunt'santswantsamtorentsomepants" 3)
Just 8
Break apart the input string into a list of n-grams. The string is
first normalized
and enclosed in hyphens. We then take
all substrings of length n, letting the offset range from
\(0 \text{ to } s + 2 − n\), where s is the length of the normalized input.
Example:
The string "Destroido Corp."
is first normalized to "destroido corp"
,
and then enclosed in hyphens, so that it becomes "-destroido corp-"
. The
trigrams generated from this normalized string are:
[ "-de" , "des" , "est" , "str" , "tro" , "roi" , "oid" , "ido" , "do " , "o c" , " co" , "cor" , "orp" , "rp-" ]