Copyright	(c) Dong Han 2017-2018
License	BSD
Maintainer	winterland1989@gmail.com
Stability	experimental
Portability	non-portable
Safe Haskell	None
Language	Haskell2010

Z.Data.Vector.Search

Contents

Element-wise search
Sub-vector search
Bytes specialized combinators
Helpers

Description

This module provides:

Element-wise searching within vectors
Fast sub-vector searching algorithm based on KMP string searching.
A hybrid sub-vector searching algorithm for Bytes.
Rewrite rules to use Bytes specialized version if possible.

Synopsis

findIndices :: Vec v a => (a -> Bool) -> v a -> [Int]
elemIndices :: (Vec v a, Eq a) => a -> v a -> [Int]
find :: Vec v a => (a -> Bool) -> v a -> (Int, Maybe a)
findR :: Vec v a => (a -> Bool) -> v a -> (Int, Maybe a)
findIndex :: Vec v a => (a -> Bool) -> v a -> Int
findIndexR :: Vec v a => (a -> Bool) -> v a -> Int
filter :: forall v a. Vec v a => (a -> Bool) -> v a -> v a
partition :: forall v a. Vec v a => (a -> Bool) -> v a -> (v a, v a)
indicesOverlapping :: (Vec v a, Eq a) => v a -> v a -> Bool -> [Int]
indices :: (Vec v a, Eq a) => v a -> v a -> Bool -> [Int]
elemIndicesBytes :: Word8 -> Bytes -> [Int]
findByte :: Word8 -> Bytes -> (Int, Maybe Word8)
findByteR :: Word8 -> Bytes -> (Int, Maybe Word8)
indicesOverlappingBytes :: Bytes -> Bytes -> Bool -> [Int]
indicesBytes :: Bytes -> Bytes -> Bool -> [Int]
kmpNextTable :: (Vec v a, Eq a) => v a -> PrimArray Int
sundayBloom :: Bytes -> Word64
elemSundayBloom :: Word64 -> Word8 -> Bool

Element-wise search

findIndices :: Vec v a => (a -> Bool) -> v a -> [Int] Source #

The findIndex function takes a predicate and a vector and returns the index of the first element in the vector satisfying the predicate.

elemIndices :: (Vec v a, Eq a) => a -> v a -> [Int] Source #

O(n) The elemIndices function extends elemIndex, by returning the indices of all elements equal to the query element, in ascending order.

find :: Vec v a => (a -> Bool) -> v a -> (Int, Maybe a) Source #

O(n) find the first index and element matching the predicate in a vector from left to right, if there isn't one, return (length of the vector, Nothing).

findR :: Vec v a => (a -> Bool) -> v a -> (Int, Maybe a) Source #

O(n) Find the first index and element matching the predicate in a vector from right to left, if there isn't one, return '(-1, Nothing)'.

findIndex :: Vec v a => (a -> Bool) -> v a -> Int Source #

findIndex f v = fst (find f v)

findIndexR :: Vec v a => (a -> Bool) -> v a -> Int Source #

findIndexR f v = fst (findR f v)

filter :: forall v a. Vec v a => (a -> Bool) -> v a -> v a Source #

O(n) filter, applied to a predicate and a vector, returns a vector containing those elements that satisfy the predicate.

partition :: forall v a. Vec v a => (a -> Bool) -> v a -> (v a, v a) Source #

O(n) The partition function takes a predicate, a vector, returns a pair of vector with elements which do and do not satisfy the predicate, respectively; i.e.,

partition p vs == (filter p vs, filter (not . p) vs)

Sub-vector search

indicesOverlapping Source #

Arguments

:: (Vec v a, Eq a)
=> v a	vector to search for (`needle`)
-> v a	vector to search in (`haystack`)
-> Bool	report partial match at the end of haystack
-> [Int]

O(n+m) Find the offsets of all indices (possibly overlapping) of needle within haystack using KMP algorithm.

The KMP algorithm need pre-calculate a shift table in O(m) time and space, the worst case time complexity is O(n+m). Partial apply this function to reuse pre-calculated table between same needles.

Chunked input are support via partial match argument, if set we will return an extra negative index in case of partial match at the end of input chunk, e.g.

indicesOverlapping [ascii|ada|]  [ascii|adadad|] True == [0,2,-2]

Where -2 is the length of the partial match part ad 's negation.

If an empty pattern is supplied, we will return every possible index of haystack, e.g.

indicesOverlapping "" "abc" = [0,1,2]

References:

Knuth, Donald; Morris, James H.; Pratt, Vaughan: "Fast pattern matching in strings" (1977)
http://www-igm.univ-mlv.fr/~lecroq/string/node8.html#SECTION0080

indices :: (Vec v a, Eq a) => v a -> v a -> Bool -> [Int] Source #

O(n+m) Find the offsets of all non-overlapping indices of needle within haystack using KMP algorithm.

If an empty pattern is supplied, we will return every possible index of haystack, e.g.

indicesOverlapping "" "abc" = [0,1,2]

`Bytes` specialized combinators

elemIndicesBytes :: Word8 -> Bytes -> [Int] Source #

O(n) Special elemIndices for Bytes using memchr(3)

findByte :: Word8 -> Bytes -> (Int, Maybe Word8) Source #

O(n) Special findByte for Word8 using memchr(3)

findByteR :: Word8 -> Bytes -> (Int, Maybe Word8) Source #

O(n) Special findR for Bytes with memrchr.

indicesOverlappingBytes Source #

Arguments

:: Bytes	bytes to search for (`needle`)
-> Bytes	bytes to search in (`haystack`)
-> Bool	report partial match at the end of haystack
-> [Int]

O(n/m) Find the offsets of all indices (possibly overlapping) of needle within haystack using KMP algorithm, combined with simplified sunday's rule to obtain O(n/m) complexity in average use case.

The hybrid algorithm need pre-calculate a shift table in O(m) time and space, and a bad character bloom filter in O(m) time and O(1) space, the worst case time complexity is O(n+m).

References:

Frantisek FranekChristopher G. JenningsWilliam F. Smyth A Simple Fast Hybrid Pattern-Matching Algorithm (2005)
D. M. Sunday: A Very Fast Substring Search Algorithm. Communications of the ACM, 33, 8, 132-142 (1990)
F. Lundh: The Fast Search Algorithm. http://effbot.org/zone/stringlib.htm (2006)

indicesBytes Source #

Arguments

:: Bytes	bytes to search for (`needle`)
-> Bytes	bytes to search in (`haystack`)
-> Bool	report partial match at the end of haystack
-> [Int]

O(n/m) Find the offsets of all non-overlapping indices of needle within haystack using KMP algorithm, combined with simplified sunday's rule to obtain O(m/n) complexity in average use case.

Helpers

kmpNextTable :: (Vec v a, Eq a) => v a -> PrimArray Int Source #

O(m) Calculate the KMP next shift table.

The shifting rules is: when a mismatch between needle[j] and haystack[i] is found, check if next[j] == -1, if so next search continue with needle[0] and haystack[i+1], otherwise continue with needle[next[j]] and haystack[i].

sundayBloom :: Bytes -> Word64 Source #

O(m) Calculate a simple bloom filter for simplified sunday's rule.

The shifting rules is: when a mismatch between needle[j] and haystack[i] is found, check if elemSundayBloom bloom haystack[i+n-j], where n is the length of needle, if not then next search can be safely continued with haystack[i+n-j+1] and needle[0], otherwise next searh should continue with haystack[i] and needle[0], or fallback to other shifting rules such as KMP.

The algorithm is very simple: for a given Word8 w, we set the bloom's bit at unsafeShiftL 0x01 (w .&. 0x3f), so there're three false positives per bit. This's particularly suitable for search UTF-8 bytes since the significant bits of a beginning byte is usually the same.

elemSundayBloom :: Word64 -> Word8 -> Bool Source #

O(1) Test if a bloom filter contain a certain Word8.

Element-wise search

Sub-vector search

Bytes specialized combinators

Helpers

`Bytes` specialized combinators