Z-Data-0.1.0.0: array, vector and text

Z.Data.Vector.Search

Description

This module provides:

• Element-wise searching within vectors
• Fast sub-vector searching algorithm based on KMP string searching.
• A hybrid sub-vector searching algorithm for Bytes.
• Rewrite rules to use Bytes specialized version if possible.
Synopsis

# Element-wise search

findIndices :: Vec v a => (a -> Bool) -> v a -> [Int] Source #

The findIndex function takes a predicate and a vector and returns the index of the first element in the vector satisfying the predicate.

elemIndices :: (Vec v a, Eq a) => a -> v a -> [Int] Source #

O(n) The elemIndices function extends elemIndex, by returning the indices of all elements equal to the query element, in ascending order.

find :: Vec v a => (a -> Bool) -> v a -> (Int, Maybe a) Source #

O(n) find the first index and element matching the predicate in a vector from left to right, if there isn't one, return (length of the vector, Nothing).

findR :: Vec v a => (a -> Bool) -> v a -> (Int, Maybe a) Source #

O(n) find the first index and element matching the predicate in a vector from right to left, if there isn't one, return '(-1, Nothing)'.

findIndex :: Vec v a => (a -> Bool) -> v a -> Int Source #

findIndex f v = fst (find f v)

findIndexR :: Vec v a => (a -> Bool) -> v a -> Int Source #

findIndexR f v = fst (findR f v)

filter :: forall v a. Vec v a => (a -> Bool) -> v a -> v a Source #

O(n) filter, applied to a predicate and a vector, returns a vector containing those elements that satisfy the predicate.

partition :: forall v a. Vec v a => (a -> Bool) -> v a -> (v a, v a) Source #

O(n) The partition function takes a predicate, a vector, returns a pair of vector with elements which do and do not satisfy the predicate, respectively; i.e.,

partition p vs == (filter p vs, filter (not . p) vs)

# Sub-vector search

Arguments

 :: (Vec v a, Eq a) => v a vector to search for (needle) -> v a vector to search in (haystack) -> Bool report partial match at the end of haystack -> [Int]

O(n+m) Find the offsets of all indices (possibly overlapping) of needle within haystack using KMP algorithm.

The KMP algorithm need pre-calculate a shift table in O(m) time and space, the worst case time complexity is O(n+m). Partial apply this function to reuse pre-calculated table between same needles.

Chunked input are support via partial match argument, if set we will return an extra negative index in case of partial match at the end of input chunk, e.g.

indicesOverlapping [ascii|ada|]  [ascii|adadad|] True == [0,2,-2]

Where -2 is the length of the partial match part ad 's negation.

If an empty pattern is supplied, we will return every possible index of haystack, e.g.

indicesOverlapping "" "abc" = [0,1,2]

References:

indices :: (Vec v a, Eq a) => v a -> v a -> Bool -> [Int] Source #

O(n+m) Find the offsets of all non-overlapping indices of needle within haystack using KMP algorithm.

If an empty pattern is supplied, we will return every possible index of haystack, e.g.

indicesOverlapping "" "abc" = [0,1,2]

# Bytes specialized combinators

elemIndicesBytes :: Word8 -> Bytes -> [Int] Source #

O(n) Special elemIndices for Bytes using memchr(3)

findByte :: Word8 -> Bytes -> (Int, Maybe Word8) Source #

O(n) Special findByte for Word8 using memchr(3)

findByteR :: Word8 -> Bytes -> (Int, Maybe Word8) Source #

O(n) Special findR for Bytes with handle roll bit twiddling.

Arguments

 :: Bytes bytes to search for (needle) -> Bytes bytes to search in (haystack) -> Bool report partial match at the end of haystack -> [Int]

O(n/m) Find the offsets of all indices (possibly overlapping) of needle within haystack using KMP algorithm, combined with simplified sunday's rule to obtain O(n/m) complexity in average use case.

The hybrid algorithm need pre-calculate a shift table in O(m) time and space, and a bad character bloom filter in O(m) time and O(1) space, the worst case time complexity is O(n+m).

References:

• Frantisek FranekChristopher G. JenningsWilliam F. Smyth A Simple Fast Hybrid Pattern-Matching Algorithm (2005)
• D. M. Sunday: A Very Fast Substring Search Algorithm. Communications of the ACM, 33, 8, 132-142 (1990)
• F. Lundh: The Fast Search Algorithm. http://effbot.org/zone/stringlib.htm (2006)

Arguments

 :: Bytes bytes to search for (needle) -> Bytes bytes to search in (haystack) -> Bool report partial match at the end of haystack -> [Int]

O(n/m) Find the offsets of all non-overlapping indices of needle within haystack using KMP algorithm, combined with simplified sunday's rule to obtain O(m/n) complexity in average use case.

# Helpers

kmpNextTable :: (Vec v a, Eq a) => v a -> PrimArray Int Source #

O(m) Calculate the KMP next shift table.

The shifting rules is: when a mismatch between needle[j] and haystack[i] is found, check if next[j] == -1, if so next search continue with needle and haystack[i+1], otherwise continue with needle[next[j]] and haystack[i].

O(m) Calculate a simple bloom filter for simplified sunday's rule.

The shifting rules is: when a mismatch between needle[j] and haystack[i] is found, check if elemSundayBloom bloom haystack[i+n-j], where n is the length of needle, if not then next search can be safely continued with haystack[i+n-j+1] and needle, otherwise next searh should continue with haystack[i] and needle, or fallback to other shifting rules such as KMP.

The algorithm is very simple: for a given Word8 w, we set the bloom's bit at unsafeShiftL 0x01 (w .&. 0x3f), so there're three false positives per bit. This's particularly suitable for search UTF-8 bytes since the significant bits of a beginning byte is usually the same.

O(1) Test if a bloom filter contain a certain Word8.