Z-Data-0.5.0.0: Array, vector and text
Copyright(c) Dong Han 2017-2018
LicenseBSD
Maintainerwinterland1989@gmail.com
Stabilityexperimental
Portabilitynon-portable
Safe HaskellNone
LanguageHaskell2010

Z.Data.Vector.Search

Description

This module provides:

  • Element-wise searching within vectors
  • Fast sub-vector searching algorithm based on KMP string searching.
  • A hybrid sub-vector searching algorithm for Bytes.
  • Rewrite rules to use Bytes specialized version if possible.
Synopsis

Element-wise search

findIndices :: Vec v a => (a -> Bool) -> v a -> [Int] Source #

The findIndex function takes a predicate and a vector and returns the index of the first element in the vector satisfying the predicate.

elemIndices :: (Vec v a, Eq a) => a -> v a -> [Int] Source #

O(n) The elemIndices function extends elemIndex, by returning the indices of all elements equal to the query element, in ascending order.

find :: Vec v a => (a -> Bool) -> v a -> (Int, Maybe a) Source #

O(n) find the first index and element matching the predicate in a vector from left to right, if there isn't one, return (length of the vector, Nothing).

findR :: Vec v a => (a -> Bool) -> v a -> (Int, Maybe a) Source #

O(n) Find the first index and element matching the predicate in a vector from right to left, if there isn't one, return '(-1, Nothing)'.

findIndex :: Vec v a => (a -> Bool) -> v a -> Int Source #

findIndex f v = fst (find f v)

findIndexR :: Vec v a => (a -> Bool) -> v a -> Int Source #

findIndexR f v = fst (findR f v)

filter :: forall v a. Vec v a => (a -> Bool) -> v a -> v a Source #

O(n) filter, applied to a predicate and a vector, returns a vector containing those elements that satisfy the predicate.

partition :: forall v a. Vec v a => (a -> Bool) -> v a -> (v a, v a) Source #

O(n) The partition function takes a predicate, a vector, returns a pair of vector with elements which do and do not satisfy the predicate, respectively; i.e.,

partition p vs == (filter p vs, filter (not . p) vs)

Sub-vector search

indicesOverlapping Source #

Arguments

:: (Vec v a, Eq a) 
=> v a

vector to search for (needle)

-> v a

vector to search in (haystack)

-> Bool

report partial match at the end of haystack

-> [Int] 

O(n+m) Find the offsets of all indices (possibly overlapping) of needle within haystack using KMP algorithm.

The KMP algorithm need pre-calculate a shift table in O(m) time and space, the worst case time complexity is O(n+m). Partial apply this function to reuse pre-calculated table between same needles.

Chunked input are support via partial match argument, if set we will return an extra negative index in case of partial match at the end of input chunk, e.g.

indicesOverlapping [ascii|ada|]  [ascii|adadad|] True == [0,2,-2]

Where -2 is the length of the partial match part ad 's negation.

If an empty pattern is supplied, we will return every possible index of haystack, e.g.

indicesOverlapping "" "abc" = [0,1,2]

References:

indices :: (Vec v a, Eq a) => v a -> v a -> Bool -> [Int] Source #

O(n+m) Find the offsets of all non-overlapping indices of needle within haystack using KMP algorithm.

If an empty pattern is supplied, we will return every possible index of haystack, e.g.

indicesOverlapping "" "abc" = [0,1,2]

Bytes specialized combinators

elemIndicesBytes :: Word8 -> Bytes -> [Int] Source #

O(n) Special elemIndices for Bytes using memchr(3)

findByte :: Word8 -> Bytes -> (Int, Maybe Word8) Source #

O(n) Special findByte for Word8 using memchr(3)

findByteR :: Word8 -> Bytes -> (Int, Maybe Word8) Source #

O(n) Special findR for Bytes with memrchr.

indicesOverlappingBytes Source #

Arguments

:: Bytes

bytes to search for (needle)

-> Bytes

bytes to search in (haystack)

-> Bool

report partial match at the end of haystack

-> [Int] 

O(n/m) Find the offsets of all indices (possibly overlapping) of needle within haystack using KMP algorithm, combined with simplified sunday's rule to obtain O(n/m) complexity in average use case.

The hybrid algorithm need pre-calculate a shift table in O(m) time and space, and a bad character bloom filter in O(m) time and O(1) space, the worst case time complexity is O(n+m).

References:

  • Frantisek FranekChristopher G. JenningsWilliam F. Smyth A Simple Fast Hybrid Pattern-Matching Algorithm (2005)
  • D. M. Sunday: A Very Fast Substring Search Algorithm. Communications of the ACM, 33, 8, 132-142 (1990)
  • F. Lundh: The Fast Search Algorithm. http://effbot.org/zone/stringlib.htm (2006)

indicesBytes Source #

Arguments

:: Bytes

bytes to search for (needle)

-> Bytes

bytes to search in (haystack)

-> Bool

report partial match at the end of haystack

-> [Int] 

O(n/m) Find the offsets of all non-overlapping indices of needle within haystack using KMP algorithm, combined with simplified sunday's rule to obtain O(m/n) complexity in average use case.

Helpers

kmpNextTable :: (Vec v a, Eq a) => v a -> PrimArray Int Source #

O(m) Calculate the KMP next shift table.

The shifting rules is: when a mismatch between needle[j] and haystack[i] is found, check if next[j] == -1, if so next search continue with needle[0] and haystack[i+1], otherwise continue with needle[next[j]] and haystack[i].

sundayBloom :: Bytes -> Word64 Source #

O(m) Calculate a simple bloom filter for simplified sunday's rule.

The shifting rules is: when a mismatch between needle[j] and haystack[i] is found, check if elemSundayBloom bloom haystack[i+n-j], where n is the length of needle, if not then next search can be safely continued with haystack[i+n-j+1] and needle[0], otherwise next searh should continue with haystack[i] and needle[0], or fallback to other shifting rules such as KMP.

The algorithm is very simple: for a given Word8 w, we set the bloom's bit at unsafeShiftL 0x01 (w .&. 0x3f), so there're three false positives per bit. This's particularly suitable for search UTF-8 bytes since the significant bits of a beginning byte is usually the same.

elemSundayBloom :: Word64 -> Word8 -> Bool Source #

O(1) Test if a bloom filter contain a certain Word8.