cuckoo-0.2.1: Haskell Implementation of Cuckoo Filters

CopyrightCopyright © 2019 Lars Kuhtz <lakuhtz@gmail.com>
LicenseBSD3
MaintainerLars Kuhtz <lakuhtz@gmail.com>
Stabilityexperimental
Safe HaskellNone
LanguageHaskell2010

Data.Cuckoo

Contents

Description

Haskell implementation of Cuckoo filters as described in

B. Fan, D.G. Anderson, M. Kaminsky, M.D. Mitzenmacher. Cuckoo Filter: Practically Better Than Bloom. In Proc. CoNEXT, 2014.

Cuckoo filters are a data structure for probabilistic set membership. They support insertion, deletion, and membership queries for set elements.

Membership queries may return false positive results. But queries don't return false negative results.

Unlike Bloom filters, Cuckoo filters maintain an upper bound on the false positive rate that is independent of the load of the filter. However, insertion of new elements in the filter can fail. For typical configurations this probability is very small for load factors smaller than 90 percent.

Example

{-# LANGUAGE DataKinds #-}
{-# LANGUAGE TypeApplications #-}
{-# LANGUAGE TypeFamilies #-}
{-# OPTIONS_GHC -fno-warn-orphans #-}

import Control.Monad (filterM)
import Data.Cuckoo
import Data.List ((\\))

-- Define CuckooFilterHash instance (this uses the default implementation)
instance CuckooFilterHash Int

main :: IO ()
main = do
    -- Create Filter for a minimum of 500000 entries
    f <- newCuckooFilter @4 @8 @Int 0 500000

    -- Insert 450000 items
    failed <- filterM (fmap not . insert f) [0..500000-1]

    -- Query inserted items
    missing <- filterM (fmap not . member f) [0..500000-1]

    -- Test for false positives
    false <- filterM (member f) [500000..1000000 - 1]

    -- Report results
    putStrLn $ "failed inserts: " <> show (length failed)
    putStrLn $ "false positives: " <> show (length false)
    putStrLn $ "false positive rate (%): " <> show @Double (fromIntegral (length false) * 100 / 500000)
    putStrLn $ "missing (must be 0): " <> show (length $ missing \\ failed)

    -- Filter properties
    putStrLn $ "capacity: " <> show (capacityInItems f)
    putStrLn $ "size in allocated bytes: " <> show (sizeInAllocatedBytes f)

    -- computing the following is a bit slow
    c <- itemCount f
    putStrLn $ "item count: " <> show c
    lf <- loadFactor f
    putStrLn $ "load factor (%): " <> show lf
    putStrLn $ "bits per item: " <> show @Double (fromIntegral (sizeInAllocatedBytes f) * 8 / fromIntegral c)
Synopsis

Hash Functions

newtype Salt Source #

Salt for hash computations.

Constructors

Salt Int 
Instances
Enum Salt Source # 
Instance details

Defined in Data.Cuckoo

Methods

succ :: Salt -> Salt #

pred :: Salt -> Salt #

toEnum :: Int -> Salt #

fromEnum :: Salt -> Int #

enumFrom :: Salt -> [Salt] #

enumFromThen :: Salt -> Salt -> [Salt] #

enumFromTo :: Salt -> Salt -> [Salt] #

enumFromThenTo :: Salt -> Salt -> Salt -> [Salt] #

Eq Salt Source # 
Instance details

Defined in Data.Cuckoo

Methods

(==) :: Salt -> Salt -> Bool #

(/=) :: Salt -> Salt -> Bool #

Integral Salt Source # 
Instance details

Defined in Data.Cuckoo

Methods

quot :: Salt -> Salt -> Salt #

rem :: Salt -> Salt -> Salt #

div :: Salt -> Salt -> Salt #

mod :: Salt -> Salt -> Salt #

quotRem :: Salt -> Salt -> (Salt, Salt) #

divMod :: Salt -> Salt -> (Salt, Salt) #

toInteger :: Salt -> Integer #

Num Salt Source # 
Instance details

Defined in Data.Cuckoo

Methods

(+) :: Salt -> Salt -> Salt #

(-) :: Salt -> Salt -> Salt #

(*) :: Salt -> Salt -> Salt #

negate :: Salt -> Salt #

abs :: Salt -> Salt #

signum :: Salt -> Salt #

fromInteger :: Integer -> Salt #

Ord Salt Source # 
Instance details

Defined in Data.Cuckoo

Methods

compare :: Salt -> Salt -> Ordering #

(<) :: Salt -> Salt -> Bool #

(<=) :: Salt -> Salt -> Bool #

(>) :: Salt -> Salt -> Bool #

(>=) :: Salt -> Salt -> Bool #

max :: Salt -> Salt -> Salt #

min :: Salt -> Salt -> Salt #

Real Salt Source # 
Instance details

Defined in Data.Cuckoo

Methods

toRational :: Salt -> Rational #

Show Salt Source # 
Instance details

Defined in Data.Cuckoo

Methods

showsPrec :: Int -> Salt -> ShowS #

show :: Salt -> String #

showList :: [Salt] -> ShowS #

class CuckooFilterHash a where Source #

Choosing good hash functions is imperative for a good performance of a cuckoo filter. The hash functions must be

  • independent and
  • provide good uniformity on the lower bits of the output.

The default implementations use sip hash for cuckooHash and fnv1a (64 bit) for cuckooFingerprint and require an instance of Storable.

>>> instance CuckooFilterHash Int

The following example uses the hash functions that are provided in this module to define an instance for ByteString:

>>> import qualified Data.ByteString as B
>>> :{
instance CuckooFilterHash B.ByteString where
    cuckooHash (Salt s) a = fnv1a_bytes s a
    cuckooFingerprint (Salt s) a = sip_bytes s a
    {-# INLINE cuckooHash #-}
    {-# INLINE cuckooFingerprint #-}
:}

Minimal complete definition

Nothing

Methods

cuckooHash :: Salt -> a -> Word64 Source #

This function must provide good entropy on the lower \(2^b - 1\) bits of the result, where \(b\) is the number of buckets.

cuckooFingerprint :: Salt -> a -> Word64 Source #

This function must provide good entropy on the lower bits of the size of a fingerprint.

cuckooHash :: Storable a => Salt -> a -> Word64 Source #

This function must provide good entropy on the lower \(2^b - 1\) bits of the result, where \(b\) is the number of buckets.

cuckooFingerprint :: Storable a => Salt -> a -> Word64 Source #

This function must provide good entropy on the lower bits of the size of a fingerprint.

Hash functions

sip Source #

Arguments

:: Storable a 
=> Int

Salt

-> a

Value that is hashes

-> Word64 

Computes a Sip hash for a value that has an Storable instance.

The first argument is a salt value that is used to derive the key for the hash computation.

sip_bytes Source #

Arguments

:: ByteArrayAccess a 
=> Int

Salt

-> a

Value that is hashes

-> Word64 

Computes a Sip hash for a value that is an instance of ByteArrayAccess.

The first argument is a salt value that is used to derive the key for the hash computation.

fnv1a Source #

Arguments

:: Storable a 
=> Int

Salt

-> a

Value that is hashes

-> Word64 

Computes a 64 bit Fnv1a hash for a value that has an Storable instance.

The first argument is use as a salt.

fnv1a_bytes Source #

Arguments

:: ByteArrayAccess a 
=> Int

Salt

-> a

Value that is hashes

-> Word64 

Computes a 64 bit Fnv1a hash for a value that is an instance of ByteArrayAccess.

The first argument is use as a salt.

Cuckoo Filter

data CuckooFilter s (b :: Nat) (f :: Nat) (a :: Type) Source #

Cuckoo Filter with

  • State token s :: Type,
  • bucket size b :: Nat,
  • fingerprint size f :: Nat, and
  • content type a :: Type.

The following constraints apply

  • \(0 < f \leq 32\)
  • \(0 < b\)

The implementation is not thread safe. For concurrent use the filter must be wrapped in a read-write lock.

type CuckooFilterIO b f a = CuckooFilter RealWorld b f a Source #

Cuckoo filter that can be used in the IO monad.

newCuckooFilter Source #

Arguments

:: KnownNat b 
=> KnownNat f 
=> PrimMonad m 
=> Salt

Salt for the hash functions.

-> Natural

Size. Must be at least 64.

-> m (CuckooFilter (PrimState m) b f a) 

Create a new Cuckoo filter that has at least the given capacity.

Enabling the TypeApplications language extension provides a convenient way for passing the type parameters to the function.

>>> :set -XTypeApplications -XDataKinds -XTypeFamilies
>>> newCuckooFilter @4 @10 @Int 0 1000

The type parameters are

  • bucket size b :: Nat,
  • fingerprint size f :: Nat,
  • content type a :: Type, and
  • Monad m :: Type -> Type,

The following constraints apply:

  • \(0 < f \leq 32\),
  • \(0 < b\), and
  • \(64 \leq n\), where \(n\) is the requested size.

The false positive rate depends mostly on the value of f. It is bounded from above by \(\frac{2b}{2^f}\). In most cases 4 is a good choice for b.

Actual performance depends on the choice of good hash functions that provide high uniformity on the lower bits.

The actual capacity may be much larger than what is requested, because the actual bucket count is a power of two.

>>> f <- newCuckooFilter @4 @10 @Int 0 600
>>> capacityInItems f
1024
>>> sizeInAllocatedBytes f
1284

Cuckoo Filter Operations

insert :: forall b f a m. KnownNat f => KnownNat b => PrimMonad m => CuckooFilterHash a => CuckooFilter (PrimState m) b f a -> a -> m Bool Source #

Insert an item into the filter and return whether the operation was successful. If insertion fails, the filter is unchanged. An item can be inserted more than once. The return value indicates whether insertion was successful. The operation can fail when the filter doesn't have enough space for the item.

This function is not thread safe. No concurrent writes or reads should occur while this function is executed. If this is needed a lock must be used.

This function is not exception safe. The filter must not be used any more after an asynchronous exception has been thrown during the computation of this function. If this function is used in the presence of asynchronous exceptions it should be apprioriately masked.

>>> f <- newCuckooFilter @4 @10 @Int 0 1000
>>> insert f 0
True
>>> insert f 0
True
>>> itemCount f
2

member :: CuckooFilterHash a => PrimMonad m => KnownNat f => KnownNat b => CuckooFilter (PrimState m) b f a -> a -> m Bool Source #

Test whether an item is in the set that is represented by the Cuckoo filter.

A negative result means that the item is definitively not in the set. A positive result means that the item is most likely in the set. The rate of false positives is bounded from above by \(\frac{2b}{2^f}\) where b is the number of items per bucket and f is the size of a fingerprint in bits.

>>> f <- newCuckooFilter @4 @10 @Int 0 1000
>>> insert f 0
True
>>> member f 0
True
>>> member f 1
False

delete :: CuckooFilterHash a => PrimMonad m => KnownNat f => KnownNat b => CuckooFilter (PrimState m) b f a -> a -> m Bool Source #

Delete an items from the filter. An item that was inserted more than once can also be deleted more than once.

IMPORTANT An item must only be deleted if it was successfully added to the filter before (and hasn't been deleted since then).

Deleting an item that isn't in the filter can result in the filter returning false negative results.

This function is not thread safe. No concurrent writes must occur while this function is executed. If this is needed a lock must be used. Concurrent reads are fine.

>>> f <- newCuckooFilter @4 @10 @Int 0 1000
>>> insert f 0
True
>>> insert f 0
True
>>> itemCount f
2
>>> delete f 0
True
>>> itemCount f
1
>>> member f 0
True
>>> delete f 0
True
>>> itemCount f
0
>>> member f 0
False

Utils

sizeInAllocatedBytes :: forall b f a s. KnownNat f => KnownNat b => CuckooFilter s b f a -> Int Source #

The total number of bytes allocated for storing items in the filter.

capacityInItems :: forall b f a s. KnownNat b => CuckooFilter s b f a -> Int Source #

Total number of items that the filter can hold. In practice a load factor of ~95% of this number can be reached.

itemCount :: forall b f a m. PrimMonad m => KnownNat b => KnownNat f => CuckooFilter (PrimState m) b f a -> m Int Source #

Number of items currently stored in the filter.

Note that computing this number is expensive \(\mathcal{O}(n)\).

loadFactor :: forall b f a m. PrimMonad m => KnownNat b => KnownNat f => CuckooFilter (PrimState m) b f a -> m Double Source #

The current load factor of the filter in percent.

loadFactor f = 100 * itemCount f / capacityInItems

Note that computing this number is expensive \(\mathcal{O}(n)\).

Debugging Utils

showFilter :: forall b f a. KnownNat f => KnownNat b => CuckooFilter RealWorld b f a -> IO [[String]] Source #

Show the contents of the filter as a list of buckets with values show in hex. Used for debugging purposes.

itemHashes :: forall b f a s. KnownNat f => CuckooFilterHash a => CuckooFilter s b f a -> a -> (Int, Int, Word64) Source #

Returns the different hashes that are associated with an item in the filter. Used for debugging purposes.