structures-0.2: "Advanced" Data Structures

Portabilitynon-portable
Stabilityexperimental
MaintainerEdward Kmett <ekmett@gmail.com>
Safe HaskellNone

Data.Vector.Map.Ephemeral

Description

This module provides a Vector-based Map that is loosely based on the Cache Oblivious Lookahead Array (COLA) by Bender et al. from "Cache-Oblivious Streaming B-Trees".

Currently this Map is implemented in an insert-only fashion. Deletions are left to future work or to another derived structure in case they prove expensive.

Unlike the COLA, this version merely provides amortized complexity bounds as this permits us to provide a fully functional API. However, even those asymptotics are only guaranteed if you do not modify the "old" versions of the Map. If you do, then while correctness is preserved, the asymptotic analysis becomes inaccurate.

Reading from "old" versions of the Map does not affect the asymptotic analysis and is fine.

Fractional cascading was originally replaced with the use of a hierarchical bloom filter per level containing the elements for that level, with the false positive rate tuned to balance the lookup cost against the costs of the cache misses for a false positive at that depth. This avoids the need to collect forwarding pointers from the next level, reducing pressure on the cache dramatically, while providing the same asymptotic complexity.

With either of these two techniques when used ephemerally, this Map had asymptotic performance equal to that of a B-Tree tuned to the parameters of your caches with requiring such parameter tuning.

However, the constants were still bad enough that the naive O(log^2 n) version of the COLA actually wins at lookups in benchmarks at the scale this data structure is interesting, say around a few million entries, by a factor of 10x! Consequently, we're currently not even Bloom filtering.

Compared to the venerable Data.Map, this data structure currently consumes more memory, but it provides a more limited palette of operations with different asymptotics (~10x faster inserts at a million entries) and enables us to utilize contiguous storage.

NB: when used with boxed data this structure may hold onto references to old versions of things for many updates to come until sufficient operations have happened to merge them out of the COLA.

TODO: track actual percentage of occupancy for each vector compared to the source vector it was based on. This would permit split and other operations that trim a Map to properly reason about space usage by borrowing the 1/3rd occupancy rule from a Stratified Doubling Array.

Synopsis

Documentation

data Map k v Source

This Map is implemented as an insert-only Cache Oblivious Lookahead Array (COLA) with amortized complexity bounds that are equal to those of a B-Tree when it is used ephemerally, using Bloom filters to replace the fractional cascade.

Constructors

Nil 
One !k v !(Map k v) 
Map !Int !(Array k) !(Array v) !(Map k v) 

Instances

(Read (Arr v v), Read (Arr k k), Read k, Read v) => Read (Map k v) 
(Show (Arr v v), Show (Arr k k), Show k, Show v) => Show (Map k v) 

empty :: Map k vSource

O(1) The empty Map.

null :: Map k v -> BoolSource

O(1). Identify if a Map is the empty Map.

singleton :: Arrayed v => k -> v -> Map k vSource

O(1) Construct a Map from a single key/value pair.

lookup :: (Ord k, Arrayed k, Arrayed v) => k -> Map k v -> Maybe vSource

O(log^2 N) persistently amortized, O(N) worst case. Lookup an element.

insert :: (Ord k, Arrayed k, Arrayed v) => k -> v -> Map k v -> Map k vSource

O((log N)/B) ephemerally amortized loads for each cache, O(N/B) worst case. Insert an element.

fromList :: (Ord k, Arrayed k, Arrayed v) => [(k, v)] -> Map k vSource

shape :: Map k v -> [Int]Source