Synopsis
This package provides efficient integer interval sets.
Description
Persistent... is it trees?
Yes, Radix trees. Trees are balanced by prefix bits, so we have fast
merge operations, such as union, intersection and difference. Chris
Okasaki and Andrew Gill shows that Patricia tree based
integer maps might be order of magnitude faster than Red-Black tree
counterparts on this operations. The same apply to integer sets, we
just have keys, but don't have values.
That does mean the "dense"?
That means we keep suffixes in bitmaps and we might pack, say 10,
integers which lies close together in one bitmap. This optimization
doesn't affect execution times for sparse sets, but makes dense sets
much more memory efficient — near 10-50 times less space usage
depending on machine word size and the actual density of the
set. Basically, this let us be 3-4 times less memory efficient
comparing with arrays of tightly packed bits, but see...
How suffix compaction is performed?
There are exist a pretty simple algorithm used in memory allocators
called "buddy memory allocator". In a nutshell, we have
a big block which is splitted by half when we remove from one of the
half, and merge then back when we insert. It's somewhat inverse to the
ordinary tree approach — in buddy tree we hold more information about
elements that it doesn't contain, while in prefix tree we hold more
information about elements that it does contain. It's easy to guess
that we should do with it — take the two structures then fuse them
into one to produce a new structure which perform better.
Indeed, the key idea in the design is right here — we switch forth and
back between representations per subtree basis. We intersperse
different representations in different tree branches. It's like
chameleon:
-
If the some subset is sparse, we just keep a radix tree with
bitmaps at leafs.
-
If the some subset becomes full we turn it into block. If some
buddy block appears, we join the buddy blocks into one. And so forth.
That is, we just get a structure that dynamically choose the optimal
representation depending on density of set. Moreover in best case
this lead to huge space savings:
> ppStats (fromList [0..123456])
gives:
Bin count: 6
Tip count: 1
Fin count: 6
Size in bytes: 408
Saved space over dense set: 123072
Saved space over bytestring: 11879
Percent saved over dense set: 99.6695821185617%
Percent saved over bytestring: 96.67941727028567%
The ppStats
is not an exposed function but you can play with it
using cabal-dev ghci
.
I don't know if it is an old idea, but this works just fine.
So when this data structure is good choice?
In many situation. It might be used as persistent and compact
replacement for bool arrays or Data.IntSet with the following advantages:
- Purity is extremely useful in multithreaded settings —
we could keep a set in a mutable transactional variable or an IORef
and atomically update/modify the set. So it could be used as
replacement for TArray Int Bool as well.
- By merging intervals together we achieve compactness. In best case
some of main operations will take O(1)time and space, so if you need
interval set it's here.
- Fast serizalization: if you are need conversion to/from bytestrings.
Because of bitmaps it's possible to do this conversion extremely fast.
How this implementation relate to containers version?
Heavely based. Essentially we just add the buddy interval compaction,
but it turns out that some operations becomes more complicated and
requires much more effort to implement — in order to maintain the all
tree invariants we need to take into account more cases. This is the
reason why some operations are not implemented yet (e.g. lack of
views), but I hope I'll fix it with the time.
Documentation
For documentation see haddock generated documentation.
Build Status
Maintainer
This library is written and maintained by Sam T. pxqr.sta@gmail.com
Feel free to report bugs and suggestions via
github issue tracker or the mail.