simple-atom-0.2: Atom (or symbol) datatype for fast comparision and sorting.

Portabilityportable
Stabilityexperimental
Maintainernominolo@gmail.com

Data.Atom.UF

Contents

Description

Symbols without a central symbol table.

Symbols provide the following efficient operations:

  • O(1) equality comparison (in practise)
  • O(1) ordering comparison (in practise)
  • O(n) creation where N is the size of the symbol descriptor.

Many implementations often have the additional property that each symbol descriptor only exists once in memory. This implementation slightly relaxes this property:

  • A symbol descriptor is guaranteed to exists only once in memory if it has been created using the same symbol table. Furthermore, if two symbols created from different symbol tables are compared and their descriptors turn out to be equal, the symbols will share the descriptor after the comparison.

This allows the following additional properties not present in conventional implementations:

  • No space leak. The symbol table can be discarded at any time.
  • Symbols created using different symbol tables can be compared reliably.
  • No global lock. (TODO: Well we might need one in the case of hash-collisions, but a lock-free implementation might be possible.)

Inspired by Richard OKeefes message to Erlang's eeps mailing list http://www.erlang.org/cgi-bin/ezmlm-cgi/5/057, which in turn was inspired by the Logix implementation of Flat Concurrent Prolog.

Synopsis

Symbols

data Symbol a Source

A symbol.

Note that the ordering on a is not preserved on Symbol a. Symbols are ordered by their hashes, and only if the hashes are equal will the ordering on a be used. We have:

  x == y ==> intern x == intern y

let sx = intern x
      sy = intern y
  in
    (sx < sy) == ((symbolHash sy < symbolHash sx) ||
                  symbolHash sy == symbolHash sx && x < y)

Instances

Ord a => Eq (Symbol a) 
Ord a => Ord (Symbol a) 
Show a => Show (Symbol a) 

intern :: (a -> Word64) -> a -> Symbol aSource

Create a new local symbol. For best performance use internInto together with a symbol table / map.

internInto :: SymTab s => (a -> Word64) -> s a -> a -> (s a, Symbol a)Source

Insert a symbol into an existing table.

class SymTab s whereSource

Methods

lookupSymbol :: s a -> a -> Maybe (Symbol a)Source

insertSymbol :: a -> Symbol a -> s a -> s aSource

symbolHash :: Symbol a -> Word64Source

Returns the hash of the symbol.

Implementation

Each symbol is represented a mutable pointer to the symbol info and a hash. The symbol info might itself be a pointer to another (equal) symbol info.

When creating a new symbol (without looking it up in a symbol table), we compute its hash and create a new symbol info.

     +----+---+     +-------+
 A:  | 42 | *-----> | "foo" |
     +----+---+     +-------+

We now know the following:

  1. If two symbols have the same reference, they are equal. (The Eq instance on IORefs implements observable sharing.)
  2. If two symbols have a different hash, they are different.

If neither of the above is true we either have a hash collision or the two objects are equal but were created using different symbol tables.

Let's consider the latter case:

      +----+---+     +-------+
  A:  | 42 | *-----> | "foo" |
      +----+---+     +-------+

      +----+---+     +-------+
  B:  | 42 | *-----> | "foo" |
      +----+---+     +-------+

We follow the symbol pointers and realise that the symbol descriptors are equal. We thus decide for one of them to be the canonical symbol descriptor and update the pointers:

      +----+---+     +-------+
  A:  | 42 | *-----> | "foo" |
      +----+---+ .-> +-------+
                 |       ^
      +----+---+ |   +---|---+
  B:  | 42 | *---'   |   *   |
      +----+---+     +-------+

We change the other symbol descriptor to be a pointer to the canonical descriptor, because there may be other pointers to this symbol descriptor. Otherwise, the old symbol descriptor becomes garbage. We now have only one "foo" object left.

We can add a third rule for equality:

  • If, following all pointers, two symbol descriptors are the same, then the two symbols are equal.

If this is not the case (e.g., in the case of a hash collision) we call the compare function of the symbol descriptor. A good hash function is therefore important since in the case of a hash collision we will always have to call the compare function of the symbol descriptor.

  • * Hash Function

Assuming a good hash function (i.e., the hash is indistinguishable from a randomly generated number) we can use the birthday paradox to calculate the probability of a hash collision:

collision_prob :: Integer -> Integer -> Double
collision_prob key_bits items =
    1 - exp (fromIntegral (-items * (items - 1)) / fromIntegral (2 * key_space))
  where
    key_space = 2 ^ key_bits :: Integer

E.g., collision_prob 32 50000 == 0.2525... means that with 32 bit hashes and 50000 symbols, there is a 25 percent chance of a hash collision.

  • * Path Shortening

If symbols from several symbol tables are joined repeatedly, its symbol infos may develop into long chains. For this reason we update all pointers while following them.

That is, given we have the following state:

 X+-sym+---+     A+-nfo-+    B+-------+
  | 42 | *------->|  *------> | "foo" |
  +----+---'      +-----'     '-------'

 Y+-sym+---+     C+-nfo-+    D+-------+
  | 42 | *------->|  *------> | "foo" |
  +----+---'      +-----'     '-------'

after x `compare` y we have.

 X+-sym+---+      B+-------+    A+-nfo-+    
  | 42 | *-------> | "foo" |<-------*  |
  `----+---'  +--> '-------'     `-----'
              |          ^ ^------+
 Y+-sym+---+  |  C+-nfo-+|   D+---|---+
  | 42 | *----'   |  *---'    |   *   |
  `----+---'      `-----'     '-------'

These references can be updated concurrently and without a lock since their information content does not change. That is, the state

 X.-sym+---.     A.-nfo-.    B.-------.
  | 42 | *------->|  *------> | "foo" |
  `----+---'      `-----'     '-------'

Is semantically equivalent to the state:

 X.-sym+---.     A.-nfo-.    B.-------.
  | 42 | *----.   |  *------> | "foo" |
  +----+---+  |   +-----+ .-> +-------+
              '-----------'
  • TODO
  • verify thread-safety
  • make sure the pointer update code is correct and has no bad cases
  • implement IntMap variant/wrapper that respects that two different objects may have the same hash (however unlikely).