diskhash: Disk-based hash table

This is a package candidate release! Here you can preview how this package release will appear once published to the main package index (which can be accomplished via the 'maintain' link below). Please note that once a package has been published to the main package index it cannot be undone! Please consult the package uploading documentation for more information.

[maintain] [Publish]

Disk-based hash table


[Skip to Readme]

Properties

Versions 0.0.1.0, 0.0.1.1, 0.0.1.1, 0.0.1.2, 0.0.2.1, 0.0.2.3, 0.0.3.0, 0.0.3.2, 0.0.4.0, 0.0.4.2
Change log ChangeLog
Dependencies base (>4 && <5), bytestring [details]
License MIT
Author Luis Pedro Coelho
Maintainer Luis Pedro Coelho
Category Data
Bug tracker https://github.com/luispedro/diskhash/issues
Source repo head: git clone https://github.com/luispedro/diskhash
Uploaded by luispedro at 2017-06-26T19:47:17Z

Modules

Downloads

Maintainer's Corner

Package maintainers

For package maintainers and hackage trustees


Readme for diskhash-0.0.1.1

[back to package description]

Disk-based hashtable

Travis License: MIT

A simple disk-based hash table.

The code is in C, wrappers are provided for Python, Haskell, and C++. The wrappers follow similar APIs with variations to accomodate the language specificity. They all use the same underlying code, so you can open a hashtable created in C from Haskell, modify it within your Haskell code, and open the result in Python (although Python's version currently only deals with integers, stored as longs).

Cross-language functionality will work best for very simple types so that you can control their binary representation (64-bit integers, for example).

Reading does not touch the disk representation at all and, thus, can be done on top of read-only files or using multiple threads. Writing or modifying values is, however, not thread-safe.

Examples

The following examples all create a hashtable to store longs (int64_t), then set the value associated with the key "key" to 9. In the current API, the maximum size of the keys needs to be pre-specified, which is the value 15 below.

Raw C

#include <stdio.h>
#include <inttypes.h>
#include "diskhash.h"

int main(void) {
    HashTableOpts opts;
    opts.key_maxlen = 15;
    opts.object_datalen = sizeof(int64_t);
    char* err = NULL;
    HashTable* ht = dht_open("testing.dht", opts, O_RDWR|O_CREAT, &err);
    if (!ht) {
        if (!err) err = "Unknown error";
        fprintf(stderr, "Failed opening hash table: %s.\n", err);
        return 1;
    }
    long i = 9;
    dht_insert(ht, "key", &i);
    
    long* val = (long*) dht_lookup(ht, "key");
    printf("Looked up value: %l\n", *val);

    dht_free(ht);
    return 0;
}

Haskell

In Haskell, you have different types/functions for read-write and read-only hashtables.

Read write example:

import Data.DiskHash
import Data.Int
main = do
    ht <- htOpenRW "testing.dht" 15
    htInsertRW ht "key" (9 :: Int64)
    val <- htLookupRW "key" ht
    print val

Read only example (htLookupRO is pure in this case):

import Data.DiskHash
import Data.Int
main = do
    ht <- htOpenRO "testing.dht" 15
    let val :: Int64
        val = htLookupRO "key" ht
    print val

Python

Python's interface is more limited and only integers are supported as values in the hash table (they are stored as 64-bit integers).

import diskhash
tb = diskhash.Str2int("testing.dht", 15)
tb.insert("key", 9)
print(tb.lookup("key"))

The Python interface is currently Python 3 only. Patches to extend it to 2.7 are welcome, but it's not a priority.

C++

In C++, a simple wrapper is defined, which provides a modicum of type-safety. You use the DiskHash<T> template. Additionally, errors are reported through exceptions (both std::bad_alloc and std::runtime_error can be thrown) and not return codes.

#include <iostream>
#include <string>

#include <diskhash.hpp>

int main() {
    const int key_maxlen = 15;
    dht::DiskHash<uint64_t> ht("testing.dht", key_maxlen, dht::DHOpenRW);
    std::string line;
    uint64_t ix = 0;
    while (std::getline(std::cine, line)) {
        if (line.length() > key_maxlen) {
            std::cerr << "Key too long: '" << line << "'. Aborting.\n";
            return 2;
        }
        const bool inserted = ht.insert(line.c_str(), ix);
        if (!inserted) {
            std::cerr  << "Found repeated key '" << line << "' (ignored).\n";
        }
        ++ix;
    }
    return 0;
}

Statibility

This is beta software. It is good enough that I am using it, but the API can change in the future with little warning. The binary format is versioned (the magic string encodes its version, so changes can be detected).

Automated unit testing ensures that basic mistakes will not go uncaught.

Limitations

License: MIT