disk-bytes
This package provides a data type DiskBytes
which represents a sequence of bytes that is stored on disk — but in a referentially transparent manner.
The key invariant is that a value of type DiskBytes
that has been evaluated to weak-head normal form (WHNF) occupies just a few words of RAM, but many bytes of on-disk storage. (We can't guarantee anything about expressions that are not in WHNF.)
The main use case for DiskBytes
is when you have a pure Haskell program which is storing too much data, and you want to offload some of this data in a controlled, yet transparent way to disk — without IO
doing violence to your beautiful, pure Haskell code.
The interface for DiskBytes
consists of two pure functions which convert to/from an in-memory (RAM) ByteString
:
toDiskBytes :: Disk -> ByteString -> DiskBytes
fromDiskBytes :: DiskBytes -> ByteString
Here, Disk
represents the on-disk storage, typically obtained by opening a file on the file system. One can interpret Disk
as virtual memory.
Implementation Details
Disk storage
Currently, the Disk
data type is implemented as an open sqlite database file. In other words, sqlite is used to manage on-disk memory in a file. I decided to use an existing library for on-disk storage, because managing the on-disk storage (B+ trees, trade-offs between read and write speed, …) is an interesting problem, but it's not a problem that I want to solve here.
However, sqlite is a bit overkill, because all that we need is a key-value store. In the future, one might consider on-disk storage libraries such as lmdb or RocksDB — I picked sqlite simply because it has Haskell bindings that I have used before. Pull requests (with benchmarks) are welcome.
Sqlite
TODO: Implement batching? We may want to batch insertions and deletions until the total number of bytes to process reaches a certain threshold, e.g. 10kB? Rumor has it that sequences of database operations such as INSERT INTO
become faster if they are batched into a single transaction, rather than run as separate queries with just a few bytes each.
Referential transparency
Internally, the DiskBytes
type uses unsafePerformIO
. However, this use is referentially transparent as long as the library has exclusive access to the on-disk storage. In other words, we assume that the on-disk memory is as exclusive to the Haskell run-time as we assume that RAM is exclusive to the Haskell run-time.
TODO: Make an honest attempt to ensure that no other process can read or write to the file, e.g. by setting file permissions.
Testing
Currently, the benchmark memory
serves as a basic test that everything is working as intended. You can run the benchmark and look at its heap profile by executing the commands
$ cabal bench
$ hp2pretty memory.hp
This tests the following properties:
-
DiskBytes
that are alive do not use much RAM. (Currently, ~100
bytes per WHNF of DiskBytes
.)
-
DiskBytes
that are not alive are garbage collected and disk memory is freed. (This works as the value returned by getDiskSize
stops growing.)
-
The bytes of 'DiskBytes' that are alive can be loaded back into RAM. (fromDiskBytes
does not throw an error.)