disk-bytes: On-disk storage, but referentially transparent

[ bsd3, library, memory ] [ Propose Tags ]

This package provides a data type DiskBytes which represents a sequence of bytes that is stored on disk — but in a referentially transparent manner.


[Skip to Readme]

Modules

[Index] [Quick Jump]

Downloads

Maintainer's Corner

Package maintainers

For package maintainers and hackage trustees

Candidates

Versions [RSS] 0.1.0.0
Change log CHANGELOG.md
Dependencies base (>=4.14.3.0 && <4.18), bytestring (>=0.10.12.0 && <0.12), containers (>=0.6.5.1 && <0.7), direct-sqlite (>=2.3.27 && <2.4), directory (>=1.3.6.0 && <1.4), stm (>=2.5.0.1 && <2.6), text (>=1.2.4.1 && <2.1) [details]
License BSD-3-Clause
Copyright Copyright (c) 2022 Heinrich Apfelmus
Author Heinrich Apfelmus
Maintainer apfelmus@quantentunnel.de
Category Memory
Home page https://github.com/HeinrichApfelmus/disk-bytes
Bug tracker https://github.com/HeinrichApfelmus/disk-bytes/issues
Uploaded by HeinrichApfelmus at 2022-09-25T20:32:20Z
Distributions
Downloads 66 total (4 in the last 30 days)
Rating (no votes yet) [estimated by Bayesian average]
Your Rating
  • λ
  • λ
  • λ
Status Docs available [build log]
Last success reported on 2022-09-25 [all 1 reports]

Readme for disk-bytes-0.1.0.0

[back to package description]

disk-bytes

This package provides a data type DiskBytes which represents a sequence of bytes that is stored on disk — but in a referentially transparent manner.

The key invariant is that a value of type DiskBytes that has been evaluated to weak-head normal form (WHNF) occupies just a few words of RAM, but many bytes of on-disk storage. (We can't guarantee anything about expressions that are not in WHNF.)

The main use case for DiskBytes is when you have a pure Haskell program which is storing too much data, and you want to offload some of this data in a controlled, yet transparent way to disk — without IO doing violence to your beautiful, pure Haskell code.

The interface for DiskBytes consists of two pure functions which convert to/from an in-memory (RAM) ByteString:

toDiskBytes :: Disk -> ByteString -> DiskBytes
fromDiskBytes :: DiskBytes -> ByteString

Here, Disk represents the on-disk storage, typically obtained by opening a file on the file system. One can interpret Disk as virtual memory.

Implementation Details

Disk storage

Currently, the Disk data type is implemented as an open sqlite database file. In other words, sqlite is used to manage on-disk memory in a file. I decided to use an existing library for on-disk storage, because managing the on-disk storage (B+ trees, trade-offs between read and write speed, …) is an interesting problem, but it's not a problem that I want to solve here.

However, sqlite is a bit overkill, because all that we need is a key-value store. In the future, one might consider on-disk storage libraries such as lmdb or RocksDB — I picked sqlite simply because it has Haskell bindings that I have used before. Pull requests (with benchmarks) are welcome.

Sqlite

TODO: Implement batching? We may want to batch insertions and deletions until the total number of bytes to process reaches a certain threshold, e.g. 10kB? Rumor has it that sequences of database operations such as INSERT INTO become faster if they are batched into a single transaction, rather than run as separate queries with just a few bytes each.

Referential transparency

Internally, the DiskBytes type uses unsafePerformIO. However, this use is referentially transparent as long as the library has exclusive access to the on-disk storage. In other words, we assume that the on-disk memory is as exclusive to the Haskell run-time as we assume that RAM is exclusive to the Haskell run-time.

TODO: Make an honest attempt to ensure that no other process can read or write to the file, e.g. by setting file permissions.

Testing

Currently, the benchmark memory serves as a basic test that everything is working as intended. You can run the benchmark and look at its heap profile by executing the commands

$ cabal bench
$ hp2pretty memory.hp

This tests the following properties:

  • DiskBytes that are alive do not use much RAM. (Currently, ~100 bytes per WHNF of DiskBytes.)

  • DiskBytes that are not alive are garbage collected and disk memory is freed. (This works as the value returned by getDiskSize stops growing.)

  • The bytes of 'DiskBytes' that are alive can be loaded back into RAM. (fromDiskBytes does not throw an error.)