# disk-bytes This package provides a data type `DiskBytes` which represents a sequence of bytes that is stored on disk — but in a referentially transparent manner. The key invariant is that a value of type `DiskBytes` that has been evaluated *to weak-head normal form* (WHNF) occupies just a few words of RAM, but many bytes of on-disk storage. (We can't guarantee anything about expressions that are not in WHNF.) The main use case for `DiskBytes` is when you have a pure Haskell program which is storing too much data, and you want to offload some of this data in a controlled, yet transparent way to disk — without `IO` doing violence to your beautiful, pure Haskell code. The interface for `DiskBytes` consists of two *pure* functions which convert to/from an in-memory (RAM) `ByteString`: ```hs toDiskBytes :: Disk -> ByteString -> DiskBytes fromDiskBytes :: DiskBytes -> ByteString ``` Here, `Disk` represents the on-disk storage, typically obtained by opening a file on the file system. One can interpret `Disk` as virtual memory. ## Implementation Details ### Disk storage Currently, the `Disk` data type is implemented as an open sqlite database file. In other words, sqlite is used to manage on-disk memory in a file. I decided to use an existing library for on-disk storage, because managing the on-disk storage (B+ trees, trade-offs between read and write speed, …) is an interesting problem, but it's not a problem that I want to solve *here*. However, sqlite is a bit overkill, because all that we need is a key-value store. In the future, one might consider on-disk storage libraries such as [lmdb][] or [RocksDB][] — I picked sqlite simply because it has Haskell bindings that I have used before. Pull requests (with benchmarks) are welcome. ### Sqlite TODO: Implement batching? We may want to batch *insertions* and *deletions* until the total number of bytes to process reaches a certain threshold, e.g. 10kB? Rumor has it that sequences of database operations such as `INSERT INTO` become faster if they are batched into a single transaction, rather than run as separate queries with just a few bytes each. ### Referential transparency Internally, the `DiskBytes` type uses `unsafePerformIO`. However, this use is referentially transparent as long as the library has exclusive access to the on-disk storage. In other words, we assume that the on-disk memory is as exclusive to the Haskell run-time as we assume that RAM is exclusive to the Haskell run-time. TODO: Make an honest attempt to ensure that no other process can read or write to the file, e.g. by setting file permissions. ### Testing Currently, the benchmark `memory` serves as a basic test that everything is working as intended. You can run the benchmark and look at its heap profile by executing the commands ```shell $ cabal bench $ hp2pretty memory.hp ``` This tests the following properties: * `DiskBytes` that are alive do not use much RAM. (Currently, ~`100` bytes per WHNF of `DiskBytes`.) * `DiskBytes` that are not alive are garbage collected and disk memory is freed. (This works as the value returned by `getDiskSize` stops growing.) * The bytes of 'DiskBytes' that are alive can be loaded back into RAM. (`fromDiskBytes` does not throw an error.) [lmdb]: https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database [rocksdb]: https://en.wikipedia.org/wiki/RocksDB