muesli
A simple document-oriented database engine for Haskell.
Use cases
- backing store for p2p / cloud nodes, mobile apps, etc.
- higher capacity replacement for acid-state (only indexes are held in memory).
- no dependency substitute for SQLite.
- ACIDic replacement for CouchDB and the like.
Features
- ACID transactions implemented on the MVCC model.
- automatic index management based on tags prepended to fields' types
(see example below).
- minimal boilerplate: instead of TemplateHaskell we use
GHC.Generics
and deriving
.
- monadic queries, with standard primitive operations like:
lookup
, insert
, update
, delete
, filter
, range
, and filterRange
.
- range queries afford efficient cursor-like navigation (paging) through
large datasets. For example this is the equivalent SQL for
filterRange
:
SELECT TOP page * FROM table
WHERE (filterFld = filterVal) AND
(sortVal = NULL OR sortFld < sortVal) AND
(sortKey = NULL OR ID < sortKey)
ORDER BY sortFld, ID DESC
- easy to reason about performance: all primitive queries run in O(p*log(n)).
- type safety: impossible to attempt deserializing a record at a wrong type
(or address), and risk getting bogus data with no error thrown.
References are tagged with a phantom type and created only by the database.
There are also
Num
/Integral
instances to support more generic apps,
but normally those are not needed.
- multiple backends supported: currently file, and soon (™️)
in-memory, remote.
- portability: it should work on all platforms, including mobile.
- p2p replication: soon (™️)
Note: Some of these features become misfeatures for certain scenarios which
would make either a pure in-memory cache, or a real database more appropriate.
In particular the query language is very basic at the current stage. Sure, you
can use the customary Functor
/ Applicative
/ Monad
interface, but you will
have to write all kinds of wrapper queries to make things manageable.
The design principle is to only upgrade the query language in tandem with the indexes.
Right now the indexes are not very smart, so the query language will not lie
about it with some nice but poorly implemented abstraction.
Example use
First, mark up your types. You must use the record syntax to name the
accessors so they'll be queryable. You can filter on Reference
fields, sort and
range on Sortable
s, and reverse lookup Unique
s. The database will extract
these keys using the Indexable
and Document
instances with the help of
GHC.Generics
, including from deep inside any Foldable
.
{-# LANGUAGE DeriveAnyClass #-}
{-# LANGUAGE DeriveGeneric #-}
import Database.Muesli.Types
data Person = Person
{ personName :: Unique (Sortable String)
, personEmail :: String
} deriving (Show, Generic, Serialize)
instance Document Person
data Content = Text String | HTML String | XHTML String
deriving (Show, Generic, Serialize, Indexable)
data BlogPost = BlogPost
{ postURI :: Unique String
, postTitle :: Sortable String
, postAuthor :: Maybe (Reference Person)
, postContributors :: [Reference Person]
, postTags :: [Sortable String]
, postContent :: Content
, publishedDate :: Sortable DateTime
} deriving (Show, Generic, Serialize)
instance Document BlogPost
Then, write some queries (updateUnique
searches by unique key, and either
inserts or updates depending on result):
{-# LANGUAGE OverloadedStrings #-}
import Database.Muesli.Query
updatePerson :: String -> String -> Transaction l m (Reference Person, Person)
updatePerson name email = do
let name' = Sortable name
let p = Person name' email
pid <- updateUnique "personName" (Unique name') p
return (pid, p)
postsByContrib :: Reference Person -> Transaction l m [(Reference BlogPost, BlogPost)]
postsByContrib pid = filter "postContributors" (Just pid) "postTitle"
flagContributor :: Reference Person -> Transaction l m ()
flagContributor pid = do
is <- postsByContributor pid
forM_ is $ \(bpid, bp) ->
update bpid bp { postTags = postTags bp ++ Sortable "stolen" }
Then you can run these transactions with runQuery
inside some MonadIO
context.
Note that Transaction
itself is an instance of MonadIO
, so you can do
arbitrary IO inside.
The l
parameter specifies which storage backend you use.
Currently only a portable binary file backend is implemented, used with
Handle FileLogState
.
import Database.Muesli.Query
import Database.Muesli.Handle
flagIt :: (MonadIO m, LogState l) => Handle l -> String -> String ->
m (Either TransactionAbort ())
flagIt h name email = runQuery h $ do
(pid, _) <- updatePerson name email
flagContributor pid
main :: IO ()
main = bracket
(putStrLn "opening DB..." >>
open (Just "blog.log") (Just "blog.dat") Nothing Nothing)
(\(h :: Handle FileLogState) -> putStrLn "closing DB..." >> close h)
(\h -> flagIt h "Bender Bending Rodríguez" "bender@ilovebender.com")
TODO
Implementation
- 2 files, one for transactions/indexes, and another for serialized data
- same file format for transactions and indexes, loading indexes is
the same as replaying transactions
- transaction file only contains int keys extracted from tagged fields
- processing a record (updating indexes) while loading the log is O(log n)
- previous 2 points make the initial loading much faster and using significantly
less memory then acid-state, which serializes entire records, including
potentially very large string fields, typical in "document-oriented" scenarios.
It was suggested that in such cases you should store this data in external files.
But then, if you want to regain the ACID property, and already have some indexes
laying around, you are well on your way of creating muesli.
- data file only contains serialized records and gaps, no metadata
- LRU cache holds deserialized objects wrapped in
Data.Dynamic
.
On SSDs deserialization is far more costly than file IO,
so having our own cache is a better solution than just memory mapping the file.
- ♻️ GC creates asynchronously new copies of both files, doing cleanup and
compaction, and only locks the world at the end
- 🔒 all locks are held for at most O(log n) time
Reference
, Unique
and Sortable
are newtype
s that have a set of general
instances for Indexable
and Document
which are used by a generic function
- transactions defer updates by collecting IDs and serialized data,
which are checked (under lock) for consistency at the end
Change log
Available here.
License
Copyright © 2015 Călin Ardelean
MIT license. See the license file for details.