Motivation
Haskell is a great language for data processing.
You load some data in the IO monad, parse it,
funnel the data through various functions and
write the result back to disk or display it
via a web server.
The programmer has the let
and where
patterns at hand
which can be used to sub-structure a single function, e.g.
workflow x y = let
a = f x
b = g a y
in h a b
To the environment program, however,
the values of the intermediate steps a
and b
are invisible and the reader does not know you used
the auxiliary functions f
, g
and h
,
although they might be important
when an outsider tries to check the correctness of
the result of the workflow
function.
This is where the Provenience monad comes in.
How it works
The Provenience monad is an ordinary state monad transformer.
The state is a data flow
graph,
which we call the variable store. Nodes are
Pandoc renderings
of so-called variables. A variable is simply a pair of an ordinary
Haskell value together with its node in the graph.
A computation in the Provenience monad performs any number
of the following five actions.
- Register a new variable in the variable store
- Provide a description of a registered variable
(in form of a Pandoc Block)
- Provide a short name for a registered variable (used in hyperlinks)
- Render the value of a registered variable into
its node in the variable store (as a Pandoc
Block
).
There is a class for default rendering methods akin to the Show
class.
- Apply a variable holding a function to a variable holding a value,
similar to the
<*>
operator of Applicative
functors.
In the Provenience monad, we write <%>
instead.
The fifth action is the only action that adds edges to the
data dependency graph. Suppose we have registered a variable f
holding a value of type a -> b
and a variable x
holding a
value of type a
. The description of f
should explain to the reader
what the function that is the value of f
does.
The monadic action
y <- pure f <%> x
does not register y
as a new variable; instead y
points to the same
node in the variable store as f
. However, the value of y
is the
application of the value of f
to the value of x
and there is now
an edge from x
to y
in the data flow graph labelled with the
description of f
. If y
is not itself a function
but the desired result, you should overwrite the node's description
(which is still the description of f
) with a new description of
the value of y
.
Why this design choice? Because otherwise partial
application is impossible. If <%> always registered new variables,
then
f <%> a <%> b
would register both f(a)
and f(a)(b)
as variables, which might not be
what the user intended. But overwriting f
also means that we can not
re-use the same function variable in several applications. When that is
desired, use a Provenience action producing a variable instead of the
variable itself. Consider the following.
let f = var succ
x <- input 4
y <- f <%> x
z <- f <%> y
Since the Haskell identifier f
is bound to a Provenience action
that registers a new variable holding the succ
function, all
three of x
, y
and z
are distinct variables.
The take-home message is that
f <- var succ
x <- input 4
y <- pure f <%> x
is a dangerous style because the value of f
is not what the corresponding
node in the graph is being used for anymore.
alternative Representation
The variable store also permits to save an alternative representation
of each variable in addition to the Pandoc rendering,
since you might want to provide a machine-readable data flow graph
in addition to a Pandoc document.
Similarly to the IHaskellDisplay
class,
each type used in a variable must have a type class instance
that allows automatic conversion into the alternative representation.
If you don't need this feature, simply choose () as the alternative
representation type.
The graph of alternative representations can be extracted from
the variable store. We provide code to assemble the store into a
spreadsheet (of static cells). Foldable structures
of basic values become columns while doubly-nested structures
become tables.
Example
Continuing the example above, in the Provenience monad you would
write something like the following. Of course it is up to the programmer
to decide how fine-grained the decomposition into Provenience actions
should be.
workflow x' y' = do
---------- register and render the input variables ------------------
x <- input x' -- register and render x'
y <- input y'
x `named` "x" -- links to x show "x" as text
y `named` "y"
x <? renderDefault "first item of input data" -- describe x
y <? renderDefault "second item of input data"
linkx <- linkto x -- create a hyperlink, used below
let what_f_does = Para [Str "auxiliary function f applied to ",linkx]
---------------------------------------------------------------------
------ the actual computation is three lines as in the pure code ----
a <- func f what_f_does <%> x
b <- func g (renderDefault "auxiliary function g") <%> a <%> y
c <- func h (renderDefault "auxiliary function h") <%> a <%> b
------ only book-keeping below --------------------------------------
---------------------------------------------------------------------
a `named` "a" >> b `named` "b" >> c `named` "result"
a <? renderDefault "first intermediate result"
b <? renderDefault "second intermediate result"
c <? renderDefault "the workflow result"
render a >> render b >> render c
return c
Above, the action func
registers a new variable and immediately
supplies a description, which is then used as edge label by the
<%>
operator on the same line.
You see that instead of one line of pure Haskell you are burdened
with writing four kinds of Provenience actions:
register, describe, alias and render. But of the four actions,
three are only concerned with providing descriptions that the pure code
did not contain.
This package was inspired by the
Javelin
Software. Thanks to John R Levine, one of the authors of Javelin,
for explaining the concepts underlying Javelin.
By using Pandoc
the user has a number of output format choices.
With a little CSS, the above example may be rendered like follows.
Unfortunately, hackage does not allow raw html in markdown, so
you have to convert the markdown yourself.
(For the sake of example,
we used f = abs
, g = replicate
and h = fmap concat . replicate
).
second item of input data
auxiliary function f applied to x
first intermediate result
second intermediate result
Alternatives and shortcomings
Spreadsheets excel at displaying intermediate values,
although comprehending the meaning and intention of
spreadsheet formulas requires experience. Needless to say,
Haskell is way more expressive than the formula language
of contemporary spreadsheets.
Limiting yourself to one of Pandoc's output formats,
you might write a
haskintex
document instead of a Provenience expression.
Haskintex may be regarded as a form of literate programming,
and you can use any of the numerous string interpolation packages
or a library for type-safe construction of the target document format
(e.g.
HaTeX,
blaze or
Pandoc)
to achieve the same outcome.
However, you will not get the complete data flow graph for free.
Some programming languages provide interactive notebooks, e.g. the
IPython interactive notebooks
or the IHaskell
variant.
Hyper-Haskell
is another approach to Haskell workbooks and shares some ideas with this package.
Typed Spreadsheets
sort of do what this package does in an interactive way, although it does not show
intermediate results.
Working with large quantities of input data might become unwieldy, however.
Rendering a notebook or interactive spreadsheet, though, requires the underlying language
to be installed on the system.
In contrast the philosophy of Provenience
is that computation and display take place on different machines,
and that a Provenience computation is only a small part
of an actual application.
Perhaps the biggest shortcoming at present is
the behaviour of mapM
.
When you mapM
a Provenience function over a
traversable structure like a list,
each element registers the same number of variables,
whereas your intention probably was to register the
entire traversable structure as input and obtain
a single output variable.
At present you have to do this transformation yourself.
TODOs
Add support for an actual graph layout of the variable store
by means of a library such as graphViz.
Perhaps it is wise to make usage of variables which meanwhile have been overwritten illegal to use.
How could one retain the functions in the data flow graph,
making the data flow interactive?
We'd have to map a subset of Haskell onto some serializable type,
e.g. spreadsheet formulas or JavaScript.