Copyright | Copyright (c) 2006 Robert Dockins |
---|---|
License | MIT; see COPYRIGHT file for terms and conditions |
Maintainer | robdockins AT fastmail DOT fm |
Stability | stable |
Portability | GHC, Hugs (MPTC and FD) |
Safe Haskell | Safe |
Language | Haskell98 |
Edison is a library of purely functional data structures written by Chris Okasaki. It is named after Thomas Alva Edison and for the mnemonic value EDiSon (Efficent Data Structures).
Edison provides several families of abstractions, each with multiple implementations. The main abstractions provided by Edison are:
- Sequences such as stacks, queues, and dequeues,
- Collections such as sets, bags and heaps, and
- Associative Collections such as finite maps and priority queues where the priority and element are distinct.
Conventions:
Each data structure is implemented as a separate module. These modules
should always be imported qualified
to prevent a flood of name clashes,
and it is recommended to rename the module using the as
keyword to reduce
the overhead of qualified names and to make substituting one implementation
for another as painless as possible.
Names have been chosen to match standard usage as much as possible. This
means that operations for abstractions frequently share the same name
(for example, empty
, null
, size
, etc). It also means that in many
cases names have been reused from the Prelude. However, the use of
qualified
imports will prevent name reuse from becoming name clashes. If
for some reason you chose to import an Edison data structure unqualified,
you will likely need to import the Prelude hiding
the relevant names.
Edison modules also frequently share type names. For example, most sequence
type constructors are named Seq
. This additionally aids substituting
implementations by simply importing a different module.
Argument orders are selected with the following points in mind:
- Partial application: arguments more likely to be static usually appear before other arguments in order to facilitate partial application.
- Collection appears last: in all cases where an operation queries a single collection or modifies an existing collection, the collection argument will appear last. This is something of a de facto standard for Haskell datastructure libraries and lends a degree of consistency to the API.
- Most usual order: where an operation represents a well-known mathematical function on more than one datastructure, the arguments are chosen to match the most usual argument order for the function.
Type classes:
Each family of abstractions is defined as a set of classes: a main class that every implementation of that abstraction should support and several auxiliary subclasses that an implementation may or may not support. However, not all applications require the power of type classes, so each method is also directly accessible from the implementation module. Thus you can choose to use overloading or not, as appropriate for your particular application.
Documentation about the behavior of data structure operations is defined in the modules Data.Edison.Seq, Data.Edison.Coll and Data.Edison.Assoc. Implementations are required to respect the descriptions and axioms found in these modules. In some cases time complexity is also given. Implementations may differ from these time complexities; if so, the differences will be given in the documentation for the individual implementation module.
Notes on Eq and Ord instances:
Many Edison data structures require Eq
or Ord
contexts to define equivalence
and total ordering on elements or keys. Edison makes the following assumptions
about all such required instances:
- An
Eq
instance correctly defines an equivalence relation (but not necessarily structural equality); that is, we assume(==)
(considered as a relation) is reflexive, symmetric and transitive, but allow that equivalent items may be distinguishable by other means. - An
Ord
instance correctly defines a total order which is consistent with theEq
instance for that type.
These assumptions correspond to the usual meanings assigned to these classes. If
an Edison data structure is used with an Eq
or Ord
instance which violates these
assumptions, then the behavior of that data structure is undefined.
Notes on Read and Show instances:
The usual Haskell convention for Read
and Show
instances (as championed by the
Haskell "deriving" mechanism), is that show
generates a string which is a
valid Haskell expression built up
using the data type's data constructors such that, if interpreted as Haskell code, the
string would generate an identical data item. Furthermore, the derived Read
instances are able to parse such strings, such that (read . show) === id
.
So, derived instances of Read
and Show
exhibit
the following useful properties:
read
andshow
are complementary; that is,read
is a useful inverse forshow
show
generates a string which is legal Haskell code representing the data item
For concrete data types, the deriving mechanism is usually quite sufficient.
However, for abstract types the derived Read
instance may allow users to create data
which violates invariants. Furthermore, the strings resulting from show
reference hidden
data constructors which violates good software engineering principles and also
cannot be compiled because the constructors are not available outside the defining module.
Edison avoids most of these problems and still maintains the above useful properties by
doing conversions to and from lists and inserting explicit calls to the list conversion
operations. The corresponding Read
instance strips the list conversion call before
parsing the list. In this way, private data constructors are not revealed and show
strings
are still legal, compilable Haskell code. Furthermore, the showed strings gain a degree of
independence from the underlying datastructure implementation.
For example, calling show
on an empty Banker's queue will result in the following string:
Data.Edison.Seq.BankersQueue.fromList []
Datatypes which are not native Edison data structures (such as StandardSet and StandardMap)
may or may not provide Read
or Show
instances and, if they exist, they may or may
not also provide the properties that Edison native Read
and Show
instances do.
Notes on time complexities:
Some Edison data structures (only the sequences currently) have detailed time complexity information. Unless otherwise stated, these are amortized time complexities, assuming persistent usage of the datastructure. Much of this data comes from:
Martin Holters. Efficent Data Structures in a Lazy Functional Language. Master's Thesis. Chalmers University of Technology, Sweden. 2003.
Notes on unsafe functions:
There are a number of different notions of what constitutes an unsafe function.
In Haskell, a function is generally called "unsafe" if it can subvert
type safety or referential integrity, such as unsafePerformIO
or unsafeCoerce#
.
In Edison, however, we downgrade the meaning of "unsafe" somewhat. An
"unsafe" Edison function is one which, if misused, can violate the structural
invariants of a data structure. Misusing an Edison "unsafe" function should
never cause your runtime to crash or break referential integrity, but it may cause
later uses of a data structure to behave in undefined ways. Almost all unsafe functions
in Edison are labeled with the unsafe
prefix. An exception to this rule is the
With
functions in the Set
class, which are also unsafe but do not have
the prefix. Unsafe functions will have explicit preconditions listed in their
documentation.
Notes on ambiguous functions:
Edison also contains some functions which are labeled "ambiguous". These
functions cannot violate the structural invariants of a data structure, but, under
some conditions, the result of applying an ambiguous function is not well defined.
For ambiguous functions, the result of applying the function may depend on otherwise
unobservable internal state of the data structure, such as the actual shape of a
balanced tree. For example, the AssocX
class contains the fold
function, which
folds over the elements in the collection in an arbitrary order. If the combining
function passed to fold
is not fold-commutative (see below), then the result of
the fold will depend on the actual order that elements are presented to the
combining function, which is not defined.
To aid programmers, each API function is labeled ambiguous or unambiguous in its documentation. If a function is unambiguous only under some circumstances, that will also be explicitly stated.
An "unambiguous" operation is one where all correct implementations of the operation will return "indistinguishable" results. For concrete data types, "indistinguishable" means structural equality. An instance of an abstract data type is considered indistinguishable from another if all possible applications of unambiguous operations to both yield indistinguishable results. (Note: this definition is impredicative and rather imprecise. Should it become an issue, I will attempt to develop a better definition. I hope the intent is sufficiently clear).
A higher-order unambiguous operation may be rendered ambiguous if passed a "function" which
does not respect referential integrity (one containing unsafePerformIO
for example).
Only do something like this if you are 110% sure you know what you are doing, and even then
think it over two or three times.
How to choose a fold:
Folds are an important class of operations on data structures in a functional
language; they perform essentially the same role that iterators perform in
imperative languages. Edison provides a dizzying array of folds which (hopefully)
correspond to all the various ways a programmer might want to fold over a data
structure. However, it can be difficult to know which fold to choose for a
particular application. In general, you should choose a fold which provides
the fewest guarantees necessary for correctness. The folds which have fewer
guarantees give data structure implementers more leeway to provide efficient
implementations. For example, if you which to fold a commutative, associative
function, you should chose fold
(which does not guarantee an order) over foldl
or foldr
, which specify particular orders.
Also, if your function is strict in
the accumulating argument, you should prefer the strict folds (eg, fold'
); they will
often provide better space behavior. Be aware, however, that the "strict" folds
are not necessarily more strict than the "non-strict" folds; they merely give
implementers the option to provide additional strictness if it improves performance.
For associative collections, only use with WithKey
folds if you actually need the
value of the key.
Painfully detailed information about ambiguous folds:
All of the folds that are listed ambiguous are ambiguous because they do not or cannot guarantee a stable order with which the folding function will be applied. However, some functions are order insensitive, and the result will be unambiguous regardless of the fold order chosen. Here we formalize this property, which we call "fold commutativity".
We say f :: a -> b -> b
is fold-commutative iff f
is unambiguous and
forall w, z :: b; m, n :: a w = z ==> f m (f n w) = f n (f m z)
where =
means indistinguishability.
This property is sufficient (but not necessary) to ensure that, for any collection of elements to fold over, folds over all permutations of those elements will generate indistinguishable results. In other words, an ambiguous fold applied to a fold-commutative combining function becomes unambiguous.
Some fold combining functions take their arguments in the reverse order. We
straightforwardly extend the notion of fold commutativity to such functions
by reversing the arguments. More formally, we say g :: b -> a -> b
is fold
commutative iff flip g :: a -> b -> b
is fold commutative.
For folds which take both a key and an element value, we extend the notion of fold
commutativity by considering the key and element to be a single, uncurried argument.
More formally, we say g :: k -> a -> b -> b
is fold commutative iff
\(k,x) z -> g k x z :: (k,a) -> b -> b
is fold commutative according to the above definition.
Note that for g :: a -> a -> a
, if g
is unambiguous,
commutative, and associative, then g
is fold-commutative.
Proof:
let w = z, then g m (g n w) = g m (g n z) g is unambiguous = g (g n z) m commutative property of g = g n (g z m) associative property of g = g n (g m z) commutative property of g
Qed.
Thus, many common numeric combining functions, including (+)
and (*)
at
integral types, are fold commutative and can be safely used with ambiguous
folds.
Be aware however, that (+)
and (*)
at floating point types are only
approximately commutative and associative due to rounding errors; using
ambiguous folds with these operations may result in subtle differences in
the results. As always, be aware of the limitations and numeric
properties of floating point representations.
About this module:
This module re-exports the various data structure abstraction classes, but
not their methods. This allows you to write type signatures which have
contexts that mention Edison type classes without having to import the
appropriate modules qualified
. The class methods are not exported to
avoid name clashes. Obviously, to use the methods of these classes, you
will have to import the appropriate modules. This module additionally
re-exports the entire Data.Edison.Prelude module.
Miscellaneous points:
Some implementations export a few extra functions beyond those included in the relevant classes. These are typically operations that are particularly efficient for that implementation, but are not general enough to warrant inclusion in a class.
Since qualified infix symbols are fairly ugly, they have been largely avoided. However, the Data.Edison.Sym module defines a number of infix operators which alias the prefix operators; this module is intended to be imported unqualified.
Most of the operations on most of the data structures are strict. This is inevitable for data structures with non-trivial invariants. Even given that, however, many of the operations are stricter than necessary. In fact, operations are never deliberately made lazy unless the laziness is required by the algorithm, as can happen with amortized data structures.
Note, however, that the various sequence implementations are always lazy in their elements. Similarly, associative collections are always lazy in their elements (but usually strict in their keys). Non-associative collections are usually strict in their elements.
- class (Functor s, MonadPlus s) => Sequence s
- class (Eq a, Monoid c) => CollX c a | c -> a
- class (CollX c a, Ord a) => OrdCollX c a | c -> a
- class CollX c a => SetX c a | c -> a
- class (OrdCollX c a, SetX c a) => OrdSetX c a | c -> a
- class CollX c a => Coll c a | c -> a
- class (Coll c a, OrdCollX c a) => OrdColl c a | c -> a
- class (Coll c a, SetX c a) => Set c a | c -> a
- class (OrdColl c a, Set c a) => OrdSet c a | c -> a
- class (Eq k, Functor m) => AssocX m k | m -> k
- class (AssocX m k, Ord k) => OrdAssocX m k | m -> k
- class AssocX m k => FiniteMapX m k | m -> k
- class (OrdAssocX m k, FiniteMapX m k) => OrdFiniteMapX m k | m -> k
- class AssocX m k => Assoc m k | m -> k
- class (Assoc m k, OrdAssocX m k) => OrdAssoc m k | m -> k
- class (Assoc m k, FiniteMapX m k) => FiniteMap m k | m -> k
- class (OrdAssoc m k, FiniteMap m k) => OrdFiniteMap m k | m -> k
- module Data.Edison.Prelude
Sequence class
class (Functor s, MonadPlus s) => Sequence s Source
The Sequence
class defines an interface for datatypes which
implement sequences. A description for each function is
given below.
lcons, rcons, fromList, copy, lview, lhead, lheadM, ltail, ltailM, rview, rhead, rheadM, rtail, rtailM, null, size, toList, concat, reverse, reverseOnto, fold, fold', fold1, fold1', foldr, foldr', foldl, foldl', foldr1, foldr1', foldl1, foldl1', reducer, reducer', reducel, reducel', reduce1, reduce1', take, drop, splitAt, subseq, filter, partition, takeWhile, dropWhile, splitWhile, inBounds, lookup, lookupM, lookupWithDefault, update, adjust, mapWithIndex, foldrWithIndex, foldrWithIndex', foldlWithIndex, foldlWithIndex', zip, zip3, zipWith, zipWith3, unzip, unzip3, unzipWith, unzipWith3, strict, strictWith, structuralInvariant, instanceName
Collection classes
Non-observable collections
class (Eq a, Monoid c) => CollX c a | c -> a Source
This is the root class of the collection hierarchy. However, it is perfectly adequate for many applications that use sets or bags.
class (CollX c a, Ord a) => OrdCollX c a | c -> a Source
Collections for which the elements have an ordering relation.
class CollX c a => SetX c a | c -> a Source
A collection where the set property is maintained; that is, a set
contains at most one element of the equivalence class formed by the
Eq
instance on the elements.
class (OrdCollX c a, SetX c a) => OrdSetX c a | c -> a Source
Sets where the elements also have an ordering relation.
This class contains no methods; it is only an abbreviation for
the context (OrdCollX c a,SetX c a)
.
Observable collections
class CollX c a => Coll c a | c -> a Source
Collections with observable elements. See the module documentation for comments on observability.
class (Coll c a, OrdCollX c a) => OrdColl c a | c -> a Source
Collections with observable elements where the elements additionally have an ordering relation. See the module documentation for comments on observability.
class (Coll c a, SetX c a) => Set c a | c -> a Source
Collections with observable elements where the set property is maintained;
that is, a set contains at most one element of the equivalence class
formed by the Eq
instance on the elements.
WARNING: Each of the following "With" functions is unsafe. The passed in combining functions are used to choose which element is kept in the case of duplicates. They are required to satisfy the precondition that, given two equal elements, they return a third element equal to the other two. Usually, the combining function just returns its first or second argument, but it can combine elements in non-trivial ways.
The combining function should usually be associative. Where the function involves a sequence of elements, the elements will be combined from left-to-right, but with an unspecified associativity.
For example, if x == y == z
,
then fromSeqWith (+) [x,y,z]
equals either
single (x + (y + z))
or
single ((x + y) + z)
class (OrdColl c a, Set c a) => OrdSet c a | c -> a Source
Collections with observable elements where the set property is maintained and where additionally, there is an ordering relation on the elements. This class introduces no new methods, and is simply an abbreviation for the context:
(OrdColl c a,Set c a)
Associative collection classes
Non-observable associative collections
class (Eq k, Functor m) => AssocX m k | m -> k Source
The root class of the associative collection hierarchy.
empty, singleton, fromSeq, insert, insertSeq, union, unionSeq, delete, deleteAll, deleteSeq, null, size, member, count, lookup, lookupM, lookupAll, lookupAndDelete, lookupAndDeleteM, lookupAndDeleteAll, lookupWithDefault, adjust, adjustAll, adjustOrInsert, adjustAllOrInsert, adjustOrDelete, adjustOrDeleteAll, fold, fold', fold1, fold1', filter, partition, elements, strict, strictWith, structuralInvariant, instanceName
class (AssocX m k, Ord k) => OrdAssocX m k | m -> k Source
An associative collection where the keys additionally have an ordering relation.
class AssocX m k => FiniteMapX m k | m -> k Source
An associative collection where the keys form a set; that is, each key appears in the associative collection at most once.
class (OrdAssocX m k, FiniteMapX m k) => OrdFiniteMapX m k | m -> k Source
Finite maps where the keys additionally have an ordering relation. This class introduces no new methods.
Observable associative collections
class (Assoc m k, OrdAssocX m k) => OrdAssoc m k | m -> k Source
An associative collection with observable keys where the keys additionally have an ordering relation.
class (Assoc m k, FiniteMapX m k) => FiniteMap m k | m -> k Source
Finite maps with observable keys.
class (OrdAssoc m k, FiniteMap m k) => OrdFiniteMap m k | m -> k Source
Finite maps with observable keys where the keys additionally have an ordering relation. This class introduces no new methods.
module Data.Edison.Prelude