EdisonAPI-1.3: A library of efficent, purely-functional data structures (API)

CopyrightCopyright (c) 2006 Robert Dockins
LicenseMIT; see COPYRIGHT file for terms and conditions
Maintainerrobdockins AT fastmail DOT fm
Stabilitystable
PortabilityGHC, Hugs (MPTC and FD)
Safe HaskellSafe
LanguageHaskell2010

Data.Edison

Contents

Description

Edison is a library of purely functional data structures written by Chris Okasaki. It is named after Thomas Alva Edison and for the mnemonic value EDiSon (Efficent Data Structures).

Edison provides several families of abstractions, each with multiple implementations. The main abstractions provided by Edison are:

  • Sequences such as stacks, queues, and dequeues,
  • Collections such as sets, bags and heaps, and
  • Associative Collections such as finite maps and priority queues where the priority and element are distinct.

Conventions:

Each data structure is implemented as a separate module. These modules should always be imported qualified to prevent a flood of name clashes, and it is recommended to rename the module using the as keyword to reduce the overhead of qualified names and to make substituting one implementation for another as painless as possible.

Names have been chosen to match standard usage as much as possible. This means that operations for abstractions frequently share the same name (for example, empty, null, size, etc). It also means that in many cases names have been reused from the Prelude. However, the use of qualified imports will prevent name reuse from becoming name clashes. If for some reason you chose to import an Edison data structure unqualified, you will likely need to import the Prelude hiding the relevant names.

Edison modules also frequently share type names. For example, most sequence type constructors are named Seq. This additionally aids substituting implementations by simply importing a different module.

Argument orders are selected with the following points in mind:

  • Partial application: arguments more likely to be static usually appear before other arguments in order to facilitate partial application.
  • Collection appears last: in all cases where an operation queries a single collection or modifies an existing collection, the collection argument will appear last. This is something of a de facto standard for Haskell datastructure libraries and lends a degree of consistency to the API.
  • Most usual order: where an operation represents a well-known mathematical function on more than one datastructure, the arguments are chosen to match the most usual argument order for the function.

Type classes:

Each family of abstractions is defined as a set of classes: a main class that every implementation of that abstraction should support and several auxiliary subclasses that an implementation may or may not support. However, not all applications require the power of type classes, so each method is also directly accessible from the implementation module. Thus you can choose to use overloading or not, as appropriate for your particular application.

Documentation about the behavior of data structure operations is defined in the modules Data.Edison.Seq, Data.Edison.Coll and Data.Edison.Assoc. Implementations are required to respect the descriptions and axioms found in these modules. In some cases time complexity is also given. Implementations may differ from these time complexities; if so, the differences will be given in the documentation for the individual implementation module.

Notes on Eq and Ord instances:

Many Edison data structures require Eq or Ord contexts to define equivalence and total ordering on elements or keys. Edison makes the following assumptions about all such required instances:

  • An Eq instance correctly defines an equivalence relation (but not necessarily structural equality); that is, we assume (==) (considered as a relation) is reflexive, symmetric and transitive, but allow that equivalent items may be distinguishable by other means.
  • An Ord instance correctly defines a total order which is consistent with the Eq instance for that type.

These assumptions correspond to the usual meanings assigned to these classes. If an Edison data structure is used with an Eq or Ord instance which violates these assumptions, then the behavior of that data structure is undefined.

Notes on Read and Show instances:

The usual Haskell convention for Read and Show instances (as championed by the Haskell "deriving" mechanism), is that show generates a string which is a valid Haskell expression built up using the data type's data constructors such that, if interpreted as Haskell code, the string would generate an identical data item. Furthermore, the derived Read instances are able to parse such strings, such that (read . show) === id. So, derived instances of Read and Show exhibit the following useful properties:

  • read and show are complementary; that is, read is a useful inverse for show
  • show generates a string which is legal Haskell code representing the data item

For concrete data types, the deriving mechanism is usually quite sufficient. However, for abstract types the derived Read instance may allow users to create data which violates invariants. Furthermore, the strings resulting from show reference hidden data constructors which violates good software engineering principles and also cannot be compiled because the constructors are not available outside the defining module.

Edison avoids most of these problems and still maintains the above useful properties by doing conversions to and from lists and inserting explicit calls to the list conversion operations. The corresponding Read instance strips the list conversion call before parsing the list. In this way, private data constructors are not revealed and show strings are still legal, compilable Haskell code. Furthermore, the showed strings gain a degree of independence from the underlying datastructure implementation.

For example, calling show on an empty Banker's queue will result in the following string:

Data.Edison.Seq.BankersQueue.fromList []

Datatypes which are not native Edison data structures (such as StandardSet and StandardMap) may or may not provide Read or Show instances and, if they exist, they may or may not also provide the properties that Edison native Read and Show instances do.

Notes on time complexities:

Some Edison data structures (only the sequences currently) have detailed time complexity information. Unless otherwise stated, these are amortized time complexities, assuming persistent usage of the datastructure. Much of this data comes from:

Martin Holters. Efficent Data Structures in a Lazy Functional Language. Master's Thesis. Chalmers University of Technology, Sweden. 2003.

Notes on unsafe functions:

There are a number of different notions of what constitutes an unsafe function. In Haskell, a function is generally called "unsafe" if it can subvert type safety or referential integrity, such as unsafePerformIO or unsafeCoerce#. In Edison, however, we downgrade the meaning of "unsafe" somewhat. An "unsafe" Edison function is one which, if misused, can violate the structural invariants of a data structure. Misusing an Edison "unsafe" function should never cause your runtime to crash or break referential integrity, but it may cause later uses of a data structure to behave in undefined ways. Almost all unsafe functions in Edison are labeled with the unsafe prefix. An exception to this rule is the With functions in the Set class, which are also unsafe but do not have the prefix. Unsafe functions will have explicit preconditions listed in their documentation.

Notes on ambiguous functions:

Edison also contains some functions which are labeled "ambiguous". These functions cannot violate the structural invariants of a data structure, but, under some conditions, the result of applying an ambiguous function is not well defined. For ambiguous functions, the result of applying the function may depend on otherwise unobservable internal state of the data structure, such as the actual shape of a balanced tree. For example, the AssocX class contains the fold function, which folds over the elements in the collection in an arbitrary order. If the combining function passed to fold is not fold-commutative (see below), then the result of the fold will depend on the actual order that elements are presented to the combining function, which is not defined.

To aid programmers, each API function is labeled ambiguous or unambiguous in its documentation. If a function is unambiguous only under some circumstances, that will also be explicitly stated.

An "unambiguous" operation is one where all correct implementations of the operation will return "indistinguishable" results. For concrete data types, "indistinguishable" means structural equality. An instance of an abstract data type is considered indistinguishable from another if all possible applications of unambiguous operations to both yield indistinguishable results. (Note: this definition is impredicative and rather imprecise. Should it become an issue, I will attempt to develop a better definition. I hope the intent is sufficiently clear).

A higher-order unambiguous operation may be rendered ambiguous if passed a "function" which does not respect referential integrity (one containing unsafePerformIO for example). Only do something like this if you are 110% sure you know what you are doing, and even then think it over two or three times.

How to choose a fold:

Folds are an important class of operations on data structures in a functional language; they perform essentially the same role that iterators perform in imperative languages. Edison provides a dizzying array of folds which (hopefully) correspond to all the various ways a programmer might want to fold over a data structure. However, it can be difficult to know which fold to choose for a particular application. In general, you should choose a fold which provides the fewest guarantees necessary for correctness. The folds which have fewer guarantees give data structure implementers more leeway to provide efficient implementations. For example, if you which to fold a commutative, associative function, you should chose fold (which does not guarantee an order) over foldl or foldr, which specify particular orders.

Also, if your function is strict in the accumulating argument, you should prefer the strict folds (eg, fold'); they will often provide better space behavior. Be aware, however, that the "strict" folds are not necessarily more strict than the "non-strict" folds; they merely give implementers the option to provide additional strictness if it improves performance.

For associative collections, only use with WithKey folds if you actually need the value of the key.

Painfully detailed information about ambiguous folds:

All of the folds that are listed ambiguous are ambiguous because they do not or cannot guarantee a stable order with which the folding function will be applied. However, some functions are order insensitive, and the result will be unambiguous regardless of the fold order chosen. Here we formalize this property, which we call "fold commutativity".

We say f :: a -> b -> b is fold-commutative iff f is unambiguous and

   forall w, z :: b; m, n :: a

      w = z ==> f m (f n w) = f n (f m z)

where = means indistinguishability.

This property is sufficient (but not necessary) to ensure that, for any collection of elements to fold over, folds over all permutations of those elements will generate indistinguishable results. In other words, an ambiguous fold applied to a fold-commutative combining function becomes unambiguous.

Some fold combining functions take their arguments in the reverse order. We straightforwardly extend the notion of fold commutativity to such functions by reversing the arguments. More formally, we say g :: b -> a -> b is fold commutative iff flip g :: a -> b -> b is fold commutative.

For folds which take both a key and an element value, we extend the notion of fold commutativity by considering the key and element to be a single, uncurried argument. More formally, we say g :: k -> a -> b -> b is fold commutative iff

   \(k,x) z -> g k x z :: (k,a) -> b -> b

is fold commutative according to the above definition.

Note that for g :: a -> a -> a, if g is unambiguous, commutative, and associative, then g is fold-commutative.

Proof:

   let w = z, then
   g m (g n w) = g m (g n z)     g is unambiguous
               = g (g n z) m     commutative property of g
               = g n (g z m)     associative property of g
               = g n (g m z)     commutative property of g

Qed.

Thus, many common numeric combining functions, including (+) and (*) at integral types, are fold commutative and can be safely used with ambiguous folds.

Be aware however, that (+) and (*) at floating point types are only approximately commutative and associative due to rounding errors; using ambiguous folds with these operations may result in subtle differences in the results. As always, be aware of the limitations and numeric properties of floating point representations.

About this module:

This module re-exports the various data structure abstraction classes, but not their methods. This allows you to write type signatures which have contexts that mention Edison type classes without having to import the appropriate modules qualified. The class methods are not exported to avoid name clashes. Obviously, to use the methods of these classes, you will have to import the appropriate modules. This module additionally re-exports the entire Data.Edison.Prelude module.

Miscellaneous points:

Some implementations export a few extra functions beyond those included in the relevant classes. These are typically operations that are particularly efficient for that implementation, but are not general enough to warrant inclusion in a class.

Since qualified infix symbols are fairly ugly, they have been largely avoided. However, the Data.Edison.Sym module defines a number of infix operators which alias the prefix operators; this module is intended to be imported unqualified.

Most of the operations on most of the data structures are strict. This is inevitable for data structures with non-trivial invariants. Even given that, however, many of the operations are stricter than necessary. In fact, operations are never deliberately made lazy unless the laziness is required by the algorithm, as can happen with amortized data structures.

Note, however, that the various sequence implementations are always lazy in their elements. Similarly, associative collections are always lazy in their elements (but usually strict in their keys). Non-associative collections are usually strict in their elements.

Synopsis

Sequence class

Collection classes

Non-observable collections

class (Eq a, Monoid c) => CollX c a | c -> a Source

This is the root class of the collection hierarchy. However, it is perfectly adequate for many applications that use sets or bags.

class (CollX c a, Ord a) => OrdCollX c a | c -> a Source

Collections for which the elements have an ordering relation.

class CollX c a => SetX c a | c -> a Source

A collection where the set property is maintained; that is, a set contains at most one element of the equivalence class formed by the Eq instance on the elements.

class (OrdCollX c a, SetX c a) => OrdSetX c a | c -> a Source

Sets where the elements also have an ordering relation. This class contains no methods; it is only an abbreviation for the context (OrdCollX c a,SetX c a).

Observable collections

class CollX c a => Coll c a | c -> a Source

Collections with observable elements. See the module documentation for comments on observability.

class (Coll c a, OrdCollX c a) => OrdColl c a | c -> a Source

Collections with observable elements where the elements additionally have an ordering relation. See the module documentation for comments on observability.

class (Coll c a, SetX c a) => Set c a | c -> a Source

Collections with observable elements where the set property is maintained; that is, a set contains at most one element of the equivalence class formed by the Eq instance on the elements.

WARNING: Each of the following "With" functions is unsafe. The passed in combining functions are used to choose which element is kept in the case of duplicates. They are required to satisfy the precondition that, given two equal elements, they return a third element equal to the other two. Usually, the combining function just returns its first or second argument, but it can combine elements in non-trivial ways.

The combining function should usually be associative. Where the function involves a sequence of elements, the elements will be combined from left-to-right, but with an unspecified associativity.

For example, if x == y == z, then fromSeqWith (+) [x,y,z] equals either single (x + (y + z)) or single ((x + y) + z)

class (OrdColl c a, Set c a) => OrdSet c a | c -> a Source

Collections with observable elements where the set property is maintained and where additionally, there is an ordering relation on the elements. This class introduces no new methods, and is simply an abbreviation for the context:

(OrdColl c a,Set c a)

Associative collection classes

Non-observable associative collections

class (AssocX m k, Ord k) => OrdAssocX m k | m -> k Source

An associative collection where the keys additionally have an ordering relation.

class AssocX m k => FiniteMapX m k | m -> k Source

An associative collection where the keys form a set; that is, each key appears in the associative collection at most once.

class (OrdAssocX m k, FiniteMapX m k) => OrdFiniteMapX m k | m -> k Source

Finite maps where the keys additionally have an ordering relation. This class introduces no new methods.

Observable associative collections

class AssocX m k => Assoc m k | m -> k Source

Associative collections where the keys are observable.

class (Assoc m k, OrdAssocX m k) => OrdAssoc m k | m -> k Source

An associative collection with observable keys where the keys additionally have an ordering relation.

class (Assoc m k, FiniteMapX m k) => FiniteMap m k | m -> k Source

Finite maps with observable keys.

Minimal complete definition

unionWithKey, unionSeqWithKey, intersectionWithKey

class (OrdAssoc m k, FiniteMap m k) => OrdFiniteMap m k | m -> k Source

Finite maps with observable keys where the keys additionally have an ordering relation. This class introduces no new methods.