pattern-matcher-0.1.0.1: A library for compiling pattern-matching to decision trees

Safe HaskellSafe
LanguageHaskell2010

Language.Pattern.Compiler

Contents

Description

Overview and basic concepts

This library implements compilation and analysis facilities for language compilers supporting ML- or Haskell-style pattern matching. It provides a compiler from a matching to a decision tree, an efficient representation mapping easily into low level languages. It supports most features one would expect, such as variable bindings, or- and as-patterns, etc. and is also able to detect anomalies in the maching, such as non-exhaustivity or redundantness. It is based on Compiling pattern-matching to good decision trees and Warnings for pattern matching by Luc Maranget.

Pattern matching

Patterns are assumed to be linear and matched “from top to bottom”. This library adopts a simplified view of patterns, or pattern Skeletons, that should be rich enough to accomodate most compilers need. It is either a catch-all pattern, eventually binding an identifier or a constructor pattern made of a tag and subpatterns, filtering only those expression sharing the same tag and whose subexpressions are also filtered by the subpatterns.

As-patterns and or-patterns are also supported, while as-patterns have there own Skeleton, or-patterns must first be decomposed into distinct lists of patterns.

In this documentation, a “row” is a list of patterns associated with an output, that will be selected if an expression matches all those patterns, while a “column” is a list of patterns that are tested against an expression from top to bottom.

Decision trees

Decision trees can be thought of as cascading switches. Each Switch checks the constructor of an expression to decide what path to take, until it reaches a Leaf (success) or encounters a dead-end 'Fail. Consider this Haskell example:

case e of
 ([], 0)    -> 0
 (_ : _, 1) -> 1

A possible decision tree corresponding to this expression could be:

Switch e +--- (,) ---> Switch e(,).0 +---  [] ----> Switch e(,).1 +---- 0 ----> Leaf 0
                                     |                            |
                                     |                            \---- _ ----> Fail [([], 1), ([], 2), ...]
                                     |
                                     \--- _:_ ----> Switch e(,).1 +---- 1 ----> Leaf 1
                                                                  |
                                                                  \---- _ ----> Fail [(_:_, 0), (_:_, 2), (_:_, 3), ...]

First, the expression e is checked for the tag (,). Since there is no other constructor for (,), this always succeeds. Matching on a tuple yields to subexpression that we name e(,).0 and e(,).1 (the Select type handles subexpression selection), that must be matched to two “columns” of patterns: e(,).0 against [] or _:_ and e(,).1 against 0 or 1. Note that we have a choice which test to perform first. Here we decide to check e(,).0 against [] and _:_. Since this is the set of all possible constructors for lists, there is no possibility for the match to fail here. We are then left with e(,).1 to match against 0,in the branch where e(,).0 is [] and 1 when e(,).0 is _:_. In either case, the matching can fail since 0 and 1 do not cover the full range of integers.

Characteristics of decision trees

A decision tree is only one possible target to compile pattern-matching. An alternative is to compile to backtracking automata (see, for instance Compiling pattern matching). Unlike decision trees, backtracking automata guarantee linear code size, however, as the name suggests, they may backtrack, thus testing more than once the same expression, which decision trees are guaranteed never to do.

Heuristics

In the example above, we choose to test e(,).0 before e(,).1, but we could have made the opposite choice. Also, in the _:_ branch we entirely ommited to test e(,).0(_:_).0, e(,).0(_:_).1 (i.e. the head and the tail of the list introducing by matching on _:_) against the two wildcards of _:_. This would of course have been useless, since matching against a wildcard always succeeds. The algorithm can make similar choices as the one we did through the use of Heuristics. The role of Heuristics is to attribute a score to a given list of patterns, so that the algorithm will first match against the list of patterns with the best score. In this case, we attributed a bigger score to the pattern 1 than to the two wildcards. A detailed list of how heuristics work, as well as all the heuristics studied by Maranget are presented later.

How to use?

The library is centered around the match and anomalies functions. match compiles a matching to a decision tree while anomalies simply gathers the anomalies in a matching. Note that the anomalies can only be retrieved from the structure of the decision tree.

The documentation makes heavy use of polymorphism to accomodate the internal representation of most languages. The convention for the names of the parameters is:

  • ident is the type of identifiers bound in patterns,
  • tag is the type of tags of constructors,
  • pat is the type of patterns in the user's language,
  • expr is the type of expressions in the user's language,
  • out is the type of the outputs of a matching.

To work, these functions need three things from the user (apart from the actual matching):

  • a way to decompose the user's language patterns into the simplified representation. This is a function of type pat -> [Skel ident tag], returning a list allows to account for or-patterns. The list of skeletons returned is tested from left-to-right.
  • for the tag type to be a member of the IsTag typeclass. This requires to be able to compute some informations from a tag, such as the range of tags it belongs to. Further information is given with the IsTag class.

Complete example

Consider the following typed language with its own Pattern representation:

data Typ     = TInt
             | TList Typ

data Pattern = VarPat Typ  String
             | IntPat      Int
             | NilPat  Typ                  -- NilPat typ has type TList typ
             | ConsPat Typ Pattern Pattern  -- ConsPat typ _ _ has type TList typ
             | OrPat       Pattern Pattern
             | AsPat       Pattern String

This language supports variables, integers and lists. It can have or- and as-patterns.

This custom representation must first be converted into a Skel-based representation. This implies defining the tag of constructors:

data Tag = NilTag | ConsTag Typ | IntTag Int

and doing the conversion:

toSkel :: Pattern -> [Skel String Tag]
toSkel (VarPat typ var)   = [WildSkel (rangeOfTyp typ) (Just var)]
toSkel (IntPat i)         = [ConsSkel (cons (IntTag i) [])]
toSkel (NilPat _)         = [ConsSkel (cons NilTag [])]
toSkel (ConsPat typ p ps) = [ ConsSkel (cons (ConsTag typ) [subp, subps])
                            | subp  <- toSkel p
                            , subps <- toSkel ps
                            ]
toSkel (OrPat p1 p2)      = toSkel p1 ++ toSkel p2
toSkel (AsPat p i)        = [ AsSkel s i
                            | s <- toSkel p
                            ]

where rangeOfTyp defines the range of Tags patterns of a certain type can have:

rangeOfTyp :: Typ -> [Tag]
rangeOfTyp TInt        = fmap IntTag [minBound .. maxBound]
rangeOfTyp (TList typ) = [NilTag, ConsTag typ]

Finally, Tag must be made an instance of IsTag. IsTag has two functions: tagRange :: tag -> [tag] that outputs the signature a given tag belongs to and subTags :: tag -> [[tag]]. subTags t defines the range of tags that can be found in the subpatterns of a constructor with tag t. For instance, a constructor tagged with ConsTag TInt will have two subfields: the first one (the head of the list), can contain any integers, the second one (the tail of the list), can be either the NilTag or another ConsTag. This gives us the following instance:

instance IsTag Tag where
 tagRange NilTag     = [NilTag, ConsTag]
 tagRange ConsTag    = [NilTag, ConsTag]
 tagRange (IntTag j) = fmap IntTag [minBound..maxBound]

 subTags NilTag        = []
 subTags (ConsTag typ) = [rangeOf typ, rangeOf (TList typ)]
 subTags (IntTag _)    = []

and this all one needs to do (apart from choosing a Heuristic) to use the compiler.

Preserving sharing

The presence of or-patterns, like in this example, can cause duplication of outputs in leaves of the decision tree. Consider this example in OCaml syntax:

match e with
| 0 | 1 -> e1
| _ -> e2

The resulting decision tree, would be:

Switch e +--- 0 ---> e1
         |
         \--- 1 ---> e1
         |
         \--- _ ---> e2

with e1 being duplicated, which is undesirable when compiling this decision tree further to machine code as it would lead to increased code size. As a result, it might be worth to consider using labels for outputs and a table linking these labels to expressions. This would make the decision tree suitable for compilation using jumps, avoiding duplication.

Synopsis

Documentation

match Source #

Arguments

:: IsTag tag 
=> Heuristic ident tag expr out

The heuristic to use to resolve ambiguous choices

-> (pat -> [Skel ident tag])

A way to decompose the language's patterns into Skeletons. Producing a list allows to account for or-patterns. They are tested from left to right.

-> expr

The expression being scrutanized

-> [(pat, out)]

The list of patterns to match on with the output associated. Patterns are tried from left to right.

-> DecTree ident tag pat expr out 

Compiles a matching to a decision tree, using the given heuristic.

data Anomalies ident tag pat Source #

Gathers all the anomalies present in a matching. Nothing indicating the absence of an anomaly.

Constructors

Anomalies 

Fields

anomalies :: IsTag tag => (pat -> [Skel ident tag]) -> [pat] -> Anomalies ident tag pat Source #

Simplified version of match, that simply gathers the anomalies of the decision tree.

Generic pattern representation

data Cons ident tag where Source #

A generic description of a constructor pattern, made of a tag and subpatterns.

Bundled Patterns

pattern Cons :: tag -> [Skel ident tag] -> Cons ident tag 

cons :: IsTag tag => tag -> [Skel ident tag] -> Cons ident tag Source #

Smart constructor for Cons. asserts that the number of subpatterns matches the tag's arity.

data Skel ident tag Source #

Constructors

WildSkel [tag] (Maybe ident)

Carries the range of tags that could have been used in this pattern and, potentially, an identifier to bound.

ConsSkel (Cons ident tag) 
AsSkel (Skel ident tag) ident

AsSkel p i matches if p matches and binds i to the result of the expression it matches against

class Ord tag => IsTag tag where Source #

Methods

tagRange :: tag -> [tag] Source #

The range of tags a given tag could have had. t should always be an element of tagRange t. In other words:

elem t (tagRange t) == True

The range of a tag is used to generate missing patterns in non-exhaustive matches. It might be interesting to consider the order the range is given in, to improve the quality of error messages. For instance, if one allows pattern-matching on integers, instead of simply giving the range [minBound..maxBound], it could be better to give the range [0..maxBound] ++ [-1,-2..minBound] (or a range alternating positive and negative integers, starting at 0) could help generate simpler messages.

subTags :: tag -> [[tag]] Source #

The range of the tags that can appear in all the subfields of a constructor with the given tag.

Example

Consider the _:_ tag for patterns of type [Bool] in Haskell. This pattern has two subpatterns, the head can be either True or False, while the tail can be either [] or _:_. Thus subTags would simply be, in pseudo-Haskell:

[[trueTag, falseTag], [nilTag, consTag]]

Expression selection

data Select expr tag Source #

Encodes the selection of a subexpression given a tag.

Constructors

NoSel expr

An untouched expression

Sel (Select expr tag) tag Int

Sel e t n selects the n+1-th subexpression in e assuming e is caracterized by tag t. Such an expression will only be generated after it has been checked that e has indeed tag t.

For example, Sel (e :: e') _::_ 1, would select the second field e :: e', in this case e'.

Decision trees

data DecTree ident tag pat expr out Source #

A decision tree can be thought of as a cascade of switches, matching on the tag of expressions and subexpressions until reaching a result. They map fairly naturally to constructs in low level languages, such as C.

Constructors

Fail [Skel ident tag]

Pattern-matching failure, with a list of all the patterns that aren't matched. The list is lazily generated and may be infinite for tags with infinite ranges.

Leaf

Pattern-matching success

Fields

  • leafBindings :: [Binding ident (Select expr tag)]

    The identifiers bound when reaching this leaf. The list of bindings is in the order of matching, as given by the heuristics.

  • leafOut :: out

    The result obtained when arriving at this leaf

  • leafRedundant :: Maybe [pat]
     
Switch

Branching on an tag expression

Fields

  • switchOn :: Select expr tag

    The expression whose tag is being scrutinised

  • switchBranches :: Map tag (DecTree ident tag pat expr out)

    Branches to follow based on specific tags. Any expression not caracterized by any tag will fall back to the default branch.

  • switchCatchAll :: Maybe (DecTree ident tag pat expr out)

    Branch to follow if the expression's tag is not present in the set of branches above. This branch may be absent if all tags are present in the switchBranches

data Binding ident expr Source #

Binding of an identifier to an expression. Bindings of wildcards are conserved.

Constructors

(Maybe ident) := expr 
Instances
(Show ident, Show expr) => Show (Binding ident expr) Source # 
Instance details

Defined in Language.Pattern.Compiler

Methods

showsPrec :: Int -> Binding ident expr -> ShowS #

show :: Binding ident expr -> String #

showList :: [Binding ident expr] -> ShowS #

Heuristics

Most of the time, there are multiple ways to construct a decision tree, since we are often faced with a choice as to which column of pattern to match first. Doing the wrong choice can lead to larger decision trees or to more tests on average. Heuristics allows us to choose between those different choices.

In there simplest form, heuristics attribute a score to a column, given it's position in the list of columns to match, the expression to match it against and the column of patterns. Some more complicated heuristics exist that require having access to the entire list of columns.

Combining heuristics

A single heuristic may give the same score to several columns, leading to ambiguity on the one to choose. Combining heuristic allows to use a second heuristic to break such a tie.

Note that if there is a tie after applying the heuristic supplied by the user, the algorithm will choose the left-most pattern with the highest score.

Influence on the semantic

Heuristics might have an influence on the semantic of the language if the resulting decision tree is used to guide evaluation, as it can be the case in a lazy language.

“But how do I choose?”

Detailed benchmarks are given in section 9 of Maranget's paper, in terms of code size and average path length in a prototype compiler, both for single and combined heuristics (up to 3 combinations). A conclusion to his findings is given in section 9.2 and is reproduced here:

  1. Good primary heuristics are firstRow, neededPrefix and constructorPrefix. This demonstrates the importance of considering clause order in heuristics.
  2. If we limit choice to combinations of at most two heuristics, fewerChildRule is a good complement to all primary heuristics. smallBranchingFactor looks sufficient to break the ties left by neededPrefix and constructorPrefix.
  3. If we limit choice to heuristics that are simple to compute, that is if we eliminate neededColumns, neededPrefix, fewerChildRule and leafEdge , then good choices are:

type Index = Int Source #

The index of the column of patterns

type Score = Int Source #

data Heuristic ident tag expr out Source #

Constructors

Score ([[Skel ident tag]] -> Index -> Select expr tag -> [Skel ident tag] -> Score)

Computes the Score for a given column. It may use the entire pattern matrix but it is also given the index of the column, the expression being matched and the column being matched.

Combine (Heuristic ident tag expr out) (Heuristic ident tag expr out)

Combine two heuristics: compute an initial score with the first heuristic and, if several columns have obtained the same score, use the second heuristic to choose among them.

seqHeuristics :: [Heuristic ident tag expr out] -> Heuristic ident tag expr out Source #

Combine a list of heuristics from left-to-right, defaulting to using no heuristic. Defined as foldr Combine noHeuristic.

Simple heuristics

A set of simple and cheap to compute heuristics.

firstRow :: Heuristic ident tag expr out Source #

This heuristic favours columns whose top pattern is a generalized constructor pattern. If the first pattern is a wildcard, the heuristic gives \(0\) and \(1\) otherwise.

smallDefault :: Heuristic ident tag expr out Source #

This heuristic favours columns with the least number of wildcard patterns. If \(v(i)\) is the number of wildcards in column \(i\), then \(-v(i)\) is the score of column \(i\).

smallBranchingFactor :: IsTag tag => Heuristic ident tag expr out Source #

Favours columns resulting in smaller switches. The score of a column is the number of branches of the switch resulting of the compilation (including an eventually default branch), negated.

arity :: Heuristic ident tag expr out Source #

The sum of the arity of the constructors of this column, negated.

Expensive heuristics

The following heuristics are deemed expensive as they require manipulation on the matrix of patterns to compute a score.

leafEdge :: IsTag tag => Heuristic ident tag expr out Source #

The score is the number of children of the emitted switch node that are leaves.

fewerChildRule :: IsTag tag => Heuristic ident tag expr out Source #

This heuristic favours columns that lead to fewer rows to test.

Necessity based heuristics

A column \(i\) is deemed necessary for row \(j\) when all paths to \(j\), in all possible decision trees, tests column \(i\). A column \(i\) is necessary if it is necessary for all outputs.

It seems sensible to favour useful columns over non-useful ones as, by definition a useful column will be tested in all paths, whether we choose it or not. Choosing it early might however result in shorter paths.

Necessity is computed according to the algorithm in section 3 of Warnings for pattern matching.

neededColumns :: IsTag tag => Heuristic ident tag expr out Source #

The score is the number of output needing this column.

neededPrefix :: IsTag tag => Heuristic ident tag expr out Source #

The score is the number of consecutive outputs needing the column.

constructorPrefix :: IsTag tag => Heuristic ident tag expr out Source #

A cheaper version of neededPrefix, where a pattern counts in the score if it is a constructor pattern.

Pseudo heuristics

The following heuristics are called pseudo-heuristics as they do not compute a score based on the patterns but rather on the expressions being matched, such as shorterOccurence or simply on the position of the column in the matrix, such as noHeuristic or reverseNoHeuristic. They make for good default heuristic, either alone or as the last heuristic of a combination.

noHeuristic :: Heuristic ident tag expr out Source #

Leaves the column in the same order by giving the score \(-i\) to column \(i\).

reverseNoHeuristic :: Heuristic ident tag expr out Source #

Reverse the order of the columns by giving the score \(i\) to column \(i\). It can be useful in combination with another heuristic to reverse the left-to-right bias of this implementation.

shorterOccurence :: (Select expr tag -> Score) -> Heuristic ident tag expr out Source #

This heuristic is called a pseudo-heuristic as it doesn't operate on the patterns but on the expression. It is most useful as a last resort heuristic in combination with others.