Earley: Parsing all context-free grammars using Earley's algorithm.

This is a package candidate release! Here you can preview how this package release will appear once published to the main package index (which can be accomplished via the 'maintain' link below). Please note that once a package has been published to the main package index it cannot be undone! Please consult the package uploading documentation for more information.

[maintain] [Publish]

See https://www.github.com/ollef/Earley for more information and https://github.com/ollef/Earley/tree/master/examples for examples.


[Skip to Readme]

Properties

Versions 0.6.0, 0.7.0, 0.7.1, 0.8.0, 0.8.1, 0.8.2, 0.8.3, 0.9.0, 0.10.0, 0.10.0.1, 0.10.1.0, 0.11.0.0, 0.11.0.1, 0.12.0.0, 0.12.0.1, 0.12.1.0, 0.13.0.0, 0.13.0.1, 0.13.0.1
Change log CHANGELOG.md
Dependencies base (>=4.6 && <5), Earley, ListLike (>=4.1), semigroups (>=0.18), unordered-containers (>=0.2) [details]
License BSD-3-Clause
Copyright (c) 2014-2019 Olle Fredriksson
Author Olle Fredriksson
Maintainer fredriksson.olle@gmail.com
Category Parsing
Source repo head: git clone https://github.com/ollef/Earley.git
Uploaded by OlleFredriksson at 2019-02-24T15:59:58Z

Modules

[Index] [Quick Jump]

Flags

Manual Flags

NameDescriptionDefault
examples

"Build examples"

Disabled

Use -f <flag> to enable a flag, or -f -<flag> to disable that flag. More info

Downloads

Maintainer's Corner

Package maintainers

For package maintainers and hackage trustees


Readme for Earley-0.13.0.1

[back to package description]

Earley Build Status Hackage

Go to the API documentation on Hackage.

This (Text.Earley) is a library consisting of a few main parts:

Text.Earley.Grammar

An embedded context-free grammar (CFG) domain-specific language (DSL) with semantic action specification in applicative style.

An example of a typical expression grammar working on an input tokenised into strings is the following:

   expr :: Grammar r (Prod r String String Expr)
   expr = mdo
     x1 <- rule $ Add <$> x1 <* namedToken "+" <*> x2
               <|> x2
               <?> "sum"
     x2 <- rule $ Mul <$> x2 <* namedToken "*" <*> x3
               <|> x3
               <?> "product"
     x3 <- rule $ Var <$> (satisfy ident <?> "identifier")
               <|> namedToken "(" *> x1 <* namedToken ")"
     return x1
     where
       ident (x:_) = isAlpha x
       ident _     = False

Text.Earley.Parser

An implementation of (a modification of) the Earley parsing algorithm.

To invoke the parser on the above grammar, run e.g. (here using words as a stupid tokeniser):

   fullParses (parser expr) $ words "a + b * ( c + d )"
   = ( [Add (Var "a") (Mul (Var "b") (Add (Var "c") (Var "d")))]
     , Report {...}
     )

Note that we get a list of all the possible parses (though in this case there is only one).

Another invocation, which shows the error reporting capabilities (giving the last position that the parser reached and what it expected at that point), is the following:

   fullParses (parser expr) $ words "a +"
   = ( []
     , Report { position   = 2
              , expected   = ["(","identifier","product"]
              , unconsumed = []
              }
     )

Text.Earley.Generator

Functionality to generate the members of the language that a grammar generates.

To get the language of a grammar given a list of allowed tokens, run e.g.:

   language (generator romanNumeralsGrammar "VIX")
   = [(0,""),(1,"I"),(5,"V"),(10,"X"),(20,"XX"),(11,"XI"),(15,"XV"),(6,"VI"),(9,"IX"),(4,"IV"),(2,"II"),(3,"III"),(19,"XIX"),(16,"XVI"),(14,"XIV"),(12,"XII"),(7,"VII"),(21,"XXI"),(25,"XXV"),(30,"XXX"),(31,"XXXI"),(35,"XXXV"),(8,"VIII"),(13,"XIII"),(17,"XVII"),(26,"XXVI"),(29,"XXIX"),(24,"XXIV"),(22,"XXII"),(18,"XVIII"),(36,"XXXVI"),(39,"XXXIX"),(34,"XXXIV"),(32,"XXXII"),(23,"XXIII"),(27,"XXVII"),(33,"XXXIII"),(28,"XXVIII"),(37,"XXXVII"),(38,"XXXVIII")]

The above example shows the language generated by a Roman numerals grammar limited to the tokens 'V', 'I', and 'X'.

Text.Earley.Mixfix

Helper functionality for creating parsers for expressions with mixfix identifiers in the style of Agda.

How do I write grammars?

As hinted at above, the grammars are written inside Grammar, which is a Monad and MonadFix. For the library to be able to tame the recursion in the grammars, we have to use the rule function whenever a production is recursive.

Whenever you would write e.g.

...
p = foo <|> bar <*> p
...

in a conventional combinator parser library, you instead write the following:

grammar = mdo
  ...
  p <- rule $ foo <|> bar <*> p
  ...

Apart from making it possible to do recursion (even left-recursion), rules have an additional benefit: they control where work is shared, by the rule that any rule is only ever expanded once per position in the input string. If a rule is encountered more than once at a position, the work is shared.

Compared to parser generators and combinator libraries

This library differs from the main methods that are used to write parsers in the Haskell ecosystem:

The parsing algorithm

The parsing algorithm that this library uses is based on Earley's parsing algorithm. The algorithm has been modified to produce online parse results, to give good error messages, and to allow garbage collection of the item sets. Essentially, instead of storing a sequence of sets of items like in the original algorithm, the modified algorithm just stores pointers back to sets of reachable items.

The worst-case run time performance of the Earley parsing algorithm is cubic in the length of the input, but for large classes of grammars it is linear. It should however be noted that this library will likely be slower than most parser generators and parser combinator libraries.

The parser implements an optimisation similar to that presented in Joop M.I.M Leo's paper A general context-free parsing algorithm running in linear time on every LR(k) grammar without using lookahead, which removes indirections in sequences of non-ambiguous backpointers between item sets.

For more in-depth information about the internals of the library, there are implementation notes currently being written.

Contact

Olle Fredriksson - https://github.com/ollef