Safe Haskell	None
Language	Haskell2010

Language.Lexer.Applicative

Contents

Building a Lexer
- Building Recognizers
Running a Lexer
- Working with a token stream

Description

For some background, see https://ro-che.info/articles/2015-01-02-lexical-analysis

Synopsis

Building a Lexer

data Lexer tok Source

A Lexer specification consists of two recognizers: one for meaningful tokens and one for whitespace and comments.

Although you can construct Lexers directly, it is more convenient to build them with token, whitespace, and the Monoid instance like this:

 myLexer :: Lexer MyToken
 myLexer = mconcat
   [ token      (longest myToken)
   , whitespace (longest myWhiteSpace)
   , whitespace (longestShortest myCommentPrefix myCommentSuffix)
   ]

Constructors

Lexer
Fields lexerTokenRE :: Recognizer tok lexerWhitespaceRE :: Recognizer ()

Instances

Functor Lexer
Monoid (Lexer tok)

token :: Recognizer tok -> Lexer tok Source

Build a lexer with the given token recognizer and no (i.e. mempty) whitespace recognizer.

token is a monoid homomorphism:

token a <> token b = token (a <> b)

whitespace :: Recognizer a -> Lexer tok Source

Build a lexer with the given whitespace recognizer and no (i.e. mempty) token recognizer.

whitespace is a monoid homomorphism:

whitespace a <> whitespace b = whitespace (a <> b)

Building Recognizers

data Recognizer tok Source

A token recognizer

Recognizer values are constructed by functions like longest and longestShortest, combined with mappend, and used by token and whitespace.

When a recognizer returns without consuming any characters, a lexical error is signaled.

Instances

Functor Recognizer
Monoid (Recognizer tok)

longest :: RE Char tok -> Recognizer tok Source

When scanning a next token, the regular expression will compete with the other Recognizers of its Lexer. If it wins, its result will become the next token.

longest has the following properties:

longest (r1 <|> r2) = longest r1 <> longest r2

```
longest r = longestShortest r pure
```

longestShortest Source

Arguments

:: RE Char pref	regex for the longest prefix
-> (pref -> RE Char tok)	regex for the shortest suffix
-> Recognizer tok

This is a more sophisticated recognizer than longest.

It recognizes a token consisting of a prefix and a suffix, where prefix is chosen longest, and suffix is chosen shortest.

An example would be a C block comment

/* comment text */

The naive

longest (string "/*" *> many anySym *> string "*/")

doesn't work because it consumes too much: in

/* xxx */ yyy /* zzz */

it will treat the whole line as a comment.

This is where longestShortest comes in handy:

longestShortest
   (string "/*")
   (\_ -> many anySym *> string "*/")

Operationally, the prefix regex first competes with other Recognizers for the longest match. If it wins, then the shortest match for the suffix regex is found, and the two results are combined with the given function to produce a token.

The two regular expressions combined must consume some input, or else LexicalError is thrown. However, any one of them may return without consuming input.

* * *

Once the prefix regex wins, the choice is committed; the suffix regex must match or else a LexicalError is thrown. Therefore,

longestShortest pref suff1
         <>
longestShortest pref suff2
         =
longestShortest pref suff1

and is not the same as

longestShortest pref (suff1 <|> suff2)

The following holds, however:

longestShortest pref1 suff
         <>
longestShortest pref2 suff
         =
longestShortest (pref1 <|> pref2) suff

Running a Lexer

runLexer Source

Arguments

:: Lexer tok	lexer specification
-> String	source file name (used in locations)
-> String	source text
-> TokenStream (L tok)

Run a lexer on a string and produce a lazy stream of tokens

Working with a token stream

data TokenStream tok Source

A stream of tokens

Constructors

TsToken tok (TokenStream tok)
TsEof
TsError LexicalError

Instances

Functor TokenStream
Eq tok => Eq (TokenStream tok)
Show tok => Show (TokenStream tok)

streamToList :: TokenStream tok -> [tok] Source

Convert a TokenStream to a list of tokens. Turn TsError into a runtime LexicalError exception.

streamToEitherList :: TokenStream tok -> Either LexicalError [tok] Source

Convert a TokenStream into either a token list or a LexicalError. This function may be occasionally useful, but in general its use is discouraged because it needs to force the whole stream before returning a result.

data LexicalError Source

The lexical error exception

Constructors

LexicalError !Pos

Instances

Eq LexicalError
Show LexicalError
Exception LexicalError
Typeable * LexicalError