-- Compiler Toolkit: Self-optimizing lexers -- -- Author : Manuel M. T. Chakravarty -- Created: 2 March 99 -- -- Copyright (c) 1999 Manuel M. T. Chakravarty -- -- This library is free software; you can redistribute it and/or -- modify it under the terms of the GNU Library General Public -- License as published by the Free Software Foundation; either -- version 2 of the License, or (at your option) any later version. -- -- This library is distributed in the hope that it will be useful, -- but WITHOUT ANY WARRANTY; without even the implied warranty of -- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -- Library General Public License for more details. -- --- DESCRIPTION --------------------------------------------------------------- -- -- Self-optimizing lexer combinators. -- -- For detailed information, see ``Lazy Lexing is Fast'', Manuel -- M. T. Chakravarty, in A. Middeldorp and T. Sato, editors, Proceedings of -- Fourth Fuji International Symposium on Functional and Logic Programming, -- Springer-Verlag, LNCS 1722, 1999. (See my Web page for details.) -- -- Thanks to Simon L. Peyton Jones and Roman -- Lechtchinsky for their helpful suggestions that -- improved the design of this library. -- --- DOCU ---------------------------------------------------------------------- -- -- language: Haskell 98 -- -- The idea is to combine the benefits of off-line generators with -- combinators like in `Parsers.hs' (which builds on Swierstra/Duponcheel's -- technique for self-optimizing parser combinators). In essence, a state -- transition graph representing a lexer table is computed on the fly, to -- make lexing deterministic and based on cheap table lookups. -- -- Regular expression map to Haskell expressions as follows. If `x' and `y' -- are regular expressions, -- -- -> epsilon -- xy -> x +> y -- x*y -> x `star` y -- x+y -> x `plus` y -- x?y -> x `quest` y -- -- Given such a Haskelized regular expression `hre', we can use -- -- (1) hre `lexaction` \lexeme -> Nothing -- (2) hre `lexaction` \lexeme -> Just token -- (3) hre `lexmeta` \lexeme pos s -> (res, pos', s', Nothing) -- (4) hre `lexmeta` \lexeme pos s -> (res, pos', s', Just l) -- -- where `epsilon' is required at the end of `hre' if it otherwise ends on -- `star', `plus', or `quest', and then, we have -- -- (1) discards `lexeme' accepted by `hre', -- (2) turns the `lexeme' accepted by `hre' into a token, -- (3) while discarding the lexeme accepted by `hre', transforms the -- position and/or user state, and -- (4) while discarding the lexeme accepted by `hre', transforms the -- position and/or user state and returns a lexer to be used for the -- next lexeme. -- -- The component `res' in case of a meta action, can be `Nothing', `Just -- (Left err)', or `Just (Right token)' to return nothing, an error, or a -- token from a meta action, respectively. -- -- * By adding `ctrlLexer', `Positions' are properly handled in the presence -- of layout control characters. -- -- * This module makes essential use of graphical data structures (for -- representing the state transition graph) and laziness (for maintaining -- the last action in `execLexer'. -- -- NOTES: -- -- * In this implementation, the combinators `quest`, `star`, and `plus` are -- *right* associative - this was different in the ``Lazy Lexing is Fast'' -- paper. This change was made on a suggestion by Martin Norbäck -- . -- --- TODO ---------------------------------------------------------------------- -- -- * error correction is missing -- -- * in (>||<) in the last case, `(addBoundsNum bn bn')' is too simple, as -- the number of outgoing edges is not the sum of the numbers of the -- individual states when there are conflicting edges, ie, ones labeled -- with the same character; however, the number is only used to decide a -- heuristic, so it is questionable whether it is worth spending the -- additional effort of computing the accurate number -- -- * Unicode posses a problem as the character domain becomes too big for -- using arrays to represent transition tables and even sparse structures -- will posse a significant overhead when character ranges are naively -- represented. So, it might be time for finite maps again. -- -- Regarding the character ranges, there seem to be at least two -- possibilities. Doaitse explicitly uses ranges and avoids expanding -- them. The problem with this approach is that we may only have -- predicates such as `isAlphaNum' to determine whether a givne character -- belongs to some character class. From this representation it is -- difficult to efficiently compute a range. The second approach, as -- proposed by Tom Pledger (on the Haskell list) -- would be to actually use predicates directly and make the whole business -- efficient by caching predicate queries. In other words, for any given -- character after we have determined (in a given state) once what the -- following state on accepting that character is, we need not consult the -- predicates again if we memorise the successor state the first time -- around. -- -- * Ken Shan writes ``Section 4.3 of your paper -- computes the definition -- -- re1 `star` re2 = \l' -> let self = re1 self >||< re2 l' in self -- -- If we let re2 = epsilon, we get -- -- many :: Regexp s t -> Regexp s t -- many re = \l' -> let self = re1 self >||< l' in self -- -- since epsilon = id.'' This should actually be as good as the current -- definiton and it might be worthwhile to offer it as a variant. -- module Text.Lexers (Regexp, Lexer, Action, epsilon, char, (+>), lexaction, lexactionErr, lexmeta, (>|<), (>||<), ctrlChars, ctrlLexer, star, plus, quest, alt, string, LexerState, execLexer) where import Data.Maybe (fromMaybe) import Data.Array (Array, (!), assocs, accumArray) import Language.C.Data.Position import qualified Data.DList as DL import Data.Errors (interr, ErrorLevel(..), Error, makeError) infixr 4 `quest`, `star`, `plus` infixl 3 +>, `lexaction`, `lexmeta` infixl 2 >|<, >||< -- constants -- --------- -- | we use the dense representation if a table has at least the given number of -- (non-error) elements -- denseMin :: Int denseMin = 20 -- data structures -- --------------- -- | represents the number of (non-error) elements and the bounds of a table -- type BoundsNum = (Int, Char, Char) -- | combine two bounds -- addBoundsNum :: BoundsNum -> BoundsNum -> BoundsNum addBoundsNum (n, lc, hc) (n', lc', hc') = (n + n', min lc lc', max hc hc') -- | check whether a character is in the bounds -- inBounds :: Char -> BoundsNum -> Bool inBounds c (_, lc, hc) = c >= lc && c <= hc -- | Lexical actions take a lexeme with its position and may return a token; in -- a variant, an error can be returned -- -- * if there is no token returned, the current lexeme is discarded lexing -- continues looking for a token -- type Action t = String -> Position -> Maybe t type ActionErr t = String -> Position -> Either Error t -- | Meta actions transform the lexeme, position, and a user-defined state; they -- may return a lexer, which is then used for accepting the next token (this -- is important to implement non-regular behaviour like nested comments) -- type Meta s t = String -> Position -> s -> (Maybe (Either Error t), -- err/tok? Position, -- new pos s, -- state Maybe (Lexer s t)) -- lexer? -- | tree structure used to represent the lexer table -- -- * each node in the tree corresponds to a state of the lexer; the associated -- actions are those that apply when the corresponding state is reached -- data Lexer s t = Lexer (LexAction s t) (Cont s t) -- | represent the continuation of a lexer -- data Cont s t = -- on top of the tree, where entries are dense, we use arrays -- Dense BoundsNum (Array Char (Lexer s t)) -- -- further down, where the valid entries are sparse, we -- use association lists, to save memory (the first argument -- is the length of the list) -- | Sparse BoundsNum [(Char, Lexer s t)] -- -- end of a automaton -- | Done -- deriving Show -- | lexical action -- data LexAction s t = Action (Meta s t) | NoAction -- deriving Show -- | a regular expression -- type Regexp s t = Lexer s t -> Lexer s t -- basic combinators -- ----------------- -- | Empty lexeme -- epsilon :: Regexp s t epsilon = id -- | One character regexp -- char :: Char -> Regexp s t char c = \l -> Lexer NoAction (Sparse (1, c, c) [(c, l)]) -- | Concatenation of regexps -- (+>) :: Regexp s t -> Regexp s t -> Regexp s t (+>) = (.) -- | Close a regular expression with an action that converts the lexeme into a -- token -- -- * Note: After the application of the action, the position is advanced -- according to the length of the lexeme. This implies that normal -- actions should not be used in the case where a lexeme might contain -- control characters that imply non-standard changes of the position, -- such as newlines or tabs. -- lexaction :: Regexp s t -> Action t -> Lexer s t lexaction re a = re `lexmeta` a' where a' lexeme pos s = let pos' = incPos pos (length lexeme) in pos' `seq` case a lexeme pos of Nothing -> (Nothing, pos', s, Nothing) Just t -> (Just (Right t), pos', s, Nothing) -- | Variant for actions that may returns an error -- lexactionErr :: Regexp s t -> ActionErr t -> Lexer s t lexactionErr re a = re `lexmeta` a' where a' lexeme pos s = let pos' = incPos pos (length lexeme) in pos' `seq` (Just (a lexeme pos), pos', s, Nothing) -- | Close a regular expression with a meta action -- -- * Note: Meta actions have to advance the position in dependence of the -- lexeme by themselves. -- lexmeta :: Regexp s t -> Meta s t -> Lexer s t lexmeta re a = re (Lexer (Action a) Done) -- | disjunctive combination of two regexps -- (>|<) :: Regexp s t -> Regexp s t -> Regexp s t re >|< re' = \l -> re l >||< re' l -- | disjunctive combination of two lexers -- (>||<) :: Lexer s t -> Lexer s t -> Lexer s t (Lexer a c) >||< (Lexer a' c') = Lexer (joinActions a a') (joinConts c c') -- | combine two disjunctive continuations -- joinConts :: Cont s t -> Cont s t -> Cont s t joinConts Done c' = c' joinConts c Done = c joinConts c c' = let (bn , cls ) = listify c (bn', cls') = listify c' in -- note: `addsBoundsNum' can, at this point, only -- approx. the number of *non-overlapping* cases; -- however, the bounds are correct -- aggregate (addBoundsNum bn bn') (cls ++ cls') where listify (Dense n arr) = (n, assocs arr) listify (Sparse n cls) = (n, cls) listify _ = interr "Lexers.listify: Impossible argument!" -- | combine two actions -- joinActions :: LexAction s t -> LexAction s t -> LexAction s t joinActions NoAction a' = a' joinActions a NoAction = a joinActions _ _ = interr "Lexers.>||<: Overlapping actions!" -- | Note: `n' is only an upper bound of the number of non-overlapping cases -- aggregate :: BoundsNum -> ([(Char, Lexer s t)]) -> Cont s t aggregate bn@(n, lc, hc) cls | n >= denseMin = Dense bn (accumArray (>||<) noLexer (lc, hc) cls) | otherwise = Sparse bn (accum (>||<) cls) where noLexer = Lexer NoAction Done -- | combine the elements in the association list that have the same key -- accum :: Eq a => (b -> b -> b) -> [(a, b)] -> [(a, b)] accum _ [] = [] accum f ((k, e):kes) = let (ke, kes') = gather k e kes in ke : accum f kes' where gather k' e' [] = ((k', e'), []) gather k' e' (ke'@(k'2, e'2):kes') | k' == k'2 = gather k' (f e' e'2) kes' | otherwise = let (ke'2, kes'2) = gather k' e' kes' in (ke'2, ke':kes'2) -- handling of control characters -- ------------------------------ -- | control characters recognized by `ctrlLexer' -- ctrlChars :: [Char] ctrlChars = ['\n', '\r', '\f', '\t'] -- | control lexer -- -- * implements proper `Position' management in the presence of the standard -- layout control characters -- ctrlLexer :: Lexer s t ctrlLexer = char '\n' `lexmeta` newline >||< char '\r' `lexmeta` newline >||< char '\v' `lexmeta` newline >||< char '\f' `lexmeta` formfeed >||< char '\t' `lexmeta` tab where newline _ pos s = (Nothing, retPos pos , s, Nothing) formfeed _ pos s = (Nothing, incPos pos 1, s, Nothing) tab _ pos s = (Nothing, incPos pos 8, s, Nothing) -- non-basic combinators -- --------------------- -- | x `star` y corresponds to the regular expression x*y -- star :: Regexp s t -> Regexp s t -> Regexp s t -- -- The definition used below can be obtained by equational reasoning from this -- one (which is much easier to understand): -- -- star re1 re2 = let self = (re1 +> self >|< epsilon) in self +> re2 -- -- However, in the above, `self' is of type `Regexp s t' (ie, a functional), -- whereas below it is of type `Lexer s t'. Thus, below we have a graphical -- body (finite representation of an infinite structure), which doesn't grow -- with the size of the accepted lexeme - in contrast to the definition using -- the functional recursion. -- star re1 re2 = \l -> let self = re1 self >||< re2 l in self -- | x `plus` y corresponds to the regular expression x+y -- plus :: Regexp s t -> Regexp s t -> Regexp s t plus re1 re2 = re1 +> (re1 `star` re2) -- | x `quest` y corresponds to the regular expression x?y -- quest :: Regexp s t -> Regexp s t -> Regexp s t quest re1 re2 = (re1 +> re2) >|< re2 -- | accepts a non-empty set of alternative characters -- alt :: [Char] -> Regexp s t -- -- Equiv. to `(foldr1 (>|<) . map char) cs', but much faster -- alt [] = interr "Lexers.alt: Empty character set!" alt cs = \l -> let bnds = (length cs, minimum cs, maximum cs) in Lexer NoAction (aggregate bnds [(c, l) | c <- cs]) -- | accept a character sequence -- string :: String -> Regexp s t string [] = interr "Lexers.string: Empty character set!" string cs = (foldr1 (+>) . map char) cs -- execution of a lexer -- -------------------- -- | threaded top-down during lexing (current input, current position, meta -- state) -- type LexerState s = (String, Position, s) -- | apply a lexer, yielding a token sequence and a list of errors -- -- * Currently, all errors are fatal; thus, the result is undefined in case of -- an error (this changes when error correction is added). -- -- * The final lexer state is returned. -- -- * The order of the error messages is undefined. -- execLexer :: Lexer s t -> LexerState s -> ([t], LexerState s, [Error]) -- -- * the following is moderately tuned -- execLexer _ state@([], _, _) = ([], state, []) execLexer l state = case lexOne l state of (Nothing , _ , state') -> execLexer l state' (Just res, l', state') -> let (ts, final, allErrs) = execLexer l' state' in case res of (Left err) -> (ts , final, err:allErrs) (Right t ) -> (t:ts, final, allErrs) where -- accept a single lexeme -- -- lexOne :: Lexer s t -> LexerState s t -- -> (Either Error (Maybe t), Lexer s t, LexerState s t) lexOne l0 state' = oneLexeme l0 state' DL.empty lexErr where -- the result triple of `lexOne' that signals a lexical error; -- the result state is advanced by one character for error correction -- lexErr = let (cs, pos, s) = state' err = makeError LevelError pos ["Lexical error!", "The character " ++ show (head cs) ++ " does not fit here; skipping it."] in (Just (Left err), l, (tail cs, incPos pos 1, s)) -- we take an open list of characters down, where we accumulate the -- lexeme; this function returns maybe a token, the next lexer to use -- (can be altered by a meta action), the new lexer state, and a list -- of errors -- -- we implement the "principle of the longest match" by taking a -- potential result quadruple down (in the last argument); the -- potential result quadruple is updated whenever we pass by an action -- (different from `NoAction'); initially it is an error result -- -- oneLexeme :: Lexer s t -- -> LexerState -- -> DList Char -- -> (Maybe (Either Error t), Maybe (Lexer s t), -- LexerState s t) -- -> (Maybe (Either Error t), Maybe (Lexer s t), -- LexerState s t) oneLexeme (Lexer a cont') state''@(cs, pos, s) csDL last' = let last'' = action a csDL state'' last' in case cs of [] -> last'' (c:cs') -> oneChar cont' c (cs', pos, s) csDL last'' oneChar Done _ _ _ last' = last' oneChar (Dense bn arr) c state'' csDL last' | c `inBounds` bn = cont (arr!c) c state'' csDL last' | otherwise = last' oneChar (Sparse bn cls) c state'' csDL last' | c `inBounds` bn = case lookup c cls of Nothing -> last' Just l' -> cont l' c state'' csDL last' | otherwise = last' -- continue within the current lexeme -- cont l' c state'' csDL last' = oneLexeme l' state'' (csDL `DL.snoc` c) last' -- execute the action if present and finalise the current lexeme -- action (Action f) csDL (cs, pos, s) _last = case f (DL.toList csDL) pos s of (Nothing, pos', s', l') | not . null $ cs -> lexOne (fromMaybe l0 l') (cs, pos', s') (res , pos', s', l') -> (res, (fromMaybe l0 l'), (cs, pos', s')) action NoAction _csDL _state last' = last' -- no change