Portability | unportable |
---|---|
Stability | unstable |
Maintainer | Michael Sloan (mgsloan@gmail.com) |
This module provides a template Haskell quasiquoter for regular expressions, which provides the following features:
1) Compile-time checking that the regular expression is valid.
2) Arity of resulting tuple based on the number of selected capture patterns in the regular expression.
3) Allows for the inline interpolation of mapping functions :: String -> a.
4) Precompiles the regular expression at compile time, by calling into the
PCRE library and storing a ByteString
literal representation of its state.
NOTE: for some unknown reason this feature is currently broken, and so off by default.
5) Compile-time configurable to use different PCRE options, turn off
precompilation, use ByteString
s, or set a default mapping expression.
Since this is a quasiquoter library that generates code using view patterns, the following extensions are required:
{-# LANGUAGE TemplateHaskell, QuasiQuotes, ViewPatterns #-}
Here's a silly example which parses peano numbers of the form Z, S Z, S S Z, etc. The s+ means that it is not sensitive to the quantity or type of seperating whitespace. (these examples can also be found in Test.hs)
peano :: String -> Maybe Int peano = [rex|^(?{ length . filter (=='S') } \s* (?:S\s+)*Z)\s*$|]
*Main> peano "Z" Just 0 *Main> peano "S Z" Just 1 *Main> peano "S S Z" Just 2 *Main> peano "S S S Z" Just 3 *Main> peano "invalid" Nothing
The token "(?{" introduces a capture group which has a mapping applied to
the -- result - in this case "length . filter (==S
)". If the ?{ ... }
are omitted, then the capture group is not taken as part of the results of
the match. If the contents of the ?{ ... } is omitted, then id
is assumed:
parsePair :: String -> Maybe (String, String) parsePair = [rex|^<\s* (?{ }[^\s,>]+) \s*,\s* (?{ }[^\s,>]+) \s*>$|]
The following example is derived from http:www.regular-expressions.info/dates.html
parseDate :: String -> Maybe (Int, Int, Int) parseDate [rex|^(?{ read -> y }(?:19|20)\d\d)[- /.] (?{ read -> m }0[1-9]|1[012])[- /.] (?{ read -> d }0[1-9]|[12][0-9]|3[01])$|] | (d > 30 && (m `elem` [4, 6, 9, 11])) || (m == 2 && (d == 29 && not (mod y 4 == 0 && (mod y 100 /= 0 || mod y 400 == 0))) || (d > 29)) = Nothing | otherwise = Just (y, m, d) parseDate _ = Nothing
The above example makes use of the regex quasi-quoter as a pattern matcher. The interpolated Haskell patterns are used to construct an implicit view pattern out of the inlined ones. The above pattern is expanded to the equivalent:
parseDate ([rex|^(?{ read }(?:19|20)\d\d)[- /.] (?{ read }0[1-9]|1[012])[- /.] (?{ read }0[1-9]|[12][0-9]|3[01])$|] -> Just (y, m, d))
Caveat: Since haskell-src-exts does not support parsing view-patterns, the above is implemented as a relatively naive split on "->". It presumes that the last "->" in the interpolated pattern seperates the pattern from an expression on the left. This allows for lambdas to be present in the expression, but prevents nesting view patterns.
There are also a few other inelegances:
1) PCRE captures, unlike .NET regular expressions, yield the last capture
made by a particular pattern. So, for example, (...)*, will only yield one
match for ...
. Ideally these would be detected and yield an implicit [a].
2) Patterns with disjunction between captures ((?{f}a) | (?{g}b)) will provide the empty string to one of f / g. In the case of pattern expressions, it would be convenient to be able to map multiple captures into a single variable / pattern, preferring the first non-empty option. The general logic for this is a bit complicated, and postponed for a later release.
3) The following error currently sometimes happens when using precompiled
regular expressions. This feature
is now off by default until this is
fixed.
<interactive>: out of memory (requested 17584491593728 bytes)
Since pcre-light is a wrapper over a C API, the most efficient interface is
ByteStrings, as it does not natively speak Haskell lists. The [rex| ... ]
quasiquoter implicitely packs the input into a bystestring, and unpacks the
results to strings before providing them to your mappers. The brex
QuasiQuoter
is provided for this purpose. You can also define your own
QuasiQuoter
- the definitions of the default configurations are as follows:
rex = rexConf False True "id" rexPCREOptions [] brex = rexConf True True "id" rexPCREOptions []
As mentioned, the other Bool determines whether precompilation is used. The string following is the default mapping expression, used when omitted. Due to GHC staging restrictions, your configuration will need to be in a different module than its usage.
Inspired by Matt Morrow's regexqq package: http://hackage.haskell.org/packages/archive/regexqq/latest/doc/html/src/Text-Regex-PCRE-QQ.html
And code from Erik Charlebois's interpolatedstring-qq package: http://hackage.haskell.org/packages/archive/interpolatedstring-qq/latest/doc/html/Text-InterpolatedString-QQ.html
- rex, brex :: QuasiQuoter
- rexConf :: Bool -> Bool -> String -> [PCREOption] -> [PCREExecOption] -> QuasiQuoter
- rexPCREOptions :: [PCREOption]
- maybeRead :: Read a => String -> Maybe a
- padRight :: a -> Int -> [a] -> [a]
- makeQuasiMultiline :: QuasiQuoter -> QuasiQuoter
Documentation
rex, brex :: QuasiQuoterSource
Default regular expression quasiquoter for String
s and ByteString
s,
respectively.
rexConf :: Bool -> Bool -> String -> [PCREOption] -> [PCREExecOption] -> QuasiQuoterSource
A configureable regular-expression QuasiQuoter. Takes the options to pass
to the PCRE engine, along with Bool
s to flag ByteString
usage and
non-compilation respecively. The provided String
indicates which mapping
function to use, when one is omitted - "(?{} ...)".
rexPCREOptions :: [PCREOption]Source
Default compilation time PCRE options. The default is extended
,
which causes whitespace to be nonsemantic, and ignores # comments.
padRight :: a -> Int -> [a] -> [a]Source
Given a desired list-length, if the passed list is too short, it is padded with the given element. Otherwise, it trims.
makeQuasiMultiline :: QuasiQuoter -> QuasiQuoterSource
This is a QuasiQuoter
transformer, which allows for a whitespace-sensitive
quasi-quoter to be broken over multiple lines. The default rex
and
brex
functions do not need this as they are already whitespace insensitive.
However, if you create your own configuration, which omits the extended
parameter, then this could be useful. The leading space of each line is
ignored, and all newlines removed.