parser-regex
Regex based parsers
Features
- Parsers based on regular expressions,
capable of parsing regular languages.
Note that there are no extra features to make parsing non-regular languages
possible.
- Regexes are composed using combinators.
- Resumable parsing of sequences of any type containing values of any type.
- Special support for
Text
and String
in the form of convenient combinators
and operations like find and replace.
- Parsing runtime is linear in the length of the sequence being parsed. No
exponential backtracking.
Examples
Versus regex patterns
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
Can you guess what this matches?
This is a non-validating regex to extract parts of a URI, from
RFC 3986. It can
be translated as follows.
{-# LANGUAGE OverloadedStrings #-}
import Control.Applicative (optional)
import Data.Text (Text)
import Regex.Text (REText)
import qualified Regex.Text as R
import qualified Data.CharSet as CS
data URI = URI
{ scheme :: Maybe Text
, authority :: Maybe Text
, path :: Text
, query :: Maybe Text
, fragment :: Maybe Text
} deriving Show
uriRE :: REText URI
uriRE = URI
<$> optional (R.someTextOf (CS.not ":/?#") <* R.char ':')
<*> optional (R.text "//" *> R.manyTextOf (CS.not "/?#"))
<*> R.manyTextOf (CS.not "?#")
<*> optional (R.char '?' *> R.manyTextOf (CS.not "#"))
<*> optional (R.char '#' *> R.manyText)
>>> R.reParse uriRE "https://github.com/meooow25/parser-regex?tab=readme-ov-file#parser-regex"
Just (URI { scheme = Just "https"
, authority = Just "github.com"
, path = "/meooow25/parser-regex"
, query = Just "tab=readme-ov-file"
, fragment = Just "parser-regex" })
More parsing
Parsing is straightforward, even for tasks which may be impractical with
submatch extraction typically offered by regex libraries.
import Control.Applicative ((<|>))
import Data.Text (Text)
import Regex.Text (REText)
import qualified Regex.Text as R
import qualified Data.CharSet as CS
data Expr
= Var Text
| Expr :+ Expr
| Expr :- Expr
| Expr :* Expr
deriving Show
exprRE :: REText Expr
exprRE = var `R.chainl1` mul `R.chainl1` (add <|> sub)
where
var = Var <$> R.someTextOf CS.asciiLower
add = (:+) <$ R.char '+'
sub = (:-) <$ R.char '-'
mul = (:*) <$ R.char '*'
>>> import qualified Regex.Text as R
>>> R.reParse exprRE "a+b-c*d*e+f"
Just (((Var "a" :+ Var "b") :- ((Var "c" :* Var "d") :* Var "e")) :+ Var "f")
Find and replace
Find and replace using regexes are supported for Text
and lists.
>>> import Control.Applicative ((<|>))
>>> import qualified Data.Text as T
>>> import qualified Regex.Text as R
>>>
>>> data Color = Blue | Orange deriving Show
>>> let re = Blue <$ R.text "blue" <|> Orange <$ R.text "orange"
>>> R.find re "color: orange"
Just Orange
>>>
>>> let re = T.toUpper <$> (R.text "cat" <|> R.text "dog" <|> R.text "fish")
>>> R.replaceAll re "locate selfish hotdog"
"loCATe selFISH hotDOG"
Parse any sequence
Regexes are not restricted to parsing text. For example, one may parse vectors
from the vector library, because
why not.
import Regex.Base (Parser)
import qualified Regex.Base as R
import qualified Data.Vector.Generic as VG
parseVector :: VG.Vector v c => Parser c a -> v c -> Maybe a
parseVector = R.parseFoldr VG.foldr
>>> import Control.Applicative (many)
>>> import qualified Data.Vector as V
>>> import qualified Regex.Base as R
>>>
>>> let p = R.compile $ many ((,) <$> R.satisfy even <*> R.satisfy odd)
>>> let v = V.fromList [0..5] :: V.Vector Int
>>> parseVector p v
Just [(0,1),(2,3),(4,5)]
Documentation
Documentation is available on Hackage:
parser-regex
Already familiar with regex patterns? See the
Regex pattern cheat sheet.
Alternatives
regex-applicative
regex-applicative
is
the primary inspiration for this library, and is similar in many ways.
parser-regex
attempts to be a more efficient and featureful library built on
the ideas of regex-applicative
, though it does not aim to provide a superset
of regex-applicative
's API.
Traditional regex libraries
These libraries use regex patterns.
Consider using these if
- The terseness of regex patterns is well-suited for your use case.
- You need something very fast for typical use cases.
regex-pcre
,
regex-pcre-builtin
, pcre-light
, pcre-heavy
are faster than
parser-regex
for typical use cases, but there are trade-offs—such as losing
Unicode support and a risk of ReDoS.
Use parser-regex
instead if
- You prefer parser combinators over regex patterns
- You need more powerful parsing capabilities than just submatch extraction
- You need to parse a sequence that is not supported by the above libraries
For a detailed comparison of regex libraries,
see here.
Other options
If you are not restricted to regexes, there are many other parsing libraries you
may use, too many to list here. See the
"Parsing" category on Hackage
for a start.
Contributing
Questions, bug reports, documentation improvements, code contributions welcome!
Please open an issue as the
first step.