replace-attoparsec: Stream edit, find-and-replace with Attoparsec parsers

This is a package candidate release! Here you can preview how this package release will appear once published to the main package index (which can be accomplished via the 'maintain' link below). Please note that once a package has been published to the main package index it cannot be undone! Please consult the package uploading documentation for more information.

[maintain] [Publish]

Warnings:

Stream editing and find-and-replace with Attoparsec monadic parsers.


[Skip to Readme]

Properties

Versions 1.0.0.0, 1.0.1.0, 1.0.2.0, 1.0.2.0, 1.0.3.0, 1.2.0.0, 1.2.1.0, 1.2.2.0, 1.4.0.0, 1.4.1.0, 1.4.2.0, 1.4.4.0, 1.4.5.0, 1.5.0.0
Change log CHANGELOG.md
Dependencies attoparsec, base (>=4.0 && <5.0), bytestring, text [details]
License BSD-2-Clause
Author James Brock
Maintainer jamesbrock@gmail.com
Category Parsing
Home page https://github.com/jamesdbrock/replace-attoparsec
Bug tracker https://github.com/jamesdbrock/replace-attoparsec/issues
Source repo head: git clone https://github.com/jamesdbrock/replace-attoparsec.git
Uploaded by JamesBrock at 2019-09-16T13:36:32Z

Modules

[Index] [Quick Jump]

Downloads

Maintainer's Corner

Package maintainers

For package maintainers and hackage trustees


Readme for replace-attoparsec-1.0.2.0

[back to package description]

replace-attoparsec

Hackage Stackage Nightly Stackage LTS

replace-attoparsec is for finding text patterns, and also editing and replacing the found patterns. This activity is traditionally done with regular expressions, but replace-attoparsec uses attoparsec parsers instead for the pattern matching.

replace-attoparsec can be used in the same sort of “pattern capture” or “find all” situations in which one would use Python re.findall or Perl m//, or Unix grep.

replace-attoparsec can be used in the same sort of “stream editing” or “search-and-replace” situations in which one would use Python re.sub, or Perl s///, or Unix sed, or awk.

See replace-megaparsec for the megaparsec version.

Why would we want to do pattern matching and substitution with parsers instead of regular expressions?

Usage Examples

Try the examples in ghci by running cabal v2-repl in the replace-attoparsec/ root directory.

The examples depend on these imports and LANGUAGE OverloadedStrings.

:set -XOverloadedStrings
import Replace.Attoparsec.Text
import Data.Attoparsec.Text as AT
import qualified Data.Text as T
import Data.Either
import Data.Char

Parsing with sepCap family of parser combinators

The following examples show how to match a pattern to a string of text and deconstruct the string of text by separating it into sections which match the pattern, and sections which don't match.

Pattern match, capture only the parsed result with sepCap

Separate the input string into sections which can be parsed as a hexadecimal number with a prefix "0x", and sections which can't.

let hexparser = string "0x" >> hexadecimal :: Parser Integer
fromRight [] $ parseOnly (sepCap hexparser) "0xA 000 0xFFFF"
[Right 10,Left " 000 ",Right 65535]

Pattern match, capture only the matched text with findAll

Just get the strings sections which match the hexadecimal parser, throw away the parsed number.

let hexparser = string "0x" >> hexadecimal :: Parser Integer
fromRight [] $ parseOnly (findAll hexparser) "0xA 000 0xFFFF"
[Right "0xA",Left " 000 ",Right "0xFFFF"]

Pattern match, capture the matched text and the parsed result with findAllCap

Capture the parsed hexadecimal number, as well as the string section which parses as a hexadecimal number.

let hexparser = chunk "0x" >> hexadecimal :: Parser Integer
fromRight [] $ parseOnly (findAllCap hexparser) "0xA 000 0xFFFF"
[Right ("0xA",10),Left " 000 ",Right ("0xFFFF",65535)]

Pattern match, capture only the locations of the matched patterns

Find all of the sections of the stream which match a string of whitespace. Print a list of the offsets of the beginning of every pattern match.

import Data.Either
let spaceoffset = getOffset <* some space :: Parser Int
fromRight [] $ parseOnly (return . rights =<< sepCap spaceoffset) " a  b  "
[0,2,5]

Pattern match balanced parentheses

Find the outer parentheses of all balanced nested parentheses. Here's an example of matching a pattern that can't be expressed by a regular expression. We can express the pattern with a recursive parser.

let parens :: Parser ()
    parens = do
        char '('
        manyTill
            (void (satisfy $ notInClass "()") <|> void parens)
            (char ')')
        return ()

fromRight [] $ parseOnly (findAll parens) "(()) (()())"
[Right "(())",Left " ",Right "(()())"]

Edit text strings by running parsers with streamEdit

The following examples show how to search for a pattern in a string of text and then edit the string of text to substitute in some replacement text for the matched patterns.

Pattern match and replace with a constant

Replace all carriage-return-newline instances with newline.

streamEdit (string "\r\n") (const "\n") "1\r\n2\r\n"
"1\n2\n"

Pattern match and edit the matches

Replace alphabetic characters with the next character in the alphabet.

streamEdit (AT.takeWhile isLetter) (T.map succ) "HAL 9000"
"IBM 9000"

Pattern match and maybe edit the matches, or maybe leave them alone

Find all of the string sections s which can be parsed as a hexadecimal number r, and if r≤16, then replace s with a decimal number. Uses the match combinator.

let hexparser = string "0x" >> hexadecimal :: Parser Integer
streamEdit (match hexparser) (\(s,r) -> if r <= 16 then T.pack (show r) else s) "0xA 000 0xFFFF"
"10 000 0xFFFF"

Pattern match and edit the matches with IO

Find an environment variable in curly braces and replace it with its value from the environment.

import System.Environment
streamEditT (char '{' *> manyTill anyChar (char '}')) (fmap T.pack . getEnv) "- {HOME} -"
"- /home/jbrock -"

In the Shell

If we're going to have a viable sed replacement then we want to be able to use it easily from the command line. This script uses the Stack script interpreter To find decimal numbers in a stream and replace them with their double.

#!/usr/bin/env stack
{- stack
  script
  --resolver nightly-2019-09-13
  --package attoparsec
  --package text
  --package text-show
  --package replace-attoparsec
-}
-- https://docs.haskellstack.org/en/stable/GUIDE/#script-interpreter

{-# LANGUAGE OverloadedStrings #-}

import qualified Data.Text as T
import qualified Data.Text.IO as T
import TextShow
import Data.Attoparsec.Text
import Replace.Attoparsec.Text

main = T.interact $ streamEdit decimal (showt . (* (2::Integer)))

If you have The Haskell Tool Stack installed then you can just copy-paste this into a file named script.hs and run it. (On the first run Stack may need to download the dependencies.)

$ chmod u+x script.hs
$ echo "1 6 21 107" | ./script.hs
2 12 42 214

Alternatives

Some libraries that one might consider instead of this one.

http://hackage.haskell.org/package/regex-applicative

http://hackage.haskell.org/package/regex

http://hackage.haskell.org/package/pipes-parse

http://hackage.haskell.org/package/stringsearch

http://hackage.haskell.org/package/substring-parser

http://hackage.haskell.org/package/pcre-utils

http://hackage.haskell.org/package/template

https://github.com/RaminHAL9001/parser-sed-thing

http://hackage.haskell.org/package/attosplit

Hypothetically Asked Questions

  1. Is it fast?

    lol not really. sepCap is fundamentally about consuming the stream one token at a time while we try and fail to run a parser and then backtrack each time. That's a slow activity.

  2. Could we write this library for parsec?

    No, because the match combinator doesn't exist for parsec. (I can't find it anywhere. Can it be written?)

  3. Is this a good idea?

    You may have heard it suggested that monadic parsers are better when the input stream is mostly signal, and regular expressions are better when the input stream is mostly noise.

    The premise of this library is: that sentiment is outdated; monadic parsers are great for finding small patterns in a stream of otherwise uninteresting text; and the reluctance to forego the speedup opportunities afforded by restricting ourselves to regular grammars is an old superstition about opportunities which remain mostly unexploited anyway. The performance compromise of allowing stack memory allocation (a.k.a pushdown automata, a.k.a context-free grammar) was once considered controversial for general-purpose programming languages. I think we can now resolve that controversy the same way for pattern matching languages.