tokenizer-streaming-0.1.0.1: A variant of tokenizer-monad that supports streaming.

Safe HaskellNone
LanguageHaskell2010

Control.Monad.Tokenizer.Streaming.Decode

Contents

Description

Functions for running TokenizerT on Unicode bytestring streams.

For more information on how to work with TokenizerT, have a look at the module Control.Monad.Tokenizer.Streaming. For more information on writing tokenizers, have a look at the module Control.Monad.Tokenizer from the package tokenizer-monad.

Example for a simple tokenizer, that splits words by whitespace and discards stop symbols:

tokenizeWords :: Monad m => Q.ByteString m () -> Stream (Of T.Text) m ()
tokenizeWords = runUtf8TokenizerT $ untilEOT $ do
  c <- pop
  if isStopSym c
    then discard
    else if c `elem` ("  \t\r\n" :: [Char])
         then discard
         else do
           walkWhile (\c -> (c=='_') || not (isSpace c || isPunctuation' c))
           emit
Synopsis

UTF-8

UTF-16

UTF-32

Helpers

decodeStream :: Monad m => (ByteString -> DecodeResult) -> ByteString m () -> Stream (Of Text) m () Source #

Decode a Unicode bytestring stream into a stream of Text chunks.