# tokenizer-monad **Motivation**: Before working with tokenizer-monad, I often implemented tokenizers by recursively destroying Char lists. The resulting code was purely functional, but hardly readable - even more so, if one destroys Text instead of Char lists. In my mind, I usually imagine tokenization algorithms like flow charts, hence I wanted to code them in a similar manner. **Main idea**: You `walk` through the input string like a turtle, and everytime you find a token boundary, you call `emit`. If some specific kinds of tokens should be suppressed, you can 'discard' them instead (or filter afterwards). This package supports Strings, strict and lazy Text, as well as strict and lazy ASCII ByteStrings. **Examples**: This tokenizer is equivalent to `words` from Prelude: words' :: String -> [String] words' = runTokenizerCS $ untilEOT $ do c <- pop if c `elem` " \t\n\r" then discard else do walkWhile (not . isSpace) emit ...> words' "Dieses Haus ist blau." ["Dieses","Haus","ist","blau."] This tokenizer is similar to `lines` from Prelude, but discards empty lines: lines' :: String -> [String] lines' = runTokenizerCS $ untilEOT $ do c <- pop if c `elem` "\n\r" then discard else do walkWhile (\c -> not (c `elem` "\r\n")) emit ...> lines' "Dieses Haus ist\n\nblau.\n" ["Dieses Haus ist","blau."] A more advanced tokenizer, that can handle punctuation and HTTP URIs in text: t1Tokenize' :: Tokenizer Text () t1Tokenize' = do http <- lookAhead "http://" https <- lookAhead "https://" if (http || https) then (walkWhile (not . isSpace) >> discard) else do c <- peek walk if isStopSym c then emit else if c `elem` (" \t\r\n" :: [Char]) then discard else do walkWhile (\c -> (c=='_') || not (isSpace c || isPunctuation c)) emit