Copyright	Dmitry Zuikov 2020
License	MIT
Maintainer	dzuikov@gmail.com
Stability	experimental
Portability	unknown
Safe Haskell	None
Language	Haskell2010

Data.Text.Fuzzy.Tokenize

Description

The lightweight and multi-functional text tokenizer allowing different types of text tokenization depending on it's settings.

It may be used in different sutiations, for DSL, text markups or even for parsing simple grammars easier and sometimes faster than in case of usage mainstream parsing combinators or parser generators.

The primary goal of this package is to parse unstructured text data, however it may be used for parsing such data formats as CSV with ease.

Currently it supports the following types of entities: atoms, string literals (currently with the minimal set of escaped characters), punctuation characters and delimeters.

Examples

Simple CSV-like tokenization

>>> tokenize (delims ":") "aaa : bebeb : qqq ::::" :: [Text]
["aaa "," bebeb "," qqq "]

>>> tokenize (delims ":"<>sq<>emptyFields ) "aaa : bebeb : qqq ::::" :: [Text]
["aaa "," bebeb "," qqq ","","","",""]

>>> > tokenize (delims ":"<>sq<>emptyFields ) "aaa : bebeb : qqq ::::" :: [Maybe Text]
[Just "aaa ",Just " bebeb ",Just " qqq ",Nothing,Nothing,Nothing,Nothing]

>>> tokenize (delims ":"<>sq<>emptyFields ) "aaa : 'bebeb:colon inside' : qqq ::::" :: [Maybe Text]
[Just "aaa ",Just " ",Just "bebeb:colon inside",Just " ",Just " qqq ",Nothing,Nothing,Nothing,Nothing]

>>> let spec = sl<>delims ":"<>sq<>emptyFields<>noslits
>>> tokenize spec "   aaa :   'bebeb:colon inside' : qqq ::::" :: [Maybe Text]
[Just "aaa ",Just "bebeb:colon inside ",Just "qqq ",Nothing,Nothing,Nothing,Nothing]

>>> let spec = delims ":"<>sq<>emptyFields<>uw<>noslits
>>> tokenize spec "  a  b  c  : 'bebeb:colon inside' : qqq ::::"  :: [Maybe Text]
[Just "a b c",Just "bebeb:colon inside",Just "qqq",Nothing,Nothing,Nothing,Nothing]

Notes

About the delimeter tokens

This type of tokens appears during a "delimited" formats processing and disappears in results. Currenly you will never see it unless normalization is turned off by nn option.

The delimeters make sense in case of processing the CSV-like formats, but in this case you probably need only values in results.

This behavior may be changed later. But right now delimeters seem pointless in results. If you process some sort of grammar where delimeter character is important, you may use punctuation instead, i.e:

>>> let spec = delims " \t"<>punct ",;()" <>emptyFields<>sq
>>> tokenize spec "( delimeters , are , important, 'spaces are not');" :: [Text]
["(","delimeters",",","are",",","important",",","spaces are not",")",";"]

Other

For CSV-like formats it makes sense to split text to lines first, otherwise newline characters may cause to weird results

Synopsis

data TokenizeSpec
class IsToken a where
- mkChar :: Char -> a
- mkSChar :: Char -> a
- mkPunct :: Char -> a
- mkText :: Text -> a
- mkStrLit :: Text -> a
- mkKeyword :: Text -> a
- mkEmpty :: a
- mkDelim :: a
- mkIndent :: Int -> a
- mkEol :: a
tokenize :: IsToken a => TokenizeSpec -> Text -> [a]
esc :: TokenizeSpec
addEmptyFields :: TokenizeSpec
emptyFields :: TokenizeSpec
nn :: TokenizeSpec
sq :: TokenizeSpec
sqq :: TokenizeSpec
noslits :: TokenizeSpec
sl :: TokenizeSpec
sr :: TokenizeSpec
uw :: TokenizeSpec
delims :: String -> TokenizeSpec
comment :: Text -> TokenizeSpec
punct :: Text -> TokenizeSpec
indent :: TokenizeSpec
itabstops :: Int -> TokenizeSpec
keywords :: [Text] -> TokenizeSpec
eol :: TokenizeSpec

Documentation

data TokenizeSpec Source #

Tokenization settings. Use mempty for an empty value and construction functions for changing the settings.

Instances

Eq TokenizeSpec Source #
Instance details Defined in Data.Text.Fuzzy.Tokenize Methods (==) :: TokenizeSpec -> TokenizeSpec -> Bool # (/=) :: TokenizeSpec -> TokenizeSpec -> Bool #
Ord TokenizeSpec Source #
Instance details Defined in Data.Text.Fuzzy.Tokenize Methods compare :: TokenizeSpec -> TokenizeSpec -> Ordering # (<) :: TokenizeSpec -> TokenizeSpec -> Bool # (<=) :: TokenizeSpec -> TokenizeSpec -> Bool # (>) :: TokenizeSpec -> TokenizeSpec -> Bool # (>=) :: TokenizeSpec -> TokenizeSpec -> Bool # max :: TokenizeSpec -> TokenizeSpec -> TokenizeSpec # min :: TokenizeSpec -> TokenizeSpec -> TokenizeSpec #
Show TokenizeSpec Source #
Instance details Defined in Data.Text.Fuzzy.Tokenize Methods showsPrec :: Int -> TokenizeSpec -> ShowS # show :: TokenizeSpec -> String # showList :: [TokenizeSpec] -> ShowS #
Semigroup TokenizeSpec Source #
Instance details Defined in Data.Text.Fuzzy.Tokenize Methods (<>) :: TokenizeSpec -> TokenizeSpec -> TokenizeSpec # sconcat :: NonEmpty TokenizeSpec -> TokenizeSpec # stimes :: Integral b => b -> TokenizeSpec -> TokenizeSpec #
Monoid TokenizeSpec Source #
Instance details Defined in Data.Text.Fuzzy.Tokenize Methods mempty :: TokenizeSpec # mappend :: TokenizeSpec -> TokenizeSpec -> TokenizeSpec # mconcat :: [TokenizeSpec] -> TokenizeSpec #

class IsToken a where Source #

Typeclass for token values. Note, that some tokens appear in results only when nn option is set, i.e. sequences of characters turn out to text tokens or string literals and delimeter tokens are just removed from the results

Minimal complete definition

mkChar, mkSChar, mkPunct, mkText, mkStrLit, mkKeyword, mkEmpty

Methods

mkChar :: Char -> a Source #

Create a character token

mkSChar :: Char -> a Source #

Create a string literal character token

mkPunct :: Char -> a Source #

Create a punctuation token

mkText :: Text -> a Source #

Create a text chunk token

mkStrLit :: Text -> a Source #

Create a string literal token

mkKeyword :: Text -> a Source #

Create a keyword token

mkEmpty :: a Source #

Create an empty field token

mkDelim :: a Source #

Create a delimeter token

mkIndent :: Int -> a Source #

Creates an indent token

mkEol :: a Source #

Creates an EOL token

Instances

IsToken Text Source #
Instance details Defined in Data.Text.Fuzzy.Tokenize Methods mkChar :: Char -> Text Source # mkSChar :: Char -> Text Source # mkPunct :: Char -> Text Source # mkText :: Text -> Text Source # mkStrLit :: Text -> Text Source # mkKeyword :: Text -> Text Source # mkEmpty :: Text Source # mkDelim :: Text Source # mkIndent :: Int -> Text Source # mkEol :: Text Source #
IsToken (Maybe Text) Source #
Instance details Defined in Data.Text.Fuzzy.Tokenize Methods mkChar :: Char -> Maybe Text Source # mkSChar :: Char -> Maybe Text Source # mkPunct :: Char -> Maybe Text Source # mkText :: Text -> Maybe Text Source # mkStrLit :: Text -> Maybe Text Source # mkKeyword :: Text -> Maybe Text Source # mkEmpty :: Maybe Text Source # mkDelim :: Maybe Text Source # mkIndent :: Int -> Maybe Text Source # mkEol :: Maybe Text Source #

tokenize :: IsToken a => TokenizeSpec -> Text -> [a] Source #

Tokenize a text

esc :: TokenizeSpec Source #

Turn on character escaping inside string literals. Currently the following escaped characters are supported: [" ' t n r a b f v ]

addEmptyFields :: TokenizeSpec Source #

Raise empty field tokens (note mkEmpty method) when no tokens found before a delimeter. Useful for processing CSV-like data in order to distingush empty columns

emptyFields :: TokenizeSpec Source #

same as addEmptyFields

nn :: TokenizeSpec Source #

Turns off token normalization. Makes the tokenizer generate character stream. Useful for debugging.

sq :: TokenizeSpec Source #

Turns on single-quoted string literals. Character stream after '\'' character will be proceesed as single-quoted stream, assuming all delimeter, comment and other special characters as a part of the string literal until the next unescaped single quote character.

sqq :: TokenizeSpec Source #

Enable double-quoted string literals support as sq for single-quoted strings.

noslits :: TokenizeSpec Source #

Disable separate string literals.

Useful when processed delimeted data (csv-like formats). Normally, sequential text chunks are concatenated together, but consequent text and string literal will produce the two different tokens and it may cause weird results if data is in csv-like format, i.e:

>>> tokenize (delims ":"<>emptyFields<>sq ) "aaa:bebe:'qq' aaa:next::" :: [Maybe Text]
[Just "aaa",Just "bebe",Just "qq",Just " aaa",Just "next",Nothing,Nothing]

look: "qq" and " aaa" are turned into two separate tokens that makes the result of CSV processing looks improper, like it has an extra-column. This behavior may be avoided using this option, if you don't need to distinguish text chunks and string literals:

>>> tokenize (delims ":"<>emptyFields<>sq<>noslits) "aaa:bebe:'qq:foo' aaa:next::" :: [Maybe Text]
[Just "aaa",Just "bebe",Just "qq:foo aaa",Just "next",Nothing,Nothing]

sl :: TokenizeSpec Source #

Strip spaces on left side of a token. Does not affect string literals, i.e string are processed normally. Useful mostly for processing CSV-like formats, otherwise delims may be used to skip unwanted spaces.

sr :: TokenizeSpec Source #

Strip spaces on right side of a token. Does not affect string literals, i.e string are processed normally. Useful mostly for processing CSV-like formats, otherwise delims may be used to skip unwanted spaces.

uw :: TokenizeSpec Source #

Strips spaces on right and left sides and transforms multiple spaces into the one. Name origins from unwords . words

Does not affect string literals, i.e string are processed normally. Useful mostly for processing CSV-like formats, otherwise delims may be used to skip unwanted spaces.

delims :: String -> TokenizeSpec Source #

Specify the list of delimers (characters) to split the character stream into fields. Useful for CSV-like separated formats. Support for empty fields in token stream may be enabled by addEmptyFields function

comment :: Text -> TokenizeSpec Source #

Specify the line comment prefix. All text after the line comment prefix will be ignored until the newline character appearance. Multiple line comments are supported.

punct :: Text -> TokenizeSpec Source #

Specify the punctuation characters. Any punctuation character is handled as a separate token. Any token will be breaked on a punctiation character.

Useful for handling ... er... punctuaton, like

> function(a,b)

or

> (apply function 1 2 3)

>>> tokenize spec "(apply function 1 2 3)" :: [Text]
["(","apply","function","1","2","3",")"]

indent :: TokenizeSpec Source #

Enable identation support

itabstops :: Int -> TokenizeSpec Source #

Set tab expanding multiplier i.e. each tab extends into n spaces before processing. It also turns on the indentation. Only the tabs at the beginning of the string are expanded, i.e. before the first non-space character appears.

keywords :: [Text] -> TokenizeSpec Source #

Specify the keywords list. Each keyword will be threated as a separate token.

eol :: TokenizeSpec Source #

Turns on EOL token generation