tokenizer: Check uniqueness and tokenize safely

Versions [RSS]	0.1.0.0
Change log	CHANGELOG.md
Dependencies	base (>=4.14.3.0 && <4.15), containers, transformers [details]
License	MIT
Copyright	Lev Dvorkin (c) 2022
Author	Lev135
Maintainer	lev_135@mail.ru
Category	text
Home page	https://github.com/Lev135/tokenizer
Source repo	head: git clone https://github.com/Lev135/tokenizer
Uploaded	by lev_135 at 2022-09-20T17:01:42Z
Distributions
Executables	examples
Downloads	98 total (0 in the last 30 days)
Rating	(no votes yet) [estimated by Bayesian average]
Your Rating	λ λ λ
Status	Docs uploaded by user Build status unknown [no reports yet]

Readme for tokenizer-0.1.0.0

[back to package description]

Tokenizer

WARNING this package is not tested enough for the moment. Bugs are very likely here.

This package provides solution for two problems:

split input string on tokens of specificated sort;
check tokenizing of all possible strings is unique.

Some examples

If you have problems with understanding the syntax we use read two sections bellow.

parse make Token from a string
checkUniqueTokenizing check if every string can no more than one way to be split into parts in such a way, that each of them would be matched by one of the given tokens
tokenize tries to split given string on parts with the same condition.

Here everything is ok

> checkUniqueTokenizing $ parse <$> ["ab", "bc", "abc"]
Right ()

and we can split string on these token with deterministic result:

> tokenize (makeTokenizeMap $ parse <$> ["ab", "bc", "abc"]) "abbcabc"
Right [("ab","ab"),("bc","bc"),("abc","abc")]
> tokenize (makeTokenizeMap $ parse <$> ["ab", "bc", "abc"]) "abbcabca"
Left (NoWayTokenize 7 [("ab","ab"),("bc","bc"),("abc","abc")])

We can parse "ab" as "a" and "b" or "ab"

> checkUniqueTokenizing $ parse <$> ["ab", "a", "b"]
Left Conflicts: [("a",a),("b",b)] [("ab",ab)]

we can tokenize using this set of tokens, but sometimes it gives us TwoWaysTokenize error:

> tokenize (makeTokenizeMap $ parse <$> ["a", "b", "ab"]) "bba"
Right [("b","b"),("b","b"),("a","a")]
> tokenize (makeTokenizeMap $ parse <$> ["a", "b", "ab"]) "aab"
Left (TwoWaysTokenize 1 [("a","a"),("ab","ab")] [("a","a"),("a","a"),("b","b")])

to solve the problem we can specify that that there should be no b character after separate a token

> checkUniqueTokenizing $ parse <$> ["ab", "a?!b", "b"]
Right ()
> tokenize (makeTokenizeMap $ parse <$> ["a?!b", "b", "ab"]) "aab"
Right [("a?!b","a"),("ab","ab")]

More complex example. Problem is for string "ababab":

> checkUniqueTokenizing $ parse <$> ["ab", "aba", "bab"]
Left Conflicts: [("ab",ab),("ab",ab),("ab",ab)] [("aba",aba),("bab",bab)]

Here even "aab" can be split as aa and then b or a*b. However, current algorithm gives another conflict "aaa*b" can be spit as aa and a*b or simply a*b:

> checkUniqueTokenizing $ parse <$> ["a*b", "aa", "b"]
Left Conflicts: [("aa",aa),("a*b",ab)] [("a*b",aaab)]

Try it yourself by executing cabal repl examples -f examples

What is a token?

A token is a parts' of string template. It consists of three parts. Each part provides some restrictions on characters of a string that can be matched. The main part of token is it's body. It describes characters of a string part, matchable by token. Two others parts behind and ahead restrict symbols, that can be situated before/after matched part respectively. Note that they are assumed to be satisfied if begin/end of line is achieved.

Each part of token is a list, describing subsequent symbols from left to right. In behind and ahead part we can specify for each position what symbols can be used or cannot be used via BlackWhiteSet. In token's body we can restrict not only one position, but some of subsequent positions. More precisely, we can mark a BlackWhiteSet to be Repeatable one or some (it's one or more) times.

Syntax, used in examples

To make examples more readable we provide simple language for describing tokens. We'll use alpha characters as symbols and some punctuation for describing token's structure:

{ and } for grouping set of characters, containing more then one char;
! means "all, except those" (BlackSet in this package' terminology);
* behind the charset means "some characters, containing in this set";
? at the beginning of behind/ahead parts;
< and > for grouping complex behind/ahead parts.

The grammar in EBNF:

BlackWhiteSet     := ['!'], (letter | '{', {letter}, '}');
Repeatable        := charset, ['*'];

ahead_or_behind   := '?', (symbol | '<', {symbol}, '>');
body              := symbol, {symbol}
token             := [ahead_or_behind], body, [ahead_or_behind];

Parser for tokens, described in this manner is available in examples/Main.hs

Technical details

Uniqueness checking is provided by a modification of Sardinas-Patterson's algorithm. Tokenizing process is written in the most simple way with non-exponential asymptotic of the input string's length.

Usage

It's very likely, that all you need is exported from Text.Tokenizer.

Bug reports and feature requests

Feel free open issues at the GitHub repo

Contribution

I would be vary glad for any contribution. The are many ways to improve this lib:

improve documentation and examples;
add more tests to check, that everything works nice;
improve performance (I think there are many opportunities here in both algorithms);
add benchmarks (connected with the previous).
(this issue is mostly for me :) improve code readability (I've tried not to make it absolutely terrible, but it's definitely not perfect);

I know, that some of those problems (especially code readability) should be closed by myself, but unfortunately I have no time to deal with them now.

Maybe, I'm the package is too raw to publish it, but there are some reasons for me to do so:

I don't no when it will be improved enough;
it is needed for my main project (FineTeX).