Tokenizer
WARNING this package is not tested enough for the moment.
Bugs are very likely here.
This package provides solution for two problems:
- split input string on tokens of specificated sort;
- check tokenizing of all possible strings is unique.
Some examples
If you have problems with understanding the syntax we use read two
sections bellow.
- parse make
Token
from a string
- checkUniqueTokenizing check if every string can no more than one way to be
split into parts in such a way, that each of them would be matched
by one of the given tokens
- tokenize tries to split given string on parts with the same
condition.
Here everything is ok
> checkUniqueTokenizing $ parse <$> ["ab", "bc", "abc"]
Right ()
and we can split string on these token with deterministic result:
> tokenize (makeTokenizeMap $ parse <$> ["ab", "bc", "abc"]) "abbcabc"
Right [("ab","ab"),("bc","bc"),("abc","abc")]
> tokenize (makeTokenizeMap $ parse <$> ["ab", "bc", "abc"]) "abbcabca"
Left (NoWayTokenize 7 [("ab","ab"),("bc","bc"),("abc","abc")])
We can parse "ab"
as "a"
and "b"
or "ab"
> checkUniqueTokenizing $ parse <$> ["ab", "a", "b"]
Left Conflicts: [("a",a),("b",b)] [("ab",ab)]
we can tokenize using this set of tokens, but sometimes it gives us
TwoWaysTokenize
error:
> tokenize (makeTokenizeMap $ parse <$> ["a", "b", "ab"]) "bba"
Right [("b","b"),("b","b"),("a","a")]
> tokenize (makeTokenizeMap $ parse <$> ["a", "b", "ab"]) "aab"
Left (TwoWaysTokenize 1 [("a","a"),("ab","ab")] [("a","a"),("a","a"),("b","b")])
to solve the problem we can specify that that there should be no b
character after separate a
token
> checkUniqueTokenizing $ parse <$> ["ab", "a?!b", "b"]
Right ()
> tokenize (makeTokenizeMap $ parse <$> ["a?!b", "b", "ab"]) "aab"
Right [("a?!b","a"),("ab","ab")]
More complex example. Problem is for string "ababab"
:
> checkUniqueTokenizing $ parse <$> ["ab", "aba", "bab"]
Left Conflicts: [("ab",ab),("ab",ab),("ab",ab)] [("aba",aba),("bab",bab)]
Here even "aab"
can be split as aa
and then b
or a*b
. However, current
algorithm gives another conflict "aaa*b"
can be spit as aa
and a*b
or
simply a*b
:
> checkUniqueTokenizing $ parse <$> ["a*b", "aa", "b"]
Left Conflicts: [("aa",aa),("a*b",ab)] [("a*b",aaab)]
Try it yourself by executing cabal repl examples -f examples
What is a token?
A token is a parts' of string template. It consists of three parts. Each part
provides some restrictions on characters of a string that can be matched.
The main part of token is it's body
. It describes characters of a string part,
matchable by token. Two others parts behind
and ahead
restrict symbols, that
can be situated before/after matched part respectively. Note that they are
assumed to be satisfied if begin/end of line is achieved.
Each part of token is a list, describing subsequent symbols from left to right.
In behind
and ahead
part we can specify for each position what symbols can
be used or cannot be used via BlackWhiteSet
. In token's body we can restrict
not only one position, but some of subsequent positions. More precisely, we can
mark a BlackWhiteSet
to be Repeatable
one or some (it's one or more) times.
Syntax, used in examples
To make examples more readable we provide simple language for describing tokens.
We'll use alpha characters as symbols and some punctuation for describing
token's structure:
{
and }
for grouping set of characters, containing more then one char;
!
means "all, except those" (BlackSet
in this package' terminology);
*
behind the charset means "some characters, containing in this set";
?
at the beginning of behind
/ahead
parts;
<
and >
for grouping complex behind
/ahead
parts.
The grammar in EBNF:
BlackWhiteSet := ['!'], (letter | '{', {letter}, '}');
Repeatable := charset, ['*'];
ahead_or_behind := '?', (symbol | '<', {symbol}, '>');
body := symbol, {symbol}
token := [ahead_or_behind], body, [ahead_or_behind];
Parser for tokens, described in this manner is available in examples/Main.hs
Technical details
Uniqueness checking is provided by a modification of Sardinas-Patterson's
algorithm. Tokenizing process is written in the most simple way with
non-exponential asymptotic of the input string's length.
Usage
It's very likely, that all you need is exported from Text.Tokenizer
.
Bug reports and feature requests
Feel free open issues at
the GitHub repo
Contribution
I would be vary glad for any contribution. The are many ways to improve this lib:
- improve documentation and examples;
- add more tests to check, that everything works nice;
- improve performance (I think there are many opportunities here in both
algorithms);
- add benchmarks (connected with the previous).
- (this issue is mostly for me :)
improve code readability (I've tried not to make it absolutely terrible, but
it's definitely not perfect);
I know, that some of those problems (especially code readability) should be
closed by myself, but unfortunately I have no time to deal with them now.
Maybe, I'm the package is too raw to publish it, but there are some reasons for
me to do so:
- I don't no when it will be improved enough;
- it is needed for my main project (FineTeX).