dawgdic
Overview
dawgdic is a library for building and accessing
dictionaries implemented with directed acyclic word
graphs (DAWG).
This is a ported version of C++ dawgdic library.
Features
This library offers DAWG
, Dictionary
, Guide
and Completer
data types as well as their builders.
DAWG
represents word graph. It is provided as intermediate data structure.
Dictionary
represents compact layout for DAWG
. It offers API to check the presence of the word and associated value in the graph. Also it could be stored into a file and loaded from file.
Guide
is being used as a tree-like index to get the faster navigation through the dictionary. Also could be stored into a file and loaded from file.
Input characters must be in range 0-255.
This port does not contain RankedGuide
and RankedCompleter
(yet).
Comparison with other DAWG libraries
- dictionary file size, in bytes:
package/lexicon size (words) |
2.6K |
26K |
370K |
dawg: Data.DAWG.Dynamic |
370868 |
2823660 |
29933964 |
dawg: Data.DAWG.Static |
133138 |
1030578 |
11128690 |
dawg-ord: Data.DAWG.Int |
N/A |
N/A |
N/A |
dawg-ord: Data.DAWG.Ord |
N/A |
N/A |
N/A |
packed-dawg: Data.DAWG.Packed |
15619 |
128491 |
1481671 |
dawgdic: Data.DAWG.Dictionary |
3088 |
141328 |
1603600 |
dawgdic: Guide (w/ Dictionary) |
4640 |
212000 |
2405408 |
feature |
dawg |
dawg-ord |
packed-dawg |
dawgdic |
build from list |
+ |
+ |
+ |
+ |
persistence |
+ |
- |
+ |
+ |
insert |
Dynamic |
+ |
- |
Builder |
delete key |
Dynamic |
+ |
- |
- |
follow character |
Static |
~edges |
lookupPrefix |
+ |
lookup value |
+ |
+ |
- |
+ |
member |
~lookup |
~lookup |
+ |
+ |
keys |
+ |
+ |
+ |
+ |
values |
+ |
+ |
- |
+ |
toList (assocs) |
+ |
+ |
- |
+ |
complete word |
~submap |
- |
~toList |
+ |
fuzzy search |
- |
- |
- |
- |
max size |
N/A |
N/A |
2^22 nodes |
N/A |
Installation
Add dawgdic
dependency to your project and run cabal build
.
Building and Querying
Building DAWG from lexicon of words and ignoring insertion failures is as simple as this:
>>> import Data.DAWG.DAWG
>>> dawg <- fromAscList . lines =<< readFile "/path/to/lexicon"
Otherwise, consider mutable builder. It could also be useful if words are associated with values.
buildOrError content = do
dawgBuilder <- new
forM_ content \(word, value) -> do
result <- insert word (Just value) dawgBuilder
unless result $ error "Insert failed"
freeze dawgBuilder
Building dictionary and guide:
>>> import qualified Data.DAWG.Dictionary as D
>>> import qualified Data.DAWG.Guide as G
>>> dict <- D.build' dawg
>>> guide <- G.build' dawg dict
Saving dictionary and guide:
>>> D.write "dict.dawg" dict
>>> G.write "guide.dawg" guide
Loading dictionary and guide:
>>> dict <- D.load "dict.dawg"
>>> guide <- G.load "guide.dawg"
Consider using Completer
for auto-complete-like queries or if you need to obtain lexicon back from Dictionary
and Guide
.
Benchmarks
Utilities
Benchmark default(μs)
------------------------------------- ------------
Utilities/10/Dawg.fromAscList 5549.76
Utilities/10/Dict.build' 26981.64
Utilities/10/Dict.contains 22.46
Utilities/10/Dict.lookup 22.68
Utilities/10/Dict.follow 22.42
Utilities/10/Guide.build' 76.81
Utilities/10/Completer.completeKeys 385.35
------------------------------------- ------------
Utilities/100/Dawg.fromAscList 66486.21
Utilities/100/Dict.build' 194881.38
Utilities/100/Dict.contains 326.16
Utilities/100/Dict.lookup 323.45
Utilities/100/Dict.follow 319.86
Utilities/100/Guide.build' 782.24
Utilities/100/Completer.completeKeys 7016.36
------------------------------------- ------------
Utilities/1000/Dawg.fromAscList 888061.61
Utilities/1000/Dict.build' 1659798.44
Utilities/1000/Dict.contains 3627.54
Utilities/1000/Dict.lookup 3638.10
Utilities/1000/Dict.follow 3564.64
Utilities/1000/Guide.build' 7992.73
Utilities/1000/Completer.completeKeys 82343.20
------------------------------------- ------------
How to reproduce:
cabal install bench-show --overwrite-policy=always
- Run and wait (it might take around
3m
to complete):
time cabal bench +RTS "-N4 -A64m -n4m -qb0" -RTS --benchmark-options="--output bench.html --csv results.csv"
bench-show --presentation=Solo report results.csv
Comparison
Benchmark default(μs)
------------------------------ -----------
10/dawgdic.follow 21.04
10/dawg.follow 85.41
10/dawg-ord.follow 110.42
10/packed-dawg.follow 147.30
100/dawgdic.follow 311.05
100/dawg.follow 1367.81
100/dawg-ord.follow 1774.78
100/packed-dawg.follow 2098.12
1000/dawgdic.follow 3273.02
1000/dawg.follow 18842.47
1000/dawg-ord.follow 21992.03
1000/packed-dawg.follow 28938.01
------------------------------ -----------
10/dawgdic.lookup value 21.74
10/dawg.lookup value 61.89
10/dawg-ord.lookup value 209.93
100/dawgdic.lookup value 290.13
100/dawg.lookup value 847.27
100/dawg-ord.lookup value 2973.35
1000/dawgdic.lookup value 3551.44
1000/dawg.lookup value 14217.29
1000/dawg-ord.lookup value 40111.28
10/dawgdic.member 21.02
------------------------------ -----------
10/dawg.member 56.43
10/dawg-ord.member 191.74
10/packed-dawg.member 173.97
100/dawgdic.member 343.32
100/dawg.member 873.70
100/dawg-ord.member 2863.96
100/packed-dawg.member 2200.33
1000/dawgdic.member 9288.22
1000/dawg.member 13878.72
1000/dawg-ord.member 31428.47
1000/packed-dawg.member 29089.83
------------------------------ -----------
10/dawgdic.keys 108.33
10/dawg.keys 100.23
10/dawg-ord.keys 160.51
10/packed-dawg.keys 50.50
100/dawgdic.keys 1392.78
100/dawg.keys 1413.00
100/dawg-ord.keys 2324.99
100/packed-dawg.keys 692.23
1000/dawgdic.keys 15628.83
1000/dawg.keys 16541.79
1000/dawg-ord.keys 27612.77
1000/packed-dawg.keys 7415.48
------------------------------ -----------
10/dawgdic.values 52.26
10/dawg.values 72.29
10/dawg-ord.values 134.79
100/dawgdic.values 616.02
100/dawg.values 1060.87
100/dawg-ord.values 1794.82
1000/dawgdic.values 6457.32
1000/dawg.values 11734.25
1000/dawg-ord.values 21532.54
------------------------------ -----------
10/dawgdic.toList 107.26
10/dawg.toList 116.89
10/dawg-ord.toList 202.35
100/dawgdic.toList 1367.87
100/dawg.toList 1608.26
100/dawg-ord.toList 2898.26
1000/dawgdic.toList 16334.34
1000/dawg.toList 18578.59
1000/dawg-ord.toList 36360.44
------------------------------ -----------
10/dawgdic.complete word 367.97
10/dawg.complete word 248.14
10/packed-dawg.complete word 303.69
100/dawgdic.complete word 6151.51
100/dawg.complete word 4494.85
100/packed-dawg.complete word 4913.83
1000/dawgdic.complete word 71152.49
1000/dawg.complete word 58222.76
1000/packed-dawg.complete word 60828.53
------------------------------ -----------
How to reproduce:
cabal install bench-show --overwrite-policy=always
- Run and wait (it might take around
7m
to complete):
time cabal bench +RTS "-N4 -A64m -n4m -qb0" -RTS --benchmark-options="--output bench.html --csv results.csv"
bench-show --presentation=Solo report results.csv
Contributing
In cases of issues please attach callstack, provide minimal dictionary lexicon and provide logs with enabled tracing.
cabal build -ftrace
Acknowledgments