sequitur: Grammar-based compression algorithms SEQUITUR

[ bsd3, compression, formal-languages, language, library, natural-language-processing, nlp, text ] [ Propose Tags ] [ Report a vulnerability ]

Modules

[Index] [Quick Jump]

Downloads

Maintainer's Corner

Package maintainers

For package maintainers and hackage trustees

Candidates

  • No Candidates
Versions [RSS] 0.1.0.0, 0.2.0.0
Change log CHANGELOG.md
Dependencies base (>=4.7 && <5), containers (>=0.6.4.1 && <0.7), hashable (>=1.3.0.0 && <1.5), hashtables (>=1.2.4.2 && <1.4), primitive (>=0.7.3.0 && <0.10) [details]
License BSD-3-Clause
Copyright Copyright (c) 2024 Masahiro Sakai
Author Masahiro Sakai
Maintainer masahiro.sakai@gmail.com
Category Formal Languages, Language, Natural Language Processing, NLP, Text, Compression
Home page https://github.com/msakai/haskell-sequitur#readme
Bug tracker https://github.com/msakai/haskell-sequitur/issues
Source repo head: git clone https://github.com/msakai/haskell-sequitur
Uploaded by MasahiroSakai at 2024-07-13T14:47:21Z
Distributions NixOS:0.2.0.0
Downloads 59 total (3 in the last 30 days)
Rating (no votes yet) [estimated by Bayesian average]
Your Rating
  • λ
  • λ
  • λ
Status Docs available [build log]
Last success reported on 2024-07-13 [all 1 reports]

Readme for sequitur-0.1.0.0

[back to package description]

Haskell implementation of SEQUITUR algorithm

build

About SEQUITUR

SEQUITUR is a linear-time, online algorithm for producing a context-free grammar from an input sequence. The resulting grammar is a compact representation of original sequence and can be used for data compression.

Example:

  • Input string: abcabcabcabcabc

  • Resulting grammar

    • SAAB
    • ABB
    • Babc

SEQUITUR consumes input symbols one-by-one and append each symbol at the end of the grammar's start production (S in the above example), then substitutes repeating patterns in the given sequence with new rules. SEQUITUR maintains two invariants:

  • Digram Uniqueness: SEQUITUR ensures that no digram (a.k.a. bigram) occurs more than once in the grammar. If a digram (e.g. ab) occurs twice, SEQUITUR introduce a fresh non-terminal symbol (e.g. M) and a rule (e.g. Mab) and replace original occurences with the newly introduced non-terminals. One exception is the cases where two occurrence overlap.

  • Rule Utility: If a non-terminal symbol occurs only once, SEQUITUR removes the associated rule and substitute the occurence with the right-hand side of the rule.

Usage

ghci> import Language.Grammar.Sequitur
ghci> encode "baaabacaa"
fromList [(0,[NonTerminal 1,NonTerminal 2,NonTerminal 1,Terminal 'c',NonTerminal 2]),(1,[Terminal 'b',Terminal 'a']),(2,[Terminal 'a',Terminal 'a'])]

The output represents the following grammar:

0 → 1 2 1 c 2
1 → b a
2 → a a

References