maxent-learner-hw: Hayes and Wilson's maxent learning algorithm for phonotactic grammars.

This is a package candidate release! Here you can preview how this package release will appear once published to the main package index (which can be accomplished via the 'maintain' link below). Please note that once a package has been published to the main package index it cannot be undone! Please consult the package uploading documentation for more information.

[maintain] [Publish]

Provides an implementation of Hayes and Wilson's machine learning algorithm for maxent phonotactic grammars, as both a command-line tool and a function library. The learner takes in a lexicon and produces a list of weighted constraints penalizing certain sound sequemces in an attempt to produce a probability distribution of words which maximizes the probability of the lexicon. Once such a set of constraints is generated, it can be tested by using it to generate random pronounceable text.

This package is an implementation of the algorithm described in Hayes and Wilson's paper A Maximum Entropy Model of Phonotactics and Phonotactic Learning (available at http://www.linguistics.ucla.edu/people/hayes/Phonotactics/Index.htm).


[Skip to Readme]

Properties

Versions 0.1.0, 0.1.0, 0.1.1, 0.1.2, 0.2.0, 0.2.1
Change log None available
Dependencies array (>=0.3 && <0.6), base (>=4.7 && <5), containers (>=0.5 && <0.6), csv (>=0.1 && <0.2), deepseq (>=1.4 && <1.5), file-embed, maxent-learner-hw, mtl (>=2.1 && <2.3), optparse-applicative, parallel (>=3.2 && <3.3), random (==1.1), text (>=1.2 && <1.3), vector (>=0.10) [details]
License LicenseRef-GPL
Copyright 2016 George Steel and Peter Jurgec
Author George Steel
Maintainer george.steel@gmail.com
Category Linguistics
Home page https://github.com/george-steel/maxent-learner
Source repo head: git clone https://github.com/githubuser/maxent-learner-hw
Uploaded by gtsteel at 2017-02-16T03:54:02Z

Modules

Downloads

Maintainer's Corner

Package maintainers

For package maintainers and hackage trustees


Readme for maxent-learner-hw-0.1.0

[back to package description]

Maxent Phonotactic Learner

A tool for automatically inferring phonotactic grammars from a lexicon and using those grammars to generate random text, based on Hayes and Wilson's A Maximum Entropy Model of Phonotactics and Phonotactic Learning. This package provides functionality both as a Haskell library and as a command line tool.

To compile this package, run stack build in the root of this repository. Run stack haddock to build the library documentation. The library may be useful if you wish to use a custom set of candidate constraints beyond the generators offered by the command line tool.

Command line usage

The command line tool (phono-learner-hw) has two commands: learn, which infers grammars, and gensalad, which generates random text using those grammars. The learn command takes the name of a lexicon file as an argument and outputs a grammar (note this is quite slow). By default the candidates consist of single classes and bigrams, and several; mote constraint types can be added with options. The gensalad takes a grammar generated by learn and uses it to generate random text. Both commands can also take global options to output their final results to a file, to use a custom-defined feature table for the generation of natural classes, and to control how text is divided into segments.

The command line works as follows:

phono-learner-hw COMMAND [-t|--featuretable CSVFILE] ([-c|--charsegs] | [-w|--wordsegs] | [--fierrosegs]) [-n|--samples ARG] [-o|--output OUTFILE]
Option Description
-t, --featuretable CSVFILE Use the features and segment list from a feature table in CSV format (a table for IPA is used by default).
-c, --charsegs Use characters as segments (default).
-w, --wordsegs Separate segments by spaces.
--fierosegs Parse segments by repeatedly taking the longest possible match and use ' to break up unintended digraphs (used for Fiero orthography).
-n, --samples N Number of samples to use for salad generation.
-o, --output OUTFILE Record final output to OUTFILE as well as stdout.
hw-learner learn LEXICON [--thresholds THRESHOLDS] [-f|--freqs] [-e|--edges] [-3|--trigrams COREFEATURES] [-l|--longdistance SKIPFEATURES] [GLOBALOPTIONS]
Option Description
--thresholds THRESHOLDS thresholds to use for candidate selection (default is `[0.01, 0.1, 0.2, 0.3]``).
-f,--freqs Lexicon file contains word frequencies.
-e,--edges Allow constraints involving word boundaries.
-3,--trigrams COREFEATURES Allow trigram constraints where at least one class uses a single one of the following features (space separated in quotes).
-l,--longdistance SKIPFEATURES Allow constraints with two classes separated by a run of characters possibly restricted to all having one of the following features.
hw-learner gensalad GRAMMAR [GLOBALOPTIONS]

Example usage

The following two command calculates a grammar using Hayes and Wilson's Shona test data using their selection of trigram restrictions and then generate random text using it.

phono-learner-hw learn ShonaLearningData.txt -f -e -3 "syllabic consonantal sonorant" -t ShonaFeatures.csv -w -o shonalongdistance.txt
phono-learner-hw gensalad ShonaGrammar.txt -t ShonaFeatures.csv -w -o ShonaSalad.txt

Feature Table Format

To use a feature table other than the default IPA one, you may define it in CSV format (RFC 4180). The segment names are defined by the first row (they may be any strings as long as they are all distinct, i.e. no duplicate names) and the feature names are defined by the first column (they are not hard-coded). Data cells should contain +, -, or 0 for binary features and + or 0 for privative features (where we do not want a minus set that could form classes).

As a simple example, consider the following CSV file, defining three segments (a, n, and t), and two features (vowel and nasal).

     ,a,n,t
vowel,+,-,-
nasal,0,+,-

If a row contains a different number of cells (separated by commas) than the header line, is rejected as invalid and does not define a feature (and will not be dispayed in the formatted feature table). If the CSV which is entered has duplicate segment names, no segments, or no valid features, the entire table is rejected (indicated by a red border around the text area, green is normal) and the last valid table is used and displayed.


Copyright © 2016-2017 George Steel and Peter Jurgec.

This project is supported by the University of Toronto Advancing Teaching and Learning in Arts and Science (ATLAS) grant to Peter Jurgec.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.