Copyright | © 2016-2017 George Steel and Peter Jurgec |
---|---|
License | GPL-2+ |
Maintainer | george.steel@gmail.com |
Safe Haskell | None |
Language | Haskell2010 |
Library for using DFAs to represent maxent grammars. A mexent grammar consists of a set of constraints, each of which is given a weight, which define a probability diatribution over the set of strings of each given length.
The relative probability (maxent score) of each string is equal to the negative exponential of the the total weight of the violated constraints. In this module, such a grammar is reperesented by a DFST
which can count violations and a Vec
of weights.
This module is mainly concerned with calculating probabilities of samples of text and finding the optimal weights to maximize that probability. There are also functions to randomly generate text using the distribution implied by a mexent grammar.
- type Length = Int
- data Lexicon sigma = Lex {
- totalWords :: Int
- lengthFreqs :: Array Length Int
- wordFreqs :: [([sigma], Int)]
- sortLexicon :: Ord sigma => [([sigma], Int)] -> Lexicon sigma
- lengthCdf :: Lexicon sigma -> Cdf Length
- lengthPdf :: Lexicon sigma -> [(Length, Double)]
- maxentProb :: Vec -> Multicount -> Double
- lexLogProbTotalDeriv :: Ix sigma => MulticountDFST sigma -> Array Length Int -> Vec -> Vec -> (Double, Vec)
- lexLogProbPartialDeriv :: Ix sigma => MulticountDFST sigma -> Array Length Int -> Vec -> Vec -> Vec -> Double
- llpOptimizeWeights :: Ix sigma => Array Length Int -> PackedText sigma -> MulticountDFST sigma -> Vec -> Vec
- sampleWord :: forall g sigma m. (RandomGen g, Ix sigma, MonadState g m) => DFST Int sigma Double -> Length -> Length -> m [sigma]
- sampleWordSalad :: (RandomGen g, Ix sigma, MonadState g m) => DFST Int sigma Double -> Cdf Length -> Int -> m [[sigma]]
Documentation
Returns the probability (as a logarithm) of a lexicon with aand associated length distribution.
Lex | |
|
sortLexicon :: Ord sigma => [([sigma], Int)] -> Lexicon sigma Source #
Convert jumbled list of words and frequencies to sorted lexicon.
lengthCdf :: Lexicon sigma -> Cdf Length Source #
Retrieve length distribution as a Cdf
for sampling.
lengthPdf :: Lexicon sigma -> [(Length, Double)] Source #
Retrieve length distribution as a normalized probability mass function. Probabilities add up to 1.
maxentProb :: Vec -> Multicount -> Double Source #
Apply weights to violation counts to get a relative probability.
:: Ix sigma | |
=> MulticountDFST sigma | DFST counting constraint violations |
-> Array Length Int | Length distribution of lexicon |
-> Vec | Observed violations in lexicon |
-> Vec | Weights to give constraints |
-> (Double, Vec) | Probability and its derivative w.r.t. the weights |
For a given set of consteraints (reperesented by a DFST counting violations), lexicon (reprersented as length distribution and total violation count, which should be precomputed), and weight vector, returns the absolute probability (as a negative logarithm) and its derivative with respect to the weight vector.
Minimize this to find the optimal weights. To prevent overfitting, this function includes an exponential (L₁) prior equivalent to each constraint being violated once for existing. This intentionally differs from Hayes and Wilson since their gaussian (L₂²) prior had a strong preference for as many simillar constraints as possible as opposed to a single constraint. The exponential prior was chosen since it is independent of splitting constraints into duplicates with the weight distributed between them.
lexLogProbPartialDeriv :: Ix sigma => MulticountDFST sigma -> Array Length Int -> Vec -> Vec -> Vec -> Double Source #
Compute partial derivative of lexicon probability. Much faster equivalent of
lexLogProbPartialDeriv ctr lengths oviols weights dir = dir `innerProd` snd (lexLogProbTotalDeriv ctr lengths oviols weights)
llpOptimizeWeights :: Ix sigma => Array Length Int -> PackedText sigma -> MulticountDFST sigma -> Vec -> Vec Source #
Calculate weights to maximize probability of lexicon.
Takes starting position of search which MUST have the correct dimensionality (do not use zero
)
:: (RandomGen g, Ix sigma, MonadState g m) | |
=> DFST Int sigma Double | Probability DFST |
-> Length | Maximum length to greate generator fot |
-> Length -> m [sigma] | Random generator taking length and returning action. |
Returns a monadic action to sample random words from a probability transducer,
which may be generated from a violation counter with (
).
For efficiency, evaluate this once then sequence the action repeatedly as intermediate values will be memoized.fmap
(maxentProb
weights) ctr)