nlp-scores-0.7.0: Scoring functions commonly used for evaluation in NLP and IR

Safe HaskellNone
LanguageHaskell98

NLP.Scores

Contents

Description

Scoring functions commonly used for evaluation of NLP systems. Most functions in this module work on sequences which are instances of Foldable, but some take a precomputed table of Counts. This will give a speedup if you want to compute multiple scores on the same data. For example to compute the Mutual Information, Variation of Information and the Adjusted Rand Index on the same pair of clusterings:

>>> let cs = counts "abcabc" "abaaba"
>>> mapM_ (print . ($ cs)) [mi, ari, vi]
>>> 0.9182958340544894
>>> 0.4444444444444445
>>> 0.6666666666666663

Synopsis

Scores for classification and ranking

errorRate :: (Eq a, Fractional c, Traversable t, Foldable s) => t a -> s a -> c Source

Error rate: the proportion of elements in the first sequence NOT equal to elements at corresponding positions in second sequence. Sequences should be of equal lengths.

accuracy :: (Eq a, Fractional c, Traversable t, Foldable s) => t a -> s a -> c Source

Accuracy: the proportion of elements in the first sequence equal to elements at corresponding positions in second sequence. Sequences should be of equal lengths.

recipRank :: (Eq a, Fractional b, Foldable t) => a -> t a -> b Source

Reciprocal rank: the reciprocal of the rank at which the first arguments occurs in the sequence given as the second argument.

Scores for clustering

ari :: (Ord a, Ord b) => Counts a b -> Double Source

mi :: (Ord a, Ord b) => Counts a b -> Double Source

Mutual information: MI(X,Y) = H(X) - H(X|Y) = H(Y) - H(Y|X). Also known as information gain.

vi :: (Ord a, Ord b) => Counts a b -> Double Source

Variation of information: VI(X,Y) = H(X) + H(Y) - 2 MI(X,Y)

Strength of association

logLikelihoodRatio :: (Ord a, Ord b) => Counts a b -> a -> b -> Double Source

Log-likelihood ratio for two binomial distributions. H_0: P(x|y) = p = P(x|~y) H_1: P(x|y) = p1 =/= p2 = P(x|~y)

Comparing probability distributions

kullbackLeibler :: (Eq a, Floating a, Foldable f, Traversable t) => t a -> f a -> a Source

Kullback-Leibler divergence: KL(X,Y) = SUM_i P(X=i) log_2(P(X=i)/P(Y=i)). The distributions can be unnormalized.

jensenShannon :: (Eq a, Floating a, Traversable t, Traversable u) => t a -> u a -> a Source

Jensen-Shannon divergence: JS(X,Y) = 1/2 KL(X,(X+Y)/2) + 1/2 KL(Y,(X+Y)/2). The distributions can be unnormalized.

Auxiliary types and functions

type Count = Double Source

A count

data Counts a b Source

Count table

Instances

(Ord a, Ord b) => Monoid (Counts a b) 

counts :: (Ord a, Ord b, Traversable t, Foldable s) => t a -> s b -> Counts a b Source

Creates count table Counts

sum :: (Foldable t, Num a) => t a -> a Source

The sum of a sequence of numbers

mean :: (Foldable t, Fractional n, Real a) => t a -> n Source

The mean of a sequence of numbers.

jaccard :: (Fractional n, Ord a) => Set a -> Set a -> n Source

Jaccard coefficient J(A,B) = |AB| / |A union B|

entropy :: (Floating c, Foldable t) => t c -> c Source

Entropy: H(X) = -SUM_i P(X=i) log_2(P(X=i)). entropy xs is the entropy of the random variable represented by the sequence xs, where each element of xs is the count of the one particular value the random variable can take. If you need to compute the entropy from a sequence of outcomes, the following will work:

entropy . elems . histogram

histogram :: (Num a, Ord k, Foldable t) => t k -> Map k a Source

histogram xs is returns the map of the frequency counts of the elements in sequence xs

Extracting joint and marginal counts from Counts

countJoint :: (Ord a, Ord b) => a -> b -> Counts a b -> Count Source

Joint count

countFst :: Ord k => k -> Counts k b -> Count Source

Count of first element

countSnd :: Ord k => k -> Counts a k -> Count Source

Count of second element

countTotal :: Counts a k -> Count Source

Total element count

Extracting lists of values from Counts

fstElems :: Counts k b -> [k] Source

List of values of first element

sndElems :: Counts a k -> [k] Source

List of values of second element