src/AI/Markov/HMM.lhs

> {-# LANGUAGE RecordWildCards, TemplateHaskell #-}
> 
> module AI.Markov.HMM (HMM(..), observe, evaluate, inspect, sequenceP) where
>
> import Control.Applicative ((<$>), pure)
> import Control.Monad (forM)
> import Data.Bifunctor (Bifunctor(first))
> import Data.Distribution (Distribution(..), Probability, (<?), (?>), (<~~))
> import Data.Function (on)
> import Data.Function.Memoize (Memoizable(..), deriveMemoize, deriveMemoizable, memoize4)
> import Data.List (maximumBy)
> import Data.List.Extras (pairs, argmax)
> import Data.Ratio (Ratio)
> import System.Random (RandomGen(..))
> import System.Random.Extras (split3)

_Hidden Markov models_ (HMMs) are used to model generative sequences characterisable by a doubly-embedded stochastic process in which the underlying process is hidden, and can only be observed through an upper-level process, which produces output. Structurally: You know what a [Mealy machine][mealy] is, right? A discrete hidden Markov model is structurally similar to a Mealy machine, except that its transitions and output values are governed by probability distributions, rather than tokens from an input alphabet. Rather than having a fixed initial state, this is chosen by an experiment on the prior distribution for the state set; a set of conditional probability distributions determine the likelihood of transition to any state from a given state, and another set the conditional probability of emitting each symbol in the output alphabet. [mealy]: https://en.wikipedia.org/wiki/Mealy_machine More formally, a HMM is a five-tuple consisting of:

> data HMM state symbol = HMM

1. The distinct states of the process, S. (|S| = N)

>   { states :: [state]

2. A finite dictionary of possible observations, E.

>   , symbols :: [symbol]

3. An initial state distribution, the likelihood of the process starting in each state s ∈ S.

>   , start :: Distribution state

4. A transition distribution; given some current state i ∈ S, this provides the likelihood of the process next transiting to any j ∈ S. By the Markov assumption, this probability is dependent only on i.

>   , transition :: state -> Distribution state

5. The observation symbol distributions; given some s ∈ S, this provides the likelihood of every observation o ∈ E being observed at that i. By the independence assumption, the output observation at any time is dependent only on the current state.

>   , emission :: state -> Distribution symbol
>   }

Having characterised some sequence generator as a HMM (and so given a well-parametrised configuration of such a model), we can simulate its output without collecting any further data; inferring future behaviour from demonstrated statistical properties. This is particularly useful in cases where gathering raw data is expensive.

> observe :: RandomGen seed => seed -> HMM state symbol -> [symbol]
> observe seed hmm@HMM{..} = observe' ns hmm (start <~~ ts)
>   where (ts, ns) = split seed
>
> observe' :: RandomGen seed => seed -> HMM state symbol -> state -> [symbol]
> observe' seed hmm@HMM{..} state = obs : observe' s2 hmm nxt
>   where (s0,s1,s2) = split3 seed
>         (obs, nxt) = (emission state <~~ s0, transition state <~~ s1)

Rabiner [^rabiner1989] outlined three fundamental inference problems for HMMs: 1. Evaluation: given a model and a sequence $(O_n)_{n=1}^T$ of observations, compute the likelihood those observations were produced by that HMM. This can also be interpreted as a scoring problem -- given a sequence produced by the real signal source, we can compare the accuracy of models. We first consider a straightforward (albeit intractable) approach; computing the likelihood our observations were produced by each possible state sequence $(I_n)_{n=1}^T$ (of appropriate length), and summing the result probabilities. I.e: $$P(O|HMM)=\sum_{j=0}^{n}P(O|I_n,HMM)P(I_n|HMM)$$ First, some precursors; given a set of states, `sequencesOfN` finds all $n$-length state sequences:

> sequencesOfN :: Int -> HMM state symbol -> [[state]]
> sequencesOfN n = sequence . replicate n . states

`sequenceP` determines the likelihood of a state sequence, given a model; $P(I|HMM)=P(I_1)P(I_2|I_1)\ldots P(I_r|I_{r-1})$:

> sequenceP :: Eq state => HMM state symbol -> [state] -> Probability
> sequenceP HMM{..} sequence = product 
>                            $ head sequence <? start 
>                            : map (uncurry (?>) . first transition) (pairs sequence)

`sequenceObservationsP` computes the likelihood of an observation sequence given a state sequence; $P(O|I, HMM)=P(O_1|I_1)P(O_2|I_2) \ldots P(O_n|I_n)$:

> sequenceObservationsP :: Eq symbol => HMM state symbol -> [(state, symbol)] -> Probability
> sequenceObservationsP HMM{..} = product . map (uncurry (?>) . first emission)

Using the above primitives, we can now express the procedure:

> inefficientEvaluate :: (Eq state, Eq symbol) => HMM state symbol -> [symbol] -> Probability
> inefficientEvaluate hmm observations = sum $ zipWith (*) statesP statesObsP
>   where states     = sequencesOfN (length observations) hmm
>         statesP    = sequenceP hmm <$> states
>         statesObsP = sequenceObservationsP hmm <$> map (`zip` observations) states

But this has abysmal runtime-performance! Let $N$ be the number of states, and $T$ equal to the number of observations (and thus the length of each state sequence); there are thus $N^T$ possible state sequences, and for each sequence we require about $2T$ calculations (Rabiner's [^rabiner1989] figure, I haven't validated this); meaning $2TN^T$ calculations in total: $N$ $T$ $2TN^T$ ---- ---- ------------------ 5 100 $\approx 10^{72}$ 10 100 $\approx 10^{102}$ 15 100 $\approx 10^{120}$ Given this profile, this function is only included here for didactic purposes, and is not exported by this module. A more efficient solution exists in the _forward algorithm_, which arranges the computation so that redundant calculations may be cached: Given a partial observation sequence $\bold{O} = {O_1,O_2,\ldots,O_n}$ and a terminal state $i$, the _forward variable_ provides the likelihood of having observed $\bold{O}$ and being in state $i$ after time $n$ -- $\alpha_n(i) = P(O_1,O_2,\ldots,O_n,I_n=i|HMM)$:

> forwardVariable' :: (Memoizable state, Memoizable symbol, Eq state, Eq symbol, Enum state, Bounded state) => Int -> HMM state symbol -> [symbol] -> state -> Probability
> forwardVariable' 0 HMM{..} observations state = (state <? start) * (head observations <? emission state)
> forwardVariable' t hmm@HMM{..} observations state = (*) (observations !! t <? emission state) . sum $ do
>   predecessor <- states
>   let a = forwardVariable (t-1) hmm observations predecessor
>       b = state <? transition predecessor
>   return $ a * b

Below, we compute the terminal forward variable for each state in the HMM, using a constant set of observations. As the computation of `a` is independent of the state under consideration, it's time-saving to memoise this value:

> forwardVariable :: (Enum state, Bounded state, Eq state, Eq symbol, Memoizable state, Memoizable symbol) => Int -> HMM state symbol -> [symbol] -> state -> Probability
> forwardVariable = memoize4 forwardVariable'
>
> instance (Memoizable state, Memoizable symbol, Eq state, Eq symbol, Enum state, Bounded state) => Memoizable (HMM state symbol) where
>   memoize = $(deriveMemoize ''HMM)
>
> deriveMemoizable ''Ratio

As the final observation must be emitted in some (unknown) state, taking the sum of terminal forward variables for each state yields our desired probability -- $P(O_1,O_2,\ldots,O_n|HMM) = \sum_{n=1}^{|I|} \alpha_{|\bold{O}|}(I_n)$:

> forward :: (Memoizable state, Memoizable symbol, Eq state, Eq symbol, Enum state, Bounded state) => HMM state symbol -> [symbol] -> Probability
> forward hmm@HMM{..} observations = sum [forwardVariable t hmm observations state | state <- states]
>   where t = pred $ length observations
>
> evaluate :: (Memoizable state, Memoizable symbol, Eq state, Eq symbol, Enum state, Bounded state) => HMM state symbol -> [symbol] -> Probability
> evaluate = forward

This evaluation algorithm requires about $TN^2$ calculations, making it many orders of magnitudes more efficient than the naïve approach: $N$ $T$ $TN^2$ ---- ---- ------- 5 100 2,500 10 100 10,000 15 100 22,500 2. Inspection: uncovering the hidden part of the model; from an observation sequence and a model, uncover the state sequence best explaining those observations, as provided by some optimality criterion. There are several possible ways of solving this problem: there are several possible optimality criteria, as the specification is ambiguous as to the definition of an optimal state sequence. One possible criterion involves maximising the expected number of correct individual states: computing at each point in time, the most likely state given the observation sequence and model. To implement this solution, we need implement a variant of the forward variable (namely, the backward variable), that describes the likelihood of observing a sequence of succeeding observations from some known state -- $\beta_n(i) = P(O_{n+1},O_{n+2},\ldots,O_{T},I_n=i,HMM)$:

> backwardVariable' :: (Memoizable state, Memoizable symbol, Eq state, Eq symbol, Enum state, Bounded state) => Int -> HMM state symbol -> [symbol] -> state -> Probability
> backwardVariable' n HMM{..} observations state = if succ n == length observations then 1 else sum $ do
>   successor <- states
>   let a = transition state ?> successor
>       b = emission successor ?> (observations !! n)
>       c = backwardVariable (n+1) HMM{..} observations successor
>   return $ a * b * c
>
> backwardVariable :: (Memoizable state, Memoizable symbol, Eq state, Eq symbol, Enum state, Bounded state) => Int -> HMM state symbol -> [symbol] -> state -> Probability
> backwardVariable = memoize4 backwardVariable'

To recap: the forward variable determines the likelihood of reaching some state $i$ being reached at time $n$, and a sequence $O_1,O_2,\ldots,O_n$ being observed preceding it. The backward variable accounts for the likelihood of the same state $i$ being reached at time $n$, and the succeeding observation sequence being $O_{n+1},O_{n+2},\ldots,O_{T}$. Between them we can compute, provided a model and observation sequence, the likelihood of some state $i$ given the observation sequence--the smoothed probability value: $\gamma_n(i)=\frac{\alpha_n(i)\beta_n(i)} {\sum^{N}_{s=1}\alpha_n{s}\beta_n{s}} = \frac{P(I_n=i,O_1,O_2,\ldots,O_T|HMM)} {\sum^{N}_{s=1} P(I_n=s,O_1,O_2,\ldots,O_T|HMM)} = \frac{P(I_n=i,O_1,O_2,\ldots,O_T|HMM)} {P(O_1,O_2,\ldots,O_T|HMM)} = P(I_n=i|O_1,O_2,\ldots,O_T,HMM)$

> forwardBackwardVariable :: (Memoizable state, Memoizable symbol, Eq state, Eq symbol, Enum state, Bounded state) => Int -> HMM state symbol -> [symbol] -> state -> Probability
> forwardBackwardVariable n hmm observations state = forwardVariable n hmm observations state 
>                                                  * backwardVariable n hmm observations state
> 
> smooth :: (Memoizable state, Memoizable symbol, Eq state, Eq symbol, Enum state, Bounded state) => Int -> HMM state symbol -> [symbol] -> state -> Probability
> smooth n hmm@HMM{..} observations state = numerator / denominator
>   where numerator         = forwardBackward n state
>         denominator       = sum $ zipWith forwardBackward [0..] states
>         forwardBackward n = forwardBackwardVariable n hmm observations

By maximising the smoothing value at each position in the sequence, we can find the most likely state at each position:

> forwardBackward :: (Memoizable state, Memoizable symbol, Eq state, Eq symbol, Enum state, Bounded state) => HMM state symbol -> [symbol] -> [state]
> forwardBackward hmm@HMM{..} observations = do
>   position <- [0..pred $ length observations]
>   return $ argmax (smooth position hmm observations) states

This is a bad criterion, though; in considering only individual states, we neglect information about the probability that state sequences will occur. Consider the case in which a HMM has state transitions with zero probability: the optimal state sequence may not even be valid! For this reason, as before, this function is not exported by this module. A more reasonable optimality criterion is to find the most probable contiguous sequence of states; i.e. determining the state sequence that maximises $P(O,I|HMM)$. This can be found by application of _the Viterbi algorithm_, which involves maximising over likelihood estimates for each possible state sequence. Given an observation sequence $\bold{O} = {O_1,O_2,\ldots,O_T}$ a state $i$, the _Viterbi step_ finds the state sequence most likely to account for the first $n$ observations and terminating at state $i$; that is: $$\delta_n(i) = \argmax_{I_1,I_2,\ldots,I_{n-1}} P(I_1,I_2,\ldots,I_n=1, \bold{O}|HMM)$$:

> viterbiStep' :: (Memoizable state, Memoizable symbol, Eq state, Eq symbol, Enum state, Bounded state) => Int -> HMM state symbol -> [symbol] -> state -> ([state], Probability)
> viterbiStep' 0 hmm@HMM{..} observations state = (pure state, start ?> state * emission state ?> head observations)
> viterbiStep' n hmm@HMM{..} observations state = argmax snd $ do
>   predecessor <- states
>   let (path, prob) = viterbiStep (n-1) hmm observations predecessor
>       likelihood   = prob * transition predecessor ?> state * emission state ?> (observations !! n)
>   return $ (state:path, likelihood)
>
> viterbiStep :: (Memoizable state, Memoizable symbol, Eq state, Eq symbol, Enum state, Bounded state) => Int -> HMM state symbol -> [symbol] -> state -> ([state], Probability)
> viterbiStep = memoize4 viterbiStep'

As the final observation must be emitted in some state, maximising the Viterbi step over $S$ yields the desired state sequence:

> viterbi :: (Memoizable state, Memoizable symbol, Eq state, Eq symbol, Enum state, Bounded state) => HMM state symbol -> [symbol] -> [state]
> viterbi hmm@HMM{..} observations = reverse . fst . argmax snd $ do
>   let t = length observations - 1
>   terminal <- states
>   return $ viterbiStep t hmm observations terminal
>
> inspect :: (Memoizable state, Memoizable symbol, Eq state, Eq symbol, Enum state, Bounded state) => HMM state symbol -> [symbol] -> [state]
> inspect = viterbi

3. Training: given some observation sequence, determine the parameters of some HMM that best model the data. If we so adapt model parameters to observed training data, we can accurately simulate real signal sources. [^rabiner1989]: Lawrence R. Rabiner, _A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition_, Proceedings of the IEEE 77 (1989): 257-286