Fully Observed Markov Model - Algebraic Statistics

Hidden Markov Models

The hidden Markov model is a statistical model in which the system modelled is a Markov chain with unknown parameters, and the challenge is to determine the hidden parameters from the observable data. Hidden Markov models were introduced for speech recognition in the 1960s and are now widely used in temporal pattern recognition.

We first introduce the fully observed Markov model in which the states are visible to the observer.

Then we proceed to the hidden Markov model where the states are not observable, but each state has a probability distribution over the generated output data. This information can be used to determine the mostly likely state sequence that generated the output data.

6.1 Fully Observed Markov Model

We introduce a variant of the Markov chain model that will serve as a prelimiarly model for the hidden Markov model.

For this, we take an alphabet Σ with l symbols, an alphabet Σ^′ with l^′ symbols, and a positive integer n. We consider wordsσ=σ1. . . σn andτ =τ1. . . τn over Σ andΣ^′ of lengthn, respectively.

These words are used to label the entries of an integral block matrix

A(l,l^′),n= (Aσ,τ)σ∈Σⁿ,τ∈Σ^′n, (6.1) whose entriesAσ,τ are pairs of matrices (w, w^′) such thatw= (wrs) is anl×l matrix andw^′= (w^′_rt) is anl×l^′ matrix. The entrywrs=wrs(σ) counts the number of occurrences inσof the length-2 word rs, and the entryw^′_st=wst(σ, τ) counts the number of indicesi, 1≤i≤n, such thatσi=sandτi=t.

We may view the matrices (w, w^′) as columns of the matrix A(l,l^′),n. Then the matrix A(l,l^′),n has d=l·l+l·l^′ =l²+l·l^′ rows labelled by the length-2 words rs in Σ² and in turn by the length-2 words rtin Σ×Σ^′. Moreover, the matrix hasm =lⁿ·l^′n columns labelled by the pairs of length-n wordsσandτ overΣandΣ^′, respectively. The matrixA(l,l^′),n has the property that the sum of each of its column entries is (n−1) +n= 2n−1, since each word of lengthnhasn−1 consecutive length-2 words and two words of length n pair in n positions. Thus the matrix A(l,l^′),n defines a toric model f =f(l,l^′),n:R^d→R^mgiven as

(θ, θ^′)7→(pσ,τ)σ∈Σⁿ,τ∈Σ^′n, (6.2)

where

pσ,τ = 1

lθ_σ^′₁_,τ₁θσ1,σ2θ_σ^′₂_,τ₂θσ2,σ3· · ·θσn−1,σnθ^′_σ_n_,τ_n. (6.3) Here we assume a uniform initial distribution on the states in the alphabetΣas described by (7.1). All termspσ,τhaven+(n−1) = 2n−1 factors. The parameter spaceΘof the model is the cartesian product of the set of positivel×l matricesθand the set of positivel×l^′ matricesθ^′; that is,Θ=R^l×l_>0×R^l×l_>0^′. The matrixθencodes a toric Markov chain, while the matrixθ^′ encodes the interplay between the two alphabets. The state space of the model isΣⁿ×Σ^′n. This model is called afully observed toric Markov model.

Example 6.1.Consider a dishonest dealer in a casino tossing coins. We know that she may use a fair or loaded coin, the latter of which is supposed to have the probability of 0.75 to get heads. We also know that she does not tend to change coins, which happens with probability of 0.1 (Fig. 6.1). Given a sequence of coin tosses we wish to determine when she used the biased and fair coins.

?>=<

Fig. 6.1.Transition graph of casino model.

This model can be described by a toric Markov model consisting of the alphabetsΣ={F, L}, where F stands for fair andLstands for loaded, and Σ^′ ={h, t}, wherehstands for heads andt stands for tails. The corresponding 8×256 matrix A_(2,2),4 for sequences of lengthn= 4 consists of the columns Aσ,τ, whereσandτ range over all words of length 4 overΣ andΣ^′, respectively; e.g.,

The model hasd= 8 parameters given by the matrices θ=

6.1 Fully Observed Markov Model 123

Suppose the dealer tosses four coins in a row such that we consider sequences of lengthn= 4. Then the model hasm= (2·2)⁴= 256 states. The fully observed toric Markov model is defined by the mapping

f :R⁸→R²⁵⁶: (θ, θ^′)7→(pσ,τ)σ∈Σ⁴,τ∈Σ^′4, where

pσ1σ2σ3σ4,τ1τ2τ3τ4 = 1

2·θ^′_σ₁_,τ₁θσ1,σ2θ^′_σ₂_,τ₂θσ2,σ3θ_σ^′₃_,τ₃θσ3,σ4θ_σ^′₄_,τ₄. For instance, in view of the above pairs of sequences, we obtain

pF F LL,htht = 1

2·θ^′_{F h}θF Fθ^′_{F t}θF Lθ_Lh^′ θLLθ_Lt^′

= 1

2·θF FθF LθLLθ_{F h}^′ θ^′_{F t}θ^′_Lhθ_Lt^′ and

pF F F L,hhhh = 1

2·θ^′_{F h}θF Fθ^′_{F h}θF Fθ^′_{F h}θF Lθ^′_Lh

= 1

2·θ²_{F F}θF Lθ^′³_{F h}θ^′_Lh.

♦ Second, we introduce the fully observed Markov model as a submodel of the toric Markov model.

For this, the parameter space of the fully observed toric Markov model is restricted to the set of pairs of matrices (θ, θ^′) whose row sums are equal to 1. The parameter space of the fully observed Markov model is thus a subsetΘ1ofR^l×(l−1)_>0 ×R^l×(l_>0 ^′⁻¹⁾, and the number of parameters isd= (l·(l−1))·(l·(l^′−1)) = l·(l+l^′−2). A pair of matrices (θ, θ^′) inΘ1provides anl×lmatrixθdescribingtransition probabilities and anl×l^′matrixθ^′providingemission probabilities. The valueθij represents the probability to transit from statei∈Σto statej∈Σin one step, and the valueθ^′_ijis the probability to emit the symbolj∈Σ^′ in statei∈Σ. Thefully observed Markov model is given by the mappingf(l,l^′),n:R^d→R^mrestricted to the parameter spaceΘ1. Each pointpin the imagef(l,l^′),n(Θ1) is called a marginal probability. We assume usually that the initial distribution at the first state inΣ is uniform.

Letu= (uσ,τ)∈N^l₀ⁿ^×l^′nbe a frequency vector representingN observed sequence pairs inΣⁿ×Σ^′n. That is,uσ,τ counts the number of times the pair (σ, τ) is observed. Thus, we haveP

σ,τuσ,τ =N. The sufficient statistic v =A(l,l^′),n·ucan be regarded as a pair of matrices (v, v^′), wherev = (vrs) is an l×l matrix whose entriesvrsare the number of occurrences ofrs∈Σ² as a consecutive pair in any of the sequencesσoccurring in the observed pairs (σ, τ), andv^′ = (v^′_st) is anl×l^′ matrix whose entries v_st^′ are the number of occurrences ofst∈Σ×Σ^′ at the same position in any of the observed sequence pairs (σ, τ).

The likelihood function of the fully observed Markov model is given as

L(θ, θ^′) = (θ, θ^′)^A^(l,l^′^),n^·u=θ^v·(θ^′)^v^′, (θ, θ^′)∈Θ1. (6.4) Proposition 6.2.In the fully observed Markov chain modelf(l,l^′),n, the maximum likelihood estimate of the frequency datau∈N^l₀ⁿ^×l^′n with sufficient statisticv=A(l,l^′),n·uis the matrix pair(ˆθ,θˆ^′)inΘ1

such that

θˆσ1σ2 = vσ1σ2

σ∈Σvσ1σ

and θˆ^′_σ₁_τ₁ = v^′_σ₁_τ₁ P

τ∈Σ^′v_σ^′₁_τ, σ1, σ2∈Σ, τ1∈Σ^′.

The proof is analogous to that of Prop. 4.11, since the log-likelihood function similarly decouples into independent parts.

Example 6.3 (Maple). We reconsider the dealer’s example. The parameter space Θ1 of the fully observed Markov model can be viewed as the set of all pairs of probability matrices

θ=

θF F 1−θF F

1−θLL θLL

and θ^′=

θ_{F h}^′ 1−θ^′_{F h} 1−θ^′_Lt θ^′_Lt

whereθF,F =θL,L= 0.9 is the probability to stay with a fair or loaded coin,θF,h= 0.5 is the probability to observe heads for a fair coin, andθL,h= 0.75 is the probability to observe heads for a loaded coin.

This model has onlyd= 4 parameters.

Suppose a game involves rolling the coinn= 4 times. Then the model hasm= (2·2)⁴= 256 states.

This Markov model is given by the mappingf(2,2),4:R⁴→R²⁵⁶ with marginal probabilities pσ1σ2σ3σ4,τ1τ2τ3τ4 = 1

2·θ^′_σ₁_,τ₁θσ1,σ2θ^′_σ₂_,τ₂θσ2,σ3θ_σ^′₃_,τ₃θσ3,σ4θ_σ^′₄_,τ₄.

For instance, in a game the fair coin was used two times and then the loaded coin was taken two times, and each time heads was observed. The corresponding probability is given by

pF F LL,hhhh =1

2 ·θ_{F h}^′ θF Fθ_{F h}^′ (1−θF F)(1−θ_Lt^′ )θLL(1−θ_Lt^′ ).

We compute symbolically the likelihood function. For this, we take the packages with(combinat): with(linalg):

and initialize

n := 4: l := 2: l’ := 2: m := (l * l’)^n:

T := array([ [tFF, tFF], [tLL, tLL] ]):

E := array([ [tFh, tFh], [tLt, tLt] ]):

P := array([], 1..l^n, 1..l’^n):

The marginal values are computed by the following code, R := powerset([1,2,3,4]):

S := powerset([1,2,3,4]):

for i from 1 to nops(R) do x := vector( [1,1,1,1] ):

for u from 1 to nops(R[i]) do x[R[i,u]] := 2;

od:

for j from 1 to nops(S) do y := vector( [1,1,1,1] ):

for v from 1 to nops(S[j]) do y[S[j,v]] := 2;

od:

P[i,j] := 1/2 * E[x[1],y[1]] * T[x[1],x[2]] * E[x[2],y[2]]

Im Dokument Algebraic Statistics (Seite 133-137)