Markov Chains – the probabilistic model

(1)

Machine Learning II Markov Chains

Dmitrij Schlesinger, Carsten Rother, Dagmar Kainmueller, Florian Jug

SS2014, 25.04.2014

(2)

Application: Speach Recognition

(3)

Markov Chains – the probabilistic model

LetI ={1,2, . . . , n} be the set of “time points”

There is arandom variabley_i ∈K for each i∈I, where K is a discrete domain (state set, label set, phoneme set etc.).

y= (y₁, y₂, . . . , y_n)with y_i ∈K is a state sequence The set of all such sequences isY =Kⁿ

– the sample space (the set of elementary events) p(y) = p(y₁, y₂, . . . , y_n) are their probabilities

(4)

Markov Chains – the probabilistic model

In general: p(y₁. . . y_n) =p(y₁. . . yn−1)·p(y_n|y₁. . . yn−1) TheMarkovian property (simplified):

p(y_n|y₁. . . yn−1) =p(y_n|yn−1) It follows:

p(y) =p(y₁)·

n

Y

i=2

p(y_i|yi−1) The model parameters are:

– the starting probability distribution p(y₁)

– the transition probability distributions p(y_i|yi−1)

(5)

Other parameterizations

An example:

p(y) = p(y₁, y₂, y₃) = p(y₁)p(y₂|y₁)p(y₃|y₂) = p(y₁)p(y₁|y2)p(y₂)

p(y₁) p(y₃|y₂) = p(y₁|y₂)p(y₂)p(y₃|y₂)

A bit more general:

p(y) = p(y₁)·

n

Y

i=2

p(y_i|yi−1) =

i^∗

Y

i=2

p(yi−1|y_i)·p(y_i^∗)·

n

Y

i=i^∗+1

p(y_i|yi−1)

⇒the “start” can be anywhere

(6)

Other parameterizations

Another way:

p(y) =p(y₁)·

n

Y

i=2

p(y_i|y_i−1) =

=p(y₁)·

n

Y

i=2

p(y_i, yi−1) p(yi−1) =

n

Y

i=2

p(y_i, y_i−1)·

n−1

Y

i=2

p(y_i)⁻¹

⇒There is no “start” or “finish” at all – the parameterization is symmetric.

(7)

Other parameterizations

And even more:

p(y) = 1 Z

n

Y

i=1

ψ_i(y_i)·

n

Y

i=2

ψi−1,i(yi−1, y_i)

with the Partition Function:

Z = ^X

y∈Y

" _n Y

i=1

ψ_i(y_i)·

n

Y

i=2

ψi−1,i(yi−1, y_i)

#

The parameters ψi :K →R⁺ and ψi−1,i :K ×K →R⁺ are arbitrary (but non-negative) functions that do not represent (explicitly) any probabilities

(8)

Hidden Markov Chains (HMM)

There are two “kinds” of Variables (both are sequences):

y= (y₁, y₂. . . y_n), y_i ∈K (hidden Variables) and x= (x₁, x₂. . . x_n), x_i ∈F (observed Variables).

A pair(x, y) is an elementary event.

Two representations for p(x, y):

“fence” and “comb” (Mealy- and Moore-Automata).

(9)

Hidden Markov Chains, fence

p(x, y) =p(y₁)

n

Y

i=2

p(x_i, yi|yi−1)

Observationsx_i depend explicitly from the hidden ones in i-th and i−1-th time points (x₁ does not exist)

p(x, y) =p(y₁)

n

Y

i=2

p(x_i, y_i|yi−1) =

=p(y₁)

n

Y

i=2

hp(y_i|yi−1)p(x_i|y_i, yi−1)ⁱ=p(y)·p(x|y)

where

p(y) = p(y₁)

n

Y

i=2

p(y_i|y_i−1) p(x|y) =

n

Y

i=2

p(x_i|y_i, y_i−1)

(10)

Hidden Markov Chains, comb

p(x, y) =p(y)·p(x|y) =p(y₁)

n

Y

i=2

p(y_i|yi−1)

n

Y

i=1

p(x_i|y_i) Observationsx_i depend explicitly from the corresponding hidden ones in i-th time only

First, it is a special case of the “fence”, if p(x_i|y_i, yi−1)⇒p(x_i|y_i) On the other side ...

(11)

Hidden Markov Chains, fence vs. comb

It is always possible to reduce a “fence” representation to the

“comb”-one.

←Example for 3 variables and 2 states In general: introduce new variables y˜1,y˜2. . .y˜n−1 that correspond to pairs (y₁, y₂),(y₂, y₃). . .(yn−1, y_n)– the new label setK˜ =K×K represents state pairs form the old model

⇒Each observation x˜_i depends explicitly only from the corresponding statey˜_i

⇒both representations (fence and comb) are equivalent.

(12)

Conditional independence

The conditional probability distributionp(x|y)isconditionally independent (in both “fence” and “comb” models), because it can be written as a product of probabilities.

Do not confuse independence and conditional independence.

For instance, somex_i and x_j are conditionally independent given an arbitraryy, but they are of course dependent through the hidden part, i.e. p(x_i, x_j)6=p(x_i)·p(x_j).

Note: summation overx gives 1 for any y:

X

y

p(x|y) =^X

y

Y

i

p(x_i|y_i) = ^Y

i

X

k

p(k|y_i) = 1

(13)

The probability of observation

p(x) = ^X

y

p(x, y) =

=^X

y

"

p(y₁)

n

Y

i=2

p(y_i|y_i−1)

n

Y

i=1

p(x_i|y_i)

#

– so-called SumProd problem.

For instance, it can be used to recognize the model:

let two Markov chains be given (specified by their parameters p(y₁),p(y_i|yi−1)and p⁰(y₁), p⁰(y_i|yi−1) respectively).

For a particular observationx decide, which model has generated it.

(14)

The probability of observation

p(x) = ^X

y

p(x, y) =

=^X

y

"

p(y₁)

n

Y

i=2

p(y_i|yi−1)

n

Y

i=1

p(x_i|y_i)

#

(15)

Prior marginal probabilities

Suppose, we are interested in the (prior) probability that at a given positioni^∗ the model generates a particular label k^∗. Note: this is a composite event whose probability is a sum over all corresponding elementary ones:

p(y_i^∗=k^∗) = ^X

y:y_i∗=k^∗

p(y) = ^X

y:y_i∗=k^∗

"

p(y₁)

n

Y

i=2

p(y_i|yi−1)

#

– a SumProd problem again.

(16)

Posterior marginal probabilities

For a given observation x calculate the probability that at a given positioni^∗ the model generates a particular label k^∗:

p(y_i^∗=k^∗|x)∝p(x, y_i^∗=k^∗) = ^X

y:y_i∗=k^∗

p(x, y) =

= ^X

y:y_i∗=k^∗

"

p(y₁)

n

Y

i=2

p(y_i|yi−1)

n

Y

i=1

p(x_i|y_i)

#

In all cases (+ the Partition Function) the SumProd problem can be expressed as

Z =^X

y

" _n Y

i=1

ψi(y_i)·

n

Y

i=2

ψi−1,i(y_i−1, yi)

#

(17)

SumProd algorithm

Example:f(a, b, c) = f₁(a)·f₂(a, b)·f₃(b)·f₄(b, c)·f₅(c)

Z =^X

abc

f(a, b, c) =

X

a

X

b

X

c

hf₁(a)f₂(a, b)f₃(b)f₄(b, c)f₅(c)ⁱ=

=^X

a

f₁(a)·

"

X

b

f₂(a, b)f₃(b)·

"

X

c

f₄(b, c)f₅(c)

##

=

=^X

a

f₁(a)·

"

X

b

f₂(a, b)F(b)

#

=^X

a

G(a)

(18)

SumProd algorithm

The Idea: propagate Bellmann functions F_i (aka messages) that represent partial solutions (sums)

for ( k = 1. . . K ) F₁(k) =q₁(k) for ( i= 2. . . n )

for ( k = 1. . . K ) Fi(k) = 0

for ( k⁰ = 1. . . K )

F_i(k) = F_i(k) +Fi−1(k⁰)g_i(k⁰, k) F_i(k) = F_i(k)·q_i(k)

Z =^P_kF_n(k)

Time complexity –O(nK²)

(19)

Conclusion

Today:

– (Hidden) Markov Chains – the probabilistic model, parameterizations

– Some “useful” probabilities, SumProd algorithm Next classes:

– Decision making (Bayesian Decision Theory), Maximum A-posteriori decision, MinSum problems (Energy

Minimization), Dynamic Programming, additive Losses, MaxMarginal decisions

– Statistical Learning, Maximum-Likelihood Principle, Expectation-Maximization algorithm for unsupervised learning

– Discriminative learning, structural SVM