Machine Learning II Markov Chains
Dmitrij Schlesinger, Carsten Rother, Dagmar Kainmueller, Florian Jug
SS2014, 25.04.2014
Application: Speach Recognition
Markov Chains – the probabilistic model
LetI ={1,2, . . . , n} be the set of “time points”
There is arandom variableyi ∈K for each i∈I, where K is a discrete domain (state set, label set, phoneme set etc.).
y= (y1, y2, . . . , yn)with yi ∈K is a state sequence The set of all such sequences isY =Kn
– the sample space (the set of elementary events) p(y) = p(y1, y2, . . . , yn) are their probabilities
Markov Chains – the probabilistic model
In general: p(y1. . . yn) =p(y1. . . yn−1)·p(yn|y1. . . yn−1) TheMarkovian property (simplified):
p(yn|y1. . . yn−1) =p(yn|yn−1) It follows:
p(y) =p(y1)·
n
Y
i=2
p(yi|yi−1) The model parameters are:
– the starting probability distribution p(y1)
– the transition probability distributions p(yi|yi−1)
Other parameterizations
An example:
p(y) = p(y1, y2, y3) = p(y1)p(y2|y1)p(y3|y2) = p(y1)p(y1|y2)p(y2)
p(y1) p(y3|y2) = p(y1|y2)p(y2)p(y3|y2)
A bit more general:
p(y) = p(y1)·
n
Y
i=2
p(yi|yi−1) =
i∗
Y
i=2
p(yi−1|yi)·p(yi∗)·
n
Y
i=i∗+1
p(yi|yi−1)
⇒the “start” can be anywhere
Other parameterizations
Another way:
p(y) =p(y1)·
n
Y
i=2
p(yi|yi−1) =
=p(y1)·
n
Y
i=2
p(yi, yi−1) p(yi−1) =
n
Y
i=2
p(yi, yi−1)·
n−1
Y
i=2
p(yi)−1
⇒There is no “start” or “finish” at all – the parameterization is symmetric.
Other parameterizations
And even more:
p(y) = 1 Z
n
Y
i=1
ψi(yi)·
n
Y
i=2
ψi−1,i(yi−1, yi)
with the Partition Function:
Z = X
y∈Y
" n Y
i=1
ψi(yi)·
n
Y
i=2
ψi−1,i(yi−1, yi)
#
The parameters ψi :K →R+ and ψi−1,i :K ×K →R+ are arbitrary (but non-negative) functions that do not represent (explicitly) any probabilities
Hidden Markov Chains (HMM)
There are two “kinds” of Variables (both are sequences):
y= (y1, y2. . . yn), yi ∈K (hidden Variables) and x= (x1, x2. . . xn), xi ∈F (observed Variables).
A pair(x, y) is an elementary event.
Two representations for p(x, y):
“fence” and “comb” (Mealy- and Moore-Automata).
Hidden Markov Chains, fence
p(x, y) =p(y1)
n
Y
i=2
p(xi, yi|yi−1)
Observationsxi depend explicitly from the hidden ones in i-th and i−1-th time points (x1 does not exist)
p(x, y) =p(y1)
n
Y
i=2
p(xi, yi|yi−1) =
=p(y1)
n
Y
i=2
hp(yi|yi−1)p(xi|yi, yi−1)i=p(y)·p(x|y)
where
p(y) = p(y1)
n
Y
i=2
p(yi|yi−1) p(x|y) =
n
Y
i=2
p(xi|yi, yi−1)
Hidden Markov Chains, comb
p(x, y) =p(y)·p(x|y) =p(y1)
n
Y
i=2
p(yi|yi−1)
n
Y
i=1
p(xi|yi) Observationsxi depend explicitly from the corresponding hidden ones in i-th time only
First, it is a special case of the “fence”, if p(xi|yi, yi−1)⇒p(xi|yi) On the other side ...
Hidden Markov Chains, fence vs. comb
It is always possible to reduce a “fence” representation to the
“comb”-one.
←Example for 3 variables and 2 states In general: introduce new variables y˜1,y˜2. . .y˜n−1 that correspond to pairs (y1, y2),(y2, y3). . .(yn−1, yn)– the new label setK˜ =K×K represents state pairs form the old model
⇒Each observation x˜i depends explicitly only from the corresponding statey˜i
⇒both representations (fence and comb) are equivalent.
Conditional independence
The conditional probability distributionp(x|y)isconditionally independent (in both “fence” and “comb” models), because it can be written as a product of probabilities.
Do not confuse independence and conditional independence.
For instance, somexi and xj are conditionally independent given an arbitraryy, but they are of course dependent through the hidden part, i.e. p(xi, xj)6=p(xi)·p(xj).
Note: summation overx gives 1 for any y:
X
y
p(x|y) =X
y
Y
i
p(xi|yi) = Y
i
X
k
p(k|yi) = 1
The probability of observation
p(x) = X
y
p(x, y) =
=X
y
"
p(y1)
n
Y
i=2
p(yi|yi−1)
n
Y
i=1
p(xi|yi)
#
– so-called SumProd problem.
For instance, it can be used to recognize the model:
let two Markov chains be given (specified by their parameters p(y1),p(yi|yi−1)and p0(y1), p0(yi|yi−1) respectively).
For a particular observationx decide, which model has generated it.
The probability of observation
p(x) = X
y
p(x, y) =
=X
y
"
p(y1)
n
Y
i=2
p(yi|yi−1)
n
Y
i=1
p(xi|yi)
#
Prior marginal probabilities
Suppose, we are interested in the (prior) probability that at a given positioni∗ the model generates a particular label k∗. Note: this is a composite event whose probability is a sum over all corresponding elementary ones:
p(yi∗=k∗) = X
y:yi∗=k∗
p(y) = X
y:yi∗=k∗
"
p(y1)
n
Y
i=2
p(yi|yi−1)
#
– a SumProd problem again.
Posterior marginal probabilities
For a given observation x calculate the probability that at a given positioni∗ the model generates a particular label k∗:
p(yi∗=k∗|x)∝p(x, yi∗=k∗) = X
y:yi∗=k∗
p(x, y) =
= X
y:yi∗=k∗
"
p(y1)
n
Y
i=2
p(yi|yi−1)
n
Y
i=1
p(xi|yi)
#
In all cases (+ the Partition Function) the SumProd problem can be expressed as
Z =X
y
" n Y
i=1
ψi(yi)·
n
Y
i=2
ψi−1,i(yi−1, yi)
#
SumProd algorithm
Example:f(a, b, c) = f1(a)·f2(a, b)·f3(b)·f4(b, c)·f5(c)
Z =X
abc
f(a, b, c) =
X
a
X
b
X
c
hf1(a)f2(a, b)f3(b)f4(b, c)f5(c)i=
=X
a
f1(a)·
"
X
b
f2(a, b)f3(b)·
"
X
c
f4(b, c)f5(c)
##
=
=X
a
f1(a)·
"
X
b
f2(a, b)F(b)
#
=X
a
G(a)
SumProd algorithm
The Idea: propagate Bellmann functions Fi (aka messages) that represent partial solutions (sums)
for ( k = 1. . . K ) F1(k) =q1(k) for ( i= 2. . . n )
for ( k = 1. . . K ) Fi(k) = 0
for ( k0 = 1. . . K )
Fi(k) = Fi(k) +Fi−1(k0)gi(k0, k) Fi(k) = Fi(k)·qi(k)
Z =PkFn(k)
Time complexity –O(nK2)
Conclusion
Today:
– (Hidden) Markov Chains – the probabilistic model, parameterizations
– Some “useful” probabilities, SumProd algorithm Next classes:
– Decision making (Bayesian Decision Theory), Maximum A-posteriori decision, MinSum problems (Energy
Minimization), Dynamic Programming, additive Losses, MaxMarginal decisions
– Statistical Learning, Maximum-Likelihood Principle, Expectation-Maximization algorithm for unsupervised learning
– Discriminative learning, structural SVM