Machine Learning II
Inference in Markov Chains
Dmitrij Schlesinger, Carsten Rother, Dagmar Kainmueller, Florian Jug
SS2014, 02.05.2014
Most probable state sequence
p(x, y) = p(y1)
n
Y
i=2
p(yi|yi−1)
n
Y
i=1
p(xi|yi)
Given anx estimate y. A “reasonable” approach – the most (a-posteriori) probable state sequence:
arg max
y
p(y|x) = arg max
y
p(x, y)
p(x) = arg max
y
p(x, y) =
= arg max
y
"
p(y1)
n
Y
i=2
p(yi|yi−1)
n
Y
i=1
p(xi|yi)
#
= / take logarithm = arg min
y
" n X
i=1
ψi(yi) +
n
X
i=2
ψi−1,i(yi−1, yi)
#
with
ψi(yi) = lnp(xi|yi)+ lnp(y1) ψi−1,i(yi−1, yi) = lnp(yi|yi−1)
Most probable state sequence
In contrast to SumProd we have tominimize over all state sequences rather than tosum them up
The “quality” of a sequence is not aproduct but asum The same as SumProd but in another Semiring:
SumProd ⇔MinSum
Dynamic Programming (Vitterbi, Dijkstra ...)
The Idea – propagate Bellman Functions Fi by Fi(k) =ψi(k) + min
k0 [Fi−1(k0) +ψi−1,i(k0, k)]
The functionsFi represent the quality of the the best extension into the already processed part
Dynamic Programming (derivation)
miny1
miny2
. . .min
yn
" n X
i=1
ψi(yi) +
n
X
i=2
ψi−1,i(yi−1, yi)
#
= miny2
. . .min
yn
" n X
i=2
ψi(yi) +
n
X
i=3
ψi−1,i(yi−1, yi)+
+miny1
ψ1(y1) +ψ1,2(y1, y2)
= miny2
. . .min
yn
" n X
i=2
ψi(yi) +
n
X
i=3
ψi−1,i(yi−1, yi) +F(y2)
#
= miny2
. . .min
yn
" n X
i=2
ψ˜i(yi) +
n
X
i=3
ψi−1,i(yi−1, yi)
#
with ψ˜2(k) =ψ2(k) +F(k), ψ˜i(k) =ψi(k) fori= 3. . . n.
Dynamic Programming (the algorithm)
// Forward pass for i= 2 to n
for k= 1 to K best=∞ for k0= 1 to K
if ψi−1(k0) +ψi−1,i(k0, k)< best
best=ψi−1(k0) +ψi−1,i(k0, k), pointeri(k) =k0 ψi(k) =ψi(k) +best
// Backward pass best=∞
for k= 1 to K if ψn(k)< best
best=ψn(k), xn=k for i=n−1 to 1
xi=pointeri+1(xi+1)
pointeri(k) is the best predecessor for k
time complexity: O(nK2) (due to the forward pass)
Inference (Recognition)
The model:
Let two random variables be given:
– The first one is typically discrete (k ∈K) – “class”
– The second one is arbitrary (x∈X) – “observation”
Let the joint probability distributionp(x, k) be “known”
The recognitiontask: given x, estimate k Usual problems (questions):
– How to estimate k from x ?
→Bayesian Decision Theory
– The joint probability is not always explicitly specified – The set K is sometimes huge,
e.g. the set of all state sequences in Markov Chains
Idea – a game
Somebodysamples a pair (x, k)according to a p.d. p(x, k) He keeps k hidden and presents x to you
You decide for some k∗ according to a chosen decision strategy
Somebodypenalizes your decision according to a
Loss-function, i.e. he compares your decision to the true hiddenk
You know bothp(x, k)and the loss-function (how does he compare)
Your goal is to design the decision strategy in order to pay as less as possible in average.
Bayesian Risk
Notations:
Thedecision set D. Note: it needs not to coincide withK !!!
Examples: decisions like “I don’t know”, “not this class” ...
Decision strategyis a mapping e:X →D Loss-functionC :D×K →R
TheBayesian Risk of a strategy e is the expected loss:
R(e) =X
x
X
k
p(x, k)·Ce(x), k→min
e
It should be minimized with respect to the decision strategy Another “writing style”:
d∗(x) = arg min
d
X
k
p(k|x)·C(d, k)
Maximum A-posteriori Decision (MAP)
The loss is the simplest one:
C(k, k0) =
( 1 if k6=k0
0 otherwise =δ(k6=k0) i.e. we pay1 if the answer is not the true class, no matter what error we make.
From that follows:
R(k) = X
k0
p(k0|x)·δ(k6=k0) =
= X
k0
p(k0|x)−p(k|x) = 1−p(k|x)→min
k
p(k|x)→max
k
For Markov Chains→ Dynamic Programming
Additive loss-functions – an example
Q1 Q2 . . . Qn P1 1 0 . . . 1 P2 0 1 . . . 0 . . . . . . . . . . . . . . . Pm 0 1 . . . 0
“P” ? ? . . . ?
Consider a “questionnaire”:
m persons answern questions.
Furthermore, let us assume that persons are rated – a “reliability”
measure is assigned to each one.
The goal is to find the “right”
answers for all questions.
Strategy 1:
Choose thebest person and take allhis/her answers.
Strategy 2:
– Consider a particular question
– Look, what allthe people say concerning this, do (weighted) voting
Additive loss-functions – example interpretation
People are classesk, reliability measure is the posterior p(k|x) Specialty:
classes consist of “parts” (questions) – classes arestructured The set of classes is k = (k1, k2. . . km)∈Km, it can be seen as a vector of m components each one being a simple answer (0 or 1 in the above example)
The “Strategy 1” is MAP
How to derive (consider, understand) the other decision strategy from the viewpoint of the Bayesian Decision Theory?
Additive loss-functions
Consider the simple C(k, k0) =δ(k 6=k0)loss for the case classes are structured – it does not reflect how strongthe class and the decision disagree
A better (?) choice – additive loss-function C(k, k0) = X
i
ci(ki, ki0)
i.e. disagreements of all components are summed up
Substitute it in the formula for Bayesian Risk, derive and look what happens ...
Additive loss-functions – derivation
R(k) = X
k0
p(k0|x)·X
i
ci(ki, ki0)
=/ swap summations
= X
i
X
k0
ci(ki, ki0)·p(k0|x) = / split summation
= X
i
X
l∈K
X
k0:k0i=l
ci(ki, l)·p(k0|x) =/ factor out
= X
i
X
l∈K
ci(ki, l)· X
k0:k0i=l
p(k0|x)
=/ redare marginals
= X
i
X
l∈K
ci(ki, l)·p(ki0=l|x)→min
k
/ independent problems
⇒ X
l∈K
ci(ki, l)·p(k0i=l|x)→min
ki
∀i
Additive loss-functions – the strategy
1. Compute marginalprobability distributions for values p(ki0=l|x) = X
k0:k0i=l
p(k0|x)
for each variablei and each valuel
2. Decide for each variable “independently” according to its marginal p.d. and the local lossci
X
l∈K
ci(ki, l)·p(ki0=l|x)→min
ki
This is again a Bayesian Decision Problem – minimize the average loss
Additive loss-functions – a special case
For each variable we pay1 if we are wrong:
ci(ki, k0i) =δ(ki6=ki0)
The overall loss is the number of misclassified variables (wrongly answered questions)
C(k, k0) =X
i
δ(ki6=ki0) and is called Hamming distance
The decision strategy isMaximum Marginal Decision ki∗ = arg max
l
p(k0i=l|x) ∀i
Minimum Marginal Square Error (MMSE)
Assume, the valuesl for ki are numbers (vectors) Examples:
– in Tracking it is the set of all possible positions of the object to be tracked
– in Stereo it is the set of all disparity/depth values etc.
→a more reasonable (additive) loss should account formetric difference between the decision and the true position, e.g.
C(k, k0) = X
i
ci(ki, k0i) =X
i
kki−ki0k2
The task to be solved for each positioni is
X
l∈K
kki−lk2·p(k0i=l|x)→min
ki
Minimum Marginal Square Error (MMSE)
X
l∈K
kki−lk2·p(k0i=l|x)→min
ki
∂
∂ki = X
l∈K
2·(ki−l)·p(k0i=l|x) = 0
X
l∈K
ki·p(ki0=l|x) = X
l∈K
l·p(ki0=l|x) ki = X
l∈K
l·p(ki0=l|x)
The optimal decision fori-th variable is the expectation
(average) in the corresponding marginal probability distribution Note: the decision is not necessarily an element ofK, e.g. it may be real-valued → setsD and K are different.
Fast computation of marginal probabilities
Back to Markov Chains ...
Usually, we are interested in allmarginal probabilities at once, i.e. for alli∗ and all k∗ (nK values in total)
Naive variant: for eachi∗ and k∗
– define the corresponding task by introducing “additional”
ψi∗(k) = δ(k=k∗)
– use SumProd algorithm to solve it
→time complexity O(nK2·nK) :-(
Fast computation of marginal probabilities
The key to the faster computation is the Markovian property (conditional independence)
Letz represent partial sequences (y1. . . yi∗−1), let ψ(z) =
i∗−1
Y
i=1
ψi(yi)·
i∗−1
Y
i=2
ψi−1,i(yi−1, yi) ψ(z, k∗) = ψi∗−1,i∗(yi∗−1, k∗)
Analogously, letz0 represent partial sequences (yi∗+1. . . yn), ψ(k∗, z0) and ψ(z0) analogously as well ...
Fast computation of marginal probabilities
The SumProd task to be solved for(i∗, k∗)can be rewritten as
X
zz0
ψ(z)·ψ(z, k∗)·ψi∗(k∗)·ψ(k∗, z0)·ψ(z0) =
=
"
X
z
ψ(z)·ψ(z, k∗)
#
·ψi∗(k∗)·
"
X
z0
ψ(k∗, z)·ψ(z0)
#
=Fi→∗ (k∗)·ψi∗(k∗)·Fi←∗ (k∗)
However,Fi→∗(k∗) are just the Bellman functions used in SumProd, i.e. they all can be computed inO(nK2)time. The same for Fi←∗ (k∗), but from right to left.
Fast computation of marginal probabilities
Algorithm summary:
(1) Compute Fi→(k) for all (i, k) by propagating Fi→(k) =X
k0
Fi−1→ (k0)·ψi−1(k0)·ψi−1,i(k0, k) (2) Compute Fi←(k) for all (i, k) by propagating
Fi←(k) =X
k0
Fi+1← (k0)·ψi+1(k0)·ψi,i+1(k, k0) (3) Compute the marginal probabilities for all (i, k) by
p(yi=k . . .) = Fi→(k)·ψi(k)·Fi←(k) The overall time complexity isO(nK2)
Conclusion
Today – inference:
– Most probable state sequence – Dynamic Programming – Bayesian Decision Theory – MAP, additive losses,
Maximum Marginal Decision, Minimum Marginal Square Error
– Fast computation of marginal probabilities Next time – statistical learning:
– A bit of general case, Maximum Likelihood Principle etc.
– Maximum Likelihood Principle for Markov Chains – Unsupervised learning, Expectation-Maximization
Algorithm
– Expectation-Maximization Algorithm for Markov Chains