Most probable state sequence

(1)

Machine Learning II

Inference in Markov Chains

Dmitrij Schlesinger, Carsten Rother, Dagmar Kainmueller, Florian Jug

SS2014, 02.05.2014

(2)

Most probable state sequence

p(x, y) = p(y₁)

n

Y

i=2

p(y_i|yi−1)

n

Y

i=1

p(x_i|y_i)

Given anx estimate y. A “reasonable” approach – the most (a-posteriori) probable state sequence:

arg max

y

p(y|x) = arg max

y

p(x, y)

p(x) = arg max

y

p(x, y) =

= arg max

y

"

p(y₁)

n

Y

i=2

p(y_i|yi−1)

n

Y

i=1

p(x_i|y_i)

#

= / take logarithm = arg min

y

" _n X

i=1

ψ_i(y_i) +

n

X

i=2

ψi−1,i(yi−1, y_i)

#

with

ψi(y_i) = lnp(x_i|yi)+ lnp(y₁) ψi−1,i(y_i−1, yi) = lnp(y_i|yi−1)

(3)

Most probable state sequence

In contrast to SumProd we have tominimize over all state sequences rather than tosum them up

The “quality” of a sequence is not aproduct but asum The same as SumProd but in another Semiring:

SumProd ⇔MinSum

(4)

Dynamic Programming (Vitterbi, Dijkstra ...)

The Idea – propagate Bellman Functions F_i by F_i(k) =ψ_i(k) + min

k⁰ [Fi−1(k⁰) +ψi−1,i(k⁰, k)]

The functionsFi represent the quality of the the best extension into the already processed part

(5)

Dynamic Programming (derivation)

miny1

miny2

. . .min

yn

" _n X

i=1

ψ_i(y_i) +

n

X

i=2

ψi−1,i(yi−1, y_i)

#

= miny2

. . .min

yn

" _n X

i=2

ψ_i(y_i) +

n

X

i=3

ψi−1,i(yi−1, y_i)+

+miny1

ψ₁(y₁) +ψ_1,2(y₁, y₂)

= miny2

. . .min

yn

" _n X

i=2

ψ_i(y_i) +

n

X

i=3

ψi−1,i(yi−1, y_i) +F(y₂)

#

= miny2

. . .min

yn

" _n X

i=2

ψ˜_i(y_i) +

n

X

i=3

ψ_i−1,i(y_i−1, y_i)

#

with ψ˜₂(k) =ψ₂(k) +F(k), ψ˜_i(k) =ψ_i(k) fori= 3. . . n.

(6)

Dynamic Programming (the algorithm)

// Forward pass for i= 2 to n

for k= 1 to K best=∞ for k⁰= 1 to K

if ψi−1(k⁰) +ψi−1,i(k⁰, k)< best

best=ψi−1(k⁰) +ψi−1,i(k⁰, k), pointeri(k) =k⁰ ψi(k) =ψi(k) +best

// Backward pass best=∞

for k= 1 to K if ψn(k)< best

best=ψn(k), xn=k for i=n−1 to 1

xi=pointeri+1(xi+1)

pointer_i(k) is the best predecessor for k

time complexity: O(nK²) (due to the forward pass)

(7)

Inference (Recognition)

The model:

Let two random variables be given:

– The first one is typically discrete (k ∈K) – “class”

– The second one is arbitrary (x∈X) – “observation”

Let the joint probability distributionp(x, k) be “known”

The recognitiontask: given x, estimate k Usual problems (questions):

– How to estimate k from x ?

→Bayesian Decision Theory

– The joint probability is not always explicitly specified – The set K is sometimes huge,

e.g. the set of all state sequences in Markov Chains

(8)

Idea – a game

Somebodysamples a pair (x, k)according to a p.d. p(x, k) He keeps k hidden and presents x to you

You decide for some k^∗ according to a chosen decision strategy

Somebodypenalizes your decision according to a

Loss-function, i.e. he compares your decision to the true hiddenk

You know bothp(x, k)and the loss-function (how does he compare)

Your goal is to design the decision strategy in order to pay as less as possible in average.

(9)

Bayesian Risk

Notations:

Thedecision set D. Note: it needs not to coincide withK !!!

Examples: decisions like “I don’t know”, “not this class” ...

Decision strategyis a mapping e:X →D Loss-functionC :D×K →R

TheBayesian Risk of a strategy e is the expected loss:

R(e) =^X

x

X

k

p(x, k)·Ce(x), k→min

e

It should be minimized with respect to the decision strategy Another “writing style”:

d^∗(x) = arg min

d

X

k

p(k|x)·C(d, k)

(10)

Maximum A-posteriori Decision (MAP)

The loss is the simplest one:

C(k, k⁰) =

( 1 if k6=k⁰

0 otherwise =δ(k6=k⁰) i.e. we pay1 if the answer is not the true class, no matter what error we make.

From that follows:

R(k) = ^X

k⁰

p(k⁰|x)·δ(k6=k⁰) =

= ^X

k⁰

p(k⁰|x)−p(k|x) = 1−p(k|x)→min

k

p(k|x)→max

k

For Markov Chains→ Dynamic Programming

(11)

Additive loss-functions – an example

Q₁ Q₂ . . . Q_n P1 1 0 . . . 1 P₂ 0 1 . . . 0 . . . . . . . . . . . . . . . Pm 0 1 . . . 0

“^P” ? ? . . . ?

Consider a “questionnaire”:

m persons answern questions.

Furthermore, let us assume that persons are rated – a “reliability”

measure is assigned to each one.

The goal is to find the “right”

answers for all questions.

Strategy 1:

Choose thebest person and take allhis/her answers.

Strategy 2:

– Consider a particular question

– Look, what allthe people say concerning this, do (weighted) voting

(12)

Additive loss-functions – example interpretation

People are classesk, reliability measure is the posterior p(k|x) Specialty:

classes consist of “parts” (questions) – classes arestructured The set of classes is k = (k₁, k2. . . km)∈K^m, it can be seen as a vector of m components each one being a simple answer (0 or 1 in the above example)

The “Strategy 1” is MAP

How to derive (consider, understand) the other decision strategy from the viewpoint of the Bayesian Decision Theory?

(13)

Additive loss-functions

Consider the simple C(k, k⁰) =δ(k 6=k⁰)loss for the case classes are structured – it does not reflect how strongthe class and the decision disagree

A better (?) choice – additive loss-function C(k, k⁰) = ^X

i

ci(k_i, k_i⁰)

i.e. disagreements of all components are summed up

Substitute it in the formula for Bayesian Risk, derive and look what happens ...

(14)

Additive loss-functions – derivation

R(k) = ^X

k⁰

p(k⁰|x)·^X

i

c_i(k_i, k_i⁰)

=/ swap summations

= ^X

i

X

k⁰

c_i(k_i, k_i⁰)·p(k⁰|x) = / split summation

= ^X

i

X

l∈K

X

k⁰:k⁰_i=l

c_i(k_i, l)·p(k⁰|x) =/ factor out

= ^X

i

X

l∈K

c_i(k_i, l)· ^X

k⁰:k⁰_i=l

p(k⁰|x)

=/ redare marginals

= ^X

i

X

l∈K

c_i(k_i, l)·p(k_i⁰=l|x)→min

k

/ independent problems

⇒ ^X

l∈K

c_i(k_i, l)·p(k⁰_i=l|x)→min

ki

∀i

(15)

Additive loss-functions – the strategy

1. Compute marginalprobability distributions for values p(k_i⁰=l|x) = ^X

k⁰:k⁰_i=l

p(k⁰|x)

for each variablei and each valuel

2. Decide for each variable “independently” according to its marginal p.d. and the local lossc_i

X

l∈K

c_i(k_i, l)·p(k_i⁰=l|x)→min

ki

This is again a Bayesian Decision Problem – minimize the average loss

(16)

Additive loss-functions – a special case

For each variable we pay1 if we are wrong:

c_i(k_i, k⁰_i) =δ(k_i6=k_i⁰)

The overall loss is the number of misclassified variables (wrongly answered questions)

C(k, k⁰) =^X

i

δ(k_i6=k_i⁰) and is called Hamming distance

The decision strategy isMaximum Marginal Decision k_i^∗ = arg max

l

p(k⁰_i=l|x) ∀i

(17)

Minimum Marginal Square Error (MMSE)

Assume, the valuesl for k_i are numbers (vectors) Examples:

– in Tracking it is the set of all possible positions of the object to be tracked

– in Stereo it is the set of all disparity/depth values etc.

→a more reasonable (additive) loss should account formetric difference between the decision and the true position, e.g.

C(k, k⁰) = ^X

i

c_i(k_i, k⁰_i) =^X

i

kk_i−k_i⁰k²

The task to be solved for each positioni is

X

l∈K

kk_i−lk²·p(k⁰_i=l|x)→min

ki

(18)

Minimum Marginal Square Error (MMSE)

X

l∈K

kk_i−lk²·p(k⁰_i=l|x)→min

ki

∂

∂k_i = ^X

l∈K

2·(k_i−l)·p(k⁰_i=l|x) = 0

X

l∈K

k_i·p(k_i⁰=l|x) = ^X

l∈K

l·p(k_i⁰=l|x) k_i = ^X

l∈K

l·p(k_i⁰=l|x)

The optimal decision fori-th variable is the expectation

(average) in the corresponding marginal probability distribution Note: the decision is not necessarily an element ofK, e.g. it may be real-valued → setsD and K are different.

(19)

Fast computation of marginal probabilities

Back to Markov Chains ...

Usually, we are interested in allmarginal probabilities at once, i.e. for alli^∗ and all k^∗ (nK values in total)

Naive variant: for eachi^∗ and k^∗

– define the corresponding task by introducing “additional”

ψ_i^∗(k) = δ(k=k^∗)

– use SumProd algorithm to solve it

→time complexity O(nK²·nK) :-(

(20)

Fast computation of marginal probabilities

The key to the faster computation is the Markovian property (conditional independence)

Letz represent partial sequences (y₁. . . y_i^∗−1), let ψ(z) =

i^∗−1

Y

i=1

ψ_i(y_i)·

i^∗−1

Y

i=2

ψ_i−1,i(y_i−1, y_i) ψ(z, k^∗) = ψ_i^∗−1,i^∗(y_i^∗−1, k^∗)

Analogously, letz⁰ represent partial sequences (y_i^∗₊₁. . . y_n), ψ(k^∗, z⁰) and ψ(z⁰) analogously as well ...

(21)

Fast computation of marginal probabilities

The SumProd task to be solved for(i^∗, k^∗)can be rewritten as

X

zz⁰

ψ(z)·ψ(z, k^∗)·ψ_i^∗(k^∗)·ψ(k^∗, z⁰)·ψ(z⁰) =

=

"

X

z

ψ(z)·ψ(z, k^∗)

#

·ψ_i^∗(k^∗)·

"

X

z⁰

ψ(k^∗, z)·ψ(z⁰)

#

=F_i^→∗ (k^∗)·ψ_i^∗(k^∗)·F_i^←∗ (k^∗)

However,F_i^→∗(k^∗) are just the Bellman functions used in SumProd, i.e. they all can be computed inO(nK²)time. The same for F_i^←∗ (k^∗), but from right to left.

(22)

Fast computation of marginal probabilities

Algorithm summary:

(1) Compute F_i^→(k) for all (i, k) by propagating F_i^→(k) =^X

k⁰

F_i−1^→ (k⁰)·ψi−1(k⁰)·ψi−1,i(k⁰, k) (2) Compute F_i^←(k) for all (i, k) by propagating

F_i^←(k) =^X

k⁰

F_i+1^← (k⁰)·ψ_i+1(k⁰)·ψ_i,i+1(k, k⁰) (3) Compute the marginal probabilities for all (i, k) by

p(y_i=k . . .) = F_i^→(k)·ψ_i(k)·F_i^←(k) The overall time complexity isO(nK²)

(23)

Conclusion

Today – inference:

– Most probable state sequence – Dynamic Programming – Bayesian Decision Theory – MAP, additive losses,

Maximum Marginal Decision, Minimum Marginal Square Error

– Fast computation of marginal probabilities Next time – statistical learning:

– A bit of general case, Maximum Likelihood Principle etc.

– Maximum Likelihood Principle for Markov Chains – Unsupervised learning, Expectation-Maximization

Algorithm

– Expectation-Maximization Algorithm for Markov Chains