• Keine Ergebnisse gefunden

Most probable state sequence

N/A
N/A
Protected

Academic year: 2022

Aktie "Most probable state sequence"

Copied!
23
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Machine Learning II

Inference in Markov Chains

Dmitrij Schlesinger, Carsten Rother, Dagmar Kainmueller, Florian Jug

SS2014, 02.05.2014

(2)

Most probable state sequence

p(x, y) = p(y1)

n

Y

i=2

p(yi|yi−1)

n

Y

i=1

p(xi|yi)

Given anx estimate y. A “reasonable” approach – the most (a-posteriori) probable state sequence:

arg max

y

p(y|x) = arg max

y

p(x, y)

p(x) = arg max

y

p(x, y) =

= arg max

y

"

p(y1)

n

Y

i=2

p(yi|yi−1)

n

Y

i=1

p(xi|yi)

#

= / take logarithm = arg min

y

" n X

i=1

ψi(yi) +

n

X

i=2

ψi−1,i(yi−1, yi)

#

with

ψi(yi) = lnp(xi|yi)+ lnp(y1) ψi−1,i(yi−1, yi) = lnp(yi|yi−1)

(3)

Most probable state sequence

In contrast to SumProd we have tominimize over all state sequences rather than tosum them up

The “quality” of a sequence is not aproduct but asum The same as SumProd but in another Semiring:

SumProd ⇔MinSum

(4)

Dynamic Programming (Vitterbi, Dijkstra ...)

The Idea – propagate Bellman Functions Fi by Fi(k) =ψi(k) + min

k0 [Fi−1(k0) +ψi−1,i(k0, k)]

The functionsFi represent the quality of the the best extension into the already processed part

(5)

Dynamic Programming (derivation)

miny1

miny2

. . .min

yn

" n X

i=1

ψi(yi) +

n

X

i=2

ψi−1,i(yi−1, yi)

#

= miny2

. . .min

yn

" n X

i=2

ψi(yi) +

n

X

i=3

ψi−1,i(yi−1, yi)+

+miny1

ψ1(y1) +ψ1,2(y1, y2)

= miny2

. . .min

yn

" n X

i=2

ψi(yi) +

n

X

i=3

ψi−1,i(yi−1, yi) +F(y2)

#

= miny2

. . .min

yn

" n X

i=2

ψ˜i(yi) +

n

X

i=3

ψi−1,i(yi−1, yi)

#

with ψ˜2(k) =ψ2(k) +F(k), ψ˜i(k) =ψi(k) fori= 3. . . n.

(6)

Dynamic Programming (the algorithm)

// Forward pass for i= 2 to n

for k= 1 to K best= for k0= 1 to K

if ψi−1(k0) +ψi−1,i(k0, k)< best

best=ψi−1(k0) +ψi−1,i(k0, k), pointeri(k) =k0 ψi(k) =ψi(k) +best

// Backward pass best=

for k= 1 to K if ψn(k)< best

best=ψn(k), xn=k for i=n1 to 1

xi=pointeri+1(xi+1)

pointeri(k) is the best predecessor for k

time complexity: O(nK2) (due to the forward pass)

(7)

Inference (Recognition)

The model:

Let two random variables be given:

– The first one is typically discrete (k ∈K) – “class”

– The second one is arbitrary (x∈X) – “observation”

Let the joint probability distributionp(x, k) be “known”

The recognitiontask: given x, estimate k Usual problems (questions):

– How to estimate k from x ?

Bayesian Decision Theory

– The joint probability is not always explicitly specified – The set K is sometimes huge,

e.g. the set of all state sequences in Markov Chains

(8)

Idea – a game

Somebodysamples a pair (x, k)according to a p.d. p(x, k) He keeps k hidden and presents x to you

You decide for some k according to a chosen decision strategy

Somebodypenalizes your decision according to a

Loss-function, i.e. he compares your decision to the true hiddenk

You know bothp(x, k)and the loss-function (how does he compare)

Your goal is to design the decision strategy in order to pay as less as possible in average.

(9)

Bayesian Risk

Notations:

Thedecision set D. Note: it needs not to coincide withK !!!

Examples: decisions like “I don’t know”, “not this class” ...

Decision strategyis a mapping e:XD Loss-functionC :D×K →R

TheBayesian Risk of a strategy e is the expected loss:

R(e) =X

x

X

k

p(x, k)·Ce(x), k→min

e

It should be minimized with respect to the decision strategy Another “writing style”:

d(x) = arg min

d

X

k

p(k|x)·C(d, k)

(10)

Maximum A-posteriori Decision (MAP)

The loss is the simplest one:

C(k, k0) =

( 1 if k6=k0

0 otherwise =δ(k6=k0) i.e. we pay1 if the answer is not the true class, no matter what error we make.

From that follows:

R(k) = X

k0

p(k0|x)·δ(k6=k0) =

= X

k0

p(k0|x)−p(k|x) = 1p(k|x)→min

k

p(k|x)→max

k

For Markov Chains→ Dynamic Programming

(11)

Additive loss-functions – an example

Q1 Q2 . . . Qn P1 1 0 . . . 1 P2 0 1 . . . 0 . . . . . . . . . . . . . . . Pm 0 1 . . . 0

P” ? ? . . . ?

Consider a “questionnaire”:

m persons answern questions.

Furthermore, let us assume that persons are rated – a “reliability”

measure is assigned to each one.

The goal is to find the “right”

answers for all questions.

Strategy 1:

Choose thebest person and take allhis/her answers.

Strategy 2:

– Consider a particular question

– Look, what allthe people say concerning this, do (weighted) voting

(12)

Additive loss-functions – example interpretation

People are classesk, reliability measure is the posterior p(k|x) Specialty:

classes consist of “parts” (questions) – classes arestructured The set of classes is k = (k1, k2. . . km)∈Km, it can be seen as a vector of m components each one being a simple answer (0 or 1 in the above example)

The “Strategy 1” is MAP

How to derive (consider, understand) the other decision strategy from the viewpoint of the Bayesian Decision Theory?

(13)

Additive loss-functions

Consider the simple C(k, k0) =δ(k 6=k0)loss for the case classes are structured – it does not reflect how strongthe class and the decision disagree

A better (?) choice – additive loss-function C(k, k0) = X

i

ci(ki, ki0)

i.e. disagreements of all components are summed up

Substitute it in the formula for Bayesian Risk, derive and look what happens ...

(14)

Additive loss-functions – derivation

R(k) = X

k0

p(k0|x)·X

i

ci(ki, ki0)

=/ swap summations

= X

i

X

k0

ci(ki, ki0p(k0|x) = / split summation

= X

i

X

l∈K

X

k0:k0i=l

ci(ki, l)·p(k0|x) =/ factor out

= X

i

X

l∈K

ci(ki, l)· X

k0:k0i=l

p(k0|x)

=/ redare marginals

= X

i

X

l∈K

ci(ki, l)·p(ki0=l|x)→min

k

/ independent problems

X

l∈K

ci(ki, l)·p(k0i=l|x)→min

ki

∀i

(15)

Additive loss-functions – the strategy

1. Compute marginalprobability distributions for values p(ki0=l|x) = X

k0:k0i=l

p(k0|x)

for each variablei and each valuel

2. Decide for each variable “independently” according to its marginal p.d. and the local lossci

X

l∈K

ci(ki, l)·p(ki0=l|x)→min

ki

This is again a Bayesian Decision Problem – minimize the average loss

(16)

Additive loss-functions – a special case

For each variable we pay1 if we are wrong:

ci(ki, k0i) =δ(ki6=ki0)

The overall loss is the number of misclassified variables (wrongly answered questions)

C(k, k0) =X

i

δ(ki6=ki0) and is called Hamming distance

The decision strategy isMaximum Marginal Decision ki = arg max

l

p(k0i=l|x) ∀i

(17)

Minimum Marginal Square Error (MMSE)

Assume, the valuesl for ki are numbers (vectors) Examples:

– in Tracking it is the set of all possible positions of the object to be tracked

– in Stereo it is the set of all disparity/depth values etc.

→a more reasonable (additive) loss should account formetric difference between the decision and the true position, e.g.

C(k, k0) = X

i

ci(ki, k0i) =X

i

kkiki0k2

The task to be solved for each positioni is

X

l∈K

kkilk2·p(k0i=l|x)→min

ki

(18)

Minimum Marginal Square Error (MMSE)

X

l∈K

kkilk2·p(k0i=l|x)→min

ki

∂ki = X

l∈K

2·(kil)·p(k0i=l|x) = 0

X

l∈K

ki·p(ki0=l|x) = X

l∈K

l·p(ki0=l|x) ki = X

l∈K

l·p(ki0=l|x)

The optimal decision fori-th variable is the expectation

(average) in the corresponding marginal probability distribution Note: the decision is not necessarily an element ofK, e.g. it may be real-valued → setsD and K are different.

(19)

Fast computation of marginal probabilities

Back to Markov Chains ...

Usually, we are interested in allmarginal probabilities at once, i.e. for alli and all k (nK values in total)

Naive variant: for eachi and k

– define the corresponding task by introducing “additional”

ψi(k) = δ(k=k)

– use SumProd algorithm to solve it

→time complexity O(nK2·nK) :-(

(20)

Fast computation of marginal probabilities

The key to the faster computation is the Markovian property (conditional independence)

Letz represent partial sequences (y1. . . yi−1), let ψ(z) =

i−1

Y

i=1

ψi(yi

i−1

Y

i=2

ψi−1,i(yi−1, yi) ψ(z, k) = ψi−1,i(yi−1, k)

Analogously, letz0 represent partial sequences (yi+1. . . yn), ψ(k, z0) and ψ(z0) analogously as well ...

(21)

Fast computation of marginal probabilities

The SumProd task to be solved for(i, k)can be rewritten as

X

zz0

ψ(z)·ψ(z, kψi(kψ(k, z0ψ(z0) =

=

"

X

z

ψ(z)·ψ(z, k)

#

·ψi(k

"

X

z0

ψ(k, z)·ψ(z0)

#

=Fi (kψi(kFi (k)

However,Fi(k) are just the Bellman functions used in SumProd, i.e. they all can be computed inO(nK2)time. The same for Fi (k), but from right to left.

(22)

Fast computation of marginal probabilities

Algorithm summary:

(1) Compute Fi(k) for all (i, k) by propagating Fi(k) =X

k0

Fi−1 (k0ψi−1(k0ψi−1,i(k0, k) (2) Compute Fi(k) for all (i, k) by propagating

Fi(k) =X

k0

Fi+1 (k0ψi+1(k0ψi,i+1(k, k0) (3) Compute the marginal probabilities for all (i, k) by

p(yi=k . . .) = Fi(k)·ψi(k)·Fi(k) The overall time complexity isO(nK2)

(23)

Conclusion

Today – inference:

– Most probable state sequence – Dynamic Programming – Bayesian Decision Theory – MAP, additive losses,

Maximum Marginal Decision, Minimum Marginal Square Error

– Fast computation of marginal probabilities Next time – statistical learning:

– A bit of general case, Maximum Likelihood Principle etc.

– Maximum Likelihood Principle for Markov Chains – Unsupervised learning, Expectation-Maximization

Algorithm

– Expectation-Maximization Algorithm for Markov Chains

Referenzen

ÄHNLICHE DOKUMENTE

- Stochastic Differential Equations - Nonlinear continuous-discrete state space model - Simulated Maximum Likelihood - Langevin Importance Sampling..

I model the first mixed moments of bivariate exponential models whose marginals are also exponential using the method of generalized linear

segment large DNA segment o into short DNA segments o 1 , ..., o M classify each short DNA segment using Markov Models λ CpG and λ ¬CpG.. CpG Islands - Detection in large

Assumption: the training data is a realization of the unknown probability distribution – it is sampled according to it.. → what is observed should have a

Assumption: the training data is a realization of the unknown probability distribution – it is sampled according to it1. → what is observed should have a

Assumption: the training data is a realization of the unknown probability distribution – it is sampled according to it.. → what is observed should have a

Today: Markov chains – The probabilistic model – Some “useful” probabilities – SumProd algorithm. – Inference – MAP,

as important examples of replicator equations, both in classical population genetics and in the chemical kinetics of polynucleotide replication.. The corresponding maximum