• Keine Ergebnisse gefunden

Markov Chains – the probabilistic model

N/A
N/A
Protected

Academic year: 2022

Aktie "Markov Chains – the probabilistic model"

Copied!
19
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Machine Learning II Markov Chains

Dmitrij Schlesinger, Carsten Rother, Dagmar Kainmueller, Florian Jug

SS2014, 25.04.2014

(2)

Application: Speach Recognition

(3)

Markov Chains – the probabilistic model

LetI ={1,2, . . . , n} be the set of “time points”

There is arandom variableyiK for each iI, where K is a discrete domain (state set, label set, phoneme set etc.).

y= (y1, y2, . . . , yn)with yiK is a state sequence The set of all such sequences isY =Kn

– the sample space (the set of elementary events) p(y) = p(y1, y2, . . . , yn) are their probabilities

(4)

Markov Chains – the probabilistic model

In general: p(y1. . . yn) =p(y1. . . yn−1p(yn|y1. . . yn−1) TheMarkovian property (simplified):

p(yn|y1. . . yn−1) =p(yn|yn−1) It follows:

p(y) =p(y1

n

Y

i=2

p(yi|yi−1) The model parameters are:

– the starting probability distribution p(y1)

– the transition probability distributions p(yi|yi−1)

(5)

Other parameterizations

An example:

p(y) = p(y1, y2, y3) = p(y1)p(y2|y1)p(y3|y2) = p(y1)p(y1|y2)p(y2)

p(y1) p(y3|y2) = p(y1|y2)p(y2)p(y3|y2)

A bit more general:

p(y) = p(y1

n

Y

i=2

p(yi|yi−1) =

i

Y

i=2

p(yi−1|yi)·p(yi

n

Y

i=i+1

p(yi|yi−1)

⇒the “start” can be anywhere

(6)

Other parameterizations

Another way:

p(y) =p(y1

n

Y

i=2

p(yi|yi−1) =

=p(y1

n

Y

i=2

p(yi, yi−1) p(yi−1) =

n

Y

i=2

p(yi, yi−1

n−1

Y

i=2

p(yi)−1

⇒There is no “start” or “finish” at all – the parameterization is symmetric.

(7)

Other parameterizations

And even more:

p(y) = 1 Z

n

Y

i=1

ψi(yi

n

Y

i=2

ψi−1,i(yi−1, yi)

with the Partition Function:

Z = X

y∈Y

" n Y

i=1

ψi(yi

n

Y

i=2

ψi−1,i(yi−1, yi)

#

The parameters ψi :K →R+ and ψi−1,i :K ×K →R+ are arbitrary (but non-negative) functions that do not represent (explicitly) any probabilities

(8)

Hidden Markov Chains (HMM)

There are two “kinds” of Variables (both are sequences):

y= (y1, y2. . . yn), yiK (hidden Variables) and x= (x1, x2. . . xn), xiF (observed Variables).

A pair(x, y) is an elementary event.

Two representations for p(x, y):

“fence” and “comb” (Mealy- and Moore-Automata).

(9)

Hidden Markov Chains, fence

p(x, y) =p(y1)

n

Y

i=2

p(xi, yi|yi−1)

Observationsxi depend explicitly from the hidden ones in i-th and i−1-th time points (x1 does not exist)

p(x, y) =p(y1)

n

Y

i=2

p(xi, yi|yi−1) =

=p(y1)

n

Y

i=2

hp(yi|yi−1)p(xi|yi, yi−1)i=p(y)·p(x|y)

where

p(y) = p(y1)

n

Y

i=2

p(yi|yi−1) p(x|y) =

n

Y

i=2

p(xi|yi, yi−1)

(10)

Hidden Markov Chains, comb

p(x, y) =p(y)·p(x|y) =p(y1)

n

Y

i=2

p(yi|yi−1)

n

Y

i=1

p(xi|yi) Observationsxi depend explicitly from the corresponding hidden ones in i-th time only

First, it is a special case of the “fence”, if p(xi|yi, yi−1)⇒p(xi|yi) On the other side ...

(11)

Hidden Markov Chains, fence vs. comb

It is always possible to reduce a “fence” representation to the

“comb”-one.

←Example for 3 variables and 2 states In general: introduce new variables y˜1,y˜2. . .y˜n−1 that correspond to pairs (y1, y2),(y2, y3). . .(yn−1, yn)– the new label setK˜ =K×K represents state pairs form the old model

⇒Each observation x˜i depends explicitly only from the corresponding statey˜i

⇒both representations (fence and comb) are equivalent.

(12)

Conditional independence

The conditional probability distributionp(x|y)isconditionally independent (in both “fence” and “comb” models), because it can be written as a product of probabilities.

Do not confuse independence and conditional independence.

For instance, somexi and xj are conditionally independent given an arbitraryy, but they are of course dependent through the hidden part, i.e. p(xi, xj)6=p(xip(xj).

Note: summation overx gives 1 for any y:

X

y

p(x|y) =X

y

Y

i

p(xi|yi) = Y

i

X

k

p(k|yi) = 1

(13)

The probability of observation

p(x) = X

y

p(x, y) =

=X

y

"

p(y1)

n

Y

i=2

p(yi|yi−1)

n

Y

i=1

p(xi|yi)

#

– so-called SumProd problem.

For instance, it can be used to recognize the model:

let two Markov chains be given (specified by their parameters p(y1),p(yi|yi−1)and p0(y1), p0(yi|yi−1) respectively).

For a particular observationx decide, which model has generated it.

(14)

The probability of observation

p(x) = X

y

p(x, y) =

=X

y

"

p(y1)

n

Y

i=2

p(yi|yi−1)

n

Y

i=1

p(xi|yi)

#

(15)

Prior marginal probabilities

Suppose, we are interested in the (prior) probability that at a given positioni the model generates a particular label k. Note: this is a composite event whose probability is a sum over all corresponding elementary ones:

p(yi=k) = X

y:yi=k

p(y) = X

y:yi=k

"

p(y1)

n

Y

i=2

p(yi|yi−1)

#

– a SumProd problem again.

(16)

Posterior marginal probabilities

For a given observation x calculate the probability that at a given positioni the model generates a particular label k:

p(yi=k|x)∝p(x, yi=k) = X

y:yi=k

p(x, y) =

= X

y:yi=k

"

p(y1)

n

Y

i=2

p(yi|yi−1)

n

Y

i=1

p(xi|yi)

#

In all cases (+ the Partition Function) the SumProd problem can be expressed as

Z =X

y

" n Y

i=1

ψi(yi

n

Y

i=2

ψi−1,i(yi−1, yi)

#

(17)

SumProd algorithm

Example:f(a, b, c) = f1(a)·f2(a, b)·f3(b)·f4(b, c)·f5(c)

Z =X

abc

f(a, b, c) =

X

a

X

b

X

c

hf1(a)f2(a, b)f3(b)f4(b, c)f5(c)i=

=X

a

f1(a)·

"

X

b

f2(a, b)f3(b)·

"

X

c

f4(b, c)f5(c)

##

=

=X

a

f1(a)·

"

X

b

f2(a, b)F(b)

#

=X

a

G(a)

(18)

SumProd algorithm

The Idea: propagate Bellmann functions Fi (aka messages) that represent partial solutions (sums)

for ( k = 1. . . K ) F1(k) =q1(k) for ( i= 2. . . n )

for ( k = 1. . . K ) Fi(k) = 0

for ( k0 = 1. . . K )

Fi(k) = Fi(k) +Fi−1(k0)gi(k0, k) Fi(k) = Fi(k)·qi(k)

Z =PkFn(k)

Time complexity –O(nK2)

(19)

Conclusion

Today:

– (Hidden) Markov Chains – the probabilistic model, parameterizations

– Some “useful” probabilities, SumProd algorithm Next classes:

– Decision making (Bayesian Decision Theory), Maximum A-posteriori decision, MinSum problems (Energy

Minimization), Dynamic Programming, additive Losses, MaxMarginal decisions

– Statistical Learning, Maximum-Likelihood Principle, Expectation-Maximization algorithm for unsupervised learning

– Discriminative learning, structural SVM

Referenzen

ÄHNLICHE DOKUMENTE

In this paper, we have proposed the use of Markov chains and transition matrices to model transitions between databases, and used them to define a probabilistic metric space for

Discriminative learning – large margin learning, SSVM, loss-based learning, learning with latent variables

Hint: Transform the input space R n into an appropriately chosen space of higher dimen- sion, in which the considered classifier is a linear one..

Today: Markov chains – The probabilistic model – Some “useful” probabilities – SumProd algorithm. – Inference – MAP,

Archaeological field work is a sensual experience: it engages our senses in the moment of the creation of data, and it requires a sensual imagination to create an

We shall now formulate two auxiliary results which will be used in the proof of Theorem 3.. Stochastic Models of Control and Economic

that we have just defined here is the same as the polynomial which is called w p k in [1] from Section 9 on, up to a change of variables; however, the polynomial which is called w

We consider the polynomial ring Q [Ξ] (this is the polynomial ring over Q in the indeterminates Ξ; in other words, we use the symbols from Ξ as variables for the polynomials) and