• Keine Ergebnisse gefunden

Markov Chains – the probabilistic model

N/A
N/A
Protected

Academic year: 2022

Aktie "Markov Chains – the probabilistic model"

Copied!
22
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Machine Learning II Summary

Dmitrij Schlesinger, Carsten Rother, Dagmar Kainmueller, Florian Jug

SS2014, 18.07.2014

(2)

Markov Chains – the probabilistic model

p(y) = p(y1

n

Y

i=2

p(yi|yi−1)

=

i

Y

i=2

p(yi−1|yip(yi

n

Y

i=i+1

p(yi|yi−1)

=

n

Y

i=2

p(yi, yi−1

n−1

Y

i=2

p(yi)−1

= 1

Z

n

Y

i=1

ψi(yi

n

Y

i=2

ψi−1,i(yi−1, yi)

(3)

HMM, The probability of observation

p(x) = X

y

p(x, y) =

=X

y

"

p(y1)

n

Y

i=2

p(yi|yi−1)

n

Y

i=1

p(xi|yi)

#

+ other "useful" marginal probabilities

(4)

SumProd algorithm

for ( k = 1. . . K ) F1(k) =ψ1(k) for ( i= 2. . . n )

for ( k = 1. . . K ) Fi(k) = 0

for ( k0 = 1. . . K )

Fi(k) = Fi(k) +Fi−1(k0ψi−1,i(k0, k) Fi(k) = Fi(k)·ψi(k)

Z =PkFn(k)

(5)

Most probable state sequence

Dynamic Programming:

In contrast to SumProd we have tominimize over all state sequences rather than tosum them up

The “quality” of a sequence is not aproduct but asum The same as SumProd but in another Semiring:

SumProd ⇔MinSum

Note: One can think about a "generalized Dynamic Programming" i.e. in an arbitrary semiring, e.g. for OrAnd

Interesting: The usual Dynamic Programming for MinSum can be expressed by means of equivalent transformation→ in fact it maximizes the seeming quality (see board)

(6)

Bayesian Decision Theory

TheBayesian Risk of a strategy e is the expected loss:

R(e) =X

x

X

k

p(x, k)·Ce(x), k→min

e

It should be minimized with respect to the decision strategy Special cases:

C(k, k0) =δ(k6=k0) →Maximum A-posteriori decision – Additive loss C(k, k0) = Pici(ki, ki0)→ the strategy is

based on marginal probability distributions - Hamming lossC(k, k0) =Piδ(ki6=k0i)

→ Maximum Marginal decision - "Metric" lossC(k, k0) =Pi(kiki0)2

→ Minimum Marginal Square Error

(7)

Probabilistic Learning

Supervised case:

ψi−1,i(k, k0) = ni−1,i(k, k0)

P

k00ni−1,i(k, k00)

"Set the parameters to the statistics (relative frequencies) computed from the training set"

Note:

– Markov Chains are also members of the exponential family (since MRF).

– If the parameters of a chain are some transition

probability matrices, the obtained chain will have these matrices as the marginal probabilities – Markov Chain can be parameterized by its own probabilities

→ the gradient (the difference of the expectation of sufficient statistics) is zero

(8)

Unsupervised learning

Expectation Maximization Algorithm:

Iterate

1. Expectation – “Recognition” (complete the data):

(x1, x2. . .), θ ⇒“classes”

2. Maximization – Supervised learning:

“classes”,(x1, x2. . .)θ

(9)

Expectation Maximization for Markov Chains

Start with arbitraryψ(0), repeat 1. Expectation: compute

n1(k) = X

l

p(y1=k|xl;ψ(t)) ni−1,i(k, k0) = X

l

p(yi−1=k, yi=k0|xl;ψ(t))

using the SumProd Algorithm.

2. Maximization: set (properly normalized) ψ1(t+1)(k)∝n1(k)

ψi−1,i(t+1)(k, k0)∝ni−1,i(k, k0)

It is not necessary to computeαl(y) for each y individually, for the maximization step onlymarginals are necessary !!!

(10)

Markov Trees

+ structure learning

Dynamic Programming for partialk-trees

(11)

Labeling problems

CSP (OrAnd)

Energy Minimization (MinSum) Partition Function (SumProd)

M

y

O

i

ψi(yi)⊗O

ij

ψij(yi, yj)

Note:

– It is necessary to solve an OrAnd problem in order to check triviality of a MinSum one

– The Relaxation Labeling algorithm for OrAnd looks very similar to the Diffusion algorithm for MinSum

– Dynamic programming is the same for all cases up to operation substitution

– Zero-temperature limit (SumProd) is used by the Simulated Annealing (MinSum)

(12)

Equivalent transformations (re-parameterization)

Φ = ϕi(k)∀i, k, ϕij(k), ∀ij, k ϕi(k) + X

j:ij∈E

ϕij(k) = 0 ∀i, k

Binary MinSum Problems – canonical forms

(13)

Binary MinSum Problems ↔ MinCut → MaxFlow

(Parametric) MaxFlow

(14)

Submodularity

ψ(k1, k10)+ψ(k2, k02)≤ψ(k1, k02)+ψ(k2, k10)

MinSum ProblemBinary MinSum Problem → MinCut (→MaxFlow)

(15)

Search techniques

General idea:

Iterated Conditional Modes (+DP),α-expansion,αβ-swap

Note: the "simple" ICM is very similar to the Gibbs Sampling

(16)

LP-relaxation

Seeming quality is a lower bound for the optimal energy

SQ(A) = X

i

mink ψi(k) +X

ij

minkk0 ψij(k, k0)

→maximize it with respect to equivalent transformations Diffusion algorithm aims to enforce local consistencies

(17)

LP-relaxation

X

i

X

k

wi(k)·ψi(k) +X

ij

X

kk0

wij(k, k0ψij(k, k0)→min

w

s.t. X

k

wi(k) = 1 ∀i, X

k0

wij(k, k0) = wi(k) ∀i, k, j, wi(k)∈[0,1], wij(k, k0)∈[0,1]

Note: the weights ware introduced in order to express the energy as a scalar product (a linear function),

i.e.E(y;ψ) = hw(y), ψi!!!

Exactly the same is done:

– to represent an MRF as a member of the Exponential family in order to formalize the statistical learning – to state that the Energy minimization is in fact a linear

classifier (discriminative learning)

(18)

Statistical learning for MRF-s

Exponential family

p(x, y;w) = 1

Z(w)exphhφ(x, y), wii Z(w) = X

x,y

exphhφ(x, y), w(µ, σ)ii

lnL

∂w =Edata[φ]−Emodel[φ]

MRF-s are members of the exponential family

(19)

Gibbs Sampling

Sample from conditional probability distributions

p(yi=k|yN(i))∝exp

−ψi(k)− X

j∈N(i)

ψij(k, yj)

(20)

Discriminative Learning

A ”hierarchy of abstraction“:

Generative models →Discriminative models → Classifiers Linear classifiers, Perceptron Algorithm, Multi-class Perceptron

x1 x2

Feature spaces – mappingsφ(x)

(21)

Discriminative Learning

Energy Minimization is a linear classifier

Multi-class perceptron + Energy Minimization:

(22)

What was not considered so far ...

Problems of higher order Modeling issues, applications ...

Non-Bayesian strategies ...

Energy Minimization – partial optimality, preprocessing Partition function (as well as marginals) – mean-field approximation, other sampling techniques

Structure learning

Other (statistical) learning subject – pseudo-likelihood, composite likelihood

Discriminative learning – large margin learning, SSVM, loss-based learning, learning with latent variables (semi-supervised) ...

Referenzen

ÄHNLICHE DOKUMENTE

Learning mappings between arbitrary structured and interdependent input and output spaces is a funda- mental problem in machine learning; it covers many natural learning tasks and

We study multi-task learning on vectorial data in a supervised setting and multi-view clustering on pairwise distance data in a Bayesian unsupervised approach.. The aim in both areas

The presence of a significant effect for n-gram frequency and the absence of significant effects for the frequency of the fourth word and for the frequency of the final bigram,

For instance, in Estonian, the regular locative case forms are based on a stem that is identical to the corresponding genitive form, singular or plural.. 1 The rules here are simple

The output of this auditory front-end is then used as training patterns in a phoneme recognition task to compare generalization performance of multi-spike tempotron learning and the

In this paper, we have proposed the use of Markov chains and transition matrices to model transitions between databases, and used them to define a probabilistic metric space for

In fact, the only case in which we have not been able to extend a polynomial query learnability result to a polynomial time learnability result is for DL-Lite ∃ H TBoxes: it

– The Idea of discriminative learning – parameterized families of classifiers, non-statistical learning – Linear classifiers, the Perceptron algorithm – Feature