Markov Chains – the probabilistic model

(1)

Machine Learning II Summary

Dmitrij Schlesinger, Carsten Rother, Dagmar Kainmueller, Florian Jug

SS2014, 18.07.2014

(2)

Markov Chains – the probabilistic model

p(y) = p(y₁)·

n

Y

i=2

p(y_i|yi−1)

=

i^∗

Y

i=2

p(yi−1|y_i)·p(y_i^∗)·

n

Y

i=i^∗+1

p(y_i|yi−1)

=

n

Y

i=2

p(y_i, yi−1)·

n−1

Y

i=2

p(y_i)⁻¹

= 1

Z

n

Y

i=1

ψ_i(y_i)·

n

Y

i=2

ψ_i−1,i(y_i−1, y_i)

(3)

HMM, The probability of observation

p(x) = ^X

y

p(x, y) =

=^X

y

"

p(y₁)

n

Y

i=2

p(y_i|yi−1)

n

Y

i=1

p(x_i|y_i)

#

+ other "useful" marginal probabilities

(4)

SumProd algorithm

for ( k = 1. . . K ) F₁(k) =ψ₁(k) for ( i= 2. . . n )

for ( k = 1. . . K ) F_i(k) = 0

for ( k⁰ = 1. . . K )

F_i(k) = F_i(k) +Fi−1(k⁰)·ψi−1,i(k⁰, k) F_i(k) = F_i(k)·ψ_i(k)

Z =^P_kF_n(k)

(5)

Most probable state sequence

Dynamic Programming:

In contrast to SumProd we have tominimize over all state sequences rather than tosum them up

The “quality” of a sequence is not aproduct but asum The same as SumProd but in another Semiring:

SumProd ⇔MinSum

Note: One can think about a "generalized Dynamic Programming" i.e. in an arbitrary semiring, e.g. for OrAnd

Interesting: The usual Dynamic Programming for MinSum can be expressed by means of equivalent transformation→ in fact it maximizes the seeming quality (see board)

(6)

Bayesian Decision Theory

TheBayesian Risk of a strategy e is the expected loss:

R(e) =^X

x

X

k

p(x, k)·Ce(x), k→min

e

It should be minimized with respect to the decision strategy Special cases:

– C(k, k⁰) =δ(k6=k⁰) →Maximum A-posteriori decision – Additive loss C(k, k⁰) = ^P_ic_i(k_i, k_i⁰)→ the strategy is

based on marginal probability distributions - Hamming lossC(k, k⁰) =^P_iδ(k_i6=k⁰_i)

→ Maximum Marginal decision - "Metric" lossC(k, k⁰) =^P_i(k_i−k_i⁰)²

→ Minimum Marginal Square Error

(7)

Probabilistic Learning

Supervised case:

ψi−1,i(k, k⁰) = ni−1,i(k, k⁰)

P

k⁰⁰ni−1,i(k, k⁰⁰)

"Set the parameters to the statistics (relative frequencies) computed from the training set"

Note:

– Markov Chains are also members of the exponential family (since MRF).

– If the parameters of a chain are some transition

probability matrices, the obtained chain will have these matrices as the marginal probabilities – Markov Chain can be parameterized by its own probabilities

→ the gradient (the difference of the expectation of sufficient statistics) is zero

(8)

Unsupervised learning

Expectation Maximization Algorithm:

Iterate

1. Expectation – “Recognition” (complete the data):

(x¹, x². . .), θ ⇒“classes”

2. Maximization – Supervised learning:

“classes”,(x¹, x². . .)⇒ θ

(9)

Expectation Maximization for Markov Chains

Start with arbitraryψ⁽⁰⁾, repeat 1. Expectation: compute

n1(k) = ^X

l

p(y₁=k|x^l;ψ^(t)) ni−1,i(k, k⁰) = ^X

l

p(yi−1=k, y_i=k⁰|x^l;ψ^(t))

using the SumProd Algorithm.

2. Maximization: set (properly normalized) ψ₁^(t+1)(k)∝n1(k)

ψ_i−1,i^(t+1)(k, k⁰)∝ni−1,i(k, k⁰)

It is not necessary to computeα_l(y) for each y individually, for the maximization step onlymarginals are necessary !!!

(10)

Markov Trees

+ structure learning

Dynamic Programming for partialk-trees

(11)

Labeling problems

CSP (OrAnd)

Energy Minimization (MinSum) Partition Function (SumProd)

M

y

O

i

ψ_i(y_i)⊗^O

ij

ψ_ij(y_i, y_j)

Note:

– It is necessary to solve an OrAnd problem in order to check triviality of a MinSum one

– The Relaxation Labeling algorithm for OrAnd looks very similar to the Diffusion algorithm for MinSum

– Dynamic programming is the same for all cases up to operation substitution

– Zero-temperature limit (SumProd) is used by the Simulated Annealing (MinSum)

(12)

Equivalent transformations (re-parameterization)

Φ = ϕ_i(k)∀i, k, ϕ_ij(k), ∀ij, k ϕi(k) + ^X

j:ij∈E

ϕij(k) = 0 ∀i, k

Binary MinSum Problems – canonical forms

(13)

Binary MinSum Problems ↔ MinCut → MaxFlow

(Parametric) MaxFlow

(14)

Submodularity

ψ(k₁, k₁⁰)+ψ(k₂, k⁰₂)≤ψ(k₁, k⁰₂)+ψ(k₂, k₁⁰)

MinSum Problem → Binary MinSum Problem → MinCut (→MaxFlow)

(15)

Search techniques

General idea:

Iterated Conditional Modes (+DP),α-expansion,αβ-swap

Note: the "simple" ICM is very similar to the Gibbs Sampling

(16)

LP-relaxation

Seeming quality is a lower bound for the optimal energy

SQ(A) = ^X

i

mink ψ_i(k) +^X

ij

minkk⁰ ψ_ij(k, k⁰)

→maximize it with respect to equivalent transformations Diffusion algorithm aims to enforce local consistencies

(17)

LP-relaxation

X

i

X

k

w_i(k)·ψ_i(k) +^X

ij

X

kk⁰

w_ij(k, k⁰)·ψ_ij(k, k⁰)→min

w

s.t. ^X

k

w_i(k) = 1 ∀i, ^X

k⁰

w_ij(k, k⁰) = w_i(k) ∀i, k, j, w_i(k)∈[0,1], w_ij(k, k⁰)∈[0,1]

Note: the weights ware introduced in order to express the energy as a scalar product (a linear function),

i.e.E(y;ψ) = hw(y), ψi!!!

Exactly the same is done:

– to represent an MRF as a member of the Exponential family in order to formalize the statistical learning – to state that the Energy minimization is in fact a linear

classifier (discriminative learning)

(18)

Statistical learning for MRF-s

Exponential family

p(x, y;w) = 1

Z(w)exp^hhφ(x, y), wiⁱ Z(w) = ^X

x,y

exp^hhφ(x, y), w(µ, σ)iⁱ

⇓

∂lnL

∂w =Edata[φ]−Emodel[φ]

MRF-s are members of the exponential family

(19)

Gibbs Sampling

Sample from conditional probability distributions

p(yi=k|y_N_(i))∝exp

−ψi(k)− ^X

j∈N(i)

ψij(k, yj)

(20)

Discriminative Learning

A ”hierarchy of abstraction“:

Generative models →Discriminative models → Classifiers Linear classifiers, Perceptron Algorithm, Multi-class Perceptron

x₁ x2

Feature spaces – mappingsφ(x)

(21)

Discriminative Learning

Energy Minimization is a linear classifier

Multi-class perceptron + Energy Minimization:

(22)

What was not considered so far ...

Problems of higher order Modeling issues, applications ...

Non-Bayesian strategies ...

Energy Minimization – partial optimality, preprocessing Partition function (as well as marginals) – mean-field approximation, other sampling techniques

Structure learning

Other (statistical) learning subject – pseudo-likelihood, composite likelihood

Discriminative learning – large margin learning, SSVM, loss-based learning, learning with latent variables (semi-supervised) ...