Data analysis:

(1)

Data analysis:

Statistical principals and computational methods

Summary

Dmitrij Schlesinger, Carsten Rother

SS2014, 16.07.2014

(2)

I. Markov Chains – the probabilistic model

Random variables yi ∈K for each i∈I

state sequence y= (y₁, y₂, . . . , y_n) with y_i ∈K p(y) =p(y₁, y₂, . . . , y_n) = p(y_i)

n

Y

i=2

p(y_i|y_i−1)

HMM:p(x, y) =p(y)·p(x|y)...

(3)

I. Markov Chains – the probability of observation

p(x) = ^X

y

p(x, y) =

=^X

y

"

p(y1)

n

Y

i=2

p(yi|yi−1)

n

Y

i=1

p(xi|yi)

#

+ other marginal probabilities ...

(4)

I. Markov Chains – SumProd algorithm

The Idea: propagate Bellmann functions F_i (aka messages) that represent partial solutions (sums)

for ( k = 1. . . K ) F₁(k) =q₁(k) for ( i= 2. . . n )

for ( k = 1. . . K ) Fi(k) = 0

for ( k⁰ = 1. . . K )

F_i(k) = F_i(k) +Fi−1(k⁰)g_i(k⁰, k) F_i(k) = F_i(k)·q_i(k)

Z =^P_kF_n(k)

(5)

I. Most probable state sequence

p(x, y) = p(y₁)

n

Y

i=2

p(y_i|yi−1)

n

Y

i=1

p(x_i|y_i)

⇓

arg min

y

" _n X

i=1

ψ_i(y_i) +

n

X

i=2

ψi−1,i(yi−1, y_i)

#

Dynamic Programming (Vitterbi, Dijkstra ...) – propagate Bellman Functions Fi by

F_i(k) =ψ_i(k) + min

k⁰ [Fi−1(k⁰) +ψi−1,i(k⁰, k)]

The functionsF_i represent the quality of the the best extension into the already processed part

(6)

II. Energy Minimization

y^∗ = arg min

y

X

i

ψ_i(y_i) +^X

ij

ψ_ij(y_i, y_j)

(7)

II. Iterated Conditional Modes

(8)

II. Equivalent transformations

Binary MinSum Problems – canonical forms E(y) = (. . .) +^X

rr⁰

β_ij ·δ(y_i6=y_j)

(9)

II. Binary MinSum Problems ↔ MinCut

C^∗ = arg min

C

X

ij∈C

c_ij

– The relation MinSum ↔MinCut works always – MinCut is NP-complete in general

– MinCut is polynomially solvable if all edge costs are non-negative, i.e. a+d≥b+cholds for all edges – Such problems are called submodular

(10)

II. Search techniques

General idea:

α-expansion,αβ-swap:

(11)

III. Bayesian Decision Theory

TheBayesian Risk of a strategy e is the expected loss:

R(e) =^X

x

X

k

p(x, k)·Ce(x), k→min

e

It should be minimized with respect to the decision strategy Special cases:

– C(k, k⁰) =δ(k6=k⁰) →Maximum A-posteriori decision – Additive loss C(k, k⁰) = ^P_ic_i(k_i, k_i⁰)→ the strategy is

based on marginal probability distributions - Hamming lossC(k, k⁰) =^P_iδ(k_i6=k⁰_i)

→ Maximum Marginal decision - "Metric" lossC(k, k⁰) =^P_i(k_i−k_i⁰)²

→ Minimum Marginal Square Error

(12)

IV. Gibbs Sampling

How to sample in general:

Sampling in MRF-s:

Markovian property:

p(y_i|yV\i) =p(y_i|yN(i)) It should be sampled from

p(y_i=k|y_N(i))∝

∝exp

−ψ_i(k)− ^X

j∈N(i)

ψ_ij(k, y_j)

(13)

IV. Maximum Likelihood for MRF-s (supervised)

From the Maximum Likelihood formulation F(θ) = lnp(L;θ) = ^X

l

h−E(y^l;θ)−lnZ(θ)ⁱ=

=−^X

l

E(y^l;θ)− |L| ·lnZ(θ)→max

θ

... to the gradient

∂F(θ)

∂θ =−E^data

"

∂E(y;θ)

∂θ

#

+E^model

"

∂E(y;θ)

∂θ

#

(14)

V. Discriminative Learning

A ”hierarchy of abstraction“:

Generative models →Discriminative models → Classifiers Linear classifiers, Perceptron Algorithm, Multi-class Perceptron

x₁ x2

Feature spaces – mappingsφ(x)

(15)

V. Discriminative Learning

Energy Minimization is a linear classifier

Multi-class perceptron + Energy Minimization:

(16)

Exam

On 5.08 at 10:00 (the room not known jet – see www later) Questions, examples

– Which loss leads to MAP (explain, derive)?

– Write down the probability of a state sequence in a Markov Chain (explain the notations)

– How to obtain an auxiliary binary Energy minimization problem forα-expansion?

– An simple Energy Minimization Problem is given.

Perform the ICM (orα-expansion orαβ-swap), starting from a given labeling.

– A simple Energy Minimization Problem for a chain is given. Find the optimal labeling by Dynamic

Programming, compute all necessary Bellman functions – Transform a given pairwise term ψ_ij(y_i, y_j) into the

canonical form

All (i.e. our part) should be manageable in 15-20 minutes