Machine Learning II Summary
Dmitrij Schlesinger, Carsten Rother, Dagmar Kainmueller, Florian Jug
SS2014, 18.07.2014
Markov Chains – the probabilistic model
p(y) = p(y1)·
n
Y
i=2
p(yi|yi−1)
=
i∗
Y
i=2
p(yi−1|yi)·p(yi∗)·
n
Y
i=i∗+1
p(yi|yi−1)
=
n
Y
i=2
p(yi, yi−1)·
n−1
Y
i=2
p(yi)−1
= 1
Z
n
Y
i=1
ψi(yi)·
n
Y
i=2
ψi−1,i(yi−1, yi)
HMM, The probability of observation
p(x) = X
y
p(x, y) =
=X
y
"
p(y1)
n
Y
i=2
p(yi|yi−1)
n
Y
i=1
p(xi|yi)
#
+ other "useful" marginal probabilities
SumProd algorithm
for ( k = 1. . . K ) F1(k) =ψ1(k) for ( i= 2. . . n )
for ( k = 1. . . K ) Fi(k) = 0
for ( k0 = 1. . . K )
Fi(k) = Fi(k) +Fi−1(k0)·ψi−1,i(k0, k) Fi(k) = Fi(k)·ψi(k)
Z =PkFn(k)
Most probable state sequence
Dynamic Programming:
In contrast to SumProd we have tominimize over all state sequences rather than tosum them up
The “quality” of a sequence is not aproduct but asum The same as SumProd but in another Semiring:
SumProd ⇔MinSum
Note: One can think about a "generalized Dynamic Programming" i.e. in an arbitrary semiring, e.g. for OrAnd
Interesting: The usual Dynamic Programming for MinSum can be expressed by means of equivalent transformation→ in fact it maximizes the seeming quality (see board)
Bayesian Decision Theory
TheBayesian Risk of a strategy e is the expected loss:
R(e) =X
x
X
k
p(x, k)·Ce(x), k→min
e
It should be minimized with respect to the decision strategy Special cases:
– C(k, k0) =δ(k6=k0) →Maximum A-posteriori decision – Additive loss C(k, k0) = Pici(ki, ki0)→ the strategy is
based on marginal probability distributions - Hamming lossC(k, k0) =Piδ(ki6=k0i)
→ Maximum Marginal decision - "Metric" lossC(k, k0) =Pi(ki−ki0)2
→ Minimum Marginal Square Error
Probabilistic Learning
Supervised case:
ψi−1,i(k, k0) = ni−1,i(k, k0)
P
k00ni−1,i(k, k00)
"Set the parameters to the statistics (relative frequencies) computed from the training set"
Note:
– Markov Chains are also members of the exponential family (since MRF).
– If the parameters of a chain are some transition
probability matrices, the obtained chain will have these matrices as the marginal probabilities – Markov Chain can be parameterized by its own probabilities
→ the gradient (the difference of the expectation of sufficient statistics) is zero
Unsupervised learning
Expectation Maximization Algorithm:
Iterate
1. Expectation – “Recognition” (complete the data):
(x1, x2. . .), θ ⇒“classes”
2. Maximization – Supervised learning:
“classes”,(x1, x2. . .)⇒ θ
Expectation Maximization for Markov Chains
Start with arbitraryψ(0), repeat 1. Expectation: compute
n1(k) = X
l
p(y1=k|xl;ψ(t)) ni−1,i(k, k0) = X
l
p(yi−1=k, yi=k0|xl;ψ(t))
using the SumProd Algorithm.
2. Maximization: set (properly normalized) ψ1(t+1)(k)∝n1(k)
ψi−1,i(t+1)(k, k0)∝ni−1,i(k, k0)
It is not necessary to computeαl(y) for each y individually, for the maximization step onlymarginals are necessary !!!
Markov Trees
+ structure learning
Dynamic Programming for partialk-trees
Labeling problems
CSP (OrAnd)
Energy Minimization (MinSum) Partition Function (SumProd)
M
y
O
i
ψi(yi)⊗O
ij
ψij(yi, yj)
Note:
– It is necessary to solve an OrAnd problem in order to check triviality of a MinSum one
– The Relaxation Labeling algorithm for OrAnd looks very similar to the Diffusion algorithm for MinSum
– Dynamic programming is the same for all cases up to operation substitution
– Zero-temperature limit (SumProd) is used by the Simulated Annealing (MinSum)
Equivalent transformations (re-parameterization)
Φ = ϕi(k)∀i, k, ϕij(k), ∀ij, k ϕi(k) + X
j:ij∈E
ϕij(k) = 0 ∀i, k
Binary MinSum Problems – canonical forms
Binary MinSum Problems ↔ MinCut → MaxFlow
(Parametric) MaxFlow
Submodularity
ψ(k1, k10)+ψ(k2, k02)≤ψ(k1, k02)+ψ(k2, k10)
MinSum Problem → Binary MinSum Problem → MinCut (→MaxFlow)
Search techniques
General idea:
Iterated Conditional Modes (+DP),α-expansion,αβ-swap
Note: the "simple" ICM is very similar to the Gibbs Sampling
LP-relaxation
Seeming quality is a lower bound for the optimal energy
SQ(A) = X
i
mink ψi(k) +X
ij
minkk0 ψij(k, k0)
→maximize it with respect to equivalent transformations Diffusion algorithm aims to enforce local consistencies
LP-relaxation
X
i
X
k
wi(k)·ψi(k) +X
ij
X
kk0
wij(k, k0)·ψij(k, k0)→min
w
s.t. X
k
wi(k) = 1 ∀i, X
k0
wij(k, k0) = wi(k) ∀i, k, j, wi(k)∈[0,1], wij(k, k0)∈[0,1]
Note: the weights ware introduced in order to express the energy as a scalar product (a linear function),
i.e.E(y;ψ) = hw(y), ψi!!!
Exactly the same is done:
– to represent an MRF as a member of the Exponential family in order to formalize the statistical learning – to state that the Energy minimization is in fact a linear
classifier (discriminative learning)
Statistical learning for MRF-s
Exponential family
p(x, y;w) = 1
Z(w)exphhφ(x, y), wii Z(w) = X
x,y
exphhφ(x, y), w(µ, σ)ii
⇓
∂lnL
∂w =Edata[φ]−Emodel[φ]
MRF-s are members of the exponential family
Gibbs Sampling
Sample from conditional probability distributions
p(yi=k|yN(i))∝exp
−ψi(k)− X
j∈N(i)
ψij(k, yj)
Discriminative Learning
A ”hierarchy of abstraction“:
Generative models →Discriminative models → Classifiers Linear classifiers, Perceptron Algorithm, Multi-class Perceptron
x1 x2
Feature spaces – mappingsφ(x)
Discriminative Learning
Energy Minimization is a linear classifier
Multi-class perceptron + Energy Minimization:
What was not considered so far ...
Problems of higher order Modeling issues, applications ...
Non-Bayesian strategies ...
Energy Minimization – partial optimality, preprocessing Partition function (as well as marginals) – mean-field approximation, other sampling techniques
Structure learning
Other (statistical) learning subject – pseudo-likelihood, composite likelihood
Discriminative learning – large margin learning, SSVM, loss-based learning, learning with latent variables (semi-supervised) ...