Machine Learning II Statistical Learning
Dmitrij Schlesinger, Carsten Rother, Dagmar Kainmueller, Florian Jug
SS2014, 04.07.2014
Outline
– Exponential Family
sufficient statistics, learning (supervised and unsupervised) – MRF is a member of the Exponential Family
what are the necessary sufficient statistics, what has to be computed for the gradient?
– Recap: inference with additive loss-functions
– Gibbs Sampling – howto compute, put it all together
Exponential family
General form:
p(x;θ) =h(x) exphhη(θ), T(x)i −A(θ)i with
– xis a random variable – θ is a parameter
– η(θ) is a natural parameter, vector (often η(θ) =θ) – T(x) is a sufficient statistic
– A(θ) is the log-partition function
Almost all probability distributions you can imagine are members of the exponential family
Exponential family, example – Gaussians
Variable: x∈Rn, parameters: µ∈Rn and σ ∈R p(x;µ, σ) = 1
(√
2πσ)nexp
"
−||x−µ||2 2σ2
#
=
= exp
"
− 1
2σ2||x||2+ hx, µi
σ2 − ||µ2||
2σ2 −nln(√ 2πσ)
#
=
= exp
D
T(x), η(µ, σ)E−A(µ, σ)
with (component-wise)
T(x) = (||x||2, x1, x2, . . . , xn) η(µ, σ) = (− 1
2σ2,µ1 σ2,µ2
σ2, . . . ,µn σ2) A(µ, σ) = µ2
σ2 +nln(√ 2πσ)
Our cases
The joint probability distribution is in the exponential family:
p(x, y;w) = 1
Z(w)exphhφ(x, y), wii Z(w) = X
x,y
exphhφ(x, y), w(µ, σ)ii
– supervised → Maximum Likelihood →Gradient – unsupervised → Maximum Likelihood →Expectation
Maximization →Gradient for the M-step
Supervised learning
Model:
p(x, y;w) = 1
Z(w)exphhφ(x, y), wii Z(w) = X
x,y
exphhφ(x, y), wii Training set: L=(xl, yl). . .
Maximum Likelihood reads F(w) = X
l
hhφ(xl, yl), wi −lnZ(w)i=
=X
l
hφ(xl, yl), wi − |L| ·lnZ(w)→min
w
Gradient:
∂F(w)
∂w = 1
|L|
X
l
φ(xl, yl)−∂lnZ(w)
∂w
Supervised learning
Partition function:
Z(w) =X
x,y
exphhφ(x, y), wii
Gradient of the log-partition function (apply the chain rule):
∂lnZ(w)
∂w =
= 1
Z(w)
X
x,y
exphhφ(x, y), wii·φ(x, y) =
=X
x,y
1
Z(w)exphhφ(x, y), wii·φ(x, y) =
=X
x,y
p(x, y;w)·φ(x, y) =Ep(x,y;w)[φ]
Supervised learning
The Gradient is the difference of expectations of sufficient statistics:
∂
∂w = 1
|L|
X
l
φ(xl, yl)−Ep(x,y;w)[φ]
= EL[φ]−Ep(x,y;w)[φ]
The first addend is often called data statistics, the second one is model statistics.
Gradient ascent:
1. compute the data statistics 2. repeat until convergence:
a) compute the model statistics
b) compute the gradient as their difference, apply.
Unsupervised learning
Model (the same):
p(x, y;w) = 1
Z(w)exphhφ(x, y), wii Z(w) = X
x,y
exphhφ(x, y), wii Training set (incomplete): L=xl. . . Expectation:
αl(y) =p(y|xl;w) ∀l, y Maximization:
X
l
X
y
αl(y) lnp(xl, y;w)→max
w
Unsupervised learning
Maximization:
X
l
X
y
αl(y) lnp(xl, y;w) =
=X
l
X
y
αl(y)hhφ(xl, y), wi −lnZ(w)i=
=X
l
X
y
αl(y)hφ(xl, y), wi −X
l
X
y
αl(y) lnZ(w) =
=X
l
X
y
αl(y)hφ(xl, y), wi − |L| ·lnZ(w) The gradient is again a difference of expectations:
∂
∂w = 1
|L|
X
l
X
y
αl(y)φ(xl, y)−Ep(x,y;w)[φ] =
= 1
|L|
X
l
Ep(y|xl)[φ(xl)]−Ep(x,y;w)[φ]
Conclusion (exponential family)
In both variants the gradient of the log-likelihood is a difference between expectations of the sufficient statistics:
∂lnL
∂w =Edata[φ]−Emodel[φ]
→the likelihood is in optimum if they coincide
In supervised cases the “data” expectation is the simple average over the training set→ Edata does not depend on w
→the problem is concave → global optimum.
MRF-s are members of the exponential family
In the parameter vectorw∈Rd there is a component for each ψ-value of the task, i.e. tuples (i, k) or (i, j, k, k0)
φ(y) is composed of "indicator" values that are 1 if the corresponding value ofψ "is contained" in the energy E(y)
φijkk0(y) =
( 1 if yi =k, yj =k0 0 otherwise
→the energy of a labeling can be written as scalar product E(y;w) = hφ(y), wi
φ and their expectations
Consider the expectation of just one component of φ Ep(y;ψ)[φijkk0] =X
y
p(y;ψ)·δ(yi=k, yj=k0) =
= X
y:yi=k,yj=k0
p(y;ψ) = p(yi=k, yj=k0;ψ)
→expectation is the corresponding marginal probability Put all together – gradient ascent:
1. count the marginal probabilities in the training set 2. repeat until convergence:
a) compute the marginal probabilities for the current model b) compute the gradient as their difference, apply.
Remember the inference with an additive loss
1. Compute marginalprobability distributions for values p(yi=k|x) = X
y:yi=k
p(y|x)
for each variablei and each valuel
2. Decide for each variable “independently” according to its marginal p.d. and the local lossci
X
k∈K
ci(yi, k)·p(yi=k|x)→min
yi
This is again a Bayesian Decision Problem – minimize the average loss
Remember the "question"
How to compute the marginal probability distributions p(yi=k|x) = X
y:yi=k
p(k|x)
It is not necessary to eat up the whole kettle completely in order to test a soup. It is often enough to stir it carefully and take just a spoon.
The idea: instead to sum overall labelings, sample a couple of them according to the target probability distribution and average →the probabilities are substituted by the relative frequencies
Sampling
Example: the values of a discrete Variablex∈ {1,2,3,4,5,6}
have to be drawn fromp(x) = (0.1,0.2,0.4,0.05,0.15,0.1)
The algorithm: input – p(x), output – a sample from p(x) a[1] =p[1]
for i=2 bis n
a[i] =a[i−1] +p[i]
r=rand[0,1]
for i= 1 bis n
if a[i]> r return i
Gibbs Sampling
Task – draw anx= (x1, x2. . . xm) (vector) fromp(x) Problem: p(x)is not given explicitly
The way out:
– start with an arbitrary x0
– sample the new one xt+1 "component-wise" from conditional probability distributions
p(xi|xt1. . .xti−1, xti+1. . .xtm)
– repeat it for all components i(Komponenten) many times After such a sampling procedure (under some mild conditions):
– xn does not depend on x0
– xn follows the target probability distributionp(x)
Gibbs Sampling
In MRF-s the conditional probability distributions can be easily computed !!!
The Markovian property
p(yi|yV\i) = p(yi|yN(i))
(i.e. under the condition that the labels in the neighbouring nodes are fixed, N(i) – neighbourhood structure) leads to
p(yi=k|yN(i))∝exp
−ψi(k)− X
j∈N(i)
ψij(k, yj)
Gibbs Sampling
A relation to Iterated Conditional Modes:
– ICM considers the "conditional energies"
Ei(k) = ψi(k) + X
j∈N(i)
ψij(k, yj) and decides for the bestlabel – Gibbs Sampling draws new labels
according to the conditional probabilities
p(yi=k|yN(i))∝exph−Ei(k)i
An example
Binary segmentation, Ising model (fixed, i.e. not learned), unknown p(xi|yi) (non-parametric, histogram),
unsupervisedlearning on the fly
The marginal label probabilities are necessary for both learning and inference