Exponential family

(1)

Machine Learning II Statistical Learning

Dmitrij Schlesinger, Carsten Rother, Dagmar Kainmueller, Florian Jug

SS2014, 04.07.2014

(2)

Outline

– Exponential Family

sufficient statistics, learning (supervised and unsupervised) – MRF is a member of the Exponential Family

what are the necessary sufficient statistics, what has to be computed for the gradient?

– Recap: inference with additive loss-functions

– Gibbs Sampling – howto compute, put it all together

(3)

Exponential family

General form:

p(x;θ) =h(x) exp^hhη(θ), T(x)i −A(θ)ⁱ with

– xis a random variable – θ is a parameter

– η(θ) is a natural parameter, vector (often η(θ) =θ) – T(x) is a sufficient statistic

– A(θ) is the log-partition function

Almost all probability distributions you can imagine are members of the exponential family

(4)

Exponential family, example – Gaussians

Variable: x∈Rⁿ, parameters: µ∈Rⁿ and σ ∈R p(x;µ, σ) = 1

(√

2πσ)ⁿexp

"

−||x−µ||² 2σ²

#

=

= exp

"

− 1

2σ²||x||²+ hx, µi

σ² − ||µ²||

2σ² −nln(√ 2πσ)

#

=

= exp

D

T(x), η(µ, σ)^E−A(µ, σ)

with (component-wise)

T(x) = (||x||², x₁, x₂, . . . , x_n) η(µ, σ) = (− 1

2σ²,µ₁ σ²,µ₂

σ², . . . ,µ_n σ²) A(µ, σ) = µ²

σ² +nln(√ 2πσ)

(5)

Our cases

The joint probability distribution is in the exponential family:

p(x, y;w) = 1

Z(w)exp^hhφ(x, y), wiⁱ Z(w) = ^X

x,y

exp^hhφ(x, y), w(µ, σ)iⁱ

– supervised → Maximum Likelihood →Gradient – unsupervised → Maximum Likelihood →Expectation

Maximization →Gradient for the M-step

(6)

Supervised learning

Model:

p(x, y;w) = 1

x,y

exp^hhφ(x, y), wiⁱ Training set: L=(x^l, y^l). . .

Maximum Likelihood reads F(w) = ^X

l

hhφ(x^l, y^l), wi −lnZ(w)ⁱ=

=^X

l

hφ(x^l, y^l), wi − |L| ·lnZ(w)→min

w

Gradient:

∂F(w)

∂w = 1

|L|

X

l

φ(x^l, y^l)−∂lnZ(w)

∂w

(7)

Supervised learning

Partition function:

Z(w) =^X

x,y

exp^hhφ(x, y), wiⁱ

Gradient of the log-partition function (apply the chain rule):

∂lnZ(w)

∂w =

= 1

Z(w)

X

x,y

exp^hhφ(x, y), wiⁱ·φ(x, y) =

=^X

x,y

1

Z(w)exp^hhφ(x, y), wiⁱ·φ(x, y) =

=^X

x,y

p(x, y;w)·φ(x, y) =Ep(x,y;w)[φ]

(8)

Supervised learning

The Gradient is the difference of expectations of sufficient statistics:

∂

∂w = 1

|L|

X

l

φ(x^l, y^l)−E^p(x,y;w)[φ]

= EL[φ]−Ep(x,y;w)[φ]

The first addend is often called data statistics, the second one is model statistics.

Gradient ascent:

1. compute the data statistics 2. repeat until convergence:

a) compute the model statistics

b) compute the gradient as their difference, apply.

(9)

Unsupervised learning

Model (the same):

p(x, y;w) = 1

x,y

exp^hhφ(x, y), wiⁱ Training set (incomplete): L=x^l. . . Expectation:

α_l(y) =p(y|x^l;w) ∀l, y Maximization:

X

l

X

y

α_l(y) lnp(x^l, y;w)→max

w

(10)

Unsupervised learning

Maximization:

X

l

X

y

α_l(y) lnp(x^l, y;w) =

=^X

l

X

y

α_l(y)^hhφ(x^l, y), wi −lnZ(w)ⁱ=

=^X

l

X

y

α_l(y)hφ(x^l, y), wi −^X

l

X

y

α_l(y) lnZ(w) =

=^X

l

X

y

α_l(y)hφ(x^l, y), wi − |L| ·lnZ(w) The gradient is again a difference of expectations:

∂

∂w = 1

|L|

X

l

X

y

α_l(y)φ(x^l, y)−E^p(x,y;w)[φ] =

= 1

|L|

X

l

Ep(y|x^l)[φ(x^l)]−Ep(x,y;w)[φ]

(11)

Conclusion (exponential family)

In both variants the gradient of the log-likelihood is a difference between expectations of the sufficient statistics:

∂lnL

∂w =Edata[φ]−Emodel[φ]

→the likelihood is in optimum if they coincide

In supervised cases the “data” expectation is the simple average over the training set→ Edata does not depend on w

→the problem is concave → global optimum.

(12)

MRF-s are members of the exponential family

In the parameter vectorw∈R^d there is a component for each ψ-value of the task, i.e. tuples (i, k) or (i, j, k, k⁰)

φ(y) is composed of "indicator" values that are 1 if the corresponding value ofψ "is contained" in the energy E(y)

φ_ijkk⁰(y) =

( 1 if y_i =k, y_j =k⁰ 0 otherwise

→the energy of a labeling can be written as scalar product E(y;w) = hφ(y), wi

(13)

φ and their expectations

Consider the expectation of just one component of φ Ep(y;ψ)[φ_ijkk⁰] =^X

y

p(y;ψ)·δ(y_i=k, y_j=k⁰) =

= ^X

y:yi=k,yj=k⁰

p(y;ψ) = p(y_i=k, y_j=k⁰;ψ)

→expectation is the corresponding marginal probability Put all together – gradient ascent:

1. count the marginal probabilities in the training set 2. repeat until convergence:

a) compute the marginal probabilities for the current model b) compute the gradient as their difference, apply.

(14)

Remember the inference with an additive loss

1. Compute marginalprobability distributions for values p(y_i=k|x) = ^X

y:yi=k

p(y|x)

for each variablei and each valuel

2. Decide for each variable “independently” according to its marginal p.d. and the local lossc_i

X

k∈K

c_i(y_i, k)·p(y_i=k|x)→min

yi

This is again a Bayesian Decision Problem – minimize the average loss

(15)

Remember the "question"

How to compute the marginal probability distributions p(yi=k|x) = ^X

y:yi=k

p(k|x)

It is not necessary to eat up the whole kettle completely in order to test a soup. It is often enough to stir it carefully and take just a spoon.

The idea: instead to sum overall labelings, sample a couple of them according to the target probability distribution and average →the probabilities are substituted by the relative frequencies

(16)

Sampling

Example: the values of a discrete Variablex∈ {1,2,3,4,5,6}

have to be drawn fromp(x) = (0.1,0.2,0.4,0.05,0.15,0.1)

The algorithm: input – p(x), output – a sample from p(x) a[1] =p[1]

for i=2 bis n

a[i] =a[i−1] +p[i]

r=rand[0,1]

for i= 1 bis n

if a[i]> r return i

(17)

Gibbs Sampling

Task – draw anx= (x₁, x₂. . . x_m) (vector) fromp(x) Problem: p(x)is not given explicitly

The way out:

– start with an arbitrary x⁰

– sample the new one x^t+1 "component-wise" from conditional probability distributions

p(x_i|x^t₁. . .x^t_i−1, x^t_i+1. . .x^t_m)

– repeat it for all components i(Komponenten) many times After such a sampling procedure (under some mild conditions):

– xⁿ does not depend on x⁰

– xⁿ follows the target probability distributionp(x)

(18)

Gibbs Sampling

In MRF-s the conditional probability distributions can be easily computed !!!

The Markovian property

p(y_i|yV\i) = p(y_i|yN(i))

(i.e. under the condition that the labels in the neighbouring nodes are fixed, N(i) – neighbourhood structure) leads to

p(y_i=k|y_N(i))∝exp

−ψ_i(k)− ^X

j∈N(i)

ψ_ij(k, y_j)

(19)

Gibbs Sampling

A relation to Iterated Conditional Modes:

– ICM considers the "conditional energies"

E_i(k) = ψ_i(k) + ^X

j∈N(i)

ψ_ij(k, y_j) and decides for the bestlabel – Gibbs Sampling draws new labels

according to the conditional probabilities

p(yi=k|y_N(i))∝exp^h−Ei(k)ⁱ

(20)

An example

Binary segmentation, Ising model (fixed, i.e. not learned), unknown p(x_i|y_i) (non-parametric, histogram),

unsupervisedlearning on the fly

The marginal label probabilities are necessary for both learning and inference