Machine Learning Exponential Family

(1)

Machine Learning Exponential Family

Dmitrij Schlesinger

WS2014/2015, 24.11.2013

(2)

General form

p(x;θ) =h(x) exp^hhη(θ), T(x)i −A(θ)ⁱ with

– xis a random variable – θ is a parameter

– η(θ) is a natural parameter, vector (often η(θ) =θ) – T(x) is a sufficient statistic

– A(θ) is the log-partition function

Almost all probability distributions you can imagine are members of the exponential family

(3)

Example – Gaussian

Generalized linear models:

p(x;w) = 1

Z(w)exp^hhφ(x), wiⁱ

i.e.φ(x)≡T(x), η(θ)≡θ≡w, Z(w)≡expA(θ), h(x)≡1 Gaussian (1D for simplicity):

p(x;µ, σ) = 1

√2πσexp

−(x−µ)² 2σ²

=

∝ exp

− 1

2σ²x²+ 2µ

2σ²x− µ² 2σ²

∝

∝ exp^hhφ(x), wiⁱ with

φ(x) = (x², x), w= (−1/2σ², µ/σ²)

(4)

Our models

Letx be an observed variable and y be a hidden one 1. The joint probability distribution is in the exponential family (a generative model):

p(x, y;w) = 1

Z(w)exp^hhφ(x, y), wiⁱ Z(w) = ^X

x,y

exp^hhφ(x, y), wiⁱ

2. The conditional probability distribution is in the exponential family (a discriminative model):

p(x, y;w) = p(x)·p(y|x;w) p(y|x;w) = 1

Z(w, x)exp^hhφ(x, y), wiⁱ Z(w, x) =^X

y

exp^hhφ(x, y), wiⁱ ∀x

(5)

Inference

For both generative and discriminative models the posterior probability distribution (what is needed for the inference) looks like:

p(y|x;w) =∝exp^hhφ(x, y), wiⁱ

⇒if the model is learned, it does not matter, whether "it was" a generative or a discriminative one.

For example, the Maximum A-posteriori Decision is done by:

y^∗ = arg min

y hφ(x, y), wi

In fact, it is a "generalized" Fisher Classifier (see the previous lecture) – i.e. scalar products in a feature space

(6)

Our learning schemes

– Generative model, supervised → Maximum Likelihood, Gradient

– Discriminative model, supervised → Maximum Conditional Likelihood, Gradient

– Generative model, unsupervised → Maximum Likelihood, Expectation Maximization, Gradient for the M-step

(7)

Generative model, supervised

Model:

p(x, y;w) = 1

x,y

exp^hhφ(x, y), wiⁱ Training set: L=(x^l, y^l). . .

Maximum Likelihood:

X

l

hhφ(x^l, y^l), wi −lnZ(w)ⁱ→min

w

Gradient:

∂

∂w = 1

|L|

X

l

φ(x^l, y^l)− ∂lnZ(w)

∂w

(8)

Generative model, supervised

Partition function:

Z(w) =^X

x,y

exp^hhφ(x, y), wiⁱ Gradient of the log-partition function:

∂lnZ(w)

∂w =

= 1

Z(w)

X

x,y

exp^hhφ(x, y), wiⁱ·φ(x, y)

=^X

x,y

p(x, y;w)·φ(x, y) = Ep(x,y;w)[φ]

The Gradient is the difference of expectations:

∂

∂w = 1

|L|

X

l

φ(x^l, y^l)−Ep(x,y;w)[φ] =EL[φ]−Ep(x,y;w)[φ]

(9)

Example – Gaussian

The gradient (no hidden variables):

∂

∂w = 1

|L|

X

l

φ(x^l)−Ep(x;w)[φ]

Remember the sufficient statistics for Gaussians

φ(x) = (x², x), substitute them into the gradient and obtain

∂

∂w₁ = 1

|L|

X

l

(x^l)²−

Z ∞

−∞p(x;w)x²dx

∂

∂w₂ = 1

|L|

X

l

x^l−

Z ∞

−∞

p(x;w)xdx

For Gaussians the needed model expectations can be computed analytically !!! ⇒ set _∂w^∂ = 0 and resolve

(10)

Example – Gaussian

Forw₂ (remember that p(x;w) is a Gaussian N(x;µ, σ)):

Z ∞

−∞p(x;w)xdx =µ in what follows:

∂

∂w₂ = 1

|L|

X

l

x^l−

Z ∞

−∞p(x;w)xdx= 1

|L|

X

l

x^l−µ= 0 Finally:

µ= 1

|L|

X

l

x^l

Forw₁ and σ it is similar. Try it by yourself (remember the exercise about the EM-algotithm)

(11)

Discriminative model (posterior), supervised

Model:

p(y|x;w) = 1

Z(w, x)exp^hhφ(x, y), wiⁱ Z(w, x) =^X

y

exp^hhφ(x, y), wiⁱ ∀x Training set: L=(x^l, y^l). . .

Maximum Conditional Likelihood:

X

l

hhφ(x^l, y^l), wi −lnZ(w, x^l)ⁱ→min

w

Gradient:

∂

∂w = 1

|L|

X

l

φ(x^l, y^l)− 1

|L|

X

l

∂lnZ(w, x^l)

∂w

(12)

Discriminative model (posterior), supervised

Partition function:

Z(w, x) =^X

y

exp^hhφ(x, y), wiⁱ

Gradient of the log-partition function for a particular x^l:

∂lnZ(w, x^l)

∂w =

= 1

Z(w, x^l)

X

y

exp^hhφ(x^l, y), wiⁱ·φ(x^l, y)

=^X

y

p(y|x^l;w)·φ(x^l, y) = E^p(y|x^l;w)

hφ(x^l)ⁱ The Gradient is again the difference of expectations:

∂

∂w =EL[φ]− 1

|L|

X

l

Ep(y|x^l;w)

hφ(x^l)ⁱ

(13)

Generative model, unsupervised

Model:

p(x, y;w) = 1

x,y

exp^hhφ(x, y), wiⁱ Training set (incomplete): L=x^l. . . Expectation:

α_l(y) =p(y|x^l;w) ∀l, y Maximization:

X

l

X

y

α_l(y) lnp(x, y;w)→max

w

(14)

Generative model, unsupervised

Maximization:

X

l

X

y

α_l(y) lnp(x, y;w) =

=^X

l

X

y

α_l(y)^hhφ(x^l, y), wi −lnZ(w)ⁱ=

=^X

l

X

y

α_l(y)hφ(x^l, y), wi −^X

l

X

y

α_l(y) lnZ(w) =

=^X

l

X

y

α_l(y)hφ(x^l, y), wi − |L| ·lnZ(w) The gradient is again a difference of expectations:

∂

∂w = 1

|L|

X

l

X

y

α_l(y)φ(x^l, y)−E^p(x,y;w)[φ] =

= 1

|L|

X

l

Ep(y|x^l)[φ(x^l)]−Ep(x,y;w)[φ]

(15)

Conclusion

In all variants the gradient of the log-likelihood is a difference between expectations of the sufficient statistics:

∂lnL

∂w =Edata[φ]−Emodel[φ]

→the likelihood is in optimum if they coincide

In supervised cases the “data” expectation is the simple average over the training set→ Edata does not depend on w

→the problem is concave → global optimum.

(16)

Summary

Before:

– Statistical models:

Bayesian Decision Theory, Maximum Likelihood principle – Generativ →discriminativ

– Linear classifiers Today: Exponential family

Next lecture: Support Vector Machines, Kernels followed by different other discriminative approaches:

empirical risk minimization, boosting, decision trees etc.