• Keine Ergebnisse gefunden

Machine Learning Exponential Family

N/A
N/A
Protected

Academic year: 2022

Aktie "Machine Learning Exponential Family"

Copied!
16
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Machine Learning Exponential Family

Dmitrij Schlesinger

WS2014/2015, 24.11.2013

(2)

General form

p(x;θ) =h(x) exphhη(θ), T(x)i −A(θ)i with

xis a random variable – θ is a parameter

η(θ) is a natural parameter, vector (often η(θ) =θ)T(x) is a sufficient statistic

A(θ) is the log-partition function

Almost all probability distributions you can imagine are members of the exponential family

(3)

Example – Gaussian

Generalized linear models:

p(x;w) = 1

Z(w)exphhφ(x), wii

i.e.φ(x)T(x), η(θ)θw, Z(w)≡expA(θ), h(x)≡1 Gaussian (1D for simplicity):

p(x;µ, σ) = 1

√2πσexp

−(x−µ)22

=

∝ exp

− 1

2x2+ 2µ

2xµ22

∝ exphhφ(x), wii with

φ(x) = (x2, x), w= (−1/2σ2, µ/σ2)

(4)

Our models

Letx be an observed variable and y be a hidden one 1. The joint probability distribution is in the exponential family (a generative model):

p(x, y;w) = 1

Z(w)exphhφ(x, y), wii Z(w) = X

x,y

exphhφ(x, y), wii

2. The conditional probability distribution is in the exponential family (a discriminative model):

p(x, y;w) = p(x)·p(y|x;w) p(y|x;w) = 1

Z(w, x)exphhφ(x, y), wii Z(w, x) =X

y

exphhφ(x, y), wii ∀x

(5)

Inference

For both generative and discriminative models the posterior probability distribution (what is needed for the inference) looks like:

p(y|x;w) =∝exphhφ(x, y), wii

⇒if the model is learned, it does not matter, whether "it was" a generative or a discriminative one.

For example, the Maximum A-posteriori Decision is done by:

y = arg min

y hφ(x, y), wi

In fact, it is a "generalized" Fisher Classifier (see the previous lecture) – i.e. scalar products in a feature space

(6)

Our learning schemes

– Generative model, supervised → Maximum Likelihood, Gradient

– Discriminative model, supervised → Maximum Conditional Likelihood, Gradient

– Generative model, unsupervised → Maximum Likelihood, Expectation Maximization, Gradient for the M-step

(7)

Generative model, supervised

Model:

p(x, y;w) = 1

Z(w)exphhφ(x, y), wii Z(w) = X

x,y

exphhφ(x, y), wii Training set: L=(xl, yl). . .

Maximum Likelihood:

X

l

hhφ(xl, yl), wi −lnZ(w)i→min

w

Gradient:

∂w = 1

|L|

X

l

φ(xl, yl)− lnZ(w)

∂w

(8)

Generative model, supervised

Partition function:

Z(w) =X

x,y

exphhφ(x, y), wii Gradient of the log-partition function:

lnZ(w)

∂w =

= 1

Z(w)

X

x,y

exphhφ(x, y), wii·φ(x, y)

=X

x,y

p(x, y;w)·φ(x, y) = Ep(x,y;w)[φ]

The Gradient is the difference of expectations:

∂w = 1

|L|

X

l

φ(xl, yl)−Ep(x,y;w)[φ] =EL[φ]−Ep(x,y;w)[φ]

(9)

Example – Gaussian

The gradient (no hidden variables):

∂w = 1

|L|

X

l

φ(xl)−Ep(x;w)[φ]

Remember the sufficient statistics for Gaussians

φ(x) = (x2, x), substitute them into the gradient and obtain

∂w1 = 1

|L|

X

l

(xl)2

Z

−∞p(x;w)x2dx

∂w2 = 1

|L|

X

l

xl

Z

−∞

p(x;w)xdx

For Gaussians the needed model expectations can be computed analytically !!! ⇒ set ∂w = 0 and resolve

(10)

Example – Gaussian

Forw2 (remember that p(x;w) is a Gaussian N(x;µ, σ)):

Z

−∞p(x;w)xdx =µ in what follows:

∂w2 = 1

|L|

X

l

xl

Z

−∞p(x;w)xdx= 1

|L|

X

l

xlµ= 0 Finally:

µ= 1

|L|

X

l

xl

Forw1 and σ it is similar. Try it by yourself (remember the exercise about the EM-algotithm)

(11)

Discriminative model (posterior), supervised

Model:

p(y|x;w) = 1

Z(w, x)exphhφ(x, y), wii Z(w, x) =X

y

exphhφ(x, y), wii ∀x Training set: L=(xl, yl). . .

Maximum Conditional Likelihood:

X

l

hhφ(xl, yl), wi −lnZ(w, xl)i→min

w

Gradient:

∂w = 1

|L|

X

l

φ(xl, yl)− 1

|L|

X

l

lnZ(w, xl)

∂w

(12)

Discriminative model (posterior), supervised

Partition function:

Z(w, x) =X

y

exphhφ(x, y), wii

Gradient of the log-partition function for a particular xl:

lnZ(w, xl)

∂w =

= 1

Z(w, xl)

X

y

exphhφ(xl, y), wii·φ(xl, y)

=X

y

p(y|xl;w)·φ(xl, y) = Ep(y|xl;w)

hφ(xl)i The Gradient is again the difference of expectations:

∂w =EL[φ]− 1

|L|

X

l

Ep(y|xl;w)

hφ(xl)i

(13)

Generative model, unsupervised

Model:

p(x, y;w) = 1

Z(w)exphhφ(x, y), wii Z(w) = X

x,y

exphhφ(x, y), wii Training set (incomplete): L=xl. . . Expectation:

αl(y) =p(y|xl;w) ∀l, y Maximization:

X

l

X

y

αl(y) lnp(x, y;w)→max

w

(14)

Generative model, unsupervised

Maximization:

X

l

X

y

αl(y) lnp(x, y;w) =

=X

l

X

y

αl(y)hhφ(xl, y), wi −lnZ(w)i=

=X

l

X

y

αl(y)hφ(xl, y), wi −X

l

X

y

αl(y) lnZ(w) =

=X

l

X

y

αl(y)hφ(xl, y), wi − |L| ·lnZ(w) The gradient is again a difference of expectations:

∂w = 1

|L|

X

l

X

y

αl(y)φ(xl, y)−Ep(x,y;w)[φ] =

= 1

|L|

X

l

Ep(y|xl)[φ(xl)]−Ep(x,y;w)[φ]

(15)

Conclusion

In all variants the gradient of the log-likelihood is a difference between expectations of the sufficient statistics:

lnL

∂w =Edata[φ]−Emodel[φ]

→the likelihood is in optimum if they coincide

In supervised cases the “data” expectation is the simple average over the training set→ Edata does not depend on w

→the problem is concave → global optimum.

(16)

Summary

Before:

– Statistical models:

Bayesian Decision Theory, Maximum Likelihood principle – Generativ →discriminativ

– Linear classifiers Today: Exponential family

Next lecture: Support Vector Machines, Kernels followed by different other discriminative approaches:

empirical risk minimization, boosting, decision trees etc.

Referenzen

ÄHNLICHE DOKUMENTE

Beginning in a no-advertising situation in which total revenue is maximized at an output unconstrained by the profit mini- mum, Sandmeyer demonstrates that profit in excess

Assumption: the training data is a realization of the unknown probability distribution – it is sampled according to it.. → what is observed should have a

– Maximum Likelihood Principle for Markov Chains – Unsupervised learning,

a) compute the marginal probabilities for the current model b) compute the gradient as their difference, apply... Remember the inference with an

Assumption: the training data is a realization of the unknown probability distribution – it is sampled according to it.. → what is observed should have a

In supervised cases the “data” expectation is the simple average over the training set → E data does not depend on w. → the problem is concave →

The idea: instead to sum over all labelings, sample a couple of them according to the target probability distribution and average → the probabilities are substituted by the

I model the first mixed moments of bivariate exponential models whose marginals are also exponential using the method of generalized linear