• Keine Ergebnisse gefunden

Exponential family

N/A
N/A
Protected

Academic year: 2022

Aktie "Exponential family"

Copied!
20
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Machine Learning II Statistical Learning

Dmitrij Schlesinger, Carsten Rother, Dagmar Kainmueller, Florian Jug

SS2014, 04.07.2014

(2)

Outline

– Exponential Family

sufficient statistics, learning (supervised and unsupervised) – MRF is a member of the Exponential Family

what are the necessary sufficient statistics, what has to be computed for the gradient?

– Recap: inference with additive loss-functions

– Gibbs Sampling – howto compute, put it all together

(3)

Exponential family

General form:

p(x;θ) =h(x) exphhη(θ), T(x)i −A(θ)i with

xis a random variable – θ is a parameter

η(θ) is a natural parameter, vector (often η(θ) =θ)T(x) is a sufficient statistic

A(θ) is the log-partition function

Almost all probability distributions you can imagine are members of the exponential family

(4)

Exponential family, example – Gaussians

Variable: x∈Rn, parameters: µ∈Rn and σ ∈R p(x;µ, σ) = 1

(√

2πσ)nexp

"

−||x−µ||22

#

=

= exp

"

− 1

2||x||2+ hx, µi

σ2 − ||µ2||

2nln(√ 2πσ)

#

=

= exp

D

T(x), η(µ, σ)EA(µ, σ)

with (component-wise)

T(x) = (||x||2, x1, x2, . . . , xn) η(µ, σ) = (− 1

21 σ22

σ2, . . . ,µn σ2) A(µ, σ) = µ2

σ2 +nln(√ 2πσ)

(5)

Our cases

The joint probability distribution is in the exponential family:

p(x, y;w) = 1

Z(w)exphhφ(x, y), wii Z(w) = X

x,y

exphhφ(x, y), w(µ, σ)ii

– supervised → Maximum Likelihood →Gradient – unsupervised → Maximum Likelihood →Expectation

Maximization →Gradient for the M-step

(6)

Supervised learning

Model:

p(x, y;w) = 1

Z(w)exphhφ(x, y), wii Z(w) = X

x,y

exphhφ(x, y), wii Training set: L=(xl, yl). . .

Maximum Likelihood reads F(w) = X

l

hhφ(xl, yl), wi −lnZ(w)i=

=X

l

hφ(xl, yl), wi − |L| ·lnZ(w)→min

w

Gradient:

∂F(w)

∂w = 1

|L|

X

l

φ(xl, yl)−lnZ(w)

∂w

(7)

Supervised learning

Partition function:

Z(w) =X

x,y

exphhφ(x, y), wii

Gradient of the log-partition function (apply the chain rule):

lnZ(w)

∂w =

= 1

Z(w)

X

x,y

exphhφ(x, y), wii·φ(x, y) =

=X

x,y

1

Z(w)exphhφ(x, y), wii·φ(x, y) =

=X

x,y

p(x, y;w)·φ(x, y) =Ep(x,y;w)[φ]

(8)

Supervised learning

The Gradient is the difference of expectations of sufficient statistics:

∂w = 1

|L|

X

l

φ(xl, yl)−Ep(x,y;w)[φ]

= EL[φ]−Ep(x,y;w)[φ]

The first addend is often called data statistics, the second one is model statistics.

Gradient ascent:

1. compute the data statistics 2. repeat until convergence:

a) compute the model statistics

b) compute the gradient as their difference, apply.

(9)

Unsupervised learning

Model (the same):

p(x, y;w) = 1

Z(w)exphhφ(x, y), wii Z(w) = X

x,y

exphhφ(x, y), wii Training set (incomplete): L=xl. . . Expectation:

αl(y) =p(y|xl;w) ∀l, y Maximization:

X

l

X

y

αl(y) lnp(xl, y;w)→max

w

(10)

Unsupervised learning

Maximization:

X

l

X

y

αl(y) lnp(xl, y;w) =

=X

l

X

y

αl(y)hhφ(xl, y), wi −lnZ(w)i=

=X

l

X

y

αl(y)hφ(xl, y), wi −X

l

X

y

αl(y) lnZ(w) =

=X

l

X

y

αl(y)hφ(xl, y), wi − |L| ·lnZ(w) The gradient is again a difference of expectations:

∂w = 1

|L|

X

l

X

y

αl(y)φ(xl, y)−Ep(x,y;w)[φ] =

= 1

|L|

X

l

Ep(y|xl)[φ(xl)]−Ep(x,y;w)[φ]

(11)

Conclusion (exponential family)

In both variants the gradient of the log-likelihood is a difference between expectations of the sufficient statistics:

lnL

∂w =Edata[φ]−Emodel[φ]

→the likelihood is in optimum if they coincide

In supervised cases the “data” expectation is the simple average over the training set→ Edata does not depend on w

→the problem is concave → global optimum.

(12)

MRF-s are members of the exponential family

In the parameter vectorw∈Rd there is a component for each ψ-value of the task, i.e. tuples (i, k) or (i, j, k, k0)

φ(y) is composed of "indicator" values that are 1 if the corresponding value ofψ "is contained" in the energy E(y)

φijkk0(y) =

( 1 if yi =k, yj =k0 0 otherwise

→the energy of a labeling can be written as scalar product E(y;w) = hφ(y), wi

(13)

φ and their expectations

Consider the expectation of just one component of φ Ep(y;ψ)ijkk0] =X

y

p(y;ψ)·δ(yi=k, yj=k0) =

= X

y:yi=k,yj=k0

p(y;ψ) = p(yi=k, yj=k0;ψ)

→expectation is the corresponding marginal probability Put all together – gradient ascent:

1. count the marginal probabilities in the training set 2. repeat until convergence:

a) compute the marginal probabilities for the current model b) compute the gradient as their difference, apply.

(14)

Remember the inference with an additive loss

1. Compute marginalprobability distributions for values p(yi=k|x) = X

y:yi=k

p(y|x)

for each variablei and each valuel

2. Decide for each variable “independently” according to its marginal p.d. and the local lossci

X

k∈K

ci(yi, k)·p(yi=k|x)→min

yi

This is again a Bayesian Decision Problem – minimize the average loss

(15)

Remember the "question"

How to compute the marginal probability distributions p(yi=k|x) = X

y:yi=k

p(k|x)

It is not necessary to eat up the whole kettle completely in order to test a soup. It is often enough to stir it carefully and take just a spoon.

The idea: instead to sum overall labelings, sample a couple of them according to the target probability distribution and average →the probabilities are substituted by the relative frequencies

(16)

Sampling

Example: the values of a discrete Variablex∈ {1,2,3,4,5,6}

have to be drawn fromp(x) = (0.1,0.2,0.4,0.05,0.15,0.1)

The algorithm: input – p(x), output – a sample from p(x) a[1] =p[1]

for i=2 bis n

a[i] =a[i−1] +p[i]

r=rand[0,1]

for i= 1 bis n

if a[i]> r return i

(17)

Gibbs Sampling

Task – draw anx= (x1, x2. . . xm) (vector) fromp(x) Problem: p(x)is not given explicitly

The way out:

– start with an arbitrary x0

– sample the new one xt+1 "component-wise" from conditional probability distributions

p(xi|xt1. . .xti−1, xti+1. . .xtm)

– repeat it for all components i(Komponenten) many times After such a sampling procedure (under some mild conditions):

xn does not depend on x0

xn follows the target probability distributionp(x)

(18)

Gibbs Sampling

In MRF-s the conditional probability distributions can be easily computed !!!

The Markovian property

p(yi|yV\i) = p(yi|yN(i))

(i.e. under the condition that the labels in the neighbouring nodes are fixed, N(i) – neighbourhood structure) leads to

p(yi=k|yN(i))∝exp

−ψi(k)− X

j∈N(i)

ψij(k, yj)

(19)

Gibbs Sampling

A relation to Iterated Conditional Modes:

– ICM considers the "conditional energies"

Ei(k) = ψi(k) + X

j∈N(i)

ψij(k, yj) and decides for the bestlabel – Gibbs Sampling draws new labels

according to the conditional probabilities

p(yi=k|yN(i))∝exph−Ei(k)i

(20)

An example

Binary segmentation, Ising model (fixed, i.e. not learned), unknown p(xi|yi) (non-parametric, histogram),

unsupervisedlearning on the fly

The marginal label probabilities are necessary for both learning and inference

Referenzen

ÄHNLICHE DOKUMENTE

The idea: instead to sum over all labelings, sample a couple of them according to the target probability distribution and average → the probabilities are substituted by the

Key words: Measurement error, Partially linear model, Regression calibration, Non-parametric function, Semi-parametric regression, Uniform confidence surface, Simultaneous

For changes in the background risk, conditions (23), (25) and (28) apply: the bank will take in more risky assets in response to a higher expected background income if its

ليج لكل يدرفلا كلاهتسلاا بيصن ومن لدعم نوكي امدنع ،جتانلا ىلع ةيبنجلأا لوصلأا لدعمل ةيتاذلا ةلداعملل رارقتسلاا ةلاح نوكت يأ يداصتقلاا ومنلا نم درفلا بيصن لدعم

يف ضئاوف ىلإ يدؤت دق ،نمزلا ربع ديازتم لكشب ةحاتم ةيلاملا دراوملا نوكت يكل ،ومنلل ركبملا يعسلا نأ يأ .جتانلا نم هبيصن ىلإ كلاهتسلاا نم درفلا بيصن لدعم نم

After deriving the cepstrum of important classes of time series processes, also featuring long memory, we discuss likelihood inferences based on the periodogram, for which

The key part of the approach are the two general Pareto distribution based approximations of the integrated tail distribution, derived for the case when the claim size distribution

The performance of a 100-instances Environmental Policy Integrated Climate model (EPIC) job on HTCondor cluster: the original binary code “Original”, a source code compiled