Machine Learning Exponential Family
Dmitrij Schlesinger
WS2014/2015, 24.11.2013
General form
p(x;θ) =h(x) exphhη(θ), T(x)i −A(θ)i with
– xis a random variable – θ is a parameter
– η(θ) is a natural parameter, vector (often η(θ) =θ) – T(x) is a sufficient statistic
– A(θ) is the log-partition function
Almost all probability distributions you can imagine are members of the exponential family
Example – Gaussian
Generalized linear models:
p(x;w) = 1
Z(w)exphhφ(x), wii
i.e.φ(x)≡T(x), η(θ)≡θ≡w, Z(w)≡expA(θ), h(x)≡1 Gaussian (1D for simplicity):
p(x;µ, σ) = 1
√2πσexp
−(x−µ)2 2σ2
=
∝ exp
− 1
2σ2x2+ 2µ
2σ2x− µ2 2σ2
∝
∝ exphhφ(x), wii with
φ(x) = (x2, x), w= (−1/2σ2, µ/σ2)
Our models
Letx be an observed variable and y be a hidden one 1. The joint probability distribution is in the exponential family (a generative model):
p(x, y;w) = 1
Z(w)exphhφ(x, y), wii Z(w) = X
x,y
exphhφ(x, y), wii
2. The conditional probability distribution is in the exponential family (a discriminative model):
p(x, y;w) = p(x)·p(y|x;w) p(y|x;w) = 1
Z(w, x)exphhφ(x, y), wii Z(w, x) =X
y
exphhφ(x, y), wii ∀x
Inference
For both generative and discriminative models the posterior probability distribution (what is needed for the inference) looks like:
p(y|x;w) =∝exphhφ(x, y), wii
⇒if the model is learned, it does not matter, whether "it was" a generative or a discriminative one.
For example, the Maximum A-posteriori Decision is done by:
y∗ = arg min
y hφ(x, y), wi
In fact, it is a "generalized" Fisher Classifier (see the previous lecture) – i.e. scalar products in a feature space
Our learning schemes
– Generative model, supervised → Maximum Likelihood, Gradient
– Discriminative model, supervised → Maximum Conditional Likelihood, Gradient
– Generative model, unsupervised → Maximum Likelihood, Expectation Maximization, Gradient for the M-step
Generative model, supervised
Model:
p(x, y;w) = 1
Z(w)exphhφ(x, y), wii Z(w) = X
x,y
exphhφ(x, y), wii Training set: L=(xl, yl). . .
Maximum Likelihood:
X
l
hhφ(xl, yl), wi −lnZ(w)i→min
w
Gradient:
∂
∂w = 1
|L|
X
l
φ(xl, yl)− ∂lnZ(w)
∂w
Generative model, supervised
Partition function:
Z(w) =X
x,y
exphhφ(x, y), wii Gradient of the log-partition function:
∂lnZ(w)
∂w =
= 1
Z(w)
X
x,y
exphhφ(x, y), wii·φ(x, y)
=X
x,y
p(x, y;w)·φ(x, y) = Ep(x,y;w)[φ]
The Gradient is the difference of expectations:
∂
∂w = 1
|L|
X
l
φ(xl, yl)−Ep(x,y;w)[φ] =EL[φ]−Ep(x,y;w)[φ]
Example – Gaussian
The gradient (no hidden variables):
∂
∂w = 1
|L|
X
l
φ(xl)−Ep(x;w)[φ]
Remember the sufficient statistics for Gaussians
φ(x) = (x2, x), substitute them into the gradient and obtain
∂
∂w1 = 1
|L|
X
l
(xl)2−
Z ∞
−∞p(x;w)x2dx
∂
∂w2 = 1
|L|
X
l
xl−
Z ∞
−∞
p(x;w)xdx
For Gaussians the needed model expectations can be computed analytically !!! ⇒ set ∂w∂ = 0 and resolve
Example – Gaussian
Forw2 (remember that p(x;w) is a Gaussian N(x;µ, σ)):
Z ∞
−∞p(x;w)xdx =µ in what follows:
∂
∂w2 = 1
|L|
X
l
xl−
Z ∞
−∞p(x;w)xdx= 1
|L|
X
l
xl−µ= 0 Finally:
µ= 1
|L|
X
l
xl
Forw1 and σ it is similar. Try it by yourself (remember the exercise about the EM-algotithm)
Discriminative model (posterior), supervised
Model:
p(y|x;w) = 1
Z(w, x)exphhφ(x, y), wii Z(w, x) =X
y
exphhφ(x, y), wii ∀x Training set: L=(xl, yl). . .
Maximum Conditional Likelihood:
X
l
hhφ(xl, yl), wi −lnZ(w, xl)i→min
w
Gradient:
∂
∂w = 1
|L|
X
l
φ(xl, yl)− 1
|L|
X
l
∂lnZ(w, xl)
∂w
Discriminative model (posterior), supervised
Partition function:
Z(w, x) =X
y
exphhφ(x, y), wii
Gradient of the log-partition function for a particular xl:
∂lnZ(w, xl)
∂w =
= 1
Z(w, xl)
X
y
exphhφ(xl, y), wii·φ(xl, y)
=X
y
p(y|xl;w)·φ(xl, y) = Ep(y|xl;w)
hφ(xl)i The Gradient is again the difference of expectations:
∂
∂w =EL[φ]− 1
|L|
X
l
Ep(y|xl;w)
hφ(xl)i
Generative model, unsupervised
Model:
p(x, y;w) = 1
Z(w)exphhφ(x, y), wii Z(w) = X
x,y
exphhφ(x, y), wii Training set (incomplete): L=xl. . . Expectation:
αl(y) =p(y|xl;w) ∀l, y Maximization:
X
l
X
y
αl(y) lnp(x, y;w)→max
w
Generative model, unsupervised
Maximization:
X
l
X
y
αl(y) lnp(x, y;w) =
=X
l
X
y
αl(y)hhφ(xl, y), wi −lnZ(w)i=
=X
l
X
y
αl(y)hφ(xl, y), wi −X
l
X
y
αl(y) lnZ(w) =
=X
l
X
y
αl(y)hφ(xl, y), wi − |L| ·lnZ(w) The gradient is again a difference of expectations:
∂
∂w = 1
|L|
X
l
X
y
αl(y)φ(xl, y)−Ep(x,y;w)[φ] =
= 1
|L|
X
l
Ep(y|xl)[φ(xl)]−Ep(x,y;w)[φ]
Conclusion
In all variants the gradient of the log-likelihood is a difference between expectations of the sufficient statistics:
∂lnL
∂w =Edata[φ]−Emodel[φ]
→the likelihood is in optimum if they coincide
In supervised cases the “data” expectation is the simple average over the training set→ Edata does not depend on w
→the problem is concave → global optimum.
Summary
Before:
– Statistical models:
Bayesian Decision Theory, Maximum Likelihood principle – Generativ →discriminativ
– Linear classifiers Today: Exponential family
Next lecture: Support Vector Machines, Kernels followed by different other discriminative approaches:
empirical risk minimization, boosting, decision trees etc.