• Keine Ergebnisse gefunden

Machine Learning 2020

N/A
N/A
Protected

Academic year: 2022

Aktie "Machine Learning 2020"

Copied!
36
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Machine Learning 2020

Volker Roth

Department of Mathematics & Computer Science University of Basel

4th May 2020

(2)

Section 8

Gaussian Processes: probabilistic kernel models

(3)

Overview

The use of the Gaussian distribution in ML

I Properties of themultivariate Gaussian distribution

I Random variablesrandom vectorsstochastic processes

I Gaussian processes for regression

I Model Selection

I Gaussian processes for classification Relation tokernel models(e.g. SVMs) Relation toneural networks.

(4)

Kernel Ridge Regression

Kernelized ridge regression: wˆ = (XtX+λI)−1Xty. Matrix inversion lemma: (I+UV)−1U =U(I+VU)−1 Define new variablesαi:

ˆ

w = (XtX+λI)−1Xty

=Xt(XXt+λI)−1y

| {z }

ˆ α

=

n

X

i=1

ˆ αixi.

Predictions for new x: ˆf(x) = ˆwtx=

n

X

i=1

ˆ

αixtix=

n

X

i=1

ˆ

αik(xi,x).

(5)

Kernel Ridge Regression

−10 −5 0 5 10

−0.50.00.51.0

f(x)

f(x) =sin(x)/x

Kernel function: k(xi,xj) = exp(−2l12kxixjk2)

(6)

How can we make use of the Gaussian distribution?

y_1

y_2

−2 −1 0 1 2

−2−1012

X

−2

−1 0

1 2

Y

−2

−1 0 1 2

Z 0.0 0.1 0.2 0.3

Is it possible to fit anonlinear regression line with the “boring”

Gaussian distribution?

Yes, but we need to introduce the concept of Gaussian Processes!

(7)

The 2D Gaussian distribution

2D Gaussian: P(y;µ=0,Σ =K) = √ 1

2π|K|exp(−12ytK−1y)

Covariance

(also written “co-variance”) is a measure of how muchtwo random variables vary to- gether:

+1: perfect linear coherence,

-1: perfect negative linear coherence, 0: no linear coherence.

y_1

y_2

−2 −1 0 1 2

−2−1012

K==



1 0 0 1





y_1

y_2

−2 −1 0 1 2

−2−1012

K==

 1.0 0.5 0.5 1.0





y_1

y_2

−2 −1 0 1 2

−2−1012

K==



1.00 0.95 0.95 1.00





y_1

y_2

−2 −1 0 1 2

−2−1012

K==



1.00 −0.8

−0.8 1.00





(8)

Properties of the Multivariate Gaussian distribution

y ∼ N(µ,K). Let y = y1 y2

!

and K = K11 K12 K21 K22

! . Then y1∼ N(µ1,K11) andy2 ∼ N(µ2,K22).

−2 −1 0 1 2

−2−1012

K=



0.75 −0.2

−0.2 0.25



y_1

y_2

Marginals of Gaussians are again Gaussian!

(9)

Properties of the Multivariate Gaussian distribution (2)

y ∼ N(µ,K). Let y = y1 y2

!

and K = K11 K12

K21 K22

! . Then y2|y1 ∼ N(µ2+K21K11−1(y1µ1),K22K21K11−1K12).

X

−2

−1 0

1

2

Y

−2

−1 0

1 2

Z

0.00 0.05 0.10 0.15

Conditionals of Gaussians are again Gaussian!

(10)

2D Gaussians: a new visualization

top left: meanand

±std.dev.ofp(y2|y1= 1).

bottom left: p(y2|y1= 1) and samples drawn from it.

top right: x-axis: indices (1,2) of dimensions, y-axis: density in each component. Shown are y1= 1 and the conditional mean ¯p(y2|y1= 1) and std.dev.

bottom right: samples drawn from above model.

y_1

y_2

0.1

0.2

0.3

−2 −1 0 1 2

−2−1012

−2 −1 0 1 2

0.00.20.40.60.8

y_2

p(y_2|y_1)

−2−1012

1 2

−2−1012

1 2

(11)

Visualizing high-dimensional Gaussians

top left: 6 samples drawn from 5-dimensional Gaussian with zero mean (indicated byblueline).

σ= 1 (magenta line).

bottom left: Conditional meanandstd.devof

p(y4,y5|y1=−1,y2= 0,y3= 0.5).

top right: contour lines of

p(y4,y5|y1=−1,y2= 0,y3= 0.5).

bottom right: samples drawn from above model.

1 2 3 4 5

−2−1012

1 2 3 4 5

−2−1012

y_4

y_5

0.5 1 1.5

2

2.5

−2 −1 0 1 2

−2−1012

1 2 3 4 5

−2−1012

(12)

From covariance matrices to Gaussian processes

top left: 8 samples, 6 dim.

x-axis: dimension-indices.

bottom left: 8 samples, viewed as valuesy =f(x).

Construction: choose 6 input pointsxi at random

build covariance matrixK withcovariance function k(x,x0) = exp(−2l12kx−x0k2)

drawf ∼ N(0,K) plot as function of inputs.

top right: same for 12 inputs bottom right: 100 inputs

0 1 2 3 4 5 6 7

−2−101

0 1 2 3 4 5 6 7

−1012

0 1 2 3 4 5 6 7

−2−1012

0 1 2 3 4 5 6 7

−3−2−101

(13)

This looks similar to Kernel Regression...

−10 −5 0 5 10

−0.50.00.51.0

f(x)

f(x) =sin(x)/x

(14)

Gaussian Processes

Gaussian Random Variable(RV):f ∼ N(µ, σ2).

Gaussian Random Vector: Collection of n RVs, characterized by mean vector and covariance matrix: f ∼ N(µ,Σ)

Gaussian Process: infinite Gaussian random vector, every finite subset of which is jointly Gaussian distributed

Continuous index, e.g. timet function f(t).

Fully specified by mean function m(t) =E[f(t)]

andcovariance functionk(t,t0) =E[(f(t)−m(t))(f(t0)−m(t0))].

In ML, we will focus on more general index setsx ∈Rd with mean functionm(x) and covariance functionk(x,x0):

f(x)∼ GP(m(x),k(x,x0)).

(15)

Visualizing Gaussian Processes: Sampling

Problem: working with infinite vectors and covariance matrices is not very intuitive...

Solution: evaluate the GP at set ofn discrete times (or input vectors x ∈Rd):

I Chooseninput pointsxi at random matrixX

I buildcovariance matrixK(X,X) withcovariance functionk(xi,xj)

I samplerealizations of the Gaussian random vector f ∼ N(0,K(X,X))

I plotf asfunction of inputs.

(16)

This is exactly what we have done here...

0 1 2 3 4 5 6 7

−2−101−1012

0 1 2 3 4 5 6 7

−2−1012−2−101

(17)

From the Prior to the Posterior

GP defines distribution over functions f evaluated at training pointsX and f evaluated at test pointsX are jointly Gaussian:

"

f f

#

∼ N 0,

"

K(X,X) K(X,X) K(X,X) K(X,X)

#!

Posterior p(f|X,X,f(X)): conditional of a Gaussian distribution.

Let x ∼ N(µ,K). Letx = x1 x2

!

andK = K11 K12 K21 K22

! . Then x2|x1∼ N(µ2+K21K11−1(f1µ1),K22K21K11−1K12).

f|X,X,f ∼ N( K(X,X)(K(X,X))−1f,

K(X,X)−K(X,X)(K(X,X))−1K(X,X)) For only one test case:

f|x,X,f ∼ N(ktK−1f,k∗∗ktK−1k)

(18)

A simple extension: noisy observations

Assume we have access only to noisy versions of function values:

y =f(x) +η, η∼ N(0, σ2) (cf. initial example of ridge regression).

Noise η does not depend on data!

Covariance of noisy observationsy is sum of covariance off and variance of noise: cov(y) =K(X,X) +σ2I.

"

y f

#

∼ N 0,

"

K(X,X) +σ2I K(X,X) K(X,X) K(X,X)

#!

f|X,X,y ∼ N( K(X,X)(K(X,X) +σ2I)−1y,

K(X,X) −K(X,X)(K(X,X) +σ2I)−1K(X,X)) f|x,X,f ∼ N(kt(K +σ2I)−1y,k∗∗kt(K +σ2I)−1k)

⇒ Posterior mean is solution of kernel ridge regression!

(19)

Noisy observations: examples

0.0 0.2 0.4 0.6 0.8

0.00.20.40.6

x

y

σ

~N(0, )2

f(x)

f(x)=0.5 x

−10 −5 0 5 10

−0.50.00.51.0

f(x)

f(x) =sin(x)/x

Noisy observations: y =f(x) +η, η∼ N(0, σ2) Mean predictions: fˆ =K(K +σ2I)−1y.

(20)

Gaussian processes for regression

−10 −5 0 5 10

−1.0−0.50.00.51.01.5

now with some noise...

●●

●●

●●

−10 −5 0 5 10

−1.0−0.50.00.51.01.5

Posterior sample

●●

0 1 2 3 4 5 6 7

−2−1012

Prior samples

Left: 11 training points generated asy= sin(x)/x+ν,ν∼ N(0,0.01) Covariancek(xp,xq) = exp(−2l12kxpxqk2) +σ2δp,q.

100 test points uniformly chosen from [−10,10] matrixX. Mean predictionE[f|X,X,y]and±std.dev.

Middle: samples drawn from posteriorf|X,X,y. Right: samples drawn from priorf ∼ N(0,K(X,X)).

(21)

Covariance Functions

A GP specifies a distribution over functionsf(x), characterized by mean functionm(x) and covariance function k(xi,xj).

Finite subset evaluated at n inputs Gaussian distribution:

f(X) = (f(x1), . . . ,f(xn))t ∼ N(µ,K),

whereK is the covariance matrix with entriesKij =k(xi,xj).

Covariance matrices are symmetric positive semi-definite:

Kij =Kji and xtKx ≥0, ∀x.

We already know that Mercer kernels have this property all Mercer kernels define proper covariance functions in GPs.

Kernels frequently have additional parameters.

Thenoise variance in the observation model y =f(x) +η, η∼ N(0, σ2) is another parameter.

How should we choose these parameters? model selection.

(22)

Model Selection

top left: sample function from priorf ∼ N(0,K(X,X)) with covariance function

k(x,x0) = exp(−2l12kxx0k2).

Length scalel= 10−0.5small highly varying function.

bottom left: same forl= 100 smoother function

top right: same for l= 100.5 even smoother...

bottom right: almost linear function forl= 101.

x1

−4

−2 0

2 4

x2

−4

−2 0

2 4

−2

−1 0 1 2

length scale: 10^−0.5, sample no. 1

x1

−4

−2 0

2 4

x2

−4

−2 0

2 4

−1 0 1 2

length scale: 10^0, sample no. 1

x1

−4

−2 0

2 4

x2

−4

−2 0

2 4

−1.0

−0.5 0.0 0.5 1.0

length scale: 10^0.5, sample no. 1

x1

−4

−2 0

2 4

x2

−4

−2 0

2 4

−1.5

−1.0

−0.5 0.0 0.5

length scale: 10^1, sample no. 1

(23)

Model Selection (2)

How to select the parameters?

One possibility: maximize marginal likelihood:

p(y|X) = Z

p(y|f,X)p(f|X)df. We do not need to integrate: we know that

f|X ∼ N(0,K) and y =f +η, η∼ N(0, σ2).

Since η does not depend onX, the variances simply add:

y|X ∼ N(0,K +σ2I).

Possible strategy:

Select parameters on a grid and choose maximum.

Or: Compute derivatives of marginal likelihood and use gradient descent.

(24)

Model Selection (3)

Example problem: y = sin(x)/x+η, η∼ N(0,0.01).

Log marg. likeli. = logN(0,K +σ2I) =

−1

2yt(K +σ2I)−1y

| {z }

data fit

−1

2log|K +σ2I|

| {z }

complexity penalty

n

2log(2π)

| {z }

norm. constant

.

2d-Example with Gaussian RBF:

(K+σ2I) =

1 +σ2 a a 1 +σ2

⇒ |K +σ2I|= (1 +σ2)2a2>0 Note thata→0 if length scalel →0

complexity penalty has high values for small length scales.

Matrix inverse includes a dominating factor |K +σ2I|−1 data fit term also high for smalll.

(25)

Model Selection (4)

Fixing σ2 = 0.01 andvarying length scale l:

neg. complexity penalty

−40−2002040

log(length scale), log(noise variance) = 0.01

log likelihood

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

marg. likelihood

data fit

(26)

Model Selection (5)

Fixing length scale l = 0.5 and varying the noise levelσ2:

neg. complexity penalty

−40−2002040

log(noise variance), log(length scale) = 0.5

log likelihood

−3 −2.6 −2.2 −1.8 −1.4 −1 −0.6 0 0.4 0.8

data fit marg. likelihood

(27)

Model Selection (6)

Varying both σ2 andl:

−1.0 −0.5 0.0 0.5 1.0

−3−2−101

log(length scale)

log(noise variance)

(28)

Classification: Basket Ball Example

0 10 20 30 40

−0.50.00.51.01.5

distance

hit (1) or miss (0)

logistic transfer function linear activation

Adapted from Fig. 7.5.1 in (B. Flury)

(29)

Classical Logistic Regression

Targets y ∈ {0,1}

Bernoulli RV with “success probability” π(x) =P(1|x).

Likelihood: P(y|X,f) =Qni=1f(xi))yi(1−πf(xi))1−yi

Linear logistic regression: unboundedf(x) =wtx (“activation”) Bounded estimates: passf(x) throughlogistic transfer function σ(f(x)) = 1+eef(x)f(x) = 1+e1−f(x) and setπf(x) =σ(f(x)).

Newton method for maximizing the log posterior J(w) := logp(y|X,w) + logp(w):

w(r+1) =w(r)− {E[H]}−1

∂w J(w)

Kernel trick: expandw =Xtα, substitute dot products by kernel functionk(x,x0) kernel logistic regression.

(30)

GP Classification

Place GP prior over “latent” function f(x)∼ GP(0,k(x,x0)).

“Squash” it through logistic function prior on π(x) =σ(f(x)).

(Rasmussen & Williams, 2006)

Problem: Bernoulli likelihood predictive distribution p(y = 1|X,y,x) cannot be calculated analytically.

Possible solution: use Laplace approximation.

Observation: MAP classification boundary is identical with boundary obtained fromkernel logistic regression.

(31)

GP Classification using Laplace’s approximation

Prior f|X ∼ N(0,K).Bernoulli likelihood:

p(y|X,f) =

n

Y

i=1

(σ(f(xi)))yi(1−σ(f(xi)))1−yi . Gaussian approximation of posterior:

p(f|X,y)≈ N(ˆf,H−1).

Predictions: compute p(y = 1|y,x,X) =

Z

σ(f)p (f |y, x, X)

| {z }

latent function atx

df =Ep(f|y,x,X)(σ).

... *

x f y

*

*

n

x f

n

y

f

y y

x x

f1

1

1 2

2

2

n

(32)

GP Classification using Laplace’s approximation

First predict latent function at test case x: p(f|y,x,X) =

Z

p(f|f,x,X)

| {z }

Gaussian

p(f|X,y)df

| {z }

approx. GaussianNf,H−1)

≈ N(µ, σ), with µ =ktK−1fˆ, σ =k∗∗ktK˜−1k

Then use Monte Carlo approximation p(y|y,x,X) =Ep(f|y,x,X)(σ)≈ 1

S

S

X

s=1

σ(fs(x)),

wherefs are samples from the (approximated) distribution over latent function values.

(33)

GPs and Neural networks

Consider a neural network for regression (square loss) with one hidden layer:

p(y|x,θ) =N(f(x;θ), σ2), f(x) =b+

nH

X

j=1

vjg(x;uj). x

x

Σ 1

2 1

x2 1 2 1 1

1 1

v g u x

u b

g(u x)1t

Bayesian treatment: i.i.d. prior assumptions over weights:

indep. zero-mean Gaussian priors forb andv, with variance σ2b andσv2, and independent (arbitrary) priors for components of the weight vector uj at the j-th hidden unit.

(34)

GPs and Neural networks

Mean and covariance:

m(x) =Eθ[f(x)] =

=0

z }| { E[b] +

nH

X

j=1

E[vjg(x;uj)]

(vindep. ofu)

=

nH

X

j=1

E[vj]

| {z }

=0

E[g(x;uj)] = 0.

k(x,x0) =Eθ[f(x)f(x0)] =σb2+

nH

X

j=1

σ2vEu[g(x;uj)g(x0;uj)].

All hidden units are identically distributed

the sum is overnH i.i.d. RVs. Assume activation g is bounded all moments of the distribution will be bounded

central limit theorem applicable

(35)

GPs and Neural networks

Suppose{X1, . . . ,Xn}is a sequence of i.i.d. RVs withE[Xi] =µand Var[Xi] =σ2 <∞. Then √

n(Snµ) −→ Nd 0, σ2 asn→ ∞.

The covariance between any pair of function values (f(x),f(x0)) converges to the covariance of two Gaussian RVs

Joint distribution ofn function values is multivariate Gaussian we get a GP as nH → ∞.

For specific activations, the neural network covariance function can be computed analytically (Williams 1998).

Athree-layer network with and infinitely wide hidden layer can be interpreted as a GP.

(36)

Summary

GPs: fully probabilistic models posterior p(f|X,y,x).

Uniquely defined by specifyingcovariance function.

Mathematically simple:

we only need to calculate conditionals of Gaussians!

Connections:

regression: MAP(GPr) = kernel ridge reg.

classification: MAP(GPc) = kernel logistic reg.

GPc ≈probabilistic version of SVM.

A three-layer network with an infinitely wide hidden layer can be interpreted as a GP with the neural network covariance function.

Referenzen

ÄHNLICHE DOKUMENTE

In this case, the row player can do no better than to play using a minmax mixed strategy which can be computed using linear programming, provided that the entire matrix is known

To foster the role of the self in learning seems to be the «new» way of learning, the morally «better» way of learning and the more effective way of learning.. Often, this kind

Issue Specific Conditions) reaches or falls below (Turbo Long) or reaches or exceeds (Turbo Short) the Current Knock-Out Barrier on an Observation Date

(2) Following (a) the exercise of the Option Rights in accordance with Section 7 (2) and (3) of the Issue Specific Conditions or (b) the occurrence of a Knock-Out

Issue Specific Conditions) reaches or falls below (Turbo Long) or reaches or exceeds (Turbo Short) the Current Knock-Out Barrier on an Observation Date

By the first algorithm (Method 1), we learn at first the negative rules from the negative seed and then integrate the negative rules in each positive rule learning iteration, see

Although the DARE rule representation is very expressive and can ideally cover all linguistic constructions that can be utilized as pattern rules, the

Für die SVM wird dann ein Tree Kernel definiert, welcher solche Baumstrukturen in der Form des Dependency Tree verwendet, um die Ähnlichkeit zwischen Relation zu berechnen. In