Machine Learning

(1)

Machine Learning

Volker Roth

Department of Mathematics & Computer Science University of Basel

(2)

Section 7 Support Vector Machines and Kernels

(3)

Structure on canonical hyperplanes

Theorem (Vapnik, 1982)

Let R be the radius of the smallest ball containing the pointsx1, . . . ,xn: B_R(a) ={x ∈R^d :kx−ak<R,a∈R^d}. The set of canonical

hyperplane decision functions f(w,w₀) =sign{w^tx+w₀} satisfying kwk ≤A has VC dimension h bounded by

h≤R²A²+ 1.

Intuitive interpretation: margin = 1/kwk

minimizing capacity(H) corresponds to maximizing the margin.

R[f_n]≤R_emp[f_n] + sa

n

capacity(H) + lnb δ

Large margin classifiers.

(4)

SVMs

When the training examples are linearly separable we can maximize the margin by minimizing the regularization term

kwk²/2 =

d

X

i=1

w_i²/2

subject to the classification constraints yi[x^t_iw]−1≥0, i = 1, . . . ,n.

x x

x

x x

x

o

o o o

o o o x

W

The solution is defined only on the basis of a subset of examples or support vectors.

(5)

SVMs: nonseparable case

Modify optimization problem slightly by adding a penalty for violating the classification constraints:

minimize kwk²/2 +C

n

X

i=1

ξi

subject to relaxed constraints

y_i[x^t_iw]−1 +ξ_i ≥0, i = 1, . . . ,n.

x x

x

x x

x

o

o o o

o o o x

W ^x

x

o

Theξi ≥0 are called slack variables.

(6)

SVMs: nonseparable case

We can also write the SVM optimization problem more compactly as C

n

X

i=1

ξi

z }| {

(1−y_i[x^t_iw])⁺+kwk²/2,

where (z)⁺=z ifz ≥0 and zero otherwise.

This is equivalent to regularized empirical loss minimization 1

n

X

i=1

(1−y_i[x^t_iw])⁺

| {z }

Remp

+λkwk²,

whereλ= 1/(2nC) is the regularization parameter.

(7)

SVMs and LOGREG

When viewed from the point of view of regularized empirical loss minimization, SVM and logistic regression appear quite similar:

SVM: 1 n

n

X

i=1

(1−yi[x^t_iw])⁺+λkwk²

LOGREG: 1 n

n

X

i=1

−log

P(yi|x_i,w)

z }| {

σ(yi[x^t_iw]) +λkwk², whereσ(z) = (1 +e^−z)⁻¹ is the logistic function.

(8)

SVMs and LOGREG

The difference comes from how we penalize errors:

Both: 1 n

n

X

i=1

Loss(

z

z }| {

y_i[x^t_iw]) +λkwk²,

SVM: Loss(z) = (1−z)⁺ LOGREG:

Loss(z) = log(1 + exp(−z))

−3 −2 −1 0 1 2 3

0.00.51.01.52.02.53.0

(9)

SVMs: solution, Lagrange multipliers

Back to the separable case: how do we solve

minimize_w kwk²/2 s.t. y_i[x^t_iw]−1≥0, i = 1, . . . ,n.

Represent the constraints as individual loss terms:

sup

αi≥0

αi(1−yi[x^t_iw]) =

(0, if y_i[x^t_iw]−1≥0,

∞, otherwise.

Rewrite the minimization problem:

minimize_w kwk²/2 +

n

X

i=1

sup

αi≥0

α_i(1−y_i[x^t_iw])

= minimize_w sup

αi≥0

kwk²/2 +

n

X

i=1

!

(10)

SVMs: solution, Lagrange multipliers

Swap maximization and minimization (technically this requires that the problem is convex and feasible Slater’s condition):

minimize_w

"

sup

αi≥0

kwk²/2 +

n

X

i=1

!#

= maximizeαi≥0

minw

kwk²/2 +

n

X

i=1

αi(1−yi[x^t_iw])

| {z }

J(w;α)

We have to minimize J(w;α) over parametersw for fixed Lagrange multipliersαi ≥0.

Simple, because J(w) is convex set derivative to zero only one stationary point global minimum.

(11)

SVMs: solution, Lagrange multipliers

Find optimalw by setting the derivatives to zero:

∂

∂wJ(w;α) =w−^X

i

α_iy_ix_i = 0 ⇒ wˆ =^X

i

α_iy_ix_i.

Substitute the solution back into the objective and get (after some re-arrangements of terms):

maxαi≥0min

w

kwk²/2 +

n

X

i=1

=maxαi≥0

kwˆk²/2 +

n

X

i=1

αi(1−yi[x^t_iwˆ])

=maxαi≥0

ⁿ X

i=1

α_i −1 2

n

X

i,j=1

y_iy_jα_iα_jx^t_ix_j

(12)

SVMs: summary

Find optimal Lagrange multipliers ˆα_i by maximizing

n

X

i=1

α_i −1 2

n

X

i,j=1

y_iy_jα_iα_jx^t_ix_j subject toα_i ≥0.

Only ˆαi’s corresponding to support vectors will be non-zero.

Make predictions on any new example x according to:

sign(x^twˆ) = sign(x^t

n

X

i=1

ˆ

α_iy_ix_i) = sign(^X

i∈SV

ˆ

α_iy_ix^tx_i).

Observation: dependency on input vectors only via dot products.

Later we will introduce the kernel trick for efficiently computing these dot products in implicitly defined feature spaces.

(13)

SVMs: formal derivation

Convex optimization problem: an optimization problem

minimize f(x) (1)

subject to g_i(x)≤0, i = 1, . . . ,m (2) is convex if the functionsf,g1. . .gm :Rⁿ→R are convex.

The Lagrangian function for the problem is

L(x, λ₀, ..., λ_m) =λ₀f(x) +λ₁g₁(x) +...+λ_mg_m(x).

Karush-Kuhn-Tucker (KKT) conditions: For each point ˆx that minimizesf, there exist real numbers λ0, . . . , λm,

calledLagrange multipliers, that simultaneously satisfy:

1 ˆx minimizesL(x, λ0, λ₁, . . . , λ_m),

2 λ₀≥0, λ₁≥0, . . . , λ_m≥0, with at least oneλ_k >0,

3 Complementary slackness: g_i(ˆx)<0⇒λ_i= 0,1≤i≤m.

(14)

SVMs: formal derivation

Slater’s condition: If there exists a strictly feasible pointz satisfying g1(z)<0, . . . ,gm(z)<0, then one can setλ0= 1.

Assume that Slater’s condition holds. Minimizing the supremum L^∗(x) = sup_λ≥0L(x,λ), is the primal problem P:

ˆ

x = argmin

x L^∗(x).

Note that L^∗(x) = sup

λ≥0

f(x) +

m

X

i=1

λigi(x)

!

=

(f(x) , if gi(x)≤0∀i

∞ , else.

Minimizing L^∗(x) is equivalent to minimizing f(x).

The maximizer of the dual problem D is λˆ = argmax

λ L_∗(λ), where L_∗(λ) = inf

x L(x,λ).

(15)

SVMs: formal derivation

The non-negative number min(P) – max(D) is the duality gap.

Convexity and Slater’s condition imply strong duality:

1 The optimal solution (ˆx,λ) is a saddle point ofˆ L(x,λ)

2 Theduality gap is zero.

Discussion: For any real function f(a,b) min_a[max_bf(a,b)]≥max_b[min_af(a,b)] . Equality saddle value exists.

By Nicoguaro - Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=20570051

(16)

Kernel functions

A kernel function is a real-valued function of two arguments, k(x,x⁰)∈R, for x,x⁰ ∈ X.

Typically the function is symmetric, and sometimes non-negative.

In the latter case, it might be interpreted as a measure of similarity.

Example: isotropic Gaussian kernel:

k(x,x⁰) = exp −kx−x⁰k² 2σ²

!

Here, σ² is the bandwidth. This is an example of a

radial basis function (RBF) kernel (only a function ofkx−x⁰k²).

(17)

Mercer kernels

A symmetric kernel is a Mercer kernel, iff the Gram matrix K =







k(x1,x1) . . . k(x1,xn) ...

k(x_n,x₁) . . . k(x_n,x_n)







is positive semidefinite for any set of inputs {x_i, . . . ,xn}.

Mercer’s theorem: Eigenvector decomposition K =VΛV^t = (VΛ^1/2)(VΛ^1/2)^t =: ΦΦ^t.

Eigenvectors: columns of V. Eigenvalues: entries of diagonal matrix Λ = diag(λ₁, . . . , λ_n). Note thatλ_i ∈R andλ_i ≥0.

Defineφ(x_i)^t =i-th row of Φ =V_[i_•]Λ^1/2 k(xi,x_i⁰) =φ(xi)^tφ(x_i⁰).

Entries ofK: inner product of some feature vectors, implicitly defined by eigenvectors V.

(18)

Mercer kernels

If the kernel is Mercer, then there exists φ:x →R^d such that k(x,x⁰) =φ(x)^tφ(x⁰),

whereφdepends on the eigenfunctions of k (d might be infinite).

Example: Polynomial kernel k(x,x⁰) = (1 +x^tx⁰)^m.

Corresponding feature vector contains terms up to degreem.

Example: m= 2, x ∈R²:

(1 +x^tx⁰)² = 1 + 2x₁x₁⁰ + 2x₂x₂⁰ + (x₁x₁⁰)²+ (x₂x₂⁰)²+ 2x₁x₁⁰x₂x₂⁰. Thus,

φ(x) = [1,√ 2x1,√

2x2,x₁²,x₂²,√

2x1x2]^t. Equivalent to working in a 6-dim feature space.

Gaussian kernel: feature map lives in an infinite dimensional space.

(19)

Kernels for documents

In document classification or retrieval, we want to compare two documents, x_i andx_i⁰.

Bag of words representation:

xij is the number of times wordj occurs in documenti. One possible choice: Cosine similarity:

k(x_i,x_i⁰) = x^t_ix_i⁰

kx_ikkx_i⁰k =:φ(x_i)^tφ(x_i⁰).

Problems:

I Popular words (like “the” or “and”) are not discriminative remove thesestop words.

I Bias: once a word is used in a document, it is very likely to beused again.

Solution: Replace word counts with “normalized” representation.

(20)

Kernels for documents

TF-IDF “term frequency inverse document frequency”:

Term frequency is log-transform of the count:

tf(xij) = log(1 +xij) Inverse document frequency:

idf(j) = log #(documents)

#(documents containing term j) = log 1 ˆ p_j. Shannon information content:

idf is a measure of how much information a word provides Combine with tf counts weighted by information content:

tf-idf(xi) = [tf(xij)·idf(j)]^V_j=1, where V = size of vocabulary.

We then use this inside the cosine similarity measure.

Withφ(x) = tf-idf(x):

k(x_i,x_i⁰) = φ(x_i)^tφ(x_i⁰) kφ(x_i)kkφ(x_i⁰)k.

(21)

String kernels

Real power of kernels arises for structured input objects.

Consider two strings x, andx⁰ of lengths d,d⁰, over alphabetA.

Idea: define similarity as the number of common substrings.

Ifs is a substring of x φ_s(x) = number of times s appears inx.

String kernel k(x,x⁰) = ^X

s∈A^∗

w_sφ_s(x)φ_s(x⁰),

wherew_s ≥0 and A^∗= set of all strings (any length) from A.

One can show: Mercer kernel, can be computed in O(|x|+|x⁰|) time using suffix trees (Shawe-Taylor and Cristianini, 2004).

Special case: w_s = 0 for|s|>1: bag-of-characters kernel:

φ(x) is the number of times each character in Aoccurs in x.

(22)

The kernel trick

Idea: modify algorithm so that it replaces all inner products x^tx⁰ with a call to the kernel function k(x,x⁰).

Kernelized ridge regression: wˆ = (X^tX+λI)⁻¹X^ty. Matrix inversion lemma:

(I+UV)⁻¹U =U(I+VU)⁻¹ Define new variablesα_i:

ˆ

w = (X^tX+λI)⁻¹X^ty

=X^t(XX^t+λI)⁻¹y

| {z }

ˆ α

=

n

X

i=1

ˆ α_ix_i.

solution is linear sum of the n training vectors.

(23)

The kernel trick

Use this and the kernel trick to make predictions for x:

ˆf(x) = ˆw^tx =

n

X

i=1

ˆ α_ix^t_ix =

n

X

i=1

ˆ

α_ik(x_i,x).

Same for SVMs:

ˆ

w^tx = ^X

i∈SV

ˆ

α_iy_ix^t_ix = ^X

i∈SV

ˆ

α⁰_ik(x_i,x) ...and for most other classical algorithms in ML!

(24)

Some applications in bioinformatics

Bioinformatics: oftennon-vectorial data-types:

I interaction graphs

I phylogenetic trees

I strings GSAQVKGHGKKVADALTNAVAHV

Data fusion: convert data of each type into kernel matrix

⇒ fuse kernel matrices

⇒ “common language” for heterogeneous data.

(25)

RBF kernels from expression data

Measurements(for each gene): vector of expression values under different experimental conditions

“classical” RBF kernelk(x₁,x₂) = exp(−σkx₁−x₂k²)

(26)

Diffusion kernels from interaction-graphs

A: Adjacency matrix,D: node degrees,L=D−A.

K := _Z(β)¹ exp(−βL) with transition probabilities β.

Physical interpretation (random walk):

randomly choose next node among neighbors.

Self-transition occurs with prob. 1−diβ

K_ij: prob. for walk fromi to j.

(Kondor and Lafferty, 2002)

(27)

Alignment kernels from sequences

Alignment with Pair HMMs Mercer kernel (Watkins, 2000).

Image source: Durbin, Eddy, Krogh, Mitchison. Biological Se- quence Alignment. Cambridge.

(28)

Combination of heterogeneous data

Adding kernels ⇒ new kernel:

k₁(x,y) =φ₁(x)·φ₁(y),

k2(x,y) =φ2(x)·φ2(y) ⇒ k⁰=k₁+k₂= ^φ_φ¹^(x)

2(x)

· ^φ_φ¹^(y)

2(y)

Fusion & relevance determination: kernel-combinations

= 1 + 2 + 3 + 4

K c K₁ c K₂ c K₃ c K₄

(29)

Section 8 Gaussian Processes: probabilistic kernel models

(30)

Overview

The use of the Gaussian distribution in ML

I Properties of themultivariate Gaussian distribution

I Random variables→random vectors→stochastic processes

I Gaussian processes for regression

I Model Selection

I Gaussian processes for classification Relation tokernel models(e.g. SVMs) Relation toneural networks.

(31)

Kernel Ridge Regression

Kernelized ridge regression: wˆ = (X^tX+λI)⁻¹X^ty. Matrix inversion lemma: (I+UV)⁻¹U =U(I+VU)⁻¹ Define new variablesα_i:

ˆ

w = (X^tX+λI)⁻¹X^ty

=X^t(XX^t+λI)⁻¹y

| {z }

ˆ α

=

n

X

i=1

ˆ α_ix_i.

Predictions for new x∗: ˆf(x∗) = ˆw^tx∗=

n

X

i=1

ˆ

α_ix^t_ix∗=

n

X

i=1

ˆ

α_ik(x_i,x∗).

(32)

Kernel Ridge Regression

−10 −5 0 5 10

−0.50.00.51.0

f(x)

f(x) =sin(x)/x

Kernel function: k(xi,xj) = exp(−_2l¹2kx_i−xjk²)

(33)

How can we make use of the Gaussian distribution?

y_1

y_2

−2 −1 0 1 2

−2−1012

●

X

−2

−1 0

1 2

Y

−2

−1 0 1 2

Z 0.0 0.1 0.2 0.3

Is it possible to fit anonlinear regression line with the “boring”

Gaussian distribution?

Yes, but we need to introduce the concept of Gaussian Processes!

(34)

The 2D Gaussian distribution

2D Gaussian: P(y;µ=0,Σ =K) = √ ¹

2π|K|exp(−¹₂y^tK⁻¹y)

Covariance

(also written “co-variance”) is a measure of how muchtwo random variables vary to- gether:

+1: perfect linear coherence,

-1: perfect negative linear coherence, 0: no linear coherence.

y_1

y_2

−2 −1 0 1 2

−2−1012

K==



1 0 0 1





y_1

y_2

−2 −1 0 1 2

−2−1012

K==

 1.0 0.5 0.5 1.0





y_1

y_2

−2 −1 0 1 2

−2−1012

K==



1.00 0.95 0.95 1.00





y_1

y_2

−2 −1 0 1 2

−2−1012

K==



1.00 −0.8

−0.8 1.00





(35)

Properties of the Multivariate Gaussian distribution

y ∼ N(µ,K). Let y = y₁ y₂

!

and K = K₁₁ K₁₂ K21 K22

! . Then y₁∼ N(µ₁,K11) andy₂ ∼ N(µ₂,K22).

−2 −1 0 1 2

−2−1012

K=



0.75 −0.2

−0.2 0.25





y_1

y_2

Marginals of Gaussians are again Gaussian!

(36)

Properties of the Multivariate Gaussian distribution (2)

y ∼ N(µ,K). Let y = y₁ y₂

!

and K = K11 K12

K21 K22

! . Then y₂|y₁ ∼ N(µ₂+K₂₁K₁₁⁻¹(y₁−µ₁),K₂₂−K₂₁K₁₁⁻¹K₁₂).

X

−2

−1 0

1

2

Y

−2

−1 0

1 2

Z

0.00 0.05 0.10 0.15

Conditionals of Gaussians are again Gaussian!

(37)

2D Gaussians: a new visualization

top left: meanand

±std.dev.ofp(y2|y1= 1).

bottom left: p(y₂|y1= 1) and samples drawn from it.

top right: x-axis: indices (1,2) of dimensions, y-axis: density in each component. Shown are y1= 1 and the conditional mean ¯p(y2|y1= 1) and std.dev.

bottom right: samples drawn from above model.

y_1

y_2

0.1

0.2

0.3

−2 −1 0 1 2

−2−1012

●

−2 −1 0 1 2

0.00.20.40.60.8

y_2

p(y_2|y_1)

●

●●●●●●●●●●●●●●●●●●●●●●● ●●

−2−1012

1 2

●

−2−1012

1 2

●

●●

●

● ●●

●

●●

●

●●

● ●

●

●●

●

●●

●

(38)

Visualizing high-dimensional Gaussians

top left: 6 samples drawn from 5-dimensional Gaussian with zero mean (indicated byblueline).

σ= 1 (magenta line).

bottom left: Conditional meanandstd.devof

p(y4,y5|y1=−1,y2= 0,y3= 0.5).

top right: contour lines of

p(y4,y5|y1=−1,y2= 0,y3= 0.5).

bottom right: samples drawn from above model.

1 2 3 4 5

−2−1012

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

1 2 3 4 5

−2−1012

●

y_4

y_5

0.5 1 1.5

2

2.5

−2 −1 0 1 2

−2−1012

1 2 3 4 5

−2−1012

●

● ●

●

● ●

●

● ●●

● ●

●

● ●

●

(39)

From covariance matrices to Gaussian processes

top left: 8 samples, 6 dim.

x-axis: dimension-indices.

bottom left: 8 samples, viewed as valuesy =f(x).

Construction: choose 6 input pointsxi at random

build covariance matrixK withcovariance function k(x,x⁰) = exp(−_2l¹2kx−x⁰k²)

drawf ∼ N(0,K) plot as function of inputs.

top right: same for 12 inputs bottom right: 100 inputs

0 1 2 3 4 5 6 7

−2−101

0 1 2 3 4 5 6 7

−1012

0 1 2 3 4 5 6 7

−2−1012

0 1 2 3 4 5 6 7

−3−2−101

(40)

This looks similar to Kernel Regression...

−10 −5 0 5 10

−0.50.00.51.0

f(x)

f(x) =sin(x)/x

(41)

Gaussian Processes

Gaussian Random Variable(RV):f ∼ N(µ, σ²).

Gaussian Random Vector: Collection of n RVs, characterized by mean vector and covariance matrix: f ∼ N(µ,Σ)

Gaussian Process: infinite Gaussian random vector, every finite subset of which is jointly Gaussian distributed

Continuous index, e.g. timet function f(t).

Fully specified by mean function m(t) =E[f(t)]

andcovariance functionk(t,t⁰) =E[(f(t)−m(t))(f(t⁰)−m(t⁰))].

In ML, we will focus on more general index setsx ∈R^d with mean functionm(x) and covariance functionk(x,x⁰):

f(x)∼ GP(m(x),k(x,x⁰)).

(42)

Visualizing Gaussian Processes: Sampling

Problem: working with infinite vectors and covariance matrices is not very intuitive...

Solution: evaluate the GP at set ofn discrete times (or input vectors x ∈R^d):

I Chooseninput pointsx_i at random matrixX

I buildcovariance matrixK(X,X) withcovariance functionk(x_i,x_j)

I samplerealizations of the Gaussian random vector f ∼ N(0,K(X,X))

I plotf asfunction of inputs.

(43)

This is exactly what we have done here...

0 1 2 3 4 5 6 7

−2−101

0 1 2 3 4 5 6 7

−1012

0 1 2 3 4 5 6 7

−2−1012

0 1 2 3 4 5 6 7

−3−2−101

(44)

From the Prior to the Posterior

GP defines distribution over functions f evaluated at training pointsX and f∗ evaluated at test pointsX∗ are jointly Gaussian:

"

f f∗

#

∼ N 0,

"

K(X,X) K(X,X∗) K(X∗,X) K(X∗,X∗)

#!

Posterior p(f∗|X∗,X,f(X)): conditional of a Gaussian distribution.

Let x ∼ N(µ,K). Letx = x₁ x2

!

andK = K₁₁ K₁₂ K21 K22

! . Then x₂|x₁∼ N(µ₂+K₂₁K₁₁⁻¹(f₁−µ₁),K₂₂−K₂₁K₁₁⁻¹K₁₂).

f∗|X_∗,X,f ∼ N( K(X∗,X)(K(X,X))⁻¹f,

K(X∗,X∗)−K(X∗,X)(K(X,X))⁻¹K(X,X∗)) For only one test case:

f∗|x∗,X,f ∼ N(k^t_∗K⁻¹f,k∗∗−k^t_∗K⁻¹k∗)

(45)

A simple extension: noisy observations

Assume we have access only to noisy versions of function values:

y =f(x) +η, η∼ N(0, σ²) (cf. initial example of ridge regression).

Noise η does not depend on data!

Covariance of noisy observationsy is sum of covariance off and variance of noise: cov(y) =K(X,X) +σ²I.

"

y f∗

#

∼ N 0,

"

K(X,X) +σ²I K(X,X∗) K(X∗,X) K(X∗,X∗)

#!

f∗|X_∗,X,y ∼ N( K(X∗,X)(K(X,X) +σ²I)⁻¹y,

K(X∗,X∗) −K(X∗,X)(K(X,X) +σ²I)⁻¹K(X,X∗)) f∗|x_∗,X,f ∼ N(k^t_∗(K +σ²I)⁻¹y,k∗∗−k^t_∗(K +σ²I)⁻¹k∗)

⇒ Posterior mean is solution of kernel ridge regression!

(46)

Noisy observations: examples

0.0 0.2 0.4 0.6 0.8

0.00.20.40.6

x

y

σ

~N(0, )²

f(x)

f(x)=0.5 x

−10 −5 0 5 10

−0.50.00.51.0

f(x)

f(x) =sin(x)/x

Noisy observations: y =f(x) +η, η∼ N(0, σ²) Mean predictions: fˆ∗ =K∗(K +σ²I)⁻¹y.

(47)

Gaussian processes for regression

−10 −5 0 5 10

−1.0−0.50.00.51.01.5

now with some noise...

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

−10 −5 0 5 10

−1.0−0.50.00.51.01.5

Posterior sample

●

●●

●

●●

0 1 2 3 4 5 6 7

−2−1012

Prior samples

Left: 11 training points generated asy= sin(x)/x+ν,ν∼ N(0,0.01) Covariancek(x_p,x_q) = exp(−_2l¹2kxp−x_qk²) +σ²δ_p,q.

100 test points uniformly chosen from [−10,10] matrixX_∗. Mean predictionE[f_∗|X∗,X,y]and±std.dev.

Middle: samples drawn from posteriorf_∗|X_∗,X,y. Right: samples drawn from priorf ∼ N(0,K(X,X)).

(48)

Covariance Functions

A GP specifies a distribution over functionsf(x), characterized by mean functionm(x) and covariance function k(x_i,x_j).

Finite subset evaluated at n inputs Gaussian distribution:

f(X) = (f(x1), . . . ,f(xn))^t ∼ N(µ,K),

whereK is the covariance matrix with entriesKij =k(xi,xj).

Covariance matrices are symmetric positive semi-definite:

K_ij =K_ji and x^tKx ≥0, ∀x.

We already know that Mercer kernels have this property all Mercer kernels define proper covariance functions in GPs.

Kernels frequently have additional parameters.

Thenoise variance in the observation model y =f(x) +η, η∼ N(0, σ²) is another parameter.

How should we choose these parameters? model selection.

(49)

Model Selection

top left: sample function from priorf ∼ N(0,K(X,X)) with covariance function

k(x,x⁰) = exp(−_2l¹₂kx−x⁰k²).

Length scalel= 10^−0.5small highly varying function.

bottom left: same forl= 10⁰ smoother function

top right: same for l= 10^0.5 even smoother...

bottom right: almost linear function forl= 10¹.

x1

−4

−2 0

2 4

x2

−4

−2 0

2 4

−2

−1 0 1 2

length scale: 10^−0.5, sample no. 1

x1

−4

−2 0

2 4

x2

−4

−2 0

2 4

−1 0 1 2

length scale: 10^0, sample no. 1

x1

−4

−2 0

2 4

x2

−4

−2 0

2 4

−1.0

−0.5 0.0 0.5 1.0

length scale: 10^0.5, sample no. 1

x1

−4

−2 0

2 4

x2

−4

−2 0

2 4

−1.5

−1.0

−0.5 0.0 0.5

length scale: 10^1, sample no. 1

(50)

Model Selection (2)

How to select the parameters?

One possibility: maximize marginal likelihood:

p(y|X) = Z

p(y|f,X)p(f|X)df. We do not need to integrate: we know that

f|X ∼ N(0,K) and y =f +η, η∼ N(0, σ²).

Since η does not depend onX, the variances simply add:

y|X ∼ N(0,K +σ²I).

Possible strategy:

Select parameters on a grid and choose maximum.

Or: Compute derivatives of marginal likelihood and use gradient descent.