Exercise sheet 6 To be handed in on Thursday, 02.06.2016

(1)

Wissenschaftliches Rechnen II/Scientific Computing II

Sommersemester 2016 Prof. Dr. Jochen Garcke Dipl.-Math. Sebastian Mayer

Exercise sheet 6 To be handed in on Thursday, 02.06.2016

1 Some very basic probability theory

Let (X, Y ) be a tuple of random variables, each taking values in R , with joint probability density p(x, y), that is, P [(X, Y ) ≤ (x

0

, y

0

)] = R

x0

−∞

R

y0

−∞

p(x, y)dxdy. The marginal density of X is given by p

_X

(x) = R

R

p(x, y)dy. The expectation of X is given by E[X] = R

R

xp

_X

(x)dx. The covariance of X, Y is defined as cov(X, Y ) = E[(X−E[X])(Y −E[Y ])].

The conditional density of X given we have observed Y = y

0

(which can happen if p

_Y

(y

₀

) > 0) is defined by p(x|y

₀

) =

^p(x,y_p ⁰⁾

Y(y0)

. The random variables X, Y are said to be independent if p(x, y) = p

_X

(x)p

_Y

(y). Bayes’ rule states

p(x|y) = p(y|x)p(x) p(y) .

A multivariate Gaussian random vector X with mean µ ∈ R

^d

and symmetric, positive definite covariance matrix Σ ∈ R

^d×d

has a probability density

p(x) = (2π)

^−D/2

|Σ|

^−1/2

exp(− 1

2 (x − µ)

^T

Σ−1(x − µ)).

We write X ∼ N (µ, Σ).

2 Group exercises

G 1. (Bayesian analysis of linear regression) Consider the standard linear regression model

y

_i

= x

^T_i

w + ε

_i

, i = 1, . . . , n.

where X = (x

1

, . . . , x

n

) ∈ R

^d×n

is the matrix of given input vectors, w ∈ R

^d

the unknown weight vector, and the ε

_i

are i.i.d with ε

_i

∼ N (0, σ

_n²

).

a) Determine the probability density of y = (Y

₁

, . . . , Y

_n

).

b) The Bayesian approach is to specify a prior distribution over w, which expresses the belief about the value of w before observing the data. Assume w ∼ N (0, Σ

_p

) with covariance matrix Σ

p

∈ R

^d×d

. Derive via Bayes’ rule the posterior density of W , which expresses our beliefs about the value of w after observing the concrete data y = (y

₁

, . . . , y

_n

). Determine also the posterior density p(y

∗

|y) of the predicted value y

∗

= x

^T_∗

w given a new data point x

∗

.

c) Show that E[y

∗

] = x

^T_∗

Σ

p

X(K + σ

²_n

)

⁻¹

y, where K = X

^T

Σ

p

X. Make a connection

between the Bayesian approach and regularization.

(2)

G 2. You are given a random vector U ∼ N (0, I

_d

), that is, U is standard normally distributed and takes values in R

^d

. For given mean m ∈ R

^d

and covariance K ∈ R

^d×d

, find a transformation ϕ : R

^d

→ R

^d

such that ϕ(U ) ∼ N (m, K).

G 3. (Lemma 46 revisited)

Assume to be given data (x

₁

, y

₁

), . . . , (x

_n

, y

_n

) and a Hilbert space H with kernel k. Let f

_(x_n_,y_n₎

be the solution of

min

f∈H n−1

X

i=1

(f (x

_i

) − y

_i

)

²

+ λkf k

_k

.

Let ˜ y

n

= f

_(x_n_,y_n₎

(x

n

). Give an alternative proof of Lemma 46 based on the representer theorem. To this end, consider the system of linear equations (K + λI

_n

) ˜ α = ˜ y, where

˜

y

_i

= y

_i

for i < n, and show that ˜ α

_n

= 0.

3 Homework

H 1. (Smoothing spline)

For given data (x

₁

, y

₁

), . . . , (x

_n

, y

_n

) with x

₀

= 0 < x

₁

< x

₂

< · · · < x

_n

< 1 and x

_i

∈ [0, 1]

and regularization parameter λ > 0, consider the problem min

f∈W²([0,1]) n

X

i=1

(y

_i

− f (x

_i

))

²

+ λ Z

₁

0

(f

⁰⁰

(x))

²

dx.

a) Give an explicit formula for the kernel R

₁

(x, y) = R

1

0

G

₂

(x, z)G

₂

(y, z )dz, where G

₂

is the Green’s function computed in Exercise G3 on Sheet 5.

b) Show that the optimal solution ˆ f

_λ

has a representation ˆ f

_λ

(x) = β

₀

φ

₀

(x) + β

₁

φ

₁

(x) + P

n

i=1

α

i

R

1

(x

i

, x). Specify φ

0

, φ

1

and show that β

0

, β

1

are unique.

c) Show that ˆ f

_λ

is a polynomial of degree 3 on every interval [x

_i

, x

_i+1

] for i = 0, . . . , n−2 and a polynomial of degree 1 on [x

n

, 1].

d) To what reduces the solution ˆ f

_λ

in the limits λ → 0 and λ → ∞. You don’t have to provide a proof, just give some plausible arguments.

(6 Punkte) H 2. (Cross-validation)

Provide a proof for Theorem 47 presented in the lecture. Hint 1: According to Lemma 46, we know that we obtain f

Dv

by learning on the modified data vector ˜ y

^D^v

∈ R

^N

given by

˜

y

^D^v

= y − I

_D^D^v

v

y + I

_D^D^v

v

y

^D^v

.

Use ˜ y

^D^v

and the linearity of the smoothing matrix KG, which maps training values y to fitted values ˆ y, to prove Theorem 47.

Hint 2: Every positive definite m × m matrix M defines a positive definite kernel on R