• Keine Ergebnisse gefunden

1 Model Selection

N/A
N/A
Protected

Academic year: 2021

Aktie "1 Model Selection"

Copied!
3
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Wissenschaftliches Rechnen II/Scientific Computing II

Sommersemester 2016 Prof. Dr. Jochen Garcke Dipl.-Math. Sebastian Mayer

Exercise sheet 7 To be handed in on Thursday, 07.06.2016

1 Model Selection

In previous programming exercises you dealt with the regularization parameter, k-fold cross valdiation to choose a good regularization parameter, and the computation of error measures. This usage of cross validation and the computation of errors was somewhat adhoc. We will use this exercise sheet and its programming exercises to put the estimation of the regularization parameter on more solid grounds from the viewpoint of statistical learning theory. To this end, imagine the following situation. You are given some data D train = {(x 1 , y 1 ), . . . , (x 1 , y n )}, x i ∈ Ω ⊆ R , y i ∈ R , and you conjecture that there is some function f : Ω → R which explains the data, i.e., the sample value y i is the function value f (x i ) perturbed by noise. Now you want to use some regularized kernel regression procedure to learn f . There are two different problems you have to address:

• Model selection: estimating the performance of different regularization parame- ters in order to choose the best one.

• Model assessment: having chosen a final model, estimating its prediction error (generalization error) on new data.

In order to do model selection and assessment, we use the following statistical model how the data has been generated. We assume that the data (x 1 , y 1 ), . . . , (x n , y n ) are realizations of n i.i.d. copies (X 1 , Y 1 ), . . . , (X n , Y n ) of (X, Y ), where X is a random variable taking values in Ω and Y is a random variable taking values in R , which is given by

Y = f (X) + ε,

where ε ∼ N (0, σ 2 ), σ > 0 fixed, is independent of X. The regularized kernel regression procedure can now be considered as a learning method L : S

n∈ N (Ω× R ) n → {f : Ω → R }, which maps given training data D train to a regression fit L(D train ) = ˆ f : Ω → R .

G 1. (Bias-variance decomposition)

Consider the squared loss ` 2 (y, t) = (y − t) 2 . Let D = {(X 1 , Y 1 ), . . . , (X n , Y n )}. The expected test error for a fixed new input X = x of a learning method L is given by

err(L, x) := E [` 2 (Y, f ˆ (X))|X = x] = E [` 2 (Y, f ˆ (x))], where ˆ f (X) = L(D)(X). Show that err(L, x) can be decomposed as follows

err(L, x) = σ 2 + (bias(L, x)) 2 + var(L, x)

with irreducible error σ, bias term bias(L, x) = f (x) − E [ ˆ f (x)], and variance term

var(L, x) = E [( ˆ f (x) − E [ ˆ f (x)]) 2 ]. Try to explain what kind of error each of the three

terms describe.

(2)

G 2. (Extra- vs. in-sample error)

Let ˜ Y 1 , . . . , Y ˜ n be an independent copy of Y 1 , . . . , Y n and L some learning method. Let R `

2

,in (f ) = 1

n

n

X

i=1

E [` 2 ( ˜ Y i , f (x i ))]

be the in-sample risk. The expected in-sample error for given sampling points x 1 , . . . , x n is defined as

err in (L, x 1 , . . . , x n ) = E [R `

2

,in (L(D)) | X 1 = x 1 , . . . , X n = x n ], where D is defined in G1.

a) Let P = P Y |X · P X and ˆ f = L(D train ). What is the difference between the risk R `

2

,P ( ˆ f ), the empirical risk R `

2

,emp ( ˆ f ), the in-sample risk R `

2

,in ( ˆ f), the expected in-sample error err in (L, x , . . . , x n ), and the expected test error err(L, x).

b) Show that E [R `

2

,P (L(D))] = E [err(L, X )].

c) Let ˆ f = L(D). Show that

err in (L, x 1 , . . . , x n ) = E [R `

2

,emp ( ˆ f) | X 1 = x 1 , . . . , X n = x n ] + 2 N

N

X

i=1

cov(Y i , f ˆ (x i )).

2 Support Vector Machines

G 3. (Geometrical interpretation of slack variables) Consider the soft margin SVM

min

w∈ R

2

,b∈ R

1

2 w T w + C

n

X

j=1

ξ j s.t. y i (w T x i + b) ≥ 1 − ξ i , i = 1, . . . , n, ξ i ≥ 0, i = 1, . . . , n.

Give a geometric interpretation of the slack variables ξ 1 , . . . , ξ n . To this end, fix some feasible vector w ∈ R 2 , assuming w.l.o.g. w 1 < 0 and w 2 > 0. Furthermore, consider w.l.o.g. the first data points (x, y) = (x 1 , y 1 ) with x = (x 1 , x 2 ). Now derive the geo- metrical interpretation by considering when ξ > 0 is required in the linear constraint y(w T x + b) ≥ 1 − ξ.

3 Homework

H 1. Let H be a Hilbertspace over Ω ⊆ R and k : Ω × Ω → R the reproducing kernel of H. Consider once again the regularized kernel regression problem

L(D train ) = argmin f∈H 1 n

n

X

i=1

(y i − f (x i )) 2 + λkf k 2 k for given data D train = {(x 1 , y 1 ), . . . , (x n , y n )}.

a) Give explicit formulas for the bias and the variance term from G1.

b) Let K = (k(x i , x i )) i,j=1,...,n and G = (K + λI n ) −1 . Consider the smoothing matrix S λ = KG. Show that for the covariance term P N

i=1 cov(Y i , f ˆ (x i )), which appears in G2 c), we have

N

X

i=1

cov(Y i , f ˆ (x i )) = trace(S λ )σ.

2

(3)

(5 Punkte) H 2. (ν -support vector classifer)

For ν > 0, consider the primal problem for the ν-SV classifier

w∈H,ξ∈ min R

n

,ρ∈ R

1

2 kwk 2 k − νρ + 1 n

n

X

j=1

ξ j s.t. y i (hw, x i i) ≥ ρ − ξ ξ i ≥ 0, ρ ≥ 0.

Note that x i denotes the feature representation of the ith sample point. The empirical margin error for a feasible w ∈ H is given by

R ρ (w) := 1

n |{i ∈ {1, . . . , n} : y i hw, x i i < ρ}|.

a) Suppose the above minimization problem has a solution with ρ > 0. Show that ν is an upper bound on the fraction of margin errors and a lower bound on the fraction of support vectors.

b) Determine the dual quadratic optimization problem. Hint: Use that the solution has a representation w = P n

i=1 λ i k(·, x i ).

(5 Punkte) H 3. (Programming task - regularization parameter and cross validation revisited) See the accompanying notebook.

(5 Punkte) H 4. (Programming task - bone mineral density estimation)

See the accompanying notebook.

(5 Punkte)

3

Referenzen

ÄHNLICHE DOKUMENTE

It thus ap- pears that the second Born approximation in the pecu- liar form of (29) is necessary to compensate the high- momentum error committed already in (8) by ignoring the

Variable selection: Number of variables kept in the model versus average model errors in evaluating diabetic foot ulcer

One way to estimate e is to compare model output with observations. The developers of the EMEP model have extensively tested model output against observed air concentrations

If all filter factors f i are chosen equal to one, the unregularized solution x LS is obtained with zero regularization error but large perturbation error, and choosing all

The robust and accurate estimation of the ensemble average diffusion propagator (EAP), based on diffusion- sensitized magnetic resonance images, is an important preprocessing step

The simulation study examines the effects of (i) blocking and regularization, (ii) the number of clusters and (iii) cluster size determination based on different

Keywords: model space, echo state networks, reservoir computing, time series classification, time series clustering, regularization..

In other words, the acyclic chromatic number is the smallest number of colors needed for a proper vertex coloring such that every two-chromatic subgraph is acyclic.. Introduced by