1 Model Selection

(1)

Wissenschaftliches Rechnen II/Scientific Computing II

Sommersemester 2016 Prof. Dr. Jochen Garcke Dipl.-Math. Sebastian Mayer

Exercise sheet 7 To be handed in on Thursday, 07.06.2016

1 Model Selection

In previous programming exercises you dealt with the regularization parameter, k-fold cross valdiation to choose a good regularization parameter, and the computation of error measures. This usage of cross validation and the computation of errors was somewhat adhoc. We will use this exercise sheet and its programming exercises to put the estimation of the regularization parameter on more solid grounds from the viewpoint of statistical learning theory. To this end, imagine the following situation. You are given some data D _train = {(x ₁ , y ₁ ), . . . , (x ₁ , y _n )}, x _i ∈ Ω ⊆ R , y _i ∈ R , and you conjecture that there is some function f : Ω → R which explains the data, i.e., the sample value y i is the function value f (x i ) perturbed by noise. Now you want to use some regularized kernel regression procedure to learn f . There are two different problems you have to address:

• Model selection: estimating the performance of different regularization parame- ters in order to choose the best one.

• Model assessment: having chosen a final model, estimating its prediction error (generalization error) on new data.

In order to do model selection and assessment, we use the following statistical model how the data has been generated. We assume that the data (x ₁ , y ₁ ), . . . , (x _n , y _n ) are realizations of n i.i.d. copies (X 1 , Y 1 ), . . . , (X n , Y n ) of (X, Y ), where X is a random variable taking values in Ω and Y is a random variable taking values in R , which is given by

Y = f (X) + ε,

where ε ∼ N (0, σ ² ), σ > 0 fixed, is independent of X. The regularized kernel regression procedure can now be considered as a learning method L : S

n∈ N (Ω× R ) ⁿ → {f : Ω → R }, which maps given training data D _train to a regression fit L(D _train ) = ˆ f : Ω → R .

G 1. (Bias-variance decomposition)

Consider the squared loss ` ₂ (y, t) = (y − t) ² . Let D = {(X ₁ , Y ₁ ), . . . , (X _n , Y _n )}. The expected test error for a fixed new input X = x of a learning method L is given by

err(L, x) := E [` ₂ (Y, f ˆ (X))|X = x] = E [` ₂ (Y, f ˆ (x))], where ˆ f (X) = L(D)(X). Show that err(L, x) can be decomposed as follows

err(L, x) = σ ² + (bias(L, x)) ² + var(L, x)

with irreducible error σ, bias term bias(L, x) = f (x) − E [ ˆ f (x)], and variance term

var(L, x) = E [( ˆ f (x) − E [ ˆ f (x)]) ² ]. Try to explain what kind of error each of the three

terms describe.

(2)

G 2. (Extra- vs. in-sample error)

Let ˜ Y 1 , . . . , Y ˜ n be an independent copy of Y 1 , . . . , Y n and L some learning method. Let R _`

₂

_,in (f ) = 1

n

X

i=1

E [` 2 ( ˜ Y i , f (x i ))]

be the in-sample risk. The expected in-sample error for given sampling points x ₁ , . . . , x _n is defined as

err _in (L, x ₁ , . . . , x _n ) = E [R _`

₂

_,in (L(D)) | X ₁ = x ₁ , . . . , X _n = x _n ], where D is defined in G1.

a) Let P = P _Y _|X · P X and ˆ f = L(D train ). What is the difference between the risk R _`

₂

_,P ( ˆ f ), the empirical risk R _`

₂

_,emp ( ˆ f ), the in-sample risk R _`

₂

_,in ( ˆ f), the expected in-sample error err in (L, x , . . . , x n ), and the expected test error err(L, x).

b) Show that E [R _`

₂

_,P (L(D))] = E [err(L, X )].

c) Let ˆ f = L(D). Show that

err _in (L, x ₁ , . . . , x _n ) = E [R _`

₂

_,emp ( ˆ f) | X ₁ = x ₁ , . . . , X _n = x _n ] + 2 N

N

X

i=1

cov(Y _i , f ˆ (x _i )).

2 Support Vector Machines

G 3. (Geometrical interpretation of slack variables) Consider the soft margin SVM

min

w∈ R

²

,b∈ R

1 2 w ^T w + C

n

X

j=1

ξ _j s.t. y _i (w ^T x ⁱ + b) ≥ 1 − ξ _i , i = 1, . . . , n, ξ _i ≥ 0, i = 1, . . . , n.

Give a geometric interpretation of the slack variables ξ 1 , . . . , ξ n . To this end, fix some feasible vector w ∈ R ² , assuming w.l.o.g. w ₁ < 0 and w ₂ > 0. Furthermore, consider w.l.o.g. the first data points (x, y) = (x ¹ , y ¹ ) with x = (x ₁ , x ₂ ). Now derive the geo- metrical interpretation by considering when ξ > 0 is required in the linear constraint y(w ^T x + b) ≥ 1 − ξ.

3 Homework

H 1. Let H be a Hilbertspace over Ω ⊆ R and k : Ω × Ω → R the reproducing kernel of H. Consider once again the regularized kernel regression problem

L(D _train ) = argmin _f∈H 1 n

n

X

i=1

(y _i − f (x _i )) ² + λkf k ² _k for given data D _train = {(x ₁ , y ₁ ), . . . , (x _n , y _n )}.

a) Give explicit formulas for the bias and the variance term from G1.

b) Let K = (k(x _i , x _i )) i,j=1,...,n and G = (K + λI _n ) ⁻¹ . Consider the smoothing matrix S _λ = KG. Show that for the covariance term P _N

i=1 cov(Y _i , f ˆ (x _i )), which appears in G2 c), we have

N

X

i=1

cov(Y _i , f ˆ (x _i )) = trace(S _λ )σ.

2

(3)

(5 Punkte) H 2. (ν -support vector classifer)

For ν > 0, consider the primal problem for the ν-SV classifier

w∈H,ξ∈ min R

ⁿ

,ρ∈ R

1 2 kwk ² _k − νρ + 1 n

n

X

j=1

ξ j s.t. y i (hw, x _i i) ≥ ρ − ξ ξ i ≥ 0, ρ ≥ 0.

Note that x _i denotes the feature representation of the ith sample point. The empirical margin error for a feasible w ∈ H is given by

R ^ρ (w) := 1

n |{i ∈ {1, . . . , n} : y _i hw, x _i i < ρ}|.

a) Suppose the above minimization problem has a solution with ρ > 0. Show that ν is an upper bound on the fraction of margin errors and a lower bound on the fraction of support vectors.

b) Determine the dual quadratic optimization problem. Hint: Use that the solution has a representation w = P n

i=1 λ i k(·, x _i ).

(5 Punkte) H 3. (Programming task - regularization parameter and cross validation revisited) See the accompanying notebook.

(5 Punkte) H 4. (Programming task - bone mineral density estimation)

See the accompanying notebook.

(5 Punkte)

3

1 Model Selection

Wissenschaftliches Rechnen II/Scientific Computing II

Sommersemester 2016 Prof. Dr. Jochen Garcke Dipl.-Math. Sebastian Mayer

Exercise sheet 7 To be handed in on Thursday, 07.06.2016

1 Model Selection

• Model selection: estimating the performance of different regularization parame- ters in order to choose the best one.

• Model assessment: having chosen a final model, estimating its prediction error (generalization error) on new data.

Y = f (X) + ε,

where ε ∼ N (0, σ 2 ), σ > 0 fixed, is independent of X. The regularized kernel regression procedure can now be considered as a learning method L : S

n∈ N (Ω× R ) n → {f : Ω → R }, which maps given training data D train to a regression fit L(D train ) = ˆ f : Ω → R .

G 1. (Bias-variance decomposition)

Consider the squared loss ` 2 (y, t) = (y − t) 2 . Let D = {(X 1 , Y 1 ), . . . , (X n , Y n )}. The expected test error for a fixed new input X = x of a learning method L is given by

err(L, x) := E [` 2 (Y, f ˆ (X))|X = x] = E [` 2 (Y, f ˆ (x))], where ˆ f (X) = L(D)(X). Show that err(L, x) can be decomposed as follows

err(L, x) = σ 2 + (bias(L, x)) 2 + var(L, x)

with irreducible error σ, bias term bias(L, x) = f (x) − E [ ˆ f (x)], and variance term

var(L, x) = E [( ˆ f (x) − E [ ˆ f (x)]) 2 ]. Try to explain what kind of error each of the three

terms describe.

G 2. (Extra- vs. in-sample error)

Let ˜ Y 1 , . . . , Y ˜ n be an independent copy of Y 1 , . . . , Y n and L some learning method. Let R `

,in (f ) = 1

n

n

X

i=1

E [` 2 ( ˜ Y i , f (x i ))]

be the in-sample risk. The expected in-sample error for given sampling points x 1 , . . . , x n is defined as

err in (L, x 1 , . . . , x n ) = E [R `

,in (L(D)) | X 1 = x 1 , . . . , X n = x n ], where D is defined in G1.

a) Let P = P Y |X · P X and ˆ f = L(D train ). What is the difference between the risk R `

,P ( ˆ f ), the empirical risk R `

,emp ( ˆ f ), the in-sample risk R `

,in ( ˆ f), the expected in-sample error err in (L, x , . . . , x n ), and the expected test error err(L, x).

b) Show that E [R `

,P (L(D))] = E [err(L, X )].

c) Let ˆ f = L(D). Show that

err in (L, x 1 , . . . , x n ) = E [R `

,emp ( ˆ f) | X 1 = x 1 , . . . , X n = x n ] + 2 N

N

X

i=1

cov(Y i , f ˆ (x i )).

2 Support Vector Machines

G 3. (Geometrical interpretation of slack variables) Consider the soft margin SVM

min

w∈ R

,b∈ R

1

2 w T w + C

n

X

j=1

ξ j s.t. y i (w T x i + b) ≥ 1 − ξ i , i = 1, . . . , n, ξ i ≥ 0, i = 1, . . . , n.

3 Homework

H 1. Let H be a Hilbertspace over Ω ⊆ R and k : Ω × Ω → R the reproducing kernel of H. Consider once again the regularized kernel regression problem

L(D train ) = argmin f∈H 1 n

n

X

i=1

(y i − f (x i )) 2 + λkf k 2 k for given data D train = {(x 1 , y 1 ), . . . , (x n , y n )}.

a) Give explicit formulas for the bias and the variance term from G1.

b) Let K = (k(x i , x i )) i,j=1,...,n and G = (K + λI n ) −1 . Consider the smoothing matrix S λ = KG. Show that for the covariance term P N

i=1 cov(Y i , f ˆ (x i )), which appears in G2 c), we have

N

X

i=1

cov(Y i , f ˆ (x i )) = trace(S λ )σ.

2

(5 Punkte) H 2. (ν -support vector classifer)

For ν > 0, consider the primal problem for the ν-SV classifier

w∈H,ξ∈ min R

,ρ∈ R

1

2 kwk 2 k − νρ + 1 n

n

X

j=1

ξ j s.t. y i (hw, x i i) ≥ ρ − ξ ξ i ≥ 0, ρ ≥ 0.

Note that x i denotes the feature representation of the ith sample point. The empirical margin error for a feasible w ∈ H is given by

R ρ (w) := 1

n |{i ∈ {1, . . . , n} : y i hw, x i i < ρ}|.

where ε ∼ N (0, σ ² ), σ > 0 fixed, is independent of X. The regularized kernel regression procedure can now be considered as a learning method L : S

n∈ N (Ω× R ) ⁿ → {f : Ω → R }, which maps given training data D _train to a regression fit L(D _train ) = ˆ f : Ω → R .

Consider the squared loss ` ₂ (y, t) = (y − t) ² . Let D = {(X ₁ , Y ₁ ), . . . , (X _n , Y _n )}. The expected test error for a fixed new input X = x of a learning method L is given by

err(L, x) := E [` ₂ (Y, f ˆ (X))|X = x] = E [` ₂ (Y, f ˆ (x))], where ˆ f (X) = L(D)(X). Show that err(L, x) can be decomposed as follows

err(L, x) = σ ² + (bias(L, x)) ² + var(L, x)

var(L, x) = E [( ˆ f (x) − E [ ˆ f (x)]) ² ]. Try to explain what kind of error each of the three

Let ˜ Y 1 , . . . , Y ˜ n be an independent copy of Y 1 , . . . , Y n and L some learning method. Let R _`

_,in (f ) = 1

be the in-sample risk. The expected in-sample error for given sampling points x ₁ , . . . , x _n is defined as

err _in (L, x ₁ , . . . , x _n ) = E [R _`

_,in (L(D)) | X ₁ = x ₁ , . . . , X _n = x _n ], where D is defined in G1.

a) Let P = P _Y _|X · P X and ˆ f = L(D train ). What is the difference between the risk R _`

_,P ( ˆ f ), the empirical risk R _`

_,emp ( ˆ f ), the in-sample risk R _`

_,in ( ˆ f), the expected in-sample error err in (L, x , . . . , x n ), and the expected test error err(L, x).

b) Show that E [R _`

_,P (L(D))] = E [err(L, X )].

err _in (L, x ₁ , . . . , x _n ) = E [R _`

_,emp ( ˆ f) | X ₁ = x ₁ , . . . , X _n = x _n ] + 2 N

cov(Y _i , f ˆ (x _i )).

2 w ^T w + C

ξ _j s.t. y _i (w ^T x ⁱ + b) ≥ 1 − ξ _i , i = 1, . . . , n, ξ _i ≥ 0, i = 1, . . . , n.

L(D _train ) = argmin _f∈H 1 n

(y _i − f (x _i )) ² + λkf k ² _k for given data D _train = {(x ₁ , y ₁ ), . . . , (x _n , y _n )}.

b) Let K = (k(x _i , x _i )) i,j=1,...,n and G = (K + λI _n ) ⁻¹ . Consider the smoothing matrix S _λ = KG. Show that for the covariance term P _N

i=1 cov(Y _i , f ˆ (x _i )), which appears in G2 c), we have

cov(Y _i , f ˆ (x _i )) = trace(S _λ )σ.

2 kwk ² _k − νρ + 1 n

ξ j s.t. y i (hw, x _i i) ≥ ρ − ξ ξ i ≥ 0, ρ ≥ 0.

Note that x _i denotes the feature representation of the ith sample point. The empirical margin error for a feasible w ∈ H is given by

R ^ρ (w) := 1

n |{i ∈ {1, . . . , n} : y _i hw, x _i i < ρ}|.

i=1 λ i k(·, x _i ).