The Degrees of F reedom of Partial Least Squares

by omputing iteratively

e

w i = w i − w e ^t _i ₋ ₁ X ^t Xw i

e

w ^t _i ₋ ₁ X ^t X w e i − 1 w e i − 1 , β b ⁽ⁱ⁾ = β b ⁽ⁱ ⁻ ¹⁾ + w e ^t _i X ^t y

e

w _i ^t X ^t X w e i w e i .

Proof. We proof the statements via indution. For

i = 1

^, ^we ^have

w e ₁ = w ₁

^as

X ₁ = X

^and

b

y = P ^t 1 y = Xw 1 Xw 1 w ₁ ^t X ^t ₋ 1

w ₁ ^t X ^t y = X w e ^t ₁ X ^t y e

w ^t ₁ X ^t X w e 1 w e 1 .

For ageneral

i

^,^we^have

t i+1 = X i+1 w i+1 = (X − P ^t 1 ,...,t _i X ) w i+1 = Xw i+1 − P ^t i Xw i+1 .

The last equality holds as

R = T ^t XW

^is bidiagonal. Using formula (A.9) for the projetion operatorand the indution hypothesis (4.16), itfollows that

t i+1 = Xw i+1 − X w e i w e _i ^t X ^t X w e i

₋ 1

w _i ^t X ^t Xw i+1 .

Weonlude that

e

w _i+1 = w _i+1 − w e ^t _i X ^t Xw i+1

e

w ^t _i X ^t X w e i w e i .

The regression estimate after

i

^steps ^is

X β b ⁽ⁱ⁾ = P ^t 1 ,...,t i y

= P ^t 1 ,...,t _i−1 y + P ^t ⁱ y

= X β b ⁽ⁱ ⁻ ¹⁾ + P ^t i y

= X β b ⁽ⁱ ⁻ ¹⁾ + X w e _i ^t X ^t y e

w ^t _i X ^t X w e i w e i .

This onludes the proof.

ross-validatederror. InChapter2,wedisussedadierentstrategyformodelseletion

that involves the omputation of the degrees of freedom. Note that the PLSR

estimate (4.8) is not a linear funtion of

y

^. ^Hene, ^we ^an ^only ^estimate ^the

degrees of freedom using equation (2.9). As a onsequene, we have to ompute

the rst derivative of the PLSR estimator. This has been done before. Phatak

et al. (2002) ompute the rst derivative of

β b ^(m) _{P LS}

ⁱⁿ^order ^to ^obtain ^asymptoti

results on the variane of the estimator. In that work, some general rules on

matrixdierentialalulusareappliedtotheformulainproposition4.7. Itturns

out that the alulation of the rst derivative of PLSR is not only numerially

instable, it is also time-onsuming. The reason is that the formula for the rst

derivativeof

β b

^involves^matries^of^order

(mn) × n

^. ^W^e^therefore^hoose^a^dierent

approah. Serneelsetal.(2004)useareursiveformulafor

β b ^(m) _{P LS}

^that^is^equivalent

to algorithm proposition 4.9 and derive a fast algorithm for the rst derivative

PLSR. We now show how to ompute the derivative of PLSR estimate

y b

ⁱⁿ ^a

reursive way. Westart with the remark thatby denition,

X i = X i − 1 − P ^t i−1 X .

Usingthe denition of the weight vetors of PLSR, we onlude that

w i = X _i ^t y = w i − 1 − X ^t P ^t i−1 y .

Letus denethe modiedlatent omponents

t i = Xw i .

We onlude that

t i = t i − 1 − XX ^t P ^t i−1 y .

Furthermore,we onlude from proposition 4.9 that

t i = X w e i = X

w i − w e ^t _i ₋ ₁ X ^t Xw i

e

w _i ^t ₋ ₁ X ^t X w e i − 1 w e i − 1

= t i − t ^t _i ₋ ₁ t i

t ^t _i ₋ ₁ t i − 1

t i − 1 = t i − P ^t i−1 t i .

Finally,

b

y ⁽ⁱ⁾ = y b ⁽ⁱ ⁻ ¹⁾ + P ^t i y .

This leads tothe following reursive algorithmfor the omputationof

y b

Algorithm 4.10. We dene

K = XX ^t

^. ^After ^setting

t ₀ = t ₀ = y b ⁽⁰⁾ = 0 ,

we iteratively ompute the PLSRestimates

y b

^via

t m =

 



Ky m = 1

t m − 1 − K P ^t m−1 y m > 1

modied latent omponents

t m = t m − P ^t m−1 t m

^latent ^omponents

b

y ^(m) = y b ^(m ⁻ ¹⁾ + P ^t m y

^predition

In order to to ompute the rst derivative of

y b

^, ^we ^have ^to ^alulate ^the ^rst

derivative of the projetion operator. The formula an be found in proposition

A.11in the appendix.

Algorithm 4.11 (First derivative of PLSR). After setting

t ₀ = t ₀ = y b ⁽⁰⁾ = 0

and

dt 1 = dt 1 = K ,

therstderivativeof thePLSRestimatoranbe obtainedbyiteratively omputing

t m =

 



Ky m = 1

t m − 1 − K P ^t m−1 y m > 1

modied latent omponents

∂t m

∂y = ^∂t _∂y ^m−1 − K _∂ _P

t m−1 y

∂y

derivative of

t m

t m = t m − P ^t m−1 t m

^latent ^omponents

∂t m

∂y = ^∂t _∂y ^m−1 − ^∂ ^P ^t ^m−1 ∂y ^t ^m

^derivative ^of

t m

b

y ^(m) = y b ^(m ⁻ ¹⁾ + P ^t ^m y

^predition

∂ y

^(m)

∂y = ^∂ ^y

^(m−1) _∂y + ^∂ ^P _∂y ^tm ^y

^derivative ^of

y b

WeannowdenetheestimateddegreesoffreedomofPLSRwith

m

^omponents

via

b

(

^PLSR

, m) =

^trae

∂ y b ^(m)

∂y

.

This isby denition anunbiasedestimateof the degrees of freedomof PLSR

inthease ofnormallydistributederror terms. Althoughitsomputationis

on-siderablyfasterthanthe oneproposedinPhatak etal.(2002),itstillsuersfrom

numerial instability. This leads to peuliar and sometimes apparently wrong

5 10 15 20

5 10 15 20 25

number of components

estimated degrees of freedom

Figure4.1: TheestimateddegreesoffreedomofPLSRasafuntionofthenumber

of omponents.

results. This is illustrated with the following example. We onsider the linear

regression model

y = Xβ + ε

with

p = 20

^preditor ^variables ^and

n = 500

^examples. ^First, ^we ^hoose ^the

preditor matrix

X

^from^amultivariatenormaldistribution withzero mean and ovariane matrix

S

^equal ^to

s ij =

 



1 , i = j 0.7 , i 6 = j .

Thisleadstohighlyollineardata

X

^. ^The^regression^vetor

β

^is^a^randomly

ho-senvetor

β ∈ { 0, 1 } ²⁰

^. ^W^e^assume^that^the ^error^terms^are^normallydistributed with

σ = 8

^. ^W^e ^ompute ^the ^degrees ^of ^freedom ^for ^all

20

^omponents. ^Figure

4.1 shows the peuliar behaviorof the estimated degrees of freedom. We expet

thedegrees of freedomtobeupper-bounded by

p = 20

^the^number^of^preditor

variables. This is indiated by the dashed line. Note however that after a few

omponents (

m ≥ 9

^), ^the ^estimated ^degrees ^of ^freedom ^exeed ^this ^value. ^The

funtiondisplayed in Figure4.1 still inreases and has the formof a jagged line.

This phenomenon persistently ours for other data sets. We onjeture that

algorithm 4.11 runs into serious numerial problems. Therefore, we reommend

tobeautious toimplementthe algorithminits urrent form.

Penalized Partial Least Squares

Nonlinearregressioneets maybemodeledviaadditiveregressionmodelsofthe

form

Y = β ₀ + f ₁ (X ₁ ) + · · · + f p (X p ) + ε ,

^(5.1)

where the funtions

f 1 , . . . , f p

^have ^unspeied ^funtional ^form. ^An ^approah

whih allows a exiblerepresentation of the funtions

f ₁ , . . . , f p

^is ^the ^expansion

in B-Splines basis funtions (Hastie & Tibshirani 1990). To prevent overtting,

there are two general approahes. In the rst approah, eah funtion

f j

^is ^the

sum of onlya small set of basis funtions,

f j (x) =

K j

X

k=1

β kj B kj (x) .

^(5.2)

Thebasis funtions

B kj

^are^hosen^adaptively^by ^a^seletion^proedure. ^The

se-ond approah irumvents the problem of basis funtion seletion. Instead, we

allow a generous amount

K j ≫ 1

^of ^basis ^funtions ⁱⁿ ^the ^expansion ^(5.2). ^As

thisusuallyleadstohigh-dimensionalandhighlyorrelateddata,wepenalizethe

oeients

β jk

ⁱⁿ ^the ^estimation ^proess ^(Eilers ^& ^Marx ^1996). ^However, ^if ^the

number

p

^of ^preditors ^is ^large^ompared ^to ^the ^number

n

^ofobservations inthe available sample,these methodsare impratiable.

Quite generally, a dierent approah to deal with high dimensionality is to use

dimensionredutiontehniquessuhasPartialLeastSquaresRegression(PLSR)

whihispresented inChapter 4. Inthis hapter,wesuggestanadaptationofthe

prinipleof penalizationtoPLSR. More preisely, wepresenta penalizedversion

of the optimization problem (4.3) attahed to PLSR. Although the motivation

stems fromitsuse forB-splines transformed data,the proposedapproah isvery

generaland an be adapted toother penalty terms orto other dimension

redu-tion tehniques suh as Prinipal Components Analysis. It turns out that the

newmethodsharesalotofpropertiesofPLSRand thatthatitsomputation

re-quires virtuallynoextra osts. Furthermore,we showthat this newpenalization

tehnique is losely related to the kernel trik that is illustrated in Setion 1.4.

WeshowthatpenalizedPLSRisequivalenttoordinaryPLSRusingageneralized

innerprodutthatisdenedbythepenaltyterm. Intheaseofhigh-dimensional

data,thenewmethodisshowntobeanattrativeompetitortoothertehniques

forestimatinggeneralizedadditivemodels. InChapter6,wehighlightenthelose

onnetion between penalizedPLSR and preonditioned linearsystems.

This hapter is joint work with Anne-Laure Boulesteix and Gerhard Tutz.

5.1 Penalized Regression Splines

The tting of generalized additive models by use of penalized regression splines

(Eilers &Marx1996)has beomeawidely used toolinstatistis. The basiidea

is toexpand the additive omponent of eah variable

X j

ⁱⁿ^basis ^funtions^as ⁱⁿ

(5.2) and to estimate the oeients by penalization tehniques. As suggested

in Eilers & Marx (1996), B-splines are used as basis funtions. Splines are

one-dimensional pieewise polynomial funtions. The pointsat whih the piees are

onneted are alled knots or breakpoints. We say that a spline is of order

d

^if

all polynomials are of degree

≤ d

^and ^if ^the ^spline ^is

(d − 1)

^times ontinuously dierentiableat the breakpoints. A partiular eient set of basis funtions are

B-splines (de Boor1978). An example of B-splines isgiven in Figure5.1.

Thenumberofbasisfuntionsdependsontheorderofthesplinesandthenumber

of breakpoints. For a given variable

X j

^, ^we ^onsider ^a ^set ^of orresponding B-splinesbasisfuntions

B _1j , . . . , B K

^. ^These^basis^funtions^dene^a^nonlinear^map

Φ j (x) = (B 1j (x), . . . , B K (x)) ^t .

By performing suh a transformation on eah of the variables

X 1 , . . . , X p

^, ^the

0 100 200 300 400 500

0.0 0.2 0.4 0.6 0.8 1.0

x

Figure5.1: Illustrationof abasis of B-splines of order3.

observation vetor

x i

^turns ^into ^a^vetor

z i = (B 11 (x i1 ), . . . , B K1 (x i1 ), . . . , B 1p (x ip ), . . . , B Kp (x ip )) ^t

^(5.3)

= Φ(x i ) .

of length

pK

^. ^Here

Φ : R ^p → R ^pK ,

Φ(x) = (Φ 1 (x 1 ), . . . , Φ p (x p )) ,

is the funtion dened by the B-splines. The resulting data matrix obtained by

thetransformation of

X

^has^dimensions

n × pK

^and ^will^be^denoted ^by

Z

ⁱⁿ^the

restof the hapter. Inthe examplesinSetions 5.5 and??,we onsiderthe most

widely used ubi B-splines, i.e. we hoose

d = 3

Theestimationof (5.1)istransformedintothe estimationofthe

pK

-dimensional vetor

β

^that ^onsists^of ^the ^oeients

β jk

β ^t = (β 11 , . . . , β K1 , . . . β 12 , . . . , β Kp ) = β ^t ₍₁₎ , . . . , β ^t _(p)

.

As explainedabove, the vetor

β

^determines ^a ^nonlinear, ^additive ^funtion

f(x) = β 0 +

X p j=1

f j (x j ) = β 0 + X p

j=1

X K k=1

β kj B kj (x j ) = β 0 + Φ(x) ^t β .

Z

^is^usuallyhigh-dimensional,the estimationof

β

^by ^minimizing^the^squared

empirialrisk

R(β) = b 1 n

X n i=1

(y i − f (x i )) ² = 1

n k y − β 0 − Zβ k ²

usually leads to overtting. Following Eilers & Marx (1996), we use for eah

variable many basis funtions, say

K ≈ 20

^, ^and ^estimate ^by penalization. The idea is topenalize the seond derivativeof the funtion

f

Z

(f ^′′ (x)) ² dx .

Eilers & Marx (1996) show that the following dierene penalty term is a good

approximation of the penalty on the seondderivative of the funtions

f j

P (β) = X p

j=1

X m k=3

λ j (∆ ² β kj ) ² .

Here

λ j ≥ 0

^are ^smoothing ^parameters ^that ^ontrol^the ^amount ^of penalization.

These are also alled the seond-order dierenes of adjaent parameters. The

diereneoperator

∆ ² β kj

^has ^the ^form

∆ ² β kj = (β kj − β k − 1,j ) − (β k − 1,j − β k − 2,j )

= β kj − 2β k − 1,j + β k − 2,j .

This penalty term an be expressed in terms of a penalty matrix

P

^. ^We ^denote

D K

^the

(K − 1) × K

^matrix

D K =



 



1 − 1 . . .

. 1 − 1 . .

. . . . .

. . . 1 − 1



 



that denes the rst order diereneoperator. Setting

K ₂ = (D K − 1 D K ) ^t D K − 1 D K ,

we onlude that the penalty term equals

P (β) = X p

j=1

λ j β ^t _(j) K 2 β _(j) = β ^t ( ∆

λ ⊗ K 2 )β .

Here

∆

λ

^is ^the

p × p

^diagonal ^matrix ^ontaining

λ 1 , . . . , λ p

^on^its ^diagonal ^and

⊗

^is ^the ^Kroneker ^produt. ^The generalization of this method to higher-order dierenes of the oeientsof adjaentB-splines is straightforward. We simply

replae

K 2

^by

K q = (D K − q+1 . . . D K ) ^t (D K − q+1 . . . D K ) .

To summarize,the penalizedleast squares riterionhas the form

R b P (β) = 1

n k y − β 0 1 − Zβ k ² + β ^t P β

^(5.4)

with

Z

^the transformed data that is dened in( 5.3) and the penalty matrix

P

dened as

P = ∆

λ ⊗ K q .

^(5.5)

This isa symmetri matrix that ispositivesemidenite.

Im Dokument Analysis of High Dimensional Data with Partial Least Squares and Boosting (Seite 64-73)

The Degrees of F reedom of Partial Least Squares

e

w i = w i − w e t i − 1 X t Xw i

e

w t i − 1 X t X w e i − 1 w e i − 1 , β b (i) = β b (i − 1) + w e t i X t y

e

w i t X t X w e i w e i .

i = 1

w e 1 = w 1

X 1 = X

b

y = P t 1 y = Xw 1 Xw 1 w 1 t X t − 1

w 1 t X t y = X w e t 1 X t y e

w t 1 X t X w e 1 w e 1 .

i

t i+1 = X i+1 w i+1 = (X − P t 1 ,...,t i X ) w i+1 = Xw i+1 − P t i Xw i+1 .

R = T t XW

t i+1 = Xw i+1 − X w e i w e i t X t X w e i

− 1

w i t X t Xw i+1 .

e

w i+1 = w i+1 − w e t i X t Xw i+1

e

w t i X t X w e i w e i .

i

X β b (i) = P t 1 ,...,t i y

= P t 1 ,...,t i−1 y + P t i y

= X β b (i − 1) + P t i y

= X β b (i − 1) + X w e i t X t y e

w t i X t X w e i w e i .

y

β b (m) P LS

β b

(mn) × n

β b (m) P LS

y b

X i = X i − 1 − P t i−1 X .

w i = X i t y = w i − 1 − X t P t i−1 y .

t i = Xw i .

t i = t i − 1 − XX t P t i−1 y .

t i = X w e i = X

w i − w e t i − 1 X t Xw i

e

w i t − 1 X t X w e i − 1 w e i − 1

= t i − t t i − 1 t i

t t i − 1 t i − 1

t i − 1 = t i − P t i−1 t i .

b

y (i) = y b (i − 1) + P t i y .

y b

K = XX t

t 0 = t 0 = y b (0) = 0 ,

y b

t m =

 



Ky m = 1

t m − 1 − K P t m−1 y m > 1

t m = t m − P t m−1 t m

b

y (m) = y b (m − 1) + P t m y

y b

t 0 = t 0 = y b (0) = 0

dt 1 = dt 1 = K ,

t m =

 



Ky m = 1

t m − 1 − K P t m−1 y m > 1

∂t m

∂y = ∂t ∂y m−1 − K ∂ P

t m−1 y

∂y

t m

t m = t m − P t m−1 t m

∂t m

∂y = ∂t ∂y m−1 − ∂ P t m−1 ∂y t m

t m

b

y (m) = y b (m − 1) + P t m y

w i = w i − w e ^t _i ₋ ₁ X ^t Xw i

w ^t _i ₋ ₁ X ^t X w e i − 1 w e i − 1 , β b ⁽ⁱ⁾ = β b ⁽ⁱ ⁻ ¹⁾ + w e ^t _i X ^t y

w _i ^t X ^t X w e i w e i .

w e ₁ = w ₁

X ₁ = X

y = P ^t 1 y = Xw 1 Xw 1 w ₁ ^t X ^t ₋ 1

w ₁ ^t X ^t y = X w e ^t ₁ X ^t y e

w ^t ₁ X ^t X w e 1 w e 1 .

t i+1 = X i+1 w i+1 = (X − P ^t 1 ,...,t _i X ) w i+1 = Xw i+1 − P ^t i Xw i+1 .

R = T ^t XW

t i+1 = Xw i+1 − X w e i w e _i ^t X ^t X w e i

₋ 1

w _i ^t X ^t Xw i+1 .

w _i+1 = w _i+1 − w e ^t _i X ^t Xw i+1

w ^t _i X ^t X w e i w e i .

X β b ⁽ⁱ⁾ = P ^t 1 ,...,t i y

= P ^t 1 ,...,t _i−1 y + P ^t ⁱ y

= X β b ⁽ⁱ ⁻ ¹⁾ + P ^t i y

= X β b ⁽ⁱ ⁻ ¹⁾ + X w e _i ^t X ^t y e

w ^t _i X ^t X w e i w e i .

β b ^(m) _{P LS}

β b ^(m) _{P LS}

X i = X i − 1 − P ^t i−1 X .

w i = X _i ^t y = w i − 1 − X ^t P ^t i−1 y .

t i = t i − 1 − XX ^t P ^t i−1 y .

w i − w e ^t _i ₋ ₁ X ^t Xw i

w _i ^t ₋ ₁ X ^t X w e i − 1 w e i − 1

= t i − t ^t _i ₋ ₁ t i

t ^t _i ₋ ₁ t i − 1

t i − 1 = t i − P ^t i−1 t i .

y ⁽ⁱ⁾ = y b ⁽ⁱ ⁻ ¹⁾ + P ^t i y .

K = XX ^t

t ₀ = t ₀ = y b ⁽⁰⁾ = 0 ,

t m − 1 − K P ^t m−1 y m > 1

t m = t m − P ^t m−1 t m

y ^(m) = y b ^(m ⁻ ¹⁾ + P ^t m y

t ₀ = t ₀ = y b ⁽⁰⁾ = 0

t m − 1 − K P ^t m−1 y m > 1

∂y = ^∂t _∂y ^m−1 − K _∂ _P

t m = t m − P ^t m−1 t m

∂y = ^∂t _∂y ^m−1 − ^∂ ^P ^t ^m−1 ∂y ^t ^m

y ^(m) = y b ^(m ⁻ ¹⁾ + P ^t ^m y

^(m)

∂y = ^∂ ^y

^(m−1) _∂y + ^∂ ^P _∂y ^tm ^y

∂ y b ^(m)

β ∈ { 0, 1 } ²⁰

Y = β ₀ + f ₁ (X ₁ ) + · · · + f p (X p ) + ε ,

f ₁ , . . . , f p

B _1j , . . . , B K

Φ j (x) = (B 1j (x), . . . , B K (x)) ^t .

z i = (B 11 (x i1 ), . . . , B K1 (x i1 ), . . . , B 1p (x ip ), . . . , B Kp (x ip )) ^t

Φ : R ^p → R ^pK ,