• Keine Ergebnisse gefunden

The Degrees of F reedom of Partial Least Squares

by omputing iteratively

e

w i = w i − w e t i 1 X t Xw i

e

w t i 1 X t X w e i − 1 w e i − 1 , β b (i) = β b (i 1) + w e t i X t y

e

w i t X t X w e i w e i .

Proof. We proof the statements via indution. For

i = 1

, we have

w e 1 = w 1

as

X 1 = X

and

b

y = P t 1 y = Xw 1 Xw 1 w 1 t X t 1

w 1 t X t y = X w e t 1 X t y e

w t 1 X t X w e 1 w e 1 .

For ageneral

i

,wehave

t i+1 = X i+1 w i+1 = (X − P t 1 ,...,t i X ) w i+1 = Xw i+1 − P t i Xw i+1 .

The last equality holds as

R = T t XW

is bidiagonal. Using formula (A.9) for the projetion operatorand the indution hypothesis (4.16), itfollows that

t i+1 = Xw i+1 − X w e i w e i t X t X w e i

1

w i t X t Xw i+1 .

Weonlude that

e

w i+1 = w i+1 − w e t i X t Xw i+1

e

w t i X t X w e i w e i .

The regression estimate after

i

steps is

X β b (i) = P t 1 ,...,t i y

= P t 1 ,...,t i−1 y + P t i y

= X β b (i 1) + P t i y

= X β b (i 1) + X w e i t X t y e

w t i X t X w e i w e i .

This onludes the proof.

ross-validatederror. InChapter2,wedisussedadierentstrategyformodelseletion

that involves the omputation of the degrees of freedom. Note that the PLSR

estimate (4.8) is not a linear funtion of

y

. Hene, we an only estimate the

degrees of freedom using equation (2.9). As a onsequene, we have to ompute

the rst derivative of the PLSR estimator. This has been done before. Phatak

et al. (2002) ompute the rst derivative of

β b (m) P LS

inorder to obtain asymptoti

results on the variane of the estimator. In that work, some general rules on

matrixdierentialalulusareappliedtotheformulainproposition4.7. Itturns

out that the alulation of the rst derivative of PLSR is not only numerially

instable, it is also time-onsuming. The reason is that the formula for the rst

derivativeof

β b

involvesmatriesoforder

(mn) × n

. Wethereforehooseadierent

approah. Serneelsetal.(2004)useareursiveformulafor

β b (m) P LS

thatisequivalent

to algorithm proposition 4.9 and derive a fast algorithm for the rst derivative

PLSR. We now show how to ompute the derivative of PLSR estimate

y b

in a

reursive way. Westart with the remark thatby denition,

X i = X i − 1 − P t i−1 X .

Usingthe denition of the weight vetors of PLSR, we onlude that

w i = X i t y = w i − 1 − X t P t i−1 y .

Letus denethe modiedlatent omponents

t i = Xw i .

We onlude that

t i = t i − 1 − XX t P t i−1 y .

Furthermore,we onlude from proposition 4.9 that

t i = X w e i = X

w i − w e t i 1 X t Xw i

e

w i t 1 X t X w e i − 1 w e i − 1

= t i − t t i 1 t i

t t i 1 t i − 1

t i − 1 = t i − P t i−1 t i .

Finally,

b

y (i) = y b (i 1) + P t i y .

This leads tothe following reursive algorithmfor the omputationof

y b

.

Algorithm 4.10. We dene

K = XX t

. After setting

t 0 = t 0 = y b (0) = 0 ,

we iteratively ompute the PLSRestimates

y b

via

t m =

 

Ky m = 1

t m − 1 − K P t m−1 y m > 1

modied latent omponents

t m = t m − P t m−1 t m

latent omponents

b

y (m) = y b (m 1) + P t m y

predition

In order to to ompute the rst derivative of

y b

, we have to alulate the rst

derivative of the projetion operator. The formula an be found in proposition

A.11in the appendix.

Algorithm 4.11 (First derivative of PLSR). After setting

t 0 = t 0 = y b (0) = 0

and

dt 1 = dt 1 = K ,

therstderivativeof thePLSRestimatoranbe obtainedbyiteratively omputing

t m =

 

Ky m = 1

t m − 1 − K P t m−1 y m > 1

modied latent omponents

∂t m

∂y = ∂t ∂y m−1 − K P

t m−1 y

∂y

derivative of

t m

t m = t m − P t m−1 t m

latent omponents

∂t m

∂y = ∂t ∂y m−1 P t m−1 ∂y t m

derivative of

t m

b

y (m) = y b (m 1) + P t m y

predition

∂ y

b

(m)

∂y = y

b

(m−1) ∂y + P ∂y tm y

derivative of

y b

WeannowdenetheestimateddegreesoffreedomofPLSRwith

m

omponents

via

b

df

(

PLSR

, m) =

trae

∂ y b (m)

∂y

.

This isby denition anunbiasedestimateof the degrees of freedomof PLSR

inthease ofnormallydistributederror terms. Althoughitsomputationis

on-siderablyfasterthanthe oneproposedinPhatak etal.(2002),itstillsuersfrom

numerial instability. This leads to peuliar and sometimes apparently wrong

5 10 15 20

5 10 15 20 25

number of components

estimated degrees of freedom

Figure4.1: TheestimateddegreesoffreedomofPLSRasafuntionofthenumber

of omponents.

results. This is illustrated with the following example. We onsider the linear

regression model

y = Xβ + ε

with

p = 20

preditor variables and

n = 500

examples. First, we hoose the

preditor matrix

X

fromamultivariatenormaldistribution withzero mean and ovariane matrix

S

equal to

s ij =

 

1 , i = j 0.7 , i 6 = j .

Thisleadstohighlyollineardata

X

. Theregressionvetor

β

isarandomly

ho-senvetor

β ∈ { 0, 1 } 20

. Weassumethatthe errortermsarenormallydistributed with

σ = 8

. We ompute the degrees of freedom for all

20

omponents. Figure

4.1 shows the peuliar behaviorof the estimated degrees of freedom. We expet

thedegrees of freedomtobeupper-bounded by

p = 20

thenumberofpreditor

variables. This is indiated by the dashed line. Note however that after a few

omponents (

m ≥ 9

), the estimated degrees of freedom exeed this value. The

funtiondisplayed in Figure4.1 still inreases and has the formof a jagged line.

This phenomenon persistently ours for other data sets. We onjeture that

algorithm 4.11 runs into serious numerial problems. Therefore, we reommend

tobeautious toimplementthe algorithminits urrent form.

Penalized Partial Least Squares

Nonlinearregressioneets maybemodeledviaadditiveregressionmodelsofthe

form

Y = β 0 + f 1 (X 1 ) + · · · + f p (X p ) + ε ,

(5.1)

where the funtions

f 1 , . . . , f p

have unspeied funtional form. An approah

whih allows a exiblerepresentation of the funtions

f 1 , . . . , f p

is the expansion

in B-Splines basis funtions (Hastie & Tibshirani 1990). To prevent overtting,

there are two general approahes. In the rst approah, eah funtion

f j

is the

sum of onlya small set of basis funtions,

f j (x) =

K j

X

k=1

β kj B kj (x) .

(5.2)

Thebasis funtions

B kj

arehosenadaptivelyby aseletionproedure. The

se-ond approah irumvents the problem of basis funtion seletion. Instead, we

allow a generous amount

K j ≫ 1

of basis funtions in the expansion (5.2). As

thisusuallyleadstohigh-dimensionalandhighlyorrelateddata,wepenalizethe

oeients

β jk

in the estimation proess (Eilers & Marx 1996). However, if the

number

p

of preditors is largeompared to the number

n

ofobservations inthe available sample,these methodsare impratiable.

Quite generally, a dierent approah to deal with high dimensionality is to use

dimensionredutiontehniquessuhasPartialLeastSquaresRegression(PLSR)

whihispresented inChapter 4. Inthis hapter,wesuggestanadaptationofthe

prinipleof penalizationtoPLSR. More preisely, wepresenta penalizedversion

of the optimization problem (4.3) attahed to PLSR. Although the motivation

stems fromitsuse forB-splines transformed data,the proposedapproah isvery

generaland an be adapted toother penalty terms orto other dimension

redu-tion tehniques suh as Prinipal Components Analysis. It turns out that the

newmethodsharesalotofpropertiesofPLSRand thatthatitsomputation

re-quires virtuallynoextra osts. Furthermore,we showthat this newpenalization

tehnique is losely related to the kernel trik that is illustrated in Setion 1.4.

WeshowthatpenalizedPLSRisequivalenttoordinaryPLSRusingageneralized

innerprodutthatisdenedbythepenaltyterm. Intheaseofhigh-dimensional

data,thenewmethodisshowntobeanattrativeompetitortoothertehniques

forestimatinggeneralizedadditivemodels. InChapter6,wehighlightenthelose

onnetion between penalizedPLSR and preonditioned linearsystems.

This hapter is joint work with Anne-Laure Boulesteix and Gerhard Tutz.

5.1 Penalized Regression Splines

The tting of generalized additive models by use of penalized regression splines

(Eilers &Marx1996)has beomeawidely used toolinstatistis. The basiidea

is toexpand the additive omponent of eah variable

X j

inbasis funtionsas in

(5.2) and to estimate the oeients by penalization tehniques. As suggested

in Eilers & Marx (1996), B-splines are used as basis funtions. Splines are

one-dimensional pieewise polynomial funtions. The pointsat whih the piees are

onneted are alled knots or breakpoints. We say that a spline is of order

d

if

all polynomials are of degree

≤ d

and if the spline is

(d − 1)

times ontinuously dierentiableat the breakpoints. A partiular eient set of basis funtions are

B-splines (de Boor1978). An example of B-splines isgiven in Figure5.1.

Thenumberofbasisfuntionsdependsontheorderofthesplinesandthenumber

of breakpoints. For a given variable

X j

, we onsider a set of orresponding B-splinesbasisfuntions

B 1j , . . . , B K

. Thesebasisfuntionsdeneanonlinearmap

Φ j (x) = (B 1j (x), . . . , B K (x)) t .

By performing suh a transformation on eah of the variables

X 1 , . . . , X p

, the

0 100 200 300 400 500

0.0 0.2 0.4 0.6 0.8 1.0

x

Figure5.1: Illustrationof abasis of B-splines of order3.

observation vetor

x i

turns into avetor

z i = (B 11 (x i1 ), . . . , B K1 (x i1 ), . . . , B 1p (x ip ), . . . , B Kp (x ip )) t

(5.3)

= Φ(x i ) .

of length

pK

. Here

Φ : R p → R pK ,

Φ(x) = (Φ 1 (x 1 ), . . . , Φ p (x p )) ,

is the funtion dened by the B-splines. The resulting data matrix obtained by

thetransformation of

X

hasdimensions

n × pK

and willbedenoted by

Z

inthe

restof the hapter. Inthe examplesinSetions 5.5 and??,we onsiderthe most

widely used ubi B-splines, i.e. we hoose

d = 3

.

Theestimationof (5.1)istransformedintothe estimationofthe

pK

-dimensional vetor

β

that onsistsof the oeients

β jk

:

β t = (β 11 , . . . , β K1 , . . . β 12 , . . . , β Kp ) = β t (1) , . . . , β t (p)

.

As explainedabove, the vetor

β

determines a nonlinear, additive funtion

f(x) = β 0 +

X p j=1

f j (x j ) = β 0 + X p

j=1

X K k=1

β kj B kj (x j ) = β 0 + Φ(x) t β .

As

Z

isusuallyhigh-dimensional,the estimationof

β

by minimizingthesquared

empirialrisk

R(β) = b 1 n

X n i=1

(y i − f (x i )) 2 = 1

n k y − β 0 − Zβ k 2

usually leads to overtting. Following Eilers & Marx (1996), we use for eah

variable many basis funtions, say

K ≈ 20

, and estimate by penalization. The idea is topenalize the seond derivativeof the funtion

f

,

Z

(f ′′ (x)) 2 dx .

Eilers & Marx (1996) show that the following dierene penalty term is a good

approximation of the penalty on the seondderivative of the funtions

f j

,

P (β) = X p

j=1

X m k=3

λ j (∆ 2 β kj ) 2 .

Here

λ j ≥ 0

are smoothing parameters that ontrolthe amount of penalization.

These are also alled the seond-order dierenes of adjaent parameters. The

diereneoperator

2 β kj

has the form

2 β kj = (β kj − β k − 1,j ) − (β k − 1,j − β k − 2,j )

= β kj − 2β k − 1,j + β k − 2,j .

This penalty term an be expressed in terms of a penalty matrix

P

. We denote

by

D K

the

(K − 1) × K

matrix

D K =

 

 

1 − 1 . . .

. 1 − 1 . .

. . . . .

. . . 1 − 1

 

 

that denes the rst order diereneoperator. Setting

K 2 = (D K − 1 D K ) t D K − 1 D K ,

we onlude that the penalty term equals

P (β) = X p

j=1

λ j β t (j) K 2 β (j) = β t ( ∆

λ ⊗ K 2 )β .

Here

λ

is the

p × p

diagonal matrix ontaining

λ 1 , . . . , λ p

onits diagonal and

is the Kroneker produt. The generalization of this method to higher-order dierenes of the oeientsof adjaentB-splines is straightforward. We simply

replae

K 2

by

K q = (D K − q+1 . . . D K ) t (D K − q+1 . . . D K ) .

To summarize,the penalizedleast squares riterionhas the form

R b P (β) = 1

n k y − β 0 1 − Zβ k 2 + β t P β

(5.4)

with

Z

the transformed data that is dened in( 5.3) and the penalty matrix

P

dened as

P = ∆

λ ⊗ K q .

(5.5)

This isa symmetri matrix that ispositivesemidenite.