by omputing iteratively
e
w i = w i − w e t i − 1 X t Xw i
e
w t i − 1 X t X w e i − 1 w e i − 1 , β b (i) = β b (i − 1) + w e t i X t y
e
w i t X t X w e i w e i .
Proof. We proof the statements via indution. For
i = 1
, we havew e 1 = w 1
asX 1 = X
andb
y = P t 1 y = Xw 1 Xw 1 w 1 t X t − 1
w 1 t X t y = X w e t 1 X t y e
w t 1 X t X w e 1 w e 1 .
For ageneral
i
,wehavet i+1 = X i+1 w i+1 = (X − P t 1 ,...,t i X ) w i+1 = Xw i+1 − P t i Xw i+1 .
The last equality holds as
R = T t XW
is bidiagonal. Using formula (A.9) for the projetion operatorand the indution hypothesis (4.16), itfollows thatt i+1 = Xw i+1 − X w e i w e i t X t X w e i
− 1
w i t X t Xw i+1 .
Weonlude that
e
w i+1 = w i+1 − w e t i X t Xw i+1
e
w t i X t X w e i w e i .
The regression estimate after
i
steps isX β b (i) = P t 1 ,...,t i y
= P t 1 ,...,t i−1 y + P t i y
= X β b (i − 1) + P t i y
= X β b (i − 1) + X w e i t X t y e
w t i X t X w e i w e i .
This onludes the proof.
ross-validatederror. InChapter2,wedisussedadierentstrategyformodelseletion
that involves the omputation of the degrees of freedom. Note that the PLSR
estimate (4.8) is not a linear funtion of
y
. Hene, we an only estimate thedegrees of freedom using equation (2.9). As a onsequene, we have to ompute
the rst derivative of the PLSR estimator. This has been done before. Phatak
et al. (2002) ompute the rst derivative of
β b (m) P LS
inorder to obtain asymptotiresults on the variane of the estimator. In that work, some general rules on
matrixdierentialalulusareappliedtotheformulainproposition4.7. Itturns
out that the alulation of the rst derivative of PLSR is not only numerially
instable, it is also time-onsuming. The reason is that the formula for the rst
derivativeof
β b
involvesmatriesoforder(mn) × n
. Wethereforehooseadierentapproah. Serneelsetal.(2004)useareursiveformulafor
β b (m) P LS
thatisequivalentto algorithm proposition 4.9 and derive a fast algorithm for the rst derivative
PLSR. We now show how to ompute the derivative of PLSR estimate
y b
in areursive way. Westart with the remark thatby denition,
X i = X i − 1 − P t i−1 X .
Usingthe denition of the weight vetors of PLSR, we onlude that
w i = X i t y = w i − 1 − X t P t i−1 y .
Letus denethe modiedlatent omponents
t i = Xw i .
We onlude that
t i = t i − 1 − XX t P t i−1 y .
Furthermore,we onlude from proposition 4.9 that
t i = X w e i = X
w i − w e t i − 1 X t Xw i
e
w i t − 1 X t X w e i − 1 w e i − 1
= t i − t t i − 1 t i
t t i − 1 t i − 1
t i − 1 = t i − P t i−1 t i .
Finally,
b
y (i) = y b (i − 1) + P t i y .
This leads tothe following reursive algorithmfor the omputationof
y b
.Algorithm 4.10. We dene
K = XX t
. After settingt 0 = t 0 = y b (0) = 0 ,
we iteratively ompute the PLSRestimates
y b
viat m =
Ky m = 1
t m − 1 − K P t m−1 y m > 1
modied latent omponents
t m = t m − P t m−1 t m
latent omponentsb
y (m) = y b (m − 1) + P t m y
preditionIn order to to ompute the rst derivative of
y b
, we have to alulate the rstderivative of the projetion operator. The formula an be found in proposition
A.11in the appendix.
Algorithm 4.11 (First derivative of PLSR). After setting
t 0 = t 0 = y b (0) = 0
anddt 1 = dt 1 = K ,
therstderivativeof thePLSRestimatoranbe obtainedbyiteratively omputing
t m =
Ky m = 1
t m − 1 − K P t m−1 y m > 1
modied latent omponents
∂t m
∂y = ∂t ∂y m−1 − K ∂ P
t m−1 y
∂y
derivative of
t m
t m = t m − P t m−1 t m
latent omponents∂t m
∂y = ∂t ∂y m−1 − ∂ P t m−1 ∂y t m
derivative oft m
b
y (m) = y b (m − 1) + P t m y
predition∂ y
b(m)
∂y = ∂ y
b(m−1) ∂y + ∂ P ∂y tm y
derivative ofy b
WeannowdenetheestimateddegreesoffreedomofPLSRwith
m
omponentsvia
b
df
(
PLSR, m) =
trae∂ y b (m)
∂y
.
This isby denition anunbiasedestimateof the degrees of freedomof PLSR
inthease ofnormallydistributederror terms. Althoughitsomputationis
on-siderablyfasterthanthe oneproposedinPhatak etal.(2002),itstillsuersfrom
numerial instability. This leads to peuliar and sometimes apparently wrong
5 10 15 20
5 10 15 20 25
number of components
estimated degrees of freedom
Figure4.1: TheestimateddegreesoffreedomofPLSRasafuntionofthenumber
of omponents.
results. This is illustrated with the following example. We onsider the linear
regression model
y = Xβ + ε
with
p = 20
preditor variables andn = 500
examples. First, we hoose thepreditor matrix
X
fromamultivariatenormaldistribution withzero mean and ovariane matrixS
equal tos ij =
1 , i = j 0.7 , i 6 = j .
Thisleadstohighlyollineardata
X
. Theregressionvetorβ
isarandomlyho-senvetor
β ∈ { 0, 1 } 20
. Weassumethatthe errortermsarenormallydistributed withσ = 8
. We ompute the degrees of freedom for all20
omponents. Figure4.1 shows the peuliar behaviorof the estimated degrees of freedom. We expet
thedegrees of freedomtobeupper-bounded by
p = 20
thenumberofpreditorvariables. This is indiated by the dashed line. Note however that after a few
omponents (
m ≥ 9
), the estimated degrees of freedom exeed this value. Thefuntiondisplayed in Figure4.1 still inreases and has the formof a jagged line.
This phenomenon persistently ours for other data sets. We onjeture that
algorithm 4.11 runs into serious numerial problems. Therefore, we reommend
tobeautious toimplementthe algorithminits urrent form.
Penalized Partial Least Squares
Nonlinearregressioneets maybemodeledviaadditiveregressionmodelsofthe
form
Y = β 0 + f 1 (X 1 ) + · · · + f p (X p ) + ε ,
(5.1)where the funtions
f 1 , . . . , f p
have unspeied funtional form. An approahwhih allows a exiblerepresentation of the funtions
f 1 , . . . , f p
is the expansionin B-Splines basis funtions (Hastie & Tibshirani 1990). To prevent overtting,
there are two general approahes. In the rst approah, eah funtion
f j
is thesum of onlya small set of basis funtions,
f j (x) =
K j
X
k=1
β kj B kj (x) .
(5.2)Thebasis funtions
B kj
arehosenadaptivelyby aseletionproedure. These-ond approah irumvents the problem of basis funtion seletion. Instead, we
allow a generous amount
K j ≫ 1
of basis funtions in the expansion (5.2). Asthisusuallyleadstohigh-dimensionalandhighlyorrelateddata,wepenalizethe
oeients
β jk
in the estimation proess (Eilers & Marx 1996). However, if thenumber
p
of preditors is largeompared to the numbern
ofobservations inthe available sample,these methodsare impratiable.Quite generally, a dierent approah to deal with high dimensionality is to use
dimensionredutiontehniquessuhasPartialLeastSquaresRegression(PLSR)
whihispresented inChapter 4. Inthis hapter,wesuggestanadaptationofthe
prinipleof penalizationtoPLSR. More preisely, wepresenta penalizedversion
of the optimization problem (4.3) attahed to PLSR. Although the motivation
stems fromitsuse forB-splines transformed data,the proposedapproah isvery
generaland an be adapted toother penalty terms orto other dimension
redu-tion tehniques suh as Prinipal Components Analysis. It turns out that the
newmethodsharesalotofpropertiesofPLSRand thatthatitsomputation
re-quires virtuallynoextra osts. Furthermore,we showthat this newpenalization
tehnique is losely related to the kernel trik that is illustrated in Setion 1.4.
WeshowthatpenalizedPLSRisequivalenttoordinaryPLSRusingageneralized
innerprodutthatisdenedbythepenaltyterm. Intheaseofhigh-dimensional
data,thenewmethodisshowntobeanattrativeompetitortoothertehniques
forestimatinggeneralizedadditivemodels. InChapter6,wehighlightenthelose
onnetion between penalizedPLSR and preonditioned linearsystems.
This hapter is joint work with Anne-Laure Boulesteix and Gerhard Tutz.
5.1 Penalized Regression Splines
The tting of generalized additive models by use of penalized regression splines
(Eilers &Marx1996)has beomeawidely used toolinstatistis. The basiidea
is toexpand the additive omponent of eah variable
X j
inbasis funtionsas in(5.2) and to estimate the oeients by penalization tehniques. As suggested
in Eilers & Marx (1996), B-splines are used as basis funtions. Splines are
one-dimensional pieewise polynomial funtions. The pointsat whih the piees are
onneted are alled knots or breakpoints. We say that a spline is of order
d
ifall polynomials are of degree
≤ d
and if the spline is(d − 1)
times ontinuously dierentiableat the breakpoints. A partiular eient set of basis funtions areB-splines (de Boor1978). An example of B-splines isgiven in Figure5.1.
Thenumberofbasisfuntionsdependsontheorderofthesplinesandthenumber
of breakpoints. For a given variable
X j
, we onsider a set of orresponding B-splinesbasisfuntionsB 1j , . . . , B K
. ThesebasisfuntionsdeneanonlinearmapΦ j (x) = (B 1j (x), . . . , B K (x)) t .
By performing suh a transformation on eah of the variables
X 1 , . . . , X p
, the0 100 200 300 400 500
0.0 0.2 0.4 0.6 0.8 1.0
x
Figure5.1: Illustrationof abasis of B-splines of order3.
observation vetor
x i
turns into avetorz i = (B 11 (x i1 ), . . . , B K1 (x i1 ), . . . , B 1p (x ip ), . . . , B Kp (x ip )) t
(5.3)= Φ(x i ) .
of length
pK
. HereΦ : R p → R pK ,
Φ(x) = (Φ 1 (x 1 ), . . . , Φ p (x p )) ,
is the funtion dened by the B-splines. The resulting data matrix obtained by
thetransformation of
X
hasdimensionsn × pK
and willbedenoted byZ
intherestof the hapter. Inthe examplesinSetions 5.5 and??,we onsiderthe most
widely used ubi B-splines, i.e. we hoose
d = 3
.Theestimationof (5.1)istransformedintothe estimationofthe
pK
-dimensional vetorβ
that onsistsof the oeientsβ jk
:β t = (β 11 , . . . , β K1 , . . . β 12 , . . . , β Kp ) = β t (1) , . . . , β t (p)
.
As explainedabove, the vetor
β
determines a nonlinear, additive funtionf(x) = β 0 +
X p j=1
f j (x j ) = β 0 + X p
j=1
X K k=1
β kj B kj (x j ) = β 0 + Φ(x) t β .
As
Z
isusuallyhigh-dimensional,the estimationofβ
by minimizingthesquaredempirialrisk
R(β) = b 1 n
X n i=1
(y i − f (x i )) 2 = 1
n k y − β 0 − Zβ k 2
usually leads to overtting. Following Eilers & Marx (1996), we use for eah
variable many basis funtions, say
K ≈ 20
, and estimate by penalization. The idea is topenalize the seond derivativeof the funtionf
,Z
(f ′′ (x)) 2 dx .
Eilers & Marx (1996) show that the following dierene penalty term is a good
approximation of the penalty on the seondderivative of the funtions
f j
,P (β) = X p
j=1
X m k=3
λ j (∆ 2 β kj ) 2 .
Here
λ j ≥ 0
are smoothing parameters that ontrolthe amount of penalization.These are also alled the seond-order dierenes of adjaent parameters. The
diereneoperator
∆ 2 β kj
has the form∆ 2 β kj = (β kj − β k − 1,j ) − (β k − 1,j − β k − 2,j )
= β kj − 2β k − 1,j + β k − 2,j .
This penalty term an be expressed in terms of a penalty matrix
P
. We denoteby
D K
the(K − 1) × K
matrixD K =
1 − 1 . . .
. 1 − 1 . .
. . . . .
. . . 1 − 1
that denes the rst order diereneoperator. Setting
K 2 = (D K − 1 D K ) t D K − 1 D K ,
we onlude that the penalty term equals
P (β) = X p
j=1
λ j β t (j) K 2 β (j) = β t ( ∆
λ ⊗ K 2 )β .
Here
∆
λ
is thep × p
diagonal matrix ontainingλ 1 , . . . , λ p
onits diagonal and⊗
is the Kroneker produt. The generalization of this method to higher-order dierenes of the oeientsof adjaentB-splines is straightforward. We simplyreplae
K 2
byK q = (D K − q+1 . . . D K ) t (D K − q+1 . . . D K ) .
To summarize,the penalizedleast squares riterionhas the form
R b P (β) = 1
n k y − β 0 1 − Zβ k 2 + β t P β
(5.4)with
Z
the transformed data that is dened in( 5.3) and the penalty matrixP
dened as
P = ∆
λ ⊗ K q .
(5.5)This isa symmetri matrix that ispositivesemidenite.