Data with
Partial Least Squares and Boosting
vorgelegt von
Diplom-Mathematikerin Ni ole Krämer
aus Berlin
vonder Fakultät IV Elektrote hnik und Informatik
der Te hnis hen Universität Berlin
zur Erlangung des akademis hen Grades
Doktorder Naturwissens haften
-Dr.rer.nat.
-genehmigte Dissertation
Promotionsauss huss:
Vorsitzender: Prof. Dr. Manfred Opper
Beri hter: Prof. Dr. Ulri h Ko kelkorn
Beri hter: Prof. Dr. GerhardTutz
Tagder wissens haftli hen Ausspra he: 15. Dezember2006
Berlin 2007
Firstofall,IwanttoexpressmydeepgratitudetoProf.Dr.Ulri hKo kelkornfor
histhoroughsupervision. I wouldalsoliketothank Prof.Dr. GerhardTutz,who
agreedtobethese ondexaminerofthis thesisandwho supportedmyworkwith
his advise and en ouragement. I am obliged to Dr. Jörg Betzin for introdu ing
me to a lot of ex iting open questions regarding Partial Least Squares and for
sharing his thoughts with me. Finally, I beneted a lot from the ex hange of
ideaswithDr.RomanRosipal,AthanassiosKondylis, Dr.Anne-LaureBoulesteix
Inderstatistis henAnalysevonho hdimensionalenDatengehtesdarum,
Zusam-menhängezwis heneinergroÿenMenge
p
anVariablenmitHilfeeinerbegrenzten Anzahln
an Beoba htungen zu modellieren. Gemeinsam sind allen Analyseme-thoden die beiden folgenden Ziele. Zum einen ist es wi htig, (latente)Struk-turen in den Daten zu erkennen, um so eine handhabbare, niedrigdimensionale
Repräsentation zu gewinnen. Zum anderen ist es oft von groÿer Wi htigkeit,
verständli he undlei htzu interpretierende Modellezu entwi keln. Die hohe
Di-mensionalitätder Datenführtoftzu groÿenProblemen, dennfür
p ≫ n
versagen die traditionellenstatistis hen Verfahren. Zudem ist die Struktur der Daten oftkomplexer. Die beoba hteten Gröÿen sind ni ht wie in der klassis hen
Statis-tik übli h Vektoren in einem endli hdimensionalen Vektorraum, sondern zum
BeispielFunktionen. Beispielefürdiese ArtvonDatenstrukturen sindZeitreihen
oderMessungen inderNah-Infrarot-Spektroskopie. IndieserArbeitsolldie
Ana-lyse von ho hdimensionalen und komplexen Daten mitHilfe von zwei Verfahren
untersu ht werden: Partial Least Squares und Boosting inFunktionenräumen.
PartialLeast Squares(PLS)modelliertdenZusammenhangzwis hen
vers hiede-nenBlö ken vonVariablenmitHilfesogenannter latenter Variablen. ImFallvon
mehralszweiBlö kenwerdendiePLS-Verfahrenau halsPfadmodellebezei hnet
undkönnenalseineErweiterungder Kanonis henKorrelationsanalyseangesehen
werden. Die mathematis hen Eigens haften von PLS-Pfadmodellen sind zum
groÿenTeilno hunerfors ht. ZumBeispielistwederklar,obdieAlgorithmenzur
Bere hnung derlatentenVariablenimPfadmodellnumeris hkonvergieren, no h,
obsiefallssiekonvergieren Lösungen vonsinnvollen Optimierungsproblemen
darstellen. In dieser Arbeit wird ein sauberes mathematis hes Gerüst für die
Bes hreibung der Pfadmodelleaufgestellt. Eswird gezeigt,dasszu einem groÿen
Teil der PLS-Algorithmen (derjenigen mit mindestens einem Blo k im Modus
Zu-lems konvergieren können.
PLSkannau hinRegressionsproblemeneingesetztwerden,indemmandie
erklä-renden und die abhängigen Variablenals jeweils einen Blo k auasst. In diesem
Fall ermögli ht PLS zudem eine Dimensionsreduktion der Daten, die wiederum
hoentli hzu besserenVorhersagenführt. IndieserArbeitwirdeineErweiterung
von PLS um einen Strafterm vorgestellt und auf die S hätzung von
generali-sierten additiven Modellen (GAM's) angewandt. Es zeigt si h, dass
insbeson-dere für ho hdimensionale Daten dieser Ansatz eine gute Alternativezu
klassis- hen GAM-Verfahren ist. Ausgehend vonder bereitsbekannten Verbindung von
PLSund dem Konjugierten-Gradienten-Verfahren(aus der numeris hen linearen
Algebra) wird gezeigt, dass PLS mit Strafterm äquivalent zu einem
vorkondi-tionierten Konjugierten-Gradienten-Verfahren ist. Die Konditionierungsmatrix
wird dabei dur h den Strafterm bestimmt. Im Ans hluss werden die
Beziehun-gen zwis hen der linearen Algebra und PLS ausgenutzt, um die sogenannten
Shrinkage-Eigens haften von PLS empiris h zu untersu hen. Darüber hinaus
wird ein unverzerrter S hätzer für die Freiheitsgrade von PLSermittelt.
Boosting ist ein Verfahren aus dem Berei h des Mas hinellen Lernens. Die
grundlegendeIdeeist,vers hiedeneeinfa heVorhersagemodellesozukombinieren,
dass dieseKombinationzu sehrviel besseren Vorhersagenführt. Indieser Arbeit
werden Boostingverfahren für komplizierte Datenstrukturen entwi kelt. Dabei
interessiertuns vorallenDingenderFall,indemdiebeoba htetenEinussgröÿen
Funktionenbzw. diskreteMessungenvonFunktionensind. Diegängigen
Boosting-Methoden basieren implizit auf der Annahme, dass die Einussvariablen Werte
in einemendli hdimensionalen Vektorraum annehmen. Es wird gezeigt, dass die
Erweiterung aufunendli hdimensionaleFunktionenräumeohne Weiteresmögli h
ist. Zudem wird illustriert, wie man mit Hilfe von Boostingverfahren wi htige
Charakteristika der Funktionen aufde kt und wie man damit lei ht
interpretier-bare und visualisierbare Modelle erzeugt. Dies ges hieht dur h eine T
The ru ial task in the statisti al analysis of high-dimensional data is to model
relationships between a large amount
p
of variables based on a small numbern
of observations. Quite generally, we pursue two goals. On the one hand, it isimportant to dete t (latent) stru tures in the data in order to obtain a feasible,
low-dimensional representation. On the other hand, we often need simple and
omprehensible models that an be interpreted. The high-dimensionality of the
data often formsan obsta le, as for
p ≫ n
, the traditionalstatisti alte hniques failtoprodu esatisfa toryresults. Furthermore,thestru ture ofthedata anbeomplex. The observed variablesare not as usually assumed in lassi al
statis-ti s elementsof a nite-dimensionalve tor spa e, but , forinstan e, fun tions.
Examples for this type of data are times series or experiments from the eld of
near-infra-red spe tros opy. In this work, we investigate high-dimensional and
omplexdata with the helpof two methods: PartialLeast Squares and Boosting
for fun tionaldata.
Partial Least Squares (PLS) models the relationship between dierent blo ks of
variables in terms of so- alled latent variables. In the ase of more than two
blo ks,the PLS-te hniques are also alledpathmodels and an beseen asa
gen-eralization of Canoni al Correlation Analysis. The mathemati al properties of
PLSare for the mostparts not yetestablished. Forexample, itisneither known
whether the PLS algorithms onverge numeri ally, nor in the ase that they
onverge if they produ e solutions of a sensible optimization riterion. In this
work, we establish a sound mathemati al framework for the des ription of PLS
pathmodels. Weshowthat foralarge partofthe PLSalgorithms(thosewith at
least one blo k in mode A), there is indeed no twi e-dierentiable optimization
problem. Furthermore, we show on simulated data that the PLS algorithms in
better predi tion rules. In this work, we present an extension of PLS using
pe-nalizationte hniques. This methodisthen usedto estimategeneralizedadditive
models(GAM's). Thisapproa hturnsouttobeagoodalternativetotraditional
GAM-methods in the ase of high-dimensional data. Based on the well-known
relationship between PLSand the onjugate gradient te hnique (a method from
the eld of numeri al linear algebra), we prove that penalized PLSis equal to a
pre onditioned onjugate gradient te hnique. Here, the pre onditioner is
deter-mined by the penalty term. Subsequently, we exploit the onne tions between
PLS and linearalgebra toinvestigateempiri allythe so- alledshrinkage
proper-tiesofPLS. Inaddition,wederiveanunbiasedestimateof thedegrees offreedom
of PLS.
Boosting has its seed in the ma hine learning ommunity. The basi idea is
to ombine several, simple models insu h away that their ombinationleads to
betterpredi tionrules. Inthiswork,wedevelop Boostingalgorithmsfor omplex
datastru tures. Ourfo us isondatathatare (dis rete)measurementsof urves.
The established Boostingmethodsimpli itly assumethat the observed variables
lieinanite-dimensionalve torspa e. WeshowthatanextensionofBoostingto
innite-dimensionalfun tionspa esisstraightforward. Furthermore,weillustrate
howto dete trelevantfeatures of the investigated fun tionsand howto produ e
simple and interpretable models. This is done by applying wavelet or Fourier
1 Preliminaries 1
1.1 Notation and Guideline. . . 1
1.2 Learning fromData . . . 2
1.3 The Regression Model . . . 3
1.4 Duality and KernelMethods . . . 6
2 Model Sele tion 9 2.1 Cross Validation. . . 11
2.2 Degrees of Freedom . . . 11
2.3 Information Criteria . . . 16
3 Partial Least Squares Path Models 19 3.1 The Partial Least Squares Framework . . . 20
3.2 Optimization Strategies . . . 24
3.3 Multivariate PowerMethods . . . 28
3.4 The Partial Least Squares Path Algorithms. . . 30
3.5 No Smooth Optimality Criterion for Mode A . . . 33
3.6 No Convergen e to the Optimum for Mode B . . . 37
3.7 Con lusion . . . 40
4 Partial Least Squares for Regression 43 4.1 NIPALS and SIMPLS . . . 44
4.2 Basi Properties of Partial Least Squares Regression . . . 47
4.3 The Degrees of Freedomof PartialLeast Squares . . . 52
5 Penalized Partial Least Squares 57 5.1 PenalizedRegression Splines . . . 58
5.2 Dimension Redu tion for B-Splines Transformations . . . 61
5.6 Con lusion . . . 69
6 From Linear Regression to Linear Algebra 71 6.1 Pre onditioned Conjugate Gradient Methods . . . 76
6.2 Why Krylov Subspa es? . . . 79
6.3 A Polynomial Representation . . . 84
7 Shrinkage Properties of Partial Least Squares 87 7.1 What is Shrinkage? . . . 88
7.2 The Shrinkage Fa tors of Partial Least Squares . . . 92
7.3 Simulations . . . 97
7.4 Example: Te atorData Set . . . 99
7.5 Con lusion . . . 100
8 Fun tional Data Analysis 103 8.1 Example: Bis uitDough Data Set . . . 103
8.2 Example: Spee h Re ognition DataSet . . . 105
8.3 From Observations toFun tions . . . 106
8.4 Learningfrom Fun tional Data . . . 109
9 Boosting for Fun tional Data 113 9.1 A Short Introdu tionto Boosting . . . 113
9.2 The Degrees of Freedomof Boosting . . . 118
9.3 Boostingand Partial Least Squares . . . 120
9.4 Fun tional Boosting. . . 121
9.5 Example: Spee h Re ognition . . . 123
9.6 Example: Bis uitDough Data Set . . . 124
9.7 Con lusion . . . 126
10 Summary and Outlook 129 Bibliography 131 A Mathemati al Ba kground 141 A.1 MatrixDierential Cal ulus . . . 141
A.2 Optimizationunder Constraints . . . 143
A.5 The Moore-PenroseInverse . . . 148
Preliminaries
This hapter serves as a referen e for some basi on epts that are essential for
the rest of this work.
1.1 Notation and Guideline
Matri esare denoted by boldupper ase letters
X
, A, U , . . .
andve tors are de-noted by bold lower ase lettersv, y, . . .
. This rule is alsovalid if we use Greek letters. The transpose of a matrix or a ve tor is indi ated by a supers ript tas in
A
t
and
v
t
. The Moore-Penrose inverse of a matrix
A
is denoted byA
−
.
Spa es are either in alligraphi style (
X , Y, F, . . .
) or in bla kboard bold style (R, N, Z, F, . . .
). Fun tionsare usuallydenoted bylower ase lettersasf, g, h, . . .
. The ovarian e and varian e of random variables are denoted by Cov and Var.Their respe tive empiri al ounterparts are denoted by ov and var.
The table of ontents hopefully reveals the rough stru ture of this work. Let us
remarkthat Chapters 1, 2and 8serve assummaries of well-known on epts and
do not ontain any fundamentally new ideas. Before starting to read further,
it might be bene ial to have a look at Chapter A in the appendix. There, we
olle tsome basi mathemati al prin iples that onstantly emergein this work.
Allexperimentsand simulationsare performed inR (RDevelopmentCore Team
1.2 Learning from Data
Letusintrodu ethegenerallearningproblem. We onsidertworandomvariables
X
andY
whi h denearandomvariableZ = X × Y
onaprodu t spa eX × Y
. Weassume thatthere is arelationshipbetweenX
andY
inthe sense that givenX
, we an predi t the out ome ofY
with a high a ura y. We do not know the distribution ofZ
, but we observe a nite setS = {(x
1
, y
1
), . . . , (x
n
, y
n
)} ⊂ X × Y
of observations. This set is alled the sample. We assume that the observations
are drawn independently from
Z = X × Y
. The lassi al statisti al approa h is toidentify the pro essZ
that generates the sample. Assuminga lass of models that isappropriate todes ribeZ
,the model parameters ofZ
are estimatedwith the help ofS
. From this, we an inferthe onditionaldistribution ofY
givenX
. In order tond an estimatef
b
of a fun tionF : X → Y
(1.1)that predi ts anout ome
y
foraxed valuex ∈ X
viaf(x)
b
, wehaveto evaluate the de isionf (x)
b
if the orre t out ome isy
. This is done via aloss fun tionL : Y × Y → R ,
(1.2)and the risk at a ertain point
z = (x, y)
is dened asL(y, b
f(x))
. The optimal fun tion isthe one that minimizesthe expe ted riskR(f ) = E
X
×Y
[L(Y, f (X))] .
(1.3)over all fun tions
f
. If the distribution ofZ
is known, the optimal fun tion an often be determined expli itly. For example, if we onsider regression problems(that is
Y = R
) andL
is the quadrati loss fun tionL(y, y
′
) = (y − y
′
)
2
,
(1.4)the optimal fun tionis the onditional expe tation
f (x) = E[Y |X = x]
.pro- ess. The primary task is to nd a fun tion that minimizes (1.3) and not to
identify the whole pro ess
Z
. As the distribution ofZ
is not known, it is not possible to minimize (1.3). As a onsequen e, we have to estimate the optimalfun tionbased onthe available sample
S
. Apopular approa histox a lass of fun tionsF
and to minimizethe empiri alriskb
R(f ) =
1
n
n
X
i=1
L(y
i
, f (x
i
))
(1.5)over all elements
f ∈ F
. The quantityR(f )
b
is also alled the training error. Sometimes, a regularization termr(f )
is added to (1.5). This strategy is alled (regularized) empiri al risk minimization. A ombination of a loss fun tion, alass of fun tions and a regularization term denes a strategy to estimate(1.1).
Dependingonthes ienti ommunity,astrategyis alledamodel,analgorithm,
a tting method or a learner. In this work, we use these terms more or less
synonymously. A more pre ise spe i ation of a learning strategy is given in
Chapter 2. Some remarks on the term model are however ne essary. In the
literature, it is used to des ribe two dierent aspe ts of learning from data. On
the one hand, a model is a des ription of how the data is generated. E.g. in
the learningtask(1.1), we an determine thestru ture of the fun tion
F
(linear, polynomial)orassume thatZ
belongs toa ertain lass ofdistributions. On the other hand, a model is a strategy how to estimate the generation of the data.This refers to the des ription of a learning strategy des ribed above. For every
sample
S
and every learning strategy, we obtain an estimate of the fun tionF
, whi h we denotebyf
b
.1.3 The Regression Model
Inthiswork,weare mainly on ernedwithmultivariateregressionproblemsthat
have a one-dimensional response. That is we assume that
Y = R
andX = R
p
.
In statisti s,regression problems are usually modeled inthe followingway:
Y
i
= F (X
i
) + ε
i
, i = 1, . . . , n .
Thepredi tor
X
isamultivariate,p−
dimensionalrandomvariable. Fromnowon, the predi tors are assumed tobedeterministi and only the response is assumedwith zero mean and equal varian e. In ompa t form, the regression model an determined via
Y
i
= F (x
i
) + ε
i
, i = 1, . . . , n ,
(1.6) withE [ε
i
] = 0
and Cov(Y
1
, . . . , Y
n
) = σ
2
I
n
.
(1.7)Here,
I
n
isthe identity matrix of dimensionn
. It follows immediatelythatE [Y
i
] = F (x
i
) .
(1.8)Let us now onsider multivariate linear regression problems. If the fun tion
F
in (1.6) is assumed to be linear, the regression model an be represented by themultivariatelinear regression model
Y
i
= x
t
i
β
+ ε
i
.
(1.9)Givendata
S
, the estimationof (1.6)istransformed intothe estimationβ
b
ofthe regressionve torβ
. Re allthatthenumberofvariablesisp
andthatthe number of examples isn
. We setX
=
x
t
1
. . .
x
t
n
∈ R
n
×p
, y =
y
1
. . .
y
n
∈ R
n
.
Aninter ept
β
0
anbein ludedinto(1.9)byatta hinganadditional olumnthat onsist of1
's. Another possibility is to estimate (1.9) based on entered data. Forthis reason,we require that bothX
andy
are entered.The Ordinary Least Squares (OLS) estimator
β
b
opti-mizationproblem arg
min
β
n
X
i=1
y
i
− x
t
i
β
2
=
argmin
β
ky − Xβk
2
.
Notethatthisequalstheminimizationoftheempiri alrisk
R (β)
b
dened(1.5)for the quadrati lossfun tion (1.4)andF
equal to the spa eof alllinear fun tions. Ifwedierentiatetheempiri alriskwithrespe ttoβ
,werealizethatthesolutionb
β
OLS
must fulllX
t
X b
β
OLS
= X
t
y
.
There is always a solution, as
X
t
y
lies in the spa e spanned by the olumns of
X
t
X
. However the solutionis not unique ifX
t
X
doesnot have full rank. This
is for example the ase if there are more variables than observations. If there is
no unique solution, we dene the OLS estimator as the solution with minimal
eu lidean norm. It follows from proposition A.14that
b
β
OLS
=
X
t
X
−
X
t
y
.
We nowuse the singular value de omposition
X
= V ΣU
t
of
X
thatisdenedin(A.8). Furthermore,Λ
= Σ
t
Σ
isthematrixofeigenvalues
of
X
t
X
. Set
s
= ΣV
t
y
.
(1.10)
In this work, weuse one of the following representations of the OLS estimator:
b
β
OLS
= X
t
X
−
X
t
y
= U Λ
−
s
=
rk(X)
X
i=1
v
i
t
y
√
λ
i
u
i
=
rk(X)
X
i=1
z
i
,
(1.11) withz
i
=
v
t
i
y
√
λ
i
u
i
.
exam-ples is small ompared tothe numberof observations orif
X
is highly ollinear. Both phenomena lead to a ovarian e matrix(1/n)X
t
X
that is (almost)
singu-lar,whi h ae ts the statisti alproperties of the estimator. This is dis ussed in
great detail inSe tion7.1.
1.4 Duality and Kernel Methods
Inthisse tion,webrieyre apitulatethe on eptofdualrepresentationsandthe
kerneltri k. Letus onsider thefollowingexample. InSe tion1.3, weintrodu ed
the linear regression model (1.9). For any estimate
β
b
ofβ
, we an predi t the value ofy
for a new observationx
new
viathe linearfun tionb
y
new
= x
t
new
β
b
=
D
x
new
, b
β
E
.
(1.12)Now suppose that we want to transform the original data
X
before applying OLS. One reason to doso is to model nonlinear relationships between predi torvariables and response. Ifwe have e.g.
p = 2
predi tor variables and we want to estimateafun tion(1.1)withF
assumed tobeapolynomialofdegree≤ 2
. How an we use a method that is designed for linear regression problems (e.g. OLS)tosolvethis problem? Wesimply transformthe data via a map
Φ (x, x
′
) =
1,
√
2x,
√
2x
′
,
√
2xx
′
, x
2
, x
′2
(1.13)and apply the linear algorithm to
y
andΦ (X)
. A transformation is also ne -essary if the observed data are not yet embedded in an eu lidean spa e. If forinstan e, the variables are on a nominal s ale, we have to transform the
vari-ablesintodummyvariablesandthenplug thetransformeddataintoanylearning
method designed for estimating linear relationships.
The transformation map
Φ : X → F
(1.14)is alledthe featuremap. The spa es
X
andF
are alledinputspa e andfeature spa erespe tively. In ordertoapply alinear algorithminF
,itis oftenne essary to assume thatF
is a Hilbert spa e. An important observation is the following. In alotof ases, the ve torβ
b
that denes the linearfun tion in(1.12)isa linearombination of the data points,
b
β
=
n
X
i=1
α
i
Φ (x
i
) .
(1.15)The oe ients
α
i
are alled dual variables. Ifweplug (1.15)intohΦ(x
new
), b
β
i
, werealizethat thelinearfun tiononlydependsoninnerprodu tsbetweentrans-formed data points,
f (x) = hΦ(x), b
β
i =
n
X
i=1
α
i
hΦ(x), Φ(x
i
)i .
In this ase, the estimation of the dual variables an be doneby usingthe
n × n
Gram matrixK
= (hΦ(x
i
), Φ(x
j
)i)
i,j=1,...,n
.
Note that ondition (1.15) holds for the OLS estimate. This follows e.g. from
(1.11), as the ve tors
u
i
are a basis of the row spa e ofX
. (Re all the singular value de omposition(A.8) ofX
.) de Bieetal.(2005)des ribe various multivari-ate methods interms of their primal and dual representation.Asweonlyneedinnerprodu tsinthedualrepresentation,wedonothavetomap
the data points expli itlytoa feature spa e, itsu es to ompute the fun tion
k : X × X → R
(1.16)with
k(x, z) = hΦ(x), Φ(z)i .
(1.17)The estimated fun tion is
f (x) =
n
X
i=1
α
i
k(x, x
i
) .
The fun tion
k
is alled akernel. Notethat in example(1.13),Therepla ementofthe usual innerprodu tby theinner produ tinsomefeature
spa eis alledthe kernel tri k. Notethat wedonot even requirethe inputspa e
to be aninner produ t spa e atall. Literature on the kernel tri k and its
appli- ationsisabundant. A detailedtreatise ofthe subje t an befound inS hölkopf
&Smola (2002).
So instead of dening a feature map, we dene an admissible kernel fun tion,
that is,a fun tion (1.16) whi h an be dened via amap (1.14)su h that (1.17)
holds. The hoi e of the optimal kernel is part of the model sele tion pro edure
that isillustratedinChapter 2. What are themerits ofthis dualrepresentation?
We already mentioned the extension to nonlinear models. Furthermore, from a
te hni al point of view, if
p ≫ n
, the omputation in the dual representation is usually faster than the omputation in the primal representation. Finally, weanextendthe wholemultivariatema hinerytospa es
X
ofinnitedimensionor witha omplexstru turebydeninganappropriateinnerprodu t. Animportantexample isthe analysis of fun tionaldata, that is,
X
isa spa eof fun tions over some domain. This subje twillbetreated inmore detail inChapter 8.Model Sele tion
We now re apitulate the main tools to evaluate the performan e of a learning
method. The ontents ofthis hapterare a summaryof the orresponding
hap-terinHastieetal.(2001). Asdes ribedinChapter1,weestimatetherelationship
(1.1) by applying an appropriate tting method. Normally, we we do not t a
single model but a group of models and have to hoose the best model. This is
usually alled model sele tion. We therefore need a strategy how to sele t the
best model out of a pool of models. After the best model is hosen, we have to
evaluate its quality. This is alled model validation. Re all that we evaluate a
modelintermsofitsexpe tedrisk(1.3). Asthisquantityisusuallyunknown,we
need a goodestimate. In what follows, we fo us onmodel sele tion and remark
that the risk of the sele ted model should be estimated on a test set that was
neitherinvolved in the ttingpro ess nor inthe sele tion pro ess.
In the rest of the hapter,we onsider the general regression model (1.6). Given
data,wetamodeland allthettedfun tion
f
b
. Inordertoevaluatethequality ofthe ttingmethod,westart by omputingtheexpe tedriskoff
b
atapointx
i
,R
f(x
b
i
)
= E
Y
new
h
L
Y
new
, b
f(x
i
)
i
.
(2.1) Here,Y
new
is a new observation at point
x
i
. Note that the quantityR
b
f (x
i
)
fun tion (1.4), the expe ted risk of
f
b
atx
i
equalsR
f (x
b
i
)
=
E
Y
new
Y
new
− b
f (x
0
)
2
=
E
Y
new
Y
new
− E[Y
new
] + E[Y
new
] − b
f (x
i
)
2
(1.8)
=
E
Y
new
(Y
new
− E[Y
new
])
2
+
F (x
i
) − b
f (x
i
)
2
=
σ
2
+
F (x
i
) − b
f (x
i
)
2
.
(2.2) Thetermσ
2
is alledtheirredu ibleerror. These ondtermdependsonthedata
that we used tot the model. If we are interested in the quality of our learning
strategythat is usedtoobtain theestimate
f
b
,we ompute the expe tedvalueofR
f (x
b
i
)
with respe t to the data
Y
(n)
= (Y
1
, . . . , Y
n
)
. For the quadrati loss, we yieldE
Y
(n)
E
Y
new
Y
new
− b
f (x
i
)
2
= σ
2
+
bias2
f (x
b
i
)
+
Varb
f(x
i
)
.
We expe t the bias to de rease for more omplex models and the varian e to
in rease. Ifwe hoose a very simple model with a low varian e, we might fail to
apture the relevant stru ture of the data. If we t a very omplex model that
is almost unbiased, we have a good explanation of the available sample but will
probably fail to predi t on new observations. The latter phenomenon is alled
overtting. InSe tion7.1,westudy thisbias-varian etrade-oinmoredetailfor
linear shrinkage estimators.
Letusreturntotheessentialquestionofthis hapter. Howdowesele tamodel?
Let us assume that we have a lot of data at hand. In this ase, we an pro eed
in the following way. We split the data set into two parts: a training set and
a validationset. We t the models on the training set. We then ompare their
performan e on the validation set. Note however that in most situations, the
amount of available data is limited and we annot aord to ex lude a fra tion
of the data from model estimation. We therefore need dierent strategies to
estimate the risk of a model. Roughly, we distinguish two dierent approa hes.
In the rst approa h, we repeat a random splitting into training and test set
several times. This is alled ross-validation and is dis ussed in Se tion 2.1. In
riskofamodel anbeestimatedintermsof itsempiri alriskand its omplexity.
More pre isely,the omplexity an be expressed in terms of degrees of freedom.
2.1 Cross Validation
The ross-validation te hnique (Stone 1974, Stone 1977) produ es estimates of
the risk
R( b
f )
of a modelf
b
. Re all that in order to sele t the best model, it is suggested tosplit the data into atrainingset and avalidationset. The model istted onthe trainingset and its performan e is estimated on the validation set.
Asinalotof ases, wedonot haveenoughdataathand,amorerened strategy
is pursued. We randomly split the data into
K
parts of roughly the same size. Fork = 1, . . . , K
, we remove thek
th blo k from the data and t the model to the remainingK − 1
parts. Thek
thblo k isused asa test set. That is,for ea h blo kk
, we obtain an estimate of the risk of the model that was tted on the otherK − 1
blo ks. Finally, we average theseK
estimates. More formally, the fun tionκ : {1, . . . , n} → {1, . . . , K}
assignstothe
i
thexampleitsblo kmembership. Wedenotebyf
b
−k
thefun tion
that was ttedon allbut the
k
th blo k.Denition 2.1. The
K
-fold ross-validationerror isCV
b
f
=
1
n
n
X
i=1
L(y
i
, b
f
−κ(i)
(x
i
)) .
(2.3)For
K = n
, this is alled the leave-one-out error.Let us note that the omputational osts of
K
-fold ross-validation an be ome veryhighifK
islarge. Inthis ase,wehavetotthemodelsseveraltimes,whi h an be verytime- onsuming.2.2 Degrees of Freedom
As already mentionedabove, the quality of amodel ora fun tion
f
b
is measured in terms of its expe ted risk (1.3). As this risk annot be omputed, we needof (1.3). We expe t the empiri al risk to be lower than the true risk, as we use
the same data set to t the model
f
b
and to asses its performan e. If we use the training data for model assessment, this leads to overoptimisti estimates of therisk. The gap between empiri al and test error is usually parti ularlylarge for
very omplex models. In order to get a good estimate of the expe ted risk, we
haveto measure the gap between empiri alerror and the expe ted risk.
Re all the general regression model (1.6). Note that we dened the expe ted
risk of
f
b
at a data pointx
i
in (2.1). The estimated fun tionf
b
depends on the sampleS
, that is it depends onY
(n)
= (Y
1
, . . . , Y
n
)
. If we average over all pointsx
1
, . . . , x
n
and ompute the expe tation with respe t toY
(n)
, we obtain
the expe ted in-sample riskof our strategy,
R
in
= R
in
(x
1
, . . . , x
n
) = E
Y
(n)
"
1
n
n
X
i=1
R
f (x
b
i
)
#
.
Thedieren ebetween
R
in
andtheexpe tedempiri alriskis alledtheoptimism:op
= R
in
− E
Y
(n)
h
b
R( b
f)
i
= E
Y
(n)
"
1
n
n
X
i=1
n
R
f(x
b
i
− b
R
f (x
b
i
o
#
.
The key point is to nd a good estimate
c
op of op. We an then estimate the in-samplerisk of amodel inthe following way:d
R
in
=
R +
b
op .
b
(2.4)Proposition2.2. Forthe quadrati loss fun tion(1.4), theoptimismof a tting
method is
op =
2
n
n
X
i=1
Covb
f (x
i
), Y
i
.
(2.5)Proof. It follows from(2.2) that
R
in
= σ
2
+
1
n
n
X
i=1
E
Y
(n)
F (x
i
) − b
f (x
i
)
2
.
Next, wehave
b
R( b
f ) =
1
n
n
X
i=1
Y
i
− b
f (x
i
)
2
=
1
n
n
X
i=1
Y
i
− F (x
i
) + F (x
i
) − b
f (x
i
)
2
=
1
n
n
X
i=1
(Y
i
− F (x
i
))
2
+ 2
1
n
n
X
i=1
(Y
i
− F (x
i
))
F (x
i
) − b
f(x
i
)
+
1
n
n
X
i=1
F (x
i
) − b
f (x
i
)
2
.
It follows thatE
Y
(n)
R( b
b
f ) = σ
2
−
2
n
n
X
i=1
Cov(Y
i
, b
f(x
i
)) +
1
n
n
X
i=1
E
Y
(n)
F (x
i
) − b
f (x
i
)
2
.
This on ludes the proof.
Before pro eeding, it is bene ial to introdu e a ompa t representation of a
tting method. If we denote by
f
b
the tted fun tion that is obtained by using the sampleS
,wedene the followingmapH : R
n
→ R
n
,
H(y) =
f(x
b
1
), . . . , b
f(x
n
)
t
=
y
b
.
(2.6)Note that the fun tion
H
depends onx
1
, . . . , x
n
. If this fun tion is linear iny
, we speak of a linear tting method or a linear learner. In this ase,H
an be represented by an × n
matrixH
that is alled the hat-matrix.Denition2.3(DegreesofFreedom). Thedegreesoffreedomofattingmethod
that is represented by
H
isdened asdf
(H) =
1
σ
2
n
X
i=1
Covb
f(x
i
), Y
i
.
In parti ular, op=
2
n
σ
2
df(H) .
errorvariables
ε
i
in(1.6)are normallydistributed. The next usefullemmaisdue toStein (1981).Lemma2.4(Stein'sLemma). Assumethat
X ∼ N(µ, σ
2
)
isaunivariaterandom
variable with density fun tion
φ
and thatg : R → R
is a dierentiable fun tion su h thatlim
x
→±∞
g(x)φ(x) = 0 .
(2.7) We have Cov(g(X), X) = σ
2
E [g
′
(X)] .
We an easilyextend Stein's lemmato multivariaterandom variables.
Lemma 2.5 (MultivariateStein's Lemma). Assume that
X = (X
1
, . . . , X
n
) ∼ N µ, σ
2
I
n
is a multivariate random variable with density fun tion
φ(x) =
Q
p
i=1
φ
i
(x
i
)
. Letg = (g
1
, . . . , g
p
) : R
p
→ R
p
be a dierentiable fun tion whi h fullls
lim
x
→±∞
g
i
(x)φ
i
(x) = 0 .
(2.8) We haven
X
i=1
Cov (g
i
(X), X
i
) = σ
2
E
tra e∂
∂X
g(X)
.
Proof. We x
i ∈ {1, . . . , p}
and setWe have
Cov (g
i
(X), X
i
) = E
X
[g
i
(X) (X
i
− E[X
i
])]
= E
X
−i
E
X
i
|X
−i
[g
i
(X) (X
i
− E[X
i
])]
= E
X
−i
E
X
i
[g
i
(X) (X
i
− E[X
i
])]
The last equality is valid as we assume that the variables
X
i
are independent. In the expressionE
X
i
[g
i
(X) (X
i
− E[X
i
])]
in the last line,g
i
(X)
only varies inX
i
and the other omponents are onsidered to be onstant. We an now apply Stein'slemma 2.4 tog
i
(X
i
)
andX
i
and obtainCov (g
i
(X), X
i
) = σ
2
E
X
−i
E
X
i
∂
∂X
i
g
i
(X)
= σ
2
E
X
∂
∂X
i
g
i
(X)
.
This proves the lemma.
Corollary 2.6. Assume thatthe fun tion
H
denedin (2.6) isdierentiable. IfY
i
is normallydistributed,Y
i
∼ N(F (x
i
), σ
2
)
and
H
fulllsassumption (2.8), we haven
X
i=1
Cov( b
Y
i
, Y
i
) = σ
2
E
tra e∂
∂Y
(n)
H Y
(n)
.
In parti ular,df (H) = E
"
tra e∂H Y
(n)
∂Y
(n)
!#
.
In this ase,b
df(H) =
tra e∂H Y
(n)
∂Y
(n)
!
(2.9)is an unbiased estimate for the degrees of freedom of
H
. If the learner is linear iny
,i.e.y
b
= Hy
withH
∈ R
n
×n
,we yield
df
(H) =
tra e(H) .
fun tion
H
orrespondingto this estimatorisb
y
= X X
t
X
−
X
t
y
= P
X
y
.
The tra e of the proje tion operator equals the dimension of the spa e spanned
by the olumnsof
X
. Weobtain the well-known resultdf
(OLS) =
rank(X) .
2.3 Information Criteria
Wenowreturntotheestimationoftheriskofthemodel. Information riteriaare
basedontheideathatthequalityofamodeldependsonitstrainingerrorandon
its omplexity. Dierent approa hes lead to dierent amounts of penalizationof
the omplexityofamodel. Information riteriadierinthey wayhowmu hthey
penalize the omplexity of the model. We already remarked in (2.4) thatwe an
estimatethe in-sampleriskintermsofthe empiri alriskandits omplexity. The
Akaike Information Criterion (AIC) (Akaike 1973) is a generalization of (2.4).
It is based on a general, asymptoti relationship between the Kullba k-Leibler
InformationCriterionandmaximumlikelihoodtheory. Wedonot wanttogotoo
mu h into detail and refer e.g. to Burnham & Anderson (2004). In the ase of
normally-distributederror terms
ε
i
,the AIC- riterion isequivalent to(2.4),AIC
( b
f ) =
R( b
b
f ) +
2
n
df (H)σ
2
.
The quantity
σ
an be estimated viabσ
2
=
1
n
n
X
i=1
y
i
− b
f (x
i
)
2
.
We hoose the model that minimizesthe AIC information riterion
b
f
AIC
=
argmin
bf
AIC
( b
f ) .
AsthegeneralAIC riteriononlyholdsasymptoti allyforlargevaluesof
n
,there isa orre ted versionof theAIC riterionforsmall samplesizes(Hurvi h &Tsailength(Hansen & Yu 2001) and that is used inChapter 9 is
gMDL
f
b
= log
n
n − df(H)
R( b
b
f )
+
df (H)
n
log
n
X
i=1
y
i
2
− n b
R( b
f )
!
− log
df (H)
n
n − df(H)
R( b
b
f )
!
.
The last two riteria penalizethe omplexity of a model more strongly than the
Partial Least Squares Path Models
Partial Least Squares (PLS) path models (Wold 1982, Wold 1985) are a
frame-work for modeling linear relationships between several blo ks of variables. In
this sense, they an be seen as a generalization and an extension of Canoni al
CorrelationAnalysis (Hotelling1936) to more than two blo ks of variables. The
relationshipbetween dierent blo ks are modeled in the following way. For ea h
blo k of variables, we look for a latent variable that is, a linear ombination
of these variables su h that latent variables that are assumed tobe linked are
highly orrelated. These latentvariablesareestimatedwith algorithmsthathave
apower methodavor.
Thestatisti alandmathemati altheoryofPLSpathmodelhasnotbeenfully
es-tablished. Infa t,PLSisdenedviaalgorithmsasinWold(1982)andLohmöller
(1989)andnotviaastatisti almodeloranempiri aloptimizationproblem. Some
fundamentalquestionshavenot beenanswered. Forexample,itisnotguaranteed
that the PLS algorithms onverge numeri ally (although onvergen e is always
observed inpra ti e). Moreseverely, forawide lassofalgorithms(thosewith at
least one blo k in mode A), it is not known if the latent variables omputed by
PLSare atleast a stationarypoint of asensibleoptimization problem. We show
thatthisisnot the ase, ifwerequirethatthe obje tivefun tionof the
optimiza-tion problem is at least twi e dierentiable. For a dierent lass of algorithms
(those with all blo ks in mode B), it is known (Mathes 1993) that the solution
of the PLS algorithms is a stationary point of a sensible optimization problem.
It is however not known if we always obtainthe optimal solution. We provide a
3.1 The Partial Least Squares Framework
InthePLSpathmodelframework,wemodellinearrelationshipsbetweendierent
blo ksofvariables. Onevariableisave toroflength
n
,asweobserven
examples. WehaveK
blo ks ofvariables,and ea hblo k onsistsofp
k
variables,whi hare subsumed in amatrixX
k
∈ R
n
×p
k
, k = 1, . . . , K .
In total we havep =
K
X
k=1
p
k
variables. Theblo ksof variables are alledmanifest variablesinthe PLS
litera-ture. The relationship between blo ks is represented by a so- alledinner model.
Arrows between dierent blo ks of variables
X
k
indi ate in whi h way they are linked (see Figure 3.1). Forea h inner model, we an dene an undire ted linkX
1
X
2
X
3
X
4
Figure3.1: Illustration of aPLS path model with
K = 4
blo ks. matrixC
∈ {0, 1}
K
×K
viac
kl
=
(
1 , X
l
→ X
k
orX
l
→ X
k
0 ,
otherwise(and
c
kk
= 0
). Furthermore,we assume that for ea h blo kX
k
, there is asingle latent (or hidden) variablez
k
∈ R
n
that represents this blo k. This is alled
the outer model and is illustrated in Figure 3.2. We distinguish two types of
z
1
z
2
z
3
z
4
Figure3.2: Illustrationofthe outerPLSmodel. Ea hblo k ofmanifestvariables
isrepla ed by one latentvariable.
relationshipsbetweenlatentandmanifestvariables. Therstoneistheformative
model, the se ond one is the ree tive model (see Figure 3.3). In the formative
model,weassumethattheblo k
X
k
ofmanifestvariablesformsthelatentvariablez
k
. In terms of aregression model, this an be expressed asz
k
= X
k
β
+ ε .
(3.1)In the ree tive model, we assume that the manifest variables are a ree tion of
the latentvariable. The underlyingregression model is
X
k
= z
k
β
t
+ E .
(3.2)Given data, we want to estimate (1) the latent variables, (2) the relationship in
theinnermodel,and(3) therelationsintheoutermodel. Inordertoestimate
z
k
, we need to dene sensible optimality riteria. Ideally, these riteria have threefeatures: Firstly, we want to nd estimates
z
k
su h thatz
k
andz
l
are lose if their orresponding blo ks are linked. Se ondly, we want to take intoa ountX
1
X
2
z
1
z
2
Figure 3.3: The dieren e between the two outer models. Left: The formative
model. The manifest variables
X
1
form the latent variablez
1
. Right: The ree tive model. The manifestvariablesX
2
arearee tion ofthe latent variablez
2
.the dire tions of the arrows in the inner model. Thirdly, we want to take into
a ount the dire tions of the arrows inthe outer model.
Thetoughpartinthe pro essisthe estimationofthe latentvariables. On ethey
are estimated,the relationshipsthatare indi atedby anarrow an bederived by
using aleast-squares estimate.
IntheliteratureonPLS,thereisoftenahugegapbetweentheabstra tmodel(in
terms of the inner and the outer model) and that what is a tually omputed by
the PLS path algorithms. Normally, the PLS algorithms are presented dire tly
in onne tion with the PLS framework,insinuating that the algorithmsprodu e
optimal solutions of an obvious estimation problem atta hed to PLS. This
es-timation problem is however never dened. Furthermore, the dire tions of the
arrows in the outer model are hoped to be taken are of by employing dierent
modes: mode A for ree tive blo ks and mode B for formative blo ks. While
the termsree tiveblo ks andformativeblo ks refertothe des riptionofthe
outermodelasillustratedinFigure3.3, modeAandB orrespondtoalgorithms.
It is a-priori not lear how the abstra t models and the PLS path models are
for-mulas,itisindispensabletosetupageneraloptimizationstrategybeforederiving
algorithms that try to solve them. For this reason, in Se tion 3.2, we start the
investigation by presenting dierent optimization riteria in order to dene the
latent variables
z
k
. Afterwards, we present two algorithms in Se tion 3.3 that try to ompute the optimal solution. Only in Se tion 3.4, we introdu e the twoPLS algorithms Lohmöller in mode B and Wold in mode B and show that
they are equal to the algorithmsin Se tion 3.3. We repeat that we always have
tokeep inmindthe dieren ebetween whatPLSwantstomodeland thatwhat
itee tively models.
Wenowtrytogiveageneraloverview ondierenttypesof PLSpathalgorithms.
All terms that are now given willbe dened in subsequent se tions. In the PLS
literature, there are two generi algorithms, the Lohmöller pro edure and the
Wold pro edure. Roughly,there are the following measures of loseness between
latent variables: Horst, fa torialand entroid. These measures are usually alled
s hemes. They have in ommon that they do not (!) onsider the dire tions of
the arrows in the inner model. A variant of PLS that does fulll this ondition
is the path weighting s heme (whi his not onsidered in this work). Were all
thatthedire tionsintheoutermodelarehoped tobetaken areofbyemploying
dierent modes: mode A for ree tive blo ksand mode B for formativeblo ks.
Letus on ludethisse tionwithsomeadditionaldenitions. Inordertosimplify
notation, allvariablesare assumed tohave zero mean. The empiri al ovarian e
matrix between blo ks of variables isdenoted by
S
kl
=
1
n
X
t
k
X
l
∈ R
p
k
×p
l
.
(3.3)Wefrequently workwithve torsand matri esthathaveablo k stru turethat is
indu edby
(p
1
, . . . , p
K
)
. The ovarian ematrixS
∈ R
p
×p
of
X
anbepartitioned intoblo ks of the formS
=
S
11
S
12
. . . S
1K
S
21
S
22
. . . S
2K
. . . . . .S
K1
. . . S
KK
∈ R
p
×p
with
S
kl
dened in (3.3). We denotethis blo k-wise stru ture by[ ]
:S
= [S
kl
] ∈ R
p
×p
.
Furthermore, we need matri es that only have entries in their diagonal blo ks.
Wedenote them by
S
D
=
diag[S
11
, . . . , S
KK
] =
S
11
0
. . .
0
0
S
22
. . .
0
. . . . . .0
. . .
0
S
KK
∈ R
p
×p
The subsript
D
indi ates that we only take the diagonalblo ks of the matrixS
. Any ve torw
∈ R
p
an be partitionedintow
t
= w
1
t
, . . . , w
t
K
t
, w
k
∈ R
p
k
.
(3.4) 3.2 Optimization StrategiesIn this se tion, we abandon the PLS framework. Instead, we present dierent
optimization riteriainorder todenelatentvariables
z
k
. The generalidea isto extend andgeneralize Canoni alCorrelationAnalysis(CCA) (Hotelling1936)tomorethan twoblo ks ofvariables. Wetrytond latentve tors
z
k
= X
k
w
k
su h thatz
k
andz
l
are maximally orrelated if the orresponding blo ks are linked. In this sense, it isan extension of CCA. Fortwoblo ks of variablesX
1
andX
2
, CCA omputesarg
max
w
1
,w
2
or
(X
1
w
1
, X
2
w
2
) .
We ans aletheweights
w
1
, w
2
without hangingthe optimizationproblem,and we obtainthe equivalent optimizationproblemarg
max
w
1
,w
2
ov(X
1
w
1
, X
2
w
2
) ,
subje t to1
n
kX
i
w
i
k
2
= 1 , i = 1, 2 .
OptimizationProblem 3.1. For
K
blo ksofvariables,wedenethefollowing, generaloptimization problem:arg
max
w
X
k,l:c
kl
6=0
g (
ov(X
k
w
k
, X
l
w
l
)) ,
subje t to1
n
kX
k
w
k
k
2
= 1 .
Here,
g
is one of three fun tionsg(x) =
x
,
Horstx
2
,
fa torial|x| ,
entroid.
The terms Horst, fa torial and entroid are alled s hemes in the PLS
lit-erature. We allthe rst s hemethe Horst s heme asit isequivalent toa
gener-alizationof CCA to more than two blo ks that is des ribed in Horst (1961) and
Horst (1965). The termsfa torial and entroid stem fromthe PLSliterature.
We remark that the Horst s heme is not used in the PLS ommunity, although
it has been suggested as an alternative to the two other s hemes by Hana &
Qannari(2005).
Let usdene the real-valuedfun tion
f
g
(w) =
X
k,l:c
kl
6=0
g (
ov(X
k
w
k
, X
l
w
l
)) =
K
X
k,l=1
c
kl
g w
k
t
S
kl
w
l
,
with
w
dened in(3.4). The Lagrangian fun tionasso iated to problem3.1 isL(w, λ) = f
g
(w) −
1
2
K
X
k=1
λ
k
w
t
k
S
kk
w
k
− 1
(3.5) withλ
= (λ
1
, . . . , λ
K
) ∈ R
K
the Lagrangian multipliers. The fa tor
−1/2
is addedin order toavoida res alingof the multipliersλ
i
. Wesetθ
kl
= θ
kl
(w) =
ov(X
k
w
k
, X
l
w
l
) = w
t
Dierentiating the Lagrangian fun tion (3.5),we yield
∂L
∂w
k
=
K
X
l=1
c
kl
g
′
(θ
kl
) S
kl
w
l
− λ
k
S
kk
w
k
,
(3.7)∂L
∂λ
k
= w
t
k
S
kk
w
k
− 1 .
WesetS
g
(w) = [c
kl
g
′
(θ
kl
) S
kl
] ,
(3.8)whi himplies that
∂f
g
∂w
(w) = S
g
(w)w .
These equations an be represented ina ompa t form.
Proposition3.2. TheLagrangian equations(A.6)and(A.7)oftheoptimization
problem 3.1 are
S
g
(w)w = ΛS
D
w
,
w
k
t
S
kk
w
k
= 1 .
The matrix
Λ
is a diagonalmatrix that is of the formΛ
=
diag[λ
1
I
p
1
, . . . , λ
K
I
p
K
] ∈ R
p
×p
.
Wemightthink of
λ
= (λ
1
, . . . , λ
K
)
asave torof multivariateeigenvalues. Note thatintheHorsts heme,thematrixS
g
(w)
doesnotdependonw
andinthis ase, theproblemis alledamultivariateeigenvalueproblem(Chu&Watterson 1993).Remark 3.3. Of ourse, in the entroid s heme, the fun tion
f
g
(w)
is not dif-ferentiable onR
p
. We therefore have to restri t
f
g
onto the open subsetM = {w ∈ R|w
t
k
S
kl
w
l
6= 0} .
Notethat we an de ompose
M
into nitely many disjointopen subsetsM
I
= {w ∈ R|
signw
t
k
S
kl
w
l
Here
I ∈ {±1}
K
×K
is a symmetri matrix with diagonal elements equal to
1
. Wheneverwespeakofaderivativeoff
g
,weimpli itlyassumethatf
g
isrestri ted toone of these subsets. Notethat onany of the subsetsM
I
, the matrixS
g
(w)
that is dened in(3.8) does not depend onw
.Any solution of the equationsin proposition 3.2 is by denition a stationary
pointof the optimizationproblem3.1. In general,there mightbe morethan one
stationarypoint.
Lemma 3.4. The stationary point
w
that is a solution of 3.1 is the one su h that the sum of the orresponding multivariate eigenvalues ismaximal.Proof. We rst note that
g
′
(x)x = g(x)
1,
Horst, entroid2,
fa torial=
ecg(x) .
It follows that for all
w
that are stationary points,X
k,l
c
kl
g (w
k
S
kl
w
l
) =
ec
X
k,l
c
kl
g
′
(w
k
S
kl
w
l
) w
k
S
kl
w
l
=
ecw
t
S
g
(w)w
=
ec
X
k
λ
k
w
k
t
S
kk
w
k
=
ec
X
k
λ
k
.
Ifwe want tomaximizethe ovarian ebetween latent variablesinstead of
orre-lation,we have to hange the onstraints in3.1. We obtain
arg
max
w
X
k,l:c
kl
6=0
g (
ov(X
k
w
k
, X
l
w
l
) , )
(3.10) subje t to1
n
kw
k
k
2
= 1 ,
and the orresponding Lagrangian equationsare
S
g
(w)w = Λw ,
1
n
w
t
We an always transform optimizationproblem3.1 into (3.10): Denote by
√
S
kk
the root of the positive-semidenitematrix
S
kk
. We sete
w
k
=
p
S
kk
w
k
,
(3.11)p
S
D
=
diaghp
S
11
, . . . ,
p
S
KK
i
,
(3.12)e
S
=
p
S
D
−
S
p
S
D
−
.
(3.13)It follows readily fromthe singular value de omposition of
X
k
andX
l
thatS
kl
=
p
S
kk
p
S
kk
−
S
kl
p
S
ll
−
p
S
ll
.
We on lude that optimization problem3.1 is equivalentto
arg
max
ew
X
k,l:c
kl
6=0
g
w
e
k
t
S
e
kl
w
e
l
,
subje t to1
n
k e
w
k
k = 1 .
3.3 Multivariate Power Methods
If we want to nd the optimal solutionof the optimization problem3.1, we an
pro eed stepwise. First, we ompute a solution of the asso iated Lagrangian
equations. As this is a stationary point, we then have to he k if this is the
optimal one. One possibility to solve eigenproblems as in proposition 3.2 is to
apply a multivariateversion of the powermethoddened inalgorithmA.9.
Algorithm 3.5 (MultivariatePower Method). After the initialization of weight
ve tors
w
= (w
1
, . . . , w
K
)
su h thatw
t
k
S
kk
w
k
= 1
, weiteratively omputee
w
(i+1)
= S
D
−
S
g
w
(i)
w
(i)
iterationw
(i+1)
k
=
w
e
(i+1)
k
/
q
e
w
k
(i+1)
S
kk
w
e
k
(i+1)
, k = 1 . . . , K
normalizationIf the multivariate power method onverges to a ve tor
w
, this is obviously a solution of the Lagrangian equations 3.2. Notethat in algorithm3.5, all weightve tors
w
k
areupdatedsimultaneously. Thereisavariationofthepowermethod, whereinea hround,onlyone weightve tor isupdated. We allthisalgorithmamultivariateGauss-Seidel algorithm. In order tohave a ompa t representation,
we dene
θ
kl
(w, v) = w
k
t
S
kl
v
l
.
We setS
g
(w, v) =
c
kl
g
′
θ
kl
(w, v)
S
kl
.
Notethatθ
kl
(w) = θ
kl
(w, w)
andS
g
(w) = S
g
(w, w) .
Now, letus de ompose
S
g
(w, v) = U
g
t
(w, v) + U
g
(w, v)
with
U
g
the stri tlyuppertriangular part ofS
g
. (Re allthat the blo k diagonal ofS
g
(w, v)
is zero, asc
kk
= 0
.)Algorithm3.6(MultivariateGauss-SeidelAlgorithm). Aftertheinitializationof
weightve tors
w
= (w
1
, . . . , w
K
)
su hthatw
t
k
S
kk
w
k
= 1
, weiteratively ompute fork = 1, . . . , K
e
w
(i+1)
= S
D
−
U
t
g
w
(i)
, w
(i+1)
w
(i+1)
+ U
g
w
(i)
, w
(i)
w
(i)
w
k
(i + 1) =
w
e
k
(i)
/
q
e
w
k
(i)
S
kk
w
e
(i)
k
For the Horst s heme, the Gauss-Seidel algorithm is already dened in Chu &
Watterson (1993).
Proposition3.7. If the multivariate Gauss-Seidelalgorithm onverges to a
ve -tor
w
, this is a stationary point of 3.1.Proof. If
w
isthesolutionofthemultivariateGauss-Seidelalgorithm,we on lude thatw
e
= Λw
withw
e
dened inalgorithm3.6. Weplugthis intothe formulafore
w
and obtainS
D
Λ
w
e
= U
g
t
(w, w) w + U
g
(w, w) w = S
g
(w) w .
3.4 The Partial Least Squares Path Algorithms
Now, we return to the PLS framework introdu ed inSe tion 3.1. There are two
importantalgorithmsthat tryto ompute the latentvariables inthe pathmodel
- Lohmöller(Lohmöller1989) and Wold(Wold 1985). The algorithmspresented
here are in mode B. Mode B refers to a path model where all blo ks are
sup-posed to be formative. PLS in mode A (whi h orresponds to ree tive blo ks)
is dis ussed in Se tion 3.5. The ommon point of view is that these algorithms
are alternatingalgorithmsinthesense thatweiterativelyestimaterst theinner
model and then the outer model. We show that Lohmöller orresponds to the
multivariate power method 3.5 and that Wold orresponds to the multivariate
Gauss-Seidelalgorithm3.6. Asa onsequen e we an on lude that-ifallblo ks
areformative-thePLSsolutionsareindeedstationarypointsoftheoptimization
riterion3.1.
Let usstart with Lohmöller's algorithm.
Algorithm3.8 (Lohmöller'salgorithm). Afterthe initializationof weightve tor
w
(0)
su h thatz
(0)
k
= X
k
w
(0)
k
has length√
n
we iteratively ompute for all k simultaneouslye
z
(i+1)
k
=
P
K
l=1
c
kl
g
′
θ
kl
(w
(i)
)
z
(i)
l
inner model(environmental variable)e
w
k
(i+1)
= (X
t
k
X
k
)
−
X
k
t
z
e
(i+1)
k
outer modelin mode Bw
k
(i+1)
=
√
n
w
e
k
(i+1)
/
X
k
w
e
(i+1)
k
normalization
z
(i+1)
k
= X
k
w
(i+1)
k
updateWe remark that the term environmental variable is part of the PLS
nomen- latura.
Proposition 3.9. The Lohmöller algorithm is equal to the multivariate power
method.
Proof. Theproofisstraightforward. Notethattheformulafortheenvironmental
variable equals