Analysis of High Dimensional Data with Partial Least Squares and Boosting

(1)

Data with

Partial Least Squares and Boosting

vorgelegt von

Diplom-Mathematikerin Ni ole Krämer

aus Berlin

vonder Fakultät IV Elektrote hnik und Informatik

der Te hnis hen Universität Berlin

zur Erlangung des akademis hen Grades

Doktorder Naturwissens haften

-Dr.rer.nat.

-genehmigte Dissertation

Promotionsauss huss:

Vorsitzender: Prof. Dr. Manfred Opper

Beri hter: Prof. Dr. Ulri h Ko kelkorn

Beri hter: Prof. Dr. GerhardTutz

Tagder wissens haftli hen Ausspra he: 15. Dezember2006

Berlin 2007

(2)

(3)

Firstofall,IwanttoexpressmydeepgratitudetoProf.Dr.Ulri hKo kelkornfor

histhoroughsupervision. I wouldalsoliketothank Prof.Dr. GerhardTutz,who

agreedtobethese ondexaminerofthis thesisandwho supportedmyworkwith

his advise and en ouragement. I am obliged to Dr. Jörg Betzin for introdu ing

me to a lot of ex iting open questions regarding Partial Least Squares and for

sharing his thoughts with me. Finally, I beneted a lot from the ex hange of

ideaswithDr.RomanRosipal,AthanassiosKondylis, Dr.Anne-LaureBoulesteix

(4)

(5)

Inderstatistis henAnalysevonho hdimensionalenDatengehtesdarum,

Zusam-menhängezwis heneinergroÿenMenge

p

anVariablenmitHilfeeinerbegrenzten Anzahl

n

an Beoba htungen zu modellieren. Gemeinsam sind allen Analyseme-thoden die beiden folgenden Ziele. Zum einen ist es wi htig, (latente)

Struk-turen in den Daten zu erkennen, um so eine handhabbare, niedrigdimensionale

Repräsentation zu gewinnen. Zum anderen ist es oft von groÿer Wi htigkeit,

verständli he undlei htzu interpretierende Modellezu entwi keln. Die hohe

Di-mensionalitätder Datenführtoftzu groÿenProblemen, dennfür

p ≫ n

versagen die traditionellenstatistis hen Verfahren. Zudem ist die Struktur der Daten oft

komplexer. Die beoba hteten Gröÿen sind ni ht wie in der klassis hen

Statis-tik übli h Vektoren in einem endli hdimensionalen Vektorraum, sondern zum

BeispielFunktionen. Beispielefürdiese ArtvonDatenstrukturen sindZeitreihen

oderMessungen inderNah-Infrarot-Spektroskopie. IndieserArbeitsolldie

Ana-lyse von ho hdimensionalen und komplexen Daten mitHilfe von zwei Verfahren

untersu ht werden: Partial Least Squares und Boosting inFunktionenräumen.

PartialLeast Squares(PLS)modelliertdenZusammenhangzwis hen

vers hiede-nenBlö ken vonVariablenmitHilfesogenannter latenter Variablen. ImFallvon

mehralszweiBlö kenwerdendiePLS-Verfahrenau halsPfadmodellebezei hnet

undkönnenalseineErweiterungder Kanonis henKorrelationsanalyseangesehen

werden. Die mathematis hen Eigens haften von PLS-Pfadmodellen sind zum

groÿenTeilno hunerfors ht. ZumBeispielistwederklar,obdieAlgorithmenzur

Bere hnung derlatentenVariablenimPfadmodellnumeris hkonvergieren, no h,

obsiefallssiekonvergieren Lösungen vonsinnvollen Optimierungsproblemen

darstellen. In dieser Arbeit wird ein sauberes mathematis hes Gerüst für die

Bes hreibung der Pfadmodelleaufgestellt. Eswird gezeigt,dasszu einem groÿen

Teil der PLS-Algorithmen (derjenigen mit mindestens einem Blo k im Modus

(6)

Zu-lems konvergieren können.

PLSkannau hinRegressionsproblemeneingesetztwerden,indemmandie

erklä-renden und die abhängigen Variablenals jeweils einen Blo k auasst. In diesem

Fall ermögli ht PLS zudem eine Dimensionsreduktion der Daten, die wiederum

hoentli hzu besserenVorhersagenführt. IndieserArbeitwirdeineErweiterung

von PLS um einen Strafterm vorgestellt und auf die S hätzung von

generali-sierten additiven Modellen (GAM's) angewandt. Es zeigt si h, dass

insbeson-dere für ho hdimensionale Daten dieser Ansatz eine gute Alternativezu

klassis- hen GAM-Verfahren ist. Ausgehend vonder bereitsbekannten Verbindung von

PLSund dem Konjugierten-Gradienten-Verfahren(aus der numeris hen linearen

Algebra) wird gezeigt, dass PLS mit Strafterm äquivalent zu einem

vorkondi-tionierten Konjugierten-Gradienten-Verfahren ist. Die Konditionierungsmatrix

wird dabei dur h den Strafterm bestimmt. Im Ans hluss werden die

Beziehun-gen zwis hen der linearen Algebra und PLS ausgenutzt, um die sogenannten

Shrinkage-Eigens haften von PLS empiris h zu untersu hen. Darüber hinaus

wird ein unverzerrter S hätzer für die Freiheitsgrade von PLSermittelt.

Boosting ist ein Verfahren aus dem Berei h des Mas hinellen Lernens. Die

grundlegendeIdeeist,vers hiedeneeinfa heVorhersagemodellesozukombinieren,

dass dieseKombinationzu sehrviel besseren Vorhersagenführt. Indieser Arbeit

werden Boostingverfahren für komplizierte Datenstrukturen entwi kelt. Dabei

interessiertuns vorallenDingenderFall,indemdiebeoba htetenEinussgröÿen

Funktionenbzw. diskreteMessungenvonFunktionensind. Diegängigen

Boosting-Methoden basieren implizit auf der Annahme, dass die Einussvariablen Werte

in einemendli hdimensionalen Vektorraum annehmen. Es wird gezeigt, dass die

Erweiterung aufunendli hdimensionaleFunktionenräumeohne Weiteresmögli h

ist. Zudem wird illustriert, wie man mit Hilfe von Boostingverfahren wi htige

Charakteristika der Funktionen aufde kt und wie man damit lei ht

interpretier-bare und visualisierbare Modelle erzeugt. Dies ges hieht dur h eine T

(7)

The ru ial task in the statisti al analysis of high-dimensional data is to model

relationships between a large amount

p

of variables based on a small number

n

of observations. Quite generally, we pursue two goals. On the one hand, it is

important to dete t (latent) stru tures in the data in order to obtain a feasible,

low-dimensional representation. On the other hand, we often need simple and

omprehensible models that an be interpreted. The high-dimensionality of the

data often formsan obsta le, as for

p ≫ n

, the traditionalstatisti alte hniques failtoprodu esatisfa toryresults. Furthermore,thestru ture ofthedata anbe

omplex. The observed variablesare not as usually assumed in lassi al

statis-ti s elementsof a nite-dimensionalve tor spa e, but , forinstan e, fun tions.

Examples for this type of data are times series or experiments from the eld of

near-infra-red spe tros opy. In this work, we investigate high-dimensional and

omplexdata with the helpof two methods: PartialLeast Squares and Boosting

for fun tionaldata.

Partial Least Squares (PLS) models the relationship between dierent blo ks of

variables in terms of so- alled latent variables. In the ase of more than two

blo ks,the PLS-te hniques are also alledpathmodels and an beseen asa

gen-eralization of Canoni al Correlation Analysis. The mathemati al properties of

PLSare for the mostparts not yetestablished. Forexample, itisneither known

whether the PLS algorithms onverge numeri ally, nor in the ase that they

onverge if they produ e solutions of a sensible optimization riterion. In this

work, we establish a sound mathemati al framework for the des ription of PLS

pathmodels. Weshowthat foralarge partofthe PLSalgorithms(thosewith at

least one blo k in mode A), there is indeed no twi e-dierentiable optimization

problem. Furthermore, we show on simulated data that the PLS algorithms in

(8)

better predi tion rules. In this work, we present an extension of PLS using

pe-nalizationte hniques. This methodisthen usedto estimategeneralizedadditive

models(GAM's). Thisapproa hturnsouttobeagoodalternativetotraditional

GAM-methods in the ase of high-dimensional data. Based on the well-known

relationship between PLSand the onjugate gradient te hnique (a method from

the eld of numeri al linear algebra), we prove that penalized PLSis equal to a

pre onditioned onjugate gradient te hnique. Here, the pre onditioner is

deter-mined by the penalty term. Subsequently, we exploit the onne tions between

PLS and linearalgebra toinvestigateempiri allythe so- alledshrinkage

proper-tiesofPLS. Inaddition,wederiveanunbiasedestimateof thedegrees offreedom

of PLS.

Boosting has its seed in the ma hine learning ommunity. The basi idea is

to ombine several, simple models insu h away that their ombinationleads to

betterpredi tionrules. Inthiswork,wedevelop Boostingalgorithmsfor omplex

datastru tures. Ourfo us isondatathatare (dis rete)measurementsof urves.

The established Boostingmethodsimpli itly assumethat the observed variables

lieinanite-dimensionalve torspa e. WeshowthatanextensionofBoostingto

innite-dimensionalfun tionspa esisstraightforward. Furthermore,weillustrate

howto dete trelevantfeatures of the investigated fun tionsand howto produ e

simple and interpretable models. This is done by applying wavelet or Fourier

(9)

1 Preliminaries 1

1.1 Notation and Guideline. . . 1

1.2 Learning fromData . . . 2

1.3 The Regression Model . . . 3

1.4 Duality and KernelMethods . . . 6

2 Model Sele tion 9 2.1 Cross Validation. . . 11

2.2 Degrees of Freedom . . . 11

2.3 Information Criteria . . . 16

3 Partial Least Squares Path Models 19 3.1 The Partial Least Squares Framework . . . 20

3.2 Optimization Strategies . . . 24

3.3 Multivariate PowerMethods . . . 28

3.4 The Partial Least Squares Path Algorithms. . . 30

3.5 No Smooth Optimality Criterion for Mode A . . . 33

3.6 No Convergen e to the Optimum for Mode B . . . 37

3.7 Con lusion . . . 40

4 Partial Least Squares for Regression 43 4.1 NIPALS and SIMPLS . . . 44

4.2 Basi Properties of Partial Least Squares Regression . . . 47

4.3 The Degrees of Freedomof PartialLeast Squares . . . 52

5 Penalized Partial Least Squares 57 5.1 PenalizedRegression Splines . . . 58

5.2 Dimension Redu tion for B-Splines Transformations . . . 61

(10)

5.6 Con lusion . . . 69

6 From Linear Regression to Linear Algebra 71 6.1 Pre onditioned Conjugate Gradient Methods . . . 76

6.2 Why Krylov Subspa es? . . . 79

6.3 A Polynomial Representation . . . 84

7 Shrinkage Properties of Partial Least Squares 87 7.1 What is Shrinkage? . . . 88

7.2 The Shrinkage Fa tors of Partial Least Squares . . . 92

7.3 Simulations . . . 97

7.4 Example: Te atorData Set . . . 99

7.5 Con lusion . . . 100

8 Fun tional Data Analysis 103 8.1 Example: Bis uitDough Data Set . . . 103

8.2 Example: Spee h Re ognition DataSet . . . 105

8.3 From Observations toFun tions . . . 106

8.4 Learningfrom Fun tional Data . . . 109

9 Boosting for Fun tional Data 113 9.1 A Short Introdu tionto Boosting . . . 113

9.2 The Degrees of Freedomof Boosting . . . 118

9.3 Boostingand Partial Least Squares . . . 120

9.4 Fun tional Boosting. . . 121

9.5 Example: Spee h Re ognition . . . 123

9.6 Example: Bis uitDough Data Set . . . 124

9.7 Con lusion . . . 126

10 Summary and Outlook 129 Bibliography 131 A Mathemati al Ba kground 141 A.1 MatrixDierential Cal ulus . . . 141

A.2 Optimizationunder Constraints . . . 143

(11)

A.5 The Moore-PenroseInverse . . . 148

(12)

(13)

Preliminaries

This hapter serves as a referen e for some basi on epts that are essential for

the rest of this work.

1.1 Notation and Guideline

Matri esare denoted by boldupper ase letters

X

, A, U , . . .

andve tors are de-noted by bold lower ase letters

v, y, . . .

. This rule is alsovalid if we use Greek letters. The transpose of a matrix or a ve tor is indi ated by a supers ript t

as in

A

t

and

v

t

. The Moore-Penrose inverse of a matrix

A

is denoted by

A

−

.

Spa es are either in alligraphi style (

X , Y, F, . . .

) or in bla kboard bold style (

R, N, Z, F, . . .

). Fun tionsare usuallydenoted bylower ase lettersas

f, g, h, . . .

. The ovarian e and varian e of random variables are denoted by Cov and Var.

Their respe tive empiri al ounterparts are denoted by ov and var.

The table of ontents hopefully reveals the rough stru ture of this work. Let us

remarkthat Chapters 1, 2and 8serve assummaries of well-known on epts and

do not ontain any fundamentally new ideas. Before starting to read further,

it might be bene ial to have a look at Chapter A in the appendix. There, we

olle tsome basi mathemati al prin iples that onstantly emergein this work.

Allexperimentsand simulationsare performed inR (RDevelopmentCore Team

(14)

1.2 Learning from Data

Letusintrodu ethegenerallearningproblem. We onsidertworandomvariables

X

and

Y

whi h denearandomvariable

Z = X × Y

onaprodu t spa e

X × Y

. Weassume thatthere is arelationshipbetween

X

and

Y

inthe sense that given

X

, we an predi t the out ome of

Y

with a high a ura y. We do not know the distribution of

Z

, but we observe a nite set

S = {(x

1 , y

1 ), . . . , (x

n

, y

n

)} ⊂ X × Y

of observations. This set is alled the sample. We assume that the observations

are drawn independently from

Z = X × Y

. The lassi al statisti al approa h is toidentify the pro ess

Z

that generates the sample. Assuminga lass of models that isappropriate todes ribe

Z

,the model parameters of

Z

are estimatedwith the help of

S

. From this, we an inferthe onditionaldistribution of

Y

given

X

. In order tond an estimate

f

b

of a fun tion

F : X → Y

(1.1)

that predi ts anout ome

y

foraxed value

x ∈ X

via

f(x)

b

, wehaveto evaluate the de ision

f (x)

b

if the orre t out ome is

y

. This is done via aloss fun tion

L : Y × Y → R ,

(1.2)

and the risk at a ertain point

z = (x, y)

is dened as

L(y, b

f(x))

. The optimal fun tion isthe one that minimizesthe expe ted risk

R(f ) = E

X

_×Y

[L(Y, f (X))] .

(1.3)

over all fun tions

f

. If the distribution of

Z

is known, the optimal fun tion an often be determined expli itly. For example, if we onsider regression problems

(that is

Y = R

) and

L

is the quadrati loss fun tion

L(y, y

′

_{) = (y − y}

′

)

2 ,

(1.4)

the optimal fun tionis the onditional expe tation

f (x) = E[Y |X = x]

.

(15)

pro- ess. The primary task is to nd a fun tion that minimizes (1.3) and not to

identify the whole pro ess

Z

. As the distribution of

Z

is not known, it is not possible to minimize (1.3). As a onsequen e, we have to estimate the optimal

fun tionbased onthe available sample

S

. Apopular approa histox a lass of fun tions

F

and to minimizethe empiri alrisk

b

R(f ) =

1 n

n

X

i=1

L(y

i

, f (x

i

))

(1.5)

over all elements

f ∈ F

. The quantity

R(f )

b

is also alled the training error. Sometimes, a regularization term

r(f )

is added to (1.5). This strategy is alled (regularized) empiri al risk minimization. A ombination of a loss fun tion, a

lass of fun tions and a regularization term denes a strategy to estimate(1.1).

Dependingonthes ienti ommunity,astrategyis alledamodel,analgorithm,

a tting method or a learner. In this work, we use these terms more or less

synonymously. A more pre ise spe i ation of a learning strategy is given in

Chapter 2. Some remarks on the term model are however ne essary. In the

literature, it is used to des ribe two dierent aspe ts of learning from data. On

the one hand, a model is a des ription of how the data is generated. E.g. in

the learningtask(1.1), we an determine thestru ture of the fun tion

F

(linear, polynomial)orassume that

Z

belongs toa ertain lass ofdistributions. On the other hand, a model is a strategy how to estimate the generation of the data.

This refers to the des ription of a learning strategy des ribed above. For every

sample

S

and every learning strategy, we obtain an estimate of the fun tion

F

, whi h we denoteby

f

b

.

1.3 The Regression Model

Inthiswork,weare mainly on ernedwithmultivariateregressionproblemsthat

have a one-dimensional response. That is we assume that

Y = R

and

X = R

p

.

In statisti s,regression problems are usually modeled inthe followingway:

Y

i

= F (X

i

) + ε

i

, i = 1, . . . , n .

Thepredi tor

X

isamultivariate,

p−

dimensionalrandomvariable. Fromnowon, the predi tors are assumed tobedeterministi and only the response is assumed

(16)

with zero mean and equal varian e. In ompa t form, the regression model an determined via

Y

i

= F (x

i

) + ε

i

, i = 1, . . . , n ,

(1.6) with

E [ε

i

] = 0

and Cov

(Y

1 , . . . , Y

n

) = σ

2 _I

n

.

(1.7)

Here,

I

n

isthe identity matrix of dimension

n

. It follows immediatelythat

E [Y

i

] = F (x

i

) .

(1.8)

Let us now onsider multivariate linear regression problems. If the fun tion

F

in (1.6) is assumed to be linear, the regression model an be represented by the

multivariatelinear regression model

Y

i

= x

t

i

β

+ ε

i

.

(1.9)

Givendata

S

, the estimationof (1.6)istransformed intothe estimation

β

b

ofthe regressionve tor

β

. Re allthatthenumberofvariablesis

p

andthatthe number of examples is

n

. We set

X

=







x

t

1 . . .

x

t

n





 ∈ R

n

×p

, y =







y

1 . . .

y

n





 ∈ R

n

.

Aninter ept

β

0

anbein ludedinto(1.9)byatta hinganadditional olumnthat onsist of

1

's. Another possibility is to estimate (1.9) based on entered data. Forthis reason,we require that both

X

and

y

are entered.

The Ordinary Least Squares (OLS) estimator

β

b

(17)

opti-mizationproblem arg

min

β

n

X

i=1

y

i

− x

t

i

β

2 =

arg

min

β

ky − Xβk

2 _.

Notethatthisequalstheminimizationoftheempiri alrisk

R (β)

b

dened(1.5)for the quadrati lossfun tion (1.4)and

F

equal to the spa eof alllinear fun tions. Ifwedierentiatetheempiri alriskwithrespe tto

β

,werealizethatthesolution

b

β

_OLS

must fulll

X

t

X b

β

_OLS

= X

t

y

.

There is always a solution, as

X

t

_y

lies in the spa e spanned by the olumns of

X

t

X

. However the solutionis not unique if

X

t

_X

doesnot have full rank. This

is for example the ase if there are more variables than observations. If there is

no unique solution, we dene the OLS estimator as the solution with minimal

eu lidean norm. It follows from proposition A.14that

b

β

_OLS

=

X

t

X

−

X

t

y

.

We nowuse the singular value de omposition

X

= V ΣU

t

of

X

thatisdenedin(A.8). Furthermore,

Λ

= Σ

t

_Σ

isthematrixofeigenvalues

of

X

t

_X

. Set

s

= ΣV

t

_y

_.

(1.10)

In this work, weuse one of the following representations of the OLS estimator:

b

β

OLS

= X

t

X

−

X

t

y

= U Λ

−

s

=

rk

(X)

X

i=1

v

i

t

y

√

λ

i

u

i

=

rk

(X)

X

i=1

z

i

,

(1.11) with

z

i

=

v

t

i

y

√

λ

i

u

i

.

(18)

exam-ples is small ompared tothe numberof observations orif

X

is highly ollinear. Both phenomena lead to a ovarian e matrix

(1/n)X

t

_X

that is (almost)

singu-lar,whi h ae ts the statisti alproperties of the estimator. This is dis ussed in

great detail inSe tion7.1.

1.4 Duality and Kernel Methods

Inthisse tion,webrieyre apitulatethe on eptofdualrepresentationsandthe

kerneltri k. Letus onsider thefollowingexample. InSe tion1.3, weintrodu ed

the linear regression model (1.9). For any estimate

β

b

of

β

, we an predi t the value of

y

for a new observation

x

new

viathe linearfun tion

b

y

new

= x

t

new

β

b

=

D

x

new

, b

β

E

.

(1.12)

Now suppose that we want to transform the original data

X

before applying OLS. One reason to doso is to model nonlinear relationships between predi tor

variables and response. Ifwe have e.g.

p = 2

predi tor variables and we want to estimateafun tion(1.1)with

F

assumed tobeapolynomialofdegree

≤ 2

. How an we use a method that is designed for linear regression problems (e.g. OLS)

tosolvethis problem? Wesimply transformthe data via a map

Φ (x, x

′

) =

1,

√

2x,

√

2x

′

,

√

2xx

′

, x

2 , x

′2

(1.13)

and apply the linear algorithm to

y

and

Φ (X)

. A transformation is also ne -essary if the observed data are not yet embedded in an eu lidean spa e. If for

instan e, the variables are on a nominal s ale, we have to transform the

vari-ablesintodummyvariablesandthenplug thetransformeddataintoanylearning

method designed for estimating linear relationships.

The transformation map

Φ : X → F

(1.14)

is alledthe featuremap. The spa es

X

and

F

are alledinputspa e andfeature spa erespe tively. In ordertoapply alinear algorithmin

F

,itis oftenne essary to assume that

F

is a Hilbert spa e. An important observation is the following. In alotof ases, the ve tor

β

b

that denes the linearfun tion in(1.12)isa linear

(19)

ombination of the data points,

b

β

=

n

X

i=1

α

i

Φ (x

i

) .

(1.15)

The oe ients

α

i

are alled dual variables. Ifweplug (1.15)into

hΦ(x

new

), b

β

i

, werealizethat thelinearfun tiononlydependsoninnerprodu tsbetween

trans-formed data points,

f (x) = hΦ(x), b

β

_{i =}

n

X

i=1

α

i

hΦ(x), Φ(x

i

)i .

In this ase, the estimation of the dual variables an be doneby usingthe

n × n

Gram matrix

K

_{= (hΦ(x}

i

), Φ(x

j

)i)

_i,j=1,...,n

.

Note that ondition (1.15) holds for the OLS estimate. This follows e.g. from

(1.11), as the ve tors

u

i

are a basis of the row spa e of

X

. (Re all the singular value de omposition(A.8) of

X

.) de Bieetal.(2005)des ribe various multivari-ate methods interms of their primal and dual representation.

Asweonlyneedinnerprodu tsinthedualrepresentation,wedonothavetomap

the data points expli itlytoa feature spa e, itsu es to ompute the fun tion

k : X × X → R

(1.16)

with

k(x, z) = hΦ(x), Φ(z)i .

(1.17)

The estimated fun tion is

f (x) =

n

X

i=1

α

i

k(x, x

i

) .

The fun tion

k

is alled akernel. Notethat in example(1.13),

(20)

Therepla ementofthe usual innerprodu tby theinner produ tinsomefeature

spa eis alledthe kernel tri k. Notethat wedonot even requirethe inputspa e

to be aninner produ t spa e atall. Literature on the kernel tri k and its

appli- ationsisabundant. A detailedtreatise ofthe subje t an befound inS hölkopf

&Smola (2002).

So instead of dening a feature map, we dene an admissible kernel fun tion,

that is,a fun tion (1.16) whi h an be dened via amap (1.14)su h that (1.17)

holds. The hoi e of the optimal kernel is part of the model sele tion pro edure

that isillustratedinChapter 2. What are themerits ofthis dualrepresentation?

We already mentioned the extension to nonlinear models. Furthermore, from a

te hni al point of view, if

p ≫ n

, the omputation in the dual representation is usually faster than the omputation in the primal representation. Finally, we

anextendthe wholemultivariatema hinerytospa es

X

ofinnitedimensionor witha omplexstru turebydeninganappropriateinnerprodu t. Animportant

example isthe analysis of fun tionaldata, that is,

X

isa spa eof fun tions over some domain. This subje twillbetreated inmore detail inChapter 8.

(21)

Model Sele tion

We now re apitulate the main tools to evaluate the performan e of a learning

method. The ontents ofthis hapterare a summaryof the orresponding

hap-terinHastieetal.(2001). Asdes ribedinChapter1,weestimatetherelationship

(1.1) by applying an appropriate tting method. Normally, we we do not t a

single model but a group of models and have to hoose the best model. This is

usually alled model sele tion. We therefore need a strategy how to sele t the

best model out of a pool of models. After the best model is hosen, we have to

evaluate its quality. This is alled model validation. Re all that we evaluate a

modelintermsofitsexpe tedrisk(1.3). Asthisquantityisusuallyunknown,we

need a goodestimate. In what follows, we fo us onmodel sele tion and remark

that the risk of the sele ted model should be estimated on a test set that was

neitherinvolved in the ttingpro ess nor inthe sele tion pro ess.

In the rest of the hapter,we onsider the general regression model (1.6). Given

data,wetamodeland allthettedfun tion

f

b

. Inordertoevaluatethequality ofthe ttingmethod,westart by omputingtheexpe tedriskof

f

b

atapoint

x

i

,

R

f(x

b

i

)

= E

Y

new

h

L

Y

new

, b

f(x

i

)

i

.

(2.1) Here,

Y

new

is a new observation at point

x

i

. Note that the quantity

R

b

f (x

i

)

(22)

fun tion (1.4), the expe ted risk of

f

b

at

x

i

equals

R

f (x

b

i

)

=

E

Y

new

Y

new

_{− b}

f (x

0 )

2 =

E

Y

new

Y

new

_{− E[Y}

new

] + E[Y

new

_{] − b}

f (x

i

)

2 (1.8)

=

E

Y

new

(Y

new

_{− E[Y}

new

])

2 +

F (x

i

) − b

f (x

i

)

2 =

σ

2 +

F (x

i

) − b

f (x

i

)

2 .

(2.2) Theterm

σ

2

is alledtheirredu ibleerror. These ondtermdependsonthedata

that we used tot the model. If we are interested in the quality of our learning

strategythat is usedtoobtain theestimate

f

b

,we ompute the expe tedvalueof

R

f (x

b

i

)

with respe t to the data

Y

(n)

_{= (Y}

1 , . . . , Y

n

)

. For the quadrati loss, we yield

E

Y

(n)

E

Y

new

Y

new

_{− b}

f (x

i

)

2 = σ

2 +

bias

2 _{f (x}

_b

i

)

+

Var

b

f(x

i

)

.

We expe t the bias to de rease for more omplex models and the varian e to

in rease. Ifwe hoose a very simple model with a low varian e, we might fail to

apture the relevant stru ture of the data. If we t a very omplex model that

is almost unbiased, we have a good explanation of the available sample but will

probably fail to predi t on new observations. The latter phenomenon is alled

overtting. InSe tion7.1,westudy thisbias-varian etrade-oinmoredetailfor

linear shrinkage estimators.

Letusreturntotheessentialquestionofthis hapter. Howdowesele tamodel?

Let us assume that we have a lot of data at hand. In this ase, we an pro eed

in the following way. We split the data set into two parts: a training set and

a validationset. We t the models on the training set. We then ompare their

performan e on the validation set. Note however that in most situations, the

amount of available data is limited and we annot aord to ex lude a fra tion

of the data from model estimation. We therefore need dierent strategies to

estimate the risk of a model. Roughly, we distinguish two dierent approa hes.

In the rst approa h, we repeat a random splitting into training and test set

several times. This is alled ross-validation and is dis ussed in Se tion 2.1. In

(23)

riskofamodel anbeestimatedintermsof itsempiri alriskand its omplexity.

More pre isely,the omplexity an be expressed in terms of degrees of freedom.

2.1 Cross Validation

The ross-validation te hnique (Stone 1974, Stone 1977) produ es estimates of

the risk

R( b

f )

of a model

f

b

. Re all that in order to sele t the best model, it is suggested tosplit the data into atrainingset and avalidationset. The model is

tted onthe trainingset and its performan e is estimated on the validation set.

Asinalotof ases, wedonot haveenoughdataathand,amorerened strategy

is pursued. We randomly split the data into

K

parts of roughly the same size. For

k = 1, . . . , K

, we remove the

k

th blo k from the data and t the model to the remaining

K − 1

parts. The

k

thblo k isused asa test set. That is,for ea h blo k

k

, we obtain an estimate of the risk of the model that was tted on the other

K − 1

blo ks. Finally, we average these

K

estimates. More formally, the fun tion

κ : {1, . . . , n} → {1, . . . , K}

assignstothe

i

thexampleitsblo kmembership. Wedenoteby

f

b

−k

thefun tion

that was ttedon allbut the

k

th blo k.

Denition 2.1. The

K

-fold ross-validationerror is

CV

b

f

=

1 n

n

X

i=1

L(y

i

, b

f

−κ(i)

(x

i

)) .

(2.3)

For

K = n

, this is alled the leave-one-out error.

Let us note that the omputational osts of

K

-fold ross-validation an be ome veryhighif

K

islarge. Inthis ase,wehavetotthemodelsseveraltimes,whi h an be verytime- onsuming.

2.2 Degrees of Freedom

As already mentionedabove, the quality of amodel ora fun tion

f

b

is measured in terms of its expe ted risk (1.3). As this risk annot be omputed, we need

(24)

of (1.3). We expe t the empiri al risk to be lower than the true risk, as we use

the same data set to t the model

f

b

and to asses its performan e. If we use the training data for model assessment, this leads to overoptimisti estimates of the

risk. The gap between empiri al and test error is usually parti ularlylarge for

very omplex models. In order to get a good estimate of the expe ted risk, we

haveto measure the gap between empiri alerror and the expe ted risk.

Re all the general regression model (1.6). Note that we dened the expe ted

risk of

f

b

at a data point

x

i

in (2.1). The estimated fun tion

f

b

depends on the sample

S

, that is it depends on

Y

(n)

_{= (Y}

1 , . . . , Y

n

)

. If we average over all points

x

1 , . . . , x

n

and ompute the expe tation with respe t to

Y

(n)

, we obtain

the expe ted in-sample riskof our strategy,

R

in

= R

in

(x

1 , . . . , x

n

) = E

Y

(n)

"

1 n

n

X

i=1

R

f (x

b

i

)

#

.

Thedieren ebetween

R

in

andtheexpe tedempiri alriskis alledtheoptimism:

op

= R

in

− E

Y

(n)

h

b

R( b

f)

i

= E

Y

(n)

"

1 n

n

X

i=1

n

R

f(x

b

i

− b

R

f (x

b

i

o

#

.

The key point is to nd a good estimate

c

op of op. We an then estimate the in-samplerisk of amodel inthe following way:

d

R

in

=

R +

b

op .

b

(2.4)

Proposition2.2. Forthe quadrati loss fun tion(1.4), theoptimismof a tting

method is

op =

2 n

n

X

i=1

Cov

b

f (x

i

), Y

i

.

(2.5)

Proof. It follows from(2.2) that

R

in

= σ

2 +

1 n

n

X

i=1

E

Y

(n)

F (x

i

) − b

f (x

i

)

2 .

(25)

Next, wehave

b

R( b

f ) =

1 n

n

X

i=1

Y

i

− b

f (x

i

)

2 =

1 n

n

X

i=1

Y

i

− F (x

i

) + F (x

i

) − b

f (x

i

)

2 =

1 n

n

X

i=1

(Y

i

− F (x

i

))

2 + 2

1 n

n

X

i=1

(Y

i

− F (x

i

))

F (x

i

) − b

f(x

i

)

+

1 n

n

X

i=1

F (x

i

) − b

f (x

i

)

2 .

It follows that

E

Y

(n)

R( b

b

f ) = σ

2 −

2 n

n

X

i=1

Cov(Y

i

, b

f(x

i

)) +

1 n

n

X

i=1

E

Y

(n)

F (x

i

) − b

f (x

i

)

2 .

This on ludes the proof.

Before pro eeding, it is bene ial to introdu e a ompa t representation of a

tting method. If we denote by

f

b

the tted fun tion that is obtained by using the sample

S

,wedene the followingmap

H : R

n

→ R

n

_,

H(y) =

f(x

b

1 ), . . . , b

f(x

n

)

t

=

y

_b

.

(2.6)

Note that the fun tion

H

depends on

x

1 , . . . , x

n

. If this fun tion is linear in

y

, we speak of a linear tting method or a linear learner. In this ase,

H

an be represented by a

n × n

matrix

H

that is alled the hat-matrix.

Denition2.3(DegreesofFreedom). Thedegreesoffreedomofattingmethod

that is represented by

H

isdened as

df

(H) =

1 σ

2 n

X

i=1

Cov

b

f(x

i

), Y

i

.

In parti ular, op

=

2 n

σ

2

df

(H) .

(26)

errorvariables

ε

i

in(1.6)are normallydistributed. The next usefullemmaisdue toStein (1981).

Lemma2.4(Stein'sLemma). Assumethat

X ∼ N(µ, σ

2 ₎

isaunivariaterandom

variable with density fun tion

φ

and that

g : R → R

is a dierentiable fun tion su h that

lim

x

_→±∞

g(x)φ(x) = 0 .

(2.7) We have Cov

(g(X), X) = σ

2 _{E [g}

′

_{(X)] .}

We an easilyextend Stein's lemmato multivariaterandom variables.

Lemma 2.5 (MultivariateStein's Lemma). Assume that

X = (X

1 , . . . , X

n

) ∼ N µ, σ

2 I

n

is a multivariate random variable with density fun tion

φ(x) =

Q

p

i=1

φ

i

(x

i

)

. Let

g = (g

1 , . . . , g

p

) : R

p

→ R

p

be a dierentiable fun tion whi h fullls

lim

x

_→±∞

g

i

(x)φ

i

(x) = 0 .

(2.8) We have

n

X

i=1

Cov (g

i

(X), X

i

) = σ

2 E

tra e

∂

∂X

g(X)

.

Proof. We x

i ∈ {1, . . . , p}

and set

(27)

We have

Cov (g

i

(X), X

i

) = E

X

[g

i

(X) (X

i

− E[X

i

])]

= E

X

−i

E

X

i

|X

−i

[g

i

(X) (X

i

− E[X

i

])]

= E

X

−i

E

X

i

[g

i

(X) (X

i

− E[X

i

])]

The last equality is valid as we assume that the variables

X

i

are independent. In the expression

E

X

i

[g

i

(X) (X

i

− E[X

i

])]

in the last line,

g

i

(X)

only varies in

X

i

and the other omponents are onsidered to be onstant. We an now apply Stein'slemma 2.4 to

g

i

(X

i

)

and

X

i

and obtain

Cov (g

i

(X), X

i

) = σ

2 E

X

−i

E

X

i

∂

∂X

i

g

i

(X)

= σ

2 _E

X

∂

∂X

i

g

i

(X)

.

This proves the lemma.

Corollary 2.6. Assume thatthe fun tion

H

denedin (2.6) isdierentiable. If

Y

i

is normallydistributed,

Y

i

∼ N(F (x

i

), σ

2 ₎

and

H

fulllsassumption (2.8), we have

n

X

i=1

Cov( b

Y

i

, Y

i

) = σ

2 E

tra e

∂

∂Y

(n)

H Y

(n)

.

In parti ular,

df (H) = E

"

tra e

∂H Y

(n)

∂Y

(n)

!#

.

In this ase,

b

df(H) =

tra e

∂H Y

(n)

∂Y

(n)

!

(2.9)

is an unbiased estimate for the degrees of freedom of

H

. If the learner is linear in

y

,i.e.

y

b

= Hy

with

H

∈ R

n

×n

,we yield

df

(H) =

tra e

(H) .

(28)

fun tion

H

orrespondingto this estimatoris

b

y

= X X

t

X

−

X

t

y

_{= P}

X

y

.

The tra e of the proje tion operator equals the dimension of the spa e spanned

by the olumnsof

X

. Weobtain the well-known result

df

(OLS) =

rank

(X) .

2.3 Information Criteria

Wenowreturntotheestimationoftheriskofthemodel. Information riteriaare

basedontheideathatthequalityofamodeldependsonitstrainingerrorandon

its omplexity. Dierent approa hes lead to dierent amounts of penalizationof

the omplexityofamodel. Information riteriadierinthey wayhowmu hthey

penalize the omplexity of the model. We already remarked in (2.4) thatwe an

estimatethe in-sampleriskintermsofthe empiri alriskandits omplexity. The

Akaike Information Criterion (AIC) (Akaike 1973) is a generalization of (2.4).

It is based on a general, asymptoti relationship between the Kullba k-Leibler

InformationCriterionandmaximumlikelihoodtheory. Wedonot wanttogotoo

mu h into detail and refer e.g. to Burnham & Anderson (2004). In the ase of

normally-distributederror terms

ε

i

,the AIC- riterion isequivalent to(2.4),

AIC

( b

f ) =

R( b

b

f ) +

2 n

df (H)σ

2 _.

The quantity

σ

an be estimated via

bσ

2 =

1 n

n

X

i=1

y

i

− b

f (x

i

)

2 .

We hoose the model that minimizesthe AIC information riterion

b

f

AIC

=

arg

min

b

f

AIC

( b

f ) .

AsthegeneralAIC riteriononlyholdsasymptoti allyforlargevaluesof

n

,there isa orre ted versionof theAIC riterionforsmall samplesizes(Hurvi h &Tsai

(29)

length(Hansen & Yu 2001) and that is used inChapter 9 is

gMDL

f

b

= log

n

n − df(H)

R( b

b

f )

+

df (H)

n

log

n

X

i=1

y

_i

2 _{− n b}

R( b

f )

!

− log

df (H)

n

n − df(H)

R( b

b

f )

!

.

The last two riteria penalizethe omplexity of a model more strongly than the

(30)

(31)

Partial Least Squares Path Models

Partial Least Squares (PLS) path models (Wold 1982, Wold 1985) are a

frame-work for modeling linear relationships between several blo ks of variables. In

this sense, they an be seen as a generalization and an extension of Canoni al

CorrelationAnalysis (Hotelling1936) to more than two blo ks of variables. The

relationshipbetween dierent blo ks are modeled in the following way. For ea h

blo k of variables, we look for a latent variable that is, a linear ombination

of these variables su h that latent variables that are assumed tobe linked are

highly orrelated. These latentvariablesareestimatedwith algorithmsthathave

apower methodavor.

Thestatisti alandmathemati altheoryofPLSpathmodelhasnotbeenfully

es-tablished. Infa t,PLSisdenedviaalgorithmsasinWold(1982)andLohmöller

(1989)andnotviaastatisti almodeloranempiri aloptimizationproblem. Some

fundamentalquestionshavenot beenanswered. Forexample,itisnotguaranteed

that the PLS algorithms onverge numeri ally (although onvergen e is always

observed inpra ti e). Moreseverely, forawide lassofalgorithms(thosewith at

least one blo k in mode A), it is not known if the latent variables omputed by

PLSare atleast a stationarypoint of asensibleoptimization problem. We show

thatthisisnot the ase, ifwerequirethatthe obje tivefun tionof the

optimiza-tion problem is at least twi e dierentiable. For a dierent lass of algorithms

(those with all blo ks in mode B), it is known (Mathes 1993) that the solution

of the PLS algorithms is a stationary point of a sensible optimization problem.

It is however not known if we always obtainthe optimal solution. We provide a

(32)

3.1 The Partial Least Squares Framework

InthePLSpathmodelframework,wemodellinearrelationshipsbetweendierent

blo ksofvariables. Onevariableisave toroflength

n

,asweobserve

n

examples. Wehave

K

blo ks ofvariables,and ea hblo k onsistsof

p

k

variables,whi hare subsumed in amatrix

X

k

∈ R

n

×p

k

, k = 1, . . . , K .

In total we have

p =

K

X

k=1

p

k

variables. Theblo ksof variables are alledmanifest variablesinthe PLS

litera-ture. The relationship between blo ks is represented by a so- alledinner model.

Arrows between dierent blo ks of variables

X

k

indi ate in whi h way they are linked (see Figure 3.1). Forea h inner model, we an dene an undire ted link

X

₁

X

₂

X

₃

X

₄

Figure3.1: Illustration of aPLS path model with

K = 4

blo ks. matrix

C

∈ {0, 1}

K

_×K

via

c

kl

=

(

1 , X

l

→ X

k

or

X

l

→ X

k

0 ,

otherwise

(33)

(and

c

kk

= 0

). Furthermore,we assume that for ea h blo k

X

k

, there is asingle latent (or hidden) variable

z

k

∈ R

n

that represents this blo k. This is alled

the outer model and is illustrated in Figure 3.2. We distinguish two types of

z

₁

z

₂

z

₃

z

₄

Figure3.2: Illustrationofthe outerPLSmodel. Ea hblo k ofmanifestvariables

isrepla ed by one latentvariable.

relationshipsbetweenlatentandmanifestvariables. Therstoneistheformative

model, the se ond one is the ree tive model (see Figure 3.3). In the formative

model,weassumethattheblo k

X

k

ofmanifestvariablesformsthelatentvariable

z

k

. In terms of aregression model, this an be expressed as

z

k

= X

k

β

+ ε .

(3.1)

In the ree tive model, we assume that the manifest variables are a ree tion of

the latentvariable. The underlyingregression model is

X

k

= z

k

β

t

+ E .

(3.2)

Given data, we want to estimate (1) the latent variables, (2) the relationship in

theinnermodel,and(3) therelationsintheoutermodel. Inordertoestimate

z

k

, we need to dene sensible optimality riteria. Ideally, these riteria have three

features: Firstly, we want to nd estimates

z

k

su h that

z

k

and

z

l

are lose if their orresponding blo ks are linked. Se ondly, we want to take intoa ount

(34)

X

₁

X

₂

z

₁

z

₂

Figure 3.3: The dieren e between the two outer models. Left: The formative

model. The manifest variables

X

1

form the latent variable

z

1

. Right: The ree tive model. The manifestvariables

X

2

arearee tion ofthe latent variable

z

2

.

the dire tions of the arrows in the inner model. Thirdly, we want to take into

a ount the dire tions of the arrows inthe outer model.

Thetoughpartinthe pro essisthe estimationofthe latentvariables. On ethey

are estimated,the relationshipsthatare indi atedby anarrow an bederived by

using aleast-squares estimate.

IntheliteratureonPLS,thereisoftenahugegapbetweentheabstra tmodel(in

terms of the inner and the outer model) and that what is a tually omputed by

the PLS path algorithms. Normally, the PLS algorithms are presented dire tly

in onne tion with the PLS framework,insinuating that the algorithmsprodu e

optimal solutions of an obvious estimation problem atta hed to PLS. This

es-timation problem is however never dened. Furthermore, the dire tions of the

arrows in the outer model are hoped to be taken are of by employing dierent

modes: mode A for ree tive blo ks and mode B for formative blo ks. While

the termsree tiveblo ks andformativeblo ks refertothe des riptionofthe

outermodelasillustratedinFigure3.3, modeAandB orrespondtoalgorithms.

It is a-priori not lear how the abstra t models and the PLS path models are

(35)

for-mulas,itisindispensabletosetupageneraloptimizationstrategybeforederiving

algorithms that try to solve them. For this reason, in Se tion 3.2, we start the

investigation by presenting dierent optimization riteria in order to dene the

latent variables

z

k

. Afterwards, we present two algorithms in Se tion 3.3 that try to ompute the optimal solution. Only in Se tion 3.4, we introdu e the two

PLS algorithms Lohmöller in mode B and Wold in mode B and show that

they are equal to the algorithmsin Se tion 3.3. We repeat that we always have

tokeep inmindthe dieren ebetween whatPLSwantstomodeland thatwhat

itee tively models.

Wenowtrytogiveageneraloverview ondierenttypesof PLSpathalgorithms.

All terms that are now given willbe dened in subsequent se tions. In the PLS

literature, there are two generi algorithms, the Lohmöller pro edure and the

Wold pro edure. Roughly,there are the following measures of loseness between

latent variables: Horst, fa torialand entroid. These measures are usually alled

s hemes. They have in ommon that they do not (!) onsider the dire tions of

the arrows in the inner model. A variant of PLS that does fulll this ondition

is the path weighting s heme (whi his not onsidered in this work). Were all

thatthedire tionsintheoutermodelarehoped tobetaken areofbyemploying

dierent modes: mode A for ree tive blo ksand mode B for formativeblo ks.

Letus on ludethisse tionwithsomeadditionaldenitions. Inordertosimplify

notation, allvariablesare assumed tohave zero mean. The empiri al ovarian e

matrix between blo ks of variables isdenoted by

S

kl

=

1 n

X

t

k

X

l

∈ R

p

k

×p

l

.

(3.3)

Wefrequently workwithve torsand matri esthathaveablo k stru turethat is

indu edby

(p

1 , . . . , p

K

)

. The ovarian ematrix

S

∈ R

p

_×p

of

X

anbepartitioned intoblo ks of the form

S

=







S

11 S

12 . . . S

1K

S

21 S

22 . . . S

2K

. . . . . .

S

K1

. . . S

KK







∈ R

p

×p

(36)

with

S

kl

dened in (3.3). We denotethis blo k-wise stru ture by

[ ]

:

S

= [S

kl

] ∈ R

p

×p

.

Furthermore, we need matri es that only have entries in their diagonal blo ks.

Wedenote them by

S

D

=

diag

[S

11 , . . . , S

KK

] =







S

11

0 . . .

0

0 S

22 . . .

0

. . . . . .

0 . . .

0 S

KK







∈ R

p

×p

The subsript

D

indi ates that we only take the diagonalblo ks of the matrix

S

. Any ve tor

w

∈ R

p

an be partitionedinto

w

t

= w

1 t

, . . . , w

t

K

t

, w

k

∈ R

p

k

.

(3.4) 3.2 Optimization Strategies

In this se tion, we abandon the PLS framework. Instead, we present dierent

optimization riteriainorder todenelatentvariables

z

k

. The generalidea isto extend andgeneralize Canoni alCorrelationAnalysis(CCA) (Hotelling1936)to

morethan twoblo ks ofvariables. Wetrytond latentve tors

z

k

= X

k

w

k

su h that

z

k

and

z

l

are maximally orrelated if the orresponding blo ks are linked. In this sense, it isan extension of CCA. Fortwoblo ks of variables

X

1

and

X

2

, CCA omputes

arg

max

w

1 ,w

2

or

(X

1 w

1 , X

2 w

2 ) .

We ans aletheweights

w

1 , w

2

without hangingthe optimizationproblem,and we obtainthe equivalent optimizationproblem

arg

max

w

1 ,w

2

ov

(X

1 w

1 , X

2 w

2 ) ,

subje t to

1 n

kX

i

w

i

k

2 _{= 1 , i = 1, 2 .}

(37)

OptimizationProblem 3.1. For

K

blo ksofvariables,wedenethefollowing, generaloptimization problem:

arg

max

w

X

k,l:c

kl

6=0

g (

ov

(X

k

w

k

, X

l

w

l

)) ,

subje t to

1 n

kX

k

w

k

2 _{= 1 .}

Here,

g

is one of three fun tions

g(x) =











x

,

Horst

x

2 _,

fa torial

|x| ,

entroid

.

The terms Horst, fa torial and entroid are alled s hemes in the PLS

lit-erature. We allthe rst s hemethe Horst s heme asit isequivalent toa

gener-alizationof CCA to more than two blo ks that is des ribed in Horst (1961) and

Horst (1965). The termsfa torial and entroid stem fromthe PLSliterature.

We remark that the Horst s heme is not used in the PLS ommunity, although

it has been suggested as an alternative to the two other s hemes by Hana &

Qannari(2005).

Let usdene the real-valuedfun tion

f

g

(w) =

X

k,l:c

kl

6=0

g (

ov

(X

k

w

k

, X

l

w

l

)) =

K

X

k,l=1

c

kl

g w

k

t

S

kl

w

l

,

with

w

dened in(3.4). The Lagrangian fun tionasso iated to problem3.1 is

L(w, λ) = f

g

(w) −

1

2 K

X

k=1

λ

k

w

t

k

S

kk

w

k

− 1

(3.5) with

λ

= (λ

1 , . . . , λ

K

) ∈ R

K

the Lagrangian multipliers. The fa tor

−1/2

is addedin order toavoida res alingof the multipliers

λ

i

. Weset

θ

kl

= θ

kl

(w) =

ov

(X

k

w

k

, X

l

w

l

) = w

t

(38)

Dierentiating the Lagrangian fun tion (3.5),we yield

∂L

∂w

k

=

K

X

l=1

c

kl

g

′

(θ

kl

) S

kl

w

l

− λ

k

S

kk

w

k

,

(3.7)

∂L

∂λ

k

= w

t

k

S

kk

w

k

− 1 .

Weset

S

g

(w) = [c

kl

g

′

(θ

kl

) S

kl

] ,

(3.8)

whi himplies that

∂f

g

∂w

(w) = S

g

(w)w .

These equations an be represented ina ompa t form.

Proposition3.2. TheLagrangian equations(A.6)and(A.7)oftheoptimization

problem 3.1 are

S

g

(w)w = ΛS

D

w

,

w

_k

t

S

kk

w

k

= 1 .

The matrix

Λ

is a diagonalmatrix that is of the form

Λ

₌

diag

[λ

1 I

p

1 , . . . , λ

K

I

p

K

] ∈ R

p

_×p

_.

Wemightthink of

λ

= (λ

1 , . . . , λ

K

)

asave torof multivariateeigenvalues. Note thatintheHorsts heme,thematrix

S

g

(w)

doesnotdependon

w

andinthis ase, theproblemis alledamultivariateeigenvalueproblem(Chu&Watterson 1993).

Remark 3.3. Of ourse, in the entroid s heme, the fun tion

f

g

(w)

is not dif-ferentiable on

R

p

. We therefore have to restri t

f

g

onto the open subset

M = {w ∈ R|w

t

k

S

kl

w

l

6= 0} .

Notethat we an de ompose

M

into nitely many disjointopen subsets

M

I

= {w ∈ R|

sign

w

t

k

S

kl

w

l

(39)

Here

I ∈ {±1}

K

_×K

is a symmetri matrix with diagonal elements equal to

1

. Wheneverwespeakofaderivativeof

f

g

,weimpli itlyassumethat

f

g

isrestri ted toone of these subsets. Notethat onany of the subsets

M

I

, the matrix

S

g

(w)

that is dened in(3.8) does not depend on

w

.

Any solution of the equationsin proposition 3.2 is by denition a stationary

pointof the optimizationproblem3.1. In general,there mightbe morethan one

stationarypoint.

Lemma 3.4. The stationary point

w

that is a solution of 3.1 is the one su h that the sum of the orresponding multivariate eigenvalues ismaximal.

Proof. We rst note that

g

′

_{(x)x = g(x)}







1,

Horst, entroid

2,

fa torial

=

_{ecg(x) .}

It follows that for all

w

that are stationary points,

X

k,l

c

kl

g (w

k

S

kl

w

l

) =

ec

X

k,l

c

kl

g

′

(w

k

S

kl

w

l

) w

k

S

kl

w

l

=

ecw

t

_S

g

(w)w

=

ec

X

k

λ

k

w

k

t

S

kk

w

k

=

ec

X

k

λ

k

.

Ifwe want tomaximizethe ovarian ebetween latent variablesinstead of

orre-lation,we have to hange the onstraints in3.1. We obtain

arg

max

w

X

k,l:c

kl

6=0

g (

ov

(X

k

w

k

, X

l

w

l

) , )

(3.10) subje t to

1 n

kw

k

2 _{= 1 ,}

and the orresponding Lagrangian equationsare

S

g

(w)w = Λw ,

1 n

w

t

(40)

We an always transform optimizationproblem3.1 into (3.10): Denote by

√

S

kk

the root of the positive-semidenitematrix

S

kk

. We set

e

w

k

=

p

S

kk

w

k

,

(3.11)

p

S

D

=

diag

hp

S

11 , . . . ,

p

S

KK

i

,

(3.12)

e

S

=

p

S

D

−

S

p

S

D

−

.

(3.13)

It follows readily fromthe singular value de omposition of

X

k

and

X

l

that

S

kl

=

p

S

kk

p

S

kk

−

S

kl

p

S

ll

−

p

S

ll

.

We on lude that optimization problem3.1 is equivalentto

arg

max

e

w

X

k,l:c

kl

6=0

g

w

_e

_k

t

S

e

kl

w

e

l

,

subje t to

1 n

k e

w

k

k = 1 .

3.3 Multivariate Power Methods

If we want to nd the optimal solutionof the optimization problem3.1, we an

pro eed stepwise. First, we ompute a solution of the asso iated Lagrangian

equations. As this is a stationary point, we then have to he k if this is the

optimal one. One possibility to solve eigenproblems as in proposition 3.2 is to

apply a multivariateversion of the powermethoddened inalgorithmA.9.

Algorithm 3.5 (MultivariatePower Method). After the initialization of weight

ve tors

w

= (w

1 , . . . , w

K

)

su h that

w

t

k

S

kk

w

k

= 1

, weiteratively ompute

e

w

(i+1)

= S

_D

−

S

g

w

(i)

w

(i)

iteration

w

(i+1)

_k

=

w

_e

(i+1)

_k

/

q

e

w

_k

(i+1)

S

kk

w

e

k

(i+1)

, k = 1 . . . , K

normalization

If the multivariate power method onverges to a ve tor

w

, this is obviously a solution of the Lagrangian equations 3.2. Notethat in algorithm3.5, all weight

ve tors

w

k

areupdatedsimultaneously. Thereisavariationofthepowermethod, whereinea hround,onlyone weightve tor isupdated. We allthisalgorithma