• Keine Ergebnisse gefunden

the number of eigenvalues for which the relative error is computed. As discussed in Chapter 3, there exists a trade-off betweenCrnandErn, such that there is no easy optimal choice. One usually considers rfixed such that kErnk is small, around the precision of the underlying finite precision arithmetics involved.

The termskCrnk andkErnkhave been extensively studied for two cases (kernels with bounded eigenfunctions and bounded kernel functions) in Sections 3.9 and 3.10. Asymptotically,kCrnk →0 andkErnkis upper bounded byP

i=r+1λ2i (see Theorem 3.131).

This theorem improves upon results from Shawe-Taylor and Williams (2003) and Zwald et al.

(2004) which only treat leading sums of eigenvalues and sums of all but the first few eigenvalues.

Moreover, those bounds are independent of the number of eigenvalues or the magnitude of the true eigenvalues under consideration, overestimating the error if the eigenvalues are very small. The present theorem also addresses the finite-sample size case in contrast to the central limit theorems of Dauxois et al. (1982) and Koltchinskii and Gin´e (2000).

The result for kernel PCA has interesting implications for the structure of a finite-sample in feature space. A smooth kernel results in a feature map whose asymptotic principal values decay quickly. This means that there are only a few directions along which the variance of the data in feature space is large. Now in principle, it is conceivable that a finite sample effectively covers a much larger space than in the asymptotic case because the principal values converge only slowly due to a large fourth moment of the data. Since a sample of size n spans an n-dimensional subspace in feature space, in the worst case, the effective dimension of the data scales with the sample size, rendering the learning problem hard. However, Theorem 5.3 implies that the estimated principal values approximate the true ones with high precision such that they also decay quickly. Consequently, a finite sample is contained in a low-dimensional subspace. This can be expressed with respect to approximation bounds on the projection error of kernel PCA.

Theorem 5.6 (Projection error for kernel PCA)

(Theorem 5.54 in the main text) The squared projection error Pd2 for the first d dimensions is bounded by

Pd2≤(1 +kCrnk)Π2d+nkErnk, (5.7) whereCrn andErn are the same error matrices as in the Theorem 5.3.

Here, Π2d is the asymptotic reconstruction error which is given by Π2d =P

i=d+1λi. For fixed r, the absolute term scales as n, but note thatkErnk will typically be very small. Asn→ ∞, r can be chosen accordingly, such that the bound converges to Π2d. Thus, this theorem amounts to a relative-absolute bound on the projection error which also results in a much tighter bound for largedcompared to existing approaches.

5.3 Principal Component Analysis

We briefly review principal component analysis. LetX be a random variable inRd with distributionµ.

Thecovariance matrixCov(X)is thed×dmatrix with entries

[Cov(X)]ij=E ([X]i−E[X]i)([X]j−E[X]j)

. (5.8)

The first principal component is given by the unit length vectorv∈Rd which maximizes

Var(X>v) =v>Cov(X)v. (5.9) By the variational characterization of eigenvalues (see Theorem 3.35), it follows that the solution v1 is given by an eigenvector to the largest eigenvalue of Cov(X). The second principal component is the vector which maximizes v>Cov(X)v with the constraint that v1 ⊥ v, because v>X should be uncorrelated from v>1X. The solution is given by an eigenvector to the second largest eigenvalue of Cov(X), and so on.

Now given only a finite sample X1, . . . , Xn ∈ Rd, the (population) principal components are estimated using the sample covariance matrix

[Cn]ij= 1 n−1

n

X

`=1

([X`]i−[X]i)([X`]j−[X]j), (5.10) withX= n1Pn

`=1X`. This matrix is called thesample covariance matrix (see, for example, Pestman (1998)). The sample covariance matrix can be thought of as an approximation of Cov(X)obtained by replacing the expectation by the empirical averages. The estimated principal componentsu1, . . . ud are the eigenvectors ofCn, and the estimated principal valuesl1, . . . , ld are the eigenvalues ofCn.

For simplicity, we will assume that E(X) = 0. In this case, one can also consider the already centered sample covariance matrix

[Cn]ij = 1 n

n

X

`=1

[X`]i[X`]j. (5.11)

Since we are in a finite-dimensional setting, it is rather easy to prove that the principal values converge:

SinceCn has only finitely many entries and by the strong law of large numbers[Cn]ij →[Cov(X)]ij, k[Cn]ij−Cov(X)ijk →0, and consequently, the eigenvalues ofCn converge to those ofCov(X)by Weyl’s theorem (Theorem 3.49),

|li−λi| ≤ k[Cn]ij−Cov(X)ijk. (5.12) This bound is again absolute in the sense that the same bound is applied to all principal values, leading to a rather pessimistic error estimate for the smaller principal values.

We can obtain a relative perturbation bound using the same techniques as in Chapter 3. Here and in the following we will always assume that all principal values are non-zero. If this is not the case, one considers the subspace spanned by the principal directions corresponding to non-zero principal values, because samples lie in this subspace almost surely.

Theorem 5.13 Let µbe a probability distribution onRd with mean zero, non-zero principal values λ1, . . . , λd , and principal directionsψ1, . . . , ψd. Letl1, . . . , ld be the estimated principal values and u1, . . . , ud the estimated principal directions based on an i.i.d. sample X1, . . . , Xn fromµ. Then,

|li−λi| ≤λikRn−Idk, (5.14) whereRn is thed×d matrix with entries

[Rn]ij= 1 pλiλj

ψ>iCnψj. (5.15)

It holds that kRn−Idk →0 almost surely.

Proof Let Λ = diag(λ1, . . . , λd) and Ψ be the d×d matrix whose columns are ψ1, . . . , ψd. Then,Ψ>Ψ=ΨΨ>=Id, because the (ψi) are orthonormal (since they are the eigenvectors of the symmetric matrixCov(X)). LetXbe thed×nmatrix whose columns are the samplesX1, . . . , Xn. UsingX, the sample covariance from (5.11) can be written as

Cn= 1

nXX>. (5.16)

NowXX>has the same non-zero eigenvalues asX>X, and we can re-writeX>X/n as 1

nX>X= 1

nX>ΨΨ>X= 1

√nX>ΨΛ12

Λ 1

√nΛ12Ψ>X

. (5.17)

5.3. Principal Component Analysis 87 Thus,X>X/ncan be considered a multiplicative perturbation of ΛtoSΛS>with

S= 1

√nX>ΨΛ12. (5.18)

SinceΛhas eigenvaluesλ1, . . . , λd, it follows from Ostrowski’s theorem (Theorem 3.52) that

|li−λi| ≤λikS>S−Ik. (5.19) Let us compute the entries ofS>S. First note that

[S]ij= 1

pnλjX>iψj, (5.20)

and consequently [S>S]ij =

n

X

`=1

1 np

λiλj

X>`ψiX>`ψj = 1 np

λiλj n

X

`=1

ψ>iX`X>`ψj= 1 np

λiλj

ψ>iXX>ψj. (5.21) And re-substitutingCn=XX>/nyields (5.15).

With respect to convergence, note that 1

iλjψ>iCnψj → 1

iλjψ>iCov(X)ψj almost surely. (5.22) Sinceψi is an eigenvector ofCov(X),

1 pλiλj

ψ>iCov(X)ψj = λi

iλj

ψ>iψjij, (5.23)

and thereforeRn→Id.

Let us take a look at the error matrixRn. This matrix is really the sample covariance matrix of the random variable X after it has been transformed into the basis of principal components and scaled along the principal components such that the variance becomes 1. To see this, note first that ψ>iX` occurring in formula (5.21) isX projected onto theith principal direction. The variance along this direction is Var(ψ>iX`) = λi. By dividing by √

λi, the variance becomes 1.

Using these projected and scaled versions ofX, one obtains a random variable which has the same principal directions asX but principal values equal to 1 by

d

X

i=1

ψi

√1

λiψ>iX`. (5.24)

On the other hand, one could also replace the first term ψi in the sum by the standard basis of Rn and obtains the random variable

Z`= 1

√λ1ψ>1X`, . . . , 1

√λdψ>dX`

, (5.25)

which is already rotated such that the principal directions are given by the coordinate axes. The sample covariance matrix ofZ` is

1 np

λiλj n

X

`=1

ψ>iX`ψ>jX` (5.26) which is just the same term as in (5.21). Therefore, the convergence depends on how fast the sample covariance matrix of the projected and scaledX converge to Id.

Contrast this result with the naive relative bound obtained by dividing by λi:

|li−λi| ≤λi

kCn−Cov(X)k λi

. (5.27)

Here, the whole error term is scaled byλi, whereas in Theorem 5.13, the principal directions are scaled individually, such that the resulting error term really measures the error in each direction at the given relative scale.