Principal Component Analysis - Spectral Properties of the Kernel Matrix and their Relation to K

the number of eigenvalues for which the relative error is computed. As discussed in Chapter 3, there exists a trade-off betweenC^r_nandE^r_n, such that there is no easy optimal choice. One usually considers rfixed such that kE^r_nk is small, around the precision of the underlying finite precision arithmetics involved.

The termskC^r_nk andkE^r_nkhave been extensively studied for two cases (kernels with bounded eigenfunctions and bounded kernel functions) in Sections 3.9 and 3.10. Asymptotically,kC^r_nk →0 andkE^r_nkis upper bounded byP∞

i=r+1λ²_i (see Theorem 3.131).

This theorem improves upon results from Shawe-Taylor and Williams (2003) and Zwald et al.

(2004) which only treat leading sums of eigenvalues and sums of all but the first few eigenvalues.

Moreover, those bounds are independent of the number of eigenvalues or the magnitude of the true eigenvalues under consideration, overestimating the error if the eigenvalues are very small. The present theorem also addresses the finite-sample size case in contrast to the central limit theorems of Dauxois et al. (1982) and Koltchinskii and Gin´e (2000).

The result for kernel PCA has interesting implications for the structure of a finite-sample in feature space. A smooth kernel results in a feature map whose asymptotic principal values decay quickly. This means that there are only a few directions along which the variance of the data in feature space is large. Now in principle, it is conceivable that a finite sample effectively covers a much larger space than in the asymptotic case because the principal values converge only slowly due to a large fourth moment of the data. Since a sample of size n spans an n-dimensional subspace in feature space, in the worst case, the effective dimension of the data scales with the sample size, rendering the learning problem hard. However, Theorem 5.3 implies that the estimated principal values approximate the true ones with high precision such that they also decay quickly. Consequently, a finite sample is contained in a low-dimensional subspace. This can be expressed with respect to approximation bounds on the projection error of kernel PCA.

Theorem 5.6 (Projection error for kernel PCA)

(Theorem 5.54 in the main text) The squared projection error P_d² for the first d dimensions is bounded by

P_d²≤(1 +kC^r_nk)Π²_d+nkE^r_nk, (5.7) whereC^r_n andE^r_n are the same error matrices as in the Theorem 5.3.

Here, Π²_d is the asymptotic reconstruction error which is given by Π²_d =P∞

i=d+1λi. For fixed r, the absolute term scales as n, but note thatkE^r_nk will typically be very small. Asn→ ∞, r can be chosen accordingly, such that the bound converges to Π²_d. Thus, this theorem amounts to a relative-absolute bound on the projection error which also results in a much tighter bound for largedcompared to existing approaches.

5.3 Principal Component Analysis

We briefly review principal component analysis. LetX be a random variable inR^d with distributionµ.

Thecovariance matrixCov(X)is thed×dmatrix with entries

[Cov(X)]ij=E ([X]i−E[X]i)([X]j−E[X]j)

. (5.8)

The first principal component is given by the unit length vectorv∈R^d which maximizes

Var(X^>v) =v^>Cov(X)v. (5.9) By the variational characterization of eigenvalues (see Theorem 3.35), it follows that the solution v₁ is given by an eigenvector to the largest eigenvalue of Cov(X). The second principal component is the vector which maximizes v^>Cov(X)v with the constraint that v1 ⊥ v, because v^>X should be uncorrelated from v^>₁X. The solution is given by an eigenvector to the second largest eigenvalue of Cov(X), and so on.

Now given only a finite sample X₁, . . . , X_n ∈ R^d, the (population) principal components are estimated using the sample covariance matrix

[Cn]ij= 1 n−1

`=1

([X`]i−[X]i)([X`]j−[X]j), (5.10) withX= _n¹Pn

`=1X_`. This matrix is called thesample covariance matrix (see, for example, Pestman (1998)). The sample covariance matrix can be thought of as an approximation of Cov(X)obtained by replacing the expectation by the empirical averages. The estimated principal componentsu₁, . . . u_d are the eigenvectors ofC_n, and the estimated principal valuesl₁, . . . , l_d are the eigenvalues ofC_n.

For simplicity, we will assume that E(X) = 0. In this case, one can also consider the already centered sample covariance matrix

[Cn]ij = 1 n

`=1

[X`]i[X`]j. (5.11)

Since we are in a finite-dimensional setting, it is rather easy to prove that the principal values converge:

SinceCn has only finitely many entries and by the strong law of large numbers[Cn]ij →[Cov(X)]ij, k[Cn]_ij−Cov(X)_ijk →0, and consequently, the eigenvalues ofC_n converge to those ofCov(X)by Weyl’s theorem (Theorem 3.49),

|li−λi| ≤ k[Cn]ij−Cov(X)ijk. (5.12) This bound is again absolute in the sense that the same bound is applied to all principal values, leading to a rather pessimistic error estimate for the smaller principal values.

We can obtain a relative perturbation bound using the same techniques as in Chapter 3. Here and in the following we will always assume that all principal values are non-zero. If this is not the case, one considers the subspace spanned by the principal directions corresponding to non-zero principal values, because samples lie in this subspace almost surely.

Theorem 5.13 Let µbe a probability distribution onR^d with mean zero, non-zero principal values λ1, . . . , λd , and principal directionsψ1, . . . , ψd. Letl1, . . . , ld be the estimated principal values and u1, . . . , ud the estimated principal directions based on an i.i.d. sample X1, . . . , Xn fromµ. Then,

|li−λi| ≤λikRn−Idk, (5.14) whereR_n is thed×d matrix with entries

[Rn]ij= 1 pλiλj

ψ^>_iCnψj. (5.15)

It holds that kRn−Idk →0 almost surely.

Proof Let Λ = diag(λ1, . . . , λd) and Ψ be the d×d matrix whose columns are ψ1, . . . , ψd. Then,Ψ^>Ψ=ΨΨ^>=Id, because the (ψi) are orthonormal (since they are the eigenvectors of the symmetric matrixCov(X)). LetXbe thed×nmatrix whose columns are the samplesX1, . . . , Xn. UsingX, the sample covariance from (5.11) can be written as

Cn= 1

nXX^>. (5.16)

NowXX^>has the same non-zero eigenvalues asX^>X, and we can re-writeX^>X/n as 1

nX^>X= 1

nX^>ΨΨ^>X= 1

√nX^>ΨΛ⁻¹²

Λ 1

√nΛ⁻¹²Ψ^>X

. (5.17)

5.3. Principal Component Analysis 87 Thus,X^>X/ncan be considered a multiplicative perturbation of ΛtoSΛS^>with

S= 1

√nX^>ΨΛ⁻¹². (5.18)

SinceΛhas eigenvaluesλ1, . . . , λd, it follows from Ostrowski’s theorem (Theorem 3.52) that

|li−λi| ≤λikS^>S−Ik. (5.19) Let us compute the entries ofS^>S. First note that

[S]ij= 1

pnλ_jX^>_iψj, (5.20)

and consequently [S^>S]_ij =

`=1

1 np

λiλj

X^>_`ψ_iX^>_`ψ_j = 1 np

λiλj n

`=1

ψ^>_iX_`X^>_`ψ_j= 1 np

λiλj

ψ^>_iXX^>ψ_j. (5.21) And re-substitutingCn=XX^>/nyields (5.15).

With respect to convergence, note that 1

pλ_iλ_jψ^>_iCnψj → 1

pλ_iλ_jψ^>_iCov(X)ψj almost surely. (5.22) Sinceψ_i is an eigenvector ofCov(X),

1 pλiλj

ψ^>_iCov(X)ψj = λi

pλiλj

ψ^>_iψj=δij, (5.23)

and thereforeRn→Id.

Let us take a look at the error matrixRn. This matrix is really the sample covariance matrix of the random variable X after it has been transformed into the basis of principal components and scaled along the principal components such that the variance becomes 1. To see this, note first that ψ^>_iX_` occurring in formula (5.21) isX projected onto theith principal direction. The variance along this direction is Var(ψ^>_iX_`) = λ_i. By dividing by √

λ_i, the variance becomes 1.

Using these projected and scaled versions ofX, one obtains a random variable which has the same principal directions asX but principal values equal to 1 by

i=1

ψi

√1

λ_iψ^>_iX`. (5.24)

On the other hand, one could also replace the first term ψi in the sum by the standard basis of Rⁿ and obtains the random variable

Z`= 1

√λ₁ψ^>₁X`, . . . , 1

√λ_dψ^>_dX`

, (5.25)

which is already rotated such that the principal directions are given by the coordinate axes. The sample covariance matrix ofZ_` is

1 np

λiλj n

`=1

ψ^>_iX_`ψ^>_jX_` (5.26) which is just the same term as in (5.21). Therefore, the convergence depends on how fast the sample covariance matrix of the projected and scaledX converge to I_d.

Contrast this result with the naive relative bound obtained by dividing by λ_i:

|li−λi| ≤λi

kCn−Cov(X)k λi

. (5.27)

Here, the whole error term is scaled byλi, whereas in Theorem 5.13, the principal directions are scaled individually, such that the resulting error term really measures the error in each direction at the given relative scale.

Im Dokument Spectral Properties of the Kernel Matrix and their Relation to Kernel Methods in Machine Learning (Seite 85-88)