Projection Error and Finite-Sample Structure of Data in Feature Space

We now turn to the question of the effective dimension of data in feature space. We will first review this question in the context of PCA and then argue that the situation is completely different for kernel PCA. In the latter case, one can only estimate the number of leading dimensions necessary to capture most of the variance of the data, formalized by the error of projecting to the space spanned by the firstdprincipal directions. We derive a relative-absolute bound for this projection error.

When treating vectorial data in a finite-dimensional learning problem, it often occurs that the data is not evenly distributed over all of the space but contained in a subspace of lower dimension.

The reason for this is that the coordinates often correspond to certain measured features which are not completely independent but correlated in some way. For example, one coordinate might

5.5. Projection Error and Finite-Sample Structure of Data in Feature Space 91 correspond to the height of a person while another represents the weights, and the height of a person is correlated with the weight in the way that a taller person will likely weigh more.

Now since PCA computes an orthonormal basis such that the variance is maximized over the space spanned by the leading basis vectors, this means that principal values usually decay rather quickly until they reach a plateau which roughly corresponds to the noise.

To illustrate this phenomenon we consider the following model. LetZ∈R^sbe a random variable with mean zero which models the (unobservable) features we wish to measure. The measurement process itself is modeled by a matrixA∈Md,s withd > s. We assume that columns ofA are independent such that the rank of the matrix iss. The measured featuresAZ thus lie in the subspace spanned by the columns of Aof dimensions. Finally, the measurement process is contaminated by independent additive noise ε ∼ N(0, σ_ε²I), with the variance σ²_ε being smaller than the variance of AZ. The resulting measurement vectorX is thus given as

X=AZ+ε. (5.44)

Let us compute the asymptotic principal components. It holds that

E(XX^>) =E(AZZ^>A^>) +E(AZε^>) +E(εZ^>A^>) +E(εε^>). (5.45) SinceεandZare independent,[E(AZε^>)]ij =E([AZ]iεj) =E([AZ]i)E([ε]j) = 0, andE(AZε^>) = 0.

Likewise,E(εZ^>A^>) = 0. Thus,

E(XX^>) =E(AZZ^>A^>) +E(εε^>). (5.46) Now note that

[E(εε^>)]ij =E(εiεj) =σ_ε²δij, (5.47) such that

E(XX^>) =E(AZZ^>A^>) +E(εε^>) =E(AZZ^>A^>) +σ²_εId. (5.48) We see that the principal values of X are the same as those ofAZ shifted up by σ_ε², which means that there ared−sprincipal values of sizeσ_ε²and thendprincipal values which are all larger thanσ_ε² and correspond to the shifted principal values of the actual signalAZ.

If we compute the PCA for a finite sample X1, . . . , Xn, due to sample fluctuations, the smallest principal value of size σ_ε² with multiplicity d−s will be perturbed slightly, giving rise to a slope of principal values for the finite sample size PCA. Figure 5.1 plots an example. This means that the data is approximately contained in ansdimensional subspace not only in the asymptotic case, but also for finite samples. This subspace is given by the space spanned by the leadingsprincipal components.

Now if we start with a sample X1, . . . , Xn from some unknown source, we can calculate its PCA.

The sequence of principal values then tells us something about the effective dimension of the underlying probability measure. One is usually interested in estimating the effective number of dimensions. There exist a number of approaches to do this.

The simplest approach looks for a “knee” in the sequence of eigenvalues, which is a transition into a ramp with small slope. A more sophisticated approach tests the hypothesis that the lastiprincipal values are finite-sample approximations of those of the covariance matrix of a spherical Gaussian distribution, whose distribution can be computed in closed form.

Let us now consider the question of effective dimension when the data is mapped into a feature space. Recall that the mapping into feature space is given by (compare (5.29))

x7→Φ(x) = p

λiψi(x)

i∈N (5.49)

Since (√

λi) is a null-sequence, we expectE([Φ(X)]²_i) to become rather small with largeri. In fact, we can compute

E([Φ(X)]²_i) =E(λiψ²_i(X)) =λi. (5.50)

0 5 10 15 20 0

1 2 3 4 5 6 7 8

Principal component number

Principal value

Principal values for the waveform data set

Figure 5.1: Example of the effect described in the text. The data set is thewaveform data set from R¨atsch et al. (2001). As predicted by the model, the principal values decay quickly until they pass into a slope which is due to additive noise.

Furthermore, let us compute the correlations:

E([Φ(X)]_i[Φ(X)]_j) =E(p

λ_iλ_jψ_i(X)ψ_j(X)) =p

λ_iλ_jhψi, ψ_jiµ =p

λ_iλ_jδ_ij. (5.51) We see that [Φ(X)]i and [Φ(X)]j are uncorrelated for i 6= j. Thus the asymptotic principal directions are given by the standard unit vectorseiin`²(eihaving zero entries everywhere except for [ei]i = 1). Asymptotically, the principal values decay rather quickly, such that there are only a few dimensions which contain any interesting data at all.

Now coming back to the question of effective dimension, we see that the situation is completely different in the case of kernel PCA from the model discussed above, because there is no knee in the sequence of principal values, but typically these principal values just decay at a given rate.

In ordinary PCA, there exists something like a background noise distributed uniformly over all directions. In kernel PCA, this noise is also mapped into the feature manifold, such that there is no slowly decreasing ramp in the sequence of principal values.

A reasonable alternative to mapping to the subspace containing the signal is to project to a number of leading dimensions such that the variance in the remaining space is negligible. The variance contained in the subspace spanned byψd+1, ψd+2, . . .is

Π²_d=

∞

i=d+1

λi. (5.52)

This number will be called thereconstruction error (of using the subspace spanned by the firstd principal directions).

For the finite sample case, we analogously define the projection error as P_d²=

i=d+1

l_i. (5.53)

The sum is finite in this case, because the data points completely lie in the subspace spanned by thenprincipal directions: From (5.36) one sees that the principal directions lie in the span of the samples Φ(X_i) in feature space. The space spanned by the Φ(X_i) has at most dimension n, and since the principal directions are orthogonal, the space spanned by those directions has dimension n, such that the data points lie in the space spanned by the firstnprincipal directions.

By Theorem 5.39, the approximate principal values converge to the asymptotic principal values with a relative-absolute bound. Thus, the principal values will decay at roughly the same rate

5.6. Conclusion 93

Im Dokument Spectral Properties of the Kernel Matrix and their Relation to Kernel Methods in Machine Learning (Seite 90-93)