• Keine Ergebnisse gefunden

calledKin the previous chapters. This change of notation is introduced to conform with the usual convention used in connection with kernel methods in supervised learning.

Let us re-write this formula using the spectral decomposition of K =ULU>. As usual, the columnsuiofUare the eigenvectors ofK,lithe diagonal elements ofL, which are the eigenvalues ofK. Everything is sorted such thatl1≥. . .≥ln. Plugging in the spectral decomposition leads to

Yˆ =Kαˆ =ULU>α.ˆ (6.5)

For the individual entry this means that [ ˆY]i=

n

X

j=1

uj(lju>jα) =:ˆ

n

X

j=1

ujβˆj (6.6)

This expresses the in-sample fit as a linear combination of eigenvectors ui of K. This way of decomposing the fit ˆY has several advantages compared with the plain in-sample fit formula (6.3), where the fit is constructed from the columns of the kernel matrix.

First of all, the eigenvectors form an orthonormal set of vectors. Therefore, the weights ˆβj

control orthogonal components, which can be thought of as geometrically independent components.

Furthermore, the eigenvectors usually increase in complexity as their associated eigenvalue becomes smaller (Williams and Seeger, 2000). Therefore, (6.6) decomposes ˆY with respect to components with increasing complexity. These are only preliminary considerations. As will be shown in this chapter, the most important reason is that with respect to the basis of eigenvectors of the kernel matrix, smooth and noisy parts of the label information have significantly different structure. It will turn out that a smooth function will have a sparse representation in this basis, while i.i.d. noise will have evenly distributed coefficients. These distinctions allow us to estimate the number of relevant dimensions in feature space.

Throughout this chapter, we will make the followingmodelling assumption. We assume thatYiis given by

Yi=f(Xi) +εi, (ε1, . . . , εn)>∼ N(0, σ2εIn), (6.7) which means that theYiare the sampled values offplus additive zero mean independent Gaussian noise. We will call

Y = (Y1, . . . , Yn)>∈Rn (6.8) thelabel vector.

Moreover, we will make certain smoothness assumptions onf. We will assume thatf lies in the image of the operator Tk. Let λi be the eigenvalues ofTk and ψi its eigenfunctions. Then, f ∈ranTk is equivalent to the existence of a sequence (αi)i∈N∈`2such that

f =

X

i=1

αiλiψi. (6.9)

For smooth kernels like the rbf-kernel, the operatorTk acts as a smoother such that this ensures a certain degree of regularity. One can show (compare Lemma 2.33) that the amount of regularity directly corresponds to the norm of the parameter sequence (αi).

6.4 The Spectrum of the Label Vector

We now introduce the notion of thespectrum of the label vector, which is the vector of coefficients ofY with respect to the eigenbasis ofY. Since the kernel matrixKis symmetric, its eigenvectors u1, . . . , un are orthogonal and if they are normalized to unit length, U = {u1, . . . , un} forms an orthonormal basis ofRn. Therefore, computing the coefficient ofY with respect to the eigenbasis

of K is particularly simple, because one only has to compute the scalar products ofY with the eigenvectors:

[s]i=u>iY. (6.10)

In matrix notation, the vectorsof all coefficients can conveniently be written as

s=U>Y. (6.11)

In analogy to the term eigenbasis, we will call s the eigencoefficients of Y with respect to the kernel matrixK.

We will also call s the spectrum of Y. This term is motivated by the observation that the eigenvectors of smooth kernels (for example, rbf-kernels (2.17)) typically look like sine waves whose complexity increases as the eigenvalue becomes smaller. For data in a vector space with dimension larger than one, similar observations can be made (see for example the paper by Sch¨olkopf et al.

(1999)). Unfortunately, since the interplay between the underlying distributionµof the data and the kernel function is not yet completely understood although first steps have been made (Williams and Seeger, 2000), this observation cannot be put in more rigorous terms. In summary, typically, computing the eigencoefficients of a vector decomposes the vector into orthogonal components with increasing complexity, not unlike a Fourier transformation. Therefore, in analogy to the Fourier transformation, we will call the eigencoefficients of Y the spectrum ofY. We will see below that this terminology is actually supported by observations concerning the structure of the signal and noise part of the label vector.

By the modelling assumption (6.7), it holds thatYi=f(Xi) +εi. Let us writef(X) = (f(X1), . . . , f(Xn))>andε= (ε1, . . . , εn)>. Then, the spectrum ofY is

U>Y =U>(f(X) +ε) =U>f(X) +U>ε. (6.12) Thus, the spectrum ofY is a superposition of the spectrum of the sample vector off and of the noiseε. As we will see, U>f(X) andU>εhave significantly different structures.

6.4.1 The Spectrum of a Smooth Function

Recall that we assumed thatf was smooth in the sense thatf =P

`=1α`λ`ψ` for some sequence (α`)∈`2. By the general results cited in Section 4.4, we know that the scalar products

√1

n|u>if(X)| (6.13)

converge to |hψi, fiµ| (taking the necessary precautions for eigenvalues with multiplicity larger than 1). Since the (ψi) form an orthonormal family of functions, it holds that

i, fiµiλi. (6.14)

The asymptotic (infinite) spectrum off given by the sequence (hψi, fiµ)i∈Ndecays rather quickly, because (αi)∈`2 and (λi)∈`1. In particular, for every ε >0, there exist only a finite number of entries which are larger than ε. In other words, by only considering a finite number of basis functions ψ1, . . . , ψr, we can already reconstructf up to a small error. More specifically, for any errorε >0, a finite reconstruction can be found whose error does not exceedε. In other words,f has finite complexity at any given scale.

We are interested in the question whether the sampled spectrum s = U>f(X) has similar properties. In Chapter 4, we derived a relative-absolute envelope on |u>if(X)|/√

n. This is an upper bound which does not converge to zero as the number of samples goes to infinity, but which is nevertheless quite small for certain indicesi. By Theorem 4.92, we know that

√1

n|u>if(X)| ≤liO

r

X

`=1

`|

!

+ε(r, f), (6.15)

6.4. The Spectrum of the Label Vector 99 where rcan be chosen such that ε(r, f) becomes very small. This means that the coefficients of salso decay quickly and that the sample vector f(X) will be contained in the space spanned by the first few eigenvectors ofKn, and that this number is independent of the sample size.

In summary, a smooth function will also have a quickly decaying spectrum on a finite sample.

There will only be a certain number of eigencoefficients which are large.

Note the similarity to the observations from the last Chapter. In Section 5.5, we showed that a finite sample will be contained in a low-dimensional subspace in feature space. In this section we have shown that the finite-dimensional version of the label vector is essentially contained in the first few dimension with respect to the eigenbasis of the kernel vector.

6.4.2 The Spectrum of Noise

Recall that we assumed thatε=N(0, σ2εI). Let us study some stochastic properties ofU>ε. First of all note thatE(U>ε) = 0, such that the noise has mean zero also with respect to the eigenbasis ofK.

The eigenbasisu1, . . . , unofKdepends only on theXi which are independent ofε. Therefore, x7→U>xcomputes a random rotation ofxwhich is independent of the realization ofε. Further-more εis a spherical distribution, such thatU>εis just a rotated version ofε which still has the same distribution:

U>ε∼ N(0, σ2εI). (6.16)

Therefore, the spectrum of εwill be more or lessflat, which means that it is equally distributed over all components, and the components are typically neither very small nor very large. This will be still be true to a lesser extent, if ε is not spherically distributed. In order for ε to be concentrated along one of the eigenvector directions ofK, the noise has to have a shape similar to the eigenvectors, but in any case, the eigenvectors will be independent ofε. Since the eigenvectors to larger eigenvalues are more or less smooth, this is rather unlikely if the noise has zero mean and is independent while not being identically distributed. The smaller eigenvectors are not very smooth but also not very stable, such that again, the scalar products will be more or less random and the spectrum will be flat.

This characterization of the shape of the spectrum is reminiscent of the question of effective di-mensions in the context of PCA from Section 5.5, although the objects involved and the underlying mechanics should not be confused: In PCA, the eigenvalues of the covariance matrix are consid-ered, whereas here, we are looking at the scalar products between the vector of all labels of the training set and the eigenbasis of the kernel matrix. Furthermore, the present characterization depends crucially on the relative-absolute envelope for scalar products with eigenvectors of the kernel matrix which form an original contribution of this thesis. Even the asymptotic results from Koltchinskii (1998) date back only a few years, whereas the PCA setting is known at least since (Schmidt, 1986).

6.4.3 An example

Let us take a look at an example. As usual, we take the noisy sinc function and the rbf-kernel with kernel widthw= 1. Figure 6.1 plots the raw data, and the spectra ofY and its componentsf(X) andε. As predicted, the spectrum of the sample vectorf(X) decays quickly (note the logarithmic scale!), while the spectrum of the noise is flat. We also see how the spectrum of the label vector sticks out of the noise. This is due to the fact that the whole variance of the label vector has to be contained in a few dimensions, while the variance of the noise can be spread out evenly. Thus, even ifkf(X)k=kεk, the spectrum off(X) will stick out of the noise spectrum.

It is instructive to compare the spectrum off(X) from Figure 6.1 with that of a function which is not smooth in the sense that it lies in the range ofTk. For continuous kernels, an example for such a function is the sign function which is discontinuous at 0. Figure 6.2 plots the spectrum of

−4 −2 0 2 4

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1 1.2

X

Y

(a) A typical data set.

0 20 40 60 80 100

0 0.5 1 1.5 2

i

|uiT Y|

(b) The spectrum of the label vectorY.

0 20 40 60 80 100

10−15 10−10 10−5 100

i

|uiT f(X)|

(c) The spectrum of the sample vector f(X) (in log-scale).

0 20 40 60 80 100

0 0.5 1 1.5 2

i

|uiTε|

(d) The spectrum of the noiseε.

Figure 6.1: Spectrum for the noisy sinc example. The spectrum of the sample vectorf(X) decays rapidly whereas the spectrum of the noise vectorεis flat.

0 20 40 60 80 100

10−3 10−2 10−1 100 101

i

|uiT sign(X)|

Figure 6.2: Spectrum of the sign function for values uniformly drawn in [−π, π] with respect to the rbf-kernel. Since the sign function is discontinuous and does therefore not fulfill the regularity assumption, the spectrum decays much slower than in the noisy sinc function example.

6.4. The Spectrum of the Label Vector 101 the sign function. We see that the spectrum decays, but much more slowly than that in Figure 6.1.

Therefore, the smoothness condition is crucial to obtain rapid decay.

6.4.4 The Cut-off Dimension

Let us summarize the above observations. By the convergence results in Chapter 4, we know that the spectrum of the sample vector of a smooth function has only a finite number of large entries (which is moreover independent ofn), whereas the spectrum of independent noise is evenly distributed over all coefficients. The eigenbasis of the kernel matrix yields a representation of the label vector in which the interesting part in the label vector and the noise have significantly different structures.

These considerations lead to the definition of thecut-off dimension for a given label vectorY: Definition 6.17 Given a label vectorY =f(X) +εof lengthn, thecut-off dimension is the largest number 1≤d≤n, such that

|[U>f(X)]d|>|[U>ε]d|. (6.18) Thus, the cut-off dimension is the size of the part of the spectrum of f(X) which sticks out of the noise. For the example, from Figure 6.1(b), the cut-off dimension seems to be 9. From Figure 6.1(c), we see that from that point onwards the spectrum is smaller than 10−2 which is relatively small, but still a bit away from the true residual at around d= 20. In the presence of noise, this portion off(X) is not visible. The following lemma summarizes the intuition that the cut-off dimension capturesf(X) up to the noise level.

Lemma 6.19 Let Y = f(X) +ε and d be the cut-off dimension. Let πdx = Pd

i=1uiu>ix be the projection ofx∈Rn to the space spanned by the first deigenvectors ofK. Then,

1

nkf(X)−πdf(X)k2≤ 1

nkεk2a.s.Var(ε), (6.20) where the limit is taken by letting the number of samplesn (andXaccordingly) tend to infinity.

Proof It holds thatf(X)−πdf(X) =Pn

i=d+1uiu>if(X). Therefore, 1

nkf(X)−πdf(x)k2= 1 n

n

X

i=d+1

kuiu>if(X)k2= 1 n

n

X

i=d+1

u>if(X)2 (1)

≤ 1 n

n

X

i=d+1

(u>iε)2≤ 1 n

n

X

i=1

(u>iε) = 1

nkU>εk2

(2)= 1

nkεk2(3)a.s.Var(ε1) =σ2ε.

(6.21)

where (1) follows from the definition of the cut-off dimension, (2) follows becauseUis an orthogonal matrix, and (3) from the strong law of large numbers. Note that although d is random, the upper bound holds nevertheless, because in step (3) where the limit is taken,dhas already been

eliminated.

In words, if one considers only the reconstruction of f(X) up to the cut-off dimension, the reconstruction error is asymptotically smaller than the noise variance.

Note that we have neglected the fact that the spectrum of f(X) decays quickly. Therefore, the actual error will be much smaller than predicted by the last lemma. On the positive side, this lemma will also hold for non-smooth functionf. In fact, it holds for any vectorsf(X).

6.4.5 Connections to Wavelet Shrinkage

Wavelet shrinkage is a spatially adaptive technique for learning a function when the sample points Xi are given as equidistant points. Such data sets typically occur in signal or image processing.

For such point sets, one can define a wavelet basis which leads to a multi-scale decomposition of the signal. Wavelet shrinkage then proceeds by selectively thresholding the wavelet coefficients of the signal, resulting in a reconstruction of the original signal which is able to recover both, smooth areas and jump discontinuities. In this respect, wavelet methods often show superior performance, in particular compared to linear methods. It has even been shown by Donoho and Johnstone (1998) that using certain thresholding schemes, the resulting method is nearly minimax optimal over any member of a wide range of so-called Triebel and Besov-type smoothness classes, and also asymptotically minimax optimal over certain Besov bodies.

The connection to the discussion here is given by the fact that wavelet shrinkage is analyzed in terms of the so-calledsequence space. This space is obtained by considering the wavelet coefficients, just as we have considered the coefficients with respect to the eigenvector of the kernel matrix. In both cases, the coefficients represent the original label vectorY with respect to an orthonormal basis. As has been discussed above, after the basis transformation, the noise stays normally distributed.

Now interestingly, using the wavelet analysis, a noiseless signal will typically have only a small number of large wavelet coefficients, while the remaining coefficients will be rather small. On the other hand, as explained above, the noise will contribute evenly to all wavelet coefficients. Based on the theoretical results from Chapter 4, we are now in a position to state that essentially the same condition holds in the case of the eigenvector basis of the kernel matrix. Now, while the eigenvectors of the kernel matrix typically do not lead to a multi-scale decomposition of the label vector, on the other hand, kernel methods naturally extend to non-equidistant sample points, a setting where application of wavelet techniques is not straight-forward.

Below, when we discuss practical methods for estimating the cut-off dimension, we will return to the topic of wavelet shrinkage and discuss the applicability of methods like VisuShrink and Sure-Shrink (introduced in (Donoho and Johnstone, 1995) and (Donoho et al., 1995)) for determining the threshold coefficients in the kernel setting.