• Keine Ergebnisse gefunden

Dataset SVM GPML KRRRB KRRSM LOOCV

banana 11.5±0.7 10.4±0.4 12.2±3.4 10.6±0.5 10.6±0.6 breast-cancer 26.0±4.7 27.2±5.1 33.2±8.5 26.4±4.7 26.6±4.7 diabetis 23.5±1.7 23.0±1.7 27.5±2.8 23.2±1.7 23.3±1.7 flare-solar 32.4±1.8 34.1±1.7 40.6±9.1 34.2±1.8 34.1±1.8

german 23.6±2.1 24.0±2.2 24.7±2.3 23.4±2.3 23.5±2.2

heart 16.0±3.3 18.4±3.4 17.9±4.0 16.4±3.3 16.6±3.6

image 3.0±0.6 2.8±0.5 2.7±0.5 2.7±0.5 2.8±0.5

ringnorm 1.7±0.1 6.0±0.9 6.1±1.1 4.9±0.7 4.9±0.6

splice 10.9±0.6 11.5±0.6 11.1±0.7 11.1±0.6 11.2±0.7

thyroid 4.8±2.2 4.3±2.7 4.6±2.3 4.4±2.2 4.4±2.2

titanic 22.4±1.0 22.7±1.3 25.3±11.3 22.4±0.9 22.4±0.9

twonorm 3.0±0.2 3.1±0.2 3.7±0.4 2.5±0.1 2.8±0.2

waveform 9.9±0.4 10.0±0.5 10.9±0.7 10.0±0.4 9.7±0.4

Figure 7.15: Test error rates for the classification benchmark data sets. The data sets are from R¨atsch et al. (2001) and are available online from http://www.first.fraunhofer.

de/~raetsch.

Dataset LOOCV GPML KRRRB KRRSM

banana 2.65±1.08×10−1 3.34±0.17×10−1 1.06±4.65×100 2.59±1.29×10−1 breast-cancer 4.66±16.93×101 5.81±3.04×101 2.14±3.76×102 3.01±9.92×101 diabetis 1.52±1.16×101 2.61±0.19×101 4.16±15.74×101 4.80±14.46×101 flare-solar 1.79±3.67×102 5.78±1.04×100 8.99±19.15×101 2.58±4.14×102 german 2.83±3.19×101 1.01±0.23×102 5.29±18.78×101 2.72±0.84×101 heart 3.21±3.86×102 1.99±1.35×101 1.98±3.14×102 4.88±4.65×102 image 2.21±0.79×100 8.68±5.40×10−1 2.44±1.16×100 2.29±1.48×100 ringnorm 2.54±0.35×101 1.27±0.00×101 1.59±0.76×101 2.34±0.57×101 splice 5.17±0.87×101 5.31±0.63×101 4.89±1.16×101 5.46±0.00×101 thyroid 2.92±0.50×100 4.44±1.65×10−1 2.77±1.09×100 2.78±0.52×100 titanic 2.09±3.94×102 6.82±6.59×100 1.63±2.98×102 1.66±3.67×102 twonorm 1.33±0.27×101 2.92±0.85×101 1.34±0.31×101 1.34±0.30×101 waveform 1.58±0.58×101 2.48±0.66×101 1.47±3.64×101 1.55±1.10×101 Figure 7.16: Estimated kernel widths for the classification benchmark data set.

thetwonormdata set, which isd= 2. This means that the resulting best solution for thetwonorm data set is computed by a very low complexity fit.

In summary, we have observed two behaviors: First of all, regression methods can be used very well for classification although they have been derived in a completely different setting. It is safe to assume that for GPML, none of the original assumptions are justified. This insight is not new but has already been stated in (Rifkin, 2002). Second of all, we have seen that KRRSM shows the same performance as LOOCV, in one case even being significantly better.

7.8 Conclusion

In this chapter, we have explored applications of the theoretical results achieved so far to kernel ridge regression. We have shown that kernel ridge regression practically works by first transforming the data into a representation where the noise can be removed by simply shrinking a number of coefficients to zero. Then, we have proposed a method for adjusting the regularization parameter based on the cut-off dimension estimators. Experimentally, we observed that this method performs very competitively to state-of-the-art methods.

Dataset LOOCV GPML KRRRB KRRSM banana 3.35±2.21×10−1 2.98±0.00×10−1 1.55±1.68×10−1 2.90±1.40×10−1 breast-cancer 1.96±1.15×100 7.77±0.78×10−1 6.93±4.17×10−2 1.14±0.31×100 diabetis 1.66±0.66×100 7.85±0.00×10−1 1.15±1.18×10−1 6.45±2.77×10−1 flare-solar 5.14±4.79×10−1 7.85±0.00×10−1 4.49±10.45×10−2 3.16±4.65×10−1 german 1.73±0.61×100 7.85±0.00×10−1 2.07±3.63×10−1 1.05±0.14×100 heart 4.72±4.34×10−1 3.16±1.56×10−1 3.00±4.07×10−1 4.19±5.36×10−1 image 1.61±0.73×10−2 2.09±2.09×10−2 6.27±2.93×10−2 4.22±2.98×10−2 ringnorm 1.11±0.10×10−1 4.07±0.72×10−2 3.82±1.66×10−2 7.71±2.65×10−2 splice 1.00±0.61×10−1 1.69±4.14×10−2 6.75±1.57×10−2 8.79±0.76×10−2 thyroid 1.15±0.72×10−1 2.03±4.71×10−3 1.49±0.83×10−1 9.83±4.20×10−2 titanic 1.39±2.20×100 7.70±0.84×10−1 5.53±10.54×10−2 6.92±6.80×10−1 twonorm 4.61±2.42×10−1 1.11±0.12×10−1 1.32±4.67×10−1 2.61±0.38×100 waveform 7.60±2.84×10−1 2.63±0.73×10−1 9.42±8.70×10−2 3.90±1.34×10−1 Figure 7.17: Estimated regularization constants for the classification benchmark data set.

Dataset KRRRB KRRSM n

banana 61±58 27±5 400

breast-cancer 100±36 3±1 200

diabetis 197±76 8±1 468

flare-solar 72±50 9±2 666

german 106±65 12±1 700

heart 7±19 4±2 170

image 200±0 266±82 1300

ringnorm 179±42 44±14 400

splice 200±0 86±12 1000

thyroid 12±7 15±5 140

titanic 8±4 6±2 150

twonorm 171±56 2±0 400

waveform 140±56 15±6 400

Figure 7.18: Estimated cut-off dimension and training sample sizes for the classification benchmark data set.

7.8. Conclusion 143 In summary, the theoretical results from Chapters 3, 4 and 6 are useful to analyze kernel methods. The proposed methods also provided useful additional information about the data sets like the dimensionality of the problem and the amount of noise present in the data.

Chapter 8

Conclusion

This thesis presents a detailed analysis of the spectral structure of the kernel matrix, and ap-plications of these results to machine learning algorithms. The kernel matrix defines a central component in virtually all kernel methods, such that detailed knowledge of the structure of the kernel matrix is of great use for both, theory and practice.

The theoretical analysis of the spectral properties of the kernel matrix were guided by the central concern that the resulting bounds actually match the behavior of the approximation errors as can be observed in numerical simulations. In such simulations, one can observe that small eigenvalues fluctuate much less than larger eigenvalues. Existing results seriously failed in reflecting this behavior, since the existing bounds did not depend on the eigenvalues in the right manner and did not scale appropriately.

The convergence results for the eigenvalues combined classical results from the perturbation theory of Hermitian matrices with probabilistic finite sample size bounds on the norm of certain error matrices to obtain a relative-absolute bound which is considerable tighter than bounds which existed so far. Compared to the approaches which were based on the Courant-Fisher variational characterization of the eigenvalues, the size of the true eigenvalue enters the bound quite naturally, leading to bounds which reflect the behavior of observed approximation errors. Being able to support this observation with a theoretical result has proved to be very valuable.

The basic relative-absolute bound is stated very generally with respect to several error matrices, which can be easily proven to imply convergence of the eigenvalues. It is slightly more intricate to obtain actual finite-sample size bounds on theses errors. We have undertaken this analysis for two cases: that of a Mercer kernel with uniformly bounded eigenfunctions, and for the case of kernel functions which are uniformly bounded, covering a range of relevant kernel functions, including, for example, the ubiquitous radial basis function kernels. These estimates showed that if the eigenvalues decay quickly enough, the absolute error term will be very small, such that the bounds become essentially relative. We moreover argued that this absolute term is realistic for eigenvalues computed on real computers using finite precision floating point architectures. Thus, in a certain sense, we achieved the goal of describing the observed behavior of the eigenvalues in two different ways: we were able to prove that the approximation errors scale with the size of the true eigenvalues and that they will stagnate at a certain (very small) level.

Concerning the spectral projections, there exists a similar numerically observable effect which lacked a matching counterpart in theory. Scalar products with eigenvectors of small eigenvalues seemed to fluctuate much less than those with eigenvectors of large eigenvalues. Here, we did not provide an independent convergence proof of our own but rather complemented an existing result with a relative-absolute envelope which again scales with the eigenvalues. We proved that the scalar products also show a nice convergence behavior: Scalar products with eigenvectors of small eigenvalues which are very small asymptotically are already very small for finite sample sizes. This result in itself seems rather abstract, but has proven to have powerful consequences (see below).

Principal component analysis directly suggests itself as a field of application for these results, mostly due to the fact that principal component analysis consists of the computation of the

145

eigenvalues of a symmetric matrix, and in the case of kernel PCA, even the eigenvalues of the kernel matrix itself. We were able to readily derive three results covering almost all aspects of the convergence of the estimated principal values. It should be stressed that these results are not dependent on strong assumptions on the underlying probability measure. The only requirement is that the true eigenvalues decay quickly. This constraint is usually fulfilled for smooth kernel functions. First of all, we derived a purely multiplicative bound for the principal values in a finite-dimensional setting. A strength of the result lies in the fact that the convergence speed is expressed with respect to the norm of certain error matrices. Very generally, the convergence can be shown to depend on the fourth moment of the underlying measure along the principal directions. However, if additional knowledge on the distribution is available, one might be able to provide a detailed analysis of the size of the error matrix, which can then result in much faster convergence results. Put differently, the convergence speed does not simply depend on a single parameter of the probability distribution, but on a complex object which can be studied further for the cases one is interested in.

The second and third result treat kernel PCA, a non-linear extension of principal component analysis. Using the relative-absolute bounds for the eigenvalues, we showed that kernel PCA approximates the true principal components with high precision. For kernel PCA, an interesting question is that of the effective dimension. Since the principal values usually have no special structure besides decaying at some rate, one often projects to a number of leading dimensions such that the reconstruction error becomes small enough. This error is linked to the sums of all eigenvalues except for the first few. This reconstruction error has been a natural target for the approach to prove convergence of the eigenvalues by the Courant-Fisher characterization. Using the relative-absolute perturbation bound on the eigenvalues of the kernel matrix, we were able to prove a relative-absolute bound on the reconstruction error, which scales nicely as the eigenvalues decay rapidly. This result is a significant improvement compared with previous results which did not scale with the size of the eigenvalues involved.

An interesting consequence of the result on the reconstruction error is that a finite sample in feature space will always be contained in a low-dimensional subspace of the (possibly infinite-dimensional) feature space. This number does not depend on the number of samples, but rather becomes even more stable as the number of sample grows. Therefore, the general intuition that learning in feature space is hard because the data fully covers anndimensional subspace spanned by the data points is wrong. Indeed, while it is true that the data spans anndimensional subspace, only a few directions have large variance. The rˆole of regularization then becomes that of adjusting the scale at which the algorithms works, such that the algorithm only sees the finite-dimensional part of the data.

In the context of supervised learning, we first studied the relation between the label infor-mation and the kernel matrix in an algorithm independent fashion. The assumption is that the target function can be represented in terms of the kernel matrix. The crucial point here was the transformation of the label vector to its representation with respect to the eigenbasis of the kernel matrix. Then, it follows by the results on spectral projections that the information content of the label, in the case of regression given as the smooth target function, is contained in the first few coefficients (when coefficients are ordered with respect to non-increasing eigenvalues), while the noise is evenly distributed over all of the coefficients.

The essential finite-dimensionality of the object samples and the label vector combined can be seen as a more direct version of the well known fact that at a non-zero scale, the set of all hypotheses with bounded weight vector in an infinite-dimensional space has finite VC-dimension (Evgeniou and Pontil, 1999; Alon et al., 1997).

This picture has some resemblance with that of performing a Fourier analysis of a signal with additive noise. There, the signal is also contained in some frequency band, while the noise covers all of the spectrum. The strength of the kernel approach then lies in the fact that this decomposition can be carried out over arbitrary spaces on which smooth kernels can be defined, and for all geometries of sample points. Fourier analysis is usually confined to compact rectangular domains in low dimensions.

The structure of the label vector with respect to the eigenbasis of the kernel function suggests

147 the definition of a cut-off dimensionsdwhich we defined as the number such that the information content of the label vector is completely contained in the firstdcoefficients. We showed that these cut-off dimensions can be effectively estimated by proposing two different procedures and testing them extensively on different data sets. The approach based on performing a maximum likelihood fit with a two component model proved to be the more robust and reliable variant.

We also discussed using the cut-off dimension estimators to perform a structural analysis of a given data set, again in an algorithm independent fashion. It turns out that combining the cut-off dimension estimators with a family of kernels depending on a scale parameter, one can detect structure at different scales by estimating cut-off dimensions at varying kernel widths.

Finally, we have turned to kernel ridge regression as an example of a supervised kernel method.

The advantage of kernel ridge regression is that the training step is computed by a matrix which is closely related to the kernel matrix. Based on our knowledge of the spectral structure of the kernel matrix, the training step can be fully decomposed and analyzed. We have seen that kernel ridge regression basically amounts to low-pass filtering of the signal. Again, the advantage with respect to employing a Fourier decomposition is that kernel ridge regression can be painlessly extended to kernels in arbitrary dimensions. Furthermore, the basis functions in kernel ridge regression adapt themselves to the underlying probability density.

The free parameter of kernel ridge regression is the regularization constant. Based on the analysis, it seems that this regularization constant should be chosen according to the cut-off dimension. The resulting method was called thespectrum method. During extensive experiments both for regression and classification we were able to show that the spectrum method performed very competitively with existing state-of-the-art methods. While we have to admit that there is really no shortage of good model selection methods, these results show that the theoretical analysis and the insights into kernel ridge regression were actually sufficiently relevant to allow us to propose a competitive method for model selection which uses only the structural insights into the spectrum of the label vector to perform effective model selection.

Future Directions

We believe that some of the results have interesting theoretical implications which we have only briefly touched upon.

We have shown that both the object samples as well as the label vector have an essentially finite dimensional structure at a given scale with the dimension not depending on the sample size.

The question is if this characterization can be used to explain in a more direct fashion, without involving VC-dimension arguments, why learning in feature spaces work well?

Closely linked to this question is if the effective dimension in feature space and the cut-off dimension of the label vector can be used as some form ofa priori complexity measure for data sources. An existing problem with data-dependent error estimates lies in the fact that the depen-dency on the data is often constructed in such a way that the complexity of the data set only becomes apparent after the learning has taken place, for example by realizing a certain margin.

A problem of this kind of argument has the drawback that one cannot ensure a priori that an algorithm performs well. On the other hand, the effective dimension of the data set and the cut-off dimension of the label depend only on the chosen kernel which is a large step towards an a priori complexity measure. In particular, because the size of these dimensions is already proven to converge, it is even possible to effectively estimate these quantities.

Of course, this question ultimately has to lead to generalization error bounds for kernel ridge regression. The question thus is, given that the cut-off dimension of the data is known, can we bound the generalization error for kernel ridge regression? In principle, we can already estimate the size of the in-sample error between the fitted function and the target function. In order to bound the out-of-sample error, one has to consider how well the Nystr¨om extrapolates of the eigenvectors predict. These could be handled using estimates on their Lipschitz constants. This way, one could derive an estimate of the generalization error which is directly linked to how the algorithm works, in contrast to using some abstract capacity argument based on VC-theory. This approach could have the added benefit of obtaining a better intuitive understanding of how the

algorithm works based on the theoretical analysis, in contrast to capacity arguments which tend to consider the algorithm as a black box which simply selects some solution from a hypothesis set in a non-transparent fashion.

In my opinion, it proved possible and rewarding to perform detailed analyses of specific algorithms and objects. In the best case, this can be both interesting and relevant. I’d like to close this thesis with the following sentence which I borrowed from the end of Bauer (1990).

On ne finit pas un œuvre, on l’abandonne.

(Gustave Flaubert)

Bibliography

754-1985, I. S. (1985). IEEE Standard for Binary Floating-Point Arithmetic. IEEE Computer Society.

Abramowitz, M. and Stegun, I. A., editors (1972). Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, 9th printing, chapter 22, ”Legendre Functions”, and chapter 8, ”Orthogonal Polynomials”, pages 331–339, 771–802. Dover, New York.

Ahrendt, T. (1999). Schnelle Berechnung der Exponentialfunktion auf hohe Genauigkeit. PhD thesis, Mathematisch-Naturwissenschaftliche Fakult¨at, Universit¨at Bonn. (“Fast Computation of the Exponential Funktion to High Precision”, in German).

Alon, N., Ben-David, S., Cesa-Bianchi, N., and Haussler, D. (1997). Scale-sensitive dimensions, uniform convergence, and learnability. Journal of the ACM, 44(4):615–631.

Anderson, T. W. (1963). Asymptotic theory for principal component analysis. Annals of Mathe-matical Statistics, 34:122–148.

Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis. Wiley-Interscience, 3rd edition.

Anselone, P. M. (1971).Collectively Compact Operator Approximation Theory and Applications to Integral Equations. Prentice-Hall Series in Automatic Computation. Prentice-Hall, Englewood Cliffs, New Jersey.

Atkinson, K. E. (1997). The Numerical Solution of Integral Equations of the Second Kind. Cam-bridge University Press.

Baker, C. T. H. (1977). The numerical treatment of integral equations. Clarendon Press, Oxford.

Bauer, H. (1990). Wahrscheinlichkeitstheorie. de Gruyter Lehrbuch. de Gruyter, Berlin, New York, 4th edition. (“Probability theory”, in German).

Bengio, Y., Delalleau, O., Le Roux, N., Paiement, J.-F., Vincent, P., and Ouimet, M. (2004).

Learning eigenfunctions links spectral embedding and kernel PCA. Neural Computation, 16:2197–2219.

Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121–167.

Chatterjee, C., Roychowdhury, V. P., and Chong, E. K. P. (1998). On relative convergence properties of principal component analysis algorithms. IEEE Transactions on Neural Networks, 9(2).

Cover, T. M. (1969). Learning in pattern recognition. In Watanabe, S., editor,Methodologies of Pattern Recognition, pages 111–132, New York. Academic Press.

Cristianini, N. and Shawe-Taylor, J. (2000). Support Vector Machines and other kernel-based learning methods. Cambridge University Press.

149

Dauxois, J., Pousse, A., and Romain, Y. (1982). Asymptotic theory for the principal component analysis of a vector random function: Some applications to statistical inference. Journal of Multivariate Analysis, 12:136–154.

Davis, C. and Kahan, W. M. (1970). The rotation of eigenvectors by a perturbation, iii. SIAM Journal of Numerical Analysis, 7:1–46.

Devroye, L., Gy¨orfi, L., and Lugosi, G. (1996). A Probabilstic Theory of Pattern Recognition.

Springer Verlag.

Donoho, D. L. and Johnstone, I. M. (1995). Adapting to unknown smoothness via wavelet shrink-age. Journal of the American Statistical Association, 90:1200–1224.

Donoho, D. L. and Johnstone, I. M. (1998). Minimax estimation via wavelet shrinkage. The Annals of Statistics, 26(3):879–921.

Donoho, D. L., Johnstone, I. M., Kerkyacharian, G., and Picard, D. (1995). Wavelet shrinkage:

Asymptopia? Journal of the Royal Statistical Society, 57:301–369.

Duda, R. O., Hart, P. E., and Stork, D. G. (2001). Pattern Classification. John Wiley & Sons, 2nd edition.

Eisenstat, S. C. and Ipsen, I. C. F. (1994). Relative perturbation bounds for eigenspaces and singular vector subspaces. In 5th SIAM Conference on Applied Linear Algebra, pages 62–65, Philadelphia. SIAM.

Engl, H. W. (1997). Integralgleichungen. Springer-Verlag. (“Integral equations”, in German).

Evgeniou, T. and Pontil, M. (1999). On the Vγ dimension for regression in reproducing kernel Hilbert spaces. InProceedings of Algorithmic Learning Theory, Tokyo, Japan.

Girosi, F., Jones, M., and Poggio, T. (1995). Regularization theory and neural networks architec-tures. Neural Computation, 7(2):219–269.

Goldberg, P. W., Williams, C. K. I., and Bishop, C. M. (1998). Regression with input-dependent noise: A gaussian process treatment. In Jordan, M. I., Kearns, M. J., and Solla., S. A., editors, Advances in Neural Information Processing Systems, volume 10. Lawrence Erlbaum.

Golub, G. H. and van Loan, C. F. (1996).Matrix Computations. Johns Hopkins University Press, 3rd edition.

Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference and Prediciton. Springer-Verlag.

Herbrich, R. (2002). Learning Kernel Classifiers. MIT Press.

Hoeffding, W. (1963). Probability inequalities for sums of bounded variables. Journal of the American Statistical Association, 58:13–30.

Hoffman, A. and Wielandt, H. (1953). The variation of the spectrum of a normal matrix. Duke.

Math. J., 29:37–38.

Horn, R. A. and Johnson, C. R. (1985). Matrix Analysis. Cambridge University Press.

Jolliffe, I. T. (2002). Principal Component Analysis. Springer-Verlag New York Inc., 2nd edition.

Kato, T. (1976). Perturbation Theory for Linear Operators. Springer-Verlag Berlin, 2nd edition.

Koltchinskii, V. and Gin´e, E. (2000). Random matrix approximation of spectra of integral opera-tors. Bernoulli, 6(1):113–167.

BIBLIOGRAPHY 151 Koltchinskii, V. I. (1998). Asymptotics of spectral projections of some random matrices

approxi-mating integral operators. Progress in Probability, 43:191–227.

Kotel’nikov, V. A. (1933). On carrying capacity of “ether” and wire in electro-communications.

Material for the First All-Union Conference on Questions of Communications. Izd. Red. Upr.

Svyazi RKKA (Moscow). (in Russian).

Lepskii, O. V. (1990). On one problem of adaptive estimation on white gaussian noise. Theory of Probability and its Applications, 35:454–466.

McDiarmid, C. (1989). On the method of bounded differences. InSurveys in Combinatorics 1989, pages 148–188. Cambridge University Press.

M¨uller, K.-R., Mika, S., R¨atsch, G., Tsuda, K., and Sch¨olkopf, B. (2001). An introduction to kernel-based learning algorithms. IEEE Neural Networks,, 12(2):181–201.

Nystr¨om, E. J. (1930). ¨Uber die praktische Aufl¨osung von Integralgleichungen mit Anwendung auf Randwertaufgaben. Acta Mathematica, 54:185–204.

Pestman, W. R. (1998). Mathematical Statistics. Willem de Gruyter, Berlin.

R¨atsch, G., Onoda, T., and M¨uller, K.-R. (2001). Soft margins for AdaBoost. Machine Learning, 42(3):287–320. also NeuroCOLT Technical Report NC-TR-1998-021.

Rifkin, R. (2002).Everything Old Is New Again: A Fresh Look at Historical Approaches in Machine Learning. PhD thesis, Massachusetts Institute of Technology.

Schmidt, R. O. (1986). Multiple emitter location and signal parameter estimation. IEEE Trans-actions on Antennas and Propagation, AP-34(3).

Sch¨olkopf, B. (1997). Support Vector Learning. PhD thesis, Technische Universit¨at Berlin.

Sch¨olkopf, B., Mika, S., Burges, C. J. C., Knirsch, P., M¨uller, K.-R., R¨atsch, G., and Smola, A. J.

(1999). Input space vs. feature space in kernel-based methods. IEEE Transactions on Neural Networks, 10(5):1000–1017.

Sch¨olkopf, B., Smola, A., and M¨uller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10:1299–1319.

Sch¨olkopf, B. and Smola, A. J. (2002). Learning with Kernels. MIT Press.

Shannon, C. (1949). Communication in the presence of noise.Proceedings of the Institute of Radio Engineers, pages 10–21.

Shawe-Taylor, J., Cristianini, N., and Kandola, J. (2002a). On the concentration of spectral properties. In Dietterich, T. G., Becker, S., and Ghahramani, Z., editors, Advances in Neural Information Processing Systems 14, Cambridge, MA. MIT Press.

Shawe-Taylor, J., Williams, C., Cristianini, N., and Kandola, J. (2002b). On the eigenspectrum of the gram matrix and its relationship to the operator eigenspectrum. In N. Cesa-Bianchi et al., editor, ALT 2002, volume 2533 of Lecture Notes in Artifical Intelligence, pages 23–40.

Springer-Verlag Berlin Heidelberg.

Shawe-Taylor, J. and Williams, C. K. I. (2003). The stability of kernel principal components analysis and its relation to the process eigenspectrum. In Becker, S., Thrun, S., and Obermayer, K., editors,Advances in Neural Information Processing Systems, volume 15.

Shawe-Taylor, J., Williams, C. K. I., Cristianini, N., and Kandola, J. (2004). On the eigenspectrum of the gram matrix and the generalisation error of kernel PCA. Technical Report NC2-TR-2003-143, Department of Computer Science, Royal Holloway, University of London. Available from http://www.neurocolt.com/archive.html.