Discussion - Bayesian Inference and Experimental Design for Large Generalised Linear Models

We have shown that a frequently used variational relaxation to Bayesian inference in super-Gaussian generalised linear models is convex if and only if the posterior is log-concave – variational inference is convex whenever MAP estimation is convex in the same model. The technique covers a wide class of models ranging from robust regression and classification to sparse linear modelling and complements the large body of work on efficient point estimation in sparse linear models. Our theoretical insights settle a long-standing question in approximate variational inference in continuous variable models and add details to the relationship between sparse estimation and sparse inference.

Further, we have developed a scalable double loop minimisation algorithm that runs or-ders of magnitude faster than previous coordinate descent methods, enhancing the scope for the Bayesian design methodology to large scales. This is achieved by decoupling the criterion

7http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

3.8. DISCUSSION 51

100 500 1000 1500 2000 2500 3000 15.5

16 16.5 17 17.5

# of data points

error percentage

a9a Gaussian; n=123

Infogain Uncertainty Random Full

100 500 1000 1500 2000 2500 3000 15.5

16 16.5 17 17.5

# of data points

error percentage

a9a Laplacian; n=123

Infogain Uncertainty Random Full

500 1000 1500 2000 2500 3000 3500 2

4 6 8 10 12

# of data points

error percentage

realsim Gaussian; n=20,958 Infogain Uncertainty Random Full

800 1100 1500 2000 2500 3000 3500 3

3.5 4 4.5 5 5.5 6 6.5 7

# of data points

error percentage

rcv1 Gaussian; n=42,736 Infogain Uncertainty Random Full

Figure 3.7: Classification errors for different design scores

Performance of information gain and classifier uncertainty versus random sampling (results on full training set also shown). We started the design phase after100,100,500,800randomly drawn initial cases respectively, all remaining training cases were candidates. The prior vari-ance was set toσ² = 1 in all cases, τ_sig = 1, 1, 3, 3 respectively. k = 80, 80, 750, 750 Lanczos vectors were computed for outer loop updates/candidate scoring. For a9a, we used design blocks of sizeK=3, andK=20for the others.

and using ideas from concave-convex programming. Computational efforts are reduced to fast algorithms known from estimation and numerical mathematics and exploiting fast MVMs with the structured matricesXandB. Our generic implementation, can be run with any configura-tion of super-Gaussian, log-concave potentials using simple scalar minimisaconfigura-tions, without any heuristics to be tuned.

From a graphical model perspective, our method reduces approximate inference in non-Gaussian (continuous variable) Markov random fields (MRFs) to repeated computations in Gaussian MRFs. In this context, we especially emphasise the importance of Gaussian marginal variance computations by the Lanczos algorithm. The considerable literature on Gaussian MRF techniques [Malioutov et al., 2006a,b] can be put to new use with our relaxation.

An interesting direction for future work is to find out what is so special about the chosen variational relaxation so that it leads to a scalable algorithm and to try and develop scalable variants of other approximate inference techniques.

Chapter 4

Gaussian Process Classification

We provide a comprehensive overview of many recent algorithms for approximate inference in Gaussian process models for probabilistic binary classification. The relationships between several approaches are elucidated theoretically, and the properties of the different algorithms are corroborated by experimental results. We examine both the quality of the predictive dis-tributions and the suitability of the different marginal likelihood approximations for model se-lection (selecting hyperparameters) and compare to a gold standard based on MCMC. Interest-ingly, some methods produce good predictive distributions although their marginal likelihood approximations are poor. Strong conclusions are drawn about the methods: the expectation propagation algorithm is almost always the method of choice unless the computational budget is very tight. We also extend existing methods in various ways, and provide unifying code implementing all approaches.

Note that all derived inference algorithms are a special case of the generalised linear model framework of chapters 2.3, 2.4 by setting σ = 1, B = I, γ = σ_n² and formally substituting X^>y ← ^y^and^X^>^X ← ^K⁻¹ and that all analytical properties derived in chapter 3 carry over.

The exposition is a revised and extended version of Nickisch and Rasmussen [2008] and details about the code are taken from Rasmussen and Nickisch [2010],http://mloss.org/software/

view/263/andhttp://gaussianprocess.org/gpml/code/.

We start the chapter by introducing Gaussian processes in section 4.1 and show how they can be used in probabilistic classification models in section 4.2. Next, each of the sections 4.3, 4.4, 4.5, 4.6 and 4.8 describe a particular deterministic approximate inference method; the rela-tion between them are reviewed in secrela-tion 4.9. A sampling approach to approximate inference serving as gold standard is presented in section 4.10. Numerical implementation issues are discussed in section 4.11. We then empirically compare the approximate inference algorithms with each other and the gold standard in section 4.12 and draw an overall conclusion in section 4.13.

4.1 Introduction

Gaussian processes (GPs) can conveniently be used to specify prior distributions for Bayesian inference. In the case of regression with Gaussian noise, inference can be done simply in closed form, since the posterior is also a GP. For non-Gaussian likelihoods, such as, e.g. in binary classification, exact inference is analytically intractable.

One prolific line of attack is based on approximating the non-Gaussian posterior with a tractable Gaussian distribution. One might think that finding such an approximating GP is a well-defined problem with a largely unique solution. However, we find no less than three different types of solution in the recent literature: Laplace approximation (LA) [Williams and Barber, 1998], expectation propagation (EP) [Minka, 2001a] and Kullback-Leibler divergence (KL) minimisation [Opper and Archambeau, 2009] comprisingvariational bounding(VB) [Gibbs and MacKay, 2000, Jaakkola and Jordan, 1996] as a special case. Another approach is based on a factorial approximation, rather than a Gaussian [Csató et al., 2000].

Practical applications reflect the richness of approximate inference methods: LA has been used for sequence annotation [Altun et al., 2004] and prostate cancer prediction [Chu et al., 2005], EP for affect recognition [Kapoor and Picard, 2005], VB for weld cracking prognosis [Gibbs and MacKay, 2000],Label regression(LR) serves for object categorisation [Kapoor et al., 2007] and MCMC sampling is applied to rheumatism diagnosis by Schwaighofer et al. [2003].

Brain computer interfaces [Zhong et al., 2008] even rely on several (LA, EP, VB) methods.

We compare these different approximations and provide insights into the strengths and weaknesses of each method, extending the work of Kuss and Rasmussen [2005] in several di-rections: We cover many more approximation methods (VB, KL, FV, LR), put all of them in com-mon framework and provide generic implementations dealing with both the logistic and the cumulative Gaussian likelihood functions and clarify the aspects of the problem causing diffi-culties for each method. We derive Newton’s method for KL and VB. We show how to accel-erate MCMC simulations. We highlight numerical problems, comment on computational com-plexity and supply runtime measurements based on experiments under a wide range of con-ditions, including different likelihood and different covariance functions. We provide deeper insights into the methods behaviour by systematically linking them to each other. Finally, we review the tight connections to methods from the literature on Statistical Physics, including the TAP approximation and TAPnaive.

The quantities of central importance are the quality of the probabilistic predictions and the suitability of the approximate marginal likelihood for selecting parameters of the covariance function (hyperparameters). The marginal likelihood for any Gaussian approximate posterior can be lower bounded using Jensen’s inequality, but the specific approximation schemes also come with their own marginal likelihood approximations.

We are able to draw clear conclusions. Whereas every method has good performance un-der some circumstances, only a single method gives consistently good results. We are able to theoretically corroborate our experimental findings; together this provides solid evidence and guidelines for choosing an approximation method in practise.

Im Dokument Bayesian Inference and Experimental Design for Large Generalised Linear Models (Seite 64-68)