Implementation - Bayesian Inference and Experimental Design for Large Generalised Linear Models

As an extension to their book, Rasmussen and Williams [2006] made the GPML (Gaussian pro-cesses for machine learning) code publicly available⁴. The toolbox contained code for Gaussian regression and approximate classification using EP and LA. About a year later, the code was refactored and improved so that inference and model specification were kept apart⁵. In ad-dition to EP and LA namedapproxEP.mandapproxLA.min the code, implementations of all of the approximation methods mentioned in this chapter can be downloaded and used⁶as an extension to the GPML code:

• approxKL.m– Kullback-Leibler, section 4.5,

• approxVB.m– individual variational bounds, section 4.6,

• approxFV.m– factorial variational, section 4.7 and

• approxLR.m– label regression 4.8.

Sparse and/or online approximation methods as introduced in section 4.2.2 include

• approxIVM.m– informative vector machine,

• approxOLEP.m– online EP and

• approxSO.m– sparse online approximation.

For mainly educational reasons, we also provide some (equivalent) variants of EP from section 4.4.1 like

• approxEC.m– expectation consistent inference,

• approxTAP.m– ADATAP and

• approxTAPnaive.m– naive ADATAP.

The release 3.1 of the GPML code as described in section 4.11.1 [Rasmussen and Nickisch, 2010], is available asmloss.orgproject⁷ or from the Gaussian process website⁸. The new im-plementation is completely generic, with simple interfaces for an extended set of covariance and likelihood functions. We also support arbitrary mean functions and provide full compati-bilty with GNU Octave. Much energy was spent to properly disentangle covariance, likelihood and mean hyperparameters. Again, special care has been taken to avoid numerical problems, e.g. safe likelihood evaluations for extreme inputs and stable matrix operations as described in the following.

4http://www.gaussianprocess.org/gpml/code/matlab/release/gpml-matlab-v1.3-2006-09-08.tar.gz

5http://www.gaussianprocess.org/gpml/code/matlab/release/gpml-matlab-v2.0-2007-06-25.tar.gz

6The extension is available athttp://www.kyb.mpg.de/~hn/approxXX.tar.gz.

7http://mloss.org/software/view/263/

8The current version can be obtained fromhttp://www.gaussianprocess.org/gpml/code/matlab/doc/.

4.11. IMPLEMENTATION 71 Stable matrix operations

More concretely, to properly handle situations, whereKis close to singular, we use the well-conditioned matrixB⁹ and its Cholesky decomposition to calculate V = K⁻¹+W₋1

and k^>_∗Ck_∗ = k^>_∗ K+W⁻¹₋1

k_∗. The case ofW¹⁰ having negative components, can be handled by using the (slower) LU-decomposition of the non-symmetric (but well-conditioned) matrix Ainstead as summarised in the following table.

Well conditioned matrix C= K+W⁻¹₋1

V= K⁻¹+W₋1

ln|^KW+I| B=W¹²KW¹² +I=LL^> W¹²B⁻¹W¹² =W¹²L^−>L⁻¹W¹² K−^KCK 2·¹^>ln(dg(L)) A=KW+I=LU WA⁻¹=WU⁻¹L⁻¹ K−^KCK ¹^>^ln|^dg(U)|

Table 4.3: Numerically stable matrix operations in GP classification

The posterior meanmis represented in terms ofα = _K⁻¹_mto avoid multiplications with K⁻¹and facilitate predictions. As a result, our code shows a high level of robustness along the full spectrum of possible hyperparameters. The KL method uses Gaussian-Hermite quadra-ture; we did not notice problems stemming therefrom. The FV and TAP methods work very reliably, although we had to add a small (10⁻⁶) ridge for FV to regulariseK.

Large scale computations

The focus of the toolbox is on approximate inference using dense matrix algebra. We currently do not support covariance matrix approximation techniques to deal with large numbers of training examplesn. Hence, all discussed inference algorithms hinge onKbeing not too big since matrix decompositions have complexityO(n³). If the dataset sizengrows beyond 5·¹⁰³^, exact matrix computations become prohibitive rather quickly. By means of an approximation to the covariance matrix

K≈^K^ˆ ^:=VRV^>+D, V∈Rⁿ^×^r, R∈R^r^×^r, D=dg(d),

which has to be computed before the inference procedure, we can reduce the computational cost so that LA and VB become scalable. Examples include the Nyström approximation [Smola and Schölkopf, 2000, Williams and Seeger, 2001] and the incomplete Cholesky decomposition [Fine and Scheinberg, 2001]. Matrix vector multiplications (MVMs) with ˆKcostO(r·ⁿ)instead ofO(n²)and MVMs with ˆK⁻¹can be computed using the matrix inversion lemma

K⁻¹≈^K^ˆ⁻¹=D⁻¹−^D⁻¹^V^R⁻¹−^V^>^D⁻¹^V⁻¹^V^>^D⁻¹ at a cost ofO(r·ⁿ), as well.

4.11.1 Thegpmltoolbox

We provide a stable and modular implementation verified by test cases and unit tests that contains a user and a technical documentation¹¹. The code is fully compatible to Matlab 7.x¹² and GNU Octave 3.2.x¹³. A Gaussian process model requires the specification of a Gaussian process prior through a mean and covariance function and as well as a likelihood. Model fitting and prediction depends on an approximate inference algorithm computingQ(f)andQ(f_∗)as summarised in the following table.

The GPML toolboox contains exactly these objects: model fitting using the marginal likeli-hood gradient ^∂L_∂θ and prediction work in a fully generic way, once the model is specified. In the following, we list some of the implemented objects.

9All eigenvaluesλofBsatisfy 1≤λ≤¹+ⁿ₄max_ijKij, thusB⁻¹and|B|can be safely computed.

10This happens for non-log-concave likelihoods like the Student’s t likelihood. Formally, negative values inW correspond to negative variances. Although negative variances do not have a probabilistic meaning, they still allow to locally imitate the non-Gaussian likelihoods so that the approximate posterior is most similar to the exact

1) GP f ∼ GP ^mφ,kψ

2) Likelihood 3) Approximate inference 4) Fittingθ= (φ,ψ,ρ) Pφ,ψ(f)∼ N(f|^mφ,K_ψ) _∏_iⁿ₌₁Pρ(yi|^fi) Q(f)≈^P(f|^y)∝Pφ,ψ(f)Pρ(y|^f) θ^?=arg maxθL(θ)

a) meanmφ(x) L(θ)≈R

Pφ,ψ(f)_P_ρ(y|^f)df 5) PredictionQ(y_∗) b) covariancekψ(x,x⁰)

Table 4.4: GPML toolbox building blocks

1a) Mean functions

In the GPML toolbox a mean function needs to implement evaluation m = mφ(X)and first derivativesm_i = ^∂

∂φim_φ(X). We offer simple and composite mean functions.

• simple functions:zero m(x) =_{0, const} m(x) =_{c, linear} m(x) =a^>x

• composite functions:sum m(_x) =_∑_jm_j(_x)_{, prod} m(_x) =_∏_jm_j(_x)_, pow m(x) =m₁(x)^d

This modular specification allows to work with affine mean functionsm(x) =c+a^>xor poly-nomialsm(x) = (c+a^>x)².

1b) Covariance functions

Similarly to the mean functions, we provide a whole algebra of covariance functions. Again, the interface is simple since only evaluation of the full covariance matrixK = kψ(X)and its derivativesK_i = ^∂

∂ψik_ψ(X)and cross terms k_∗ = k_ψ(X,x_∗)and k_∗∗ = k_ψ(x_∗,x_∗)for predic-tion are required. Besides a long list of simple covariance funcpredic-tions, we also offer a variety of composite covariance functions.

• simple functions:linear, constant, ridge, Matérn, squared exponential, polynomial, periodic, MKL, neural network, finite support

• composite functions

– sum, prod k(x,x⁰) =_∑_jk_j(x,x⁰),k(x,x⁰) =_∏_jk_j(x,x⁰) – masked k(_x_I_,_x⁰_I), masking indexI ⊆ [_{1, 2, ..,}D]_,_x∈ R^D – scalingk(x,x) =σ²_fk₀(x,x⁰)

– additive k(x,x⁰) =_∑_|_I_|₌_d_∈Dk(x_I,x⁰_I), index degree setD Both the mean and the covariance functions are easily extensible.

2) Likelihoods

The GPML toolbox approximate inference engine does not explicitly distinguish between clas-sification and regression: for any choice of the likelihoodPρ(y_i|^fi), the toolbox uses the same code in the inference step. The following table enumerates all (currently) implemented likeli-hood functions and their respective parameter setρ. See figure 2.2 for a graphical illustration and the expressions forPρ(y_i|^fi).

posterior.

11http://www.gaussianprocess.org/gpml/code/

12The MathWorks,http://www.mathworks.com/

13The Free Software Foundation,http://www.gnu.org/software/octave/

4.11. IMPLEMENTATION 73 Pρ(yi|fi) regressionyi∈R classificationyi∈ {±1}

name Gaussian logistic Laplacian Student’s t cum. Gaussian cum. logistic

ρ= {lnσ} {ln(ν−1), lnσ} ∅

Table 4.5: Likelihood functions implemented in the GPML toolbox

3) Approximate inference methods

In addition to exact inference (only possible for Gaussian likelihood), we have three major approximate inference methods implemented in the toolbox: expectation propagation (section 4.4 and chapter 2.5.10), Laplace approximation (section 4.3 and chapter 2.5.6) and variational Bayes (section 4.6 and chapter 2.5.9). The following table lists all possible combinations of likelihood and inference algorithm. Note that any choice of mean and covariance function is allowed.

likelihood \ inference exact EP Laplace variational Bayes

Gaussian X X X

logistic X X X

Laplacian X X

Student’s t X X

cumulative Gaussian X X

cumulative logistic X X X

Table 4.6: Likelihood↔inference compatibility in the GPML toolbox

Expectation propagation for Student’s t likelihoods is inherently unstable due to non-log-concavity. The Laplace approximation for Laplace likelihoods is not sensible because at the mode the curvature and the gradient can be undefined due to the non-differentiable peak of the Laplace distribution. Special care has been taken for the non-convex optimisation problem imposed by the combination Student’s t likelihood and Laplace approximation. Finally, the (convex) lower bounding approach by Gaussian potentials of variable width is problematic for Gaussian and cumulative Gaussian likelihoods because they admit only certain widths.

Code example

Due to the modular structure of the code, specification of a full GP model and model fitting can be done in less than ten lines of code as illustrated by the following example.

1 [xtr ,xte ,ytr ,yte] = read_data ; % train and test data 2

3 % 1) SET UP THE GP

4 cov = {'covSEiso '}; sf = 1; ell = 0.7; % squared exp. covariance 5 mean = {'meanSum ',{'meanLinear ','meanConst '}}; a = 2; b = 1;% a*x+b 6 lik = 'likLaplace '; sn = 0.2; % sparse Laplace likelihood 7 hyp.mean = [a;b]; hyp.cov = log ([ ell;sf ]); hyp.lik = log(sn );% hyp 8 inf = 'infEP '; % inference method is expectation propagation 9

10 % 2) LEARN , i.e. MAX. MARGINAL LIKELIHOOD w.r.t. hyp

11 Ncg = 50; % number of conjugate gradient steps for optimisation 12 hyp = minimize (hyp,'gp ', -Ncg , inf, mean, cov, lik, xtr , ytr );

14 % 3) PREDICT

15 [ymu , ys2] = gp(hyp, inf, mean, cov, lik, xtr , ytr , xte) 16

17 K = feval (cov{:} , hyp.cov, xtr ); % evaluate covariance matrix 18 m = feval (mean{:} , hyp.mean, xtr ); % evaluate mean vector 19 lp = feval (lik, hyp.lik, ytr , ftr ); % evaluate log likelihood

Im Dokument Bayesian Inference and Experimental Design for Large Generalised Linear Models (Seite 84-88)