• Keine Ergebnisse gefunden

As an extension to their book, Rasmussen and Williams [2006] made the GPML (Gaussian pro-cesses for machine learning) code publicly available4. The toolbox contained code for Gaussian regression and approximate classification using EP and LA. About a year later, the code was refactored and improved so that inference and model specification were kept apart5. In ad-dition to EP and LA namedapproxEP.mandapproxLA.min the code, implementations of all of the approximation methods mentioned in this chapter can be downloaded and used6as an extension to the GPML code:

• approxKL.m– Kullback-Leibler, section 4.5,

• approxVB.m– individual variational bounds, section 4.6,

• approxFV.m– factorial variational, section 4.7 and

• approxLR.m– label regression 4.8.

Sparse and/or online approximation methods as introduced in section 4.2.2 include

• approxIVM.m– informative vector machine,

• approxOLEP.m– online EP and

• approxSO.m– sparse online approximation.

For mainly educational reasons, we also provide some (equivalent) variants of EP from section 4.4.1 like

• approxEC.m– expectation consistent inference,

• approxTAP.m– ADATAP and

• approxTAPnaive.m– naive ADATAP.

The release 3.1 of the GPML code as described in section 4.11.1 [Rasmussen and Nickisch, 2010], is available asmloss.orgproject7 or from the Gaussian process website8. The new im-plementation is completely generic, with simple interfaces for an extended set of covariance and likelihood functions. We also support arbitrary mean functions and provide full compati-bilty with GNU Octave. Much energy was spent to properly disentangle covariance, likelihood and mean hyperparameters. Again, special care has been taken to avoid numerical problems, e.g. safe likelihood evaluations for extreme inputs and stable matrix operations as described in the following.

4http://www.gaussianprocess.org/gpml/code/matlab/release/gpml-matlab-v1.3-2006-09-08.tar.gz

5http://www.gaussianprocess.org/gpml/code/matlab/release/gpml-matlab-v2.0-2007-06-25.tar.gz

6The extension is available athttp://www.kyb.mpg.de/~hn/approxXX.tar.gz.

7http://mloss.org/software/view/263/

8The current version can be obtained fromhttp://www.gaussianprocess.org/gpml/code/matlab/doc/.

4.11. IMPLEMENTATION 71 Stable matrix operations

More concretely, to properly handle situations, whereKis close to singular, we use the well-conditioned matrixB9 and its Cholesky decomposition to calculate V = K1+W1

and k>Ck = k> K+W11

k. The case ofW10 having negative components, can be handled by using the (slower) LU-decomposition of the non-symmetric (but well-conditioned) matrix Ainstead as summarised in the following table.

Well conditioned matrix C= K+W11

V= K1+W1

ln|KW+I| B=W12KW12 +I=LL> W12B1W12 =W12L−>L1W12 KKCK1>ln(dg(L)) A=KW+I=LU WA1=WU1L1 KKCK 1>ln|dg(U)|

Table 4.3: Numerically stable matrix operations in GP classification

The posterior meanmis represented in terms ofα = K1mto avoid multiplications with K1and facilitate predictions. As a result, our code shows a high level of robustness along the full spectrum of possible hyperparameters. The KL method uses Gaussian-Hermite quadra-ture; we did not notice problems stemming therefrom. The FV and TAP methods work very reliably, although we had to add a small (106) ridge for FV to regulariseK.

Large scale computations

The focus of the toolbox is on approximate inference using dense matrix algebra. We currently do not support covariance matrix approximation techniques to deal with large numbers of training examplesn. Hence, all discussed inference algorithms hinge onKbeing not too big since matrix decompositions have complexityO(n³). If the dataset sizengrows beyond 5·103, exact matrix computations become prohibitive rather quickly. By means of an approximation to the covariance matrix

KKˆ :=VRV>+D, VRn×r, RRr×r, D=dg(d),

which has to be computed before the inference procedure, we can reduce the computational cost so that LA and VB become scalable. Examples include the Nyström approximation [Smola and Schölkopf, 2000, Williams and Seeger, 2001] and the incomplete Cholesky decomposition [Fine and Scheinberg, 2001]. Matrix vector multiplications (MVMs) with ˆKcostO(r·n)instead ofO(n²)and MVMs with ˆK1can be computed using the matrix inversion lemma

K1Kˆ1=D1D1VR1V>D1V1V>D1 at a cost ofO(r·n), as well.

4.11.1 Thegpmltoolbox

We provide a stable and modular implementation verified by test cases and unit tests that contains a user and a technical documentation11. The code is fully compatible to Matlab 7.x12 and GNU Octave 3.2.x13. A Gaussian process model requires the specification of a Gaussian process prior through a mean and covariance function and as well as a likelihood. Model fitting and prediction depends on an approximate inference algorithm computingQ(f)andQ(f)as summarised in the following table.

The GPML toolboox contains exactly these objects: model fitting using the marginal likeli-hood gradient ∂L∂θ and prediction work in a fully generic way, once the model is specified. In the following, we list some of the implemented objects.

9All eigenvaluesλofBsatisfy 1λ1+n4maxijKij, thusB1and|B|can be safely computed.

10This happens for non-log-concave likelihoods like the Student’s t likelihood. Formally, negative values inW correspond to negative variances. Although negative variances do not have a probabilistic meaning, they still allow to locally imitate the non-Gaussian likelihoods so that the approximate posterior is most similar to the exact

1) GP f ∼ GP mφ,kψ

2) Likelihood 3) Approximate inference 4) Fittingθ= (φ,ψ,ρ) Pφ,ψ(f)∼ N(f|mφ,Kψ) in=1Pρ(yi|fi) Q(f)≈P(f|y)Pφ,ψ(f)Pρ(y|f) θ?=arg maxθL(θ)

a) meanmφ(x) L(θ)≈R

Pφ,ψ(f)Pρ(y|f)df 5) PredictionQ(y) b) covariancekψ(x,x0)

Table 4.4: GPML toolbox building blocks

1a) Mean functions

In the GPML toolbox a mean function needs to implement evaluation m = mφ(X)and first derivativesmi =

∂φimφ(X). We offer simple and composite mean functions.

• simple functions:zero m(x) =0, const m(x) =c, linear m(x) =a>x

• composite functions:sum m(x) =jmj(x), prod m(x) =jmj(x), pow m(x) =m1(x)d

This modular specification allows to work with affine mean functionsm(x) =c+a>xor poly-nomialsm(x) = (c+a>x)2.

1b) Covariance functions

Similarly to the mean functions, we provide a whole algebra of covariance functions. Again, the interface is simple since only evaluation of the full covariance matrixK = kψ(X)and its derivativesKi =

∂ψikψ(X)and cross terms k = kψ(X,x)and k∗∗ = kψ(x,x)for predic-tion are required. Besides a long list of simple covariance funcpredic-tions, we also offer a variety of composite covariance functions.

• simple functions:linear, constant, ridge, Matérn, squared exponential, polynomial, periodic, MKL, neural network, finite support

• composite functions

sum, prod k(x,x0) =jkj(x,x0),k(x,x0) =jkj(x,x0) masked k(xI,x0I), masking indexI ⊆ [1, 2, ..,D],xRD scalingk(x,x) =σ2fk0(x,x0)

additive k(x,x0) =|I|=d∈Dk(xI,x0I), index degree setD Both the mean and the covariance functions are easily extensible.

2) Likelihoods

The GPML toolbox approximate inference engine does not explicitly distinguish between clas-sification and regression: for any choice of the likelihoodPρ(yi|fi), the toolbox uses the same code in the inference step. The following table enumerates all (currently) implemented likeli-hood functions and their respective parameter setρ. See figure 2.2 for a graphical illustration and the expressions forPρ(yi|fi).

posterior.

11http://www.gaussianprocess.org/gpml/code/

12The MathWorks,http://www.mathworks.com/

13The Free Software Foundation,http://www.gnu.org/software/octave/

4.11. IMPLEMENTATION 73 Pρ(yi|fi) regressionyiR classificationyi∈ {±1}

name Gaussian logistic Laplacian Student’s t cum. Gaussian cum. logistic

ρ= {lnσ} {ln(ν1), lnσ}

Table 4.5: Likelihood functions implemented in the GPML toolbox

3) Approximate inference methods

In addition to exact inference (only possible for Gaussian likelihood), we have three major approximate inference methods implemented in the toolbox: expectation propagation (section 4.4 and chapter 2.5.10), Laplace approximation (section 4.3 and chapter 2.5.6) and variational Bayes (section 4.6 and chapter 2.5.9). The following table lists all possible combinations of likelihood and inference algorithm. Note that any choice of mean and covariance function is allowed.

likelihood \ inference exact EP Laplace variational Bayes

Gaussian X X X

logistic X X X

Laplacian X X

Student’s t X X

cumulative Gaussian X X

cumulative logistic X X X

Table 4.6: Likelihood↔inference compatibility in the GPML toolbox

Expectation propagation for Student’s t likelihoods is inherently unstable due to non-log-concavity. The Laplace approximation for Laplace likelihoods is not sensible because at the mode the curvature and the gradient can be undefined due to the non-differentiable peak of the Laplace distribution. Special care has been taken for the non-convex optimisation problem imposed by the combination Student’s t likelihood and Laplace approximation. Finally, the (convex) lower bounding approach by Gaussian potentials of variable width is problematic for Gaussian and cumulative Gaussian likelihoods because they admit only certain widths.

Code example

Due to the modular structure of the code, specification of a full GP model and model fitting can be done in less than ten lines of code as illustrated by the following example.

1 [xtr ,xte ,ytr ,yte] = read_data ; % train and test data 2

3 % 1) SET UP THE GP

4 cov = {'covSEiso '}; sf = 1; ell = 0.7; % squared exp. covariance 5 mean = {'meanSum ',{'meanLinear ','meanConst '}}; a = 2; b = 1;% a*x+b 6 lik = 'likLaplace '; sn = 0.2; % sparse Laplace likelihood 7 hyp.mean = [a;b]; hyp.cov = log ([ ell;sf ]); hyp.lik = log(sn );% hyp 8 inf = 'infEP '; % inference method is expectation propagation 9

10 % 2) LEARN , i.e. MAX. MARGINAL LIKELIHOOD w.r.t. hyp

11 Ncg = 50; % number of conjugate gradient steps for optimisation 12 hyp = minimize (hyp,'gp ', -Ncg , inf, mean, cov, lik, xtr , ytr );

13

14 % 3) PREDICT

15 [ymu , ys2] = gp(hyp, inf, mean, cov, lik, xtr , ytr , xte) 16

17 K = feval (cov{:} , hyp.cov, xtr ); % evaluate covariance matrix 18 m = feval (mean{:} , hyp.mean, xtr ); % evaluate mean vector 19 lp = feval (lik, hyp.lik, ytr , ftr ); % evaluate log likelihood