Experimental design - Bayesian Inference and Experimental Design for Large Generalised Linear M

Experimental design allows to guide the measurement process itself in order to acquire only the most informative data points(x_i,y_i). Often, the data matrixXcontaining the covariates is simply called thedesign matrix.

2.6. EXPERIMENTAL DESIGN 25 The frequentist or classical experimental design methodology as introduced by Fisher [1935]

tries to decrease the variance of the estimator ˆufor the unknown variablesu. As a result, the de-sign criteria are based on the eigenvalues of the estimator’s covariance matrix or lower bounds thereof. Modern books on the subject include Atkinson and Donev [2002], Pukelsheim [2006].

The Bayesian approach is different since the unknown uis treated as a random variable with a priorP(u). Here, the goal is to reduce the entropy in the posteriorP(u|^y). For a seminal review of Bayesian experimental design, see Chaloner and Verdinelli [1995].

As we will see, for the Gaussian linear model, Bayesian experimental design is equivalent toD-optimal frequentist design. However, for more complex models, the two approaches are very different. One distinction is that the Bayesian design score depends on the measurements ymade so far, whereas only expectations w.r.t. the likelihoodP(y|^u)appear in the frequentist score.

2.6.1 Frequentist experimental design

The basic frequentist idea is to select new data(_x_∗,y_∗)so that the varianceV = _V[uˆ] of the estimator ˆu = uˆ(X,y)for the unknownudecreases as much as possible, where the particular choice of estimator determines the compromise between bias and variance. Most of the classical design criteria arep-norms of the vectorλ

φ(uˆ) = k^λkp=

∑

n i=₁

λ_i^p

!¹_p

, λ_i = λ_i(V)

whose components are the eigenvalues ofV– a way to express the “size” of the matrixVas a scalar. Table 2.4 summarises the most common cost functions used in experimental design.

name of the design criterion p cost functionφ(uˆ) intuition D-optimality 0 ∏ⁿi=₁λ_i = |^V| generalised variance A-optimality 1 ∑ⁿi=1λ_i =tr(V) average variance E-optimality ∞ max_iλ_i =k^Vk_∞ maximal variance

Table 2.4: Experimental design cost functions

For the simple OLS estimator, we can analytically compute the variance, but for non-Gaussian likelihoods or more complicated estimators, it can be impossible to explicitly derive the vari-ance. Using the likelihood P(y|^u), a distribution over y for fixed u, the Cramér-Rao lower bound (CRB) [Cramér, 1946, Rao, 1945] on the variance of any estimator ˆuhas the form

V=_V[uˆ]< ^∂ψ

∂u^>F⁻¹∂ψ^>

∂u , ψ=

uPˆ (y|^u)dy, F=

Z ∂lnP(y|^u)

∂u

∂lnP(y|^u)

∂u^> P(y|^u)dy, (2.24) whereFis the Fisher information matrixand ψ = _E[uˆ] is the expected value of the estimator under the likelihood. The bound is asymptotically tight for the maximum likelihood estimator.

Often, unbiasedestimators are used, whereE[uˆ] = ψ = uand hence V[uˆ] < ^F⁻¹^{. Since}^V does not have a closed form for many interesting models, one replacesVby its lower bound according to equation 2.24. For general likelihoodsP(_y|^u), also the expectation in the Fisher matrix is likely to be analytically intractable. Besides the CRB, there exists a big variety of lower bounds onV[uˆ][Bhattacharyya, 1946, Barankin, 1949, Abel, 1993] being sometimes tighter but more tedious to compute. For non-linear Gaussian models, the estimator’s expectationE[uˆ] is hard to compute. Further, for Gaussian likelihoodP(y|^u) = N(y|^Xu,^σ²^I), the Fisher in-formation matrix is given by F = ¹

σ²X^>X, which is rank deficient if m < n. This property renders the approach inapplicable in underdetermined settings. In PLS (section 2.2.1), for ex-ample, depending on γ⁻¹, ψ ranges betweenEγ=0[uˆ] = 0 E[uˆ] ^u = _E_γ=_∞[uˆ]giving rise to different values of the biasE[uˆ]−u. There is one critical issue concerning the design

methodology: we minimise a lower bound on the variance, however theoretical guarantees for the validity of this procedure apply only to the asymptotic regime of many observations. The small sample regime is less well understood.

Note that the criteria φ_D,A,E(uˆ)do not depend on the actual measurementsymade so far;

they are expectations w.r.t. yunder the likelihood.

2.6.2 Bayesian experimental design

In Bayesian design philosophy, the unknown uis considered a random variable. A natural measure of uncertainty contained in a random variablezis its(differential) entropy[Cover and Thomas, 2006]

H[_P(_z)] = −

P(_z)_ln_P(_z)_dz.

For fixed mean and variance, a Gaussian has maximal entropy (appendix C) leading to the upper bound

H[_P(z)]≤ H^hN ^z|^EP(z)[z],V_P(z)[z]ⁱ= ¹ 2ln

V_P(z)[z] + ⁿ

2(1+ln 2π), z∈^Rⁿ^. ^(2.25) More accurate statements about the tightness of the bound are based on series approximations of P(z)as given in appendix D.4. Therefore, large variances are equivalent to high entropy implying very little information about the location of z. At the core of the Bayesian design strategy is the idea to localise the posterior as much as possible. This is equivalent to decreasing the expected entropy of the posterior including the new datax_∗ relative to the entropy of the previous posterior withoutx_∗. Formally, we use theinformation gain

IG(x_∗) =H[_P(u|^y)]−

H[_P(u|^y,^y∗)]_P(y_∗|^y)dy_∗, (2.26) where we need to compute the expected entropyH[_P(u|^y,^y∗)]of the augmented posterior in-cluding the measurementy_∗alongx_∗. The expectation is done overP(y_∗|^y) =R _P

(u|^y)_P(y_∗|^u)du.

Note that the information gain explicitly depends on the observationsy. In the applications of this thesis (see chapters 5&6), the integrals in equation 2.26 cannot be done analytically. There-fore, we will use approximate inference to replaceP(u|^y)byQ(u)first with an approximation allowing an analytic computation of the information gain score. However, it is necessary to keep in mind that we approximate at various stages to obtain the design score: first, variational methods (except for EP) typically underestimate the posterior covariance and second the Gaus-sian entropy is an upper bound on the actual posterior entropy. As in case of frequentist design (section 2.6.1), theoretical results on the approximation quality are rare.

2.6.3 Information gain scores and approximate posteriors

For general posteriors P(_u|^y) the information gain score IG(_X_∗) is analytically intractable.

However, for Gaussian likelihoods P(y_∗|^u) = N(y_∗|^X∗u,σ²I), we can use a Gaussian Q(u) to compute the information gain score IG(X_∗)approximately. For non-Gaussian likelihoods, further approximations are necessary. WithP(_u|^y,^y∗)_P(_y_∗|^y) = _P(_y_∗|^y,^u)_P(_u|^y)andX_∗ ∈ R^d^×ⁿ,y_∗ ∈ R^d, the score IG(X_∗)can be expressed as the entropy of the new observationsy_∗ given the old observationsy:

IG(X_∗) = H[_P(u|^y)]−

H[_P(u|^y,^y∗)]_P(y_∗|^y)dy_∗

= H[_P(u|^y)] +

Z Z ln

P(y_∗|^y,^u)_P(u|^y) P(y_∗|^y)

P(u|^y,^y∗)_duP(y_∗|^y)dy_∗

= H[_P(u|^y)] +

Z Z

= H[_P(y_∗|^y)]−

H[_P(y_∗|^u)]_P(u|^y)du=H[_P(y_∗|^y)]−^d 1

2ln 2πe+lnσ

2.6. EXPERIMENTAL DESIGN 27 Even though,P(y_∗|^y)is a non-Gaussian distribution, its variance can be obtained by the law of total variance from the variance of the posteriorP(u|^y)

V_P(y_∗|^y)[y_∗|^y] = _E_P₍_u_|_y₎^h_V_P₍_y_∗_|_y,u₎[y_∗|^y,^u]ⁱ+_V_P₍_u_|_y₎^h_E_P₍_y_∗_|_y,u₎[y_∗|^y,^u]ⁱ

= _E_P₍_u_|_y₎σ²I

+_V_P₍_u_|_y₎[X_∗u]

= σ²I+X_∗V_P(u|^y)[u]X^>_∗.

Using the Gaussian upper bound on the entropy (equation 2.25), we get a formula generalising the linear Gaussian case (equations 2.27 and 2.28) to

IG(_X_∗) ≤ ¹ 2ln

V_P(y_∗|^y)[_y_∗]+ ^d

2(ln 2πe)−^d 1

2ln 2πe+lnσ

= ¹ 2ln

I+σ⁻²X_∗V_P(u|^y)[_u]_X^>_∗.

Since we seek forX_∗with maximal information gain IG(X_∗), the bound depends on the dom-inating eigenmodes of the posterior covariance matrixV_P(u|^y)[_u]. In applications where nis large and the approximate posterior covarianceV = _V_Q₍_u₎[u] = σ² X^>X+B^>Γ⁻¹B₋1

can-not be stored as a dense matrix but is implicitly represented using MVMs withX, Band the vectorγ, the evaluation ofX_∗VX^>_∗ is computationally demanding. Every row of X_∗ requires the solution of a linear system with then×ⁿ^matrixV, which can – of course – be done by conjugate gradients . To alleviate this computational burden, one can use the Lanczos method of section 2.5.4 computing a low-rank approximationV ≈ ^σ²^QkT⁻_k¹Q^>_k . If the eigenmodes of Vare well captured by the Lanczos approximation, we can expect the large score values to be rather accurate.

2.6.4 Constrained designs

Up to now, we require new measurement directions to have unit length dg(X_∗X^>_∗) = 1 other-wise, rescaling would always lead to an increase in information gain or equivalently a decrease in the estimator’s variance. Further constraints might be present in practise. Most commonly, the rows ofX_∗ can originate from a discrete set of candidatesXc. In the so-calledtransductive setting [Yu et al., 2006], one has to find a discrete subset of the possible candidates rather than a continuous matrix. In general, the selection problem is of combinatorial complexity, however, there exist convex reformulations for the linear Gaussian case [Yu et al., 2008]. Unfortunately, they are useless in the underdetermined regime wherem<n.

2.6.5 Sequential and joint designs

In the applications of this thesis, experimental design is not only used once. For complex design decisions based on data(y,X), we alternate in a loop between the inference step and the design decision for the nextsingle(y_∗,x_∗)orjoint measurement(y_∗,X_∗)to include. Clearly, optimising a set of candidatesX_∗ jointly can lead to better designs but is also computationally more de-manding. Often, a greedy strategy will act as the pragmatic choice with only a single candidate x_∗being added each time. The individual candidate measurementsx_∗can come from a discrete candidate setxⁱ_∗, i ∈ ^I or from a continuous candidate spacex_∗ ∈ X. In the former case, we simply select the candidate with highest score, and in the latter case, we have to optimise the design score w.r.t.x_∗ with gradient based methods, for example.

It is the inference step, that marks the difference between the frequentist and the Bayesian approach. In Frequentist design, we need to compute the inverse Fisher information matrix F⁻_x_∗¹for every candidate x_∗ and select the candidate with smallest costφ. In Bayesian design, we compute an approximate posterior (basically a Gaussian)Q(_u) ≈ P(_u|^y,^X) specifically tailored to facilitate the evaluation of the information gain scoreIG(x_∗)and pick the candidate x_∗yielding the biggest score.

On a higher level, the actual observationsyandy_∗do not enter the frequentist design loop as particular values; they are present through expectations only. In Bayesian methodology however, precisely these numbers form the basis for a proper assessment of the uncertainty left in the current state of knowledge aboutu. In the regime of abundant data,m0, frequentist design is the method of choice since it implies a lot of asymptotic guarantees. However, in the underdetermined casem< n, the Bayesian approach is more appropriate as we will see in the following.

2.6.6 Bayesian versus frequentist design

D-optimal frequentist design and Bayesian experimental design based on a Gaussian approxi-mation to the posterior distribution are similar in two ways: first, they both reduce uncertainty, i.e. either shrink the variance of the estimator or lower the posterior entropy, which is equiv-alent to decreasing the variance in a Gaussian approximation. Second, in the limit of many observationsm → ∞and hence omission of the prior, they are the same. However, there are also severe differences: in the underdetermined casem < n, the frequentist approach is not applicable.

To make this more concrete, we have a look at the linear Gaussian case as detailed in section 2.2.1 and. For p = 2, the PLS estimator (equation 2.7) is given by ˆu_PLS = A⁻¹X^>ywith A = X^>X+γ⁻¹B^>B. Using the bilinearity of the covariance andV[y] =σ²I, we obtain the variance of the PLS estimator ˆuPLS

Vˆ :=_V[uˆ_PLS] = A⁻¹X^>V[y]XA⁻¹=σ²A⁻¹X^>XA⁻¹.

Although, the PLS estimator coincides with the posterior mean, the posterior variance V:=_V_P₍_u_|D₎[u] = σ²A⁻¹

is distinctively different from ˆV. As it will be shown in chapter 3, the diagonal ν = dg(V) is bounded ν ^σ²^γ1 from above by the prior variance, which does not hold for ˆV. Also the rank of ˆVonly depends on the rank of X^>X. For underdetermined measurements m <

n, ˆV inevitably becomes singular; it cannot be interpreted as the uncertainty of the current knowledge aboutusince it is impossible to achieve perfect certainty from a small number of noisy measurements.

Experimental design with D-optimality as criterion and invertible X^>X, selects the next measurementsX_∗= [_x_∗_,1_{, ..,}_x_∗_,d]^>to maximise the design score

−^ln^φD(_X_∗, ˆuPLS) = −^ln|^V^ˆ|= −^ln|^σ²(_A+_X_∗_X^>_∗)⁻²(_X^>_X+_X_∗_X^>_∗)|^, ^A=_X^>_X+_Γ⁻¹

=c 2 ln|^A+X_∗X^>_∗| −^ln|^X^>^X+X_∗X^>_∗|

=c 2 ln|^I+X^>_∗A⁻¹X_∗| −^ln|^I+X^>_∗(X^>X)⁻¹X_∗|^. ^(2.27) The score compromises between choosingX_∗along the biggest eigendirections ofA⁻¹(Bayesian posterior variance) and along the smallest eigendirections of (X^>X)⁻¹ (OLS estimator vari-ance).

The Bayesian information gain score IG(x_∗) =−¹

2ln|^A|+¹ 2ln

X^>X+X_∗X^>_∗ +_Γ⁻¹ = ¹

2ln

I+X^>_∗A⁻¹X_∗

,A=X^>X+_Γ⁻¹ (2.28) is equivalent to−^ln^φD(X_∗, ˆu_PLS)in the flat prior limitΓ→∞·^I^only.

We use two toy examples withn =2,q=m= 1 to illustrate the different behaviours: first let the measurementX = [_{0, 1}]and the penalty domainsB = [_{1, 0}]be orthogonalBX^> = ₀ ∈ R^q^×^m, hence

γ⁻¹ 0

0 1

, ˆV=σ²

γ² 0 0 0

⇒^x^ˆ∗ = 1

and V=σ²

γ 0 0 1

2.7. DISCUSSION AND LINKS TO OTHER CHAPTERS 29

Im Dokument Bayesian Inference and Experimental Design for Large Generalised Linear Models (Seite 38-43)