• Keine Ergebnisse gefunden

4. Numerical solution of the identification problem

4.3. Approximate inference

4.3. Approximate inference

Anticipating one of the shortcomings of the numerical techniques presented in chapter 4.4, it has to be mentioned that some techniques can pose almost insuperable computational barriers. In particular, these are the techniques based on numerical integration and sampling. To overcome these shortcomings, huge effort is made most notably in the field ofmachine learning [see e.g 25] to reduce the computational cost by developing methods for approximate inference.

Variational Bayes (VB) techniques (see chapter 4.3.2) constitute one such class of methods. By computing optimal approximationsq(θ)of the posteriorp(θ|Z), VB meth-ods try to render the evaluation of integrals such as in (4.3) or in the evaluation of the evidence in (3.32) feasible. With the intention of identifying parameters for com-plex computational models, these methods have also been extended to be applicable in combination with models incorporating nonlinear ‘input-output’-relations (see chapter 4.3.3). These extensions can also be seen as generalizations to the well-known Laplace approximation described in the following.

4.3.1. Laplace approximation

In the context of statistical inference, the Laplace approximation usually refers to ap-proximations of the posterior by a normal distribution [209]. This is equivalent to a quadratic approximation to the log-posterior at the MAP point [145]. To this end, the posterior is represented as

p(θ|Z) = 1

Zq(θ), (4.12)

andlogq(θ)is approximated in terms of the second-order Taylor expansion at the MAP point [see e.g. 25, 145]. Due to the definition of the MAP point as an extremal point of the posterior, the first-order term in (4.13) vanishes. The matrix

H:=−d2logp(θ|Z)

2 (4.14)

is usually called hessian matrix. Provided that His positive definite,q(θ) can be repre-sented by the Gaussian distribution

q(θ) =N(ˆθmap,H1). (4.15) This distribution is also shown in figure 4.1. The normalizing constant

Z = Z

p(Z|θ)p(θ)dθ=p(Z) (4.16)

can then be approximately computed by

Z ≈q(ˆθmap)(2π)np/2

|H|1/2 . (4.17)

This approximation of the evidence p(Z) makes Laplace’s method a popular tool for model evaluation, e.g., by means of the Bayes factor [111]. The use of the Gaussian-approximation at the MAP estimate can be justified from an asymptotic perspective where a large amount of data is available. Nevertheless, there is no guarantee on the quality of the approximation and the Gaussian distribution might not represent the posterior well in a particular situation [25].

4.3.2. Variational Bayesian approach

VB methods also aim at statistical inference based on an approximation q(θ) of the posterior p(θ|Z). In contrast to the Laplace approximation – where the approximation q(θ) is a posteriori fitted in terms of a fixed Gaussian distribution – optimality of the approximations in the VB approach is obtained from a variational argumenta priori. To this end, similarity between probability distributionsp1 andp2 is measured by means of theKullback-Leibler divergence [129]:

DKL(p1kp2) = Z

p1(x) logp1(x)

p2(x)dx. (4.18)

The evaluation of a minimum of such a functional with respect to an argumentp1 or p2

lends itself towards variational calculus motivating the name variational Bayes. Although not being a metric sinceDKL(p1kp2)6=DKL(p2kp1), the application of (4.18) reveals very useful properties. This can be seen by a rearrangement of the Kullback-Leibler divergence of some approximation q(θ) from the posteriorp(θ|Z):

DKL(q(θ)kp(θ|Z)) = Z

q(θ) log q(θ)

p(θ|Z)dθ (4.19)

= Z

q(θ) log q(θ)

p(Z,θ)dθ+ logp(Z) (4.20)

=:−G(q) + logp(Z). (4.21)

→logp(Z) =G(q) +DKL(q(θ)kp(θ|Z)). (4.22) Since the model-evidence p(Z) is fixed with respect to q, the minimization of (4.19) is equivalent to a maximization of the functional G(q). Thus, G(q)provides a lower bound for p(Z) [see e.g. 25].

In order to arrive at tractable relations by performing the minimization of (4.19), the VB method usually assumes that the posterior can be factorized over partitionsθi of the parametersθ according to

q(θ) =

N

Y

i

qii). (4.23)

4.3. Approximate inference This approximation, known asmean field approximation, has its foundation in the mean field theory of statistical physics [173]. The minimization of (4.19) is then performed with respect to the single componentsqi according to the classical variational argument

argmin

qi

DKL(q(θ)kp(θ|Z)) ⇐⇒ δDKL(qkp)[δqi]≡0. (4.24) Thereby, no a priori assumptions on the single components qi are made, which is an often advocated advantage of the VB approach referred to asfree-form optimization [9].

The solution of the minimization problem (4.24) results in

logqii) =Eqj6=i[logp(Z,θ)] +const (4.25) [see e.g. 25]. Thereby, Eqj6=i refers to the expectation computed with respect to the probability densityqj6=i =Q

j,j6=iqjj). With a clever choice of the partitionsθi and/or with the choice of conjugate priors for the likelihoodp(Z|θ), the integral Eqj6=i[p(Z,θ)]

becomes analytically tractable and the components qii) can often be identified with some specific distribution [180].

Starting from the mean field approximation, the original identification problem is split into N conceptually simple optimization problems (4.25). The cyclic dependency be-tween the sub-problem suggests an iterative update procedure: starting from an initial guessθ0, the partitions are updated sequentially, and new parametersθi are used in the optimization for the next partition. This procedure is then repeated until convergence.

This approach is very similar to the expectation-maximization scheme known from op-timization of latent variable models [see e.g. 25]. In fact, it can be shown that the VB approach represent a generalization of expectation-maximization [17].

The VB approach is a relatively recent development in the field of statistical infer-ence and it has been proven particularly useful for parameter identification. Often, partitioning the parameters into model- and noise-parameters already results in feasible computations. Also for Bayesian graphical networks or hierarchical Bayesian models, the hierarchy provides a natural way of partitioning [106]. Furthermore, the good con-vergence properties being inherited from the expectation-maximization scheme make the use of VB methods appealing. The general advantage of the VB approach over the con-ceptually easier Laplace approximation is twofold. On the one hand, VB provides a much richer class of approximations, whereas the Laplace approximation is restricted to opti-mal solutions being represented as Gaussian distributions. On the other hand, VB gives a lower bound for the evidence, whereas the approximation quality of the Laplace ap-proximation can generally not be assessed. For a detailed introduction and illustrations of the approach, the reader is referred to Bishop [25], MacKay [145] and Beal [17].

4.3.3. Extension to nonlinear forward models

The advantage of the VB approach is mainly connected to the availability of good parti-tionsθisuch that the optimal distributionqican be identified from (4.25). Under certain circumstances, this identification might not be possible. A particular case is given by likelihoods given in terms of complex nonlinear forward models. In order to also handle nonlinear models within the VB framework, it can be combined with the Laplace ap-proximation [67]. One possible approach suggested by Chappell et al. [39] is to use the

computational model in a linearized version

F(θ)≈F(θm) +J(θ−θm), (4.26)

with the matrix [J]ij = dFi(θ)

j

θm

. Considering solely the model-parametersθ as single partition, the minimization problem (4.25) reduces to

logq(θ) = logp(θ|Z) + logp(θ) +const. (4.27) Using the generic definition of the likelihood (3.27) and the assumption of a Gaussian prior p(θ)∼ N(θ00) over the parameters, the log-posterior can be expanded to

logq(θ) =− 1

2(Z−F(θ))·(Z−F(θ))−1

2(θ−θ0)·Σ0·(θ−θ0) +const. (4.28) Inserting the linearization (4.26) and accumulating all terms that are constant with respect to θ in the constant remainder, one obtains

logq(θ) =−1 2θ·( 1

σ2J>J+Σ0)·θ +θ·( 1

σ2J>·((Z−F(θm)) +J·θm) +Σ0·θ0) +const. (4.29) In order to identify the ride hand side as the logarithm of some specific probability density function, the logarithm of a Gaussian density logN(x|m,Σ) is expanded to

logN(x|m,Σ) =−1

2x·Σ·x+x·Σ·m+const. (4.30) By comparison to (4.29) it is possible to identify

Σ= 1

σ2J>J+Σ0, (4.31)

Σm= 1

σ2J>·((Z−f(θm)) +J·θm) +Σ0·θ0. (4.32) This again suggests an iterative procedure: starting from an initial guess of θ0m, m is iteratively obtained as the solution of the linear system (4.32), and the next linearization point is updated as θn+1m ← m. This iteration can be readily identified as a Gauss-Newton algorithm for the solution of a regularized least-squares formulation [see e.g.

146]. Except for the Gauss-Newton approximation, this iteration will converge to the Laplace-approximation (4.15) with the covarianceHapproximated by

H≈ 1

σ2J>J+Σ0. (4.33)

This specific interpretation of the nonlinear-VB approach arises as the limiting case of a fixed a priori known noise varianceσ2 and a Gaussian prior onθ. Whereas this limiting case is of course not representative for the nonlinear VB approach, it reveals its two main disadvantages. On the one hand, the lower bound on the evidencep(Z)is weakened due to the linear model approximation. On the other hand, the nonlinear VB is no longer an