• Keine Ergebnisse gefunden

2.2 Approximate inference

2.2.3 Empirical approximation - Monte Carlo methods

In the previous two subsections, we reviewed inference approximations which are rel-atively cheap to calculate numerically but can be far off the real solution. Which alternatives do exist if the Gaussian approximation (Laplace approximation) is inade-quate and if computationally more expensive methods can be applied? Although for some applications the posterior itself is of interest, mostly integrals (e.g., estimates of the first or second moments, see Equation (2.1)) need to be evaluated. For example, the expectation of g(Ψ) with respect to the posterior p(Ψ|ˆy), is expressed by the integral I:

I = Z

g(Ψ) p(Ψ|ˆy) dΨ. (2.8)

Monte Carlo methods are numerical integration methods which offer a general ap-proach to approximate integrals. They are, for example, employed for inference, model validation or prediction, which is often of primary interest. To approximate the integral

2.2 Approximate inference

in Equation (2.8) by a Monte Carlo method, a (pseudo-) random number generator is required. The Monte Carlo algorithm using M number of samples is:

• sample m= 1 :M samples Ψ(m) from p(Ψ|y)ˆ

• unbiased estimate of the integral is given by I 'IM = 1 run Monte Carlo, it suffices to be able to draw samples from p(Ψ|ˆy) and to evaluate g(Ψ). An increasing amount of samples increases the accuracy and in the limiting case, M → ∞, one obtains the exact value of the integral. However, Monte Carlo methods can be computationally very expensive as g(Ψ) needs to be evaluated many times.

Later on, we will use some sampling schemes for comparison/verification purposes.

For this reason, we briefly discuss the importance of sampling and the general Metropolis algorithm in more detail. Additionally, we derive the (normalized) effective sample size (ESS) for both algorithms. The ESS is used to measure the efficiency of the algorithms and provides a measure of comparison with other inference strategies. The ESS determines how informative a given sample is and takes values between M1 and 1 [128]. An ESS = 1 relates to an algorithm which is highly efficient compared to one with an ESS= M1 (compare also the ESSIS and ESSM CM C below).

Importance Sampling

In situations when it is not straightforward to sample from a desired probability distri-bution π(Ψ), here π(Ψ) =p(Ψ|y), but an evaluation ofˆ π(Ψ)is easy for a given Ψ, one can useimportance sampling (IS). In IS, one samples from an auxiliary distribution q(Ψ) and then corrects the estimate by weights, e.g., to approximate the integral:

I =R

The two steps of importance sampling are:

• sample m= 1 :M samples Ψ(m) from q(Ψ)

q(Ψ(m)) are the corresponding importance weights.

Belonging to Monte Carlo methods, IS converges by theStrong Law of Large Numbers for M → ∞ to: IM →I.

The (normalized) effective sample size for importance sampling can be expressed using the importance weights (details in [129]):

ESSIS = (PM

m=1w(m))2

MPM

m=1(w(m))2. (2.11)

It points out the percentage of number of samples that actually contribute to the estimate. The latter attains values between the following extremes: In one extreme (when ESS → M1) a single sample has a unit normalized weight, whereas the others have zero weights. That happens if q(Ψ) provides a poor approximation and the ESS is dominated by the largest weight w(Ψ(m)). In the other extreme, when q(Ψ) coincides with the exact posterior, all samples have equal weights w(Ψ(m)) and are equally informative (ESS→1).

The performance of importance sampling can decay rapidly in high dimensions [130, 79]. Therefore, we discuss a very general and powerful algorithm in the next subsection, the Markov Chain Monte Carlo method.

Markov Chain Monte Carlo

Importance sampling can be very inefficient (if the proposal distribution is badly se-lected) and suffers from severe limitations in high-dimensional problems. Interest-ing and often used alternatives to importance samplInterest-ing are Markov Chain Monte Carlo (MCMC) methods which combine Markov chains with Monte Carlo techniques to focus on more important regions. Within MCMC, a chain of a correlated (and therefore not independent) sequence of samples is generated starting from any con-figuration Ψ(1). Each sample is a non-deterministic function of its previous sample Ψ(m)T−−(m)→ Ψ(m+1) (only of the previous sample, following the conditional indepen-dence property of Markov chains). The construction of samples is based on the us-age of a transition kernel T(m)(m)(m+1)) = p(Ψ(m+1)(m)) in such a way that Ψ(m+1) ∼T(m)(m)(m+1)). To ensure convergence of a Markov chain towards the desired probability distribution in the limit of a large number of samples, it needs to be invariant with respect to the Markov chain. A more restrictive condition to ensure in-variance is the fulfillment of detailed balance,R

p(Ψ(m))T(m)(m)(m+1))dΨ(m) = p(Ψ(m+1)) where p(Ψ) is the distribution which the samples have to follow. A Markov chain respecting the detailed balance is also reversible. More information, also about π-irreducibility and aperiodicity, can be found in [54, 128, 119]. Tradi-tionally, T(m) is generated using a proposal probability distribution Ψ ∼ q(Ψ|Ψ(m)) dependent on the previous sample. In cases of a non-symmetric proposal distribution

2.2 Approximate inference

(q(Ψ(m)|Ψ) 6= q(Ψ|Ψ(m))), one needs to incorporate the ratio q(Ψq(Ψ(m)(m))) within the calculation of the acceptance ratio π(Ψπ(Ψ(m)))to ensure reversibility.

One of the simplest MCMC methods is the general Metropolis algorithm which we briefly describe below as we refer to it later. Similar to importance sampling, one samples from a proposal distributionq(Ψ|Ψ(m)), now depending on the current sample Ψ(m). In the general Metropolis algorithm, one iterates the following stepsM times:

• sample Ψ ∼q(Ψ|Ψ(m))

The integral in Equation (2.8) can be approximated similarly to Equation (2.9) by I 'IM = 1

Within the general Metropolis algorithm, the proposal distribution is also symmetric:

q(Ψ(m)(m+1)) =q(Ψ(m+1)(m)),∀Ψ(m).

The convergence rate inversely scales with the square root of the number of param-eters [128]. The constant in front of this rate highly depends on how well the proposal step fits the specific problem and other algorithmic details. The autocovariance ρ(k) at lag k can be used to evaluate the independence of the consecutive sample draws [131]: autocorrelation time τint can be computed as [51, 132, 119]:

τint= 1 + 2

X

k=1

ρ(k). (2.14)

Also the (normalized) effective sample size derives from [119]:

ESSM CM C = 1

τint = 1

1 + 2P

k=1ρ(k), (2.15)

indicating the loss in efficiency due to the usage of a Markov chain. In general, the more correlated the samples are, the less information they contain.

For further introductory information about Monte Carlo methods the reader can consult [117]. For particular information about properties of the transition kernel or about specific MCMC algorithms, such as MALA, Simulated Annealing, Gibbs sam-pling, we direct the reader to [54, 128, 119]. Whilst MCMC sampling methods are general algorithms that are guaranteed to yield exact estimates in the limit of a large number of samples, the number of required samples for accurate estimates can be too large to make even highly optimized procedures feasible.