Bayesian model selection and averaging - Outline and contributions

1.3 Outline and contributions

2.1.2 Bayesian model selection and averaging

ϑ⁽ⁱ⁾ with respect to its prior. The marginal likelihood therefore has the very intuitive interpretation of providing the probability of observing the data D given modelM_i. Note also, that the marginal likelihood depends on the choice of prior distribution since two models with an identical parametric structure but different prior distributions usually yield different marginal likelihoods. Thereby the model, which places high prior probability in parameter regions yielding high likelihoods, yields also the higher marginal likelihood. This becomes clear, because the marginal likelihood can also be written as the expectation of the model specific likelihood function with respect to the prior distribution, i.e.

f^(M)(D|M_i) =Eπ⁽ⁱ⁾

hf⁽ⁱ⁾(D|·)ⁱ.

The ratio of the marginal likelihood of two models is called a Bayes-factor which pro-vides a measure for the strength of evidence of one model against another, not accounting for prior model probabilities π^M(M_i), i.e.

B_M_i_,M_j = f^(M)(D|M_i) f^(M)(D|M_j),

where a Bayes-factorB_M_i_,M_j >1 means that modelM_iis preferred over modelM_jaccording to the data. Kass and Raftery (1995) provided a scale for interpreting the Bayes-factor in order to obtain a degree of evidence for/against a specific model (see Table 2.1.2). However, these interpretations should be considered as a guideline since fixed thresholds for grading the evidence from the Bayes-factor are highly subjective.

Table 2.1: Overview of how to interpret various values of the Bayes-factor according to Kass and Raftery (1995).

range of B_M₁_,M₂ degree of evidence B_M₁_,M₂ ∈[1,3] barely worth to mention B_M₁_,M₂ ∈[3,20] positive evidence for M₁ BM1,M2 ∈[20,150] strong evidence for M1

B_M₁_,M₂ >150 very strong evidence forM₁

If one is just interested in identifying one best model instead of model-specific weights, another criteria which can be derived from the marginal likelihood is the Bayesian in-formation criterion (BIC) (Schwarz, 1978). The BIC has the advantage that, due to its form similar to the AIC, it can be applied in both, frequentist and Bayesian settings. It is motivated by the asymptotic behaviour of the marginal likelihood if the model-specific likelihood functions f⁽ⁱ⁾(D|ϑ⁽ⁱ⁾) belong to the exponential family (Held and Bové, 2014) and it is defined for a specific model by

BIC = log(n)d−2 logfDϑˆ,

where n refers to the sample size of the data set D = (D_i)_i=1,...,n and d to the number of model parameters. As with AIC, the model yielding the lowest BIC score should be

preferred. Hereby, the penalization of model complexity, i.e. the number of free parameters, is even more pronounced when using the BIC, as the penalization factor log(n) depends on the size of the data set. In the case of AIC the penalization factor is fixed at 2 which becomes negligible for large data sets as minor model extensions may still yield large increases of the maximum likelihood value.

Another strictly Bayesian measure for model validity, motivated by information-theoretic arguments, is the deviance-information-criterion (DIC) (Spiegelhalter et al., 2002), which is defined by the effective number of parameters p_D in the model, i.e.

p_D =Eπ(ϑ|D)[−2 logf(D|·)] + 2 logfDϑˆ,

where ˆϑdenotes the posterior mean. Here, the effective number of parameter may be lower than the actual size of the parameter vector due to implicit restrictions contained in the prior, e.g. if the prior imposes a strong correlation of some model parameters or if certain model parameters do not affect the likelihood. The DIC is then given by

DIC =−2 logfDϑˆ+ 2p_D

=2E^π(ϑ|D)[−2 logf(D|·)] + 2 logfDϑˆ.

One can see, that if the parameter restrictions through the prior are very low and the posterior mean is near the MLE, the DIC becomes approximately equivalent to the AIC.

One main difference in practice is the computability of the three selection criteria AIC, BIC and DIC. While evaluation of AIC and BIC requires the MLE, the DIC involves computation of the expected likelihood and the posterior mean. Thus, the DIC can be easily obtained if one has a posterior sample (and corresponding likelihood values) available, which is the case when sampling procedures such as MCMC methods are used for posterior computation. The AIC and BIC in any case require maximization of the likelihood function which, depending on the model, may need more or less additional effort.

However, all three criteria are only suitable to compare distinct models against each other and to identify one best model among them. If one is interested to assign a probability to each model of a certain model set M under consideration, one has to utilize posterior model probabilities as defined in (2.2). Since one often deals with a finite set of model, i.e.

|M|<∞, the normalized probabilities can be easily calculated for all M_i ∈ M by π^(M)(M_i|D) = f^(M)(D|Mi)π^(M)(Mi)

M∈M

f^(M)(D|M)π^(M)(M).

Still, computing the posterior model probabilities requires the calculation of the marginal likelihoods f^(M)(D|M_i). For some models, these can be computed analytically as the marginal likelihood coincides with the normalizing constant of the model specific posterior distribution π⁽ⁱ⁾(· |D). Conversely, if the model’s posterior distribution is difficult to ob-tain analytically then so is the marginal likelihood. In such situations it has to be either

approximated (e.g. by Laplace-approximation or by the BIC which is already an asymp-totic approximation), computed by numerical integration, or estimated using Monte Carlo methods. Among the latter, Monte-Carlo based estimation approaches utilizing posterior samples are covered within Chapter 3.

In the case of having an infinite number of models available, evaluating the marginal likelihood for all models is less obvious. For that scenario, sampling procedures which sample simultaneously from the posterior model distribution π^(M)(· |D) and the corre-sponding parameter posteriorsπ⁽ⁱ⁾(· |D) (M_i ∈ M) are able to approximate the respective distribution (Toni et al., 2009; Green, 1995).

Having computed the posterior model probabilities, it is possible to calculate averaged values for quantities which are defined for each model inM. For instance, when modelling infectious disease transmission, we are often interested in the predictive distribution of future case counts. However, different models yield different predictive distributions, but we are interested in a joint prediction from all models. More formally, suppose one is interested in a random variable with distribution functionQ, which is defined within each model M_i ∈ M. The distribution function Q, however, is defined differently for each model in M as it for instance may depend on the parameter ϑ⁽ⁱ⁾ ∈Θ⁽ⁱ⁾ corresponding to M_i, where the parameter spaces Θ⁽ⁱ⁾ might not coincide for all considered models. Thus, let Q⁽ⁱ⁾ denote the distribution function of Q when defined within a specific model M_i, i.e. Q|_M_=M

i = Q⁽ⁱ⁾. Then, the averaged distribution according to the posterior model distribution π^(M)(· |D) is defined via

Q= ^X

Mi∈M

π^(M)(M_i|D)·Q⁽ⁱ⁾.

A common example is given through averaging the predictive distribution of the data D (here assuming it has an absolutely continuous distribution). For each model M_i the (mostly multi-dimensional) probability density f⁽ⁱ⁾(·) of D is defined via the likelihood function f⁽ⁱ⁾(·|ϑ⁽ⁱ⁾) and the parameter posterior π⁽ⁱ⁾(· |D) by

f⁽ⁱ⁾(D) =

Θ⁽ⁱ⁾

f⁽ⁱ⁾Dϑ⁽ⁱ⁾π⁽ⁱ⁾(ϑ⁽ⁱ⁾|D)dϑ⁽ⁱ⁾,

i.e. as the (with respect to the posterior) averaged likelihood of observing D in model M_i. The averaged predictive probability density is then given by

f(D) = ^X

Mi∈M

π^(M)(M_i|D)·f⁽ⁱ⁾(D)

Such averaging techniques can also be applied to, e.g., the distribution of a specific param-eter componentϑwhich is contained in every model inMby averaging over the respective model specific marginal distributions π⁽ⁱ⁾(ϑ|D). Moreover, if the posterior model distri-bution is heavily degenerated in the favour of one single model, i.e. π^(M)(M_i|D) > 0.99 for one M_i, it might be worthwhile to only use this single model for any further analyses

instead of including the marginal effect of all other improbable models through averaging processes.

Altogether, there exists a multitude of tools for measuring and comparing the validity of a set of models, which enables the identification of a best model to explain the data.

For evaluation each of the different criteria requires different objects such as a posterior sample or the MLE. By computing posterior model probabilities, it is also possible to jointly process the whole ensemble of models with respective model shares. Thus, one can compute averaged values for quantities, which are common to all considered models.

These could be, e.g., the averaged predictive distribution of the data or averaged parameter distributions.

2.2 A new approach for addressing autocorrelated

Im Dokument Bayesian inference for infectious disease transmission models based on ordinary differential equations (Seite 45-49)