Mixed Models with Correlated Random Effects in the Bayesian Distributional Regression Framework / submitted by Tim Föckersperger

(1)

Submitted at

Institut f¨ur Angewandte Statistik

Supervisor

Assoz. Univ.-Prof.in Mag.a Dr.in Helga Wagner February 2021 JOHANNES KEPLER UNIVERSITY LINZ Altenbergerstraße 69 4040 Linz, ¨Osterreich www.jku.at DVR 0093696

Mixed

Models

with

Correlated

Random

Effects

in

the

Bayesian Distributional Regression

Framework

Master Thesis

to obtain the academic degree of

Master of Science

in the Master’s Program

Statistics

(2)

STATUTORY DECLARATION

I hereby declare that the thesis submitted is my own unaided work, that I have not used other than the sources indicated, and that all direct and indirect sources are acknowledged as references.

This printed thesis is identical with the electronic version submitted.

(3)

Abstract

The implementation of mixed models in the Bayesian distributional regression framework is currently limited to independent random effects between the different distributional parameters. This Master thesis investigates the consequences of a violation of this inde-pendence assumption and further outlines the necessary adaption in MCMC inference in a model for normal data that allows correlation between the random intercepts in the models for the mean and the variance. In the second part of the Master thesis, arguments for the implementation of correlated random effects are provided: Firstly, the effects of a violation of the independence assumption are examined in a simulation study for zero-adjusted gamma (ZAGA) distributed data where the results show that correlations between the random in-tercepts are shrunk to zero. Secondly, an analysis of data on mothers’ yearly income over time after their return to the labor market following a maternity leave by a ZAGA model and independent random intercepts is presented. Estimation results suggest that a model with correlated random intercepts would be more appropriate.

Keywords: correlated random effects; mixed models; Bayesian distributional regression; MCMC

(4)

Chapter 1 Introduction

Regression analysis is a powerful tool to infer the relationship between one or many ex-planatory variables of any type and a uni- or multivariate response variable. There are many regression methods that all serve various purposes. One very well-known model type is the class of mixed models. It extends linear models by random effects that primarily serve to account for a special covariance structure between subsets of the observed data. Therefore, mixed models are typically used for data consisting of repeated measurements (longitudinal data analysis) or clustered data (cluster analysis). Another model class that has gained pop-ularity over the last few years is the Bayesian distributional regression framework. This model class has been introduced by Klein, Kneib, Lang and Sohn in 2015 (Klein et al. (2015a)) and generalizes the generalized additive model class (GAM, Hastie and Tibshirani (1990)) in mainly two points: Firstly, any K-parametric distribution can be considered for the response variable and secondly, each of its distributional parameters can be modeled by covariates. These extensions provide great new flexibility as they (1) relax the distributional assumption and (2) allow to not only model the expected value as a function of covariates. Furthermore, the possibility to model all distributional parameters implicitly extends the class of mixed models as random effects can now be included in the linear predictors of any parameter. In this context, the central assumption of mixed models in the Bayesian distributional re-gression framework is that the random effects in the models for the different distributional parameters are independent. Hence, they are modeled separately from each other, ignoring the fact that there might be some hidden dependence structure between them.

The purpose of this Master thesis is to investigate the limitations of this independence as-sumption and to outline the changes in MCMC inference in a model for normally distributed data that allows for covariance or, respectively, correlation between the random intercepts in the models for the two parameters, the mean and the variance. Thereby, the joint dis-tribution of the random intercepts is assumed to be a bivariate normal disdis-tribution with a positive definite variance-covariance matrix. This setting allows to model and account for the covariance structure between the random intercepts. Relaxing the independence assumption and investigating the dependence structure can therefore be accomplished.

To further motivate the implementation of correlated random effects, a simulation study is performed where models, assuming independent random effects in the models for the distributional parameters, are fitted to data with different correlation structures between the random intercepts. The estimation results will be discussed with a particular focus on how the violated model assumption affects the estimation. Finally, a ZAGA mixed model is used to model the yearly income of mothers over time and to investigate the presence of correlated random intercepts in real-world data.

(7)

The remainder of this Master thesis is structured as follows:

Chapter 2 reviews the Bayesian distributional regression framework. Chapter 3 describes the framework for modeling the correlation between random effects in the models for different parameters and outlines the adaption of the MCMC scheme for a normal model. The simu-lation study and its key results are provided in Chapter 4. Chapter 5 contains the real-world application. Finally, Chapter 6 summarizes the main findings.

(8)

Chapter 2 The Bayesian Distributional

Regression Framework

In the first Section 2.1 of this chapter, the principles of Bayes modeling and some of its inference techniques are noted. In the Section 2.2, the Bayesian distributional regression framework will be briefly summarized in order to contextualize the extension of mixed models with correlated random effects in the further chapters.

2.1 Bayes Modeling and Inference

In this section, the fundamentals of Bayesian inference are shortly reviewed.

Statistical inference is a powerful tool to infer certain characteristics from a small sample to a larger population Casella and Berger (2002). This is usually conducted by assuming a specific model and estimating the underlying model parameters θ. Inference can be drawn from mainly two perspectives, the Frequentist and the Bayesian point of view.

One of the main differences between the two approaches is that in the frequentist setting the parameters θ are assumed to be fixed, while in the Bayesian framework, they are assumed to be random variables. Considering them to be stochastic allows to quantify the uncertainty of the parameters by a distribution p(θ), which is known as the prior distribution. This prior is a distribution that can be used to include prior beliefs on the parameters of interest by domain knowledge, results from previous studies or other possibly subjective assumptions (Hoff (2009)). Once data is gathered and quantified by the likelihood, Bayesian inference is then based on the posterior distribution of the parameters, which can be derived via Bayes’ theorem

p(θ|y) = R p(y|θ) p(θ) θp(y|θ) p(θ) dθ ∝ p(y|θ) p(θ) Posterior ∝ Likelihood x Prior

(9)

with R_θp(y|θ) p(θ) dθ denoting the normalizing constant.

The interpretation of the Bayes theorem is conceptually very intuitive. It states how the prior uncertainty of the parameters changes by the information that is contained in the observed data. The updated beliefs are quantified by the posterior distribution. As point estimates for θ, any location parameter of the posterior like, e.g., the expectation, the median or the mode, can be taken. Further, any interval I[UL, UU] for which

Z UU

UL

p(θ|y) dθ = 1 − α

can be considered as (1 − α)% credibility interval (Hoff (2009)).

In many scenarios, the posterior distribution cannot be determined analytically and therefore, Markov-Chain-Monte-Carlo (MCMC) methods can be applied. The first conceptual ideas of MCMC were already proposed in the 1950s by Metropolis et al. (1953). Still, their usefulness for statistical inference was not realized until the 1990s (Gelfand and Smith (1990)) due to several reasons, such as missing computational power (Fahrmeir et al. (2007)). Nowadays, MCMC methods are widely used for Bayesian inference as they allow to draw samples from the posterior distribution efficiently. These samples can then be used to estimate practically any characteristics of the posterior distribution like, e.g., the expected value, the variance, quantiles or even the entire density function. One of the main advantages of MCMC techniques compared to other sampling methods is that they can generate samples even from extremely high-dimensional densities (Fahrmeir et al. (2007)). This is crucial, e.g., for Bayesian regression models as the number of parameters in such models can be quite enormous. Furthermore, the normalizing constant, which can be numerically very demanding or even unsolvable, does not need to be evaluated in order to draw samples from the posterior distribution.

The main objective of MCMC techniques is to choose a transition kernel so that the sta-tionary distribution of the Markov chain is the posterior distribution. Convergence to the posterior distribution is guaranteed; nevertheless the time to convergence depends on the starting values. Hence, the starting iterations, also called burn-in period, until convergence is obtained, are discarded to ensure that only draws from the posterior distribution are used for inference (Fahrmeir et al. (2007)). One commonly used MCMC technique is the Metropolis-Hastings-algorithm (MH-algorithm). The MH-algorithm is a powerful and fast technique to generate samples when they cannot be directly drawn from the posterior. Its transition kernel is generated as follows:

1. Sample a candidate θ∗ from a proposal q(θ∗|θold₎

2. Accept the draw with probability α = min

1, p(θ∗|y)q(θold|θ∗) p(θold|y)q(θ∗|θold₎

where θold denotes the previously generated sample. The proposal density is arbitrary, but a good choice yields a larger acceptance probability. In the case of two or more parameters, a multivariate proposal density can be used. However, it is recommended to draw the parame-ters in smaller blocks instead as higher-dimensional proposal densities with large acceptance probability are rather hard to find.

Another MCMC method that is often used is Gibbs sampling. This technique is extremely useful when the distribution of a parameter conditional on the data and all other parameters in the model, also called full conditional distribution, can be derived in closed form. In this case, a sample for θ can be generated by directly drawing from its full conditional

(10)

distribution. Gibbs sampling can be seen as a special case of an MH-step where each draw is accepted (Hoff (2009)).

In complex models with many parameters, it may occur that the full conditional can not be derived for all parameters. In such cases, it is common practice that a mixture of Gibbs sampling and MH-steps is applied. Gibbs sampling is then used for all parameters where the full conditional is available in closed form. For the parameters where this is not feasible, an MH-step is applied to generate a draw from the parameter’s full conditional. This procedure allows to generate samples from the posterior distribution for all model parameters which can then be used for inference (Hoff (2009)).

2.2 Bayesian Distributional Regression Model

In 2015, Klein, Kneib, Lang and Sohn (Klein et al. (2015a)) proposed the Bayesian dis-tributional regression framework as the Bayesian counterpart to the generalized additive models for location, scale and shape (GAMLSS) by Rigby and Stasinopoulos (2005). The model class allows to model the response by any K-parametric distribution for the response variable and hence, relaxes the distributional assumption of the exponential family in the GAMs. Additionally, it allows to model any of the distributional parameters by covariates. The Bayesian distributional regression framework introduces many new opportunities in the modeling process such as, e.g., to

1. consider a wide range of distributions with different characteristics matching various types of data (e.g., mixtures of discrete-continuous distributions)

2. account for model misspecifications in too simple models (e.g., heteroscedacity, zero-inflation, over-dispersion)

3. investigate and model additional features of the response distribution (e.g., extreme observations (Umlauf and Kneib (2018)))

In the following, the model class is formally introduced based on the model definitions of Klein et al. (2015a), Umlauf and Kneib (2018), Umlauf et al. (2018), Kneib et al. (2019) and Belitz et al. (2015b).

2.2.1 Model Definition

Let (yi, xi) be the data observed for i = 1, . . . , n individuals. Conditionally on the co-variate(s) X = (x1, . . . , xn), the response yi is assumed to be independent and identically distributed following a K-parametric distribution D

yi|xi ∼ D(θ1(xi), . . . , θK(xi)) (2.2)

with parameters θk, k = 1, . . . , K (Klein et al. (2015a), Umlauf and Kneib (2018)). For the choice of the response distribution D, any univariate distribution can be taken. This means that the response distribution is not limited to the exponential family but can be any continuous distribution on R or R+_{, any discrete distribution, any mixed discrete-continuous} distribution, any distribution with compact support or potentially any other probability dis-tribution (Klein et al. (2015b)). In order to connect the covariates to a parameter θk, a

(11)

monotonic, twice differentiable link function hk(.) is used. This results in

ηk,i= hk(θk(xi)) (2.3)

with ηk,idenoting the linear predictor of parameter θk for observation i. As in the generalized linear models (GLM, Nelder and Wedderburn (1972)) or the generalized additive models (GAM, Hastie and Tibshirani (1990)), the use of a link function guarantees to preserve the domain restrictions of a parameter. For example, the variance of the normal distribution is positive and thus, a log link function can be used to ensure that its values are positive. Further, as in the structured-additive models (Fahrmeir et al. (2004)), each linear predictor ηk,i is assumed to consist of a sum of unspecified functions fk,l(.), l = 1, . . . , Lk, such that

ηk,i = hk(θk(xi)) = fk,1(xi; βk,1) + . . . + fk,Lk(xi; βk,Lk) (2.4)

where β_k = (β_k,1, . . . , β_k,L

k) are the regression parameters. Depending on the effect type

that is used for a covariate or the combination of two or more covariates, the functions fk,l(.) may be specified as (Umlauf and Kneib (2018)):

- Linear: xi,lβk,l

- Smooth non-linear: fk,l(xi,l) - Varying coefficient: xi,m∗ fk,l(xi,l) - Interaction surface: fk,l(xi,l, xi,m)

- Spatial effects: fk,l(i) with spatial index i - Random effect: γk,i with group index i

2.2.2 Generic Model representation

1

For each effect type, the function fl(xi) can be approximated by a combination of R basis functions fl(xi) = Rl X r=1 βl,rBl,r(xi) = Bl(xi)0βl (2.5) with Bl(xi) = (Bl,1(xi), . . . , Bl,Rl(xi))

0 _{representing the vector of basis functions and β} l = (βl,1, . . . , βl,Rl)

0 _{the corresponding vector of basis coefficients. In matrix notation for all} observations n, the generic model can be written as

fl(X) = (fl(x1), . . . , fl(xn))0 = Zlβl (2.6)

with design matrix Zl =

Bl(x1)0, . . . , Bl(xn)0 0

.

This leads to a generic representation of the linear predictor in the form of

η = 1β0+ Z1β1+ . . . + ZLβL (2.7)

(12)

with η being the vector with the linear predictors of all observations for parameter k1_. Further, the design matrices Zlare possibly different for each predictor (Klein et al. (2015a)).

2.2.3 Prior Distributions

1

For the parameters β_l in Equation (2.7), prior distributions are specified. To allow certain properties such as shrinkage and/or smoothness, informative priors for each of the parameters may be used. The most common choice is a multivariate normal prior with precision matrix Kl and variance τl2: p(β_l|τ_l2) ∝ 1 τ2 l rk(Kl)/2 exp − 1 2τ2 l βT_l Klβl (2.8) The variance τ2

l is a prior smoothing parameter. An inverse gamma distribution is used as hyperprior with hyperparameter (al, bl) for τl2 in order guarantee data-driven smoothness for the effects (Klein and Kneib (2016), Kneib et al. (2019)). The density, the expected value and the variance of the inverse gamma distribution can be found in Appendix A.2.

The choice of the precision matrix Kl depends on the type of effect and may not be of full rank leading to a partial improper prior. In the following, the specification of Kl will be shown for three examples: linear effects, non-linear effects (p-spline) and random intercepts. Prior distributions for other effects are discussed in Belitz et al. (2015b).

Linear Effect. For a linear effect, a common choice for the prior distribution is a noninfo-mative flat prior. This prior, which can be obtained by setting the precision matrix Kl = 0, can also be seen as a limiting case (Klein et al. (2015a)). Alternatively, a proper normal distribution can also be used as a simple form of an informative prior.

Non-linear Effect (p-spline). A common approach to model the effect of a continuous variable as non-linear is a p-spline approach (Eilers and Marx (1996)). The function fl(x) of Equation (2.5) is approximated by a linear combination of Rl polynomial splines of a certain degree, the so-called B-Spline basis functions. A common choice are cubic splines. The B-Spline basis functions are distributed over an equidistant grid of knots and weighted by basis function coefficients βl,r, r = 1, . . . , Rl:

fl(x) = Rl

X

r=1

βl,rBl,r

with Bl,r indicating the rth B-Spline basis function of function fl.

In a Bayesian approach, the evolution of coefficients can be modeled by a random walk dependence of order d (RWd). A first or second order random walk is defined as

βl,r = βl,r−1+ l for r = 2, . . . , Rl βl,r = 2βl,r−1− βl,r−2+ l for r = 3, . . . , Rl

with l ∼ N (0, τ2) and noninformative priors for βl,1 or βl,1 and βl,2 respectively (Lang and Brezger (2004)). Further, Lang and Brezger (2004) showed that the joint prior of the

(13)

basis function coefficients follows a (partial) improper prior distribution with mean 0 and K = D0_dDdwhere Ddis a difference matrix of order d (compare also Fahrmeir et al. (2007)).

Random Intercepts. A random intercept is a special effect type that is used to estimate an individual- or cluster-specific deviation from the global intercept when data with, e.g., repeated measurements or clustered data are modeled. For the notation of the random intercepts, γi, i = 1, . . . , n, is used in this thesis. The difference to other effects is that the coefficients all share a normal prior with the same variance τγ:

γi ∼ N (0, τγ2)

This can be obtained by setting K = I in Equation (2.8) (Belitz et al. (2015b)).

2.2.4 Posterior Inference

Let θ = (β, γ, τ2_{, τ}2

γ) be the collection of the model parameters where β = (β1, · · · , βK) denotes a vector with the regression coefficients in the models of all distributional pa-rameters and τ2 = (τ_1,12 , . . . , τ_1,L2 ₁, . . . , τ_K,12 , . . . , τ_K,L2 _K) denotes the corresponding (op-tional) prior variances. The vectors a = (a1,1, . . . , a1,L1, . . . , aK,1, . . . , aK,LK) and b =

(b1,1, . . . ,b1,L1, . . . , bK,1, . . . ,

bK,LK) are the optional hyperparameters of the hyperprior on τ

2_{, in case a hyperprior for the} prior smoothing variance is specified. γ = (γ₁, . . . , γ_k, . . . , γ_K) with γ_k = (γk,1, . . . , γk,i, . . . , γk,n) denotes the vector of random intercepts for all parameters. The vector τ2γ = (τ_1,γ2 , . . . , τ_K,γ2 ) contains the prior variances of the random intercepts. The hyperparameter of the hyperprior on τ2

γ are given by aγ = (a1,γ, . . . , aK,γ) and bγ = (b1,γ, . . . , bK,γ). To justify the use of random intercepts, j = 1, . . . , ni repeated measurements from i = 1, . . . , n individuals/clusters are assumed.

According to the Bayes theorem (see Equation (2.1)), the posterior for θ can then be derived by p(θ|y, X, a, b) ∝ n Y i=1 ni Y j=1

p(yi,j|θ1(xi, β1, γ1,i), . . . , θK(xi, βK, γK,i))

K Y k=1 Lk Y l=1 p(βk,l|τk,l2 ) · p(τk,l2 |ak,l, bk,l) K Y k=1 n Y i=1 p(γk,i|τk,2γ) · p(τ 2 k,γ|ak,γ, bk,γ) (2.9)

where p(yi,j|θ1(xi, β1, γ1,i), . . . , θK(xi, βK, γK,i)) denotes the density of observation j of individual i from the K-parametric response distribution D.

The posterior distribution can only be determined in very few cases. In most situations, the posterior is not of closed form as either the likelihood itself or the combination of the likelihood and the prior distribution are of a complex form and cannot be specified as a known distribution function. Therefore, Klein et al. (2015a) proposed to use an MH-algorithm for posterior inference. As it is not possible to derive all full conditional distributions of the model parameters, they suggest to use an MH-step to generate a sample from a full

(14)

conditional using iterative weighted least squares (IWLS) proposals. The idea is to construct an IWLS proposal that is based on the local quadratic approximation to the full conditional (Gamerman (1997), Brezger and Lang (2006)).

Let log(L) denote the log-likelihood, the derived proposal for regression coefficient β_k,l is in the form of a normal distribution N µ_k,l, P−1_k,l with

µ_k,l = P−1_k,lZ0_k,lW(z − η_−k,l) Pk,l = Z0k,lWZ 0 k,l+ 1 τ2 k,l Kk,l (2.10)

with W being a diagonal matrix with working weights wi = E(−∂

2_log(L)

∂η2 i

), z = η + W−1v being a working weight response that depends on the score vector v = ∂ log(L)_∂η and η_−k,l = η − Zk,lβk,l being the linear predictor without the lth term. The working weights and the score vector depend on the chosen response distribution. A list with their derivations for the most common distributions can be found in Supplement B in Klein et al. (2015b). The main advantages of this approach are that the approximation to the full conditional does not require any manual tuning and that it provides a very general approach for basically any distribution once the working weights and the score vector are known (Klein et al. (2015a)). In contrast to the regression coefficients β, the full conditionals of the smoothing variances τ2

k,l can be derived in closed form and are inverse gamma distributions IG(a ∗ k,l, b ∗ k,l) with updated parameters a∗_k,l = ak,l+ rk(Kk,l) 2 and b ∗ k,l = bk,l+ 1 2β 0 k,lKk,lβk,l. (2.11)

As already mentioned in Section 2.2.3, random intercepts can be considered as a special effect type and hence, the proposal of Equation 2.10 can be adapted to generate a sample for the random intercepts from their full conditional distribution. Just like the distribution of the smoothing variances, the full conditional of the random intercept variance τ_k,γ2 is also an inverse gamma distribution IG(a∗_k,γ, b∗_k,γ) with updated parameters

a∗_k,γ = ak,γ + n 2 and b ∗ k,γ = bk,γ+ 1 2γ 0 kγk. (2.12)

The basic MCMC algorithm is outlined in Algorithm 1.

MCMC-algorithm:

1. Set hyperparameters a, aγ, b, bγ and the number of iterations T

2. Set starting values for all regression coefficients β(0), smoothing variances (τ2₎(0)_, random intercepts γ(0) and the variances (τ2_γ)(0)

3. Iterate for t = 1, . . . , T :

(a) For k = 1, . . . , K and l = 1, . . . , Lk, generate a draw for βk,l from its full conditional p(β_k,l|β[t−1]_−k,l, γ[t−1]_{, τ}[t−1]

k,l , y)1:

i. Sample β*_k,l from the proposal q(β*_k,l|β[t−1]_k,l ) with parameters given in Equa-tion (2.10)

(15)

ii. Accept the new draw β*_k,l for β_k,l with acceptance probability α(β*_k,l|β[t−1]_k,l ) = min        1, p β*_k,l|β[t−1]_−k,l, γ[t−1], τ_k,l[t−1], y q β*_k,l|β[t−1]_k,l p β[t−1]_k,l |β[t−1]_−k,l, γ[t−1]_{, τ}[t−1] k,l , y q β[t−1]_k,l |β*_k,l        otherwise set β[t]_k,l = β[t−1]_k,l

(b) For k = 1, . . . , K and l = 1, . . . , Lk, generate a draw for τk,l2 from its full condi-tional p(τ2

k,l|β [t]

k,l, ak,l, bk,l):

Sample τ_k,l2 [t] from the inverse gamma distribution IG(a[t]_k,l, b[t]_k,l) with updated parameters given in Equation (2.11)

(c) For k = 1, . . . , K and i = 1, . . . , n, generate a draw for γk,i from its full condi-tional p(γk,i|β[t], γ

[t−1]

−k,i, (τk,γ2 )[t−1], yi)2: i. Sample γ*

k,i from the proposal q(γk,i* |γ [t−1]

k,i ) with adjusted parameters from Equation (2.10)

ii. Accept the new draw γ*

k,i for γk,i with acceptance probability

α(γ*_k,i|γ_k,i[t−1]) = min        1, p γ* k,i|β [t]_{, γ}[t−1] −k,i, (τk,γ2 )[t−1], yi q γ* k,i|γ [t−1] k,i p γ_k,i[t−1]|β[t]_{, γ}[t−1] −k,i, (τk,γ2 )[t−1], yi q γ_k,i[t−1]|γ* k,i       

otherwise set γ_k,i[t] = γ_k,i[t−1]

(d) For k = 1, . . . , K and i = 1, . . . , n, generate a draw for τ_k,γ2 from its full condi-tional p(τ2

k,γ|γ [t]

k, ak,γ, bk,γ):

Sample τ_k,γ2 [t]from the inverse gamma distribution IG(a[t]_k,γ, b[t]_k,γ) with updated parameters given in Equation (2.12)

Algorithm 1: General MCMC algorithm for a Bayesian distributional regression model

1_β

−k,l: Denotes the collection of regression coefficients β except of βk,l 2_γ

−k,i= (γ1,i, . . . , γk−1,i, γk+1,i. . . , γK,i): Denotes the collection of random intercepts for individual i

(16)

Chapter 3 Mixed models with correlated

random effects in the Bayesian

distributional regression framework

Compared to linear models, mixed models are extended by random effects in the linear pre-dictor. Therefore, they are often also called random effects models (Fahrmeir et al. (2007)). Mixed models are frequently used to model data where the independence assumption be-tween the responses might be violated, and a specific correlation structure bebe-tween subsets of the observations is assumed. Two examples are repeated measurements and clustered data. In both cases, it is likely that observations of the same individual or within the same cluster are more similar than observations from two different individuals or clusters. To ac-count for correlation between observations, random effects were introduced to model the heterogeneity between the subjects and the dependence of observations within one subject (Jiang (2007)). Thereby, the random effects estimate individual- or cluster-specific devia-tions from the global effects (Fahrmeir et al. (2007)). The two random effect types that are most frequently used are random intercepts and random slopes.

In Bayesian distributional regression models, the inclusion of random effects is possible in the linear predictor of any regression parameter (see Chapter 2). This extension provides the opportunity to account for the dependence within an individual/cluster and for the variation across the individuals/clusters for all distributional parameters. From the modeling perspective, the random effects in the models for parameters θk and θk0 are assumed to be

independent, e.g. for random intercepts:

γk,i∼ N 0, τk,γ2 ∀ k = 1, . . . , K with γk,i⊥⊥ γk0_,i if k 6= k0

(3.1)

The independence assumption simplifies inference but also constitutes a restriction, because it implies that the random effects are uncorrelated. This limits the modeling frame, as corre-lation between random effects can not be handled and causes a viocorre-lation of the assumption which may substantially impact the estimations results. This issue can be overcome by ex-tending the current implementation of independent random effects in Bayesian distributional regression by allowing non-negative covariance between the random effects in the models

(17)

for the different distributional parameters.

The chapter is structured as follows: Firstly, in Section 3.1 the framework for modeling the correlation between random effects in the models for the different distributional parameters is described. Secondly, in Section 3.2 inference for correlated random intercepts in the mean and the variance of a normal distribution is outlined.

3.1 Correlated Random Effects

It is common practice in mixed model, e.g., in GAMM (Lin and Zhang (1999)), to allow random intercepts and random slopes to be correlated in the model for the conditional expectation w.r.t. an individual i by assuming a bivariate normal prior distribution with non-negative covariance (Fahrmeir et al. (2007)). Similarly also random effects in the mod-els for different parameters need not be independent (as in the actual implementation of Bayesian distributional regression) but can be correlated. This extension enables to investi-gate the dependence structure between random effects that are used for modeling different parameters.

The section focuses only on modeling the correlation between random intercepts, but ex-tending it to, e.g., random slopes is straightforward. An alternative to the assumption that random intercepts are independently normal (see Equation (3.1)), is to assume that they follow a multivariate normal distribution with positive definite covariance matrix V

        γ1,i γ2,i .. . γK,i         ∼ N                 0 0 .. . 0         , V         with V =         τ2 1,γ τ1,2,γ . . . τ1,K,γ τ2,1,γ τ2,γ2 . . . τ2,K,γ .. . . .. τK,1,γ τK,2,γ . . . τK,γ2         (3.2)

where the variance τ2

k,γ denotes by how much the individual-specific effects vary around the global intercept for parameter k and τk,k0_,γ, with k 6= k0, expresses the covariance between

the individual-specific effects of an individual i in the models of the different parameters θk and θk0. The correlation ρ_k,k0 is calculated by ρ_k,k0 =

τ_k,k0,γ τk,γ · τk,γ.

Due to the property that uncorrelatedness between any subset of a random vector that follows a multivariate normal distribution implies independence, the approach of modeling the random intercepts independently, can be seen as a special case of Equation (3.2) as-suming that the covariances are equal to zero. Thus, modeling the random intercepts with a multivariate normal distribution can be seen as a generalization, which further allows in-vestigating whether and how strong different random intercepts correlate with each other. The interpretation of the correlation is especially interesting in cases where the distributional parameters are directly connected with the central moments. As an example in a normal model where the expected value and the variance depend on covariates, a positive covariance between the random intercepts can be interpreted as follows: individuals that have higher individual-specific effects for the expected value, also tend to have higher effects for the variance.

For independent random effects, inverse gamma hyperpriors on the variances τ2

k,γ, k = 1, . . . , K are specified (see Equation (3.1)). This is a convenient choice as the inverse gamma distribution is a conjugate prior for the variance of a univariate normal distribution. Due to the correlation structure, an inverse Wishart distribution is proposed as hyperprior for the covariance matrix V instead. Similar to the inverse gamma distribution, the inverse

(18)

Wishart distribution is a conjugate prior for the covariance matrix of a multivariate normal distribution. This is especially useful in the context of Gibbs sampling as the full conditional of V can be derived in closed form.

3.2 Example: Normal Distribution with Correlated

Random Intercepts

The previous section described how to model the correlation between random effects for different parameters. To illustrate how inference has to be adapted when random effects are correlated, MCMC inference for a normal model with correlated random intercepts in the mean and the variance is given below. In the following, the entire modeling framework is provided, including the concrete response distribution, the likelihood, the prior distribution, the posterior distribution, and a possible MCMC scheme.

3.2.1 Data

Let (yi,j, xi,j) be the data observed from i = 1, . . . , n individuals with j = 1, . . . , ni obser-vations per individual.

3.2.2 Response Distribution

The response variable y = (y1, . . . , yi, . . . , yn) with yi = (yi,1, . . . , yi,ni) is assumed to follow

a normal distribution with parameters k = 1, 2 where ”1” indicates µ and ”2” indicates σ2_. Both parameters are modeled as a function of the covariates X = (x1, . . . , xi, . . . , xn) with xi = (xi,1, . . . , xi,ni). The density of one observation yi,j is given by

p

yi,j|µ(xi,j, β1, γi,1), σ2(xi,j, β2, γ2,i) = 1 p2πσ2_(x i,j, β2, γ2,i) exp

−(yi,j− µ(xi,j, β1, γ1,i)) 2 2σ2_(x

i,j, β2, γ2,i)

with regression coefficients β_k = (β_k,1, . . . , β_k,L

k)

0 _{and random intercepts γ}

i = (γ1,i, γ2,i) 0 of individual i. To preserve the positive domain of the variance σ2, a loglinear link function is used. For the mean µ, the identity link function is applied as the domain of the parameter is not restricted. Let z1,i,j and z2,i,j denote the design matrices of the fixed effects of observation j of individual i for the two parameter. The link functions are then defined as

µ(xi,j, β1, γ1,i) = z1,i,jβ1+ γ1,i= η1,i,j (3.3)

log

σ2(xi,j, β2, γ2,i)

= z2,i,jβ2+ γ2,i = η2,i,j (3.4)

with η1,i,j and η2,i,j being the linear predictors of observation j of individual i for the param-eters µ and σ. For all observations of an individual i, the linear predictors can be written in matrix notation as

µ(xi, β1, γ1,i) = Z1,iβ1 + v1,iγ1,i = η1,i (3.5)

log

σ2(xi, β2, γ2,i)

(19)

with Zk,i =               zk,i,1 .. . zk,i,j .. . zk,i,ni               vk,i=               vk,i,1 .. . vk,i,j .. . vk,i,ni               η_k,i =               ηk,i,1 .. . ηk,i,j .. . ηk,i,ni              

where each element of the design matrix of the random intercepts vk,i is a one as only random intercepts are considered in this example. For the further sections, for parameter k, Zk = (Zk,1, . . . , Zk,i. . . , Zk,n) denotes the design matrix, ηk = (ηk,1, . . . , ηk,i, . . . , ηk,n) the vector of the linear predictors for all observations and γ_k = (γk,1, . . . , γk,i, . . . , γk,n) is the vector of random intercepts.

3.2.3 Likelihood

Let γ = (γ₁, γ₂)0 be the vector of the random intercepts in the models for µ and σ. The likelihood is given by p(y|β₁, β₂, γ) = n Y i=1 ni Y j=1 p(yi,j|β1, β2, γi) = n Y i=1 ni Y j=1 p

yi,j|µ(xi,j, β1, γ1,i), σ2(xi,j, β2, γ2,i) = n Y i=1 ni Y j=1 1 p2πσ2_(x i,j, β2, γi,2) exp

−(yi,j − µ(xi,j, β1, γ1,i)) 2 2σ2_(x

i,j, β2, γ2,i)

(3.7)

with p(yi,j|β1, β2, γi) being the likelihood contribution of observation j of individual i.

3.2.4 Prior Distributions for Linear and Non-linear Effects

The regression parameters β₁ and β₂ may include the overall intercept and various other effect types. The two most common used ones are linear and non-linear effects. For linear effects, a flat prior is typically used, while for non-linear effects an (improper multivariate) normal prior with precision matrix K = D0_dDd with Dd being a difference matrix of order d (see Section 2.2.3) is applied. The hyperprior of the prior variance for a non-linear effect is an inverse gamma distribution with two hyperparameters. In the case of a flat prior for linear effects, no additional parameter is present, and thus, no hyperprior needs to be specified. The regression parameter β₁ of in the model for parameter µ is, therefore, a collection of regression coefficients β₁ = β_1,1, β_1,2, . . . , β_1,L₁ where each β_1,l denotes a regression coefficient for one type of effect which may (like, e.g., for a non-linear effect) again be a vector. The same accounts for the regression coefficient β₂.

(20)

3.2.5 Prior Distribution for Random Intercepts

Random intercepts are used in the linear predictors of both parameters to model the individual-specific effects in the models for the expectation µ and the variance σ2_{. The} random intercepts γ1,i and γ2,i, i = 1, . . . , n , are assumed to be independent w.r.t i. Re-laxing the independence assumption of random effects that is usually imposed in Bayesian distributional regression, a bivariate normal distribution with positive definite covariance matrix V is used for the random intercepts for one individual i:

γ_i =   γ1,i γ2,i  ∼ N     0 0  , V   with V =   τ_1,γ2 τ1,2,γ τ2,1,γ τ2,γ2   (3.8)

where the variances τ2

1,γ and τ2,γ2 measure the variation around the global intercept in the model for parameter µ and for σ2. The covariance τ1,2,γ (= τ2,1,γ) specifies the covariance between the individual-specific effects. The correlation ρ1,2 is calculated by ρ1,2 = _τ τ1,2,γ

1,γ· τ2,γ.

As a prior for the covariance matrix V, an inverse Wishart distribution with hyperparameter ν and S−1 is defined:

V ∼ W−1(ν, S−1) (3.9)

where ν is a scalar and S−1 is a 2 x 2 matrix.

3.2.6 Posterior Inference

Let τ2 _{= (τ}2

1, τ22) with τ2k = (τk,12 , . . . , τk,l2 , . . . , τk,L2 k), k = 1, 2, be the (optional) prior

variances of the regression parameters β₁ and β₂ and a = (a1,1, . . . , a1,L1, a2,1, . . . , a2,L2)

and b = (b1,1, . . . , b1,L1, b2,1, . . . , b2,L2) be the corresponding optional hyperparameters of

the hyperpriors on τ2_.

According to Bayes theorem, the posterior of all model parameters θ = (β₁, β₂, γ, τ2, V ) is proportional to the product of the likelihood and the prior:

The posterior distribution can not be determined in closed form. One reason is that the normalizing constant p(θ), which is omitted in Equation (3.10) is a high dimensional integral that would need to be solved. However, Bayesian inference is still feasible by sampling from the posterior using MCMC methods as described in Section 2.2.4.

In the case that the full conditional can be determined, Gibbs sampling can be applied and a sample can directly be drawn from the full conditional. Otherwise, a draw is generated by a MH-step with IWLS proposal. Algorithm 2 outlines the MCMC scheme for the model approach. To emphasize the changes in the MCMC algorithm required for correlated random intercepts, the differences to Algorithm 1 are marked by the red shaded area. For further explanation of the sampling scheme, each step of the algorithm will be explained in detail in the following sub-sections:

(21)

MCMC algorithm

1. For β₁ and β₂ with a proper prior, set the hyperparameters a and b of the hyperpriors of the smoothing variances τ2_{. For the inverse Wishart hyperprior of V, set the} hyperparameters ν and S−1. Set the number of iterations T.

2. Set starting values for the regression coefficients β(0)₁ and β(0)₂ , the smoothing variances τ(0)_{, the random intercepts γ}(0) _{and the prior covariance matrix V}

3. Iterate for t = 1, . . . , T:

(a) For l = 1, . . . , L1, generate a draw for β1,l from the full conditional p(β_1,l|β[t−1]_1,−l, β[t−1]₂ , γ[t−1], (τ2)[t−1], y)1:

Draw β[t]_1,l from the (multivariate) normal distribution N (m[t]_1,l, S[t]_1,l)

(b) For l = 1, . . . , L2, generate a draw for β2,l from the full conditional p(β_2,l|β[t]₁ , β[t−1]_2,−l, γ[t−1], (τ2)[t−1], y)2:

As the full conditional is not in closed form, a MH-step with IWLS proposal is required.

(c) For l = 1, . . . , L1, generate a draw for τ1,l from the full conditional p(τ1,l|β

[t]

1,l, a1,l, b1,l):

Draw τ_1,l[t] from the inverse gamma distribution IG(a[t]_1,l, b[t]_1,l)

(d) For l = 1, . . . , L2, generate a draw for τ2,l from the full conditional p(τ2,l|β

[t]

2,l, a2,l, b2,l):

Draw τ_2,l[t] from the inverse gamma distribution IG(a[t]_2,l, b[t]_2,l)

(e) For i = 1, . . . , n, generate a draw for γ_i = (γ1,i, γ2,i)0 from the full conditional p(γ_i|β[t]₁ , β[t]₂ , V[t−1], yi):

Option 1: As the full conditional is not in closed form, a MH-step with IWLS proposal is required.

Option 2: Generate a draw for γ1,i from the full conditional

p(γ1,i|β [t] 1 , β [t] 2 , γ [t−1] 2,i , V [t−1]_{, y}

i) and for γ2,i from the full conditional p(γ2,i|β [t] 1 , β [t] 2 , γ [t] 1,i, V [t−1]_{, y} i):

A. Draw γ1,i from the normal distribution N

g_1,i[t], G[t]_1,i

B. As the full conditional is not in closed form, a MH-step with IWLS proposal is required for the draw of γ2,i.

Option 3: Generate a draw for γ2,i from the marginal p(γ2,i|β [t] 1 , β [t] 2 , V [t−1]_{, y} i) and for γ1,i from the full conditional p(γ1,i|β

[t] 1 , β [t] 2 , γ [t] 2,i, V [t−1]_{, y} i):

(22)

A. As the marginal conditional is not in closed form, a MH-step with IWLS proposal is required for the draw of γ2,i.

B. Draw γ1,i from the normal distribution N

g_1,i[t], G[t]_1,i

(f) Generate a sample for V from the full conditional p(V|γ[t]_i , ν, S): Draw V[t] from the inverse Wishart distribution W−1(ν[t], (S−1)[t])

Algorithm 2: MCMC algorithm

Details on the sampling steps are given in the following and the full conditionals are provided for the steps (a), (c), (d) and (f). For step (b), it is explained why the full conditional distribution cannot be derived in closed form. Special emphasize is given on the MCMC step (e) of updating the random intercepts γ_i.

(a) Regression parameter β_1,l with l = 1, . . . , Lµ:

The full conditional of the regression parameter β_1,l is proportional to the following terms of the posterior of Equation (3.7):

p(β_1,l|β_1,−l, β₂, τ2, γ, y) ∝ p(y|β₁, β₂, γ) x p(β_1,l|τ_1,l2 ) (3.11)

with β_1,−l = (β_1,1, . . . β_1,l−1, β_1,l+1, . . . , β_1,L₁) denoting the regression coefficients in the model for µ without the lth _coefficient.

Let Σ = diag

σ2(x1,1, β2, γ2,1), . . . , σ2(x1,n1, β2, γ2,1) . . . , σ2(xn,nn, β2, γ2,n)

be the diagonal matrix of the variances of all observations. Further, let Z1,l denote the design matrix of the lth_{parameter β}

1,land let η1,−l = η1− Z1,lβ1,l be the vector of the linear predictors of parameter µ without the lth _{component. The likelihood as a function of} β_1,l is a multivariate normal distribution with mean (Z0_1,lΣ−1Z1,l)−1Z01,lΣ

−1

(y − η_1,−l) and variance (Z0_1,lΣ−1Z1,l)−1. The prior distribution as described in Section 3.2.4 is also proportional to the kernel of a (multivariate) normal distribution. Therefore, as a (multivariate) normal distribution is a conjugate prior to a multivariate normal, the full conditional distribution is again a (multivariate) normal N (m1,l, S1,l)

with mean m1,l = Z0_1,lΣ−1Z1,l+_(τ12 1,l) K1,l −1 Z0_1,lΣ−1(y − η_1,−l) and covariance matrix

S1,l= Z0_1,lΣ−1Z1,l+ _(τ12 1,l) K1,l −1

Instead of sampling the regression coefficients of β₁ = (β_1,1, . . . , β_1,L₁) sepa-rately, a sample for β₁ can also be drawn from the joint full conditional normal p(β₁|β₂, τ2_{, γ, y).}

1_β

1,−l = (β1,1, . . . β1,l−1, β1,l+1, . . . , β1,L1): Regression coefficients in the model for µ without l

th

coefficient

2_β

2,−l = (β2,1, . . . β2,l−1, β2,l+1, . . . , β2,L2): Regression coefficients in the model for σ

2 _{without l}th

(23)

(b) Regression parameter β_2,l with l = 1, . . . , L2:

The full conditional of the regression parameter β_2,l is proportional to the likelihood and its prior:

p(β_2,l|β₁, β_2,−l, τ2, γ, y) ∝ p(y|β₁, β₂, γ) x p(β_2,l|τ2

2,l) (3.12)

with β_2,−l = (β_2,1, . . . β_2,l−1, β_2,l+1, . . . , β_2,L₂) denoting the regression coefficients in the model for σ2 without the lth coefficient. The prior p(β_2,l|τ2

2,l) is proportional to a (multivariate) normal density, but the likelihood as a function of β_2,l cannot be determined as a known distribution. Therefore, the full conditional distribution cannot be derived in closed form and instead, a MH-step with IWLS proposal is used to generate a sample for β_2,l.

(c) Variances τ2

1,l with l = 1, . . . , L1:

The full conditional of the prior variance τ_1,l2 is proportional to the following terms:

p(τ_1,l2 |β_1,l, a1,l, b1,l) ∝ p(β1,l|τ 2 1,l) x p(τ 2 1,l|a1,l, b1,l) (3.13) The prior p(β_1,l|τ2

1,l) is proportional to a (multivariate) normal density with variance τ2

1,l, while the hyperprior of τ1,l2 is defined as an inverse gamma distribution. Hence, the full conditional is again an inverse gamma distribution IG(a∗_1,l, b∗_1,l) as an inverse gamma distribution is a conjugate prior for the variance of a normal distribution. The updated parameters are

a∗_1,l= a1,l+ rk(K1,l) 2 and b ∗ 1,l = b1,l+ 1₂β01,lK1,lβ1,l (d) Variances τ2 2,l with l = 1, . . . , L2: In the posterior, the prior variance τ2

2,l is contained only in the prior distribution of β2,l and in its hyperprior distribution. Hence, the full conditional of τ_2,l2 is proportional to:

p(τ_2,l2 |β_2,l, a2,l, b2,l) ∝ p(β2,l|τ2,l) x p(τ2,l2 |a2,l, b2,l) (3.14)

The full conditional distribution of τ2

2,lis again an inverse gamma distribution because the prior p(β_2,l|τ2

2,l) is proportional to a (multivariate) normal density, and the hyper-prior is an inverse gamma distribution. The updated parameters for the inverse gamma distribution are a∗_2,l= a2,l+ rk(K2,l) 2 and b ∗ 2,l = b2,l+ 1₂β02,lK2,lβ2,l (e) Correlated random intercepts γ_i for all i = 1, . . . , n:

The full conditional of the random intercepts γ_i of individual i is proportional to the product of the likelihood contributions of subject i and its prior distribution:

p(γ_i|β₁, β₂, V , yi) ∝ ni

Y

j=1

p(yi,j|β1, β2, γi) x p(γi|V) (3.15)

It is not possible to derive the two dimensional full conditional distribution of γ_i in closed form because the bivariate density p(yi,j|β1, β2, γi) is not conjugate to the

(24)

likelihood. Hence, samples for γ_i cannot be directly drawn from the full conditional but different approaches could be used:

Option 1:

The first approach is to apply a MH-step with a bivariate IWLS proposal to generate samples for γ_i.

Option 2:

An alternative option to the MH-step is to take the univariate full condition-als p(γ1,i|β1, β2, γ2,i, V , yi) and p(γ2,i|β1, β2, γ1,i, V , yi) to generate a sample for γ_i = (γ1,i, γ2,i)0. This is in any way advantageous, because firstly, the full condi-tional distribution of γ1,i can be derived in closed form and secondly, although the full conditional distribution of γ2,i is not in closed form and a MH-step must be applied, the acceptance probability of a univariate IWLS proposal is usually higher than of a bivariate proposal. Details on the full conditional distributions of γ1,i and γ2,i will be provided in the following:

• Random intercept γ1,i for all i = 1, . . . , n:

The full conditional of γ1,i is proportional to the following terms of the posterior distribution:

p(γ1,i|β1, β2, γ2,i, V , yi) ∝ ni

Y

j=1

p(yi,j|β1, β2, γi) x p(γ1,i|V, γ2,i) (3.16)

Let η_1,−i = (η1,−i,1, . . . , η1,−i,j, . . . , η1,−i,ni)

0 _{= η}

1,i − v1,iγ1,i be a vector with the linear predictors of all observations of individual i for parameter µ with-out the ith random intercept. The likelihood component p(yi,j|β1, β2, γi) as a function only of γ1,i is the density of a normal with mean yi,j − η1,−i,j and variance σ2_(x

i,j, β2, γ2,i). As the product of normal densities is again a normal density, the product Qni

j=1p(yi,j|β1, β2, γi) is as well the density function of a normal distribution with mean (v0_1,iΣ−1_i v1,i)−1(v01,iΣ

−1

i (yi− η1,−i)) and variance (v0_1,iΣ−1_i v1,i)−1 where Σi = diag(σ2(xi,1, β2, γ2,i), . . . , σ2(xi,ni, β2, γ2,i)) denotes a diagonal matrix containing the variances of all observations from individual i. In this example, the design matrix of the random effects v1,i is a vector of length ni containing only ones as only random intercepts are used. Hence, the formulas for the mean and the variance could be further simplified. However, for completeness, the general notation is kept.

The density p(γ1,i|γ2,i, V) of Equation (3.15) is the conditional density of the bivariate normal prior and therefore, again normal:

γ1,i|γ2,i, V ∼ N τ1,2,γ τ2 2,γ γ2,i, τ1,γ2 − τ2 1,2,γ τ2 2,γ = (3.17) N ρ1,2 τ1,γ τ2,γ γ2,i, (1 − ρ21,2)τ 2 1,γ (3.18) with ρ1,2 = _τ τ1,2,γ

1,γ· τ2,γ denoting the correlation coefficient between the random

(25)

In summary, both components of Equation (3.16) are normal and hence, the full conditional p(γ1,i|β1, β2, γ2,i, V , yi) is as well the density function of a normal distribution N (g1,i, G1,i)

with mean g1,i = v0_1,iΣ−1_i v1,i+_(1−ρ21 1,2)τ1,γ2 −1 v0_1,iΣ−1_i (yi− η1,−i) + ρ1,2γ2,i τ2,γ(1−ρ21,2)τ1,γ

and covariance matrix G1,i =

v0_1,iΣ−1_i v1,i+_(1−ρ21 1,2)τ1,γ2

−1

The full conditional of γ2,i is proportional to the following terms of the posterior:

p(γ2,i|β1, β2, γ1,i, V , yi) ∝ ni

Y

j=1

p(yi,j|β1, β2, γi) x p(γ2,i|V, γ1,i) (3.19)

In this case, the full conditional cannot be derived in closed form, because the likelihood contribution cannot be determined as a known distribution in γ2,i. Thus, a MH-step with a univariate IWLS proposal is required.

Option 3:

The third approach that can be considered to generate a sample for γ_i = (γ1,i, γ2,i)0 is a slightly modified version of the second approach. Thereby, a draw of γ2,i is generated from the marginal p(γ2,i|β1, β2, V , yi) instead of the full conditional. A MH-step still needs to be applied for this sampling step, but compared to the second approach the distribution does not depend on γ1,i. After the draw for γ2,i has been created, a draw for γ1,i can be generated from its full conditional:

In order to obtain the marginal distribution, the full conditional of γ_i from Equa-tion (3.15) is re-written to p(γ_i|β₁,β₂, V , yi) ∝ ni Y j=1

p(yi,j|β1, β2, γi) x p(γ1,i|γ2,i, V) x p(γ2,i|τ2,γ)

(3.20)

with the second term denoting the conditional prior distribution of γ1,i. The third term denotes the marginal prior of γ2,i.

(26)

The marginal of γ2,i can then be obtained by taking the integral w.r.t. γ1,i: p(γ2,i|β1,β2, V , yi) = Z p(γ_i|β₁, β₂, V , yi) dγ1,i ∝ Z ni Y j=1

p(yi,j|β1, β2, γi) x p(γ1,i|γ2,i, V) x p(γ2,i|τ2,γ) dγ1,i

∝ p(γ2,i|τ2,γ) Z

p(yi|β1, β2, γi) x p(γ1,i|γ2,i, V) dγ1,i ∝ p(γ2,i|τ2,γ) p(yi|β1, β2, γ,i2)

(3.21)

The marginal prior p(γ2,i|τ2,γ) is independent of γ1,i and, therefore, can be moved outside the integral. By definition, the product of the likelihood contributions of all observations of individual i is the likelihood of yiwhich is a multivariate normal distribution with mean η_1,i = Z1,iβ1+v1,iγ1,iand variance Σi. As the conditional prior distribution of γ1,i is also a normal (compare Equation 3.18), the density p(yi|β1, β2, γi,2) is the density function of a normal distribution N (mi, Mi) with mean

mi = Z1,iβ1+ v1,i(ρ1,2 τ1,γ

τ2,γγ2,i)

and covariance matrix Mi = Σi+ v1,i (1 − ρ2 1,2)τ1,γ2 v0_1,i

For the derivation of the marginal posterior distribution of γ2,i, the marginal prior and the density p(yi|β1, β2, γi,2) have to be considered as functions in γ2,i. While the marginal prior distribution is a normal distribution with mean 0 and variance τ2,γ, p(yi|β1, β2, γi,2) cannot be formulated as a known density in γ2,i. Hence, the marginal posterior distribution of γ2,i cannot be derived in closed form and an MH-step must be applied.

It was shown in option 2 that the full conditional p(γ1,i|β1, β2, γ2,i, V , yi) is of closed form. Hence, after generating a sample for γ2,i, a draw for γ1,i can directly be sampled from its full conditional normal distribution N (g1,i, G1,i).

(f) Covariance matrix for random intercepts V:

In the posterior, the prior covariance matrix V is only contained in the prior of γ_i and in its hyperprior. Hence, the full conditional is proportional to the following terms:

p(V|·) ∝ n Y

i=1

p(γ_i|V) x p(V|ν, S−1) (3.22)

where the prior p(γ_i|V) is the density function of a bivariate normal and the hyper-prior is defined as an inverse Wishart distribution. The inverse Wishart distribution is a conjugate prior for the covariance matrix of a multivariate or in our case a bivariate normal distribution. Due to this, the full conditional is also an inverse Wishart distri-bution W−1(ν∗, (S−1)∗) with updated parameters

ν∗ = ν + n and (S−1)∗ = S + Sγ with Sγ =Pn_i=1γiγ 0 i

(27)

Chapter 4 Simulation Study

In this chapter, the effects of a violation of the independence assumption between the random intercepts in the models for different distributional parameters is investigated. To examine this issue, a simulation study was conducted. The simulations were based on data sets with varying sample sizes and different correlation structures between the random intercepts. For the response, the zero-adjusted gamma distribution (ZAGA) was used. This distribution was taken as it allows to include random intercepts in the linear predictors of three different parameters and, thus, to consider various covariance structures between them in the data generating process (DGP). Bayesian distributional regression models assuming independence between the random intercepts were fitted to the generated data sets to investigate effects of the misspecification of the correlation structure w.r.t the sample sizes.

The chapter is organised as follows: In Section 4.1, the data generating process is described. Section 4.2 outlines the simulation setup. In particular, the generated data sets and the different types of models that were fitted to the data are explained. The technical set-up of the simulation study follows in Section 4.3. Finally, the simulation results are described and interpreted in Section 4.4.

4.1 Data Generating Process

For the simulation study, the response variable is generated from the three-parametric ZAGA (µ, σ, ν) distribution. The pdf of the ZAGA is defined as

f (y|µ, σ, ν) = ( ν if y = 0 (1 − ν) 1 (µσ2₎1/σ2_Γ(1/σ2₎y 1/σ2₋₁ exp(−_µσy2) if y > 0 (4.1) with µ > 0, σ > 0 and 0 ≤ ν ≤ 1.

The ZAGA distribution consists of two components: A point mass at zero and a weighted gamma distribution. Therefore, it is a distribution that can be used to model data that contain observations that are zero or positive continuous. An analysis based on real-world data using the ZAGA as response distribution will be given in Chapter 5. The linear predictor of each parameter of the response variable is assumed to include one continuous covariate x and a random intercept. Given the domain restrictions of the parameters, a logit link is applied for the parameter ν and a log link function is used for the parameters µ and σ. The

(28)

data generating process for an observation j of individual i is given by log(µ(xi,j, β1, γ1,i) = 2 + 0.2xi,j+ 0.25 sin(xi,j) + γ1,i log(σ(xi,j, β2, γ2,i)) = −1 − 0.05(xi,j− 5)2+ γ2,i log ν(xi,j, β3, γ3,i) 1 − ν(xi,j, β3, γ3,i) = −0.5 − 0.25xi,j+ γ3,i (4.2)

where the first index of the random intercepts γk,i indicates the distributional parameter with k = 1 for µ, k = 2 for σ and k = 3 for ν.

The random intercepts are generated from a three dimensional normal distribution      γ1,i γ2,i γ3,i      ∼ N           0 0 0      , Σ      (4.3)

where Σ is a positive definite covariance matrix varying across the different data sets. The matrix P is defined as the correlation matrix corresponding to Σ.

4.2 Simulation Set-up

4.2.1 Generated Data Sets

To address the research question for multiple scenarios, twelve data sets were generated differing in their sample size set-up and their correlation structure of the random intercepts. The total number of observations N was kept the same for all data sets, but the number of individuals and observations per individual were varied. The following settings were used:

• Sample size type 1: n = 50 individuals with ni = 40 observations (N = 2000) • Sample size type 2: n = 100 individuals with ni = 20 observations (N = 2000) • Sample size type 3: n = 250 individuals with ni = 8 observations (N = 2000)

As the main interest is to examine the effects of the violation of the independence as-sumption, four different correlation structures for the random intercepts were considered: independence (P1), weak correlation (P2), medium correlation (P3) and strong correlation (P4): P1 =   1 0 0 0 1 0 0 0 1   P2 =   1 −0.1 0.1 −0.1 1 0.1 0.1 0.1 1   P3 =   1 0.5 0.5 0.5 1 0.5 0.5 0.5 1   P4 =   1 0.9 −0.9 0.9 1 −0.9 −0.9 −0.9 1  

In all settings, the variances of the random intercepts were fixed to

Var(γ1,i) = 0.1 Var(γ2,i) = 0.5 Var(γ3,i) = 0.25 (4.4)

One data set was generated for each combination of the three sample size types and the four covariance matrices resulting in twelve different data sets. For each data set, the values

(29)

of the covariate x were drawn from a Uniform distribution U [0, 10]. The exact procedure on how the different data sets were generated for a given sample size type and covariance matrix Σ is as follows:

1. Set a seed to guarantee reproducibility 2. Draw N values for x from U [0, 10]

3. Draw n values for the random intercepts from N (0, Σ)

4. Determine the values for the model parameters µ, σ and ν for each observation by computing the linear predictors from Equation (4.2) and applying the inverse link function

5. Generate N response values from the ZAGA (µ, σ, ν) distribution The list of the seeds for all twelve data sets can be found in Appendix A.3.

To illustrate the variation due to the individual-specific random intercepts, Figure 4.1 shows the response functions for the random intercept set to the 0.01, 0.1, 0.25, 0.5, 0.75, 0.9 and 0.95 quantile of the underlying normal distribution. The normal distributions of the random intercepts are centered at zero (compare Equation (4.3)). Hence, the 0.5 quantile is equal to zero and the solid black line in each plot shows the pure effect of x on the parameter. In the following, this line will be denoted as “reference line”. Due to the non-linearity of the link function, the effects with random intercepts corresponding to the α and 1 − α quantile of their corresponding normal distribution are not symmetric with respect to the reference line. Note that the dependence structure of the random intercepts is not relevant in these plots as only univariate response functions are shown.

Figure 4.1: Effect of x on the parameters µ (left), σ (middle) and ν (right) grouped by the random intercept set to the 0.01, 0.1, 0.25, 0.5, 0.75, 0.9 and 0.95 quantile of the underlying normal distribution with a legend denoting the concrete values of the underlying random intercepts

4.2.2 Fitted Models

As the goal of this simulation study is to investigate the impact of a violation of the inde-pendence assumption between the random intercepts on the model fit, the following aspects are discussed:

(30)

1. Effect of the absence of random intercepts on the model fitting (Section 4.4.1) 2. Effects of the missspecification of the random intercept’s correlation structure (Section

4.4.2)

• Effects on the estimation of the covariate in the different scenarios (Section 4.4.2.1)

• Effects on the estimation of the random intercepts in the different scenarios (Section 4.4.2.2)

To investigate how the absence of random intercepts affects the estimation results, three different models were fitted to each data set: A full model where a random intercept is included in each linear predictor, a partial model where a random intercept is only included for µ and a fixed effects model where no random intercept is included. In all models, an overall intercept and a p-spline for the effect of x are included in the linear predictors of each distributional parameter.

To examine the effect of the misspecification of the random intercept’s correlation structure, the estimation of the full models are evaluated and compared.

4.3 Technical Set-up

4.3.1 Bayesian Distributional Regression in BayesX

To fit the Bayesian distributional regression models to the data, the implementation of BayesX (Belitz et al. (2015a)) was used. BayesX was initially introduced as a stand-alone software tool for Bayesian inference in structured additive regression models (STAR) covering many regression types such as generalized additive models (GAM, Hastie and Tibshirani (1990)), generalized additive mixed models (GAMM, Lin and Zhang (1999)), geoadditive models (GGAMM, Kammann and Wand (2003)) and many more (Belitz et al. (2015b)). The software has been extended to non-standard regression model such as categorical response, quantile regression, etc. The Bayesian distributional regression model class has finally been integrated with the release 3.0. The computational kernel of BayesX is written in C++, and as long as a GNU C++ compiler is available on the system, the command line version can be used on any operating system. Further, there is a second option available for Windows that includes a graphical user interface (GUI) implemented in Java. The current version of BayesX can be downloaded under www.bayesx.org, including three manuals that explain how to install and use BayesX (Belitz et al. (2015a), Belitz et al. (2015b), Belitz et al. (2015c), Belitz et al. (2015d)).

Different choices for the response distribution D are available in BayesX (for details see Ta-bles 10.6 to 10.9 in the reference manual of BayesX (Belitz et al. (2015b)). These cover a wide range of continuous distributions on R and R+, discrete distributions, mixed discrete-continuous distributions, distributions with compact support, and even multivariate distri-butions. Further, various effect types can be used for the covariates: linear effects, non-linear effects modeled by p-splines, spatial effects modelled by Markov random fields and random intercepts. Additionally, interaction effects such as varying coefficient terms, random slopes, geographically weighted regression, and two dimensional kriging terms can be modeled. Fur-ther options, like defining the hyperparameters, usage of binning, defining the degree for the p-splines, etc., are available.

(31)

For the simulation study in this thesis, it is essential to note that linear effects are im-plemented with a flat prior as described in Section 2.2.3. P-splines are modeled with cubic B-spline basis functions, a random walk of order two (d = 2), twenty inner knots and default values of ak,j = bk,j = 0.001 for the hyperparameters of the inverse gamma hyperprior. For fast convergence to the stationary distribution, good starting values are required. Therefore, a backfitting algorithm is used to approximate the posterior modes of all model coefficients (Belitz et al. (2015b)). As a side note, BayesX has sometimes crashed when the sample size or the number of observations per individual has only slightly been increased. This be-havior could not be explained as no informative error message was produced but potentially happened due to memory issues or something the like.

4.3.2 Technical Aspects of the Simulation Study

The data sets of the simulation study were generated by RStudio (Version 1.2.1335, RStudio Team (2018)) using the R version 3.6.0 (R Core Team (2019)) on a MacBook Pro (2.7 GHz Dual-Core Intel Core i5, beginning of 2015). The data sets were saved to the local repository, transferred to an Acer Aspire E5-575G (i5-7200U, GTX 950M), and then the Bayesian distributional regression models were fitted with BayesX. For each model, 208.000 MCMC iterations were done. To obtain practically independent MCMC draws from the posterior, the burnin phase was set to 8000 iteration and a thinning factor of 400 was used, resulting in 500 MCMC draws that were used for the estimation of each parameter. A list of the run times can be found in Appendix A.3. Depending on the model specification between 49 - 80 result files were produced. Finally, the result files were transferred back to the MacBook Pro and then further analyzed using RStudio. A template for the BayesX code that was used to define and fit the models can be found in Appendix A.7.1. A template for the R code that was written to create the data and analyze the estimation results can be found in Appendix A.8.

4.4 Simulation Results

In this section, the results of the simulation study are discussed. For each simulation, trace and autocorrelation plots were generated and checked for dependence of the MCMC draws. With the thinning factor of 400, the draws for all parameters are nearly independent. As an example, trace and autocorrelation plots for the full model for sample size type 2 with correlation structure P3 are given in Appendix A.4. It can be seen that there is no correlation between the draws for any parameter. This is also the case for all other models. Hence, the plots of the MCMC draws in those models are not included in this thesis.

4.4.1 Effect of the Absence of Random Intercepts on the Model

Fitting

Firstly, a close look will be taken on whether and which model estimates are affected by the absence of random intercepts in the model specification. This section serves as a starting point of the analysis to evaluate the general necessity of random intercepts in the models. To investigate the effects of omitting random intercepts, the non-linear effects of x on the parameters are compared for the three different model types. The estimation results for four data sets are displayed in Figure 4.2 where each row shows the effect for one distributional

Mixed Models with Correlated Random Effects in the Bayesian Distributional Regression Framework / submitted by Tim Föckersperger

Mixed

Models

with

Correlated

Random

Effects

in

the

Bayesian Distributional Regression

Framework

Master Thesis

Master of Science

Statistics

Abstract

Contents

Chapter 1

Introduction

Chapter 2

The Bayesian Distributional

Regression Framework

2.1

Bayes Modeling and Inference

2.2

Bayesian Distributional Regression Model

2.2.1

Model Definition

2.2.2

Generic Model representation

2.2.3

Prior Distributions

2.2.4

Posterior Inference

MCMC-algorithm:

Chapter 3