• Keine Ergebnisse gefunden

Conditional posterior densities based on the full likelihood

Part II. Bayesian regularization in the CRR model

9. MCMC inference in extended CRR model

9.1. Conditional posterior densities based on the full likelihood

Using the Bayes theorem, the joint posterior p( | )θ D is proportional to the product of the model likelihood L( | )θ D and the joint prior density of the model parameters p( )θ . Based on the full log-likelihood (7.6) of the extended Cox model with the extended predictor given in (7.3) and the prior (8.14) the posterior has the general form

(

2 2

) ( )

pz j 2, j 2, j 2 2

j 0

p , , , α, , |β exp l( , , | ) p( | α )p( α ) p( | β)p( β| )p( )p( )

=

τ τ ⋅

α β γ τ τ ρ D α β γ D α β τ τ ρ ρ γ . (9.1)

9.1.1. Full conditionals of the predictor components

Unregularized linear regression coefficients γ

In the following we derive the general structure of the full conditionals and the proposal distributions for the predictor components ϑϑϑϑ=( , , )α β γ′ ′ ′ ′ in terms of the unpenalized regression coefficients γ

assuming for a while the prior (8.1), |γ µ Σγ, γ ~ N( ,µ Σγ γ), with µγ0 and Σγ10 to preserve the generality. Discarding the factors in the posterior (9.1) that do not depend on γ, the full conditional of the regression parameter γ is given by

( ) ( )

1 1 1

p | exp l |

2

γ γ γ

 

′ ′

⋅ ∝  − + 

 

γ ϑϑϑϑ D γ Σ γ γ Σ µ . (9.2)

To construct the Gaussian IWLS proposal, we apply a second order Taylor expansion to the logarithm of the full conditional f ( ) log(p( | ))γ = γ ⋅ at the current state of the parameter vector γ(c), compare Appendix C for details. Differentiating f ( )γ with respect to γ gives the score vector

1 1

l( | )

( )

γ γ γ γ

=∂ − +

s γΣ γ Σ µ

γ ϑ ϑ ϑ ϑ D

(9.3) and the hessian matrix

( )

2l( | ) 1

γ γ

=∂ −

∂ ∂ ′

H γ Σ

γ γ ϑϑ ϑϑ D

, (9.4)

with the following contributions from the differentiation of the log-likelihood

n ti n

i i i i i i 0 i i i

i 1 0 i 1

l( | )

d exp( (s))ds d (t )

= =

∂ = − η = − Λ η

γ

u

u

u u

ϑϑ

ϑϑ D ɶ ɶ

,

n i n

2 t

i i i 0 i i i i

i 1 0 i 1

l( | )

exp( (s))ds (t )

= =

∂ = − ′ η = − Λ η ′

∂ ∂γ γϑϑϑϑ D

∑ ∫

ɶu u

ɶ u u .

The second order Taylor expansion of f ( )γ around the current state of the parameter vector γ(c) has the form

( ) (

(c)

) (

(c)

) (

(c)

)

1

(

(c)

) (

(c)

)(

(c)

)

ˆf f

γ 2 γ

′ ′

≈ + − + − −

γ γ γ γ s γ γ γ H γ γ γ , (9.5)

where sγ(γ(c)) and Hγ(γ(c)) denote the score vector (9.3) and the Hessian matrix (9.4) evaluated at the current state γ(c) and the current states of the remaining involved parameters ϑϑϑϑ− γ =(α(c),β(c)). Building the exponential of approximation (9.5) and neglecting the components that do not depend on

γ provides the following structure of the proposal density

( ) ( ( ) ( ) )

(p) ˆ(c) ˆ(c) 1 (c) (c) (c) (c)

( | , ) exp

γ γ 2 γ γ γ

 

′ ′

ϕ ∝  + − 

 

γ µ Σ γH γ γ γ s γ H γ γ ,

which represents the kernel of a multivariate Gaussian distribution density ϕ ⋅( |µˆ(c)γ ,Σˆ(c)γ ) with mean vector and covariance matrix

( ) ( )

( ) ( ( ) )

1

(c) ˆ(c) (c) (c) (c) ˆ(c) (c)

ˆγ = γ γγ , γ = − γ

µ Σ s γ H γ γ Σ H γ . (9.6)

As already mentioned in the context of the extended AFT model, the reformulation of the mean as

(c) (c) (c) (c)

ˆγ = − γ( ) (γ )

µ γ H γ s γ enables the interpretation as one-step-approximation to the mode of the full conditional obtained by a single Fisher scoring step from the current state.

Using the notation Λ( | ) : ( (t | ),..., (t | ))′tɶ ϑϑϑϑ = Λ ɶ1 ϑϑϑϑ Λ ɶn ϑϑϑϑ as the vector of the cumulative baseline hazards evaluated at the observed survival times tɶi and d: (d ,...,d )′= 1 n as the vector of censoring indicators, the score vector has the compact form

(

(c)

) (

( | (c), )

)

1 (c) 1

γ = ′ − − γγ + γ γ

s γ U d Λ tɶ γ ϑϑϑϑ Σ γ Σ µ (9.7)

and the Hessian matrix reads

(

(c)

)

diag

(

( | (c), )

)

1

γ = − ′ − γγ

H γ U Λ tɶ γ ϑϑϑϑ U Σ . (9.8)

Finally, a new proposed value γ(p) of the Markov chain is obtained by drawing a random number from the Gaussian distribution with density ϕ ⋅( |µˆ(c)γ ,Σˆ(c)γ ). The new state is accepted with the probability

( ) ( )

( )

(p) (c) (p) (p)

(p) (c)

accept (c) (p) (c) (c)

ˆ ˆ

p | ( | , )

p , min 1,

ˆ ˆ

p | ( | , )

γ γ

γ γ

 ⋅ ϕ 

 

=  

 ⋅ ϕ 

 

γ γ µ Σ

γ γ

γ γ µ Σ ,

where p(γ(p)| )⋅ and p(γ(c)| )⋅ denote the evaluations of the full conditional (9.2) at the proposed state γ(p) and the current state γ(c) with respect to the current states of the remaining involved model parameters ϑϑϑϑ− γ. The mean vector µˆ(p)γ and the covariance matrix Σˆ(p)γ , appearing in the acceptance probability, are computed by the evaluation of the expressions in (9.6) at the proposed value γ( )p keeping the remaining parameters ϑϑϑϑ− γ fixed at their current states.

In particular, since we assume a flat Gaussian prior for the unregularized effects γ that corresponds to the limiting case µγ =0 and Σγ1=0, the mean and covariance matrix of the Gaussian proposal for the unregularized effects γare given as

( ) ( ( ) )

( )

( )

( )

(c) (c) (c) (c) (c)

(c) (c) 1

ˆ ˆ | , diag | , ,

ˆ diag | , .

γ γ − γ − γ

γ − γ

 ′ ′ ′ 

=  − + 

= ′

µ Σ U d t γ U Λ t γ

Σ U Λ t γ U

ϑ ϑ

ϑϑ ϑϑ

ϑ ϑ

ϑ ϑ ϑ ϑ

ɶ ɶ

ɶ

(9.9)

Regularized linear regression coefficients β

For the remaining regression coefficients α and β the IWLS proposal densities can conceptually be carried out in the same way as above for γ. We obtain the corresponding expressions for the mean and the covariance matrix of the Gaussian proposal by replacing the design matrix U, the precision matrix Σγ1 and the mean µγ in (9.7) and (9.8) with the associated quantities from the priors of the regularized linear effects and the smooth effects. Proceeding as before, using the conditional Gaussian prior β τ| 2β~ N( ,0 Σβ) from (8.2) the full conditional of β is given as

( ) ( )

1 1

p | exp l |

2

τ

 

⋅ ∝  − ′ 

 

β ϑϑϑϑ D βD β . (9.10)

Given the current state β(c) and the current states of the remaining regression coefficients

(c) (c)

( , )

−β= α γ

ϑ ϑϑ

ϑ , proposals are drawn from a Gaussian density with mean vector and covariance matrix

( ) ( ( ) )

( )

( )

( )

(c) (c) (c) (c) (c)

(c) (c) 1 1

ˆ ˆ | , diag | , ,

ˆ diag | , .

−β −β

β β

−β τ

β

 ′ ′ ′ 

=  − + 

= ′ −

µ Σ X d t β X Λ t β

Σ X Λ t β X D

ϑ ϑ

ϑϑ ϑϑ

ϑ ϑ

ϑ ϑ ϑ ϑ

ɶ ɶ

ɶ

(9.11)

Regularized nonlinear regression coefficients αj

Using the conditional Gaussian priors j

(

j

)

jα2 ~ N , α

α 0 Σ from (9.12), the full conditionals of the basis function weights coefficients αj are

( ) ( )

j

j 2 j j j z

p | exp l | 1 , j 0,1,...,p

2 α

 

 

⋅ ∝  − ′  =

 τ 

 

α ϑϑϑϑ D αKα . (9.12)

Given the current state α(c)j and the current states of the remaining regression coefficients

j z

(c) (c) (c) (c) (c) (c)

0 j 1 j 1 p

( ,..., , +,..., , , )

−α = α α α α γ β

ϑ ϑ ϑ

ϑ , proposals are drawn from a Gaussian distribution with mean

vector and covariance matrix

( ) ( ( ) )

( )

( )

j j

j

j

(c) (c) (c) (c) (c)

j j j j

, j ,j j j j

1

(c) (c)

j j j

, j j 2

ˆ ˆ | , diag | ,

ˆ diag | , 1 .

−α −α

α α

α −α

α

 ′ ′ ′ 

=  − + 

 

= ′ −τ 

µ Σ Z d t α Z Λ t α

Σ Z Λ t α Z K

ϑ ϑ

ϑ ϑ

ϑ ϑ

ϑ ϑ

ϑ ϑ ϑ ϑ

ɶ ɶ

ɶ

(9.13)

In particular the band structure of the precision matrices of the Gaussian proposal distributions for the basis coefficients αj enables an efficient computation of the Cholesky decomposition and a fast implementation of the algorithm of Rue (2001), compare Section 6.1.7, to draw a new proposal and compute the acceptance probability. To identify the model, the B-spline coefficients are centered as described in Section 6.1.1.

The computational efforts increase for the update of the baseline hazard coefficients α0, because the time-dependent cumulative baseline hazard function Λ0(t)=

0texp( (s) dsb0α0 and the associated time-dependent derivates are involved, complicating the computation of the score function and the Hessian matrix of the likelihood. We obtain

n ti

i 0 i i 0 0 0

0 i 1 0

l( | )

d (t ) exp( ) (s)exp( (s) )ds

=

∂ = − η ′

ϑϑϑϑαD

b

b b α ,

n i

2 t

i 0 0 0 0

0 0 i 1 0

l( | )

exp( ) (s) (s)exp( (s) )ds

=

∂ = − η ′ ′

∂ ∂α αϑϑϑϑ D

∑ ∫

b b b α ,

and we have to evaluate this time-dependent expressions concerning the log-baseline hazard P-spline model by numerical integration in every iteration of the sampler. As computationally more efficient alternative, one may use MH steps with conditional prior proposals, as developed in Knorr-Held (1999) for state space models and applied to geoadditive hazard rate models in Hennerfeind et al.

(2006), which require only the evaluation of the log-likelihood and not evaluation of the derivates.

Weibull baseline hazard model

In addition to the P-spline based approach, we consider a simple parametric Weibull model to model the baseline hazard with λ0(t) := α0tα −0 1 and λi(t | )θ = α0t exp( (t))α0 ηi as competitor. In the extended predictor (7.3) the nonlinear log-baseline component f (t) log( (t))0 = λ0 has in this case the special form f (t) log(0 = α + α −0) ( 0 1)log(t).

Typically a gamma prior is employed to model the prior knowledge about the shape parameter α0, compare e. g. Ibrahim et al. (2001), i. e.,

(

0 0

)

0 0

0~ Gamma h1,α ,h2,α , h1,α 0,h2,α 0

α > > (9.14)

with E(α0) h= 1,α0 h2,α0 and Var(α0) h= 1,α0 h22,α0 . If identical hyperparameters like h1,α0 =0.01 and

1, 0

h α =0.01 are used, the prior mean of α0 equals one, which corresponds to a constant hazard over time with a large variance of 100. The update of the log-baseline hazard in this parametric case is achieved by the update of the single parameter α0. With the prior assumption (9.14), and the likelihood contributions l (i ϑ |ϑ |ϑ |ϑ |D) d log(= i α +0) d (i α −0 1)log(t ) dɶi + η − Λi i i(t | )ɶi ϑϑϑϑ the full conditional of α0 is given by

( )

0 0

n

0 i 0 i 0 i i i 1, 0 2, 0

i 1

p | exp d log( ) d ( 1)log(t ) (t | ) (h α 1)log( ) h α

=

 

α ⋅ ∝  α + α − − Λ + − α − α 

ϑϑϑϑ, (9.15)

where in our case of time-independent covariates in the predictor the expression for the cumulative baseline hazard is simplified to Λi(t | ) tɶi ϑϑϑϑ =ɶiα0exp( )ηi . Since the full conditional does not have a closed form, we use again MH steps to update the shape parameter α0. The proposal α(p)0 is drawn from a gamma distribution

(

(c)0 0

) (

0 (c)0 0

)

q |⋅ α ,dα ~ Gamma dαα ,dα (9.16)

based on the current value α(c)0 , which leads to the acceptance probability

( ) ( ) ( )

( ) ( )

0

0

(p) (c) (p)

0 0 0

(p) (c)

accept 0 0 (c) (p) (c)

0 0 0

p | q | ,d

p , min 1,

p | q | ,d

α α

 α ⋅ α α 

 

α α =  

α ⋅ α α

 

 

.

The value of dα0 is determined during the burnin to achieve reasonable acceptance rates. In addition slice sampling is possible, because the full conditional is log-concave, see Ibrahim et al. (2001), and thereby also adaptive rejection sampling can be applied to update the shape parameter of the Weibull model. However, the update of the remaining model parameters is practiced as in the case when the baseline is modeled by a P-spline. In particular, we only have to replace the logarithm of the baseline hazard and the cumulative baseline hazard by the corresponding expressions of the Weibull model.

9.1.2. Full conditionals of the regularization parameters

Due to the stage of the hierarchical model structure, there is no direct connection between the regularization parameters τ2α, τ2β, ρ and the likelihood and, as a consequence, the full conditionals have a closed form to draw directly a new state of the MCMC chain by Gibbs sampling. The same holds, if the partial likelihood is used for inference. The full conditionals of the regularization parameters are derived as in Section 6.1.2 and we shortly summarize here only the results.

Bayesian ridge

Version (A): We have for the variance parameters the deterministic connection τ =2βj 1 2λj and the full conditionals of the shrinkage parameters are gamma distributions

j 1, 2, 2j

| Gamma h 1, h

λ 2 λ

 

λ ⋅  + + β 

 

∼ , j 1,...p= x. (9.17)

Version (B): We have for the variance parameters the deterministic connection τ =2β 1 2λ and the full conditional of the shrinkage parameter is a gamma distribution

px

x 2

1, 2, j

j 1

| Gamma h p ,h

λ 2 λ

=

 

λ ⋅  + + β 

∼ . (9.18)

Bayesian lasso

The full conditional of the variance parameters τβ2,j are inverse Gaussian distributions

j

2

2 x

2 j

1 | InvGauss , , j 1,..,p

β

 λ 

⋅  λ  =

τ ∼  β  , (9.19)

and we have a gamma full conditional of the complexity parameter

x

j

p

2 2

1, x 2,

j 1

| Gamma h p ,h 1

λ λ 2 β

=

 

λ ⋅  + + τ 

∼ . (9.20)

Bayesian NMIG

Under the Bayesian NMIG prior the variance parameter is the product of two variance components

j

2 2

j j β I

τ = ψ . The full conditionals of the covariate-specific binary indicator variables I have Bernoulli j distributions

( )

vo(I )j v1(I )j

j x

j j j j

1 1

p I | 1 , j 1,..,p

1 A B 1 A B

δ δ

   

⋅ = −    =

+ +

    , (9.21)

with

j 1 0 1 2j

j 0 0 1 2j

A 1 v (v v )

B v exp v v 2

 β 

− ω −

=  

ω  ψ 

and the variance parameters have inverse gamma distributions

2j

2j 1, 2, x

j

| InvGamma h 1,h , j 1,..,p .

2 2I

ψ ψ

 β 

ψ ⋅  + +  =

 

∼ (9.22)

The full conditional for the mixing parameter is a beta density

(

1, 1 2, 0

)

| Beta h ω n ;h ω n

ω ⋅∼ + + (9.23)

with n : # j: I0 =

{

j= ν0

}

, n : # j: I1 =

{

j= ν1

}

.

Smoothing variances:

The full conditionals for the variance parameters τ2αjare (proper) inverse gamma distributions

j j

j

2 j

1, 2, j j j

rank( ) 1

| ~ InvGamma h , h

2 2

τ τ

α

 

τ ⋅  + + 

 

K αKα , j 0,1,..., p= z. (9.24)