Response of an univariate exponential family

2.3 Inference

2.3.3 Response of an univariate exponential family

Now, we consider models with an univariate response variable belonging to an exponential family. Examples are count data or binary response variables. These models in combination with a linear predictor are called generalised linear models (see e.g. McCullagh & Nelder (1989)). Here again, like in Gaussian models, it is possible to replace the linear predictor with a structured additive predictor (2.4) leading to generalised STAR models.

Before we will describe the estimation of regression coefficients in section 2.3.3.2, we will introduce some facts about model specification.

2.3.3.1 Model specification

Here, we sketch the most important facts about model specification in generalised regression models. More details about model specification and estimation can be found in Fahrmeir

& Tutz (2001)for instance. In generalised regression models, model specification is based on two different assumptions. This fact results in several possible models for the same data even when using the same predictor. The two assumptions are:

1. Distributional assumption

Given the predictor valuesηi,i= 1, . . . , n, the response valuesyi are conditionally in-dependent and their distributions belong to an exponential family, i.e. the respective density can be written as

f(y_i|θ_i, φ, w_i) = exp

½y_iθ_i−b(θ_i)

φ w_i+c(y_i, φ, w_i)

¾ , where

θ_i is the natural parameter of the exponential family,

φ is a scale or dispersion parameter common to all observations, w_i is a weight and

b(.) and c(.) are specific functions depending on the particular exponential family.

2. Structural assumption

The (conditional) expectation µ_i = E(y_i|η_i) is related to the predictor η_i by µ_i =h(η_i) or η_i =g(µ_i),

where

h is a known bijective, sufficiently smooth response function and g is the inverse of h, called link function.

The natural parameter θ is a function of the mean µ and is for every exponential family uniquely determined by the relation

µ=b⁰(θ) = ∂b(θ)

∂θ .

For a single observation, we have the relationθi =θ(µi). The natural parameter provides a special kind of link function, the natural link function. Here, the natural parameter is directly linked to the predictor, i.e.

θ =θ(µ) =η.

The variance of the observations y_i is of the form Var(y_i|η_i) = φv(µ_i)

w_i ,

where the variance function is also for every exponential family uniquely determined by v(µ) = b⁰⁰(θ) = ∂²b(θ)

∂θ² .

Distribution Notation θ(µ) b(θ) φ b⁰(θ) b⁰⁰(θ)

Normal N(µ, σ²) µ θ²/2 σ² µ=θ 1

Bernoulli B(1, π) log(π/(1−π)) log(1 + exp(θ)) 1 π= _1+exp(θ)^exp(θ) π(1−π)

Poisson P o(λ) log(λ) exp(θ) 1 λ= exp(θ) λ

Gamma G(µ, ν) −1/µ −log(−θ) ν⁻¹ µ=−1/θ µ²

Table 2.1: Important quantities of some exponential families.

Important quantities of some exponential families, like e.g. the natural parameter and the variance function, are shown in table 2.1.

As already mentioned above, the same distributional assumption together with different choices for the response function in the structural assumption leads to several possible mod-els for the same data. The following passages describe frequently used response functions for different types of dependent variables.

• Normal distribution

The normal distribution is also an exponential family. When using the natural link function θ(µ) = µ, the response function is simply the identity h(η) =η and we get back to the classical linear (or STAR) model as in section 2.3.2.

• Bernoulli and binomial distribution

First, we consider the case of ungrouped data with a binary response coded by 0 and 1. Here the expected value is the probability for observing the value 1, i.e.

E(y_i|η_i) =P(y_i = 1|η_i) = π_i. The natural link function is θ= log

µ π 1−π

=η with the logistic distribution function

π =h(η) = exp(η) 1 + exp(η)

as resulting response function. Applying a distribution function on the predictor η ensures that the probability π lies in the interval [0; 1]. The model using the natural link function is called the logit model.

Another possible choice for the response function is the standard normal distribution function, i.e.

π =h(η) = Φ(η).

This model is called the probit model. There exist further possibilities for choosing the response function. Here we have restricted to the ones mentioned above.

If we can group the data, i.e. if there are several independent trials for every combi-nation of covariates, we gety_i ∼B(m_i, π_i) withi= 1, . . . , n. In this case, the relative frequencies ¯y_i = y_i/m_i are used as dependent variable leading to a scaled binomial distribution with E( ¯yi) =πi. By defining weights wi =mi fori= 1, . . . , n, both logit and probit model can also be used for grouped data.

• Count data

Here, we assume to have a Poisson distributed response variable, i.e. y ∼P o(λ). In this case the most natural choice is using the natural link and the respective response function which are given by

g(λ) = log(λ) = η and h(η) = exp(η) =λ.

This ensures a positive value for the mean λ. The model in combination with a simple linear predictor is often called a loglinear model.

• Gamma distribution

Here, we deal with a nonnegative continuous response variable that usually has an asymmetric distribution. One possible model for these data is the lognormal model where the identity link of the normal model is replaced by a log link. The other possibility is to assume a distribution that by definition only has the support R₊, e.g. the gamma distribution. Additionally, the gamma distribution has the property that it includes asymmetric distributions. The most common choice for the structural assumption, that we also use, is the log link

g(µ) = log(µ) = η with the respective response function

h(η) = exp(η) =µ.

This choice ensures a nonnegative value for µ. This, however, is not ensured when using the natural response function

h(η) =−η⁻¹ =µ.

Note that for the notation used here the gamma distribution is parameterised as follows:

f(y|µ, ν) = 1 Γ(ν)

µν µ

¶_ν

y^ν−1exp µ

−ν µy

¶ , with E(y) = µand Var(y) =µ²/ν.

2.3.3.2 Inference

As the system of estimation equations is nonlinear for generalised models, it is no longer possible to calculate maximum likelihood estimates for the coefficients in the linear predic-tor analytically. And the analytical calculation of posterior mode or maximum penalised likelihood estimates in a structured additive predictor is not possible, either. Instead, we need iterative algorithms. First, we want to describe the IWLS algorithm for computing maximum likelihood estimates ˆγ in a linear predictor without penalisation. IWLS is short for iteratively weighted least squares. In every iteration, weighted least squares estimates are calculated where the weights and the dependent variable are adjusted with respect to the current estimates ofγ.

IWLS Algorithm 1. Initialisation:

Set (e.g.) ˆγ⁽⁰⁾ = 0. Set r= 1.

2. Computation of weight matrix and dependent variable:

W^(r−1) = diag(d^(r−1)₁ , . . . , d^(r−1)_n ) η_i^(r−1) = u⁰_iγˆ^(r−1)

µ^(r−1)_i = h

u⁰_iγˆ^(r−1)

θ_i^(r−1) = θ

³ h

u⁰_iγˆ^(r−1)

´´

d^(r−1)_i = wi

Ã∂h(η^(r−1)_i )

∂η

!₂Ã

∂²b(θ^(r−1)_i )

∂θ²

!₋₁

y_i^(r−1) = η^(r−1)_i +

Ã∂h(η^(r−1)_i )

∂η

!₋₁

(y_i−µ^(r−1)_i )

3. Computation of the weighted least squares estimate ˆ

γ^(r)= (U⁰W^(r−1)U)⁻¹U⁰W^(r−1)y˜^(r−1) 4. Computation of the stop criterion

||ˆγ^(r)−γˆ^(r−1)||

||ˆγ^(r−1)|| .

If the stop criterion is larger than a specified ε >0, set r=r+ 1 and go back to 2., otherwise terminate the process.

This algorithm is equivalent to Fisher–Scoring which is a modified Newton–Raphson method.

Fisher–Scoring uses the expected Fisher information matrix instead of the matrix contain-ing the second derivatives of the log–likelihood, the observed information matrix. When using the natural link function, expected and observed Fisher information are identical.

If we have a structured additive predictor, i.e. if we want to maximise a penalised log–

likelihood, step 3. of the IWLS algorithm is replaced by the backfitting algorithm. This combined algorithm is called Local Scoring Procedure by Hastie & Tibshirani (1990).

In fact, it calculates the zero point of the first derivative of the penalised log–likelihood

∂ l_pen(y|γ, β₁, . . . , β_q)/∂(γ, β₁, . . . , β_q).

If the scale parameter is unknown, as is the case for a Gamma or normally distributed response, it can be estimated by

φˆ= 1 n

i=1

w_i(y_i−µˆ_i)²

v(ˆµ_i) , (2.43)

where ˆµ_i = h(ˆη_i) and v(ˆµ_i) are the respective mean and variance function of y_i. For a normally distributed response formula (2.43) results in the ML–estimate ˆσ² from formula (2.42). In contrast to the usually used estimator (see e.g. Fahrmeir & Tutz (2001)), we do not correctn with the number of model parameters in order to get an estimator analogous to the ML–estimate in the Gaussian case.

Selection of Variables and Smoothing Parameters

In chapter2, we already mentioned the influence of the smoothing parameterλ (or equiv-alently the variance parameter τ²) on the estimated effect of a covariate (see figure 2.2 for the case of P–splines in section2.2.3.2). We also described approaches for inference in structured additive models if all smoothing parameters are fixed. In this chapter, we deal with the problem of determining appropriate values for the smoothing parameters. Addi-tionally, we want to deal with a second, but similar, problem: the selection of important variables. This question was not mentioned in the last section. But in many applications, a lot of potentially influential covariates are available although only a few of them actually have an influence on the response. Altogether, there arise the following questions:

• Which terms (covariates) are to be included in the model?

• Is the effect of a certain continuous variable linear or nonlinear, i.e. is it necessary to use a spline function or would a linear effect be sufficient?

• Which value should be used for the smoothing parameter of a nonlinear function?

• Does a nonlinear effect vary over the range of another variable or is the effect con-stant?

• Is there a complex interaction between two continuous variables?

In this chapter, we want to deal with these questions simultaneously and introduce algo-rithms that can answer them.

This chapter is organised as follows: The first section3.1 gives an overview of alternative and related methods for variable and/or smoothing parameter selection. All other sections

explain details that are in close connection with our selection approach: section 3.2 de-scribes several selection criteria, the concept of degrees of freedom in additive models is explained in section3.3 and the selection algorithms are described in the last section 3.4.

3.1 Alternative Approaches

In the last two decades, considerable research has been carried out on the topic of variable selection and on determining values for smoothing parameters. Nevertheless, most of the existing approaches can either select subsets of variables in (generalised) linear models or can determine smoothing parameters for a fixed set of covariates. Altogehter, none of the approaches introduced in this section can deal with a simultaneous variable and smoothing parameter selection in such a broad class of models as our approach described in section 3.4.

Im Dokument Model selection in generalised structured additive regression models (Seite 57-64)