Generalized Linear Models: An Introduction based on

(1)

An Introduction based on

Herwig Friedl Institute of Statistics

Graz University of Technology, Austria

(2)

• This course will provide an introduction into the concepts of generalized linear models (GLM’s).

• This class extends the class of linear models (LM’s) to regression models for non-normal data.

• Special interest will be on binary data (logistic regression) and count data (log-linear models).

• All models will be handled by using functions like lm, anova, or glm.

(3)

• Linear Models (LM’s): Recap of Results

• Box-Cox Transformation Family: Extending the LM

• Generalized Linear Models (GLM’s): An Introduction

• Linear Exponential Family (LEF): Properties and Members

• GLM’s: Parameter Estimates

• GLM’s: glm(.) Function

• Gamma Regression Models

• Logistic Regression (Binomial Responses)

• Log-linear Model (Poisson Responses)

• Multinomial Response Models

• Quasi-Likelihood Models (Overdispersion)

(4)

Goal of regression models is to determine how aresponse variabledepends oncovariates. A special class of regression models are linear models. The general setup is given by

• Data(Yi,x_i1, . . . ,x_i,p−1),i =1, . . . ,n

• Response y= (y1, . . . ,y_n)^> (random variable)

• Covariates x_i = (xi1, . . . ,x_i,p−1)^> (fixed, known)

(5)

Data Example: Life Expectancies

Source: The World Bank makes available tons of great data from the World Development Indicators that are available within

¿ install.packages(’WDI’); library(WDI)

¿ WDIsearch(’gdp’) # gives a list of available data about gdp

¿ d ¡- WDI(indicator=’NY.GDP.PCAP.KD’, country=c(’AT’, ’US’), + start=1960, end=2012)

¿ head(d)

iso2c country NY.GDP.PCAP.KD year 1 AT Austria 41246.57 2013 2 AT Austria 41366.52 2012 3 AT Austria 41120.22 2011 4 AT Austria 39972.94 2010 5 AT Austria 39350.17 2009

(6)

¿ install.packages(’gdata’)

# also needs Perl (for MS Windows) to be installed

# also needs java with the same structure as R (32/64 bit)

¿ library(gdata)

¿ temp.xls¡-”http://databank.worldbank.org/data/download/catalog/

+ cckp˙historical˙data˙0.xls”

¿ myperl ¡- ”c:/Strawberry/perl/bin/perl.exe”

¿ sheetCount(temp.xls, perl=myperl) Downloading...

trying URL ’http://databank.worldbank.org/data/.../*.xls’

Content type ’application/vnd.ms-excel’ length 378368 bytes

(7)

Data Example: Life Expectancies

¿ temp ¡- read.xls(temp.xls, sheet=”Country˙temperatureCRU”,

+ perl=myperl)

¿ temp.data ¡- temp[ , c(”ISO˙3DIGIT”, ”Annual˙temp”)]

¿ colnames(temp.data) ¡- c(”iso3c”, ”temp”)

¿ head(temp.data) iso3c temp

1 AFG 12.92

2 AGO 21.51

3 ALB 11.27

4 ARE 26.83

5 ARG 14.22

6 ARM 6.37

(8)

Data Example: Life ExpectanciesData we are interested in (from 2010):

• life.expat birth, total (years)

• urban population (percent)

• physicians (per 1,000 people)

• temp annual mean (Celsius)

Which is the response and which are covariates?

(9)

Gaussian Linear Model:

y_i =β₀+β₁x_i1+· · ·+β_p−1x_i,p−1+_i, _i ^iid∼Normal(0, σ²), with unknownregression parameters β₀, β₁, . . . , β_p−1 (intercept β0, slopesβj,j =1, . . . ,p−1) and unknown (homogenous) error varianceσ².

This is equivalent withyi

ind∼ Normal(E(yi),var(yi)), where E(yi) =µ_i =β₀+β₁x_i1+· · ·+β_p−1x_i,p−1 is alinearfunction in the parameters and

var(yi) =σ², i =1, . . . ,n describes ahomoscedasticscenario.

(10)

β= (β0, β1, . . . , βp−1) , xi = (1,xi1, . . . ,x_i,p−1) , X= (x1, . . . ,x_n)^>

and write the Gaussian regression models as y=Xβ+ with

E(y) =µ=Xβ and

(11)

Exploratory Data Analysis (EDA):

•Check out the rangesof the response and covariates. For discretecovariates (with sparse factor levels) we consider groupingthe levels.

•Plot covariates against response. Scatter plot should reflect linearrelationships otherwise we consider transformations.

•To check if the constant variance assumption is reasonable, the points of the scatter plot of covariates against the responses should be contained in ahorizontal band.

(12)

¿ summary(mydata[, c(5, 6, 8, 10)])

life.exp urban physicians temp

Min. :44.84 Min. :0.09092 Min. :0.008 Min. :-7.14 1st Qu.:62.16 1st Qu.:0.37396 1st Qu.:0.272 1st Qu.:10.15 Median :71.83 Median :0.56297 Median :1.764 Median :21.58 Mean :69.11 Mean :0.55892 Mean :1.871 Mean :18.06 3rd Qu.:76.10 3rd Qu.:0.74453 3rd Qu.:3.000 3rd Qu.:25.05 Max. :82.84 Max. :0.98655 Max. :7.739 Max. :28.30

NA’s :1 NA’s :39

(13)

¿ plot(mydata[, c(5, 6, 8, 10)])

(14)

plot(log(physicians), life.exp)

(15)

Idea of Least Squares: minimize the sum of squared errors, i.e.

SSE(β) =

n

X

i=1

(yi −x^>_i β)²

Equivalent withMaximum Likelihood: maximize the sample log-likelihood function

`(β|y) =

n

X

i=1

log 1

√

2πσ² − 1

2σ²(yi −x^>_i β)²

LSE/MLESolution: βˆ= (X^>X)⁻¹X^>y Fory_i ^ind∼ Normal(x^>_i β, σ²)we have

βˆ∼Normal(β, σ²(X^>X)⁻¹)

(16)

ˆ σ²= 1

n SSE( ˆβ) = 1 n

X

i=1

(yi−x^>_i β)ˆ ², E(ˆσ²) = 1−p n σ² is biased. Anunbiasedvariance estimator is (df corrected)

S²= 1

n−p SSE( ˆβ)

Fory_i ^ind∼ Normal(x^>_i β, σ²)we get

(17)

n

X

i=1

(yi −y¯)²

| {z } SST

=

n

X

i=1

( ˆµi −y¯)²

| {z } SSR( ˆβ)

+

n

X

i=1

(yi−µˆi)²

| {z } SSE( ˆβ)

TotalSS equals (maxim.) RegressionSS plus (minim.) Error SS Thus, the proportion of variability explained by the regression model is described by thecoefficient of determination

R²= SSR( ˆβ)

SST =1− SSE( ˆβ)

SST ∈(0,1)

To penalize for model complexityp we use itsadjustedversion R² =1−SSE( ˆβ)/(n−p)

6∈(0,1)

(18)

Thus, for eachslopeparameter β_j, j =1, . . . ,p−1, we have βˆ_j ∼Normal(βj, σ²(X^>X)⁻¹_j+1,j₊₁)

and therefore

βˆ_j −β_j q

σ²(X^>X)⁻¹_j+1,j₊₁

∼Normal(0,1)

SinceS² andβˆare independent, replacingσ²by S² results in

(19)

Hypothesis Tests: t-Test βˆj −βj

q

S²(X^>X)⁻¹_j+1,j₊₁

∼t_n−p

Therefore, we can test the relevance of asingle predictor xj by H0:βj =0 vs H1:βj 6=0

and use the well-knowntest statistic Estimate

Std. Error = βˆ_j q

S²(X^>X)⁻¹_j_+1,j+1

H0

∼t_n−p

(20)

µ=β₀+β_AfI(Africa) +β_AmI(America) +β_AsI(Asia) To check if the predictor continentis irrelevant we have to test k−1parameters

H0:βAf =βAm =βAs =0 vs H1:not H0

Fitting the model twice, underH₀ and underH₁, results in

(21)

y=Xβ+, ∼Normal(0, σ²W), W=diag(w1, . . . ,w_n)

The MLE (weighted LSE) ofβ is given by βˆ= (X^>W⁻¹X)⁻¹X^>W⁻¹y with

E( ˆβ) =β and var( ˆβ) =σ²(X^>W⁻¹X)⁻¹ The MLE ofσ² is

ˆ σ²= 1

n

X

i=1

(yi−µˆ_i)² w_i = 1

nr^>W⁻¹r

(22)

Coefficients:

Estimate Std. Error t value Pr(¿—t—) (Intercept) 61.49091 2.21534 27.757 ¡ 2e-16 ***

urban 15.06895 2.94544 5.116 1.1e-06 ***

physicians 1.87133 0.46933 3.987 0.000111 ***

temp -0.19079 0.07287 -2.618 0.009896 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 5.874 on 129 degrees of freedom

(23)

Data Example: Life Expectancies Coefficients:

urban 15.06895 2.94544 5.116 1.1e-06 ***

physicians 1.87133 0.46933 3.987 0.000111 ***

temp -0.19079 0.07287 -2.618 0.009896 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 All predictors are significant (on a 5% level). Onlytemp has a

negative effect.

(24)

Residual standard error: 5.874 on 129 degrees of freedom (39 observations deleted due to missingness)

Multiple R-squared: 0.5872, Adjusted R-squared: 0.5776 F-statistic: 61.17 on 3 and 129 DF, p-value: ¡ 2.2e-16 Under the model, the estimated standard error of the response is 5.874 (years). We haven−p =129and p−1=3predictors.

Almost 60% of the total variability is explained by this model.

The adjusted version ofR² is 0.5776.

(25)

Data Example: Life Expectancies (log(physicians))

¿ mod.log ¡- update(mod, . ˜ . -physicians + log(physicians))

¿ summary(mod.log) Coefficients:

urban 8.13713 2.59353 3.137 0.00211 **

temp -0.05590 0.06242 -0.895 0.37219 log(physicians) 3.50002 0.40578 8.625 2.01e-14 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Predictortemp is no longer significant!

(26)

Residual standard error: 4.958 on 129 degrees of freedom (39 observations deleted due to missingness)

Multiple R-squared: 0.7059, Adjusted R-squared: 0.6991 F-statistic: 103.2 on 3 and 129 DF, p-value: ¡ 2.2e-16

Standard error is much smaller now as before (±5years)!

Even 70% of the total variability is now explained by this model.

(27)

Data Example: Life Expectancies (ANOVA)

¿ anova(mod.log)

Analysis of Variance Table Response: life.exp

Df Sum Sq Mean Sq F value Pr(¿F) urban 1 4813.0 4813.0 195.831 ¡ 2.2e-16 ***

temp 1 970.1 970.1 39.470 4.709e-09 ***

log(physicians) 1 1828.5 1828.5 74.399 2.010e-14 ***

Residuals 129 3170.5 24.6 ---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(28)

Information about this is contained in theANOVA Table:

Source df Sum of Squares/SS MSS F

Regression p−1 SSR( ˆβ) = MSR( ˆβ) =

SST−SSE( ˆβ) SSR( ˆβ)/(p−1) MSR( ˆβ) MSE( ˆβ)

(29)

Null Model: assuming an iidrandom sample (E(yi) =β0·1), results inSSE( ˆβ0) =P

i(yi−βˆ0)² whereβˆ0= ¯y. Thus, SSE( ˆβ₀) =P

i(yi −¯y)²≡SSTin this case.

Nested Model: we assume that

y=Xβ+=X1β1+X2β2+, and test on H0:β2=0 withdim(β1) =p1(including the intercept) and dim(β2) =p2

(additional slopes). The corresponding SSR and SSE terms are SSR( ˆβ1) =

n

X

i=1

(x^>_i βˆ1−y¯)², SSE( ˆβ1) =

n

X

i=1

(yi−x^>_i βˆ1)²

(30)

Source df Sum of Squares/SS MSS F X₁ p₁−1 SSR( ˆβ₁) MSR( ˆβ₁) =

SSR( ˆβ1) p1−1

MSR( ˆβ1) MSE( ˆβ) X₂|X₁ p₂ SSR( ˆβ₂|βˆ₁) = MSR( ˆβ₂|βˆ₁) =

SSR( ˆβ)−SSR( ˆβ₁) SSR( ˆβ2|βˆ1) p2

MSR( ˆβ2|βˆ1) MSE( ˆβ)

(31)

We now assume that the modely=β₀+X₁β₁+X₂β₂+holds.

Test 1: test statistic

F = MSR( ˆβ₁|βˆ₀) MSE( ˆβ)

tests the model improvement when adding the predictors inX₁ to the iid model with onlyβ0.

Test 2: test statistic

F = MSR( ˆβ2|βˆ1,βˆ0) MSE( ˆβ)

tests the model improvement when adding the predictors inX to

(32)

Analysis of Variance Table Response: life.exp

Df Sum Sq Mean Sq F value Pr(¿F) urban 1 4813.0 4813.0 195.831 ¡ 2.2e-16 ***

temp 1 970.1 970.1 39.470 4.709e-09 ***

log(physicians) 1 1828.5 1828.5 74.399 2.010e-14 ***

Residuals 129 3170.5 24.6 ---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(33)

Problems:

• y_i 6∼Normal(E(yi),var(yi))

• E(yi)6=x^>_i β∈R

• var(yi)6=σ² equal (homoscedastic) ∀i=1, . . . ,n

Remedies:

• transform y_i such that g(yi)^ind∼ Normal(x^>_i β, σ²)

• utilize a GLM where y_i ^ind∼ LEF(g⁻¹(x^>_i β), φV(µi))

(34)

y(λ) =



 y^λ−1

λ , ifλ6=0, logy, ifλ=0,

y(λ)→logy for λ→0, such thaty(λ) is continuous inλ.

Assumption: there is a valueλfor which y_i(λ)^ind∼ Normal

µ_i(λ) =x^>_i β(λ), σ²(λ)

(35)

Density Transformation Theorem: Ifg(Y)∼F_g(Y₎(y)holds for a continuous r.v. andg(·) is a monotone function, then the untransformed r.v.Y has cdf

F_Y(y) =Pr(Y ≤y) =Pr(g(Y)≤g(y)) =F_g(Y₎(g(y)).

Thus, the density ofY is f_Y(y) = ∂F_g(Y₎(g(y))

∂y =f_g(Y₎(g(y))·

∂g(y)

∂y with Jacobian

∂g(y)

∂y

.

(36)

f(y|λ, µ(λ), σ²(λ)) =







 1

p2πσ²(λ)exp

− _(yλ−1)

λ −µ(λ) 2

2σ²(λ)



y^λ−1, λ6=0,

1

p2πσ²(λ)exp −(logy−µ(λ))² 2σ²(λ)

!

y⁻¹, λ=0.

• Ifλ6=0 andµ(λ) =x^>β(λ)then f(y|λ, µ(λ), σ²(λ)) = 1

p2πλ²σ²(λ)exp − y^λ−1−λx^>β(λ)2

2λ²σ²(λ)

!

|λ|y^λ−1.

(37)

Usingβ₀=1+λβ₀(λ), β_j =λβ_j(λ), j =1, . . . ,p−1, and σ²=λ²σ²(λ) then

f(y|λ, µ(λ), σ²(λ)) = 1

p2πλ²σ²(λ)exp − y^λ−1−λx^>β(λ)2

2λ²σ²(λ)

!

|λ|y^λ−1

f(y|λ,β, σ²) = 1

√

2πσ²exp

−(y^λ−x^>β)² 2σ²

|λ|y^λ−1.

• Ifλ=0, letβ_j =β_j(λ),j =0, . . . ,p−1, and σ²=σ²(λ) f(y|0,β, σ²) = 1

√2πσ²exp

−(logy−x^>β)² 2σ²

y⁻¹. Ifλwould be known, then the MLE could be easily computed!

(38)

•λ6=0:

`(λ,β, σ²|y) =−n

2logσ²− 1 2σ²

n

X

i=1

y_i^λ−x^>_i β2

+nlog|λ|+(λ−1)

n

X

i=1

logyi

•λ=0:

`(0,β, σ²|y) =−n

2logσ²− 1 2σ²

n

X

i=1

logyi−x^>_i β2

−

n

X

i=1

logyi

(39)

Ifλwould be known, then the MLEs would be βˆ_λ=

((X^>X)⁻¹X^>y^λ, λ6=0, (X^>X)⁻¹X^>logy, λ=0,

ˆ σ_λ²= 1

nSSE_λ( ˆβ_λ) =









 1 n

n

X

i=1

(y_i^λ−x^>_i βˆλ)², λ6=0, 1

n

X

i=1

(logy_i−x^>_i βˆ_λ)², λ=0.

(40)

=











−n

2log SSE_λ( ˆβ_λ) +nlog|λ|+ (λ−1)

n

X

i=1

logy_i, λ6=0,

−n

2log SSE0( ˆβ0)−

n

X

i=1

logyi, λ=0.

This is the sample log-likelihood function that has been already maximized with respect toβ andσ².

(41)

Likelihood Ratio Test(LRT):H0:λ=λ0 versusH1:λ6=λ0. For the LRT statistic it holds that

−2

p`(λ0|y)−p`(ˆλ|y) _D

→χ²₁.

If−2(p`(λ0|y)−p`(ˆλ|y))∼χ²₁, a (1−α) confidence interval contains all valuesλ₀, for which

− p`(λ0|y)−p`(ˆλ|y)

< 1 2χ²_1;1−α (notice thatχ²_1;0.95=3.841,χ²_1;0.99 =6.635).

(42)

E(logy_i) =x_i β, var(logy_i) =σ².

Untransformed responseyi follows a log-normal distribution with median(yi) =exp(x^>_i β),

E(yi) =exp(x^>_i β+σ²/2) =exp(x^>_i β)exp(σ²/2), var(yi) = exp(σ²)−1

exp(2x^>_i β+σ²).

(43)

Power-Transformation (λ6=0): ify_i^λ∼Normal(x^>_i β, σ²)then median(y_i^λ) =x^>_i β,

E(y_i^λ) =x^>_i β, var(y_i^λ) =σ².

Untransformed responseyi follows a distribution with median(yi) =µ^1/λ_i ,

E(yi)≈µ^1/λ_i 1+σ²(1−λ)/(2λ²µ²_i) , var(yi)≈µ^2/λ_i σ²/(λ²µ²_i).

(44)

¿ H ¡- trees$Height; D ¡- trees$Girth; V ¡- trees$Volume

¿ plot(D, V); lines(lowess(D, V)) # curvature (wrong scale?)

¿ plot(H, V) # increasing variance?

(45)

¿ (mod ¡- lm(V ˜ H + D)) # still fit a linear model for volume Coefficients:

(Intercept) H D

-57.9877 0.3393 4.7082

¿ plot(D, residuals(mod), ylab=”residuals”); abline(0, 0)

¿ lines(lowess(D, residuals(mod))) # sink in the middle

(46)

(47)

¿ library(MASS)

¿ bc¡-boxcox(V˜H+D,lambda=seq(0.0,0.6,length=100),plotit=FALSE)

¿ ml.index ¡- which(bc$y == max(bc$y))

¿ bc$x[ml.index]

[1] 0.3090909

¿ boxcox(V˜H+D, lambda = seq(0.0, 0.6,len = 18)) # plot it now

(48)

(49)

Is volume cubic in height and diameter?

¿ plot(D, Vˆ(1/3), ylab=expression(Vˆ–1/3˝))

¿ lines(lowess(D, Vˆ(1/3))) # curvature almost gone

¿ (mod1 ¡- lm(Vˆ(1/3) ˜ H + D)) Coefficients:

(Intercept) H D

-0.08539 0.01447 0.15152

For fixedλ=1/3we havemedian(V\ ) = ˆµ³_1/3 where E(V^1/3) =µ_1/3. E(Vˆ ) = ˆµ³_1/3(1+3ˆσ_1/3² /ˆµ²_1/3). Compare responses with estimated medians

¿ mu ¡- fitted(mod1)

¿ plot(muˆ3, V) # fitted median modell

(50)

(51)

Alternative strategy:

Remove curvature by a log-transform of all predictors (i.e., regress onlog(D) andlog(H)).

Should we also considerlog(V) as response?

¿ plot(log(D), log(V)) # shows nice linear relationship

¿ lm(log(V) ˜ log(H) + log(D)) # response log(V) or still V?

Coefficients:

(Intercept) log(H) log(D)

-6.632 1.117 1.983

¿ boxcox(V˜log(H)+log(D), lambda=seq(-0.35,0.25,length=100))

(52)

(53)

Which of the models isbetter? Comparison by LRT. Both models are members of themodel family

V^∗ ∼Normal(β0+β1H^∗+β2D^∗, σ²) V^∗ = (V^λ^V −1)/λV

H^∗ = (H^λ^H−1)/λH

D^∗ = (D^λ^D −1)/λD

Compare Profile-Likelihood function inλ_V =1/3, λ_H =λ_D =1 (E(V^1/3) =β0+β1H+β2D), with that in λ_V =λ_H =λ_D =0 (E(log(V)) =β₀+β₁log(H) +β₂log(D)).

(54)

¿ bc1 ¡- boxcox(V ˜ H + D, lambda = 1/3, plotit=FALSE)

¿ bc1$y [1] 25.33313

¿ bc2 ¡- boxcox(V ˜ log(H) + log(D), lambda = 0, plotit=FALSE)

¿ bc2$y [1] 26.11592

LRT Statistic: −2(25.333−26.116) =1.566(not significant).

(55)

Remark: Coefficient of log(H) close to 1 (βˆ1=1.117) and coefficient oflog(D) close to 2 (βˆ₂=1.983).

Tree can be represented by acylinder or a cone. Volume is πhd²/4(cylinder) or πhd²/12(cone), i.e.

E(log(V)) =c+1 log(H) +2 log(D) withc=log(π/4) (cylinder) or c=log(π/12) (cone).

Attention: D has to be converted from inches to feet⇒D/12 as predictor.

(56)

Coefficients:

(Intercept) log(H) log(D/12)

-1.705 1.117 1.983

Conversion only influences intercept!

Fix slopes(β1, β₂) to (1, 2) and estimate only interceptβ₀, i.e.

consider the model

E(log(V)) =β0+1 log(H) +2 log(D/12).

(57)

¿ (mod3 ¡- lm(log(V) ˜ 1 + offset(log(H) + 2*log(D/12)))) Coefficients:

(Intercept) -1.199

¿ log(pi/4) [1] -0.2415645

¿ log(pi/12) [1] -1.340177

Volume can be better described by a cone than by a cylinder.

However, its volume is slightly larger than the one of a cone.

(58)

independent response variables with covariates.

• While a linear model combinesadditivity of the covariate effects with thenormality of the errors, including variance homogeneity, GLM’s don’t need to satisfy these

requirements. GLM’s allow also to handle nonnormal responses such as binomial, Poisson and Gamma.

• Regression parameters are estimated using maximum likelihood.

• Standard reference on GLM’s is McCullagh & Nelder (1989).

(59)

i i = (1, i1 i,p−1)

1 Random Component:

y_i,i =1, . . . ,n, independent with density from thelinear exponential family (LEF), i.e.

f(y|θ, φ) =exp

yθ−b(θ)

φ +c(y, φ)

φ >0is a dispersion parameter andb(·) andc(·,·) are known functions.

2 Systematic Component:

ηi =ηi(β) =x^>_i β is called linear predictor,

β= (β0, . . . , β_p−1)^> are unknown regression parameters

3 Parametric Link Component:

The link functiong(µi) =ηi combines the linear predictor

(60)

f(y|µ, σ²) = 1

√

2πσ²exp

− 1

2σ²(y −µ)²

=exp

(yµ− ^µ₂² σ² − 1

2

log(2πσ²) + y² σ²

)

Definingθ=µand φ=σ² results in b(θ) = µ²

2 and c(y, φ) =−1 2

log(2πσ²) + y² σ²

(61)

It can be shown that for the LEF E(y) =b⁰(θ) =µ

var(y) =φb⁰⁰(θ) =φV(µ), whereV(µ) =b⁰⁰(θ) is called thevariance function.

Thus, we generally consider the model g(µ) =g(b⁰(θ)).

Thus, thecanonical linkis defined as g= (b⁰)⁻¹

⇒g(µ) =θ =x^>β.

(62)

The log-likelihood of the sampley₁, . . . ,y_n is

`(µ|y) =

n

X

i=1

y_iθ_i−b(θi)

φ +c(yi, φ)

The maximum likelihood estimatorµˆ is obtained by solving the score function (chain rule)

s(µ) = ∂

∂µ`(µ|y) = ∂

∂θ`(µ|y)∂θ

∂µ =

y1−µ1

φV(µ1), . . . ,yn−µn

φV(µn)

(63)

Because ofµ=µ(β) the score function for the parameterβ is (chain rule again)

s(β) = ∂

∂β`(β|y) = ∂

∂θ`(µ|y)∂θ

∂µ

∂η

∂β =

n

X

i=1

y_i−µ_i φV(µi)

1 g⁰(µi)xi

which depends again only on themean/variance relationship.

For the sampley1, . . . ,yn we assumed that there is only one global dispersion parameterφ, i.e.E(yi) =µ_i,var(yi) =φV(µi).

(64)

i=1 V( ˆµi) g⁰( ˆµi)xi =0

which doesn’t depend onφand whereg( ˆµi) =x^>_i β.ˆ Notice, if a canonical link (g(µ) =θ) is used, we have

g⁰(µ) = ∂θ

∂µ = 1

∂µ/∂θ = 1

∂b⁰(θ)/∂θ = 1

b⁰⁰(θ) = 1 V(µ) and the above score equation simplifies to

n

(65)

A general method to solve the score equation is the iterative algorithmFisher’s Method of Scoring (derived from a Taylor expansion ofs(β)).

In thet-th iteration, the new estimateβ^(t+1) is obtained from the previous oneβ^(t) by

β^(t+1)=β^(t)+s(β^(t))

"

E

∂s(β)

∂β

β=β^(t)

#−1

Therefore, the speciality is the usage of theexpectedinstead of theobserved Hessian matrix.

(66)

β^(t+1)=

X^>W^(t)X −1

X^>W^(t)z^(t)

with the vector of pseudo-observationsz= (z1, . . . ,zn)^> and diagonal weight matrixW defined as

z_i =g(µi) +g⁰(µi)(yi−µ_i)

w_i = 1

V(µi)(g⁰(µi))²

(67)

Since

β^(t+1)=

X^>W^(t)X−1

X^>W^(t)z^(t)

the estimateβˆis calculated using an Iteratively (Re-)Weighted Least Squares(IWLS) algorithm:

1 start with initial guesses µ⁽⁰⁾_i (e.g. µ⁽⁰⁾_i =y_i orµ⁽⁰⁾_i =y_i+c)

2 calculate working responses z_i^(t) and weightsw_i^(t)

3 calculate β^(t+1) by weighted least squares

4 repeat steps 2 and 3 till convergence.

(68)

βˆ∼Normal(β, φ(X^>WX)⁻¹)

Thus, standard errors of the estimatorsβˆj are the respective diagonal elements of the estimated variance/covariance matrix

var( ˆ\β) =φ(X^>WX)ˆ ⁻¹

withWˆ =W( ˆµ). Note that (X^>WX)ˆ ⁻¹ is a by-product of the last IWLS iteration. Ifφis unknown, an estimator is required.

(69)

Amethod-of-momentslike estimator is developed considering the ratios

φ= E(yi −µ_i)²

V(µi) , for alli =1, . . . ,n

Averaging over all these ratios and assuming that theµ_i’s are known results in the estimator

1 n

n

X

i=1

(yi−µ_i)² V(µi)

However, sinceβ is unknown we better use the bias-corrected version (also known as the mean generalized Pearson’s chi-square statistic)

φˆ= 1 Xⁿ (yi −µˆi)²

= 1

X²

(70)

Generalized linear models can be fitted in using theglm function, which is similar tolm for fitting linear models.

The arguments to aglm call are as follows:

glm(formula, family = gaussian, data, weights, subset, na.action, start = NULL, etastart, mustart, offset, control = glm.control(...), model = TRUE,

method = ”glm.fit”, x = FALSE, y = TRUE, contrasts = NULL, ...)

(71)

Formula argument:

The formula is specified for a glm as e.g.

y ˜ x1 + x2

wherex1and x2are the names of

• numeric vectors (continuous predictors)

• factors (categorial predictors)

All the variables used in the formula must be in the workspace or in the data frame passed to thedata argument.

(72)

Other symbols that can be used in the formula are:

• a:bfor the interaction between aand b

• a*bwhich expands to 1 + a + b + a:b

• . first order terms of all variables in data

• -to exclude a term (or terms)

• 1intercept (default)

• -1 without intercept

(73)

Family argument:

The family argument defines the response distribution (variance function) and thelinkfunction. The exponential family functions available in are e.g.

• gaussian(link = ”identity”)

• binomial(link = ”logit”)

• poisson(link = ”log”)

• Gamma(link = ”inverse”)

(74)

Theglmfunction returns an object of class c(”glm”, ”lm”).

There are several methods available to access or display components of aglmobject, e.g.

• residuals()

• fitted()

• predict()

• coef()

• deviance()

• summary()

(75)

The first part contains the same information as fromlm()

¿ mod ¡- glm(life.exp ˜ urban + log(physicians) + temp)

¿ summary(mod) Call:

glm(formula = life.exp ˜ urban + log(physicians) + temp) Deviance Residuals:

Min 1Q Median 3Q Max

-20.4925 -2.6337 0.7898 3.3447 10.4813 Coefficients:

urban 8.13713 2.59353 3.137 0.00211 **

log(physicians) 3.50002 0.40578 8.625 2.01e-14 ***

(76)

Since the defaultfamily=”gaussian”, deviance residuals corresponds to ordinary residuals as in a linear model.

A five-number summary of those raw residuals is given.

(77)

Remember that for the MLE it asymptotically holds that βˆ∼Normal(β, φ(X^>WX)⁻¹)

Thus, we can utilize this to construct a test statistic on the significance of a coefficient, sayβj forj =1, . . . ,p−1.

If we test

H₀:β_j =0 versus H₁:β_j 6=0 we can use the test statistic

t = βˆ_j

qφ(Xˆ ^>WX)ˆ ⁻¹_j+1,j+1

which underH0 asymptotically follows a t distribution withn−p degrees of freedom.

(78)

The second part contains some new information on estimated dispersionandgoodness-of-fit aspects which we will discuss later in detail.

First the dispersion estimate (if necessary)φˆis provided

(Dispersion parameter for gaussian family taken to be 24.577) This estimate is simply the squared residual standard error (that was 4.958 in thesummary(lm())).

(79)

Next there is thedeviance of two models and the number of missing observations:

Null deviance: 10782.0 on 132 degrees of freedom Residual deviance: 3170.5 on 129 degrees of freedom

(39 observations deleted due to missingness)

The first refers to thenull modelwhich corresponds to a model with intercept only (the iid assumption, no explanatory variables).

The associated degrees of freedom aren−1.

The second refers to ourfitted model withp−1explanatory variables in the predictor and, thus, with associated degrees of freedomn−p.

(80)

Thedevianceof a model is defined as the distance of log-likelihoods, i.e.

D(y,µ) =ˆ −2φ(`( ˆµ|y)−`(y|y))

Here,µˆ are the fitted values under the considered model

(maximizing the log-likelihood under the given parametrization), andy denote the estimated means under a model without any restriction at all (thusµˆ =y in such a saturated model).

(81)

For any member of the LEF the deviance equals D(y,µ) =ˆ −2φ

n

X

i=1

(yiθˆ_i −y_iθ˜_i)−(b(ˆθ_i)−b(˜θ_i)) φ

=−2

n

X

i=1

n

(yiθˆi −yiθ˜i)−(b(ˆθi)−b(˜θi)) o

whereθ˜_i denotes the estimate of θ_i under the saturated model.

Under the saturated model, there are as many mean parameters µ_i allowed as observationsy_i.

Note that for LEF members thedeviance D(y,µ) =ˆ −2

n

X

i=1

n

(yiθˆi−yiθ˜i)−(b(ˆθi)−b(˜θi)) o

(82)

`( ˆµ|y) =−n

2log(2πσ²)−1 2

n

X

i=1

(yi −µˆ_i)² σ²

`(y|y) =−n

2log(2πσ²)

Therefore the deviance equals thesum of squared errors, i.e.

D(y,µ) =ˆ −2φ(`( ˆµ|y)−`(y|y)) =

n

X

i=1

(yi−µˆi)²=SSE( ˆβ)

(83)

Finally we have AIC: 809.22

Number of Fisher Scoring iterations: 2

TheAkaike Information Criterion (AIC) also assess the fit penalizing for the total number of parametersp+1(linear predictor and dispersion in this case) and is defined as

AIC=−2`( ˆµ|y) +2(p+1)

The smaller the AIC value the better the fit. Use AIC only to compare different models (not necessarily nested).

Sometimes, the term−2`( ˆµ|y) is calleddisparity.

(84)

”response”, ”partial”), ...)

• deviance: write deviance asPn

i=1d(yi,µˆ_i)²

• pearson: r_i^P = (yi−µˆ_i)/p V( ˆµ_i)

• working: r_i^W = ˆz_i −ηˆ_i = (yi −µˆ_i)g⁰( ˆµ_i) (remember that g⁰( ˆµi) =1/V( ˆµi) for canonical link models)

• response: yi−µˆi

• partial: r_i^P + ˆβjxij is the partial residual for thej-th covariate

(85)

Deviance residuals are the default used in since they reflect the same criterion as used in the fitting.

Plot deviance residuals against fitted values:

¿ plot(residuals(mod) ˜ fitted(mod), + xlab = expression(hat(mu)[i]), + ylab = expression(r[i]))

¿ abline(0, 0, lty = 2)

(86)

(87)

Theplot()function gives the following sequence of plots:

• deviance residuals vs. fitted values

• Normal Q-Q plot of deviance residuals standardized to unit variance

• scale-location plot of standardized deviance residuals

• standardized deviance residuals vs. leverage with Cook’s distance contours

¿ plot(mod)

(88)

(89)

So far we considered (Box-Cox transformation) models like

•V_i^1/3^ind∼ Normal(µi, σ²),E(V^1/3) =µ=H+D

•log(Vi)^ind∼ Normal(µi, σ²),E(log(V)) =µ=log(H) +log(D)

In what follows we will assume that a GLM holds with

•V_i ^ind∼ Normal(µi, σ²) andg(E(V)) =η.

More specifically, we like to check out the models:

•µ^1/3=H+D

•log(µ) =log(H) +log(D).

These models on theobservations scalecan be easily fitted usingglm().

(90)

Coefficients:

Estimate Std. Error t value Pr(¿—t—) (Intercept) -0.051322 0.224095 -0.229 0.820518 H 0.014287 0.003342 4.274 0.000201 ***

D 0.150331 0.005838 25.749 ¡ 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 6.577063)

Null deviance: 8106.08 on 30 degrees of freedom

(91)

i Normal(µi ), = +

¿ AIC(pmodel) [1] 151.2102

¿ -2*logLik(pmodel) + 2*4

’log Lik.’ 151.2102 (df=4)

¿ logLik(pmodel)

’log Lik.’ -71.60508 (df=4)

¿ sum(log(dnorm(V,pmodel$fit,sqrt(summary(pmodel)$disp*28/31)))) [1] -71.60508

¿ sum(residuals(pmodel)ˆ2) [1] 184.1577

¿ deviance(pmodel) [1] 184.1577

¿ sum((V-mean(V))ˆ2) # Null Deviance

(92)

Coefficients:

Estimate Std. Error t value Pr(¿—t—) (Intercept) -6.53700 0.94352 -6.928 1.57e-07 ***

log(H) 1.08765 0.24216 4.491 0.000111 ***

log(D) 1.99692 0.08208 24.330 ¡ 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 6.41642)

Null deviance: 8106.08 on 30 degrees of freedom

(93)

Gamma responses: y ∼Gamma(a, λ)with density function f(y|a, λ) =exp(−λy)λ^ay^a−1 1

Γ(a), a, λ,y >0 withE(y) =a/λ andvar(y) =a/λ².

Mean parametrizationneeded!

(94)

f(y|a, λ) =exp(−λy)λ^ay^a−1 Γ(a) f(y|µ, ν) =exp

−ν

µy ν

µ ν

y^ν⁻¹ 1 Γ(ν)

=exp



 y

−_µ¹

−logµ

1/ν +νlogν+ (ν−1)logy −logΓ(ν)



.

LEF memberwith:

(95)

θ=−1/µ, b(θ) =logµ=−log(−θ), φ=1/ν.

Thus,

E(y) =b⁰(θ) =−−1

−θ =−1 θ =µ var(y) =φb⁰⁰(θ) =φ1

θ² =φµ²

with dispersionφ=1/ν and variance functionV(µ) =µ². Coefficient of variation:

pvar(yi)

= q

φµ²_i

=p

φ=constant for all i=1, . . . ,n.

(96)

¿ y ¡- (1:400)/100

¿ shape ¡- 0.9

¿ scale ¡- 1.5

¿ plot(y, dgamma(y, shape=shape, scale=scale))

¿ mean(rgamma(10000, shape=shape, scale=scale)); shape*scale [1] 1.374609

[1] 1.35

¿ var(rgamma(10000, shape=shape, scale=scale)); shape*(scale)ˆ2 [1] 2.001009

(97)

Gamma distributions are generallyskewed to the right.

shape<1(0.9 left) shape>1 (1.5 right)

Special cases: ν =1/φ=1(exponential) andν → ∞ (normal)

(98)

What’s anappropriate linkfunction?

• Canonical link function: η=θ =−¹_µ (inverse-link).

Since we needµ >0we need η <0giving complicated restriction on β.

• Thus, thelog-linkis often used without restrictions on η, i.e.

logµ=η

(99)

Then

`( ˆµ, φ|y) =

n

X

i=1





 y_i

−_µ_ˆ¹

i

−logµˆ_i

φ +c(yi, φ)







`(y, φ|y) =

n

X

i=1





 yi

−_y¹

i

−logyi

φ +c(yi, φ)





 and thus thescaled devianceequals

1

φD(y,µ) =ˆ −2 φ

n

X

i=1

−y_i ˆ µi

−logµˆi

−(−1−logyi)

=−2 φ

n

X

log yi

ˆ

µ −yi−µˆi

ˆ µ

(100)

parameter. We have a sampley1, . . . ,yn with

E(yi) =µ_i and var(yi) =φµ²_i, i=1, . . . ,n Considerz_i =y_i/µ_i withE(zi) =1andvar(zi) =φ(z_i are iid).

Thus,

φˆ= 1 n−p

n

X

i=1

y_i ˆ µi

−1 2

= 1

n−p

n

X

i=1

y_i−µˆ_i ˆ µi

2

which is equivalent to the meanPearsonstatistic.

(101)

We now assume that life expectancy follows agammamodel.

¿ gmod ¡- glm(life.exp ˜ urban + log(physicians) + temp, + family=Gamma(link=”log”))

¿ summary(gmod) Coefficients:

urban 0.0981231 0.0388869 2.523 0.0128 * log(physicians) 0.0545432 0.0060841 8.965 3.04e-15 ***

temp -0.0007257 0.0009359 -0.775 0.4395 ---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Givenurbanandlog(physicians) are in the model,temp

seems to beirrelevant as an additional predictor.

(102)

(Dispersion parameter for Gamma family taken to be 0.005525322) The dispersion estimateφˆis the mean Pearson statistic

¿ # direct from summary(.)

¿ summary(gmod)$dispersion [1] 0.005525322

¿ # or explicitly calculated as

¿ sum(residuals(gmod, type=”pearson”)ˆ2)/gmod$df.resid [1] 0.005525322

(103)

(Dispersion parameter for Gamma family taken to be 0.005525322) Null deviance: 2.37804 on 132 degrees of freedom

Residual deviance: 0.75412 on 129 degrees of freedom (39 observations deleted due to missingness)

AIC: 829.14

Number of Fisher Scoring iterations: 4 For the scaled deviance we get

1

φˆD(y,µ) =ˆ 0.7541

0.0055 =136.4838

which is pretty close its associated degrees of freedom 129.

(104)

Reject model (∗) at levelα if 1

φD(y,µ)ˆ > χ²_1−α,n−p

Since the dispersionφis unknown, we use its estimate φˆinstead and reject model (∗) if

1

φˆD(y,µ)ˆ > χ²_1−α,n−p

(105)

Partial Deviance Test:

Consider the modelg(µ) =X₁β₁+X₂β₂ withdim(β1) =p₁, dim(β2) =p2and p=p1+p2. Now calculate

• µˆ1=g⁻¹(X1βˆ1): the fitted means under the reduced model with design X₁ only (corresponds to H₀:β₂=0)

• µˆ₂=g⁻¹(X1βˆ₁+X₂βˆ₂): the fitted means under the full model with design X1 andX2

• φˆ=X²/(n−p): dispersion estimate under the full model RejectH0 at levelαif

(D(y,µˆ1)−D(y,µˆ2))/p2

φˆ >F_1−α,p₂_,n−p

(106)

¿ (dev2 ¡- deviance(gmod)) [1] 0.7541171

¿ (hatphi ¡- sum(residuals(gmod, type=”pearson”)ˆ2)/gmod$df.r) [1] 0.005525322

¿ gmod1 ¡- glm(life.exp ˜ urban + log(physicians), + family=Gamma(link=”log”))

¿ (dev1 ¡- deviance(gmod1)) [1] 0.7574799