An Introduction based on
Herwig Friedl Institute of Statistics
Graz University of Technology, Austria
• This course will provide an introduction into the concepts of generalized linear models (GLM’s).
• This class extends the class of linear models (LM’s) to regression models for non-normal data.
• Special interest will be on binary data (logistic regression) and count data (log-linear models).
• All models will be handled by using functions like lm, anova, or glm.
• Linear Models (LM’s): Recap of Results
• Box-Cox Transformation Family: Extending the LM
• Generalized Linear Models (GLM’s): An Introduction
• Linear Exponential Family (LEF): Properties and Members
• GLM’s: Parameter Estimates
• GLM’s: glm(.) Function
• Gamma Regression Models
• Logistic Regression (Binomial Responses)
• Log-linear Model (Poisson Responses)
• Multinomial Response Models
• Quasi-Likelihood Models (Overdispersion)
Goal of regression models is to determine how aresponse variabledepends oncovariates. A special class of regression models are linear models. The general setup is given by
• Data(Yi,xi1, . . . ,xi,p−1),i =1, . . . ,n
• Response y= (y1, . . . ,yn)> (random variable)
• Covariates xi = (xi1, . . . ,xi,p−1)> (fixed, known)
Data Example: Life Expectancies
Source: The World Bank makes available tons of great data from the World Development Indicators that are available within
¿ install.packages(’WDI’); library(WDI)
¿ WDIsearch(’gdp’) # gives a list of available data about gdp
¿ d ¡- WDI(indicator=’NY.GDP.PCAP.KD’, country=c(’AT’, ’US’), + start=1960, end=2012)
¿ head(d)
iso2c country NY.GDP.PCAP.KD year 1 AT Austria 41246.57 2013 2 AT Austria 41366.52 2012 3 AT Austria 41120.22 2011 4 AT Austria 39972.94 2010 5 AT Austria 39350.17 2009
¿ install.packages(’gdata’)
# also needs Perl (for MS Windows) to be installed
# also needs java with the same structure as R (32/64 bit)
¿ library(gdata)
¿ temp.xls¡-”http://databank.worldbank.org/data/download/catalog/
+ cckp˙historical˙data˙0.xls”
¿ myperl ¡- ”c:/Strawberry/perl/bin/perl.exe”
¿ sheetCount(temp.xls, perl=myperl) Downloading...
trying URL ’http://databank.worldbank.org/data/.../*.xls’
Content type ’application/vnd.ms-excel’ length 378368 bytes
Data Example: Life Expectancies
¿ temp ¡- read.xls(temp.xls, sheet=”Country˙temperatureCRU”,
+ perl=myperl)
¿ temp.data ¡- temp[ , c(”ISO˙3DIGIT”, ”Annual˙temp”)]
¿ colnames(temp.data) ¡- c(”iso3c”, ”temp”)
¿ head(temp.data) iso3c temp
1 AFG 12.92
2 AGO 21.51
3 ALB 11.27
4 ARE 26.83
5 ARG 14.22
6 ARM 6.37
Data Example: Life ExpectanciesData we are interested in (from 2010):
• life.expat birth, total (years)
• urban population (percent)
• physicians (per 1,000 people)
• temp annual mean (Celsius)
Which is the response and which are covariates?
Gaussian Linear Model:
yi =β0+β1xi1+· · ·+βp−1xi,p−1+i, i iid∼Normal(0, σ2), with unknownregression parameters β0, β1, . . . , βp−1 (intercept β0, slopesβj,j =1, . . . ,p−1) and unknown (homogenous) error varianceσ2.
This is equivalent withyi
ind∼ Normal(E(yi),var(yi)), where E(yi) =µi =β0+β1xi1+· · ·+βp−1xi,p−1 is alinearfunction in the parameters and
var(yi) =σ2, i =1, . . . ,n describes ahomoscedasticscenario.
β= (β0, β1, . . . , βp−1) , xi = (1,xi1, . . . ,xi,p−1) , X= (x1, . . . ,xn)>
and write the Gaussian regression models as y=Xβ+ with
E(y) =µ=Xβ and
Exploratory Data Analysis (EDA):
•Check out the rangesof the response and covariates. For discretecovariates (with sparse factor levels) we consider groupingthe levels.
•Plot covariates against response. Scatter plot should reflect linearrelationships otherwise we consider transformations.
•To check if the constant variance assumption is reasonable, the points of the scatter plot of covariates against the responses should be contained in ahorizontal band.
¿ summary(mydata[, c(5, 6, 8, 10)])
life.exp urban physicians temp
Min. :44.84 Min. :0.09092 Min. :0.008 Min. :-7.14 1st Qu.:62.16 1st Qu.:0.37396 1st Qu.:0.272 1st Qu.:10.15 Median :71.83 Median :0.56297 Median :1.764 Median :21.58 Mean :69.11 Mean :0.55892 Mean :1.871 Mean :18.06 3rd Qu.:76.10 3rd Qu.:0.74453 3rd Qu.:3.000 3rd Qu.:25.05 Max. :82.84 Max. :0.98655 Max. :7.739 Max. :28.30
NA’s :1 NA’s :39
¿ plot(mydata[, c(5, 6, 8, 10)])
plot(log(physicians), life.exp)
Idea of Least Squares: minimize the sum of squared errors, i.e.
SSE(β) =
n
X
i=1
(yi −x>i β)2
Equivalent withMaximum Likelihood: maximize the sample log-likelihood function
`(β|y) =
n
X
i=1
log 1
√
2πσ2 − 1
2σ2(yi −x>i β)2
LSE/MLESolution: βˆ= (X>X)−1X>y Foryi ind∼ Normal(x>i β, σ2)we have
βˆ∼Normal(β, σ2(X>X)−1)
ˆ σ2= 1
n SSE( ˆβ) = 1 n
X
i=1
(yi−x>i β)ˆ 2, E(ˆσ2) = 1−p n σ2 is biased. Anunbiasedvariance estimator is (df corrected)
S2= 1
n−p SSE( ˆβ)
Foryi ind∼ Normal(x>i β, σ2)we get
n
X
i=1
(yi −y¯)2
| {z } SST
=
n
X
i=1
( ˆµi −y¯)2
| {z } SSR( ˆβ)
+
n
X
i=1
(yi−µˆi)2
| {z } SSE( ˆβ)
TotalSS equals (maxim.) RegressionSS plus (minim.) Error SS Thus, the proportion of variability explained by the regression model is described by thecoefficient of determination
R2= SSR( ˆβ)
SST =1− SSE( ˆβ)
SST ∈(0,1)
To penalize for model complexityp we use itsadjustedversion R2 =1−SSE( ˆβ)/(n−p)
6∈(0,1)
Thus, for eachslopeparameter βj, j =1, . . . ,p−1, we have βˆj ∼Normal(βj, σ2(X>X)−1j+1,j+1)
and therefore
βˆj −βj q
σ2(X>X)−1j+1,j+1
∼Normal(0,1)
SinceS2 andβˆare independent, replacingσ2by S2 results in
Hypothesis Tests: t-Test βˆj −βj
q
S2(X>X)−1j+1,j+1
∼tn−p
Therefore, we can test the relevance of asingle predictor xj by H0:βj =0 vs H1:βj 6=0
and use the well-knowntest statistic Estimate
Std. Error = βˆj q
S2(X>X)−1j+1,j+1
H0
∼tn−p
µ=β0+βAfI(Africa) +βAmI(America) +βAsI(Asia) To check if the predictor continentis irrelevant we have to test k−1parameters
H0:βAf =βAm =βAs =0 vs H1:not H0
Fitting the model twice, underH0 and underH1, results in
y=Xβ+, ∼Normal(0, σ2W), W=diag(w1, . . . ,wn)
The MLE (weighted LSE) ofβ is given by βˆ= (X>W−1X)−1X>W−1y with
E( ˆβ) =β and var( ˆβ) =σ2(X>W−1X)−1 The MLE ofσ2 is
ˆ σ2= 1
n
n
X
i=1
(yi−µˆi)2 wi = 1
nr>W−1r
Coefficients:
Estimate Std. Error t value Pr(¿—t—) (Intercept) 61.49091 2.21534 27.757 ¡ 2e-16 ***
urban 15.06895 2.94544 5.116 1.1e-06 ***
physicians 1.87133 0.46933 3.987 0.000111 ***
temp -0.19079 0.07287 -2.618 0.009896 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 5.874 on 129 degrees of freedom
Data Example: Life Expectancies Coefficients:
Estimate Std. Error t value Pr(¿—t—) (Intercept) 61.49091 2.21534 27.757 ¡ 2e-16 ***
urban 15.06895 2.94544 5.116 1.1e-06 ***
physicians 1.87133 0.46933 3.987 0.000111 ***
temp -0.19079 0.07287 -2.618 0.009896 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 All predictors are significant (on a 5% level). Onlytemp has a
negative effect.
Residual standard error: 5.874 on 129 degrees of freedom (39 observations deleted due to missingness)
Multiple R-squared: 0.5872, Adjusted R-squared: 0.5776 F-statistic: 61.17 on 3 and 129 DF, p-value: ¡ 2.2e-16 Under the model, the estimated standard error of the response is 5.874 (years). We haven−p =129and p−1=3predictors.
Almost 60% of the total variability is explained by this model.
The adjusted version ofR2 is 0.5776.
Data Example: Life Expectancies (log(physicians))
¿ mod.log ¡- update(mod, . ˜ . -physicians + log(physicians))
¿ summary(mod.log) Coefficients:
Estimate Std. Error t value Pr(¿—t—) (Intercept) 67.37488 1.85916 36.239 ¡ 2e-16 ***
urban 8.13713 2.59353 3.137 0.00211 **
temp -0.05590 0.06242 -0.895 0.37219 log(physicians) 3.50002 0.40578 8.625 2.01e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Predictortemp is no longer significant!
Residual standard error: 4.958 on 129 degrees of freedom (39 observations deleted due to missingness)
Multiple R-squared: 0.7059, Adjusted R-squared: 0.6991 F-statistic: 103.2 on 3 and 129 DF, p-value: ¡ 2.2e-16
Standard error is much smaller now as before (±5years)!
Even 70% of the total variability is now explained by this model.
Data Example: Life Expectancies (ANOVA)
¿ anova(mod.log)
Analysis of Variance Table Response: life.exp
Df Sum Sq Mean Sq F value Pr(¿F) urban 1 4813.0 4813.0 195.831 ¡ 2.2e-16 ***
temp 1 970.1 970.1 39.470 4.709e-09 ***
log(physicians) 1 1828.5 1828.5 74.399 2.010e-14 ***
Residuals 129 3170.5 24.6 ---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Information about this is contained in theANOVA Table:
Source df Sum of Squares/SS MSS F
Regression p−1 SSR( ˆβ) = MSR( ˆβ) =
SST−SSE( ˆβ) SSR( ˆβ)/(p−1) MSR( ˆβ) MSE( ˆβ)
Data Example: Life Expectancies (ANOVA)
Null Model: assuming an iidrandom sample (E(yi) =β0·1), results inSSE( ˆβ0) =P
i(yi−βˆ0)2 whereβˆ0= ¯y. Thus, SSE( ˆβ0) =P
i(yi −¯y)2≡SSTin this case.
Nested Model: we assume that
y=Xβ+=X1β1+X2β2+, and test on H0:β2=0 withdim(β1) =p1(including the intercept) and dim(β2) =p2
(additional slopes). The corresponding SSR and SSE terms are SSR( ˆβ1) =
n
X
i=1
(x>i βˆ1−y¯)2, SSE( ˆβ1) =
n
X
i=1
(yi−x>i βˆ1)2
Source df Sum of Squares/SS MSS F X1 p1−1 SSR( ˆβ1) MSR( ˆβ1) =
SSR( ˆβ1) p1−1
MSR( ˆβ1) MSE( ˆβ) X2|X1 p2 SSR( ˆβ2|βˆ1) = MSR( ˆβ2|βˆ1) =
SSR( ˆβ)−SSR( ˆβ1) SSR( ˆβ2|βˆ1) p2
MSR( ˆβ2|βˆ1) MSE( ˆβ)
Data Example: Life Expectancies (ANOVA)
We now assume that the modely=β0+X1β1+X2β2+holds.
Test 1: test statistic
F = MSR( ˆβ1|βˆ0) MSE( ˆβ)
tests the model improvement when adding the predictors inX1 to the iid model with onlyβ0.
Test 2: test statistic
F = MSR( ˆβ2|βˆ1,βˆ0) MSE( ˆβ)
tests the model improvement when adding the predictors inX to
Analysis of Variance Table Response: life.exp
Df Sum Sq Mean Sq F value Pr(¿F) urban 1 4813.0 4813.0 195.831 ¡ 2.2e-16 ***
temp 1 970.1 970.1 39.470 4.709e-09 ***
log(physicians) 1 1828.5 1828.5 74.399 2.010e-14 ***
Residuals 129 3170.5 24.6 ---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Problems:
• yi 6∼Normal(E(yi),var(yi))
• E(yi)6=x>i β∈R
• var(yi)6=σ2 equal (homoscedastic) ∀i=1, . . . ,n
Remedies:
• transform yi such that g(yi)ind∼ Normal(x>i β, σ2)
• utilize a GLM where yi ind∼ LEF(g−1(x>i β), φV(µi))
y(λ) =
yλ−1
λ , ifλ6=0, logy, ifλ=0,
y(λ)→logy for λ→0, such thaty(λ) is continuous inλ.
Assumption: there is a valueλfor which yi(λ)ind∼ Normal
µi(λ) =x>i β(λ), σ2(λ)
Density Transformation Theorem: Ifg(Y)∼Fg(Y)(y)holds for a continuous r.v. andg(·) is a monotone function, then the untransformed r.v.Y has cdf
FY(y) =Pr(Y ≤y) =Pr(g(Y)≤g(y)) =Fg(Y)(g(y)).
Thus, the density ofY is fY(y) = ∂Fg(Y)(g(y))
∂y =fg(Y)(g(y))·
∂g(y)
∂y with Jacobian
∂g(y)
∂y
.
f(y|λ, µ(λ), σ2(λ)) =
1
p2πσ2(λ)exp
− (yλ−1)
λ −µ(λ) 2
2σ2(λ)
yλ−1, λ6=0,
1
p2πσ2(λ)exp −(logy−µ(λ))2 2σ2(λ)
!
y−1, λ=0.
• Ifλ6=0 andµ(λ) =x>β(λ)then f(y|λ, µ(λ), σ2(λ)) = 1
p2πλ2σ2(λ)exp − yλ−1−λx>β(λ)2
2λ2σ2(λ)
!
|λ|yλ−1.
Usingβ0=1+λβ0(λ), βj =λβj(λ), j =1, . . . ,p−1, and σ2=λ2σ2(λ) then
f(y|λ, µ(λ), σ2(λ)) = 1
p2πλ2σ2(λ)exp − yλ−1−λx>β(λ)2
2λ2σ2(λ)
!
|λ|yλ−1
f(y|λ,β, σ2) = 1
√
2πσ2exp
−(yλ−x>β)2 2σ2
|λ|yλ−1.
• Ifλ=0, letβj =βj(λ),j =0, . . . ,p−1, and σ2=σ2(λ) f(y|0,β, σ2) = 1
√2πσ2exp
−(logy−x>β)2 2σ2
y−1. Ifλwould be known, then the MLE could be easily computed!
•λ6=0:
`(λ,β, σ2|y) =−n
2logσ2− 1 2σ2
n
X
i=1
yiλ−x>i β2
+nlog|λ|+(λ−1)
n
X
i=1
logyi
•λ=0:
`(0,β, σ2|y) =−n
2logσ2− 1 2σ2
n
X
i=1
logyi−x>i β2
−
n
X
i=1
logyi
Ifλwould be known, then the MLEs would be βˆλ=
((X>X)−1X>yλ, λ6=0, (X>X)−1X>logy, λ=0,
ˆ σλ2= 1
nSSEλ( ˆβλ) =
1 n
n
X
i=1
(yiλ−x>i βˆλ)2, λ6=0, 1
n
n
X
i=1
(logyi−x>i βˆλ)2, λ=0.
=
−n
2log SSEλ( ˆβλ) +nlog|λ|+ (λ−1)
n
X
i=1
logyi, λ6=0,
−n
2log SSE0( ˆβ0)−
n
X
i=1
logyi, λ=0.
This is the sample log-likelihood function that has been already maximized with respect toβ andσ2.
Likelihood Ratio Test(LRT):H0:λ=λ0 versusH1:λ6=λ0. For the LRT statistic it holds that
−2
p`(λ0|y)−p`(ˆλ|y) D
→χ21.
If−2(p`(λ0|y)−p`(ˆλ|y))∼χ21, a (1−α) confidence interval contains all valuesλ0, for which
− p`(λ0|y)−p`(ˆλ|y)
< 1 2χ21;1−α (notice thatχ21;0.95=3.841,χ21;0.99 =6.635).
E(logyi) =xi β, var(logyi) =σ2.
Untransformed responseyi follows a log-normal distribution with median(yi) =exp(x>i β),
E(yi) =exp(x>i β+σ2/2) =exp(x>i β)exp(σ2/2), var(yi) = exp(σ2)−1
exp(2x>i β+σ2).
Power-Transformation (λ6=0): ifyiλ∼Normal(x>i β, σ2)then median(yiλ) =x>i β,
E(yiλ) =x>i β, var(yiλ) =σ2.
Untransformed responseyi follows a distribution with median(yi) =µ1/λi ,
E(yi)≈µ1/λi 1+σ2(1−λ)/(2λ2µ2i) , var(yi)≈µ2/λi σ2/(λ2µ2i).
¿ H ¡- trees$Height; D ¡- trees$Girth; V ¡- trees$Volume
¿ plot(D, V); lines(lowess(D, V)) # curvature (wrong scale?)
¿ plot(H, V) # increasing variance?
¿ (mod ¡- lm(V ˜ H + D)) # still fit a linear model for volume Coefficients:
(Intercept) H D
-57.9877 0.3393 4.7082
¿ plot(D, residuals(mod), ylab=”residuals”); abline(0, 0)
¿ lines(lowess(D, residuals(mod))) # sink in the middle
¿ library(MASS)
¿ bc¡-boxcox(V˜H+D,lambda=seq(0.0,0.6,length=100),plotit=FALSE)
¿ ml.index ¡- which(bc$y == max(bc$y))
¿ bc$x[ml.index]
[1] 0.3090909
¿ boxcox(V˜H+D, lambda = seq(0.0, 0.6,len = 18)) # plot it now
Is volume cubic in height and diameter?
¿ plot(D, Vˆ(1/3), ylab=expression(Vˆ–1/3˝))
¿ lines(lowess(D, Vˆ(1/3))) # curvature almost gone
¿ (mod1 ¡- lm(Vˆ(1/3) ˜ H + D)) Coefficients:
(Intercept) H D
-0.08539 0.01447 0.15152
For fixedλ=1/3we havemedian(V\ ) = ˆµ31/3 where E(V1/3) =µ1/3. E(Vˆ ) = ˆµ31/3(1+3ˆσ1/32 /ˆµ21/3). Compare responses with estimated medians
¿ mu ¡- fitted(mod1)
¿ plot(muˆ3, V) # fitted median modell
Alternative strategy:
Remove curvature by a log-transform of all predictors (i.e., regress onlog(D) andlog(H)).
Should we also considerlog(V) as response?
¿ plot(log(D), log(V)) # shows nice linear relationship
¿ lm(log(V) ˜ log(H) + log(D)) # response log(V) or still V?
Coefficients:
(Intercept) log(H) log(D)
-6.632 1.117 1.983
¿ boxcox(V˜log(H)+log(D), lambda=seq(-0.35,0.25,length=100))
Which of the models isbetter? Comparison by LRT. Both models are members of themodel family
V∗ ∼Normal(β0+β1H∗+β2D∗, σ2) V∗ = (VλV −1)/λV
H∗ = (HλH−1)/λH
D∗ = (DλD −1)/λD
Compare Profile-Likelihood function inλV =1/3, λH =λD =1 (E(V1/3) =β0+β1H+β2D), with that in λV =λH =λD =0 (E(log(V)) =β0+β1log(H) +β2log(D)).
¿ bc1 ¡- boxcox(V ˜ H + D, lambda = 1/3, plotit=FALSE)
¿ bc1$y [1] 25.33313
¿ bc2 ¡- boxcox(V ˜ log(H) + log(D), lambda = 0, plotit=FALSE)
¿ bc2$y [1] 26.11592
LRT Statistic: −2(25.333−26.116) =1.566(not significant).
Remark: Coefficient of log(H) close to 1 (βˆ1=1.117) and coefficient oflog(D) close to 2 (βˆ2=1.983).
Tree can be represented by acylinder or a cone. Volume is πhd2/4(cylinder) or πhd2/12(cone), i.e.
E(log(V)) =c+1 log(H) +2 log(D) withc=log(π/4) (cylinder) or c=log(π/12) (cone).
Attention: D has to be converted from inches to feet⇒D/12 as predictor.
Coefficients:
(Intercept) log(H) log(D/12)
-1.705 1.117 1.983
Conversion only influences intercept!
Fix slopes(β1, β2) to (1, 2) and estimate only interceptβ0, i.e.
consider the model
E(log(V)) =β0+1 log(H) +2 log(D/12).
¿ (mod3 ¡- lm(log(V) ˜ 1 + offset(log(H) + 2*log(D/12)))) Coefficients:
(Intercept) -1.199
¿ log(pi/4) [1] -0.2415645
¿ log(pi/12) [1] -1.340177
Volume can be better described by a cone than by a cylinder.
However, its volume is slightly larger than the one of a cone.
independent response variables with covariates.
• While a linear model combinesadditivity of the covariate effects with thenormality of the errors, including variance homogeneity, GLM’s don’t need to satisfy these
requirements. GLM’s allow also to handle nonnormal responses such as binomial, Poisson and Gamma.
• Regression parameters are estimated using maximum likelihood.
• Standard reference on GLM’s is McCullagh & Nelder (1989).
i i = (1, i1 i,p−1)
1 Random Component:
yi,i =1, . . . ,n, independent with density from thelinear exponential family (LEF), i.e.
f(y|θ, φ) =exp
yθ−b(θ)
φ +c(y, φ)
φ >0is a dispersion parameter andb(·) andc(·,·) are known functions.
2 Systematic Component:
ηi =ηi(β) =x>i β is called linear predictor,
β= (β0, . . . , βp−1)> are unknown regression parameters
3 Parametric Link Component:
The link functiong(µi) =ηi combines the linear predictor
f(y|µ, σ2) = 1
√
2πσ2exp
− 1
2σ2(y −µ)2
=exp
(yµ− µ22 σ2 − 1
2
log(2πσ2) + y2 σ2
)
Definingθ=µand φ=σ2 results in b(θ) = µ2
2 and c(y, φ) =−1 2
log(2πσ2) + y2 σ2
It can be shown that for the LEF E(y) =b0(θ) =µ
var(y) =φb00(θ) =φV(µ), whereV(µ) =b00(θ) is called thevariance function.
Thus, we generally consider the model g(µ) =g(b0(θ)).
Thus, thecanonical linkis defined as g= (b0)−1
⇒g(µ) =θ =x>β.
The log-likelihood of the sampley1, . . . ,yn is
`(µ|y) =
n
X
i=1
yiθi−b(θi)
φ +c(yi, φ)
The maximum likelihood estimatorµˆ is obtained by solving the score function (chain rule)
s(µ) = ∂
∂µ`(µ|y) = ∂
∂θ`(µ|y)∂θ
∂µ =
y1−µ1
φV(µ1), . . . ,yn−µn
φV(µn)
Because ofµ=µ(β) the score function for the parameterβ is (chain rule again)
s(β) = ∂
∂β`(β|y) = ∂
∂θ`(µ|y)∂θ
∂µ
∂µ
∂η
∂η
∂β =
n
X
i=1
yi−µi φV(µi)
1 g0(µi)xi
which depends again only on themean/variance relationship.
For the sampley1, . . . ,yn we assumed that there is only one global dispersion parameterφ, i.e.E(yi) =µi,var(yi) =φV(µi).
i=1 V( ˆµi) g0( ˆµi)xi =0
which doesn’t depend onφand whereg( ˆµi) =x>i β.ˆ Notice, if a canonical link (g(µ) =θ) is used, we have
g0(µ) = ∂θ
∂µ = 1
∂µ/∂θ = 1
∂b0(θ)/∂θ = 1
b00(θ) = 1 V(µ) and the above score equation simplifies to
n
A general method to solve the score equation is the iterative algorithmFisher’s Method of Scoring (derived from a Taylor expansion ofs(β)).
In thet-th iteration, the new estimateβ(t+1) is obtained from the previous oneβ(t) by
β(t+1)=β(t)+s(β(t))
"
E
∂s(β)
∂β
β=β(t)
#−1
Therefore, the speciality is the usage of theexpectedinstead of theobserved Hessian matrix.
β(t+1)=
X>W(t)X −1
X>W(t)z(t)
with the vector of pseudo-observationsz= (z1, . . . ,zn)> and diagonal weight matrixW defined as
zi =g(µi) +g0(µi)(yi−µi)
wi = 1
V(µi)(g0(µi))2
Since
β(t+1)=
X>W(t)X−1
X>W(t)z(t)
the estimateβˆis calculated using an Iteratively (Re-)Weighted Least Squares(IWLS) algorithm:
1 start with initial guesses µ(0)i (e.g. µ(0)i =yi orµ(0)i =yi+c)
2 calculate working responses zi(t) and weightswi(t)
3 calculate β(t+1) by weighted least squares
4 repeat steps 2 and 3 till convergence.
βˆ∼Normal(β, φ(X>WX)−1)
Thus, standard errors of the estimatorsβˆj are the respective diagonal elements of the estimated variance/covariance matrix
var( ˆ\β) =φ(X>WX)ˆ −1
withWˆ =W( ˆµ). Note that (X>WX)ˆ −1 is a by-product of the last IWLS iteration. Ifφis unknown, an estimator is required.
Amethod-of-momentslike estimator is developed considering the ratios
φ= E(yi −µi)2
V(µi) , for alli =1, . . . ,n
Averaging over all these ratios and assuming that theµi’s are known results in the estimator
1 n
n
X
i=1
(yi−µi)2 V(µi)
However, sinceβ is unknown we better use the bias-corrected version (also known as the mean generalized Pearson’s chi-square statistic)
φˆ= 1 Xn (yi −µˆi)2
= 1
X2
Generalized linear models can be fitted in using theglm function, which is similar tolm for fitting linear models.
The arguments to aglm call are as follows:
glm(formula, family = gaussian, data, weights, subset, na.action, start = NULL, etastart, mustart, offset, control = glm.control(...), model = TRUE,
method = ”glm.fit”, x = FALSE, y = TRUE, contrasts = NULL, ...)
Formula argument:
The formula is specified for a glm as e.g.
y ˜ x1 + x2
wherex1and x2are the names of
• numeric vectors (continuous predictors)
• factors (categorial predictors)
All the variables used in the formula must be in the workspace or in the data frame passed to thedata argument.
Other symbols that can be used in the formula are:
• a:bfor the interaction between aand b
• a*bwhich expands to 1 + a + b + a:b
• . first order terms of all variables in data
• -to exclude a term (or terms)
• 1intercept (default)
• -1 without intercept
Family argument:
The family argument defines the response distribution (variance function) and thelinkfunction. The exponential family functions available in are e.g.
• gaussian(link = ”identity”)
• binomial(link = ”logit”)
• poisson(link = ”log”)
• Gamma(link = ”inverse”)
Theglmfunction returns an object of class c(”glm”, ”lm”).
There are several methods available to access or display components of aglmobject, e.g.
• residuals()
• fitted()
• predict()
• coef()
• deviance()
• summary()
The first part contains the same information as fromlm()
¿ mod ¡- glm(life.exp ˜ urban + log(physicians) + temp)
¿ summary(mod) Call:
glm(formula = life.exp ˜ urban + log(physicians) + temp) Deviance Residuals:
Min 1Q Median 3Q Max
-20.4925 -2.6337 0.7898 3.3447 10.4813 Coefficients:
Estimate Std. Error t value Pr(¿—t—) (Intercept) 67.37488 1.85916 36.239 ¡ 2e-16 ***
urban 8.13713 2.59353 3.137 0.00211 **
log(physicians) 3.50002 0.40578 8.625 2.01e-14 ***
Since the defaultfamily=”gaussian”, deviance residuals corresponds to ordinary residuals as in a linear model.
A five-number summary of those raw residuals is given.
Remember that for the MLE it asymptotically holds that βˆ∼Normal(β, φ(X>WX)−1)
Thus, we can utilize this to construct a test statistic on the significance of a coefficient, sayβj forj =1, . . . ,p−1.
If we test
H0:βj =0 versus H1:βj 6=0 we can use the test statistic
t = βˆj
qφ(Xˆ >WX)ˆ −1j+1,j+1
which underH0 asymptotically follows a t distribution withn−p degrees of freedom.
The second part contains some new information on estimated dispersionandgoodness-of-fit aspects which we will discuss later in detail.
First the dispersion estimate (if necessary)φˆis provided
(Dispersion parameter for gaussian family taken to be 24.577) This estimate is simply the squared residual standard error (that was 4.958 in thesummary(lm())).
Next there is thedeviance of two models and the number of missing observations:
Null deviance: 10782.0 on 132 degrees of freedom Residual deviance: 3170.5 on 129 degrees of freedom
(39 observations deleted due to missingness)
The first refers to thenull modelwhich corresponds to a model with intercept only (the iid assumption, no explanatory variables).
The associated degrees of freedom aren−1.
The second refers to ourfitted model withp−1explanatory variables in the predictor and, thus, with associated degrees of freedomn−p.
Thedevianceof a model is defined as the distance of log-likelihoods, i.e.
D(y,µ) =ˆ −2φ(`( ˆµ|y)−`(y|y))
Here,µˆ are the fitted values under the considered model
(maximizing the log-likelihood under the given parametrization), andy denote the estimated means under a model without any restriction at all (thusµˆ =y in such a saturated model).
For any member of the LEF the deviance equals D(y,µ) =ˆ −2φ
n
X
i=1
(yiθˆi −yiθ˜i)−(b(ˆθi)−b(˜θi)) φ
=−2
n
X
i=1
n
(yiθˆi −yiθ˜i)−(b(ˆθi)−b(˜θi)) o
whereθ˜i denotes the estimate of θi under the saturated model.
Under the saturated model, there are as many mean parameters µi allowed as observationsyi.
Note that for LEF members thedeviance D(y,µ) =ˆ −2
n
X
i=1
n
(yiθˆi−yiθ˜i)−(b(ˆθi)−b(˜θi)) o
`( ˆµ|y) =−n
2log(2πσ2)−1 2
n
X
i=1
(yi −µˆi)2 σ2
`(y|y) =−n
2log(2πσ2)
Therefore the deviance equals thesum of squared errors, i.e.
D(y,µ) =ˆ −2φ(`( ˆµ|y)−`(y|y)) =
n
X
i=1
(yi−µˆi)2=SSE( ˆβ)
Finally we have AIC: 809.22
Number of Fisher Scoring iterations: 2
TheAkaike Information Criterion (AIC) also assess the fit penalizing for the total number of parametersp+1(linear predictor and dispersion in this case) and is defined as
AIC=−2`( ˆµ|y) +2(p+1)
The smaller the AIC value the better the fit. Use AIC only to compare different models (not necessarily nested).
Sometimes, the term−2`( ˆµ|y) is calleddisparity.
”response”, ”partial”), ...)
• deviance: write deviance asPn
i=1d(yi,µˆi)2
• pearson: riP = (yi−µˆi)/p V( ˆµi)
• working: riW = ˆzi −ηˆi = (yi −µˆi)g0( ˆµi) (remember that g0( ˆµi) =1/V( ˆµi) for canonical link models)
• response: yi−µˆi
• partial: riP + ˆβjxij is the partial residual for thej-th covariate
Deviance residuals are the default used in since they reflect the same criterion as used in the fitting.
Plot deviance residuals against fitted values:
¿ plot(residuals(mod) ˜ fitted(mod), + xlab = expression(hat(mu)[i]), + ylab = expression(r[i]))
¿ abline(0, 0, lty = 2)
Theplot()function gives the following sequence of plots:
• deviance residuals vs. fitted values
• Normal Q-Q plot of deviance residuals standardized to unit variance
• scale-location plot of standardized deviance residuals
• standardized deviance residuals vs. leverage with Cook’s distance contours
¿ plot(mod)
So far we considered (Box-Cox transformation) models like
•Vi1/3ind∼ Normal(µi, σ2),E(V1/3) =µ=H+D
•log(Vi)ind∼ Normal(µi, σ2),E(log(V)) =µ=log(H) +log(D)
In what follows we will assume that a GLM holds with
•Vi ind∼ Normal(µi, σ2) andg(E(V)) =η.
More specifically, we like to check out the models:
•µ1/3=H+D
•log(µ) =log(H) +log(D).
These models on theobservations scalecan be easily fitted usingglm().
Coefficients:
Estimate Std. Error t value Pr(¿—t—) (Intercept) -0.051322 0.224095 -0.229 0.820518 H 0.014287 0.003342 4.274 0.000201 ***
D 0.150331 0.005838 25.749 ¡ 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 6.577063)
Null deviance: 8106.08 on 30 degrees of freedom
i Normal(µi ), = +
¿ AIC(pmodel) [1] 151.2102
¿ -2*logLik(pmodel) + 2*4
’log Lik.’ 151.2102 (df=4)
¿ logLik(pmodel)
’log Lik.’ -71.60508 (df=4)
¿ sum(log(dnorm(V,pmodel$fit,sqrt(summary(pmodel)$disp*28/31)))) [1] -71.60508
¿ sum(residuals(pmodel)ˆ2) [1] 184.1577
¿ deviance(pmodel) [1] 184.1577
¿ sum((V-mean(V))ˆ2) # Null Deviance
Coefficients:
Estimate Std. Error t value Pr(¿—t—) (Intercept) -6.53700 0.94352 -6.928 1.57e-07 ***
log(H) 1.08765 0.24216 4.491 0.000111 ***
log(D) 1.99692 0.08208 24.330 ¡ 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 6.41642)
Null deviance: 8106.08 on 30 degrees of freedom
Gamma responses: y ∼Gamma(a, λ)with density function f(y|a, λ) =exp(−λy)λaya−1 1
Γ(a), a, λ,y >0 withE(y) =a/λ andvar(y) =a/λ2.
Mean parametrizationneeded!
f(y|a, λ) =exp(−λy)λaya−1 Γ(a) f(y|µ, ν) =exp
−ν
µy ν
µ ν
yν−1 1 Γ(ν)
=exp
y
−µ1
−logµ
1/ν +νlogν+ (ν−1)logy −logΓ(ν)
.
LEF memberwith:
θ=−1/µ, b(θ) =logµ=−log(−θ), φ=1/ν.
Thus,
E(y) =b0(θ) =−−1
−θ =−1 θ =µ var(y) =φb00(θ) =φ1
θ2 =φµ2
with dispersionφ=1/ν and variance functionV(µ) =µ2. Coefficient of variation:
pvar(yi)
= q
φµ2i
=p
φ=constant for all i=1, . . . ,n.
¿ y ¡- (1:400)/100
¿ shape ¡- 0.9
¿ scale ¡- 1.5
¿ plot(y, dgamma(y, shape=shape, scale=scale))
¿ mean(rgamma(10000, shape=shape, scale=scale)); shape*scale [1] 1.374609
[1] 1.35
¿ var(rgamma(10000, shape=shape, scale=scale)); shape*(scale)ˆ2 [1] 2.001009
Gamma distributions are generallyskewed to the right.
shape<1(0.9 left) shape>1 (1.5 right)
Special cases: ν =1/φ=1(exponential) andν → ∞ (normal)
What’s anappropriate linkfunction?
• Canonical link function: η=θ =−1µ (inverse-link).
Since we needµ >0we need η <0giving complicated restriction on β.
• Thus, thelog-linkis often used without restrictions on η, i.e.
logµ=η
Then
`( ˆµ, φ|y) =
n
X
i=1
yi
−µˆ1
i
−logµˆi
φ +c(yi, φ)
`(y, φ|y) =
n
X
i=1
yi
−y1
i
−logyi
φ +c(yi, φ)
and thus thescaled devianceequals
1
φD(y,µ) =ˆ −2 φ
n
X
i=1
−yi ˆ µi
−logµˆi
−(−1−logyi)
=−2 φ
n
X
log yi
ˆ
µ −yi−µˆi
ˆ µ
parameter. We have a sampley1, . . . ,yn with
E(yi) =µi and var(yi) =φµ2i, i=1, . . . ,n Considerzi =yi/µi withE(zi) =1andvar(zi) =φ(zi are iid).
Thus,
φˆ= 1 n−p
n
X
i=1
yi ˆ µi
−1 2
= 1
n−p
n
X
i=1
yi−µˆi ˆ µi
2
which is equivalent to the meanPearsonstatistic.
We now assume that life expectancy follows agammamodel.
¿ gmod ¡- glm(life.exp ˜ urban + log(physicians) + temp, + family=Gamma(link=”log”))
¿ summary(gmod) Coefficients:
Estimate Std. Error t value Pr(¿—t—) (Intercept) 4.2149307 0.0278759 151.203 ¡ 2e-16 ***
urban 0.0981231 0.0388869 2.523 0.0128 * log(physicians) 0.0545432 0.0060841 8.965 3.04e-15 ***
temp -0.0007257 0.0009359 -0.775 0.4395 ---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Givenurbanandlog(physicians) are in the model,temp
seems to beirrelevant as an additional predictor.
(Dispersion parameter for Gamma family taken to be 0.005525322) The dispersion estimateφˆis the mean Pearson statistic
¿ # direct from summary(.)
¿ summary(gmod)$dispersion [1] 0.005525322
¿ # or explicitly calculated as
¿ sum(residuals(gmod, type=”pearson”)ˆ2)/gmod$df.resid [1] 0.005525322
(Dispersion parameter for Gamma family taken to be 0.005525322) Null deviance: 2.37804 on 132 degrees of freedom
Residual deviance: 0.75412 on 129 degrees of freedom (39 observations deleted due to missingness)
AIC: 829.14
Number of Fisher Scoring iterations: 4 For the scaled deviance we get
1
φˆD(y,µ) =ˆ 0.7541
0.0055 =136.4838
which is pretty close its associated degrees of freedom 129.
Reject model (∗) at levelα if 1
φD(y,µ)ˆ > χ21−α,n−p
Since the dispersionφis unknown, we use its estimate φˆinstead and reject model (∗) if
1
φˆD(y,µ)ˆ > χ21−α,n−p
Partial Deviance Test:
Consider the modelg(µ) =X1β1+X2β2 withdim(β1) =p1, dim(β2) =p2and p=p1+p2. Now calculate
• µˆ1=g−1(X1βˆ1): the fitted means under the reduced model with design X1 only (corresponds to H0:β2=0)
• µˆ2=g−1(X1βˆ1+X2βˆ2): the fitted means under the full model with design X1 andX2
• φˆ=X2/(n−p): dispersion estimate under the full model RejectH0 at levelαif
(D(y,µˆ1)−D(y,µˆ2))/p2
φˆ >F1−α,p2,n−p
¿ (dev2 ¡- deviance(gmod)) [1] 0.7541171
¿ (hatphi ¡- sum(residuals(gmod, type=”pearson”)ˆ2)/gmod$df.r) [1] 0.005525322
¿ gmod1 ¡- glm(life.exp ˜ urban + log(physicians), + family=Gamma(link=”log”))
¿ (dev1 ¡- deviance(gmod1)) [1] 0.7574799