• Keine Ergebnisse gefunden

Generalized Linear Models: An Introduction based on

N/A
N/A
Protected

Academic year: 2021

Aktie "Generalized Linear Models: An Introduction based on"

Copied!
215
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

An Introduction based on

Herwig Friedl Institute of Statistics

Graz University of Technology, Austria

(2)

This course will provide an introduction into the concepts of generalized linear models (GLM’s).

This class extends the class of linear models (LM’s) to regression models for non-normal data.

Special interest will be on binary data (logistic regression) and count data (log-linear models).

All models will be handled by using functions like lm, anova, or glm.

(3)

Linear Models (LM’s): Recap of Results

Box-Cox Transformation Family: Extending the LM

Generalized Linear Models (GLM’s): An Introduction

Linear Exponential Family (LEF): Properties and Members

GLM’s: Parameter Estimates

GLM’s: glm(.) Function

Gamma Regression Models

Logistic Regression (Binomial Responses)

Log-linear Model (Poisson Responses)

Multinomial Response Models

Quasi-Likelihood Models (Overdispersion)

(4)

Goal of regression models is to determine how aresponse variabledepends oncovariates. A special class of regression models are linear models. The general setup is given by

Data(Yi,xi1, . . . ,xi,p−1),i =1, . . . ,n

Response y= (y1, . . . ,yn)> (random variable)

Covariates xi = (xi1, . . . ,xi,p−1)> (fixed, known)

(5)

Data Example: Life Expectancies

Source: The World Bank makes available tons of great data from the World Development Indicators that are available within

¿ install.packages(’WDI’); library(WDI)

¿ WDIsearch(’gdp’) # gives a list of available data about gdp

¿ d ¡- WDI(indicator=’NY.GDP.PCAP.KD’, country=c(’AT’, ’US’), + start=1960, end=2012)

¿ head(d)

iso2c country NY.GDP.PCAP.KD year 1 AT Austria 41246.57 2013 2 AT Austria 41366.52 2012 3 AT Austria 41120.22 2011 4 AT Austria 39972.94 2010 5 AT Austria 39350.17 2009

(6)

¿ install.packages(’gdata’)

# also needs Perl (for MS Windows) to be installed

# also needs java with the same structure as R (32/64 bit)

¿ library(gdata)

¿ temp.xls¡-”http://databank.worldbank.org/data/download/catalog/

+ cckp˙historical˙data˙0.xls”

¿ myperl ¡- ”c:/Strawberry/perl/bin/perl.exe”

¿ sheetCount(temp.xls, perl=myperl) Downloading...

trying URL ’http://databank.worldbank.org/data/.../*.xls’

Content type ’application/vnd.ms-excel’ length 378368 bytes

(7)

Data Example: Life Expectancies

¿ temp ¡- read.xls(temp.xls, sheet=”Country˙temperatureCRU”,

+ perl=myperl)

¿ temp.data ¡- temp[ , c(”ISO˙3DIGIT”, ”Annual˙temp”)]

¿ colnames(temp.data) ¡- c(”iso3c”, ”temp”)

¿ head(temp.data) iso3c temp

1 AFG 12.92

2 AGO 21.51

3 ALB 11.27

4 ARE 26.83

5 ARG 14.22

6 ARM 6.37

(8)

Data Example: Life ExpectanciesData we are interested in (from 2010):

life.expat birth, total (years)

urban population (percent)

physicians (per 1,000 people)

temp annual mean (Celsius)

Which is the response and which are covariates?

(9)

Gaussian Linear Model:

yi01xi1+· · ·+βp−1xi,p−1+i, i iid∼Normal(0, σ2), with unknownregression parameters β0, β1, . . . , βp−1 (intercept β0, slopesβj,j =1, . . . ,p−1) and unknown (homogenous) error varianceσ2.

This is equivalent withyi

ind∼ Normal(E(yi),var(yi)), where E(yi) =µi01xi1+· · ·+βp−1xi,p−1 is alinearfunction in the parameters and

var(yi) =σ2, i =1, . . . ,n describes ahomoscedasticscenario.

(10)

β= (β0, β1, . . . , βp−1) , xi = (1,xi1, . . . ,xi,p−1) , X= (x1, . . . ,xn)>

and write the Gaussian regression models as y=Xβ+ with

E(y) =µ=Xβ and

(11)

Exploratory Data Analysis (EDA):

•Check out the rangesof the response and covariates. For discretecovariates (with sparse factor levels) we consider groupingthe levels.

•Plot covariates against response. Scatter plot should reflect linearrelationships otherwise we consider transformations.

•To check if the constant variance assumption is reasonable, the points of the scatter plot of covariates against the responses should be contained in ahorizontal band.

(12)

¿ summary(mydata[, c(5, 6, 8, 10)])

life.exp urban physicians temp

Min. :44.84 Min. :0.09092 Min. :0.008 Min. :-7.14 1st Qu.:62.16 1st Qu.:0.37396 1st Qu.:0.272 1st Qu.:10.15 Median :71.83 Median :0.56297 Median :1.764 Median :21.58 Mean :69.11 Mean :0.55892 Mean :1.871 Mean :18.06 3rd Qu.:76.10 3rd Qu.:0.74453 3rd Qu.:3.000 3rd Qu.:25.05 Max. :82.84 Max. :0.98655 Max. :7.739 Max. :28.30

NA’s :1 NA’s :39

(13)

¿ plot(mydata[, c(5, 6, 8, 10)])

(14)

plot(log(physicians), life.exp)

(15)

Idea of Least Squares: minimize the sum of squared errors, i.e.

SSE(β) =

n

X

i=1

(yi −x>i β)2

Equivalent withMaximum Likelihood: maximize the sample log-likelihood function

`(β|y) =

n

X

i=1

log 1

2πσ2 − 1

2(yi −x>i β)2

LSE/MLESolution: βˆ= (X>X)−1X>y Foryi ind∼ Normal(x>i β, σ2)we have

βˆ∼Normal(β, σ2(X>X)−1)

(16)

ˆ σ2= 1

n SSE( ˆβ) = 1 n

X

i=1

(yi−x>i β)ˆ 2, E(ˆσ2) = 1−p n σ2 is biased. Anunbiasedvariance estimator is (df corrected)

S2= 1

n−p SSE( ˆβ)

Foryi ind∼ Normal(x>i β, σ2)we get

(17)

n

X

i=1

(yi −y¯)2

| {z } SST

=

n

X

i=1

( ˆµi −y¯)2

| {z } SSR( ˆβ)

+

n

X

i=1

(yi−µˆi)2

| {z } SSE( ˆβ)

TotalSS equals (maxim.) RegressionSS plus (minim.) Error SS Thus, the proportion of variability explained by the regression model is described by thecoefficient of determination

R2= SSR( ˆβ)

SST =1− SSE( ˆβ)

SST ∈(0,1)

To penalize for model complexityp we use itsadjustedversion R2 =1−SSE( ˆβ)/(n−p)

6∈(0,1)

(18)

Thus, for eachslopeparameter βj, j =1, . . . ,p−1, we have βˆj ∼Normal(βj, σ2(X>X)−1j+1,j+1)

and therefore

βˆj −βj q

σ2(X>X)−1j+1,j+1

∼Normal(0,1)

SinceS2 andβˆare independent, replacingσ2by S2 results in

(19)

Hypothesis Tests: t-Test βˆj −βj

q

S2(X>X)−1j+1,j+1

∼tn−p

Therefore, we can test the relevance of asingle predictor xj by H0j =0 vs H1j 6=0

and use the well-knowntest statistic Estimate

Std. Error = βˆj q

S2(X>X)−1j+1,j+1

H0

∼tn−p

(20)

µ=β0AfI(Africa) +βAmI(America) +βAsI(Asia) To check if the predictor continentis irrelevant we have to test k−1parameters

H0AfAmAs =0 vs H1:not H0

Fitting the model twice, underH0 and underH1, results in

(21)

y=Xβ+, ∼Normal(0, σ2W), W=diag(w1, . . . ,wn)

The MLE (weighted LSE) ofβ is given by βˆ= (X>W−1X)−1X>W−1y with

E( ˆβ) =β and var( ˆβ) =σ2(X>W−1X)−1 The MLE ofσ2 is

ˆ σ2= 1

n

n

X

i=1

(yi−µˆi)2 wi = 1

nr>W−1r

(22)

Coefficients:

Estimate Std. Error t value Pr(¿—t—) (Intercept) 61.49091 2.21534 27.757 ¡ 2e-16 ***

urban 15.06895 2.94544 5.116 1.1e-06 ***

physicians 1.87133 0.46933 3.987 0.000111 ***

temp -0.19079 0.07287 -2.618 0.009896 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 5.874 on 129 degrees of freedom

(23)

Data Example: Life Expectancies Coefficients:

Estimate Std. Error t value Pr(¿—t—) (Intercept) 61.49091 2.21534 27.757 ¡ 2e-16 ***

urban 15.06895 2.94544 5.116 1.1e-06 ***

physicians 1.87133 0.46933 3.987 0.000111 ***

temp -0.19079 0.07287 -2.618 0.009896 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 All predictors are significant (on a 5% level). Onlytemp has a

negative effect.

(24)

Residual standard error: 5.874 on 129 degrees of freedom (39 observations deleted due to missingness)

Multiple R-squared: 0.5872, Adjusted R-squared: 0.5776 F-statistic: 61.17 on 3 and 129 DF, p-value: ¡ 2.2e-16 Under the model, the estimated standard error of the response is 5.874 (years). We haven−p =129and p−1=3predictors.

Almost 60% of the total variability is explained by this model.

The adjusted version ofR2 is 0.5776.

(25)

Data Example: Life Expectancies (log(physicians))

¿ mod.log ¡- update(mod, . ˜ . -physicians + log(physicians))

¿ summary(mod.log) Coefficients:

Estimate Std. Error t value Pr(¿—t—) (Intercept) 67.37488 1.85916 36.239 ¡ 2e-16 ***

urban 8.13713 2.59353 3.137 0.00211 **

temp -0.05590 0.06242 -0.895 0.37219 log(physicians) 3.50002 0.40578 8.625 2.01e-14 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Predictortemp is no longer significant!

(26)

Residual standard error: 4.958 on 129 degrees of freedom (39 observations deleted due to missingness)

Multiple R-squared: 0.7059, Adjusted R-squared: 0.6991 F-statistic: 103.2 on 3 and 129 DF, p-value: ¡ 2.2e-16

Standard error is much smaller now as before (±5years)!

Even 70% of the total variability is now explained by this model.

(27)

Data Example: Life Expectancies (ANOVA)

¿ anova(mod.log)

Analysis of Variance Table Response: life.exp

Df Sum Sq Mean Sq F value Pr(¿F) urban 1 4813.0 4813.0 195.831 ¡ 2.2e-16 ***

temp 1 970.1 970.1 39.470 4.709e-09 ***

log(physicians) 1 1828.5 1828.5 74.399 2.010e-14 ***

Residuals 129 3170.5 24.6 ---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(28)

Information about this is contained in theANOVA Table:

Source df Sum of Squares/SS MSS F

Regression p−1 SSR( ˆβ) = MSR( ˆβ) =

SST−SSE( ˆβ) SSR( ˆβ)/(p−1) MSR( ˆβ) MSE( ˆβ)

(29)

Data Example: Life Expectancies (ANOVA)

Null Model: assuming an iidrandom sample (E(yi) =β0·1), results inSSE( ˆβ0) =P

i(yi−βˆ0)2 whereβˆ0= ¯y. Thus, SSE( ˆβ0) =P

i(yi −¯y)2≡SSTin this case.

Nested Model: we assume that

y=Xβ+=X1β1+X2β2+, and test on H02=0 withdim(β1) =p1(including the intercept) and dim(β2) =p2

(additional slopes). The corresponding SSR and SSE terms are SSR( ˆβ1) =

n

X

i=1

(x>i βˆ1−y¯)2, SSE( ˆβ1) =

n

X

i=1

(yi−x>i βˆ1)2

(30)

Source df Sum of Squares/SS MSS F X1 p1−1 SSR( ˆβ1) MSR( ˆβ1) =

SSR( ˆβ1) p1−1

MSR( ˆβ1) MSE( ˆβ) X2|X1 p2 SSR( ˆβ2|βˆ1) = MSR( ˆβ2|βˆ1) =

SSR( ˆβ)−SSR( ˆβ1) SSR( ˆβ2|βˆ1) p2

MSR( ˆβ2|βˆ1) MSE( ˆβ)

(31)

Data Example: Life Expectancies (ANOVA)

We now assume that the modely=β0+X1β1+X2β2+holds.

Test 1: test statistic

F = MSR( ˆβ1|βˆ0) MSE( ˆβ)

tests the model improvement when adding the predictors inX1 to the iid model with onlyβ0.

Test 2: test statistic

F = MSR( ˆβ2|βˆ1,βˆ0) MSE( ˆβ)

tests the model improvement when adding the predictors inX to

(32)

Analysis of Variance Table Response: life.exp

Df Sum Sq Mean Sq F value Pr(¿F) urban 1 4813.0 4813.0 195.831 ¡ 2.2e-16 ***

temp 1 970.1 970.1 39.470 4.709e-09 ***

log(physicians) 1 1828.5 1828.5 74.399 2.010e-14 ***

Residuals 129 3170.5 24.6 ---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(33)

Problems:

yi 6∼Normal(E(yi),var(yi))

E(yi)6=x>i β∈R

var(yi)6=σ2 equal (homoscedastic) ∀i=1, . . . ,n

Remedies:

transform yi such that g(yi)ind∼ Normal(x>i β, σ2)

utilize a GLM where yi ind∼ LEF(g−1(x>i β), φV(µi))

(34)

y(λ) =

 yλ−1

λ , ifλ6=0, logy, ifλ=0,

y(λ)→logy for λ→0, such thaty(λ) is continuous inλ.

Assumption: there is a valueλfor which yi(λ)ind∼ Normal

µi(λ) =x>i β(λ), σ2(λ)

(35)

Density Transformation Theorem: Ifg(Y)∼Fg(Y)(y)holds for a continuous r.v. andg(·) is a monotone function, then the untransformed r.v.Y has cdf

FY(y) =Pr(Y ≤y) =Pr(g(Y)≤g(y)) =Fg(Y)(g(y)).

Thus, the density ofY is fY(y) = ∂Fg(Y)(g(y))

∂y =fg(Y)(g(y))·

∂g(y)

∂y with Jacobian

∂g(y)

∂y

.

(36)

f(y|λ, µ(λ), σ2(λ)) =

1

p2πσ2(λ)exp

(yλ−1)

λ µ(λ) 2

2(λ)

yλ−1, λ6=0,

1

p2πσ2(λ)exp (logyµ(λ))2 2(λ)

!

y−1, λ=0.

• Ifλ6=0 andµ(λ) =x>β(λ)then f(y|λ, µ(λ), σ2(λ)) = 1

p2πλ2σ2(λ)exp yλ1λx>β(λ)2

2σ2(λ)

!

|λ|yλ−1.

(37)

Usingβ0=1+λβ0(λ), βj =λβj(λ), j =1, . . . ,p−1, and σ22σ2(λ) then

f(y|λ, µ(λ), σ2(λ)) = 1

p2πλ2σ2(λ)exp yλ1λx>β(λ)2

2σ2(λ)

!

|λ|yλ−1

f(y|λ,β, σ2) = 1

2πσ2exp

(yλx>β)2 2

|λ|yλ−1.

• Ifλ=0, letβjj(λ),j =0, . . . ,p−1, and σ22(λ) f(y|0,β, σ2) = 1

2πσ2exp

(logyx>β)2 2

y−1. Ifλwould be known, then the MLE could be easily computed!

(38)

•λ6=0:

`(λ,β, σ2|y) =n

2logσ2 1 2

n

X

i=1

yiλx>i β2

+nlog|λ|+(λ−1)

n

X

i=1

logyi

•λ=0:

`(0,β, σ2|y) =n

2logσ2 1 2

n

X

i=1

logyix>i β2

n

X

i=1

logyi

(39)

Ifλwould be known, then the MLEs would be βˆλ=

((X>X)−1X>yλ, λ6=0, (X>X)−1X>logy, λ=0,

ˆ σλ2= 1

nSSEλ( ˆβλ) =









 1 n

n

X

i=1

(yiλ−x>i βˆλ)2, λ6=0, 1

n

n

X

i=1

(logyi−x>i βˆλ)2, λ=0.

(40)

=









−n

2log SSEλ( ˆβλ) +nlog|λ|+ (λ−1)

n

X

i=1

logyi, λ6=0,

−n

2log SSE0( ˆβ0)−

n

X

i=1

logyi, λ=0.

This is the sample log-likelihood function that has been already maximized with respect toβ andσ2.

(41)

Likelihood Ratio Test(LRT):H0:λ=λ0 versusH1:λ6=λ0. For the LRT statistic it holds that

−2

p`(λ0|y)−p`(ˆλ|y) D

→χ21.

If−2(p`(λ0|y)−p`(ˆλ|y))∼χ21, a (1−α) confidence interval contains all valuesλ0, for which

− p`(λ0|y)−p`(ˆλ|y)

< 1 2χ21;1−α (notice thatχ21;0.95=3.841,χ21;0.99 =6.635).

(42)

E(logyi) =xi β, var(logyi) =σ2.

Untransformed responseyi follows a log-normal distribution with median(yi) =exp(x>i β),

E(yi) =exp(x>i β+σ2/2) =exp(x>i β)exp(σ2/2), var(yi) = exp(σ2)−1

exp(2x>i β+σ2).

(43)

Power-Transformation (λ6=0): ifyiλ∼Normal(x>i β, σ2)then median(yiλ) =x>i β,

E(yiλ) =x>i β, var(yiλ) =σ2.

Untransformed responseyi follows a distribution with median(yi) =µ1/λi ,

E(yi)≈µ1/λi 1+σ2(1−λ)/(2λ2µ2i) , var(yi)≈µ2/λi σ2/(λ2µ2i).

(44)

¿ H ¡- trees$Height; D ¡- trees$Girth; V ¡- trees$Volume

¿ plot(D, V); lines(lowess(D, V)) # curvature (wrong scale?)

¿ plot(H, V) # increasing variance?

(45)

¿ (mod ¡- lm(V ˜ H + D)) # still fit a linear model for volume Coefficients:

(Intercept) H D

-57.9877 0.3393 4.7082

¿ plot(D, residuals(mod), ylab=”residuals”); abline(0, 0)

¿ lines(lowess(D, residuals(mod))) # sink in the middle

(46)
(47)

¿ library(MASS)

¿ bc¡-boxcox(V˜H+D,lambda=seq(0.0,0.6,length=100),plotit=FALSE)

¿ ml.index ¡- which(bc$y == max(bc$y))

¿ bc$x[ml.index]

[1] 0.3090909

¿ boxcox(V˜H+D, lambda = seq(0.0, 0.6,len = 18)) # plot it now

(48)
(49)

Is volume cubic in height and diameter?

¿ plot(D, Vˆ(1/3), ylab=expression(Vˆ–1/3˝))

¿ lines(lowess(D, Vˆ(1/3))) # curvature almost gone

¿ (mod1 ¡- lm(Vˆ(1/3) ˜ H + D)) Coefficients:

(Intercept) H D

-0.08539 0.01447 0.15152

For fixedλ=1/3we havemedian(V\ ) = ˆµ31/3 where E(V1/3) =µ1/3. E(Vˆ ) = ˆµ31/3(1+3ˆσ1/32 /ˆµ21/3). Compare responses with estimated medians

¿ mu ¡- fitted(mod1)

¿ plot(muˆ3, V) # fitted median modell

(50)
(51)

Alternative strategy:

Remove curvature by a log-transform of all predictors (i.e., regress onlog(D) andlog(H)).

Should we also considerlog(V) as response?

¿ plot(log(D), log(V)) # shows nice linear relationship

¿ lm(log(V) ˜ log(H) + log(D)) # response log(V) or still V?

Coefficients:

(Intercept) log(H) log(D)

-6.632 1.117 1.983

¿ boxcox(V˜log(H)+log(D), lambda=seq(-0.35,0.25,length=100))

(52)
(53)

Which of the models isbetter? Comparison by LRT. Both models are members of themodel family

V ∼Normal(β01H2D, σ2) V = (VλV −1)/λV

H = (HλH−1)/λH

D = (DλD −1)/λD

Compare Profile-Likelihood function inλV =1/3, λHD =1 (E(V1/3) =β01H+β2D), with that in λVHD =0 (E(log(V)) =β01log(H) +β2log(D)).

(54)

¿ bc1 ¡- boxcox(V ˜ H + D, lambda = 1/3, plotit=FALSE)

¿ bc1$y [1] 25.33313

¿ bc2 ¡- boxcox(V ˜ log(H) + log(D), lambda = 0, plotit=FALSE)

¿ bc2$y [1] 26.11592

LRT Statistic: −2(25.333−26.116) =1.566(not significant).

(55)

Remark: Coefficient of log(H) close to 1 (βˆ1=1.117) and coefficient oflog(D) close to 2 (βˆ2=1.983).

Tree can be represented by acylinder or a cone. Volume is πhd2/4(cylinder) or πhd2/12(cone), i.e.

E(log(V)) =c+1 log(H) +2 log(D) withc=log(π/4) (cylinder) or c=log(π/12) (cone).

Attention: D has to be converted from inches to feet⇒D/12 as predictor.

(56)

Coefficients:

(Intercept) log(H) log(D/12)

-1.705 1.117 1.983

Conversion only influences intercept!

Fix slopes(β1, β2) to (1, 2) and estimate only interceptβ0, i.e.

consider the model

E(log(V)) =β0+1 log(H) +2 log(D/12).

(57)

¿ (mod3 ¡- lm(log(V) ˜ 1 + offset(log(H) + 2*log(D/12)))) Coefficients:

(Intercept) -1.199

¿ log(pi/4) [1] -0.2415645

¿ log(pi/12) [1] -1.340177

Volume can be better described by a cone than by a cylinder.

However, its volume is slightly larger than the one of a cone.

(58)

independent response variables with covariates.

While a linear model combinesadditivity of the covariate effects with thenormality of the errors, including variance homogeneity, GLM’s don’t need to satisfy these

requirements. GLM’s allow also to handle nonnormal responses such as binomial, Poisson and Gamma.

Regression parameters are estimated using maximum likelihood.

Standard reference on GLM’s is McCullagh & Nelder (1989).

(59)

i i = (1, i1 i,p−1)

1 Random Component:

yi,i =1, . . . ,n, independent with density from thelinear exponential family (LEF), i.e.

f(y|θ, φ) =exp

yθ−b(θ)

φ +c(y, φ)

φ >0is a dispersion parameter andb(·) andc(·,·) are known functions.

2 Systematic Component:

ηii(β) =x>i β is called linear predictor,

β= (β0, . . . , βp−1)> are unknown regression parameters

3 Parametric Link Component:

The link functiong(µi) =ηi combines the linear predictor

(60)

f(y|µ, σ2) = 1

2πσ2exp

− 1

2(y −µ)2

=exp

(yµ− µ22 σ2 − 1

2

log(2πσ2) + y2 σ2

)

Definingθ=µand φ=σ2 results in b(θ) = µ2

2 and c(y, φ) =−1 2

log(2πσ2) + y2 σ2

(61)

It can be shown that for the LEF E(y) =b0(θ) =µ

var(y) =φb00(θ) =φV(µ), whereV(µ) =b00(θ) is called thevariance function.

Thus, we generally consider the model g(µ) =g(b0(θ)).

Thus, thecanonical linkis defined as g= (b0)−1

⇒g(µ) =θ =x>β.

(62)

The log-likelihood of the sampley1, . . . ,yn is

`(µ|y) =

n

X

i=1

yiθi−b(θi)

φ +c(yi, φ)

The maximum likelihood estimatorµˆ is obtained by solving the score function (chain rule)

s(µ) = ∂

∂µ`(µ|y) = ∂

∂θ`(µ|y)∂θ

∂µ =

y1−µ1

φV(µ1), . . . ,yn−µn

φV(µn)

(63)

Because ofµ=µ(β) the score function for the parameterβ is (chain rule again)

s(β) = ∂

∂β`(β|y) = ∂

∂θ`(µ|y)∂θ

∂µ

∂µ

∂η

∂η

∂β =

n

X

i=1

yi−µi φV(µi)

1 g0i)xi

which depends again only on themean/variance relationship.

For the sampley1, . . . ,yn we assumed that there is only one global dispersion parameterφ, i.e.E(yi) =µi,var(yi) =φV(µi).

(64)

i=1 V( ˆµi) g0( ˆµi)xi =0

which doesn’t depend onφand whereg( ˆµi) =x>i β.ˆ Notice, if a canonical link (g(µ) =θ) is used, we have

g0(µ) = ∂θ

∂µ = 1

∂µ/∂θ = 1

∂b0(θ)/∂θ = 1

b00(θ) = 1 V(µ) and the above score equation simplifies to

n

(65)

A general method to solve the score equation is the iterative algorithmFisher’s Method of Scoring (derived from a Taylor expansion ofs(β)).

In thet-th iteration, the new estimateβ(t+1) is obtained from the previous oneβ(t) by

β(t+1)(t)+s(β(t))

"

E

∂s(β)

∂β

β=β(t)

#−1

Therefore, the speciality is the usage of theexpectedinstead of theobserved Hessian matrix.

(66)

β(t+1)=

X>W(t)X −1

X>W(t)z(t)

with the vector of pseudo-observationsz= (z1, . . . ,zn)> and diagonal weight matrixW defined as

zi =g(µi) +g0i)(yi−µi)

wi = 1

V(µi)(g0i))2

(67)

Since

β(t+1)=

X>W(t)X−1

X>W(t)z(t)

the estimateβˆis calculated using an Iteratively (Re-)Weighted Least Squares(IWLS) algorithm:

1 start with initial guesses µ(0)i (e.g. µ(0)i =yi orµ(0)i =yi+c)

2 calculate working responses zi(t) and weightswi(t)

3 calculate β(t+1) by weighted least squares

4 repeat steps 2 and 3 till convergence.

(68)

βˆ∼Normal(β, φ(X>WX)−1)

Thus, standard errors of the estimatorsβˆj are the respective diagonal elements of the estimated variance/covariance matrix

var( ˆ\β) =φ(X>WX)ˆ −1

withWˆ =W( ˆµ). Note that (X>WX)ˆ −1 is a by-product of the last IWLS iteration. Ifφis unknown, an estimator is required.

(69)

Amethod-of-momentslike estimator is developed considering the ratios

φ= E(yi −µi)2

V(µi) , for alli =1, . . . ,n

Averaging over all these ratios and assuming that theµi’s are known results in the estimator

1 n

n

X

i=1

(yi−µi)2 V(µi)

However, sinceβ is unknown we better use the bias-corrected version (also known as the mean generalized Pearson’s chi-square statistic)

φˆ= 1 Xn (yi −µˆi)2

= 1

X2

(70)

Generalized linear models can be fitted in using theglm function, which is similar tolm for fitting linear models.

The arguments to aglm call are as follows:

glm(formula, family = gaussian, data, weights, subset, na.action, start = NULL, etastart, mustart, offset, control = glm.control(...), model = TRUE,

method = ”glm.fit”, x = FALSE, y = TRUE, contrasts = NULL, ...)

(71)

Formula argument:

The formula is specified for a glm as e.g.

y ˜ x1 + x2

wherex1and x2are the names of

numeric vectors (continuous predictors)

factors (categorial predictors)

All the variables used in the formula must be in the workspace or in the data frame passed to thedata argument.

(72)

Other symbols that can be used in the formula are:

a:bfor the interaction between aand b

a*bwhich expands to 1 + a + b + a:b

. first order terms of all variables in data

-to exclude a term (or terms)

1intercept (default)

-1 without intercept

(73)

Family argument:

The family argument defines the response distribution (variance function) and thelinkfunction. The exponential family functions available in are e.g.

gaussian(link = ”identity”)

binomial(link = ”logit”)

poisson(link = ”log”)

Gamma(link = ”inverse”)

(74)

Theglmfunction returns an object of class c(”glm”, ”lm”).

There are several methods available to access or display components of aglmobject, e.g.

residuals()

fitted()

predict()

coef()

deviance()

summary()

(75)

The first part contains the same information as fromlm()

¿ mod ¡- glm(life.exp ˜ urban + log(physicians) + temp)

¿ summary(mod) Call:

glm(formula = life.exp ˜ urban + log(physicians) + temp) Deviance Residuals:

Min 1Q Median 3Q Max

-20.4925 -2.6337 0.7898 3.3447 10.4813 Coefficients:

Estimate Std. Error t value Pr(¿—t—) (Intercept) 67.37488 1.85916 36.239 ¡ 2e-16 ***

urban 8.13713 2.59353 3.137 0.00211 **

log(physicians) 3.50002 0.40578 8.625 2.01e-14 ***

(76)

Since the defaultfamily=”gaussian”, deviance residuals corresponds to ordinary residuals as in a linear model.

A five-number summary of those raw residuals is given.

(77)

Remember that for the MLE it asymptotically holds that βˆ∼Normal(β, φ(X>WX)−1)

Thus, we can utilize this to construct a test statistic on the significance of a coefficient, sayβj forj =1, . . . ,p−1.

If we test

H0j =0 versus H1j 6=0 we can use the test statistic

t = βˆj

qφ(Xˆ >WX)ˆ −1j+1,j+1

which underH0 asymptotically follows a t distribution withn−p degrees of freedom.

(78)

The second part contains some new information on estimated dispersionandgoodness-of-fit aspects which we will discuss later in detail.

First the dispersion estimate (if necessary)φˆis provided

(Dispersion parameter for gaussian family taken to be 24.577) This estimate is simply the squared residual standard error (that was 4.958 in thesummary(lm())).

(79)

Next there is thedeviance of two models and the number of missing observations:

Null deviance: 10782.0 on 132 degrees of freedom Residual deviance: 3170.5 on 129 degrees of freedom

(39 observations deleted due to missingness)

The first refers to thenull modelwhich corresponds to a model with intercept only (the iid assumption, no explanatory variables).

The associated degrees of freedom aren−1.

The second refers to ourfitted model withp−1explanatory variables in the predictor and, thus, with associated degrees of freedomn−p.

(80)

Thedevianceof a model is defined as the distance of log-likelihoods, i.e.

D(y,µ) =ˆ −2φ(`( ˆµ|y)−`(y|y))

Here,µˆ are the fitted values under the considered model

(maximizing the log-likelihood under the given parametrization), andy denote the estimated means under a model without any restriction at all (thusµˆ =y in such a saturated model).

(81)

For any member of the LEF the deviance equals D(y,µ) =ˆ −2φ

n

X

i=1

(yiθˆi −yiθ˜i)−(b(ˆθi)−b(˜θi)) φ

=−2

n

X

i=1

n

(yiθˆi −yiθ˜i)−(b(ˆθi)−b(˜θi)) o

whereθ˜i denotes the estimate of θi under the saturated model.

Under the saturated model, there are as many mean parameters µi allowed as observationsyi.

Note that for LEF members thedeviance D(y,µ) =ˆ −2

n

X

i=1

n

(yiθˆi−yiθ˜i)−(b(ˆθi)−b(˜θi)) o

(82)

`( ˆµ|y) =−n

2log(2πσ2)−1 2

n

X

i=1

(yi −µˆi)2 σ2

`(y|y) =−n

2log(2πσ2)

Therefore the deviance equals thesum of squared errors, i.e.

D(y,µ) =ˆ −2φ(`( ˆµ|y)−`(y|y)) =

n

X

i=1

(yi−µˆi)2=SSE( ˆβ)

(83)

Finally we have AIC: 809.22

Number of Fisher Scoring iterations: 2

TheAkaike Information Criterion (AIC) also assess the fit penalizing for the total number of parametersp+1(linear predictor and dispersion in this case) and is defined as

AIC=−2`( ˆµ|y) +2(p+1)

The smaller the AIC value the better the fit. Use AIC only to compare different models (not necessarily nested).

Sometimes, the term−2`( ˆµ|y) is calleddisparity.

(84)

”response”, ”partial”), ...)

deviance: write deviance asPn

i=1d(yi,µˆi)2

pearson: riP = (yi−µˆi)/p V( ˆµi)

working: riW = ˆzi −ηˆi = (yi −µˆi)g0( ˆµi) (remember that g0( ˆµi) =1/V( ˆµi) for canonical link models)

response: yi−µˆi

partial: riP + ˆβjxij is the partial residual for thej-th covariate

(85)

Deviance residuals are the default used in since they reflect the same criterion as used in the fitting.

Plot deviance residuals against fitted values:

¿ plot(residuals(mod) ˜ fitted(mod), + xlab = expression(hat(mu)[i]), + ylab = expression(r[i]))

¿ abline(0, 0, lty = 2)

(86)
(87)

Theplot()function gives the following sequence of plots:

deviance residuals vs. fitted values

Normal Q-Q plot of deviance residuals standardized to unit variance

scale-location plot of standardized deviance residuals

standardized deviance residuals vs. leverage with Cook’s distance contours

¿ plot(mod)

(88)
(89)

So far we considered (Box-Cox transformation) models like

•Vi1/3ind∼ Normal(µi, σ2),E(V1/3) =µ=H+D

•log(Vi)ind∼ Normal(µi, σ2),E(log(V)) =µ=log(H) +log(D)

In what follows we will assume that a GLM holds with

•Vi ind∼ Normal(µi, σ2) andg(E(V)) =η.

More specifically, we like to check out the models:

•µ1/3=H+D

•log(µ) =log(H) +log(D).

These models on theobservations scalecan be easily fitted usingglm().

(90)

Coefficients:

Estimate Std. Error t value Pr(¿—t—) (Intercept) -0.051322 0.224095 -0.229 0.820518 H 0.014287 0.003342 4.274 0.000201 ***

D 0.150331 0.005838 25.749 ¡ 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 6.577063)

Null deviance: 8106.08 on 30 degrees of freedom

(91)

i Normal(µi ), = +

¿ AIC(pmodel) [1] 151.2102

¿ -2*logLik(pmodel) + 2*4

’log Lik.’ 151.2102 (df=4)

¿ logLik(pmodel)

’log Lik.’ -71.60508 (df=4)

¿ sum(log(dnorm(V,pmodel$fit,sqrt(summary(pmodel)$disp*28/31)))) [1] -71.60508

¿ sum(residuals(pmodel)ˆ2) [1] 184.1577

¿ deviance(pmodel) [1] 184.1577

¿ sum((V-mean(V))ˆ2) # Null Deviance

(92)

Coefficients:

Estimate Std. Error t value Pr(¿—t—) (Intercept) -6.53700 0.94352 -6.928 1.57e-07 ***

log(H) 1.08765 0.24216 4.491 0.000111 ***

log(D) 1.99692 0.08208 24.330 ¡ 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 6.41642)

Null deviance: 8106.08 on 30 degrees of freedom

(93)

Gamma responses: y ∼Gamma(a, λ)with density function f(y|a, λ) =exp(−λy)λaya−1 1

Γ(a), a, λ,y >0 withE(y) =a/λ andvar(y) =a/λ2.

Mean parametrizationneeded!

(94)

f(y|a, λ) =exp(−λy)λaya−1 Γ(a) f(y|µ, ν) =exp

−ν

µy ν

µ ν

yν−1 1 Γ(ν)

=exp

 y

µ1

−logµ

1/ν +νlogν+ (ν−1)logy −logΓ(ν)

.

LEF memberwith:

(95)

θ=−1/µ, b(θ) =logµ=−log(−θ), φ=1/ν.

Thus,

E(y) =b0(θ) =−−1

−θ =−1 θ =µ var(y) =φb00(θ) =φ1

θ2 =φµ2

with dispersionφ=1/ν and variance functionV(µ) =µ2. Coefficient of variation:

pvar(yi)

= q

φµ2i

=p

φ=constant for all i=1, . . . ,n.

(96)

¿ y ¡- (1:400)/100

¿ shape ¡- 0.9

¿ scale ¡- 1.5

¿ plot(y, dgamma(y, shape=shape, scale=scale))

¿ mean(rgamma(10000, shape=shape, scale=scale)); shape*scale [1] 1.374609

[1] 1.35

¿ var(rgamma(10000, shape=shape, scale=scale)); shape*(scale)ˆ2 [1] 2.001009

(97)

Gamma distributions are generallyskewed to the right.

shape<1(0.9 left) shape>1 (1.5 right)

Special cases: ν =1/φ=1(exponential) andν → ∞ (normal)

(98)

What’s anappropriate linkfunction?

Canonical link function: η=θ =−1µ (inverse-link).

Since we needµ >0we need η <0giving complicated restriction on β.

Thus, thelog-linkis often used without restrictions on η, i.e.

logµ=η

(99)

Then

`( ˆµ, φ|y) =

n

X

i=1

 yi

µˆ1

i

−logµˆi

φ +c(yi, φ)

`(y, φ|y) =

n

X

i=1

 yi

y1

i

−logyi

φ +c(yi, φ)

 and thus thescaled devianceequals

1

φD(y,µ) =ˆ −2 φ

n

X

i=1

−yi ˆ µi

−logµˆi

−(−1−logyi)

=−2 φ

n

X

log yi

ˆ

µ −yi−µˆi

ˆ µ

(100)

parameter. We have a sampley1, . . . ,yn with

E(yi) =µi and var(yi) =φµ2i, i=1, . . . ,n Considerzi =yii withE(zi) =1andvar(zi) =φ(zi are iid).

Thus,

φˆ= 1 n−p

n

X

i=1

yi ˆ µi

−1 2

= 1

n−p

n

X

i=1

yi−µˆi ˆ µi

2

which is equivalent to the meanPearsonstatistic.

(101)

We now assume that life expectancy follows agammamodel.

¿ gmod ¡- glm(life.exp ˜ urban + log(physicians) + temp, + family=Gamma(link=”log”))

¿ summary(gmod) Coefficients:

Estimate Std. Error t value Pr(¿—t—) (Intercept) 4.2149307 0.0278759 151.203 ¡ 2e-16 ***

urban 0.0981231 0.0388869 2.523 0.0128 * log(physicians) 0.0545432 0.0060841 8.965 3.04e-15 ***

temp -0.0007257 0.0009359 -0.775 0.4395 ---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Givenurbanandlog(physicians) are in the model,temp

seems to beirrelevant as an additional predictor.

(102)

(Dispersion parameter for Gamma family taken to be 0.005525322) The dispersion estimateφˆis the mean Pearson statistic

¿ # direct from summary(.)

¿ summary(gmod)$dispersion [1] 0.005525322

¿ # or explicitly calculated as

¿ sum(residuals(gmod, type=”pearson”)ˆ2)/gmod$df.resid [1] 0.005525322

(103)

(Dispersion parameter for Gamma family taken to be 0.005525322) Null deviance: 2.37804 on 132 degrees of freedom

Residual deviance: 0.75412 on 129 degrees of freedom (39 observations deleted due to missingness)

AIC: 829.14

Number of Fisher Scoring iterations: 4 For the scaled deviance we get

1

φˆD(y,µ) =ˆ 0.7541

0.0055 =136.4838

which is pretty close its associated degrees of freedom 129.

(104)

Reject model (∗) at levelα if 1

φD(y,µ)ˆ > χ21−α,n−p

Since the dispersionφis unknown, we use its estimate φˆinstead and reject model (∗) if

1

φˆD(y,µ)ˆ > χ21−α,n−p

(105)

Partial Deviance Test:

Consider the modelg(µ) =X1β1+X2β2 withdim(β1) =p1, dim(β2) =p2and p=p1+p2. Now calculate

µˆ1=g−1(X1βˆ1): the fitted means under the reduced model with design X1 only (corresponds to H02=0)

µˆ2=g−1(X1βˆ1+X2βˆ2): the fitted means under the full model with design X1 andX2

φˆ=X2/(n−p): dispersion estimate under the full model RejectH0 at levelαif

(D(y,µˆ1)−D(y,µˆ2))/p2

φˆ >F1−α,p2,n−p

(106)

¿ (dev2 ¡- deviance(gmod)) [1] 0.7541171

¿ (hatphi ¡- sum(residuals(gmod, type=”pearson”)ˆ2)/gmod$df.r) [1] 0.005525322

¿ gmod1 ¡- glm(life.exp ˜ urban + log(physicians), + family=Gamma(link=”log”))

¿ (dev1 ¡- deviance(gmod1)) [1] 0.7574799

Referenzen

ÄHNLICHE DOKUMENTE

1 INTR ODUCTIONSemiparametric mo dels com bine the exibilit yo fnonparametric mo deling with

We show that the asymptotic variance of the resulting nonparametric estimator of the mean function in the main regression model is the same as that when the selection probabilities

(where at and at,b are actual and nominal parameter vectors) and to define a so-called sensitivity function S which relates the elements of the set of the parameter deviations Aat

We will now introduce the Synthetic Aperture Focusing Technique (SAFT), which is a widely used defect imaging algorithm in ultrasonic nondestructive testing9. In related fields such

General models for multiple-spell duration data are considered. A general theory which indicates how the successive spells of an individual are generated by an underlying

That is, the MAPE, MdAPE and PB statistics suggest the Holt and Holt-D models provide an improvement over other linear extrapolation techniques and a random walk model in

As the generalized linear models are widely used tools in analyzing genetic data, the proposed tests, being more adaptive to the high dimensionality, are useful additions to

As a result of this exercise we see how recent object models can be obtained as a synthesis of well- established concepts, namely (1) set-oriented, descriptive query