• Keine Ergebnisse gefunden

Applied Statistical Regression

N/A
N/A
Protected

Academic year: 2022

Aktie "Applied Statistical Regression"

Copied!
40
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Marcel Dettling, Zurich University of Applied Sciences 1

Applied Statistical Regression

HS 2011 – Week 05

Marcel Dettling

Institute for Data Analysis and Process Design Zurich University of Applied Sciences

marcel.dettling@zhaw.ch http://stat.ethz.ch/~dettling

ETH Zürich, October 24, 2011

(2)

Marcel Dettling, Zurich University of Applied Sciences 2

Applied Statistical Regression

HS 2011 – Week 05

An Example

Researchers at General Motors collected data on 60 US

Standard Metropolitan Statistical Areas (SMSAs) in a study of whether air pollution contributes to mortality.

http://lib.stat.cmu.edu/DASL/Stories/AirPollutionandMortality.html

City Mortality JanTemp JulyTemp RelHum Rain Educ Dens NonWhite WhiteCollar Pop House Income HC NOx SO2

Akron, OH 921.87 27 71 59 36 11.4 3243 8.8 42.6 660328 3.34 29560 21 15 59

Albany, NY 997.87 23 72 57 35 11 4281 3.5 50.7 835880 3.14 31458 8 10 39

Allentown, PA 962.35 29 74 54 44 9.8 4260 0.8 39.4 635481 3.21 31856 6 6 33

Atlanta, GA 982.29 45 79 56 47 11.1 3125 27.1 50.2 2138231 3.41 32452 18 8 24

Baltimore, MD 1071.29 35 77 55 43 9.6 6441 24.4 43.7 2199531 3.44 32368 43 38 206 Birmingham, AL 1030.38 45 80 54 53 10.2 3325 38.5 43.1 883946 3.45 27835 30 32 72

(3)

Marcel Dettling, Zurich University of Applied Sciences 3

Applied Statistical Regression

HS 2011 – Week 05

Multiple Linear Regression

The model is:

We have predictors now, visualization is no longer possible.

Our goal is to estimate the regression coefficients

from the data points we have. We determine residuals:

And then estimate the coefficients such that the sum of squared residuals is minimal.

0 1 1 2 2

...

i i i p ip i

Y     x   x    xE p

n

0

,

1,

...,

p

  

0 1 1

( ... )

i i i p ip

r   y    x    x

2 1 n i ri

(4)

4

Applied Statistical Regression

HS 2011 – Week 05

Normal Equations and Their Solutions

The least squares approach leads to the normal equations, which are of the following form:

• Unique solution if and only if has full rank

• Predictor variables need to be linearly independent

• If has not full rank, the model is “badly formulated”

• Design improvement mandatory!!!

• Necessary (not sufficient) condition:

• Do not over-parametrize your regression!

( X X

T

)   X y

T

Marcel Dettling, Zurich University of Applied Sciences

X X

pn

(5)

5

Applied Statistical Regression

HS 2011 – Week 05

Properties of the Estimates

Gauss-Markov-Theorem:

The regression coefficients are unbiased estimates, and they fulfill the optimality condition of minimal variance among all linear, unbiased estimators (BLUE).

- -

- (note: degrees of freedom!)

Marcel Dettling, Zurich University of Applied Sciences

[ ] ˆ

E   

2 1

( ) ˆ

E

(

T

)

Cov     X X

2 2

1

ˆ 1

( 1)

n

E i

i

n p r

   

(6)

6

Applied Statistical Regression

HS 2011 – Week 05

Hat Matrix Notation

The fitted values are:

The matrix is called hat matrix, because “it puts a hat on the Y’s”, i.e. transforms the observed values into fitted values. We can also use this matrix for computing the residuals:

Moments of these estimates:

,

,

Marcel Dettling, Zurich University of Applied Sciences

ˆ

1

ˆ (

T

)

T

yX   X X X

X YHY

ˆ ( )

r     Y Y I H Y

[ ] ˆ

E yy E r [ ]  0 ˆ

2

( )

E

Var y   H Var r ( )  

E2

( IH )

(7)

7

Applied Statistical Regression

HS 2011 – Week 05

If the Errors are Gaussian…

While all of the above statements hold for arbitrary error

distribution, we obtain some more, very useful properties by assuming i.i.d. Gaussian errors:

- - -

What to do if the errors are non-Gaussian?

Marcel Dettling, Zurich University of Applied Sciences

2 1

ˆ ~ N ,

E

( X X

T

)

  

ˆ ~ ( ,

E2

) y N X   H

2

ˆ ~E2 E n p

n p

  

(8)

8

Applied Statistical Regression

HS 2011 – Week 05

Coefficient of Determination

The coefficient of determination, also called multiple R- squared, is aimed at describing the goodness-of-fit of the multiple linear regression model:

It shows the proportion of the total variance which has been explained by the predictors. The extreme cases 0 and 1

mean:…

Marcel Dettling, Zurich University of Applied Sciences

2

2 1

2 1

( ˆ )

1 [0,1]

( )

n

i i

i n

i i

y y R

y y

  

(9)

9

Applied Statistical Regression

HS 2011 – Week 05

Adjusted Coefficient of Determination

If we add more and more predictor variables to the model, R- squared will always increase, and never decreases

Is that a realistic goodness-of-fit measure?

NO, we better adjust for the number of predictors!

Marcel Dettling, Zurich University of Applied Sciences

2

2 1

2 1

( ˆ )

1 1 [0,1]

( 1)

( )

n

i i

i n

i i

y y adjR n

n p

y y

 

   

  

(10)

10

Applied Statistical Regression

HS 2011 – Week 05

Global F-Test

Question: is there any relation between predictors and response?

We test the null hypothesis

against the alternative

for at least one j in 1,…, p The test statistic is:

0

:

1 2

...

p

0

H       

: 0

A j

H  

2 1

, ( 1) 2

1

( ˆ )

( 1)

~

( ˆ )

n

i i

p n p n

i i

i

y y n p

F F

p y y

 

  

 

(11)

11

Applied Statistical Regression

HS 2011 – Week 05

Individual Parameter Tests

If we are interested whether the jth predictor variable is relevant, we can test the hypothesis

against the alternative hypothesis

We can derive the test statistic and its distribution:

Marcel Dettling, Zurich University of Applied Sciences

0

:

j

0 H  

: 0

A j

H  

( 1)

2 1

ˆ

~

ˆ ( )

j

n p

T

E jj

T t

X X

 

(12)

12

Applied Statistical Regression

HS 2011 – Week 05

Individual Parameter Tests

These tests quantify the effect of the predictor on the

response after having subtracted the linear effect of all other predictor variables on .

Be careful, because of:

a) The multiple testing problem: when doing many tests, the total type II error increases. By how much: see blackboard b) It can happen that all individual tests do not reject the null

hypothesis, although some predictors have a significant effect on the response. Reason: correlated predictors!

Marcel Dettling, Zurich University of Applied Sciences

x

j

Y

Y

(13)

13

Applied Statistical Regression

HS 2011 – Week 05

Partial F-Tests

Test the effects of p-q predictors simultaneously!

We divide the model into 2 parts

So that we can test the hypotheses versus

We compute

and

1 1 2 2

YX

 E X

X

E

0 : 2 0

H

HA :

2  0

0

2 1

: ( ˆ )

n

H i i

i

SSR y y

% 2

1

: ( ˆ )

A

n

H i i

i

SSR y y

(14)

14

Applied Statistical Regression

HS 2011 – Week 05

Partial F-Tests

Test the effects of p-q predictors simultaneously!

The test statistic is

Where do we need this?

- meteorological variables in the mortality dataset - later, when we work with factor/dummy variables

0

, 1

% 2 1

1 ~

( ˆ )

HA H

p q n p n

i i

i

SSR SSR n p

F F

p q

y y

 

  

 

  

(15)

15

Applied Statistical Regression

HS 2011 – Week 05

R-Output

> summary(lm(Mortality~log(SO2)+NonWhite+Rain, data=mo…)) Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 773.0197 22.1852 34.844 < 2e-16 ***

log(SO2) 17.5019 3.5255 4.964 7.03e-06 ***

NonWhite 3.6493 0.5910 6.175 8.38e-08 ***

Rain 1.7635 0.4628 3.811 0.000352 ***

---

Residual standard error: 38.4 on 55 degrees of freedom Multiple R-squared: 0.641, Adjusted R-squared: 0.6214 F-statistic: 32.73 on 3 and 55 DF, p-value: 2.834e-12

Marcel Dettling, Zurich University of Applied Sciences

(16)

16

Applied Statistical Regression

HS 2011 – Week 05

Interpreting the Result

Does the SO2 concentration affect the mortality?

 Might be, might not be

 There are only 3 predictors

 We could suffer from confounding effects

 Causality is always difficult, but…

The next step would be to include all predictor variables that are present in the mortality dataset.

Marcel Dettling, Zurich University of Applied Sciences

(17)

Marcel Dettling, Zurich University of Applied Sciences 17

Applied Statistical Regression

HS 2011 – Week 05

Versatility of Multiple Linear Regression

Many different predictor types are allowed in linear regression:

Continuous predictors

“Standard case”, e.g. temperature, distance, pH-value, …

Transformed predictors For example:

Powers

We can also use:

Categorical predictors

Often used: sex, day of week, political party, …

( ), ( ), ( ),...

log x sqrt x arcsin x

2 3

1 , x , x , ...

x

(18)

Marcel Dettling, Zurich University of Applied Sciences 18

Applied Statistical Regression

HS 2011 – Week 05

Polynomial Regression

Polynomial Regression = Multiple Linear Regression !!!

Goals:

- fit a curvilinear relation

- improve the fit between x and Y - determine the polynomial order d Example:

- Savings dataset: personal savings ~ income per capita

2

0 1 2

...

d d

Y     x   x    xE

(19)

Marcel Dettling, Zurich University of Applied Sciences 19

Applied Statistical Regression

HS 2011 – Week 05

Polynomial Regression Fit

0 5 10 15

05101520

ddpi

sr

Savings Data: Polynomial Regression Fit

(20)

Marcel Dettling, Zurich University of Applied Sciences 20

Applied Statistical Regression

HS 2011 – Week 05

Polynomial Regression

Output from the model with the linear term only:

> summary(lm(sr ~ ddpi, data = savings)) Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 7.8830 1.0110 7.797 4.46e-10 ***

ddpi 0.4758 0.2146 2.217 0.0314 * ---

Residual standard error: 4.311 on 48 degrees of freedom Multiple R-squared: 0.0929, Adjusted R-squared: 0.074 F-statistic: 4.916 on 1 and 48 DF, p-value: 0.03139

(21)

Marcel Dettling, Zurich University of Applied Sciences 21

Applied Statistical Regression

HS 2011 – Week 05

Diagnostic Plots

8 10 12 14 16

-10-50510

Fitted values

Residuals

Residuals vs Fitted

Japan

Chile

Zambia

-2 -1 0 1 2

-2-1012

Theoretical Quantiles

Standardized residuals

Normal Q-Q

Japan

LibyaChile

(22)

Marcel Dettling, Zurich University of Applied Sciences 22

Applied Statistical Regression

HS 2011 – Week 05

Quadratic Regression

Add the quadratic term:

> summary(lm(sr ~ ddpi + I(ddpi^2), data = savings)) Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 5.13038 1.43472 3.576 0.000821 ***

ddpi 1.75752 0.53772 3.268 0.002026 **

I(ddpi^2) -0.09299 0.03612 -2.574 0.013262 * ---

Residual standard error: 4.079 on 47 degrees of freedom Multiple R-squared: 0.205, Adjusted R-squared: 0.1711 F-statistic: 6.059 on 2 and 47 DF, p-value: 0.004559

2

0 1 2

Y     x   xE

(23)

Marcel Dettling, Zurich University of Applied Sciences 23

Applied Statistical Regression

HS 2011 – Week 05

Diagnostic Plots: Quadratic Regression

6 8 10 12

-10-505

Fitted values

Residuals

Residuals vs Fitted

Chile Korea Japan

-2 -1 0 1 2

-2-1012

Theoretical Quantiles

Standardized residuals

Normal Q-Q

ChileKorea

Japan

(24)

Marcel Dettling, Zurich University of Applied Sciences 24

Applied Statistical Regression

HS 2011 – Week 05

Cubic Regression

Add the cubic term:

> summary(lm(sr~ddpi + I(ddpi^2) + I(ddpi^3), data = savings) Coefficients: Estimate Std. Error t value Pr(>|t|)

(Intercept) 5.145e+00 2.199e+00 2.340 0.0237 * ddpi 1.746e+00 1.380e+00 1.265 0.2123 I(ddpi^2) -9.097e-02 2.256e-01 -0.403 0.6886 I(ddpi^3) -8.497e-05 9.374e-03 -0.009 0.9928 ---

Residual standard error: 4.123 on 46 degrees of freedom Multiple R-squared: 0.205, Adjusted R-squared: 0.1531 F-statistic: 3.953 on 3 and 46 DF, p-value: 0.01369

2 3

0 1 2 3

Y     x   x   xE

(25)

Marcel Dettling, Zurich University of Applied Sciences 25

Applied Statistical Regression

HS 2011 – Week 05

Powers Are Strongly Correlated Predictors!

The smaller the x-range, the bigger the problem!

> cor(cbind(ddpi, ddpi2=ddpi^2, ddpi3=ddpi^3)) ddpi ddpi2 ddpi3

ddpi 1.0000000 0.9259671 0.8174527 ddpi2 0.9259671 1.0000000 0.9715650 ddpi3 0.8174527 0.9715650 1.0000000

Way out: use centered predictors!

2 2

3 3

( )

( )

( )

i i

i i

i i

z x x

z x x

z x x

 

 

 

(26)

Marcel Dettling, Zurich University of Applied Sciences 26

Applied Statistical Regression

HS 2011 – Week 05

Powers Are Strongly Correlated Predictors!

> summary(lm(sr~z.ddpi+I(z.ddpi^2)+I(z.ddpi^3),dat=z.savings) Coefficients: Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.042e+01 8.047e-01 12.946 < 2e-16 ***

z.ddpi 1.059e+00 3.075e-01 3.443 0.00124 **

I(z.ddpi^2) -9.193e-02 1.225e-01 -0.750 0.45691 I(z.ddpi^3) -8.497e-05 9.374e-03 -0.009 0.99281

 Coefficients, standard error and tests are different

 Fitted values and global inference remain the same

 Not overly beneficial on this dataset!

Be careful: extrapolation with polynomials is dangerous!

(27)

Marcel Dettling, Zurich University of Applied Sciences 27

Applied Statistical Regression

HS 2011 – Week 05

Dummy Variables

So far, we only considered continuous predictors:

- temperature - distance

- pressure - …

It is perfectly valid to have categorical predictors, too:

- sex (male or female)

- status variables (employed or unemployed) - working shift (day, evening, night)

- …

Implementation in the regression with dummy variables

(28)

Marcel Dettling, Zurich University of Applied Sciences 28

Applied Statistical Regression

HS 2011 – Week 05

Example: Binary Categorical Variable

The lathe dataset:

- lifetime of a cutting tool in a lathe - speed of the machine in rpm

- tool type A or B

Dummy variable encoding:

Y

x

1

x

2

2

0 1

tool type A x tool type B

  

(29)

Marcel Dettling, Zurich University of Applied Sciences 29

Applied Statistical Regression

HS 2011 – Week 05

Interpretation of the Model

 see blackboard…

> summary(lm(hours ~ rpm + tool, data = lathe)) Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 36.98560 3.51038 10.536 7.16e-09 ***

rpm -0.02661 0.00452 -5.887 1.79e-05 ***

toolB 15.00425 1.35967 11.035 3.59e-09 ***

---

Residual standard error: 3.039 on 17 degrees of freedom Multiple R-squared: 0.9003, Adjusted R-squared: 0.8886 F-statistic: 76.75 on 2 and 17 DF, p-value: 3.086e-09

(30)

Marcel Dettling, Zurich University of Applied Sciences 30

Applied Statistical Regression

HS 2011 – Week 05

The Dummy Variable Fit

500 600 700 800 900 1000

152025303540

rpm

hours

A

A A

A A

A

A A

A A

B

B

B B

B B

B B

B B

Durability of Lathe Cutting Tools

(31)

Marcel Dettling, Zurich University of Applied Sciences 31

Applied Statistical Regression

HS 2011 – Week 05

A Model with Interactions

Question: do the slopes need to be identical?

 with the appropriate model, the answer is no!

 see blackboard for model interpretation…

0 1 1 2 2 3 1 2

Y     x   x   x xE

(32)

Marcel Dettling, Zurich University of Applied Sciences 32

Applied Statistical Regression

HS 2011 – Week 05

Different Slope for the Regression Lines

500 600 700 800 900 1000

152025303540

rpm

hours

A

A A

A A

A

A A

A A

B

B

B B

B B

B B

B B

Durability of Lathe Cutting Tools: with Interaction

(33)

Marcel Dettling, Zurich University of Applied Sciences 33

Applied Statistical Regression

HS 2011 – Week 05

Summary Output

> summary(lm(hours ~ rpm * tool, data = lathe)) Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 32.774760 4.633472 7.073 2.63e-06 ***

rpm -0.020970 0.006074 -3.452 0.00328 **

toolB 23.970593 6.768973 3.541 0.00272 **

rpm:toolB -0.011944 0.008842 -1.351 0.19553 ---

Residual standard error: 2.968 on 16 degrees of freedom Multiple R-squared: 0.9105, Adjusted R-squared: 0.8937 F-statistic: 54.25 on 3 and 16 DF, p-value: 1.319e-08

(34)

Marcel Dettling, Zurich University of Applied Sciences 34

Applied Statistical Regression

HS 2011 – Week 05

How Complex the Model Needs to Be?

Question 1: do we need different slopes for the two lines?

against

 individual parameter test for the interaction term!

Question 2: is there any difference altogether?

against

 this is a partial F-test

 we try to exclude interaction and dummy variable together R offers convenient functionality for these tests!

0 : 3 0

H

H

A

: 

3

 0

0 : 2 3 0

H     HA :

2  0 and or/

3  0

(35)

Marcel Dettling, Zurich University of Applied Sciences 35

Applied Statistical Regression

HS 2011 – Week 05

Anova Output

Summary output for the interaction model

> fit1 <- lm(hours ~ rpm, data=lathe)

> fit2 <- lm(hours ~ rpm * tool, data=lathe)

> anova(fit1, fit2) Model 1: hours ~ rpm

Model 2: hours ~ rpm * tool

Res.Df RSS Df Sum of Sq F Pr(>F) 1 18 1282.08 2 16 140.98 2 1141.1 64.755 2.137e-08 ***

no different slopes, but different intercept!

(36)

Marcel Dettling, Zurich University of Applied Sciences 36

Applied Statistical Regression

HS 2011 – Week 05

Categorical Input with More than 2 Levels

There are now 3 tool types A, B, C:

Main effect model:

With interactions:

2 3

0 0 1 0 0 1

x x

for observations of type A for observations of type B for observations of type C

0 1 1 2 2 3 3

Y

x

x

xE

0 1 1 2 2 3 3 4 1 2 5 1 3

Y     x   x   x   x x   x xE

(37)

Marcel Dettling, Zurich University of Applied Sciences 37

Applied Statistical Regression

HS 2011 – Week 05

Three Types of Cutting Tools

500 600 700 800 900 1000

152025303540

rpm

hours

A

A A

A A

A

A A

A A

B

B

B B

B B

B B

B B

C C

C

C

C C

C C C

C

Durability of Lathe Cutting Tools: 3 Types

(38)

Marcel Dettling, Zurich University of Applied Sciences 38

Applied Statistical Regression

HS 2011 – Week 05

Summary Output

> summary(lm(hours ~ rpm * tool, data = abc.lathe) Coefficients:Estimate Std. Error t value Pr(>|t|) (Intercept) 32.774760 4.496024 7.290 1.57e-07 ***

rpm -0.020970 0.005894 -3.558 0.00160 **

toolB 23.970593 6.568177 3.650 0.00127 **

toolC 3.803941 7.334477 0.519 0.60876 rpm:toolB -0.011944 0.008579 -1.392 0.17664 rpm:toolC 0.012751 0.008984 1.419 0.16869 ---

Residual standard error: 2.88 on 24 degrees of freedom

Multiple R-squared: 0.8906, Adjusted R-squared: 0.8678 F-statistic: 39.08 on 5 and 24 DF, p-value: 9.064e-11

(39)

Marcel Dettling, Zurich University of Applied Sciences 39

Applied Statistical Regression

HS 2011 – Week 05

Inference with Categorical Predictors

Do not perform individual hypothesis tests on factors!

Question 1: do we have different slopes?

against

Question 2: is there any difference altogether?

against

 Again, R provides convenient functionality

0 : 4 0 5 0

H   and   HA : 4  0 and or/ 5  0

0

:

2 3 4 5

0

H        

HA : any of

   

2, 3, 4, 5  0

(40)

Marcel Dettling, Zurich University of Applied Sciences 40

Applied Statistical Regression

HS 2011 – Week 05

Anova Output

> anova(fit.abc)

Analysis of Variance Table

Df Sum Sq Mean Sq F value Pr(>F) rpm 1 139.08 139.08 16.7641 0.000415 ***

tool 2 1422.47 711.23 85.7321 1.174e-11 ***

rpm:tool 2 59.69 29.84 3.5974 0.043009 * Residuals 24 199.10 8.30

 strong evidence that we need to distinguish the tools!

 weak evidence for the necessity of different slopes

Referenzen

ÄHNLICHE DOKUMENTE

Aim: Reducing the regression model such that the remaining predictors are necessary for describing the response. How: We start with the full model and then in a step-by-step

The notation lm(Mortality ~ ., data=apm) means that mortality is explained by all the other variables that exist in data frame apm. However, in our example these two commands will

Y i Mortality rate, i.e. The plot below shows scatter plots of the response versus each of the predictors, together with the fit from a simple linear regression. Since the SO

Given a set of data points , the goal is to fit the regression line such that the sum of squared differences between observed value and regression line is minimal.

b) It can happen that all individual tests do not reject the null hypothesis, although some predictors have a significant effect on the response.. Applied Statistical Regression.

• Cross validation is often used for identifying the most predictive model from a few candidate models that were found by stepwise variable selection procedures. • There

growth. This gave rise to a search for tree species which show high tolerance against these conditions. An outdoor trial was performed, where 120 trees of a particular species

Often, there are just a few observations which &#34;are not in accordance&#34; with a model. However, these few can have strong impact on model choice, estimates and fit..