Regression Exercise

(1)

Regression Exercise

Christopher Nowzohour

09.04.2014

(2)

Regression: Line Fitting

y=Xβ+ y

X

β

(n×1)-vector of observations of dependent variable (n×p)-matrix of observations of independent variables (one column per variable, first columnt constant) (p×1)-vector of parameters

(n×1)-vector of errors

Goals:

1 Prediction: Accurately predict y for newX

2 Statistical Inference: How confident are we about the parameter values β?

3 Causal Inference: Can we change y by changingX?

I Careful – need extra assumptions to make causal statements (e.g. no hidden variables, known causal direction)

I Otherwise: Confounding,Simpson’s Paradox, ...

Christopher Nowzohour Regression Exercise 09.04.2014 2 / 9

(3)

Regression: Line Fitting

y=Xβ+ y

X

β

(n×1)-vector of errors Goals:

(4)

Regression: Line Fitting

y=Xβ+ y

X

β

(5)

Regression: Line Fitting

y=Xβ+ y

X

β

(6)

Regression: Line Fitting

y=Xβ+ y

X

β

(7)

Regression: Line Fitting

y=Xβ+ y

X

β

(8)

Fitting criteria: three examples

What are “good” parameter estimates β?b

1 Small squared residuals (L² regression / least squares):

βb_L2 = arg min

β

ky−Xβk²₂= arg min

β n

X

i=1

(y_i −x_i·β)²

2 Small absolute residuals (L¹ regression / robust regression):

βb_L¹ = arg min

β

ky−Xβk₁= arg min

β n

X

i=1

|y_i −xi ·β|

3 Maximum likelihood:

βb_ML = arg max

β n

X

i=1

logf(y_i −x_i·β)

(9)

Fitting criteria: three examples

βb_L2 = arg min

β

β n

X

i=1

(y_i −x_i·β)²

βb_L¹ = arg min

β

β n

X

i=1

|y_i −xi ·β|

βb_ML = arg max

β n

X

i=1

(10)

Fitting criteria: three examples

βb_L2 = arg min

β

β n

X

i=1

(y_i −x_i·β)²

βb_L¹ = arg min

β

β n

X

i=1

|y_i −xi ·β|

βb_ML = arg max

β n

X

i=1

(11)

Fitting criteria: three examples

βb_L2 = arg min

β

β n

X

i=1

(y_i −x_i·β)²

βb_L¹ = arg min

β

β n

X

i=1

|y_i −xi ·β|

βb_ML = arg max

β n

Xlogf(y_i −x_i·β)

(12)

Finding optimal parameters β b

∇ky−Xβb_L²k²₂=−2X^T(y−Xβb_L²)=^! 0 Henceβb_L2 = (X^TX)⁻¹X^Ty

I No analytic solution possible :-(

I But numerical optimization works in practice (e.g. gradient descent)

I If∼ Nn(0, σ²I_n×n), for someσ >0: βb_ML=βb_L2 !

I In general: can be difficult (→numerical optimization)

(13)

Finding optimal parameters β b

∇ky−Xβb_L²k²₂=−2X^T(y−Xβb_L²)=^! 0 Henceβb_L2= (X^TX)⁻¹X^Ty

(14)

Finding optimal parameters β b

(15)

Finding optimal parameters β b

(16)

Finding optimal parameters β b

(17)

Finding optimal parameters β b

(18)

Typical Assumptions

In descending order of importance:

1 Our sample (X,y) is representative of the population

2 X has full column rank (n≥p and no collinear predictors)

3 Unbiased errors: E[_i] = 0 ∀i

4 Uncorrelated errors: E[ij] = 0 ∀i,j (i 6=j)

5 Exactly measured (but possibly still random) covariates X

6 Constant error variance: E[²_i] =σ² ∀i

7 Jointly Gaussian errors: ∼ N

Assumptions 3,4,6,7 are often summarized as ∼ N_n(0, σ²In×n)

(19)

Typical Assumptions

(20)

Typical Assumptions

(21)

Typical Assumptions

(22)

Typical Assumptions

(23)

Typical Assumptions

(24)

Typical Assumptions

(25)

Typical Assumptions

(26)

Typical Assumptions

(27)

Properties of β b

_L²

If we have∼ N_n(0, σ²In×n), then the following hold:

1 Unbiasedness: E[βb_L2] =β

2 Minimal variance among all unbiased estimators (Gauss-Markov Theorem)

3 βb_L2 ∼ N_p(β, σ²(X^TX)⁻¹), andβb_L2 is independent of bσ²

I t-tests for components of βb_L2 possible

I F-test for the whole of βb_L2 possible

I Confidence interval forE[y0|x0] and prediction interval fory0possible (wherey0is a new observation at x0)

(28)

Properties of β b

_L²

(29)

Properties of β b

_L²

(30)

Properties of β b

_L²

(31)

What happens if assumptions fail?

1 Non-representative sample: cannot infer about population

2 X^TX non invertible: cannot computeβb_L²

3 Biased errors:

I βb_L2 will be biased

I →Transformations? More predictors?

4 Correlated errors:

I Wrong p-values & confidence intervals

I Estimator less precise (higher variance)

I →Generalized Least Squares

5 Noisy covariates: βb_L2 will be biased

6 Non-constant error variance:

I →Generalized Least Squares, Transformations?

7 Non-normal errors:

I Only weak version of Gauss-Markov Theorem

I βb_L2 is only approximately Gaussian (under weak assumptions onX), therefore slightly wrong p-values & confidence intervals

I →Transformations?

(32)

What happens if assumptions fail?

3 Biased errors:

(33)

What happens if assumptions fail?

3 Biased errors:

(34)

What happens if assumptions fail?

3 Biased errors:

(35)

What happens if assumptions fail?

3 Biased errors:

(36)

What happens if assumptions fail?

3 Biased errors:

(37)

What happens if assumptions fail?

3 Biased errors:

(38)

What happens if assumptions fail?

3 Biased errors:

(39)

Confidence and Prediction intervals / bands

95%-Confidence band: Area that includes true regression line E[y|x] with 95% probability.

95%-Prediction band: Area that includes new observations (X,y) with 95% probability.

(40)

Confidence and Prediction intervals / bands

95%-Confidence band: Area that includes true regression line E[y|x]

with 95% probability.

(41)

Confidence and Prediction intervals / bands

(42)

Confidence and Prediction intervals / bands

(43)

Diagnostic Plots

Tukey-Anscombe Plot: Residuals against fitted values Check for bias in errors

Check for correlated errors

Check for non-constant error variance

QQ-Plot: Theoretical Gaussian quantiles against empirical quantiles Check for non-Gaussian errors

(44)

Diagnostic Plots

Tukey-Anscombe Plot: Residuals against fitted values

Check for bias in errors Check for correlated errors

(45)

Diagnostic Plots

(46)

Diagnostic Plots

(47)

Diagnostic Plots

QQ-Plot: Theoretical Gaussian quantiles against empirical quantiles

Check for non-Gaussian errors

(48)