Regression with breakpoints - A practical introduction to statistics

Thus far, all examples of nonlinear relations involved smooth, continuous functions that we modeled with polynomials or with splines. However, one may also encounter situ-ations in which there is a discontinuity in an otherwise linear relation. An example is a study of the frequency with which years were referenced in the Frankfurter Allgemeine Zeitung [Pollman and Baayen, 2001]. The relevant data are available as the data setfaz.

> head(faz, 3) Year Frequency

1 1993 12068

2 1992 6338

3 1991 3791

> tail(faz, 3) Year Frequency

798 1196 0

799 1195 1

800 1194 2

For each year in the time period1993–1194,fazlists the frequency of that year as ref-erenced in this newspaper in1994. Most of the year references in the issues of1994were to the previous year,1993, followed by1992, then by1991, etc. We add a column tofaz specifying the distance from1994,

> faz$Distance = 1:nrow(faz)

and plot log frequency of use as a function of log distance from1994, as shown in the upper left panel of Figure 6.14.

> plot(log(faz$Distance), log(faz$Frequency + 1), + xlab = "log Distance", ylab = "log Frequency")

234

DRAFT

NcountStem

0 5 10 15 20 25

0.800.850.900.951.00

Adjusted to: Regularity=regular Denominative=N

Regularity

irregular regular

0.800.850.900.951.00

−

Adjusted to: NcountStem=8 Denominative=N

Denominative

Den N

0.800.850.900.951.00

−

Adjusted to: NcountStem=8 Regularity=regular

Figure 6.13: Partial effects of the predictors for the probability of the etymological age of Dutch verbs.

235

DRAFT

What is of interest in this plot is that there seems to be a linear relation up till approxi-mately a log distance of four. Around the location of the vertical solid line, the slope of the regression line changes fairly abruptly. This suggests that the collective conscious-ness of events in the past is substantially reduced for events occurring more than a life-time (some60years) ago. The dashed vertical line marks1945, the end of the second world war. Therefore, an alternative explanation of the observed change is that the sec-ond world war is the dividing line between recent and more distant history. In order to evaluate these hypotheses, we need to establish whether there is indeed a sudden change

— a significant change in the slope — and if so, where this discontinuity is located.

0 1 2 3 4 5 6

02468

log Distance

log Frequency

−4 −2 0 1 2

02468

log Shifted Distance

log Frequency

0 1 2 3 4 5 6

260280300

breakpoint

deviance

0 1 2 3 4 5 6

02468

log Distance

log Frequency

Figure 6.14: Breakpoint analysis of the frequency of use of references to years in the Frank-furter Allgemeine Zeitung in1994as a function of the distance of the year name from1994.

236

DRAFT

The simplest regression model for this data that takes the discontinuity into account is one with a single linear regression line that changes slope at a so-calledBREAKPOINT. Let’s assume that the breakpoint is at distance59. For convenience, we log frequency and distance,

> faz$LogFrequency = log(faz$Frequency + 1)

> faz$LogDistance = log(faz$Distance)

> breakpoint = log(59)

and then shift all the datapoints leftwards along the horizontal axis, so that the breakpoint coincides with the vertical axis. This is shown in the upper right panel of Figure 6.14.

> faz$ShiftedLogDistance = faz$LogDistance - breakpoint

> plot(faz$ShiftedLogDistance, faz$LogFrequency,

+ xlab = "log Shifted Distance", ylab = "log Frequency")

We can now fit two regression models to the data, one for the data points to the left of the vertical axis, and one for the data points to its right. As can be seen in the upper right panel of Figure 6.14, the two lines cross the vertical axis at nearly the same place.

> faz.left = lm(LogFrequency ˜ ShiftedLogDistance, + data = faz[faz$ShiftedLogDistance <= 0,])

> faz.right = lm(LogFrequency ˜ ShiftedLogDistance, + data = faz[faz$ShiftedLogDistance >= 0,])

> abline(faz.left, lty = 1)

> abline(faz.right, lty = 2)

What we need to do is to integrate these two models into a single regression model. We do this by introducing anINDICATOR VARIABLE that specifies whether the shifted log distance is greater than zero,

> faz$PastBreakPoint = as.factor(faz$ShiftedLogDistance > 0)

and by constructing a model in which the only term in the formula is the interaction of ShiftedLogDistancewith this indicator variablePastBreakPoint:

> faz.both = lm(LogFrequency ˜ ShiftedLogDistance : PastBreakPoint, + data=faz)

Normally, one would not include an interaction without including the main effects, but in this special case we do not want these main effects to be present. To see why, consider the table of coefficients in the summary:

> summary(faz.both) ...

Residuals:

Min 1Q Median 3Q Max

-1.76242 -0.31593 -0.02271 0.34838 1.87073

237

DRAFT

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 5.52596 0.05434 101.70 <2e-16

ShiftedLogDist:PastBreakPointFALSE -0.84124 0.06460 -13.02 <2e-16 ShiftedLogDist:PastBreakPointTRUE -1.88383 0.02872 -65.60 <2e-16 Residual standard error: 0.5705 on 797 degrees of freedom

Multiple R-Squared: 0.8898, Adjusted R-squared: 0.8895 F-statistic: 3218 on 2 and 797 DF, p-value: < 2.2e-16

We have three coefficients, one for the intercept, one for the slope when we are to the left of the breakpoint, and one for when we are to the right of the breakpoint. Since the intercept represents the frequency when the shifted distance is zero, we have succeeded in building a model that combines the first half of the solid line in the upper right panel with the second half of the dashed line. An anova test comparing this model with a model with just a simple regression line shows that the extra parameter for modeling the breakpoint is justified.

> anova(faz.both, lm(LogFrequency ˜ ShiftedLogDistance, data = faz)) Analysis of Variance Table

Model 1: LogFrequency ˜ ShiftedLogDistance:PastBreakPoint Model 2: LogFrequency ˜ ShiftedLogDistance

Res.Df RSS Df Sum of Sq F Pr(>F) 1 797 259.430

2 798 312.945 -1 -53.515 164.41 < 2.2e-16

Up till now, we have worked with one sensible breakpoint, but we still need to ascer-tain what the most likely breakpoint is. To do so, we fit a series of models, one for each possible breakpoint. For each model, we calculate the deviance, the sum of the squared differences between the observed and the fitted values:

> sum((fitted(faz.both) - faz$LogFrequency)ˆ2) [1] 259.4298

> deviance(faz.both) [1] 259.4298

The following lines of code implement this idea. We begin with creating a vector in which we store the deviances for the models. We then loop over all sensible breakpoints, and carry out the same sequence of steps as above.

> deviances = rep(0, nrow(faz)-1)

> for (pos in 1 : (nrow(faz)-1)) { + breakpoint = log(pos)

+ faz$ShiftedLogDistance = faz$LogDistance - breakpoint + faz$PastBreakPoint = as.factor(faz$ShiftedLogDistance > 0)

238

DRAFT

+ faz.both = lm(LogFrequency ˜ ShiftedLogDistance:PastBreakPoint, + data = faz)

+ deviances[pos] = deviance(faz.both) + }

We select the breakpoint for which the deviance is smallest,

> best = which(deviances == min(deviances))

> best [1] 58

> breakpoint = log(best)

and refit the model one last time for this breakpoint.

> faz$ShiftedLogDistance = faz$LogDistance - breakpoint

> faz$PastBreakPoint = as.factor(faz$ShiftedLogDistance > 0)

> faz.both = lm(LogFrequency ˜ ShiftedLogDistance:PastBreakPoint, + data = faz)

We now add the lower panels to Figure 6.14.

> plot(log(1:length(deviances)), deviances, type = "l", + xlab = "breakpoint", ylab = "deviance")

> plot(faz$LogDistance, faz$LogFrequency,

+ xlab = "log Distance", ylab = "log Frequency", col = "darkgrey")

> lines(faz$LogDistance, fitted(faz.both))

Note that the final plot has the unshifted distances on the horizontal axis, and the fitted values (obtained for the shifted values) on the vertical axis. (A moment’s thought should reveal why this is legitimate.) The breakpoint is at distance58from1994, in1936, so this suggests that the change in historical consciousness is located well before the beginning of the second world war.

A second example illustrating the use of indicator variables addresses changes in the frequency with which constructions with periphrasticdowere used in English from the end of the fourteenth to the end of the sixteenth century. Elleg˚ard [1953] studied the use of periphrasticdoin107texts. Counts of periphrasticdofor four sentence types are available as the data setperiphrasticDo.

> head(periphrasticDo)

begin end type do other 1 1390 1425 affdecl 17 49583 2 1425 1475 affdecl 121 45379 3 1475 1500 affdecl 1059 58541 4 1500 1525 affdecl 396 28204 5 1525 1535 affdecl 494 18306 6 1535 1550 affdecl 1564 17636

> table(periphrasticDo$type) affdecl affquest negdecl negquest

11 11 11 11

239

DRAFT

The columnsbeginandendlist the beginning and end of the period for which Elle-gard counted the occurrences ofdoandotherconstructions for affirmative declarative sentences (affdecl), affirmative questions (affquest), negative declarative sentences (negdecl) and negative questions (negquest). Figure 6.15 shows, for each sentence type, the observed proportion of sentences with periphrasticdofor the midpoints of each time period. Except for affirmative declarative sentences, the use of periphrasticdo in-creased over the years.

1400 1500 1600 1700

0.00.40.8

year

proportion

affdecl

1400 1500 1600 1700

0.00.40.8

year

proportion

affquest

1400 1500 1600 1700

0.00.40.8

year

proportion

negdecl

1400 1500 1600 1700

0.00.40.8

year

proportion

negquest

Figure 6.15: The relative frequency of periphrasticdoin four sentence types across three ages. Circles represent observed relative frequencies, dashed and solid lines a regression model without and with an indicator variable adjusting for the fifteenth century.

The curve for affirmative questions has been analyzed with a logistic regression model by [Kroch, 1989], see Vulanovi´c and Baayen [2006] for further references to studies that propose models for subsets of the sentence types. The question considered by the latter study is whether a single model can be fitted to the data of all four sentence types. After all, each sentence type shows a pattern of linguistic change, including the affirmative

240

DRAFT

declarative sentences, for which the change did not carry through.

Since we are dealing with binary data (counts of sentences with and without pe-riphrasticdo) in tabular format, we useglm()and allow points of inflection into the curves by using both quadratic and cubic polynomial terms, which we allow to interact with sentence type.

> periphrasticDo$year = periphrasticDo$begin +

+ (periphrasticDo$end-periphrasticDo$begin)/2 # midpoints

> periphrasticDo.glm = glm(cbind(do, other) ˜ + (year + I(yearˆ2) + I(yearˆ3)) * type, + data = periphrasticDo, family = "binomial")

> summary(periphrasticDo.glm) Deviance Residuals:

Min 1Q Median 3Q Max

-18.4741 -1.7182 -0.1357 1.7668 14.8644 Coefficients:

Estimate Std. Error z value Pr(>|z|) (Intercept) -4.901e+02 2.163e+02 -2.266 0.0235

year 6.024e-01 4.167e-01 1.445 0.1483

I(yearˆ2) -1.759e-04 2.675e-04 -0.658 0.5107 I(yearˆ3) -6.345e-09 5.720e-08 -0.111 0.9117 typeaffquest -6.073e+02 9.088e+02 -0.668 0.5040 typenegdecl -4.009e+03 7.325e+02 -5.473 4.42e-08 typenegquest -8.083e+02 1.229e+03 -0.658 0.5106 year:typeaffquest 1.328e+00 1.726e+00 0.769 0.4418 year:typenegdecl 7.816e+00 1.392e+00 5.613 1.99e-08 year:typenegquest 1.790e+00 2.365e+00 0.757 0.4492 I(yearˆ2):typeaffquest -9.591e-04 1.092e-03 -0.878 0.3800 I(yearˆ2):typenegdecl -5.078e-03 8.816e-04 -5.760 8.43e-09 I(yearˆ2):typenegquest -1.299e-03 1.517e-03 -0.856 0.3918 I(yearˆ3):typeaffquest 2.298e-07 2.303e-07 0.998 0.3183 I(yearˆ3):typenegdecl 1.100e-06 1.860e-07 5.915 3.32e-09 I(yearˆ3):typenegquest 3.111e-07 3.241e-07 0.960 0.3370 (Dispersion parameter for binomial family taken to be 1) Null deviance: 20431.1 on 43 degrees of freedom Residual deviance: 1236.0 on 28 degrees of freedom AIC: 1504.6

Since the residual deviance is much larger than the corresponding degrees of freedom, we have overdispersion, so we use theF-test to evaluate the significance of the interactions, following Crawley [2002].

> anova(periphrasticDo.glm, test = "F") Analysis of Deviance Table

241

DRAFT

Model: binomial, link: logit

Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. Dev F Pr(>F)

NULL 43 20431.1

year 1 6302.2 42 14128.9 6302.225 < 2.2e-16 I(yearˆ2) 1 4085.6 41 10043.3 4085.613 < 2.2e-16

I(yearˆ3) 1 31.3 40 10012.0 31.321 2.187e-08

type 3 7810.5 37 2201.4 2603.510 < 2.2e-16

year:type 3 750.9 34 1450.5 250.296 < 2.2e-16 I(yearˆ2):type 3 173.3 31 1277.2 57.767 < 2.2e-16 I(yearˆ3):type 3 41.3 28 1236.0 13.754 5.752e-09 The dotted lines in Figure 6.15 show that this model captures the main trends for all sentence types, but the fit is rather poor for especially the negative questions. In order to improve the fit, we note that there is very little development during the fifteenth century.

We therefore create an indicator variable that is zero for the first three time periods, and one for the remaining periods.

> periphrasticDo$Indicator = rep(c(rep(0, 3), rep(1, 8)), 4)

> periphrasticDo.glmA = glm(cbind(do, other) ˜ + (year + I(yearˆ2) + I(yearˆ3)) * type + + Indicator * type + Indicator * year, + data = periphrasticDo, family = "binomial")

The anova summary shows that the indicator variable is significant,

> anova(periphrasticDo.glmA, test = "F")

Df Deviance Resid. Df Resid. Dev F Pr(>F)

NULL 43 20431.1

year 1 6302.2 42 14128.9 6302.225 < 2.2e-16 I(yearˆ2) 1 4085.6 41 10043.3 4085.613 < 2.2e-16

I(yearˆ3) 1 31.3 40 10012.0 31.321 2.187e-08

type 3 7810.5 37 2201.4 2603.510 < 2.2e-16

Indicator 1 174.7 36 2026.8 174.663 < 2.2e-16 year:type 3 717.0 33 1309.8 238.990 < 2.2e-16 I(yearˆ2):type 3 199.9 30 1109.9 66.636 < 2.2e-16 I(yearˆ3):type 3 46.1 27 1063.8 15.359 5.459e-10 type:Indicator 3 48.2 24 1015.6 16.081 1.891e-10 year:Indicator 1 485.8 23 529.8 485.820 < 2.2e-16 so it does indeed make sense to allow coefficients to change when going from the fifteenth century to the next two centuries. The solid lines in Figure 6.15 show that the new model is superior to the old model for all sentence types, with the exception of the affirmative declaratives, for which there is no improvement that is visible to the eye.

Compared to previous models proposed in the literature, the present model has the advantage of fitting all sentence types simultaneously. This brings out a similarity be-tween the two types of declarative clauses. For both, an initial increase is followed by

242

DRAFT

a decrease that perseveres in the case of affirmative sentences, but that is followed by a slight increase in the case of negative declaratives. For further discussion of the mathe-matics of the functional considerations motivating these patterns of language change, see Vulanovi´c and Baayen [2006].

At this point, you might be asking yourself whether we are overfitting the data, with 21coefficients for4sentence types with11time points each. The rule of thumb given by Harrell (2001:61) is that for logistic models, the number of coefficients should be smaller than the total number of observations with the minority outcome, divided by20. For the present data,

> min(apply(periphrasticDo[, c("do", "other")], 2, sum)) [1] 9483

the9483observations for the less frequent outcome (do) is much larger than the number of parameters (21) multiplied by20, so we are doing fine.

Figure 6.15 was made by looping over the level of sentencetypein order to create the successive panels.

> periphrasticDo$predict = predict(periphrasticDo.glm, type="response")

> periphrasticDo$predictA=predict(periphrasticDo.glmA, type="response")

> par(mfrow=c(2, 2))

> for (i in 1:nlevels(periphrasticDo$type)) { + subset = periphrasticDo[periphrasticDo$type ==

+ levels(periphrasticDo$type)[i], ] + plot(subset$year,

+ subset$do/(subset$do + subset$other),

+ type = "p", ylab = "proportion", xlab = "year", + ylim = c(0, 1), xlim = c(1400, 1700))

+ mtext(levels(periphrasticDo$type)[i], line = 2) + lines(subset$ar, subset$predict, lty = 3) + lines(subset$ar, subset$predictA, lty = 1) + }

Im Dokument A practical introduction to statistics (Seite 122-127)