Collinearity - Ordinary least squares regression

6.2 Ordinary least squares regression

6.2.2 Collinearity

The ideal data set for multiple regression is one in which all the predictors are uncorre-lated. Severe problems may arise if the predictors enter into strong correlations, a phe-nomenon known asCOLLINEARITY[Belsley et al., 1980]. A metaphor for understanding the problem posed by collinearity builds on Figure 6.6. The ideal situation is shown to the left. The variance to be explained is represented by the square. The small circles represent the part of the variance captured by four predictors. In the situation shown on the left, each predictor captures its own unique portion of the variance. In this case, the predictors are said to beORTHOGONAL, they are uncorrelated. The situation depicted to the right il-lustrates collinear predictors. There is little variance that is captured by just one predictor.

Instead, almost the same part of the variance is captured by all four predictors. Hence, it becomes difficult to tease the explanatory values of these predictors apart.

Collinearity is generally assessed by means of the condition numberκ. The greater the collinearity, the closer the matrix of predictors is to becomingSINGULAR. When a matrix is singular, the problem that arises is similar to attempting to divide a number

198

DRAFT

WrittenFrequency

RTlexdec

0 2 4 6 8 10 12

6.26.46.66.87.0

Adjusted to: AgeSubject=old LengthInLetters=4

LengthInLetters

RTlexdec

2 3 4 5 6 7

6.26.46.66.87.0

− − − − − −

Adjusted to: WrittenFrequency=4.832 AgeSubject=old

AgeSubject

RTlexdec

old young

6.26.46.66.87.0

−

Adjusted to: WrittenFrequency=4.832 LengthInLetters=4

Figure 6.5: The partial effects according to modelenglish.olsE. As the vertical axes are all on the same scale, the huge differences in the sizes of the effects are clearly visible.

199

DRAFT

&%'$

Figure 6.6: Orthogonal (left) and collinear (right) predictors.

by zero: The operation is not defined. The condition number estimates the extent to which a matrix is singular, i.e., how close the task of estimating the parameters is to being unsolvable.Rprovides a functionkappa()for estimating the condition number, but we calculateκwithcollin.fnc()following Belsley et al. [1980]. These authors argue that not only the predictors, but also the intercept should also be taken into account when evaluating the condition number. When the condition number is between0and6, there is no collinearity to speak of. Medium collinearity is indicated by condition numbers around15, and condition numbers of30or more indicate potentially harmful collinearity.

In order to assess the collinearity of our lexical predictors, we first remove word dupli-cates from theenglishdata frame by selecting those rows that concern the young age group. We then applycollin.fnc()to the resulting data matrix of items, restricted to the columns of the23numerical variables in which we are interested (in columns7 through29of our data frame). From the list of objects returned bycollin.fnc()we select the condition number with the$operator.

> collin.fnc(english[english$AgeSubject == "young",], 7:29)$cnumber [1] 132.0727

Note that the second argument tocollin.fnc()specifies the columns to be selected from the data frame specified as its first argument. A condition number as high as132 indicates that it makes no sense to consider these23predictors jointly in a multiple re-gression model. Too many variables tell the same story. The numerical algorithms used to estimate the coefficients may even run into problems with machine precision.

As a first step towards addressing this problem, we visualize the correlational struc-ture of our predictors. In section 5.1.4 we studied this correlational strucstruc-ture with the help of hierarchical clustering. TheDesignpackage provides a convenient function for visualizing clusters of variables,varclus(), that obviates intermediate steps.

> plot(varclus(as.matrix(english[english$AgeSubject == "young", 7:29]))) Thevarclus()function carries out a hierarchical cluster analysis, using the square of Spearman’s rank correlation as a similarity metric to obtain a more robust insight into the correlational structure of (possibly nonlinear) predictors. Figure 6.7 shows that there are several groups of tightly correlated predictors. For instance, the leftmost cluster brings to-gether six correlated measures for orthographic consistency, which subdivide by whether

200

DRAFT

ConffV ConffN ConfbV ConfbN ConphonN ConspelN ConfriendsN ConphonV ConspelV ConfriendsV WrittenSpokenFrequencyRatio Ncount LengthInLetters MeanBigramFrequency InflectionalEntropy VerbFrequency FrequencyInitialDiphone FamilySize DerivationalEntropy NumberSimplexSynsets NumberComplexSynsets WrittenFrequency NounFrequency

1.00.80.60.40.20.0

Spearman ρ2

Figure 6.7: Hierarchical clustering of23predictors in theenglishdata set, using the square of Spearman’s rank correlation as similarity measure.

they are based on token counts (the left subcluster with variable names ending inN) or whether they are based on type counts (the right subcluster with names ending inV).

There are several strategies that one can pursue to reduce collinearity. The simplest strategy is to select one variable from each cluster. The problem with this strategy is that we may be throwing out information that is actually useful. Belsley et al. [1980] give as example an entrance test gauging skills in mathematics and physics. Normally, grades for these subjects will be correlated, and one could opt for looking only at the grades for physics. But some students might like only math, and basing a selection criterion on the grades for physics would exclude students with excellent grades for math but low grades for physics. In spite of this consideration, one may have theoretical reasons for selecting one variable from a cluster. For instance,FamilySizeandDerivationalEntropyare measures that are mathematically related, and that gauge the same phenomenon. As we are not interested in which of the two is superior in this study, we select one.

In the case of our10measures for orthographic consistency, we can do more. We

201

DRAFT

can orthogonalize these predictors using principal components analysis, a technique that was introduced in Chapter 5. Columns19through28contain the orthographic consis-tency measures for our words, and just for these10variables by themselves, the condition number is already quite large:

> collin.fnc(english[english$AgeSubject == "young",], 18:27)$cnumber [1] 49.05881

We reduce these10correlated predictors to4uncorrelated, orthogonal, predictors as fol-lows. Withprcomp()we create a principal components object. Next, we inspect the proportions of variance explained by the successive principal components.

> items = english[english$AgeSubject == "young",]

> items.pca = prcomp(items[ , c(18:27)], center = T, scale = T)

> summary(items.pca) Importance of components:

PC1 PC2 PC3 PC4 PC5 ...

Standard deviation 2.087 1.489 1.379 0.9030 0.5027 ...

Proportion of Variance 0.435 0.222 0.190 0.0815 0.0253 ...

Cumulative Proportion 0.435 0.657 0.847 0.9288 0.9541 ...

The first fourPCs each capture more than 5% of the variance, and jointly account for 93%

of the variance,

> sum((items.pca$sdevˆ2/sum(items.pca$sdevˆ2))[1:4]) [1] 0.9288

so they are excellent candidates for replacing the10original consistency measures. In-spection of the rotation matrix allows insight into the relation between the original and new variables. For instance, sorting the rotation matrix byPC4shows that this component distinguishes between the token based and type based measures.

> x = as.data.frame(items.pca$rotation[,1:4])

> x[order(x$PC4), ]

PC1 PC2 PC3 PC4

ConfriendsN 0.37204438 -0.28143109 0.07238358 -0.44609099 ConspelN 0.38823175 -0.22604151 -0.15599471 -0.40374288 ConphonN 0.40717952 0.17060014 0.07058176 -0.35127339 ConfbN 0.24870639 0.52615043 0.06499437 -0.06059884 ConffN 0.10793431 0.05825320 -0.66785576 0.05538818 ConfbV 0.25482902 0.52696962 0.06377711 0.10447280 ConffV 0.09828443 0.03862766 -0.67055578 0.13298443 ConfriendsV 0.33843465 -0.35438183 0.20236240 0.38326779 ConphonV 0.38450345 0.22507258 0.13966044 0.38454580 ConspelV 0.36685237 -0.32393895 -0.03194922 0.42952573

The principal components themselves are available initems.pca$x. That there is in-deed no collinearity among these four principal components can be verified by applica-tion ofcollin.fnc():

202

DRAFT

> collin.fnc(items.pca$x, 1:4)$cnumber [1] 1

Finally, we add these four principal components to our data, first for the young age group, and then for the old age group. We then combine the two data frames into an expanded version of the original data frameenglishwith the help ofrbind(), which binds vec-tors or data frames row-wise.

> items$PC1 = items.pca$x[,1]

> items$PC2 = items.pca$x[,2]

> items$PC3 = items.pca$x[,3]

> items$PC4 = items.pca$x[,4]

> items2 = english[english$AgeSubject != "young", ]

> items2$PC1 = items.pca$x[,1]

> items2$PC2 = items.pca$x[,2]

> items2$PC3 = items.pca$x[,3]

> items2$PC4 = items.pca$x[,4]

> english2 = rbind(items, items2)

Sometimes, simpler solutions are possible. For the present data, one question of in-terest concerned the potential consequences of the frequency of use of a word as a noun or as a verb (e.g.,the work, to work). Including two correlated frequency vectors is not advisable. As a solution, we include as a predictor the difference of the log frequency of the noun and that of the verb. (This is mathematically equivalent to considering the log of the ratio of the unlogged nominal and verbal frequencies.) With this new predictor, we can investigate whether it matters whether a word is used more often as a noun, or more often as a verb.

> english2$NVratio =

+ log(english2$NounFrequency+1) - log(english2$VerbFrequency+1)

Similarly, the frequencies of use in written and spoken language can be brought together in a ratio,WrittenSpokenFrequencyRatio, that is already available in the data frame.

With just three frequency measures,WrittenFrequency,WrittenSpokenFrequency Ratio, andNVratio, instead of four frequency measures, we reduce the condition num-ber for the frequency measures from9.45to3.44. In what follows, we restrict ourselves to the following predictors,

> english3 = english2[,c("RTlexdec", "Word", "AgeSubject", + "WordCategory", "WrittenFrequency",

+ "WrittenSpokenFrequencyRatio", "FamilySize", + "InflectionalEntropy", "NumberSimplexSynsets",

+ "NumberComplexSynsets", "LengthInLetters", "MeanBigramFrequency", + "Ncount", "NVratio", "PC1", "PC2", "PC3", "PC4", "Voice")]

and create the corresponding data distribution object.

> english3.dd = datadist(english3)

> options(datadist = "english3.dd")

203

DRAFT

We also include the interaction ofWrittenFrequencybyAgeSubjectobserved above in the new model.

> english3.ols = ols(RTlexdec ˜ Voice + PC1 + PC2 + PC3 + PC4 + + LengthInLetters + MeanBigramFrequency + Ncount +

+ rcs(WrittenFrequency, 5) + WrittenSpokenFrequencyRatio + + NVratio + WordCategory + AgeSubject +

+ FamilySize + InflectionalEntropy +

+ NumberSimplexSynsets + NumberComplexSynsets +

+ rcs(WrittenFrequency, 5) * AgeSubject, data = english3)

Ananovasummary shows remarkably few non-significant predictors: the principal com-ponentsPC2–4, Length, neighborhood density, and the number of simplex synsets. A pro-cedure in theDesignpackage for removing superfluous predictors from the full model isfastbw(), which implements a fast backwards elimination routine.

> fastbw(english3.ols)

Deleted Chi-Sq df P Residual df P AIC R2

NumberSimplexSynsets 0.00 1 0.9742 0.00 1 0.9742 -2.00 0.734 Ncount 0.05 1 0.8192 0.05 2 0.9737 -3.95 0.734

PC3 0.74 1 0.3889 0.80 3 0.8505 -5.20 0.734

PC2 0.90 1 0.3441 1.69 4 0.7924 -6.31 0.734

LengthInLetters 1.15 1 0.2845 2.84 5 0.7252 -7.16 0.734

PC4 1.40 1 0.2364 4.24 6 0.6445 -7.76 0.734

NVratio 4.83 1 0.0279 9.07 7 0.2476 -4.93 0.734 WordCategory 2.01 1 0.1562 11.08 8 0.1971 -4.92 0.733 Approximate Estimates after Deleting Factors

Coef S.E. Wald Z P

Intercept 6.865088 0.0203124 337.97550 0.000e+00

Voice=voiceless -0.009144 0.0025174 -3.63235 2.808e-04

PC1 0.002687 0.0005961 4.50736 6.564e-06

MeanBigramFrequency 0.007509 0.0018326 4.09740 4.178e-05 WrittenFrequency -0.041683 0.0047646 -8.74852 0.000e+00 WrittenFrequency’ -0.114355 0.0313057 -3.65285 2.593e-04 WrittenFrequency’’ 0.704428 0.1510582 4.66329 3.112e-06 WrittenFrequency’’’ -0.886685 0.1988077 -4.46002 8.195e-06 WrittenSpokenFrequencyRatio 0.009739 0.0011305 8.61432 0.000e+00 AgeSubject=young -0.275166 0.0187071 -14.70915 0.000e+00 FamilySize -0.010316 0.0022198 -4.64732 3.363e-06 InflectionalEntropy -0.021827 0.0022098 -9.87731 0.000e+00 NumberComplexSynsets -0.006295 0.0012804 -4.91666 8.803e-07 Frequency * AgeSubject=young 0.017493 0.0066201 2.64244 8.231e-03 Frequency’ * AgeSubject=young -0.043592 0.0441450 -0.98747 3.234e-01 Frequency’’ * AgeSubject=young 0.010664 0.2133925 0.04998 9.601e-01

204

DRAFT

Frequency’’’ * AgeSubject=young 0.171251 0.2807812 0.60991 5.419e-01 Factors in Final Model

[1] Voice PC1 MeanBigramFrequency

[4] WrittenFrequency WrittenSpokenFrequencyRatio AgeSubject

[7] FamilySize InflectionalEntropy NumberComplexSynsets [10] WrittenFrequency * AgeSubject

The output offastbw()has two parts. The first part lists statistics summarizing why factors are deleted. As can be seen in the two columns ofp-values, none of the deleted variables comes anywhere near explaining a significant part of the variance. Unsurpris-ingly, all predictors that did not reach significance in the anova table are deleted. In addition,WordCategoryandNVratio, which just reached significance at the5% level, are removed as well. The second part of the output offastbw()lists the estimated coef-ficients for the remaining predictors, together with their associated statistics.

We should not automatically accept the verdict offastbw(). First, it is only one of many available methods for searching for the most parsimonious model. Second, it often makes sense to remove predictors by hand, guided by one’s theoretical knowledge of the predictors. In the present example,PC1 remains in the model as the single representa-tive of 10 control variables for orthographic consistency. We gladly accept the removal of the other three principal components.LengthInLettersis also deleted. Given the very small effect size we observed above for this variable, and given that a highly cor-related control variable for orthographic form,MeanBigramFrequency, remains in the model, we have no regrets either for word length. With respect toWordCategoryand NVratio, we need to exercise some caution. Not only did these predictors reach signif-icance at the5% level, we also have theoretical reasons for predicting that nouns should have a processing advantage compared to verbs in visual lexical decision. Third, we need to check at this point is whether there are nonlinearities for other predictors besides writ-ten frequency. In fact, nonlinearities turn out to be required for forFamilySizeand WrittenSpokenFrequencyRatio, and once these nonlinearities are brought into the model,WordCategoryandNVratioemerge as predictive after all (bothp <0.05).

> english3.olsA = ols(RTlexdec ˜ Voice + PC1 + MeanBigramFrequency + + rcs(WrittenFrequency, 5) + rcs(WrittenSpokenFrequencyRatio, 3) + + NVratio + WordCategory + AgeSubject + rcs(FamilySize, 3) + + InflectionalEntropy + NumberComplexSynsets +

+ rcs(WrittenFrequency, 5):AgeSubject, data=english3, x=T, y=T)

We summarize this model by means of Figure 6.8, removing confidence bands (which are extremely narrow) and the subtitles specifying how the partial effects are adjusted for the other predictors in the model (as this is a very long list with so many predictors).

> par(mfrow = c(4, 3), mar = c(4, 4, 1, 1), oma = rep(1, 4))

> plot(english3.olsA, adj.subtitle=F, ylim=c(6.4, 6.9), conf.int=F)

> par(mfrow = c(1, 1))

205

DRAFT

Voice

RTlexdec

voiced 6.46.66.8 voiceless

PC1

RTlexdec

−6 −2 2 4 6 8

6.46.66.8

MeanBigramFrequency

RTlexdec

6 7 8 9 10 11

6.46.66.8

WrittenFrequency

RTlexdec

0 2 4 6 8 10

6.46.66.8

WrittenSpokenFrequencyRatio

RTlexdec

−4 −2 0 2 4

6.46.66.8

NVratio

RTlexdec

−15 −5 0 5 10

6.46.66.8

WordCategory

RTlexdec

N V

6.46.66.8

AgeSubject

RTlexdec

old young

6.46.66.8

FamilySize

RTlexdec

0 1 2 3 4 5

6.46.66.8

InflectionalEntropy

RTlexdec

0.0 1.0 2.0

6.46.66.8

NumberComplexSynsets

RTlexdec

0 1 2 3 4 5 6

6.46.66.8

Figure 6.8: Partial effects of the predictors according to modelenglish3.olsA.

206

DRAFT

Im Dokument A practical introduction to statistics (Seite 104-109)