• Keine Ergebnisse gefunden

A numerical vector and a factor: analysis of variance

> mat[1:10,]

[,1] [,2]

[1,] 319 328 [2,] 22 18

[3,] 0 0

[4,] 3 2

[5,] 307 287 [6,] 29 29 [7,] 240 223

[8,] 2 1

[9,] 1 0

[10,] 523 527

The first row ofmatlists the frequencies for the first word, the second row those for the second word, and so on. Now thatmathas been properly filled with simulated frequen-cies of occurrence, we use it as input to the density estimation function. Before we do so, it is essential to apply a logarithmic transformation to remove most of the skew. As there are zero frequencies inmat, and as the logarithm of zero is undefined, we back off from zero by adding1to all cells ofmatbefore taking the log.

> mat = log(mat+1)

We now use the same code as previously for the bivariate normal density,

> persp(kde2d(mat[, 1], mat[, 2], n = 50),

+ phi = 30, theta = 20, d = 10, col = "lightblue",

+ shade = 0.75, box = T, border = NA, ltheta = -100, expand = 0.5, + xlab = "log X", ylab = "log Y", zlab = "density")

but change the accompanying text.

> mtext("bivariate lognormal-Poisson", 3, 1)

The lower panels of Figure 4.14 illustrate two empirical densities. The left panel con-cerns the phonological similarity space of4171Dutch word forms with four phonemes.

For each of these words, we calculated the type count of four-phoneme words that differ in only one phoneme, its phonological neighborhood size. For each word, we also cal-culated the rank of that word in its neighborhood. (If the word was the most frequent word in its neighborhood, its rank was1, etc.) After removal of words with no neigh-bors and log transforms, we obtain a density that is clearly not strictly bivariate normal, but that might perhaps be considered as sufficiently approximating a bivariate normal distribution when considering a regression model.

109

DRAFT

The lower right panel of Figure 4.14 presents the density for the (log) frequencies of 4633Dutch monomorphemic noun stems in the singular and plural form. This distribu-tion has the same kind of shape as that of the lognormal-Poisson variate in the upper right.

4.4 A numerical vector and a factor: analysis of variance

Up till now, we have considered the functional relation between two numerical vectors.

In this section, we consider how to analyse a numerical vector that is paired with a factor.

Consider again mean familiarity ratings and the class of the words in theratingsdata frame:

> ratings[1:5, c("Word", "meanFamiliarity", "Class")]

Word meanFamiliarity Class

23 almond 3.72 plant

70 ant 3.60 animal

12 apple 5.84 plant

76 apricot 4.40 plant

79 asparagus 3.68 plant

We can use thelm()function to test whether there is a difference in mean familiarity between nouns for plants and nouns for animals. This is known as aONE-WAY ANALYSIS OF VARIANCE.

> summary(lm(meanFamiliarity ˜ Class, data = ratings)) Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 3.5122 0.1386 25.348 < 2e-16 Classplant 0.8547 0.2108 4.055 0.000117

The summary shows two highly significantp-values, so we may infer that the difference between the two group means must somehow be significant. But let’s delve a little deeper into what is happening here. After all,Classis a factor and not a numerical variable representing a line for which a slope and an intercept make sense.

Whatlm()does for us with the factorClassis to recode its factor levels into one or more numerical vectors. BecauseClasshas only two levels, one numerical vector suffices, a vector with zeros for the animals and with ones for the plants. This numerical vector is labeled asClassplant, andlm()carries out its standard calculations with this vector just as it would for any other numerical variable. Hence, it reports an intercept and a slope. However, intercept and slope receive a special interpretation that crucially depends on how the factor levels are recoded numerically.

The numerical recoding of factor levels is referred to asDUMMY CODING. There are many different algorithms for dummy coding. (The help page forcontr.treatment() provides further information.) The kind of dummy coding used in this book is known as

110

DRAFT

TREATMENT CODING.Rhandles dummy coding automatically for us, but by way of illus-tration we add treatment dummy codes to our data frame by hand. For convenience, we first make a copy ofratingswith only the columns relevant for the current discussion included.

> dummy = ratings[,c("Word", "meanFamiliarity", "Class")]

We now add the dummy codes: a1forplants, and a0for animals, in a vector named Classplant, followingR’s naming conventions.

> dummy$Classplant = 1

> dummy[dummy$Class == "animal",]$Classplant = 0

> dummy[1:5, ]

Word meanFamiliarity Class Classplant

23 almond 3.72 plant 1

70 ant 3.60 animal 0

12 apple 5.84 plant 1

76 apricot 4.40 plant 1

79 asparagus 3.68 plant 1

It does not matter which factor level is assigned a1and which a0. Some decision has to be made,Rbases its decision on alphabetical order. Henceanimalis singled out as theDEFAULT orREFERENCE LEVELthat is contrasted with the levelplant. Rlabels the dummy vector with the factor name followed by the non-default factor level, hence the nameClassplant. If we now runlm()ondummywithClassplantas predictor instead ofClass, we obtain exactly the same table of coefficients as above:

> summary(lm(meanFamiliarity ˜ Classplant, data = dummy)) Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 3.5122 0.1386 25.348 < 2e-16 Classplant 0.8547 0.2108 4.055 0.000117

Let’s now study this table in some more detail. It lists two coefficients. First consider the coefficient labeled intercept. Since all we are doing is comparing the ratings for the two levels of the factorClass, the term ’intercept’ must have a more general interpretation than ’theY-value of a line whenX= 0. What the intercept actually represents here is the group mean for the default level,animal. In other words, the intercept is nothing else but the mean familiarity for the subset of animals:

> mean(ratings[ratings$Class == "animal",]$meanFamiliarity) [1] 3.512174

> coef(ratings.lm)[1]

(Intercept) 3.512174

111

DRAFT

Thet-value and its correspondingp-value answer the question whether the group mean for the animals,3.5122, is significantly different from zero. It clearly is, but this informa-tion is not that interesting to us as we are concerned with the difference between the two group means.

Consider therefore the second coefficient in the model,0.8547. The value of this coef-ficient represents the contrast (i.e., the difference) between the group mean of the plants and that of the animals. When a word does not belong to the default class, i.e., it denotes a plant instead of an animal, then the mean has to be adjusted upwards by adding0.8547 to the intercept, the group mean for the animals. In other words, the group mean for the nouns denoting plants is4.3669(3.5122 + 0.8547). What thet-test in the above table of coefficients tells us is that this adjustment of0.8547is statistically significant. In other words, we have ample reason to suppose that the two group means differ significantly.

Thet-value andp-value obtained here are identical to those for a straightforwardt-test when we forcet.test()to treat the variances of the familiarity ratings for plants and animals as identical:

> t.test(animals$meanFamiliarity, plants$meanFamiliarity, + var.equal = TRUE)

t = -4.0548, df = 79, p-value = 0.0001168

alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval:

-1.2742408 -0.4351257 sample estimates:

mean of x mean of y 3.512174 4.366857

Note once more that the mean for animals is identical to the coefficient for the intercept, and that the mean for plants is the sum of the intercept and the coefficient adjusting for the levelplantof the factorClass.

Whereas the functiont.test()is restricted to comparing two group means, the lm()function can be applied to a factor with more than two levels. By way of example, consider theauxiliariesdata set, which provides information on285Dutch verbs.

> head(auxiliaries)

Verb Aux VerbalSynsets Regularity

1 blijken zijn 1 irregular

2 gloeien hebben 3 regular

3 glimmen zijnheb 2 irregular

4 rijzen zijn 4 irregular

5 werpen hebben 3 irregular

6 delven hebben 2 irregular

The column labeledAuxspecifies what the appropriate auxiliary for the perfect tense is for the verb listed in the first column. Dutch has two auxiliaries for the perfect tense, zijn(’be’) andhebben(’have’), and verbs subcategorize as to whether they select onlyzijn, onlyhebben, or both (depending on the aspect of the clause and the inherent aspect of

112

DRAFT

the verb). The columnVerbalSynsetsspecifies the number of verbal synsets in which a given verb appears in the Dutch WordNet. The final column categorizes the verbs as regular versus irregular.

We test whether the number of verbal synsets varies significantly with auxiliary by modelingVerbalSynsetsas a function ofAux.

> auxiliaries.lm = lm(VerbalSynsets ˜ Aux, data = auxiliaries)

Let’s first consider the general question whetherAuxhelps explain at least some of the variation in the number of verbal synsets. This question is answered with the help of the anova()function.

> anova(auxiliaries.lm) Analysis of Variance Table Response: VerbalSynsets

Df Sum Sq Mean Sq F value Pr(>F) Aux 2 117.80 58.90 7.6423 0.0005859 Residuals 282 2173.43 7.71

Theanova()function reports anF-value of7.64, which, for2and282degrees of free-dom, is highly significant (compare1-pf(7.6423, 2, 282)). What this test tells us is that there are significant differences in the mean number of synsets for the three kinds of verbs. However, it doesnotspecify which of the — in this case3— possible differences in the means might be involved:hebben–zijn,hebben–zijnhebandzijn–zijnheb.

Some information as to which of these means are really different can be gleaned from the summary:

> summary(auxiliaries.lm) Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 3.4670 0.1907 18.183 < 2e-16 Auxzijn 0.5997 0.7417 0.808 0.419488 Auxzijnheb 1.6020 0.4114 3.894 0.000123

From the summary we infer that the default or reference level ishebben: hebben precedeszijnandzijnhebin the alphabet. This explains why there is no row labeled withAuxhebbenin the summary table. Sincehebbenis the default, the intercept (3.4670) represents the group mean forhebben. There are two additional coefficients, one for the contrast between the group mean ofhebbenversuszijn, represented by the vector of dummy contrasts labeledAuxzijn, and one for the contrast between the group mean for hebbenand that ofzijnheb, represented by the dummy vectorAuxzijnheb. Hence, we can reconstruct the other two group means from the table of coefficients. The mean forzijnis3.4670 + 0.5997, and the mean for verbs allowing both auxiliaries is3.4670 + 1.6020. Thet-test for the intercept tells us that3.4670is unlikely to be zero, which is not of interest to us here. The coefficient of0.5997(for the verbs takingzijn) is not significant (p >0.40). This indicates that there is no reason to suppose that the means of the verbs

113

DRAFT

takinghebbenand those takingzijnare different. The coefficient for verbs taking both auxiliaries is significant, so we know that this mean is really different from the mean for verbs selecting onlyhebben.

There is one comparison that is left out in this example: (zijn versuszijnheb).

When a factor has more than three levels, there will be more comparisons that do not appear in the table of coefficients. This is because this table lists only those pairwise comparisons that involve the default level, the reference level that is mapped onto the intercept.

A question that often arises when a factor has more than two levels is which group means are actually different. In the present example, one might consider renaming the factor levels so that the missing comparison appears in the table of coefficients. This is not recommended, however, for two reasons. The first is that it is cumbersome to do so, the second is that there is a statistical snag whenMULTIPLE COMPARISONSare carried out on the same data.

Recall that we accept the outcome of a statistical experiment as surprising when its p-value is extreme, for instance, belowα= 0.05. When we are interested in the differences between, for instance, three group means, we have to be careful how to define what we count as extreme. The proper definition of an extreme probability is, in this case, that at least one of the outcomes is truly surprising. Now, if we simply carry out three separate t-tests withα = 0.05, the probability of surprise for at least one comparison, increases from0.05to0.143. To see this, we model our statistical experiment as a random variable with a probability of success equal to0.05and a probability of failure equal to0.95. The probability of at least one success is the same as one minus the probability of no successes at all, hence

> 1 - pbinom(0, 3, 0.05) [1] 0.142625

In other words, the probability that at least one out of three experiments will be success-ful in producing ap-value less than0.05just by chance is0.14. This example illustrates that when we carry out multiple comparison we run the risk of seriousINFLATION IN SURPRISE. This is not what we want.

There are several remedies, of which I discuss two. The first is known as BONFER -RONIS CORRECTION. Forncomparisons, simply divideαbyn. Any comparison that produces ap-value less thanα/nis sure to be significant at theαsignificance level. Ap-plied to our example, we begin with noting thatAuxhas3levels and therefore3pairwise comparisons of two means are at issue. Sincen= 3, any pairwise comparison that yields ap-value less than0.05/3 = 0.0167can be accepted as significant. IfAuxwould have had 4levels, the number of possible pairwise comparisons would be6, soα= 0.0083would have been appropriate.

The second remedy is to make use of TUKEYS HONESTLY SIGNIFICANT DIFFERENCE, available inRasTukeyHSD(). This method has greater power to detect significant differ-ences than the Bonferroni method, but has as disadvantage that the means for each level of the factor should be based on equal numbers of observations. The implementation

114

DRAFT

of Tukey’s HSD inRincorporates an adjustment for sample size that produces sensible results also for mildly unbalanced designs.

For the present example, the counts of verbs, cross-classified by the auxiliary they select, point to a very unbalanced design.

> xtabs(˜ auxiliaries$Aux) auxiliaries$Aux

hebben zijn zijnheb

212 15 58

Hence, the Bonferroni adjustment is required. We could applyTukeyHSD()to these data, but the results would be meaningless. To illustrate how to carry out multiple comparisons using Tukey’s honestly significant difference, consider the following (simplified) example from the help page ofTukeyHSD(). From the built-in data sets inR, we select the data frame namedwarpbreaks, which gives the number of warp breaks per loom, where a loom corresponds to a fixed length of yarn. For more information on this data set, type

?warpbreaks. We run a one-way analysis of variance.

> warpbreaks.lm = lm(breaks ˜ tension, data = warpbreaks)

> anova(warpbreaks.lm) Analysis of Variance Table Response: breaks

Df Sum Sq Mean Sq F value Pr(>F) tension 2 2034.3 1017.1 7.2061 0.001753 Residuals 51 7198.6 141.1

> summary(warpbreaks.lm) Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 36.39 2.80 12.995 < 2e-16 tensionM -10.00 3.96 -2.525 0.014717 tensionH -14.72 3.96 -3.718 0.000501 Residual standard error: 11.88 on 51 degrees of freedom Multiple R-Squared: 0.2203, Adjusted R-squared: 0.1898 F-statistic: 7.206 on 2 and 51 DF, p-value: 0.001753

The table of coefficients suggests that there are significant contrasts of medium and high tension compared to low tension. In order to make use ofTukeyHSD(), we have to rerun this analysis using a function specialized for analysis of variance,aov().

> warpbreaks.aov = aov(breaks ˜ tension, data = warpbreaks)

The summary of theaovobject gives exactly the same output as theanovafunction applied to thelmobject:

> summary(warpbreaks.aov)

Df Sum Sq Mean Sq F value Pr(>F) tension 2 2034.3 1017.1 7.2061 0.001753 Residuals 51 7198.6 141.1

115

DRAFT

Also note that theF-test in this summary yields the same results as theF-test following the table of coefficients in the summary ofwarpbreaks.lm. BothF-values tell exactly the same story: There are statistically significant differences in the number of breaks as a function of the amount of tension. Let’s now applyTukeyHSD():

> TukeyHSD(warpbreaks.aov)

Tukey multiple comparisons of means 95% family-wise confidence level

Fit: aov(formula = breaks ˜ tension, data = warpbreaks)

$tension

diff lwr upr p adj

M-L -10.000000 -19.55982 -0.4401756 0.0384598 H-L -14.722222 -24.28205 -5.1623978 0.0014315 H-M -4.722222 -14.28205 4.8376022 0.4630831

This table lists the differences in the means, the lower and upper end points of the con-fidence intervals, and the adjustedp-value. A comparison of the adjustedp-values for theM-Land H-Lcomparisons with thep-values listed in the table of coefficients for warkbreaks.lmabove shows that the adjustedp-values are more conservative. For visualization, see Figure 4.15, simply type

> plot(TukeyHSD(warpbreaks.aov))

Above, we fitted a linear model to the auxiliary data usinglm(). Alternatively, we could have used theaov()function. However, both methods, which are underlyingly identical, may be inappropriate. We have already seen that the numbers of observations for the three levels ofAuxdiffer widely. More importantly, there are also substantial differences in their variances.

> tapply(auxiliaries$VerbalSynsets, auxiliaries$Aux, var) hebben zijn zijnheb

5.994165 18.066667 11.503932

It is crucial, therefore, to check whether a non-parametric test also provides support for differences in the number of synsets for verbs with different auxiliaries. The test we illus-trate here is the KRUSKAL-WALLIS RANK SUM TEST:

> kruskal.test(auxiliaries$VerbalSynsets, auxiliaries$Aux) Kruskal-Wallis rank sum test

data: auxiliaries$VerbalSynsets and auxiliaries$Aux

Kruskal-Wallis chi-squared = 11.7206, df = 2, p-value = 0.002850

The smallp-value supports our intuition that the numbers of synsets are not uniformly distributed over the three kinds of verbs.

116

DRAFT

−25 −20 −15 −10 −5 0 5

H−MH−LM−L

95% family−wise confidence level

Differences in mean levels of tension

Figure 4.15: Family-wise95% confidence intervals for Tukey’s honestly significant dif-ference for thewarpbreaksdata. The significant differences are those for which the confidence intervals do not intersect the dashed zero line.

4.4.1 Two numerical vectors and a factor: analysis of covariance

In this section, we return to the analysis of the mean size ratings. What we have done thus far is to analyse these data either with linear regression (the first example in section 4.3.2) or with analysis of variance (section 4.4). In linear regression, we used a numerical vector as predictor, in analysis of variance, the predictor was a factor. The technical term for anal-yses with both numeric predictors and factorial predictors isANALYSIS OF COVARIANCE. InR, the same functionlm()is used for all three kinds of analyses (regression, analysis of variance, and analysis of covariance), as all three are built on the same fundamental principles.

Recall that we observed a non-linear relation between familiarity and size rating, and that we fitted a linear model with a quadratic term to the subset of nouns denoting plants.

We could fit a separate regression model to the subset of nouns denoting animals, but what we really need is a model that tailors the regression lines to both subsets of nouns simultaneously. This is accomplished in the following linear model, in which we include

117

DRAFT

bothmeanFamiliarityand the factorClassas predictors.

> ratings.lm = lm(meanSizeRating ˜ meanFamiliarity * Class + + I(meanFamiliarityˆ2), data = ratings)

> summary(ratings.lm) Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 4.42894 0.54787 8.084 7.6e-12

meanFamiliarity -0.63131 0.29540 -2.137 0.03580 I(meanFamiliarityˆ2) 0.10971 0.03801 2.886 0.00508

Classplant -1.01248 0.41530 -2.438 0.01711

meanFamiliarity:Classplant -0.21179 0.09779 -2.166 0.03346

---Residual standard error: 0.3424 on 76 degrees of freedom Multiple R-Squared: 0.8805, Adjusted R-squared: 0.8742 F-statistic: 140 on 4 and 76 DF, p-value: < 2.2e-16

Let’s consider the elements of this model by working through the table of coefficients.

As usual, there is an intercept, which represents a modified group mean for the subset of nouns denoting animals. We are dealing with a modified group mean because this mean is calibrated for words with zeromeanFamiliarity. As familiarity ratings range between1and7in this experiment, this group mean is a theoretical construct. The next two coefficients define the nonlinear effect ofmeanFamiliarity, one for the linear term, and one for the quadratic term. These coefficients likewise concern the subset of nouns for animals.

The last two coefficients summarizes how the preceding coefficients should be

The last two coefficients summarizes how the preceding coefficients should be