General considerations - A practical introduction to statistics

There are two very different ways in which statistical models are used. Ideally, a model is used to test a pre-specified hypothesis, or a set of hypotheses. One fits a model to the data, removes overly influential outliers, uses bootstrap validation and if required shrinks the estimated coefficients. Only after this process is completed, one inspects the anova and summary tables, to see whether thep-values and the direction of the effects are as predicted by one’s hypotheses. Thep-values in the summary tables are correct under these circumstances, and only under these circumstances.

In practise, this ideal procedure is hardly ever realistic, for a variety of reasons. First, it is often the case that our initial hypotheses are very underspecified. Under these cir-cumstances, we engage in statistical modeling in order to explore the potential relevance of predictors, to learn about their functional form, and more in general to come to a bet-ter understanding of the structure of our data. In this exploratory process, we screen predictors for significantp-values, remove variables accordingly, and gradually develop a model that we feel is both parsimonious and adequate. Thep-values of such a final model are still informative, but far from exact. According to some, they are even to-tally worthless and completely uninterpretable. This highlights the crucial importance of model validation, for instance by means of the bootstrap, as this will inform us about the extent to which we might be overfitting the data. It is equally crucial to replicate one’s experiment with new materials. The same factors should be predictive, the magnitudes of the coefficients should be similar, and one would hope to find that the model for the original experiment provides reasonable predictions for the new data.

What you should avoid at all times is what statisticians refer to as cherry-picking.

You should not tweak the data by removing data points so that a non-significant effect becomes significant. It is not bad to remove data points, but one should have reasons for removing them that are completely independent of whether as a result predictors will be significant. Overly influential outliers have to be removed, and any other data points that are suspect. For instance, in experiments using lexical decision, response latencies less than200milliseconds are probably artefactual, simply because the time for reading the

258

DRAFT

stimulus combined with the time required for planning and carrying out the movements involved in pushing the response button already require at least200milliseconds.

Similarly, one should not hunt around for a method that will make an effect signif-icant. It is true that there are often several different methods available for modeling a given data set. And yes, there is no single best model. However, when different model-ing techniques have been considered, and when each technique is appropriate, then the combined evidence should be taken into account. A predictor that happens to be signifi-cant in only one analysis but not in the others should not be reported as signifisignifi-cant.

The examples in this chapter illustrate the steps in data analysis: the construction of an initial model, the exploration of nonlinear relations, model criticism, and validation.

All these steps are important, and crucial for understanding your data. As you build up experience with regression modeling, you will find that notably model criticism almost always allows theoretically well-supported predictors to emerge more strongly.

A final methodological issue that should be mentioned is the unfortunate practice in psycholinguistics of dichotomizing continuous variables. For instance, Baayen et al.

[1997] studied frequency effects in visual word recognition by contrasting high-frequency words with low-frequency words. The two sets of words were matched in the mean for a number of other lexical variables. However, this dichotomization of frequency reduces an information-rich continuous variable into an information-poor two-level factor. If fre-quency were a treatment that we could administer to words, like raising the temperature or the humidity in an agricultural experiment, then it would make sense to maximize one’s chances of finding an effect by contrasting observations subjected to a fixed very low level of the treatment with observations subjected to a fixed very high level of the treatment. Unfortunately, frequency is a property of our experimental units, it cannot be administered independently, and it is correlated with many other lexical variables. Due to this correlational structure, dichotomization of linguistic variables almost always leads to factor levels with overlapping or nearly overlapping distributions of the original variable

— it is nearly impossible to build contrasts for extreme values on one linguistic variable while matching for a host of other correlated linguistic variables. As a consequence, the enhanced statistical power obtained by comparing two very different treatment levels is not available. In these circumstances, dichotomization comes with a severe loss of sta-tistical power, precise information is lost and nonlinearities become impossible to detect.

Furthermore, samples obtained through dichotomization tend to be small and to get ever smaller the more variables are being matched for. Such samples are also non-random in the extreme, and hence do not allow proper statistical inference. To make matters even worse, dichotomization may also have various other adverse side effects, including spu-rious significance [see, e.g., Cohen, 1983, Maxwell and Delaney, 1993, MacCallum et al., 2002]. Avoid it. Use regression.

259

DRAFT

6.7 Exercises

1. Analyse the effect ofPC1on the naming latencies in theenglish2data set that we created in section 6.2.2. Attach theDesignpackage, make a data distribution object, and set thedatadistvariable to point to this object with theoptions()function.

First fit a model withAgeSubjectandWrittenFrequency, andPC1as predic-tors. Use a restricted cubic spline with three knots forWrittenFrequency, and include an interaction ofWrittenFrequencybyAgeSubject. Is the linear effect ofPC1significant? Now allow the effect ofPC1to be nonlinear with a restricted cubic spline with three knots. Plot the partial effect ofPC1in this new model, and explain the difference with respect to the first model.

2. Exercise 5.3 addressed the prediction of the underlying voice specification of the stem-final obstruent in Dutch verbs with the help of a classification tree. Ernestus and Baayen [2003] compared several statistical models for thefinalDevoicing data set, including a logistic regression model. Load the data, and use thelrm() function from theDesignpackage to model the dependent variableVoiceas a function of the other variables in the data frame. Usefastbw()to remove irrele-vant predictors from the model.

3. Check that the danger of overfitting has been reduced for the penalized model dutch.lrm.penby means of bootstrap validation.

4. We fit a logistic regression model to the data setetymologywith as dependent variable theRegularityof the verb, and the ordered factorEtymAge (etymologi-cal age) as etymologi(etymologi-cal age as main predictor of interest.

> etymology$EtymAge = ordered(etymology$EtymAge, levels=c("Dutch", + "DutchGerman", "WestGermanic", "Germanic", "IndoEuropean"))

> library(Design)

> etym.dd = datadist(etym)

> options(datadist=’etym.dd’)

> etymology.lrm = lrm(Regularity ˜ rcs(WrittenFrequency,3) + + rcs(FamilySize,3) + NcountStem + InflectionalEntropy + + Auxiliary + Valency + NVratio + WrittenSpokenRatio + EtymAge, + data=etymology, x=T, y=T)

Warning message: Variable EtymAge is an ordered factor.

You should set

options(contrasts=c("contr.treatment","contr.treatment"))

or Design will not work properly. in: Design(eval(m, sys.parent())) The warning message tells us that the defaults for the dummy coding of factors have to be reset. We do as instructed.

> options(contrasts = c("contr.treatment", "contr.treatment")) 260

DRAFT

Rerun the model, inspect the result by means of anANOVA table, and validate it.

You will observe considerable overfitting, so use thepentrace()function to find an optimal penalty for shrinking the coefficients. Make a plot of the partial effects of the predictors in the penalized model.

5. Consider again the breakpoint analysis of the frequencies of references to years in the Frankfurter Allgemeine Zeitung (faz). Explain why the model

> faz.bothA = lm(LogFrequency ˜ ShiftedLogDistance + + ShiftedLogDistance : PastBreakPoint, data = faz)

is a correct alternative formulation of the model presented in the main text, and also explain why the model

> faz.bothA = lm(LogFrequency ˜ ShiftedLogDistance * PastBreakPoint, + data = faz)

is incorrect for our purposes.

6. Compare the lexical richness of Lewis Carroll’sAlice’s adventures in Wonderlandwith that of Lewis Carroll’sThrough the looking-glass, available as the data setthrough, usingcompare.richness.fnc()for equal text sizes, i.e., for the number of to-kens in the smallest of the two texts. Use the same method to compareAlice’s adven-tures in Wonderlandwith Baum’sThe wonderful wizard of Oz(oz) and with Melville’s Moby Dick(moby).

7. Plag et al. [1999] studied morphological productivity for selected affixes in the British National Corpus (BNC). The BNC consists of three subcorpora: written English, spontaneous conversations (the demographic subcorpus), and spoken English in more formal settings (the context-governed subcorpus). Frequency spectra for the English suffix-nesscalculated for these subcorpora are available as the data sets nessw,nessdemogandnesscg. Convert them intoscpobjects withspc(). Then fit the finite Zipf-MandelbrotLNREmodel to each of the spectra. Inspect the good-ness of fit, and refit with the Generalized Inverse Gauss-Poisson model where nec-essary. Plot the growth curve of the vocabulary at40equally spaced intervals in the range from zero to the size of the sample of written words with-ness. Comment on the relation between the shape of the growth curves and the estimated numbers of types in the population. Finally, calculate the growth rates of the vocabulary both at the sample size of the largest subcorpus, and for that of the smallest subcorpus.

Use the functionVm()from thezipfRpackage, which takes as first argument a frequency spectrum and as second argument the spectrum element (1for the hapax legomena).

261

DRAFT

8. Tyler et al. [2005] combined fMRI and priming data in a study addressing the extent to which phonological and semantic processes recruit the same brain areas. Fig-ure 6.20, reconstructed from the graphics coordinates of their FigFig-ure 2b, summarizes the main structure of one of their subanalyses. The authors argue that the priming scores (horizontal axis) for the semantic condition are significantly correlated with the intensity of the most significant voxel (vertical axis), which is located in an area of the brain typically associated with semantic processing. They also argue that there is no such correlation for the morphological condition. Figure 6.20 is based on the data setimaging. Carry out an analysis of covariance withFilteredSignal as dependent variable in the model, and test whether there is a significant interac-tion ofBehavioralScorebyCondition. Then apply model criticism, and use this to evaluate the conclusions reached by Tyler and colleagues.

−30 −20 −10 0 10 20 30 40

020406080

imaging$BehavioralScore

imaging$FilteredSignal

r = 0.82

r = −0.24

Figure 6.20: Signal intensity in fMRI at the peak voxel in the left medial fusiform gyrus and priming scores for semantically related (card/paper) and morphologically related (be-gin/began) conditions. Each data point represents a brain-damaged patient. After Tyler et al. [2005].

262

DRAFT

Chapter 7 Mixed Models

Consider a study addressing the consequences of adding white noise to the comprehen-sion of words presented auditorily over headphones to a group of subjects, using audi-tory lexical decision latencies as a measure of speed of lexical access. In such a study, the presence or absence of white noise would be the treatment factor, with two levels (noise versus no noise). In addition, we would need identifiers for the individual words (items), and identifiers for the individual participants (or subjects) in the experiment. The item and subject factors, however, differ from the treatment factor in that we would normally only regard the treatment factor asREPEATABLE.

A factor is repeatable, if the set of possible levels for that factor is fixed, and if, moover, each of these levels can be repeated. In our example, the treatment factor is re-peatable, because we can take any new acoustic signal and either add or not add a fixed amount of white noise. We would not normally regard the identifiers of items or subjects as repeatable. Items and subjects are sampled randomly from populations of words and participants, and replicating the experiment would involve selecting other words and other participants. For these new units, we would need new identifiers. In other words, we would be introducing new levels of these subject and item factors in the experiment that had not been seen previously.

To see the far-reaching consequences of this, imagine that we have eight subjects and eight items, and that we create two factors, each with eight levels, using contrast coding.

One of the subjects and one of the items will be mapped onto the intercept, the other subjects and items will receive coefficients specifying how they differ from the intercept.

How useful is this model for predicting response latencies for new subjects and items? A moment’s thought will reveal that it is completely useless. New subjects and new items have new identifiers that do not match the identifiers that were used in building the con-trasts and the model using these concon-trasts. We can still assign new data points to the levels of the treatment factor, noise versus no noise, because these levels are repeatable.

But subjects and items are not repeatable, hence we cannot use our model to make pre-dictions for new subjects and new items. In short, the model does not generalize to the populations of subjects and items. It is tailored to the specific subjects and items in the experiment only.

263

DRAFT

The statistical literature therefore makes a crucial distinction between factors with repeatable levels, for which we useFIXED-EFFECT terms, and factors with levels ran-domly sampled from a much larger population, for which we useRANDOM-EFFECTterms.

MIXED-EFFECT MODELS, or more simply,MIXED MODELS, are models which incorporate both fixed and random effects.

While fixed effects factors are modeled by means of contrasts, random effects are mod-eled as random variables with a mean of zero and unknown variance. For instance, the participants in a reaction time experiment will differ with respect to how quickly they re-spond. Some tend to be slow, others tend to be fast. Across the population of participants, the average adjustment required to account for differences in speed will be zero. The ad-justments required for individual subjects will in general not be zero, instead, they will vary around zero with some unknown standard deviation. In mixed models, the stan-dard deviations associated with random effects are parameters that are estimated, just as the coefficients for the fixed effects are parameters that are estimated.

Im Dokument A practical introduction to statistics (Seite 134-137)