Models for lexical richness - A practical introduction to statistics

6.5 Models for lexical richness

The frequencies of linguistic units such as words, word bigrams and trigrams, sylla-bles, constructions, etc. pose a special challenge for statistical analysis. This section il-lustrates this challenge by means of an investigation of lexical richness inAlice’s Adven-tures in Wonderland. The data setaliceis based on a version obtained from the project Gutenberg (http://www.gutenberg.org/wiki/Main Page) from which header and trailer were removed. The resulting text was loaded intoRwithscan("alice.txt", what="character")and converted to lower case withtolower(). This ensures that variants such asWentandwentare considered as tokens of the same word type. To clarify the distinction betweenTYPESandTOKENS, consider the first sentence ofAlice’s Adventures in Wonderland.

243

DRAFT

Alice was beginning to get very tired of sitting by her sister on the bank and of having nothing to do.

There are21words in this sentence, of which two are used twice. We will refer to the number of unique words as the number of types, and to the number of words regardless of their identity as the number of tokens.

The question that we consider here is how to characterize the vocabulary richness ofAlice’s Adventures in Wonderland. Intuitively, vocabulary richness (or lexical richness) should be quantifiable in terms of the number of different word types. However, the number of different word types depends on the number of tokens.

If we read through a text or corpus, and at regular intervals keep note of how many different types we have encountered, we find that, unsurprisingly, the number of types increases, first rapidly, and then more and more slowly. This phenomenon is illustrated in the upper left panel of Figure 6.16. For40equally spaced measurement points in ’token time’, the corresponding count of different is types is graphed. I refer to this curve as the GROWTH CURVE OF THE VOCABULARY. The panel to its right shows the rate at which the vocabulary is increasing, quickly at first, more and more slowly as we proceed through the text. TheVOCABULARY GROWTH RATE is estimated by the ratio of the number of HAPAX LEGOMENA(types with a frequency of1) to the number of tokens sampled. The growth rate is a probability, the probability that, after having readN tokens, the next token sampled represents an unseen type, a word type that did not occur among the precedingNtokens [Good, 1953, Baayen, 2001].

The problem that arises is that, although we could select the total number of types counted for the full text as a measure of lexical richness, this measure would not lend itself well for comparison with longer or with shorter texts. Therefore, considerable effort has been invested in developing measures of lexical richness that would supposedly be independent of the number of tokens sampled. The remaining six panels of Figure 6.16 illustrate that these measures have not been particularly successful. The third panel on the upper row shows the worst measure of all, the type-token ratio, obtained by dividing the number of types by the number of tokens. It is highly correlated (r= 0.99) with the growth rate of the vocabulary shown in the panel to its left. The panel in the upper right explores the idea that word frequencies might follow a lognormal distribution.. If so, the mean log frequency might be expected to remain roughly constant and in fact to narrow down to its true value as the sample size increases. We return to this issue below, here we note that there is no sign that the curve is anywhere near reaching a stable value. The bot-tom panels illustrate the systematic variability in four more complex measures that have been put forward in the literature. None of these putative constants is a true constant.

The only measure of these last four that is, at least under the simplifying assumption that words are used randomly and independently, truly constant is Yule’sK, but due to the non-random way in which Lewis Carroll used the words inAlice’s Adventures in Wonder-land, evenKfails to be constant.

Before considering the implications of this conclusion, we first introduce the function that was used to obtain Figure 6.16,growth.fnc(). We instruct it to calculate lexical

244

DRAFT

measures at40intervals with648tokens in each interval.

> alice[1:4]

[1] "alice’s" "adventures" "in" "wonderland"

> alice.growth = growth.fnc(text = alice, size = 648, nchunks = 40) The output ofgrowth.fnc()is a growth object, and its contents can be inspected with head()ortail().

> head(alice.growth, 3)

Chunk Tokens Types HapaxLegomena DisLegomena TrisLegomena

1 1 646 269 175 38 20

2 2 1292 440 264 74 26

3 3 1938 578 337 94 42

Yule Zipf TypeTokenRatio Herdan Guiraud 1 109.36556 -0.6607349 0.4164087 0.7392036 41.57137 2 103.78227 -0.7533172 0.3405573 0.7300149 61.41866 3 99.61543 -0.7628707 0.2982456 0.7186964 76.35996

Sichel Lognormal 1 0.1412639 0.4508406 2 0.1681818 0.5339446 3 0.1626298 0.5794592

The first three columns list the indices of the chunks, the corresponding (cumulative) number of tokens and the counts of different types in the text up to and including the current chunk. The next three columns list the numbers ofHAPAX,DIS,AND TRIS LEGOM -ENA, the words that are counted exactly once, exactly twice, or exactly three times at a given text size. The remaining columns list various measures of lexical richness: Yule’sK [Yule, 1944], the Zipf slope [Zipf, 1935], the type-token ratio, Herdan’sC[Herdan, 1960], Guiraud’sR[Guiraud, 1954], Sichel’sS[Sichel, 1986], and the mean of log frequencyCar-roll [1967]. Once a growth object has been created, Figure 6.16 is obtained straightfor-wardly by applying the standardplot()function to the growth object.

> plot(alice.growth)

Let’s return to the issue of the variability of the lexical constants. This variability would not be much of a problem if a constant’s range of variability within a given text would be very small compared to its range of variability across texts. Unfortunately, this is not the case, as shown by Tweedie and Baayen [1998] and Hoover [2003]. The within-text variability can be of the same order of magnitude as the between-within-text variability.

There are two approaches to overcome this problem. A practical solution is to compare the vocabulary size (number of types) across texts for the same text sizes. For larger texts, a random sample of the same size as the smallest text in the comparison set has to be selected. The concomitant data loss (all the other words in the larger text that are discarded) is taken for granted. The functioncompare.richness.fnc()carries out such comparisons. By way of example, we split the text ofAlice’s Adventures in Wonderland into unequal parts.

Figure 6.16: The vocabulary growth curve and selected measures of lexical richness, all of which depend on the text size.

> aiw1 = alice[1:17000]

> aiw2 = alice[17001:25942]

If we straightforwardly compare these texts by examining the number of types, we find that there is a highly significant difference in vocabulary richness.

> compare.richness.fnc(aiw1, aiw2)

comparison of lexical richness for aiw1 and aiw2

with approximations of variances based on the LNRE models gigp (X2 = 19.43) and gigp (X2 = 18.96)

Tokens Types HapaxLegomena GrowthRate

aiw1 17000 2110 1002 0.05894

aiw2 8942 1442 712 0.07962

two-tailed tests:

Z p Vocabulary Size 18.5709 0 Vocabulary Growth Rate -6.7915 0

246

DRAFT

In order to evaluate differences in the observed numbers of types, the variances of these type counts have to be estimated.compare.richness.fnc()does this by fitting word frequency models (see below) to each text, and selecting for each text the model with the best goodness of fit. (Models with a better goodness of fit have a lower chi-squared value). Given the estimates of the required variances,Zscores are obtained that evaluate the difference between the number of types in the first and the second text. Becauseaiw1 has more tokens thanaiw2, this difference is positive. Hence theZscore is also positive.

Its very large value,18.57, is associated with a very smallp-value, effectively zero.

When we reduce the size of the larger text to that of the smaller one, the differences in lexical richness are no longer significant, as expected.

> aiw1a = aiw1[1:length(aiw2)]

> compare.richness.fnc(aiw1a, aiw2)

comparison of lexical richness for aiw1a and aiw2

with approximations of variances based on the LNRE models gigp (X2 = 17.2) and gigp (X2 = 18.96)

Tokens Types HapaxLegomena GrowthRate

aiw1a 8942 1437 701 0.07839

aiw2 8942 1442 712 0.07962

two-tailed tests:

Z p

Vocabulary Size -0.1529 0.8784 Vocabulary Growth Rate -0.3379 0.7355

Note thatcompare.richness.fnc()compares texts not only with respect to their vo-cabulary sizes, but also with respect to their growth rates. A test of growth rates is carried out because two texts may have made use of the same number of types, but may never-theless differ substantially with respect to the rate at which unseen types are expected to appear.

The other approach to the problem of lexical richness is to develop better statistical models. The challenge that this poses is best approached by first considering in some more detail the problems with the models proposed by Herdan [1960] and Zipf [1935].

In fact, there are two kinds of problems. The first is illustrated in Figure 6.17. The upper panel plots log types against log tokens. The double log transformation changes a curve into what looks like a straight line. Herdan proposed that the slope of this line is a text characteristic that is invariant with respect to text length. This slope is known as Herdan’s Cand was plotted in the lower left panel of Figure 6.16 for a range of text sizes. A plot of the residuals, shown in the upper right panel of Figure 6.17, shows that the residuals are far from random. Instead, they point to the presence of some curvature that the straight line fails to capture. In other words, the regression model proposed by Herdan is too simple. This is the first problem. The second problem is that when we estimate the slope of the regression line at40equally spaced intervals for varying text sizes, the estimated slope changes systematically. This is clearly visible in the lower left panel of Figure 6.16.

247

DRAFT

7 8 9 10

6.07.0

log(aiw.g$Tokens)

log(aiw.g$Types)

7 8 9 10

−0.08−0.020.04

log(aiw.g$Tokens)

resid(aiw.g.lm)

0 2 4 6 8

0246

log rank

log frequency

0 2 4 6 8

−0.6−0.20.2

log(z$rank)

resid(z.lm)

Figure 6.17: Herdan’s law (upper left) and Zipf’s law (lower left) and the corresponding residuals (right panels) forAlice’s Adventures in Wonderland.

Zipf’s law is beset by exactly the same problems. The lower left panel of Figure 6.17 plots log frequency against log rank. The overall pattern is that of a straight line, as shown by the ordinary least squares regression line shown in grey. The slope of this line, the Zipf slope, is supposed to be a textual characteristic independently of the sample size. But the residuals (see the lower right panel of Figure 6.17) again point to systematic problems with the goodness of fit. And the lower right panel of Figure 6.16 shows that the slope of this regression line also changes systematically as we vary the size of the text, a phenomenon first noted by Orlov [1983]. We could try to fit more complicated regression models to the data using quadratic terms or cubic splines. Unfortunately, although this might help to obtain a better fit for a fixed text size, it would leave the second problem unsolved. Any non-trivial change in the text size leads to a non-trivial change in the values of the regression coefficients. Before explaining why these changes occur, we pause

248

DRAFT

to discuss the code for Figure 6.17.

The objectalice.growthis a growth object. Internal to that object is a data frame, which we extract as follows:

> alice.g = alice.growth@data$data

> head(alice.g, 3)

Chunk Tokens Types HapaxLegomena DisLegomena TrisLegomena Yule

1 1 646 269 175 38 20 109.36556

2 2 1292 440 264 74 26 103.78227

3 3 1938 578 337 94 42 99.61543

Zipf TypeTokenRatio Herdan Guiraud Sichel Lognormal 1 -0.6607349 0.4164087 0.7392036 41.57137 0.1412639 0.4508406 2 -0.7533172 0.3405573 0.7300149 61.41866 0.1681818 0.5339446 3 -0.7628707 0.2982456 0.7186964 76.35996 0.1626298 0.5794592 The upper left panel of Figure 6.17 is obtained by regressing log Types on log Tokens,

> plot(log(alice.g$Tokens), log(alice.g$Types))

> alice.g.lm = lm(log(alice.g$Types)˜log(alice.g$Tokens))

> abline(alice.g.lm, col="darkgrey") The summary of the model

> summary(alice.g.lm) Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 1.810291 0.041288 43.84 <2e-16 log(alice.g$Tokens) 0.599329 0.004454 134.55 <2e-16 Residual standard error: 0.0243 on 38 degrees of freedom Multiple R-Squared: 0.9979, Adjusted R-squared: 0.9979 F-statistic: 1.81e+04 on 1 and 38 DF, p-value: < 2.2e-16

shows we have been extremely succesful with an R-squared of0.998. But the residual plot shows the model is nevertheless inadequate.

> plot(log(alice.g$Tokens), resid(alice.g.lm))

> abline(h=0)

The lower left panel of Figure 6.17 is obtained withzipf.fnc(). Its output is a data frame with the word frequencies, the frequencies of these frequencies, and the associated ranks.

> z = zipf.fnc(alice, plot = T)

> head(z, n = 3)

frequency freqOfFreq rank

114 1593 1 1

113 836 1 2

249

DRAFT

112 710 1 3

> tail(z, n = 3)

frequency freqOfFreq rank

3 3 228 1052

2 2 394 1446

1 1 1188 2634

When plot is set to true, it shows theRANK-FREQUENCY STEP FUNCTIONin the graph-ics window, as illustrated in the lower left panel of Figure 6.17. The code it executes is simply

> plot(log(z$rank), log(z$frequency), type = "S")

The step function (obtained withtype = "S") highlights that, especially for the lowest frequencies, large numbers of words share exactly the same frequency but have different (arbitrary) ranks. We fit a linear model predicting frequency from the highest rank with that frequency, and add the regression line.

> z.lm = lm(log(z$frequency) ˜ log(z$rank))

> abline(z.lm, col = "darkgrey")

Finally, we add the plot with the residuals at each rank.

> plot(log(z$rank), resid(z.lm))

> abline(h=0)

So why is it that the slopes of the regression models proposed by Herdan and Zipf change systematically as the text size is increased? A greater text size implies a greater sample size, and under normal circumstances, a greater sample size would lead one to expect not only more precise estimates but also more stable estimates. Consider, for in-stance, what happens if we regress reaction time on frequency for increasing samples of words from the data set of English monomorphemic and monosyllabic words in the data setenglish. We simplify by restricting ourselves to the data pertaining to the young age group, and by ignoring all other predictors in the model.

> young = english[english$AgeSubject == "young",]

> young = young[sample(1:nrow(young)), ]

The last line randomly reorders the rows in the data frame. We next define a vector with sample sizes,

> samplesizes = seq(57, 2284, by = 57)

and create vectors for storing the coefficients, their standard errors, and the lower bound of the95% confidence interval.

> coefs = rep(0, 40)

> stderr = rep(0, 40)

> lower = rep(0, 40)

250

DRAFT

We loop over the sample sizes, select the relevant subset of the data, fit the model, and extract the statistics of interest.

> for (i in 1:length(samplesizes)) { + young.s = young[1:samplesizes[i], ]

+ young.s.lm = lm(RTlexdec ˜ WrittenFrequency, data = young.s) + coefs[i] = coef(young.s.lm)[2]

+ stderr[i] = summary(young.s.lm)$coef[2, 2]

+ lower[i] = qt(0.025, young.s.lm$df.residual) * stderr[i]

+ }

Finally, we plot the coefficients as a function of sample size, and add the95% confidence intervals.

> plot(samplesizes, coefs, ylim = c(-0.028, -0.044), type = "l", + xlab = "sample size", ylab = "coefficient for frequency")

> points(samplesizes, coefs)

> lines(samplesizes, coefs - lower, col = "darkgrey")

> lines(samplesizes, coefs + lower, col = "darkgrey")

What we see is that after some initial fluctuations the estimates of the coefficient become stable, and that the confidence interval becomes narrower as the sample size is increased.

This is the normal pattern: We expect that as the sample size grows larger, the difference between the sample mean and the population mean will approach zero. (This is known as theLAW OF LARGE NUMBERS.) However, this pattern is unlike anything that we see for our lexical measures.

The reason that our lexical measures misbehave is that word frequency distributions, and even more so the distributions of bigrams and trigrams, are characterized by large numbers of very-low probability elements. Such distributions are referred to asLNRE dis-tributions, where the acronymLNREstands for Large Number of Rare Events [Chitashvili and Khmaladze, 1989, Baayen, 2001]. Many of the rare events in the population do not oc-cur in a given sample, even when that sample is large. The joint probability of the unseen words is usually so substantial that the relative frequencies in the sample become inaccu-rate estimates of the real probabilities. Since the relative frequencies in the sample sum up to1, they leave no space for probabilities of the unseen types in the population. Hence, the sample relative frequencies have to be adjusted so that they become slightly smaller, in order to free probability space for the unseen types.[Good, 1953, Gale and Sampson, 1995, Baayen, 2001]. An estimate for the joint probability of the unseen types is the growth rate of the vocabulary. ForAlice’s Adventures in Wonderland, this probability equals0.05.

In other words, the likelihood of observing a new word at the end of the text is1out of 20. It is not surprising, therefore, that lexical measures have to be updated continuously as the text sample is increased.

The packagezipfR, developed by Evert and Baroni [2006], provides tools for fitting the two most important and usefulLNREmodels, the Generalized Inverse Gauss-Poisson model of Sichel [1986], and the finite Zipf-Mandelbrot model of Evert [2004]. An object

251

DRAFT

0 500 1000 1500 2000

−0.030−0.035−0.040

sample size

coefficient for frequency

Figure 6.18: Estimated coefficient for written frequency for English lexical decision times for increasing sample size, with95% confidence interval.

type that is fundamental to thezipfRpackage is theFREQUENCY SPECTRUM. A fre-quency spectrum is a table with frequencies of frequencies. When working with raw text we can make a frequency spectrum withinR. (This, however, is feasible only with texts or small corpora with less than a million words.) By way of illustration, we return toAlice’s Adventures in Wonderland, and applytable()twice:

> alice.table = table(table(alice))

> head(alice.table)

1 2 3 4 5 6

1188 394 228 150 101 53

> tail(alice.table)

522 532 620 710 836 1593

1 1 1 1 1 1

There are1188hapax legomena,394dis legomena,228tris legomena, and steadily de-creasing counts of words with higher frequencies. At the tail of the frequency spectrum we see that the highest frequency,1593, is realized by only a single word. To see which

252

DRAFT

words have the highest frequencies, we applytable()to the text, but now only once.

After sorting, we see that the highest frequency is realized by the definite article.

> tail(sort(table(alice))) alice

it she a to and the 522 532 620 710 836 1593

In order to convertalice.tableinto a spectrum object, we applyspc(). Its first ar-gument,m, should specify the word frequencies, its second argument,Vm, should specify the frequencies of these word frequencies.

> alice.spc = spc(m = as.numeric(names(alice.table)), + Vm = as.numeric(alice.table))

> alice.spc m Vm 1 1 1188 2 2 394 3 3 228 4 4 150 5 5 101

6 6 53

7 7 59

8 8 53

9 9 29

10 10 37 ...

N V

25942 2634

Spectrum objects have a summary method, which lists the first ten elements of the spec-trum, together with the number of tokensNand the number of typesV in the text. A spectrum behaves like a data frame, so we can verify that the counts of types and tokens are correct with

> sum(alice.spc$Vm) # types [1] 2634

> sum(alice.spc$m * alice.spc$Vm) # tokens [1] 25942

For large texts and corpora, frequency spectra should be created by independent software.

For a corpus of Dutch newspapers of some80million words (part of the Twente Nieuws Corpus), a frequency spectrum is available as the data settwente. We convert this data frame into azipfRspectrum object withspc().

> twente.spc = spc(m=twente$m, Vm = twente$Vm)

> N(twente.spc) # ask for number of tokens

253

DRAFT

[1] 78934379

> V(twente.spc) # ask for number of types [1] 912289

Note that a frequency spectrum provides a very concise summary of a frequency dis-tribution. We have nearly a million different words (defined as sequences of characters separated by spaces), buttwente.spchas a mere4639rows.

We return toAlice’s Adventures in Wonderlandand fit an^LNREmodel to this text with lnre(). This function takes two arguments, the type of model, and a frequency spec-trum. We first choose as a model the Generalized Inverse Gauss-Poisson model,gigp.

> alice.lnre.gigp = lnre("gigp", alice.spc)

A summary of the model is obtained by typing the name of the model object to the prompt.

> alice.lnre.gigp

Generalized Inverse Gauss-Poisson (GIGP) LNRE model.

Parameters:

Shape: gamma = -0.7097631 Lower decay: B = 0.02788357 Upper decay: C = 0.03338946 [ Zipf size: Z = 29.94957 ] Population size: S = 6021.789

Sampling method: Poisson, with exact calculations.

Parameters estimated from sample of size N = 25942:

V V1 V2 V3 V4 V5

Observed: 2634.00 1188.00 394.00 228.00 150.00 101.00 ...

Expected: 2622.03 1169.79 455.43 228.43 137.02 91.98 ...

Goodness-of-fit (multivariate chi-squared test):

Im Dokument A practical introduction to statistics (Seite 127-134)