• Keine Ergebnisse gefunden

Discrete distributions

TheCELEXlexical database [Baayen et al., 1995] lists the frequencies of a large number of English words in a corpus of 18.6 million words. Table 3.1 provides these frequencies for four words, the high-frequency definite articlethe, the medium-frequency wordpresident, and two low-frequency words,hareandharpsichord. It also lists theRELATIVE FREQUEN -CIESof these words, which are obtained by dividing a word’s frequency by the size of the corpus. These relative frequencies are estimates of thePROBABILITIESof these words in English.

47

DRAFT

Table 3.1: Frequencies and relative frequencies of four words in the version of the Cobuild corpus underlying theCELEXfrequency counts (corpus size:18580121tokens).

Frequency Relative Frequency

the 1093547 0.05885575

president 2469 0.00013288

hare 153 0.00000823

harpsichord 16 0.00000086

In the simplest model for text generation, the selection of a word for inclusion in a text is similar to sampling marbles from a vase. The likelihood of sampling a red marble is given by the proportion of red marbles in that vase. Crucially, we sample with replace-ment, and we assume that the probabilities of words do not change over time. We also assume independence: The outcome of one trial does not affect the outcome of the next trial. It is obvious that these assumptions of what is known as the urn model involve substantial simplifications. The probability of observingthe, a high-probability word, adjacent to another instance ofthein real language is very small. In spoken language such sequences may occasionally occur, for instance, due to hesitations on the part of the speaker, but in carefully edited written texts a sequence of two instances oftheis highly improbable. On the other hand, it is also clear thattheis indeed very much more frequent thanhareorharpsichord, and for questions at high aggregation levels, even simplifying assumptions can provide us with surprising leverage.

By way of example, consider the question of how the frequencies of these words com-pare to their frequencies observed in other, smaller, corpora of English such as the Brown corpus [Kuˇcera and Francis, 1967] (1million words). Table 3.2 lists the probabilities (rel-ative frequencies) for the four words in Table 3.1, as well as the frequencies observed in the Brown corpus and the frequencies one would expect givenCELEX. These expected frequencies are easy to calculate. For instance, if0.05885575is the proportion of word to-kens inCELEXrepresenting the word typethe, then a similar proportion of tokens should represent this type in a1million corpus, i.e.,1000000∗0.05885575 = 58856tokens. As shown in Table 3.2, the expected counts are smaller fortheandpresident, larger forhare, and right on target forharpsichord.

Table 3.2: Probabilities (estimated fromCELEX), expected frequencies and observed fre-quencies in the Brown corpus.

p expected frequency observed frequency

the 0.05885575 58856 69971

president 0.00013288 133 382

hare 0.00000823 8 1

harpsichord 0.00000086 1 1

Should we be surprised by the observed differences? In order to answer this question, 48

DRAFT

we need to make some assumptions about the properties of the distribution of a word’s frequency. There are382occurrences of the nounpresidentin the Brown corpus, but the Brown corpus is only one sample from American English as spoken in the early sixties.

If additional corpora were compiled from the same kind of textual materials using the same sampling criteria, the number of occurrences of the nounpresidentwould still vary from corpus to corpus. In other words, the frequency of a word in a corpus is a random variable. The statistical experiment associated with this random variable involves creat-ing a corpus of one million words, followed by countcreat-ing how oftenpresidentis used in this corpus. For repeated experiments sampling one million words, we expect this ran-dom variable to assume values similar to the382tokens observed in the Brown corpus.

But what we really want to know is the magnitude of the fluctuations of the frequency of presidentacross corpora.

At this point, we need some further terminology. Let’s define two probabilities, the probability of observing a specific word and the probability of observing any other word.

We call the former probabilitypthePROBABILITY OF SUCCESS, and the latter probability qthePROBABILITY OF FAILURE. The probability of failure is1−probability of success. In the case ofhare, these probabilities arep = 0.0000082andq = 0.9999918. Furthermore, let theNUMBER OF TRIALS(n) denote the size of the corpus. Each token in the corpus is regarded as a trial which can result either in a success (hareis observed) or in a failure (some other word is observed). Given the previously mentioned simplifying assumption that words are used independently and randomly in text, it turns out that we can model the frequency of a word as aBINOMIALLY DISTRIBUTED RANDOM VARIABLEwithPARAM -ETERSpandn. (The textbook example of a binomially distributed random variable is the count of heads observed when tossing a coinntimes that has probabilitypof turning up heads.) The properties of the binomial distribution are well known, and make it possible to obtain better insight in how much variability we may expect for our word frequencies across corpora, given our simplifying assumptions.

There are two kinds of properties that we need to distinguish. On the one hand, there are the properties of thePOPULATION, on the other hand, there are the properties of a givenSAMPLE. When we consider the properties of the population, we consider what we expect to happen on average across an infinite series of experiments. When we consider the properties of a sample, we consider what has actually occurred in a finite, usually small series of experiments. We need tools for both kinds of properties. For instance, we want to know whether an observed frequency of382is surprising forpresidentgiven thatp= 0.000133according to theCELEXcounts andn = 1,000,000. This is a question about the population. How often will we observe this frequency across an infinite series of samples of one million words? Is this close to what one would expect on average? In this book, we will mostly use properties of the population, but sometimes it is also useful to know what a sample of a given size might look like.Rprovides tools for both kinds of questions.

Consider the upper left panel of Figure 3.1. The horizontal axis graphs frequency, the vertical axis the probability of that frequency, given that the wordtheis binomially distributed with parametersn = 1,000,000andp = 0.059. The tool that we use here

49

DRAFT

57000 59000 61000

0.00000.00050.00100.0015

frequency

probability of frequency

the

0 5 15 25

0.000.040.080.12

frequency

probability of frequency

hare

0 2 4 6 8 10

0.00.10.20.30.4

frequency

probability of frequency

harpsichord

57000 59000 61000

0.0020.0040.0060.008

frequency

sample probability of frequency

the

0 5 15 25

0.000.050.100.15

frequency

sample probability of frequency

hare

0 2 4 6 8 10

0.00.10.20.30.4

frequency

sample probability of frequency

harpsichord

Figure 3.1: The frequencies (horizontal axis) and the probabilities of these frequencies (vertical axis) for three words under the assumption that word frequencies are binomially distributed. Upper panels show the population distributions, lower panels the sample distributions for 500 random corpora.

50

DRAFT

is thedbinom()function, which is often referred to as theFREQUENCY FUNCTIONand also as thePROBABILITY DENSITY FUNCTION. It requires three input values: a frequency (or a vector of frequencies), and values for the two parameters that define a binomial distribution,n, andp.dbinom()returns the probability of that frequency (or a vector of such probabilities in case a vector of frequencies was supplied). For instance, the expected probability of observingtheexactly59000times averaged over an infinite series of corpora of one million words given the probability of successp= 0.05885575is

> dbinom(59000, 1000000, 0.05885575) [1] 0.001403392

The upper panels of Figure 3.1 show, for each of the three words from Table 3.2, the probabilities of the frequencies with which these words are expected to occur. For each word and each frequency, we useddbinom()to calculate these probabilities given a sample sizen = 1,000,000and the word’s population probabilitypas estimated by its relative frequency inCELEX.

The panel fortheshows frequencies that are more or less centered around the mean frequency,58856, the expected count listed in Table 3.2. We can see that the probabil-ity of observing values greater than60000are infinitesimally small, hence we have solid grounds to be surprised by the frequency of69971observed in the Brown corpus given theCELEXcounts. The next panel of Figure 3.1 shows the distribution of frequencies for hare. This is a low-frequency word, and we can now see the individual high density lines for the individual frequencies. The pattern is one that is less symmetric. The highest prob-ability is0.1391, which occurs for a frequency of8, in conformity with the expected value we saw earlier in Table 3.2. The value actually observed in the Brown corpus,1, is clearly atypically low. The upper right panel, finally, shows that for the very low-frequency wordharpsichord, a frequency of zero is actually slightly more likely than the frequency of1listed in Table 3.2 (which rounded the expected frequency0.86to the nearest actually possible — discrete — number of occurrences).

The panels in the second row of Figure 3.1 correspond to those in the first row. The difference concerns the way in which the probabilities were obtained. The probabilities for the top row are those one would obtain for the frequencies observed across an infinite series of corpora (experiments) of one million words. They are population probabilities.

The probabilities in the second row are those one might observe for a particular run of just 500corpora (experiments) of one million words. They illustrate the kind of irregularities in the shape of a distribution that are typical for the actual samples with which we have to deal in practice. The irregularities that characterize sample distributions are most clearly visible in the lower left panel, but also to some extent in the lower central panel. Note that here the mode (the frequency with the highest sample probability) has an elevated value with respect to the immediately surrounding frequencies, compared to the upper central panel. Below, we discuss the tool for simulating random samples of a binomial random variable that we used to make these plots.

Figure 3.1 illustrates how the parameterp, the probability of success, affects the shape of the distribution. The other parameter, the number of trials (corpus size)n, likewise

co-51

DRAFT

30 50 70 90

0.000.100.20

frequency

probability of frequency

1000 trials

0 10 20 30 40 50

0.000.100.20

frequency

probability of frequency

50 trials

Figure 3.2: The frequencies (horizontal axis) and the probabilities of these frequen-cies (vertical axis) for the assuming that its frequency is binomially distributed with p= 0.05885575andn= 1000(left panel) orn= 50(right panel).

determines the shape of the distribution. Figure 3.2 illustrates this for the population, i.e., across an infinite series of corpora ofn= 1000(left) andn= 50(right) word tokens. The left panel is still more or less symmetric, but by the time that the corpus size is reduced to only50tokens, the symmetry is gone.

It is important to realize that the values that a binomially (n, p)-distributed random variable can assume are bounded by0andn. In the present example, this is intuitively obvious: a word need not occur in a corpus of sizen, and so may have zero frequency.

But a word can never occur more often than the corpus size. The upper bound, therefore, isn, for a boring but theoretically possible corpus consisting of just one word repeatedn times. It is also useful to keep in mind that theEXPECTED(or mean) frequency isn∗p, as pspecifies the proportion of thentrials that are successful.

Let’s now have a closer look at the tools thatRprovides for working with the binomial distribution. There are four such tools, the functionsdbinom(),qbinom(),pbinom(), andrbinom().Rprovides similar functions for a wide range of other random variables.

Once you know how to use them for the binomial distribution, you know how to use the corresponding functions for any other distribution implemented inR.

First consider the observed frequency of1forharewhere one would expect8given the counts inCELEX. What is the probability of observing such a low count under chance

52

DRAFT

conditions? To answer this question, we use the functiondbinom()that we already introduced above. Given an observed value (its first argument), and given the parameters nandp(its second and third arguments), it returns the requested probability:

> dbinom(1, size = 1000000, prob = 0.0000082) [1] 0.002252102

In this example, I have spelled out the names of the second and third parameters, the size nand the probabilityp, in order to make it easier to interpret the function call, but the shorter version works just as well as long as the arguments are provided in exactly this order:

> dbinom(1, 1000000, 0.0000082) [1] 0.002252102

Of course, if we think1is a low frequency, then0must also be a low frequency. So maybe we should ask what the probability is of observing a frequency of1or lower. Since the event of observing a count of1is independent of the event of observing a count of0, we may add these two probabilities,

> dbinom(0, size = 1000000, prob = 0.0000082) + + dbinom(1, size = 1000000, prob = 0.0000082) [1] 0.002526746

or, equivalently,

> sum(dbinom(0:1, size = 1000000, prob = 0.0000082)) [1] 0.002526746

Whendbinom()is supplied with a vector of frequencies, it returns a vector of probabili-ties, which we add usingsum(). Another way to proceed is to make use of thepbinom() function, which immediately produces the sum of the probabilities for the supplied fre-quency as well as the probabilities of all smaller frequencies:

> pbinom(1, size = 1000000, prob = 0.0000082) [1] 0.002526746

The low probability that we obtain here suggests that there is indeed reason for surprise about the low frequency ofharein the Brown corpus, at least, from the perspective of CELEX.

Recall that the Brown corpus mentions the word president382times, whereas we would expect only133occurrences givenCELEX. In this case, we can ask what the prob-ability is of observing a frequency of382or higher. This probability is the same as one minus the probability of observing a frequency of381or less.

> 1 - pbinom(381, size = 1000000, prob = 0.00013288) [1] 0

53

DRAFT

The resulting probability is indistinguishable from zero given machine precision, and provides ample reason for surprise.

We used the functiondbinom()to make the upper panels of Figure 3.1 and the panels of Figure 3.2. Here is the code producing the left panel of Figure 3.2.

> n = 1000

> p = 0.05885575

> frequencies = seq(25, 95, by = 1) # 25, 26, 27, ..., 94, 95

> probabilities = dbinom(frequencies, n, p)

> plot(frequencies, probabilities, type = "h",

+ xlab = "frequency", ylab = "probability of frequency")

The first two lines define the parameters of the binomial distribution. The third line de-fines a range of frequencies for which the corresponding probabilities have to be pro-vided. The fourth line calculates these probabilities. Sincefrequenciesis a vector, dbinom()provides a probability for each frequency in this vector. The last two lines plot the probabilities against the frequencies, provide sensible labels, and specify, by means oftype = "h", that a vertical line (a ’high-density line’) should be drawn downwards from each point on the density curve.

Thus far, we have considered functions for using the population properties of the bi-nomial distribution. But it is sometimes useful to know what a sample from a given distribution would look like. The lower panels of Figure 3.1, for instance, illustrated the variability that is typically observed in samples. The tool for investigating random samples from a binomial distribution is the functionrbinom(). This function produces binomially distributedRANDOM NUMBERS. A random number is a number that simu-lates the outcome of a statistical experiment. A binomial random number simusimu-lates the number of successes one might observe given a success probabilitypandntrials. Tech-nically, random numbers are never truly random, but for practical purposes, they are a good approximation to randomness.

The following lines of code illustrate how to make the lower panel forharein Fig-ure 3.1. We first define the number of random numbers, the corpus size (the number of trials in one binomial experiment), and the probability of success.

> s = 500 # the number of random numbers

> n = 1000000 # number of trials in one experiment

> p = 0.0000082 # probability of success

Next, we userbinom()to produce the random numbers representing the simulated frequencies ofharein the samples. This function takes three arguments: the number of random numbers required, and the two parameters of the binomial distribution,nandp.

We feed the output ofrbinom()intoxtabs()to obtain a table listing for each simulated frequency how often that frequency occurs across the500simulation runs. We divide the resulting vector of counts by the number of simulation runssto obtain the proportions (relative frequencies) of the simulated frequencies.

> x = xtabs( ˜ rbinom(s, n, p) ) / s 54

DRAFT

> x

rbinom(s, n, p)

2 3 4 5 6 7 8 9 10

0.012 0.028 0.062 0.086 0.126 0.118 0.138 0.132 0.084

11 12 13 14 16 17 18 19

0.090 0.058 0.044 0.008 0.006 0.004 0.002 0.002

Note that in this simulation there are no instances wherehareis observed not at all or only once. If you rerun this simulation, more extreme outcomes may be observed oc-casionally. This is becauserbinom()simulates the randomness that is inherent in the sampling process. For plotting we convert the cell names in the table to numbers with as.numeric():

> plot(as.numeric(names(x)), x, type = "h", xlim = c(0, 30), + xlab = "frequency", ylab = "sample probability of frequency")

Recall thatpbinom(x, n, p)produces the summed probability of values smaller than or equal tox, which is why it is referred to as theCUMULATIVE DISTRIBUTION FUNC -TION. It has a mirror image (technically, itsINVERSEfunction),qbinom(y, n, p), the QUANTILE FUNCTION, which takes this summed probability as input, and produces the corresponding countx.

> pbinom(4, size = 10, prob = 0.5)

[1] 0.3769531 # from count to cumulative probability

> qbinom(0.3769531, size = 10, prob = 0.5)

[1] 4 # from cumulative probability to count

Quantile functions are useful for checking whether a random variable is indeed bino-mially distributed. Consider, for example, the frequencies of the Dutch determinerhetin the consecutive stretches of1000words of a Dutch novel that gave its name to a fair trade brand in Europe, ’Max Havelaar’ (by Eduard Douwes Dekker, 1820 – 1887). The data set havelaarcontains these counts for the99consecutive complete stretches of1000words in this novel.

> havelaar$Frequency

[1] 13 19 19 14 20 18 16 16 17 32 25 10 9 12 15 [16] 22 26 16 23 10 12 11 16 13 8 4 16 13 13 11 [31] 11 18 12 16 10 18 10 11 9 18 15 36 22 10 7 [46] 20 5 13 12 14 9 6 8 7 9 11 14 16 10 9 [61] 12 11 6 20 11 12 12 1 9 11 11 7 13 13 10 [76] 9 13 7 8 16 11 15 8 16 26 23 13 11 15 12 [91] 7 9 18 8 21 5 16 11 13

Are these frequencies binomially distributed? As a first step, we estimate the probability of success from the sample, while noting that the number of trialsnis1000:

> n = 1000

> p = mean(havelaar$Frequency / n)

55

DRAFT

In order to see whether the observed frequencies indeed follow a binomial distribution, we plot the quantiles of an (n, p)-binomially distributed random variable against the sorted observed frequencies. Recall that the quantile for a given proportionpis the small-est observed value such that all observed values less than or equal to that value account for the proportionpof the data. If we plot the observed quantiles against the quantiles of a truly(n, p)-binomially distributed random variable, we should obtain a straight line if the observed frequencies are indeed binomially distributed. We therefore define a vector of proportions

> qnts = seq(0.005, 0.995, by=0.01)

and use thequantile()function to obtain the corresponding expected and observed frequencies for these percentage points, which we then graph.

> plot(qbinom(qnts, n, p), quantile(havelaar$Frequency,qnts), + xlab = paste("quantiles of (", n, ",", round(p, 4),

+ ")-binomial", sep=""), ylab = "frequencies")

As can be seen in Figure 3.3, the points in the resultingQUANTILE-QUANTILE PLOTdo not follow a straight line. Especially the higher frequencies are too high for a binomially (1000,0.0134)-distributed random variable.

5 10 15 20

5101520253035

quantiles of (1000,0.0134)−binomial

frequencies

Figure 3.3: Quantile-quantile plot for inspecting whether the frequency of the definite articlethein the Dutch novelMax Havelaaris binomially distributed.

56

DRAFT

To summarize, here is a short characterization of the four functions for working with the binomial distribution withntrials and success probabilityp:

dbinom(x, n, p) THE PROBABILITY DENSITY FUNCTION probability of the valuex

qbinom(q, n, p) THE QUANTILE FUNCTION

the largest value for the firstq% of ranked data points pbinom(x, n, p) THE CUMULATIVE DISTRIBUTION FUNCTION

the proportion of values with value less than or equal tox rbinom(k, n, p) THE RANDOM NUMBER GENERATOR

kbinomially distributed random numbers

Thus far, we used the binomial distribution to gain some insight in the

Thus far, we used the binomial distribution to gain some insight in the