• Keine Ergebnisse gefunden

Descriptive statistics for ordinal data

Im Dokument Corpus linguistics (Seite 173-179)

Let us turn, next, to a design with one nominal and one ordinal variable: a test of the second of the three hypotheses introduced at the beginning of this chapter.

Again, it is restated here together with the background assumption from which it is derived:

(12) Assumption: Animate items occur before inanimate items.

Hypothesis: Thes-possessive will be used when the modifier is high in Animacy, theof-possessive will be used when the modifier is low in An-imacy.

The constructions are operationalized as before. The data used are based on the same data set, except that cases with proper names are now included. For expository reasons, we are going to look at a ten-percent subsample of the full sample, giving us 22s-possessives and 17of-possessives.

Animacy was operationally defined in terms of the annotation scheme shown in Table 5.5 (based on Zaenen et al. 2004).

As pointed out above, Animacy hierarchies are a classic example of ordinal data, as the categories can be ordered (although there may be some disagreement about the exact order), but we cannot say anything about the distance between one category and the next, and there is more than one conceptual dimension in-volved (I ordered them according to dimensions like “potential for life”, “toucha-bility” and “conceptual independence”).

Table 5.5: A simple annotation scheme for Animacy

Animacy Category Definition Rank

human (hum) Real or fictional humans and human-like beings 1

organization (org) Groups of humans acting with a common purpose 2 other animate (ani) Real or fictional animals, animal-like beings and plants 3

human attribute (hat) Body parts, organs, etc. of humans 4

concrete touchable (cct) Physical entities that are incapable of life and can be touched 5 concrete nontouchable (ccn) Physical entities that are incapable of life and cannot be touched 6

location (loc) Physical places and regions 7

time (tim) Points in and periods of time 8

event (evt) Events 9

abstract (abs) Other abstract entities 10

We can now formulate the following prediction:

(13) Prediction: The modifiers of thes-possessive will tend to occur high on the Animacy scale, the modifiers of theof-possessive will tend to occur low on the Animacy scale.

Note that phrased like this, it is not yet a quantitative prediction, since “tend to” is not a mathematical concept. Whilefrequencyfor nominal data andmean(or

“average”) for cardinal data are used in everyday language with something close to their mathematical meaning, we do not have an everyday word for dealing with differences in ordinal data. We will return to this point presently, but first, let us look at the data impressionistically. Table 5.6 shows the annotated sample (cases are listed in the order in which they occurred in the corpus).

A simple way of finding out whether the data conform to our prediction would be to sort the entire data set by the rank assigned to the examples and check whether thes-possessives cluster near the top of the list and theof-possessives near the bottom. Table 5.7 shows this ranking.

Table 5.7 shows that the data conform to our hypothesis: among the cases whose modifiers have an animacy of rank 1 to 3,s-possessives dominate, among those with a modifier of rank 4 to 10,of-possessives make up an overwhelming majority.

However, we need a less impressionistic way of summarizing data sets coded as ordinal variables, since not all data set will be as straightforwardly inter-pretable as this one. So let us turn to the question of an appropriate descriptive statistic for ordinal data.

Table 5.6: A sample of s- andof-possessives annotated for Animacy (BROWN)

No. Example Animacy Rank

(a) s-possessive

1 its [administration] policy org 2

2 her professional roles hum 1

3 their burden hum 1

4 its [word] musical frame ccn 6

5 its [sect] metaphysic org 2

6 your management climate org 2

7 their families hum 1

8 Lumumba’s death hum 1

9 his arts or culture hum 1

10 her life hum 1

11 its [monument] reputation cct 5

12 their impulses and desires hum 1

13 its [board] members’ duties org 2

14 our national economy org 2

15 the convict’s climactic reappearance hum 1

16 its [bird] wing ani 3

17 her father hum 1

18 his voice hum 1

19 her brain hum 1

20 his brown face hum 1

21 his expansiveness hum 1

22 its [snake] black, forked tongue ani 3

23 the novelist’s carping phrase hum 1

(b) of-possessive

1 the invasion of Cuba loc 8

2 a joint session of Congress org 2

3 [...] enemies of peaceful coexistence evt 7

4 the word of God hum 1

5 the volume of the cylinder opening [...] cct 5

6 the depths of the fourth dimension abs 10

7 the views of George Washington hum 1

8 all the details of the pattern abs 10

9 the makers of constitutions ccn 6

10 the extent of ethical robotism abs 10

11 the number of new [...] construction projects [...] cct 5 12 the expanding [...] economy of the 1960’s tim 9 13 hyalinization of [...] glomerular arterioles hat 4 14 the possible forms of nonverbal expression evt 7 15 the maintenance of social stratification [...] abs 10

16 knowledge of the environment cct 5

17 the bow of the nearest skiff cct 5

18 the corner of the car cct 5

Table 5.7: The annotated sample from Table 5.6 ordered by animacy rank

(contd.)

Anim. Type No. Anim. Type No.

1 s (a 2) 4 of (b 13)

1 s (a 3) 5 s (a 11)

1 s (a 7) 5 of (b 5)

1 s (a 8) 5 of (b 11)

1 s (a 9) 5 of (b 16)

1 s (a 10) 5 of (b 17)

1 s (a 12) 5 of (b 18)

1 s (a 15) 6 s (a 4)

1 s (a 17) 6 of (b 9)

1 s (a 18) 7 of (b 3)

1 s (a 19) 7 of (b 14)

1 s (a 20) 8 of (b 1)

1 s (a 21) 9 of (b 12)

1 s (a 23) 10 of (b 6)

1 of (b 4) 10 of (b 8)

1 of (b 7) 10 of (b 10)

2 s (a 1) 10 of (b 15)

2 s (a 5)

2 s (a 6)

2 s (a 13)

2 s (a 14)

2 of (b 2)

3 s (a 16)

3 s (a 22)

5.3.1 Medians

As explained above, we cannot calculate a mean for a set of ordinal values, but we can do something similar. The idea behind calculating a mean value is, essentially, to provide a kind of mid-point around which a set of values is distributed – it is a so-called measure ofcentral tendency. Thus, if we cannot calculate a mean, the next best thing is to simply list our data ordered from highest to lowest and find the value in the middle of that list. This value is known as themedian– a value that splits a sample or population into a higher and a lower portion of equal sizes.

For example, the rank values for the Animacy of our sample ofs-possessives are shown in Figure 5.1a. There are 23 values, thus the median is the twelfth value in the series (marked by a dot labeled M) – there are 11 values above it and eleven below it. The twelfth values in the series is a 1, so the median value of s-possessive modifiers in our sample is 1 (or human).

1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 3 3 5 6 M

1

M

1 2 4 5 5 5 5 5 6 7 7 8 9 10 10 10 10 1 1 2 4 5 5 5 5 5 6 7 7 8 9

M

10 10 10 10 (b)

(a)

Figure 5.1: Medians for (a) thes-possessives and (b) theof-possessives in Table 5.7

If the sample consists of an even number of data points, we simply calculate the mean between the two values that lie in the middle of the ordered data set.

For example, the rank values for the Animacy of our sample ofof-possessives are shown in Figure 5.1b. There are 18 values, so the median falls between the ninth and the tenth value (marked again by a dot labeled M). The ninth and tenth value are 5 and 6 respectively, so the median for theof-possessive modifiers is(5+6)/2= 5.5(i.e., it falls between concrete touchable and concrete nontouchable).

Using the idea of a median, we can now rephrase our prediction in quantitative terms:

(14) Prediction: The modifiers of the s-possessive will have a higher median on the Animacy scale than the modifiers of theof-possessive.

Our data conform to this prediction, as 1 is higher on the scale than 5.5. As before, this does not prove or disprove anything, as, again, we would expect some random variation. Again, we will return to this issue in Chapter 6.

5.3.2 Frequency lists and mode

Recall that I mentioned above the possibility of treating ordinal data like nominal data. Table 5.8 shows the relative frequencies for each animacy category, (alter-natively, we could also calculate expected frequencies in the way described in Section 5.3 above).

Table 5.8: Relative frequencies for the Animacy values of possessive modifiers

Animacy s-possessive of-possessive

Rank Category Abs. Rel. Abs. Rel.

1 human 14 0.609 2 0.111

2 organization 5 0.217 1 0.056

3 other animate 2 0.087 0 –

4 human attribute 0 – 1 0.056

5 concrete touchable 1 0.043 5 0.279

6 concrete nontouchable 1 0.043 1 0.056

7 location 0 – 1 0.056

8 time 0 – 1 0.056

9 event 0 – 2 0.111

10 abstract 0 – 4 0.222

Total 23 1.000 18 1.000

This table also nicely shows the preference of the s-possessive for animate modifiers (human, organization, other animate) and the preference of the of -possessive for the categories lower on the hierarchy. The table also shows, how-ever, that the modifiers of theof-possessive are much more evenly distributed across the entire Animacy scale than those of thes-possessive.

For completeness’ sake, let me point out that there is a third measure of cen-tral tendency, that is especially suited to nominal data (but can also be applied to ordinal and cardinal data): themode. The mode is simply the most frequent value in a sample, so the modifiers of theof-possessive have a mode of 5 (or concrete touchable) and those of the s-possessive have a mode of 1 (or hu-man) with respect to animacy (similarly, we could have said that the mode of s-possessive modifiers is discourse-old and the mode of of-possessive modi-fiers is discourse-new). There may be more than one mode in a given sample.

For example, if we had found just a single additional modifier of the type ab-stract in the sample above (which could easily have happened), its frequency would also be five; in this case, theof-possessive modifier would have two modes (concrete touchable and abstract).

The concept ofmodemay seem useful in cases where we are looking for a sin-gle value by which to characterize a set of nominal data, but on closer inspection it turns out that it does not actually tell us very much: it tells us what the most frequent value is, but it does not tell us how much more frequent that value is than the next most frequent one, how many other values occur in the data at all, etc. Thus, it is always preferable to report the frequencies of all values, and, in fact, I have never come across a corpus-linguistic study reporting modes.

Im Dokument Corpus linguistics (Seite 173-179)