• Keine Ergebnisse gefunden

Tables with counts: Correspondence Analysis

4.6 A note on statistical significance

5.1.3 Tables with counts: Correspondence Analysis

In the preceding section we used principal components analysis for analyzing a two-way table of measurements (i.e., real-valued numbers). For two two-way contingency tables, correspondence analysis provides an attractive alternative. Like principal components analysis, correspondence analysis seeks to provide a low-dimensional map of the data.

The correspondence map is made in two steps. First, two matrices of distances are cal-culated, one for the distances between columns, and one for the distances between rows.

In daily life, you may have encountered distance matrices for geographical distances be-tween major cities. The cities are listed in both margins of the table. Hence, a distance matrix is always a square matrix. The distances on the main diagonal are zero, as the distance of a city to itself is zero. Furthermore, the distances above the main diagonal are the flip image of the distances below the main diagonal: A distance matrix is symmetric.

Hence, some distance tables for cities show only the upper or the lower triangle of the distance matrix.

In correspondence analysis, we regard row vectors (or column vectors) as profiles of ’cities’, and calculate the distances between them. There are many different ways in which distances (or dissimilarities) between vectors can be computed, the on-line help pages fordist()document a range of options. The distance measure that is used in correspondence analysis is the so-called chi-squared distance. Given a contingency table with20rows and5columns, correspondence analysis constructs two distance matrices, a 20by20matrix specifying the distances between the rows, and a5by5matrix specifying the distances between the columns.

The second step in correspondence analysis is to represent these distances as faithfully as possible in a two-dimensional scatterplot, a low-dimensional map. The larger the dis-tance between two rows, the further these two rows should be apart in the map for rows.

Likewise, dissimilar columns should be far apart, while similar columns should be near to each other in the map for columns. In correspondence analysis, we superimpose the row and column maps, analogous to the superposition of the PC scores and the loadings on these PCs in the biplot. Thanks to the chi-squared distance measure, we ensure that proximity between rows and columns in the merged map is as good an approximation as possible of the correlation between rows and columns. The set of functions illustrated in the following examples extend the code of Murtagh [2005].

Ernestus et al. [2006] studied register variation and diachronic variation in the use of syntactic constructions in Medieval French. For29authors (some of which are anony-mous), and often for several manuscripts versions of the same text, the counts of the35 most frequent tag trigrams were calculated. Texts with more than2000words were sub-divided into chunks of2000words.

The data of this study are available in the form of two data frames. TheoldFrench data frame contains the counts of tag trigrams (columns) for342texts. TheoldFrench Metadata frame provides meta data on these texts, including information on author, region of origin, data of composition, register and topic.

> oldFrench[1:3, 1:4]

139

DRAFT

T30.16.00 T00.31.51 T16.00.31 T00.60.31

Abe.2 11 2 1 6

Abe.3 13 4 6 5

Abe.4 7 1 4 2

> oldFrenchMeta[1:3, ]

Textlabels Codes Author Topic Genre Region Year

1 Abe Abe.2 Meun 12 prose R2 1325

2 Abe Abe.3 Meun 12 prose R2 1325

3 Abe Abe.4 Meun 12 prose R2 1325

In both data frames, rows represent text fragments. Rows are ordered alphabetically by the codes for the fragments. As a consequence, the information in the two data frames is perfectly aligned. As will become apparent below, this alignment allows us to select subsets of rows fromoldfrenchusing information inoldFrenchMetawithR’s sub-scripting mechanism.

The columns ofoldfrenchrepresent the frequencies of the tag trigrams in the text fragments. What we would like to know is whether there are systematic differences in the frequencies of these tag trigrams as a function of author, topy, genre, region, and time. As a first step, we make use of the functioncorres.fnc(), which takes a data frame with counts as input and produces as output a correspondence analysis object. This object can be subsequently be summarized and plotted.

> oldFrench.ca = corres.fnc(oldFrench)

Let’s first inspect the summary. As its output is rather voluminous, we specifyhead = TRUE, so that only the first six lines of relevant tables are shown.

> summary(oldFrench.ca, head = TRUE) Call:

corres.fnc(oldfrench) Eigenvalue rates:

0.1704139 0.1326913 0.06854973 0.05852097 0.05394474 ...

Factor 1

coordinates correlations contributions

T30.16.00 -0.113 0.074 0.012

T00.31.51 -0.560 0.464 0.103

T16.00.31 -0.139 0.053 0.006

T00.60.31 -0.122 0.050 0.006

T16.00.33 -0.085 0.020 0.003

T02.00.30 0.293 0.227 0.027

...

140

DRAFT

Factor 2

coordinates correlations contributions

T30.16.00 0.119 0.082 0.017

T00.31.51 0.205 0.062 0.018

T16.00.31 0.255 0.179 0.024

T00.60.31 0.162 0.090 0.014

T16.00.33 -0.220 0.139 0.029

T02.00.30 0.166 0.073 0.011

...

The summary ofoldfrench.cabegins with listingEIGENVALUES RATES. These rates have a similar interpretation as the proportions of the variance explained by the principal components in principal components analysis. The larger the rate, the more succesful a factor is in accounting for differences among the distances between the texts. The first rate pertains to the first factor, theX-axis in a correspondence map, the second rate to the second factor, theY-axis in the map. Higher dimensions are seldom considered in corre-spondence analysis. (For inspection of higher dimensions, specifyn=aand the summary will display the firstadimensions.)

The summary then proceeds with two tables that specify, for the first two factors, how the distances between the columns relate to the distances between rows. As we called summary()withhead=T, only the first six tag trigrams are shown. For each tag trigram, its coordinate on the relevant axis is listed first, followed by its correlation with that axis.

These correlations, however, are not standard correlations. They are more comparable to the loadings in principal components analysis, and as such they provide an important guide to the interpretation of the dimensions. The final column provides a measure for the extent to which a row (tag trigram) contributes to the explanatory value of the factor.

The attractiveness of correspondence analysis resides in the possibilities it offers for visualization. For instance, we can query whether the difference between prose and po-etry is reflected in the frequencies with which particular tag trigrams are used. Figure 5.6 shows that there is a clear separation of prose and poetry on the first factor, which is carried primarily by the tag trigramsT00.30.01,T00.31.51andT51.10.00.

This correspondence plot has a number of features that are controlled by a range of options. First, the texts of the two genres are shown with different colors. Second, tags are represented with their own font size, and also with another color. Third, we have not shown all35tags, which would clutter the center of the plot, but only those tags that drive the separation of the genres. Although

> plot(oldFrench.ca)

is sufficient to obtain a correspondence plot, the result, with342texts and35tag trigrams, is an extremely cluttered scatterplot. We therefore consider the plot method for corre-spondence objects in some more detail.

141

DRAFT

It is often useful to plot text properties as specified in the meta data rather than the identifiers of the texts themselves: By default, plot()uses the rownames of the data frame serving as input tocorres.fnc()for labeling the row data points in the scatter-plot. We override this default with the option for row labels, which we set to point to, for instance, the genre labels inoldFrenchMetaby settingrlabels = oldFrenchMeta$

Genre.

The option for row colors,rcol, allows us to specify different colors for the levels of Genre. This option should point to a vector that specifies, for each row (text) the color with which it is to be displayed. For instance, we can convert the factoroldFrenchMeta$

Genreinto a numerical vector withas.numeric(). The first factor level will now be paired with a1, the second factor level with a2, and so on. We then use these numbers as identifiers of colors by settingrcol = as.numeric(oldFrenchMeta$Genre).

We scale down the row font size withrcex = 0.5. As it makes no sense to add35 column names to the plot, we restrict the tag trigrams to be shown to those that have extreme values in the first or last decile on either axis withextreme = 0.1. Finally, we set the color for the column names to blue (ccol = "blue"). This completes our plot instructions.

> plot(oldFrench.ca, rlabels = oldFrenchMeta$Genre, + rcol = as.numeric(oldFrenchMeta$Genre), rcex = 0.5, + extreme = 0.1, ccol = "blue")

In Figure 5.6, colors have been changed to greyscales, the colors will be shown on your computer screen when the preceding lines of code are used.

When we zoom in on the prose, we find indications of diachronic change. As a first step, we exclude those texts for which the approximate date of composition is not known.

Because the rows ofoldFrenchandoldFrenchMetaare synchronized, we subscript oldFrenchwith information inoldFrenchMeta.

> prose = oldFrench[oldFrenchMeta$Genre == "prose" &

+ !is.na(oldFrenchMeta$Year),]

Texts for which we have no information on their approximate date of origin are labeled as missing data withNA. The functionis.na()returnsTRUEfor those cells in its input vector that contain missing data. By negating this vector of truth values, we obtain a condition on the rows that allows only non-missing information into the new data frame.

We likewise create a version ofoldFrenchMetathat is synchronized withprose,

> proseinfo = oldFrenchMeta[oldFrenchMeta$Genre=="prose" &

+ !is.na(oldFrenchMeta$Year),]

and because the chronological information is coarse, we set a major boundary at the year 1250.

> proseinfo$Period = as.factor(proseinfo$Year <= 1250)

We applycorres.fnc()and plot the result, disabling the addition of the column names withaddcol = F.

142

DRAFT

Figure 5.6: Correspondence analysis of the frequencies of35tag trigrams in342Old French text fragments. Text fragments are labeled by register (prose versus poetry), only highly predictive tag trigrams are displayed.

> prose.ca = corres.fnc(prose)

> plot(prose.ca, addcol = F, rcol = as.numeric(proseinfo$Period) + 1, + rlabels = proseinfo$Year, rcex = 0.7)

As can be seen in Figure 5.7, the texts from1250or before, shown in light grey (or green on the computer screen), reveal some separation from texts dated after1250, shown in dark grey (or red on the computer screen).

Let’s now consider the prose text for which the approximate date of composition is unknown — labeled asNAinoldFrenchMeta$Year. Can anything be said about their

143

DRAFT

date of composition? To address this issue, we first select the relevant texts and store them in a separate data frame.

> proseSup = oldFrench[oldFrenchMeta$Genre == "prose" &

+ is.na(oldFrenchMeta$Year),]

We add these additional data to the correspondence plot withcorsup.fnc(), a function for adding so-calledSUPPLEMENTARY ROWSorSUPPLEMENTARY COLUMNS.

> corsup.fnc(prose.ca, bycol = F, supp = proseSup, font = 2, + cex = 0.8, labels = substr(rownames(proseSup), 1, 4))

By default, corsup.fnc()proceeds on the assumption that we add supplementary columns. In the present example, we are dealing with supplementary rows, so we change the default by specifyingbycol = F. The supplementary rows themselves are specified withsupp = proseSup, and we label them with the manuscript identifiers provided by the row names, after stripping off the fragment numbers withsubstr(). Figure 5.7 lo-cates the fragments more or less at the transition area of the early and late texts, perhaps with a slight bias towards the late texts. The advantage of not including the undated texts from the beginning in the correspondence analysis is that we establish a correspondence map on the basis of known data, against which we pit unknown supplementary data.

Finally consider a sociolinguistic data set,variationLijk, which provides the fre-quency counts in eight subcorpora of spoken Dutch for32words ending in the Dutch suffix-lijk[Keune et al., 2005]. The subcorpora are constructed with contrasts along three dimensions: country (Flanders versus the Netherlands), sex (male versus female), and education level (high versus mid). We load the data, and display the first four columns for the first five lines.

> variationLijk[1:5, 1:4]

nlfemaleHigh nlfemaleMid nlmaleHigh nlmaleMid

afhankelijk 1 1 3 4

belachelijk 7 4 7 3

dadelijk 8 13 6 10

degelijk 1 1 1 1

duidelijk 11 6 14 8

The full set of column names

> colnames(variationLijk)

[1] "nlfemaleHigh" "nlfemaleMid" "nlmaleHigh" "nlmaleMid"

[5] "vlfemaleHigh" "vlfemaleMid" "vlmaleHigh" "vlmaleMid"

reflects the design of this data set, withnlrepresenting the Netherlands, andvl rep-resenting Flanders. A chi-squared test shows that the words in-lijkare not uniformly distributed over the subcorpora.

> chisq.test(variationLijk) ...

X-squared = 575.3482, df = 217, p-value < 2.2e-16 ...

144

DRAFT

Figure 5.7: Correspondence analysis of the frequencies of35tag trigrams in125Old French prose fragments. Text fragments are labeled by approximate date of origin, texts dating from1250or earlier are shown in light grey, texts located later in time are shown in dark grey. The texts in black represent supplementary rows representing texts of un-known date.

This chi-squared test is rather uninformative, however. We have lots and lots of data points, so it is unlikely a-priori that the test will report a non-significantp-value. Further-more, all that this test tells us is that the counts are not proportionally distributed in the table. The correspondence plot shown in Figure 5.8 is much more revealing,

> variationLijk.ca = corres.fnc(variationLijk)

> plot(variationLijk.ca)

The subcorpora from the Netherlands (labels beginning withnl) cluster at the left hand side of the plot, and those from Flanders (vl) cluster at the right hand side of the plot.

Vriendelijk, ’friendly’, emerges from this plot as characteristic for female speakers from

145

DRAFT

Figure 5.8: Correspondence analysis of the frequencies of 32 words ending in the Dutch suffix-lijkin 8 subcorpora of spoken conversational Dutch.

Flanders with a medium education level.