Discriminant analysis - A practical introduction to statistics

5.2 Classification

5.2.2 Discriminant analysis

Discriminant analysis is used to predict an item’s class on the basis of a set of numerical predictors. As in principal components analysis, the idea is to represent the items in a low-dimensional space, typically a plane that can be inspected with the help of a scatter-plot. Instead of principal components, the analysis producesLINEAR DISCRIMINANTS. In bothPCAand discriminant analysis, the new axes are linear combinations of the original variables. But in discriminant analysis, the idea is to choose the linear discriminants such that the means of the groups are as different as possible while the variance around these means within the groups is as small as possible. We illustrate the use of discriminant analysis by a study in authorship attribution [Spassova, 2006].

Five texts from three Spanish writers were selected for analysis. Metadata on the texts are given inspanishMeta.

> spanishMeta = spanishMeta[order(spanishMeta$TextName),]

> spanishMeta

Author YearOfBirth TextName PubDate Nwords FullName

1 C 1916 X14458gll 1983 2972 Cela

2 C 1916 X14459gll 1951 3040 Cela

3 C 1916 X14460gll 1956 3066 Cela

4 C 1916 X14461gll 1948 3044 Cela

5 C 1916 X14462gll 1942 3053 Cela

6 M 1943 X14463gll 1986 3013 Mendoza

7 M 1943 X14464gll 1992 3049 Mendoza

8 M 1943 X14465gll 1989 3042 Mendoza

9 M 1943 X14466gll 1982 3039 Mendoza

10 M 1943 X14467gll 2002 3045 Mendoza

11 V 1936 X14472gll 1965 3037 VargasLLosa 12 V 1936 X14473gll 1963 3067 VargasLLosa 13 V 1936 X14474gll 1977 3020 VargasLLosa 14 V 1936 X14475gll 1987 3016 VargasLLosa 15 V 1936 X14476gll 1981 3054 VargasLLosa

From each text, fragments of approximately3000words were extracted. These text frag-ments were tagged, and the relative frequencies of tag trigrams were obtained. These

167

DRAFT

relative frequencies are available as the data setspanish, rows represent tag trigrams and colums represent text fragments.

> dim(spanish) [1] 120 15

> spanish[1:5, 1:5]

X14461gll X14473gll X14466gll X14459gll X14462gll P.A.N4 0.027494 0.006757 0.000814 0.024116 0.009658 VDA.J6.N5 0.000786 0.010135 0.003257 0.001608 0.005268 C.P.N5 0.008641 0.001126 0.001629 0.003215 0.001756 P.A.N5 0.118617 0.118243 0.102606 0.131833 0.118525 A.N5.JQ 0.011783 0.006757 0.014658 0.008039 0.000878

As we are interested in differences and similarities between texts, we transpose this ma-trix, so that we can consider the texts to be points in tag space.

> spanish.t = t(spanish)

It is intructive to begin with an unsupervised exploration of these data, for instance with principal components analysis.

> spanish.pca = prcomp(spanish.t, center = T, scale = T)

> spanish.x = data.frame(spanish.pca$x)

> spanish.x = spanish.x[order(rownames(spanish.x)), ]

> library(lattice)

> super.sym = trellis.par.get("superpose.symbol")

> splom(˜spanish.x[ , 1:3], groups = spanishMeta$Author, + panel = panel.superpose,

+ key=list(

+ title=" ",

+ text=list(levels(spanishMeta$FullName)), + points = list(pch = super.sym$pch[1:3],

+ col = super.sym$col[1:3])

+ )

Figure 5.17 suggests some authorial structure: CelaandMendozaoccupy different re-gions in the plane spanned by PC1 and PC2.VargasLLosa, however, seems to be indis-tinguishable from the other two authors.

Let’s now replace unsupervised clustering by supervised classification. We order the rows ofspanish.tso that they are synchronized with the author information in spanishMeta, and load theMASSpackage in order to have access to the function for linear discriminant analysis,lda().

> spanish.t = spanish.t[order(rownames(spanish.t)),]

> library(MASS)

lda()takes two arguments, the matrix of numerical predictors and a vector with class labels. A first attempt comes with a warning about collinearity.

168

DRAFT

Scatter Plot Matrix 0 PC1

5 0 5

−10

−5

−10 −5

0 PC2

5 0 5

−5 0

0 PC3 2 4 6

0 2 4 6

−6

−4

−2 0

−6 −4 −2 0 Cela

Mendoza VargasLLosa

Figure 5.17: Principal components analysis of15Spanish texts from3authors.

169

DRAFT

−6 −4 −2 0 2 4 6

−3−2−10123

LD1

LD2

C C C

M M

M M V

V V V

Figure 5.18: Linear discriminant analysis of15Spanish texts by author.

> spanish.lda = lda(spanish.t, spanishMeta$Author) Warning message:

variables are collinear in: lda.default(x, grouping, ...)

The columns inspanish.tare too correlated forlda()to work properly. We there-fore continue our analysis with the first8principal components, which, as revealed by the summary (not shown) of the^PCAobjects, capture almost80% of the variance in the data. These principal components are, by definition, uncorrelated, so the warning mes-sage should disappear.

> spanish.pca.lda = lda(spanish.x[ , 1:8], spanishMeta$Author)

> plot(spanish.pca.lda)

Figure 5.18 shows a clear separation of the texts by author. We can query the model for the probability with which it assigns texts to authors withpredict(), supplied with the model object as first argument, and the input data as second argument. A table with the desired probabilities is available under the nameposterior, which we round to four decimal digits for ease of interpretation.

> round(predict(spanish.pca.lda, + spanish.x[ ,1:8])$posterior, 4)

C M V

X14458gll 1.0000 0.0000 0.0000 X14459gll 1.0000 0.0000 0.0000 X14460gll 1.0000 0.0000 0.0000

170

DRAFT

X14461gll 1.0000 0.0000 0.0000 X14462gll 0.9999 0.0000 0.0001 X14463gll 0.0000 0.9988 0.0012 X14464gll 0.0000 1.0000 0.0000 X14465gll 0.0000 0.9965 0.0035 X14466gll 0.0000 0.9992 0.0008 X14467gll 0.0000 0.8416 0.1584 X14472gll 0.0000 0.0001 0.9998 X14473gll 0.0000 0.0000 1.0000 X14474gll 0.0000 0.0014 0.9986 X14475gll 0.0000 0.0150 0.9850 X14476gll 0.0001 0.0112 0.9887

It is clear that each text is assigned to its own author with a very high probability.

Unfortunately, this table is rather misleading because the model seriously overfits the data. It has done its utmost to find a representation of the data that separates the groups as best as possible. This is fine as a solution for this particular sample of texts, but it does not guarantee that prediction will be accurate for unseen text fragments as well.

The existence of a problem lurking in the background is indicated by scrutinizing the group means, as provided by a summary of the discriminant object, abbreviated here for convenience.

> spanish.pca.lda ...

Group means:

PC1 PC2 PC3 PC4 PC5

C -4.820024 -2.7560056 1.3985890 -0.94026140 0.2141179 M 3.801425 2.9890677 0.6494555 -0.01748498 0.4472681 V 1.018598 -0.2330621 -2.0480445 0.95774638 -0.6613860

PC6 PC7 PC8

C -0.02702131 -0.5425466 0.86906543 M 1.75549883 -0.6416654 0.09646039 V -1.72847752 1.1842120 -0.96552582 ...

There are differences among these group means, but they are not that large, and we may wonder whether any are actually significant. A statistical test appropriate for answering this question is aMULTIVARIATE ANALYSIS OF VARIANCE, available inRas the function manova(). It considers a group of numerical vectors as the dependent variable, and takes one or more factors as predictors. We use it to ascertain whether there are significant differences in the mean among the dependent variables. (Running a series of separate one-way analyses of variance, one for each PC, would run into the same problem of in-flatedp-values as discussed in Chapter 4 for a series oft-tests where a one-way analysis of variance is appropriate.)

> spanish.manova =

171

DRAFT

+ manova(cbind(PC1, PC2, PC3, PC4, PC5, PC6, PC7, PC8) ˜ Author, + data = spanish.x)

There are several methods for evaluating the output ofmanova(), we useR’s default, which makes use of the Pillai-Bartlett statistic, which approximately follows anF-distribution.

Df Pillai approx F num Df den Df Pr(>F) Author 2 1.6283 3.2854 16 12 0.02134 Residuals 12

Thep-value is sufficiently small to suggest that there are indeed significant differences among the group means. On the other hand, the evidence for such differences is not that exciting, and certainly not strong enough to inspire confidence in the perfect classification by authors obtained withlda().

In order to guage the extent to which our results might generalize, we carry out a leave-one-out cross-validation. We run15different discriminant analyses, each of which is trained on14texts and is used to predict the author of the remaining held out text.

The proportion of correct attributions will give us improved insight into how well the model would perform when confronted with new texts by one of these three authors. Al-thoughlda()has an option for carrying out leave-one-out cross-validation (CV=TRUE), we cannot use this option here because the orthogonalization of our input (resulting inspanish.x) takes the data from all authors and all texts into account. We there-fore implement cross-validation ourselves, and begin with making sure that the texts in spanish.tandspanishMetaare in sync. We then set the number of PCs to be consid-ered to8and define a vector with15empty strings to store the predicted authors.

> spanish.t = spanish.t[order(rownames(spanish.t)), ]

> n = 8

> predictedClasses = rep("", 15)

Next, we loop over the15texts. In each pass through the loop, we create a training data set and a vector with the corresponding information on the author by omitting thei-th text. Following orthogonalization, we make sure that the texts remain in sync with the vector of authors, and then applylda(). Finally, we obtain the predicted authors for the full data set on the basis of the model for the training data, but select only thei-th element and store it in thei-th cell ofpredictedClasses.

> for (i in 1:15) {

+ training = spanish.t[-i,]

+ trainingAuthor = spanishMeta[-i,]$Author

+ training.pca = prcomp(training, center=T, scale=T) + training.x = data.frame(training.pca$x)

+ training.x = training.x[order(rownames(training.x)), ] + training.pca.lda = lda(training[ , 1:n], trainingAuthor) + predictedClasses[i] =

+ as.character(predict(training.pca.lda, spanish.t[ , 1:n])$class[i]) + }

172

DRAFT

Finally, we compare the observed and predicted authors.

> data.frame(obs = as.character(spanishMeta$Author), + pred = predictedClasses)

obs pred

1 C V

2 C C

3 C C

4 C C

5 C V

6 M M

7 M M

8 M M

9 M M

10 M V

11 V M

12 V V

13 V V

14 V M

15 V M

The number of correct attributions is

> sum(predictedClasses==as.character(spanishMeta$Author)) [1] 9

which reaches significance according to a binomial test: The likelihood of observing9or more successes in15trials is0.03.

> sum(dbinom(9:15, 15, 1/3)) [1] 0.03082792

We conclude that there is significant authorial structure, albeit not as crisp and clear as Figure 5.18 suggested at first. We may therefore expect our discriminant model to achieve some success at predicting the authorial hand of unseen texts from one of these three authors.

Im Dokument A practical introduction to statistics (Seite 89-92)