Support vector machines - A practical introduction to statistics

5.2 Classification

5.2.3 Support vector machines

Support vector machines are a relatively recent development in classification, and their performance is often excellent. A support vector machine for a binary classification prob-lem tries to find a hyperplane in multidimensional space such that ideally all eprob-lements of a given class are on one side of that hyperplane, and all the other elements are on the other side. Furthermore, it allocates a margin around that hyperplane, and points that are exactly the margin distance away from the hyperplane are called its support vectors. In other words, whereas discriminant analysis tries to separate groups by focusing on the

173

DRAFT

group means, support vector machines target the border area where the groups meet, and seeks to set up a boundary there.

Let’s re-examine the Old French texts studied previously with the help of correspon-dence analysis. Instead of clustering (unsupervised), we apply classification (supervised) with thesvm()function from thee1071package.

> library(e1071)

Correspondence analysis revealed a clear difference in the use of tag trigrams across prose and poetry. We givesvm()the reverse task of determining the amount of support that our a-priori classification into prose versus poetry receives support from the use of tag trigrams across our texts. The first argument that we supply tosvm()is the data frame with counts, the second argument is the vector specifying the genre for each row in the data frame.

> genre.svm = svm(oldFrench, oldFrenchMeta$Genre)

Typing the object name to the prompt results in a brief summary of the parameters used for the classification (many possibilities are offered, we have simply used the defaults), and the number of support vectors.

> genre.svm Call:

svm.default(x = oldfrench, y = oldFrenchMeta$Genre, cross = 10) Parameters:

SVM-Type: C-classification SVM-Kernel: radial

cost: 1

gamma: 0.02857143

Number of Support Vectors: 158

There is no straightforward way to visualize the classification. Some intuitions about the support vectors can be gleaned by means of multidimensional scaling, with special plot symbols for the observations that are chosen as support vectors, in Figure 5.19 the plus symbol. Note that the plus symbols are especially dense in the border area between the two (color-coded) genres.

> plot(cmdscale(dist(oldFrench)),

+ col = c("blue", "red")[as.integer(oldFrenchMeta$Genre)],

+ pch = c("o", "+")[1:nrow(oldFrenchMeta) %in% genre.svm$index + 1]) The second and third lines of this plot command illustrate a feature of subscripting that has not yet been explained, namely, that a vector can be subscripted for more elements as it is long, as long as these elements refer to legitimate indices in the vector.

> c("blue", "red")[c(1, 2, 1, 2, 2, 1)]

[1] "blue" "red" "blue" "red" "red" "blue"

174

DRAFT

Figure 5.19: Multidimensional scaling for registers in Old French on the basis of tag tri-gram frequencies, with support vectors highlighted by the plus symbol. Black points represent poetry, grey points represent prose.

In the second line of the plot command,as.integer(oldFrenchMeta$Genre)is a vector with ones and twos, corresponding to the levelspoetryandprose. This vector is mapped onto a vector withbluerepresentingpoetryandredrepresentingprose.

The same mechanism is at work for the third line. The vector between the square brackets is dissected as follows. The index extracted from the model object

> genre.svm$index

[1] 2 3 6 13 14 15 16 17

refers to the row numbers inoldFrenchof the support vectors. The vector 1:nrow(oldFrenchMeta)

175

DRAFT

is the vector of all row numbers. The%in%operator checks for set membership. The result is a vector that isTRUEfor the support vectors andFALSEfor all other rows. When 1is added to this vector,TRUEfirst converts to1andFALSEto zero, so the result is a vector with ones and twos, which are in turn mapped onto theoand+symbols.

A comparison of the predicted classes with the actual classes shows that only a single text is misclassified.

> xtabs( ˜ oldFrenchMeta$Genre + predict(genre.svm)) predict(genre.svm)

oldFrenchMeta$Genre poetry prose

poetry 198 0

prose 1 143

However, the model might be overfitting the data, so we carry out10-fold cross-validation by runningsvm()with the optioncross(by default0) set to10.

> genre.svm = svm(oldFrench, oldFrenchMeta$Genre, cross = 10)

The summary now specifies the average accuracy as well as the accuracy in each separate cross-validation run.

> summary(genre.svm)

10-fold cross-validation on training data:

Total Accuracy: 96.78363 Single Accuracies:

97.05882 97.05882 97.05882 94.11765 97.14286 97.05882 97.05882 97.05882 100 94.28571

An average success rate of0.97(so roughly8misclassifications) shows that genre is in-deed very well-predictable from the author’s syntactic habits.

Classification byRegion, by contrast, poses a more serious challenge.

> region.svm = svm(oldFrench, oldFrenchMeta$Region, cross = 10)

> xtab = xtabs(˜oldFrenchMeta$Region + predict(region.svm))

> xtab

To calculate the sum of the correct classifications, we extract the diagonal elements

> diag(xtab) R1 R2 R3 86 152 46

take their sum and divide by the total number of observations.

176

DRAFT

> sum(diag(xtab))/sum(xtab) [1] 0.8304094

Unfortunately, this success rate is severely inflated due to overfitting, as shown by10-fold cross-validation.

> summary(region.svm)

10-fold cross-validation on training data:

Total Accuracy: 61.9883 Single Accuracies:

64.70588 67.64706 67.64706 50 57.14286 64.70588 44.11765 70.58824 73.52941 60

However, a success rate of62% still compares favorably with a baseline classifier that would always assign the majority class,R2.

> max(xtabs( ˜ oldFrenchMeta$Region))/nrow(oldFrench) [1] 0.4473684

This success rate differs significantly from the cross-validated success rate. To see this, we bring together the number of successes and failures for both classifiers into a contingency table,

> cbind(c(153, 342-153), c(212, 342-212)) [,1] [,2]

[1,] 153 212 [2,] 189 130

and apply a chi-squared test:

> chisq.test(cbind(c(153, 342-153), c(212, 342-212))) Pearson’s Chi-squared test with Yates’ continuity correction data: cbind(c(153, 342 - 153), c(212, 342 - 212))

X-squared = 19.7619, df = 1, p-value = 8.771e-06

An alternative test that produces the same lowp-value is the proportions test.

> prop.test(c(153, 212), c(342, 342)) ...

data: c(153, 212) out of rep(342, 2)

X-squared = 19.7619, df = 1, p-value = 8.771e-06 alternative hypothesis: two.sided

95 percent confidence interval:

-0.2490838 -0.0959454 sample estimates:

prop 1 prop 2 0.4473684 0.6198830

177

DRAFT

In summary, support vector machines are excellent classifiers and probably one’s best choice if the goal is to achieve optimal classification performance for an application. Their disadvantage is that they are difficult to interpret and provide little insight into what factors drive the classification.

5.3 Exercises

1. Burrows [1992], in a study using principal components analysis of English authorial hands, observed that one of his principal components represented time. Burrows’

study was based on a careful selection of texts from the same register (novels writ-ten in the first person singular). Explore whether time is a lawrit-tent variable for pro-ductivity for the subset of literary texts (labeled withLin the columnRegisters, using the year of birth as specified in the last column of the data frame (Birth).

Run a principal components analysis using the correlation matrix. Make sure to ex-clude the last three columns from the data frame before runningprcomp. Then use pairscor.fnc()(available if you have attached thelanguageRpackage), that, likepairs(), creates a scatterplot matrix. Unlikepairs(), it lists correlations in the lower triangle of the matrix. Use the output ofpairscor.fnc()to determine whether there is a principal component that represents time. Finally use a biplot to investigate which affixes were used most productively by the early authors and which by the late authors.

2. Consider the lexical measures for English monosyllabic monomorphemic words in the data setlexicalMeasures. Calculate the correlation matrix (exclude the first column, which lists the words) using the Spearman correlation. Square the cor-relation matrix, and use multidimensional scaling to study whether the measures CelS, NsyC, NsyS, Vf, Dent, Ient, NVratioandFdifform a cluster.

3. Ernestus and Baayen [2003] studied whether it is predictable whether a stem-final obstruent in Dutch alternates with respect to its voice specification. The data set finalDevoicingis a data frame with1697monomorphemic Dutch words, to-gether with the properties of their onsets, vowels, codas, etc. The dependent vari-able isVoice, which specifies whether the final obstruent is voiced instead of voice-less when it is syllable-initial (as, for instance, in the plural ofmuis:mui-zen(’mice’).

Use a classification tree to trace the probabilistic grammar underlying voice alter-nation in Dutch. Calculate the classification accuracy, and compare it with a base-line model that allways selectsvoiceless. Details on the factors and their levels are available in the description of the data set — type?finalDevoicingto theR prompt.

4. The data set spanishFunctionWordsprovides the relative frequencies of the most common function words in the Spanish texts studied above using the frequen-cies of tag trigrams. Analyze this data set with linear discriminant analysis with

178

DRAFT

cross-validation. As in the analysis of tag trigrams, first orthogonalize the data with principal components analysis. Which measure is a better predictor for authorship attribution: tag trigram frequency or function word frequency?

5. The data setregularityspecifies for700Dutch verbs whether or not they are regular or irregular, along with numeric predictors such as frequency and family size, and a categorical predictor, the auxiliary selected by the verb for the past per-fect. Investigate whether a verb’s regularity is predictable from these variables using support vector machines. After loading the data, we convert the factorAuxiliary into a numeric predictor as support vector machines cannot handle factors.

> regularity$AuxNum = as.numeric(regularity$Auxiliary)

Exclude columns1,8,10(the columns labeling the verbs, their regularity, and the auxiliary) from the data frame when supplied as first argument tosvm(). Use 10-fold cross-validation and formally test whether the cross-validated accuracy is superior to the baseline model that always selects regularity.

179

DRAFT

¹⁸⁰

DRAFT

Chapter 6 Regression modeling

Section 4.3 introduced the basics of linear regression and analysis of covariance. This chapter begins with a recapitulation of the central concepts and ideas introduced in Chap-ter 4. It then broadens the horizon on linear regression in several ways. Section 6.2 dis-cusses multiple linear regression and various analytical strategies for dealing with multi-ple predictors simultaneously. Section 6.3 introduces theGENERALIZED LINEAR MODEL, which extends the linear modeling approach to binary dependent variables (successes versus failures, correct versus incorrect responses, NPorPPrealizations of the dative, etc.) and factors with ordered levels (e.g., low, mid and high education level). (TheVAR -BRULprogram used widely in sociolinguistics implements the general linear model for binary variables.) Finally, section 6.4 outlines a method for dealing with breakpoints, and section 6.5 discusses the special care required for dealing with word frequency distribu-tions.

6.1 Introduction

Consider again theratingsdata set that we studied in Chapter 4. We are interested in whether the rated size (averaged over subjects) of the referents of81English nouns can be predicted from the subjective estimates of these words’ familiarity and from the class of their referents (plantversusanimal). We begin with fitting a model of covariance with meanFamiliarityas nonlinear numeric predictor andClassas factorial predictor. The SIMPLE MAIN EFFECTS, i.e., main effects that are not involved in any interactions, are separated by plus symbols in the formula forlm().

> ratings.lm = lm(meanSizeRating ˜ meanFamiliarity + + I(meanFamiliarityˆ2) + Class, data = ratings)

> summary(ratings.lm) Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 4.09872 0.53870 7.609 5.75e-11 meanFamiliarity -0.38880 0.27983 -1.389 0.1687

181

DRAFT

I(meanFamiliarityˆ2) 0.07056 0.03423 2.061 0.0427 Classplant -1.89252 0.08788 -21.536 < 2e-16

This model has four coefficients: a coefficient for the intercept, coefficients for the linear and quadratic terms ofmeanFamiliarity, and a coefficient for the contrast between the levels of the factorClass: The group mean for the subset of plants is−1.89units lower than that for the animals, the reference level mapped onto the intercept. Although we want our model to be as simple as possible, we leave the non-significant coefficient for the linear effect ofmeanFamiliarityin the model, for technical reasons, given that the quadratic term is significant.

The model that we ended up with in Chapter 4 was more complex, in that it contained anINTERACTIONterm forClassbymeanFamiliarity:

> ratings.lm = lm(meanSizeRating ˜ meanFamiliarity * Class + + I(meanFamiliarityˆ2), data = ratings)

> summary(ratings.lm) Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 4.42894 0.54787 8.084 7.6e-12

meanFamiliarity -0.63131 0.29540 -2.137 0.03580 I(meanFamiliarityˆ2) 0.10971 0.03801 2.886 0.00508

Classplant -1.01248 0.41530 -2.438 0.01711

meanFamiliarity:Classplant -0.21179 0.09779 -2.166 0.03346

This model has three main effects and one interaction. The interpretation of this main effect, which is no longer asimplemain effect because of the presence of an interaction in which it is involved, is not as straightforward as in the previous model. In that model, the effect ofClassis very similar to the difference in the group means for animals and plants. (It is not identical to this difference becausemeanFamiliarityis also in the model.) In the new model with the interaction, everything is recalibrated, and the main effect by itself is no longer very informative. In fact, a main effect need not be significant as long as it is involved in interactions that are significant, in which case it normally has to be retained in the model.

Thus far, we have inspected this model withsummary(), which tells us whether the coefficients are significantly different from zero. There is another way to look at these data, usinganova():

> anova(ratings.lm) Analysis of Variance Table Response: meanSizeRating

Df Sum Sq Mean Sq F value Pr(>F) meanFamiliarity 1 3.599 3.599 30.6945 4.162e-07 Class 1 60.993 60.993 520.2307 < 2.2e-16 I(meanFamiliarityˆ2) 1 0.522 0.522 4.4520 0.03815 meanFamiliarity:Class 1 0.550 0.550 4.6907 0.03346

Residuals 76 8.910 0.117

182

DRAFT

This summary tells us, by means ofF-tests, whether a predictor contributes significantly to explaining the variance in the dependent variable. It does so in a sequential way, by ascertaining whether a predictor further down the list has anything to contribute over and above the predictors higher up in the list. Hence the output ofanova()for a model fit withlm()is referred to as aSEQUENTIAL ANALYSIS OF VARIANCE TABLE. A sequential ANOVAtable answers different questions than thesummary()function. To see why, we fit a series of separate models, each with one additional predictor.

> ratings.lm1 = lm(meanSizeRating ˜ meanFamiliarity, ratings)

> ratings.lm2 = lm(meanSizeRating ˜ meanFamiliarity + Class, ratings)

> ratings.lm3 = lm(meanSizeRating ˜ meanFamiliarity + Class + + I(meanFamiliarityˆ2), ratings)

> ratings.lm4 = lm(meanSizeRating ˜ meanFamiliarity * Class + + I(meanFamiliarityˆ2), ratings)

We compare the first and the second model to test whetherClassis predictive given thatmeanFamiliarityis in the model. In the same way, we compare the second and the third model to ascertain whether we need the quadratic term, and the third and the fourth model to verify that we need the interaction. We carry out all these comparisons simultaneously with

> anova(ratings.lm1, ratings.lm2, ratings.lm3, ratings.lm4) Analysis of Variance Table

Model 1: meanSizeRating ˜ meanFamiliarity

Model 2: meanSizeRating ˜ meanFamiliarity + Class

Model 3: meanSizeRating ˜ meanFamiliarity + Class + I(meanFamiliarityˆ2) Model 4: meanSizeRating ˜ meanFamiliarity * Class + I(meanFamiliarityˆ2)

Res.Df RSS Df Sum of Sq F Pr(>F) 1 79 70.975

2 78 9.982 1 60.993 520.2307 < 2e-16 3 77 9.460 1 0.522 4.4520 0.03815 4 76 8.910 1 0.550 4.6907 0.03346

and obtain the same results as produced withanova(ratings.lm). Each successive row in a sequentialANOVAtable evaluates whether adding a new predictor is justified given the other predictors in the preceding rows. By contrast, thesummary()function evaluates whether the coefficients are significantly different from zero in a model contain-ing all other predictors. This is a different question, that often results in differentp-values.

An interaction ofClassby the quadratic term formeanFamiliarityturns out not to be necessary.

> ratings.lm5 = lm(meanSizeRating ˜ meanFamiliarity * Class + + I(meanFamiliarityˆ2) * Class, data = ratings)

> anova(ratings.lm5) Analysis of Variance Table Response: meanSizeRating

Df Sum Sq Mean Sq F value Pr(>F)

183

DRAFT

meanFamiliarity 1 3.599 3.599 30.7934 4.128e-07

Class 1 60.993 60.993 521.9068 < 2.2e-16

I(meanFamiliarityˆ2) 1 0.522 0.522 4.4663 0.03790 meanFamiliarity:Class 1 0.550 0.550 4.7058 0.03323 Class:I(meanFamiliarityˆ2) 1 0.145 0.145 1.2449 0.26810

Residuals 75 8.765 0.117

With a minimal change in the specification of the model, the replacement of the second asterisk in the model formula by a colon, we obtain a very different result:

> ratings.lm6 = lm(meanSizeRating ˜ meanFamiliarity * Class + + I(meanFamiliarityˆ2) : Class, data = ratings)

> anova(ratings.lm5) Analysis of Variance Table

Response: meanSizeRating

Df Sum Sq Mean Sq F value Pr(>F) meanFamiliarity 1 3.599 3.599 30.7934 4.128e-07

Class 1 60.993 60.993 521.9068 < 2.2e-16

meanFamiliarity:Class 1 0.095 0.095 0.8166 0.36906 Class:I(meanFamiliarityˆ2) 2 1.122 0.561 4.8002 0.01092

Residuals 75 8.765 0.117

It would now seem as if the interaction is significant after all. In order to understand what is going on, we inspect the table of coefficients.

> summary(ratings.lm6) Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 4.16838 0.59476 7.008 8.95e-10

meanFamiliarity -0.48424 0.32304 -1.499 0.1381

Classplant 1.02187 1.86988 0.546 0.5864

meanFamiliarity:Classplant -1.18747 0.87990 -1.350 0.1812 Classanimal:I(meanFamiliarityˆ2) 0.09049 0.04168 2.171 0.0331 Classplant:I(meanFamiliarityˆ2) 0.20304 0.09186 2.210 0.0301 Note that the coefficients for meanFamiliarity, Classplantand their interaction are no longer significant. This may happen when a complex interaction is added to a model. The last two lines show that we have two quadratic coefficients, one for the an-imals (0.09) and one for the plants (0.20). This is what we asked for when we specified the interaction (I(meanFamiliarityˆ2) : Class) without including a main effect formeanFamiliarity. in the formula forratings.lm6. The question, however, is whether we need these two coefficients. At first glance, the two coefficients look fairly different, but the standard error of the second coefficient is quite large, 0.09. A quick and dirty estimate of the confidence interval for the second coefficient is0.20±2∗0.09, which includes the value of the first coefficient. Clearly, these two coefficients are not significantly different. This is why the anova()and summary()functions reported

184

DRAFT

a non-significant effect for modelratings.lm5. What we are asking with the for-mula ofratings.lm6is whether the individual coefficients of the quadratic terms of meanFamiliarityfor the plants and the animals are different from zero. This they are.

We are not asking whether we need two different coefficients. This we do not. What this example shows is that the main effect of a term in the model, here the quadratic term formeanFamiliarity, should be specified explicitly in the model when the question of interest is whether an interaction term is justified.

The conventions governing the specification of main effects and interactions in the for-mula of a model are both straightforward and flexible. It is often convenient not to have to spell out all interactions for models with many predictors. The following overview shows how combinations of predictors and their interactions can be specified using parentheses, the plus and minus symbols, and the^∧operator. With^∧2, for instance, we denote that all interactions involving pairwise combinations of the predictors enclosed within parenthe-ses should be included in the model.

a + b + c

a + b + c + a:b or a * b + c

a + b + c + a:b + a:c + b:c or (a + b + c)ˆ2 a + b + c + a:b + a:c + b:c + a:b:c or (a + b + c)ˆ3 a + b + c + a:b + a:c or (a + b + c)ˆ2 - b:c Thus, the formula forratings.lm5, for instance, can be simplified to

meanSizeRating ˜ (meanFamiliarity + I(meanFamiliarityˆ2)) * Class

Im Dokument A practical introduction to statistics (Seite 92-98)