• Keine Ergebnisse gefunden

Classification trees

5.2 Classification

5.2.1 Classification trees

The proportions of bootstrap trees that support the subtrees of our original tree are ob-tained with the help ofprop.clades().

> props = prop.clades(papuan.dist.tr, btr)/B

> props

[1] 1.000 0.600 0.865 0.050 0.100 0.115 0.200 0.315 0.555 0.680 0.625 [12] 0.445 0.920

We plot the original tree

> plot(papuan.dist.tr, type = "u", font = as.numeric(as.factor(fonts))) and add the thermometers withnodelabels().

> nodelabels(thermo = props, piecol = c("black", "grey"))

The proportion of bootstrap support decreases as one moves to the center of the graph.

This points to a lack of consensus with respect to how subtrees should be linked. A dif-ferent way of bringing this uncertainty out into the open is to plot aCONSENSUS TREE. In a consensus tree, subgroups that are not observed in all bootstrap trees (strict consen-sus) or in a majority of all bootstrap trees (majority-rule consenconsen-sus) will be collapsed.

The result is a tree with multichotomies. The lower left tree of Figure 5.13 shows such a multichotomy in the center, where8branches come together. Theapepackage provides the functionconsensus()for constructing a consensus tree for a list of trees, given a proportionpspecifying the required level of consensus.

> btr.consensus = consensus(btr, p = 0.5)

Consensus trees come with a plot method, and can be visualized straightforwardly with plot(). Some extra steps are required to plot the tree with fonts representing geograph-ical areas.

> x = btr.consensus$tip.label

> x

[1] "Anem" "Ata" "Bilua" "Buin" "Nasioi"

[6] "Motuna" "Kol" "Sulka" "Mali" "Kuot"

159

DRAFT

[11] "Lavukaleve" "Rotokas" "Savosavo" "Touo" "Yeli_Dnye"

> x = data.frame(Language = x, Node = 1:length(x))

> x = merge(x, papuan.meta, by.x = "Language", by.y = "Language")

> head(x)

Language Node Family Geography 1 Anem 1 Papuan Bismarck Archipelago 2 Ata 2 Papuan Bismarck Archipelago 3 Bilua 3 Papuan Central Solomons

4 Buin 4 Papuan Bougainville

5 Kol 7 Papuan Bismarck Archipelago 6 Kuot 10 Papuan Bismarck Archipelago

> x = x[order(x$Node),]

> x$Geography = as.factor(x$Geography)

> plot(btr.consensus, type = "u", font = as.numeric(x$Geography)) The consensus tree shows that the grouping of Bilua, Kuot, Lavukaleve, Rotokas, and Yeli Dnye is inconsistent across bootstrap runs. We should at the same time keep in mind that a given bootstrap run will make use of roughly80of the125available grammatical traits. A loss of about a third of the available grammatical markers may have had severe adverse consequences for the goodness of the clustering. Therefore, replication studies with a larger set of languages and an even broader range of grammatical traits may well support the interesting similarity in geographical and grammatical topology indicated by the original tree constructed with all125traits currently available.

5.2 Classification

In the previous section, we have been concerned with discerning clusters and groupings for data points described by the rows of numerical matrices. When we visualized data, we often used color coding or changes in font size to distinguish subsets of data points.

But information on these subsets was never used in the calculations. We only added it to our plots afterwards. In this section, we change our perspective fromCLUSTERING toCLASSIFICATION, and take information on subsets (classes) of data points as point of departure. Our aim is now to ascertain whether the class of a data point can be predicted.

5.2.1 Classification trees

In Chapters 1 and 2 we started exploring data on the dative alternation in English [Bres-nan et al., 2007]. The dependent variable in this study is a factor with as levelsn(the dative is realized as anNP, as inJohn gave Mary the book) andp(the dative is realized as aPP, as inJohn gave the book to Mary). For3263verb tokens in corpora of written and spoken English, the values of a total of12variables were determined, in addition to the realization of the dative, coded asRealizationOfRecipientin the data setdative.

> colnames(dative)

160

DRAFT

[1] "Speaker" "Modality"

[3] "Verb" "SemanticClass"

[5] "LengthOfRecipient" "AnimacyOfRec"

[7] "DefinOfRec" "PronomOfRec"

[9] "LengthOfTheme" "AnimacyOfTheme"

[11] "DefinOfTheme" "PronomOfTheme"

[13] "RealizationOfRecipient" "AccessOfRec"

[15] "AccessOfTheme"

Short descriptions of these variables are available with?dative. The question that we address here is whether the realization of the recipient asNPorPPcan be predicted from the other variables. The technique that we introduce here isCARTanalysis, an acronym for Classification And Regression Trees. This section restricts itself to discussing classification trees. (When the dependent variable is not a factor but a numerical variable, the same principles apply and the result is a regression tree.)

An initial classification tree for the dative alternation is shown in Figure 5.14. The tree outlines a decision procedure for determining the realization of the recipient asNPorPP. Each split in the tree is labeled with a decision rule. The decision rule at the root, the top node of the tree, asks whether or not the factorAccessOfRechas the levelgiven. If so, follow the left branch, otherwise, follow the right branch. At each next branch a new decision rule is considered that directs us to a new branch in its subtree. This process is repeated until a leaf node, a node with no further splits, is reached. A data point for which the accessibility of the recipient isgiven, for which the accessibility of the theme isgiven, and for which the pronominality of the theme isnonpronominal, we go left, right, and left at which point we reach a leaf node for which the predicted outcome isNP. This outcome is supported by119observations and contradicted by only17observations.

The leaf nodes of the tree specify a partition of the data, i.e., a division of the data set into a series of non-overlapping subsets that jointly comprise the full data set. Hence, CARTanalysis is often referred to asRECURSIVE PARTITIONING. For any node, the algo-rithm for growing a tree inspects all predictors and selects the one that is most useful.

The algorithm begins with the root node, which represents the full data set, and creates two subsets. For each of these subsets, it creates two new subsets, for which in turn new subsets are created, and so on. Without a stopping criterion, the tree would keep grow-ing until its leaves would contain sgrow-ingle observations only. Such leaves would bepure, in the sense that only one level of the dependent variable would be represented at any leaf node. But such leaf nodes would also betriviallypure, and would not allow general-ization: The tree would severely overfit the data. Therefore, the tree growing algorithm stops when there are too few observations at a node, by default20. In addition, the tree growing algorithm refuses to implement useless splits. For a split to be useful, the daugh-ter nodes should be purer than the mother node, in the sense that the ratio ofNPtoPP realizations in the daughter nodes should be more extreme (i.e., closer to1or to0) than in the mother node. How exactlyNODE IMPURITYis assessed is a technical issue that need not concern us here. What is important is that the usefulness of a predictor is assessed by its success in reducing the impurity in the mother node, and its success in creating purer

161

DRAFT

daughter nodes. The vertical parts of the branches in the tree diagram are proportional to the achieved reduction in node heterogeneity, and provide a graphical representation of the explanatory value of a split.

AccessOfRec=given|

AccessOfTheme=accessible,new PronomOfTheme=nonpronominal

SemanticClass=a,c

SemanticClass=a,c,f,p

LengthOfTheme>=4.5

LengthOfRecipient< 2.5 Modality=spoken NP

1861/116

NP 119/17

NP 43/16

PP 7/123

NP 165/44

NP 86/44

PP 25/40

PP 23/104 PP 85/345

Figure 5.14: Initial (unpruned)CART tree for the realization of the recipient in English clauses (NPorPP) in written and spoken English.

The tree shown in Figure 5.14 was grown by the functionrpart()from therpart package.

> library(rpart)

> dative.rp = rpart(RealizationOfRecipient ˜ .,

+ data = dative[ ,-c(1, 3)]) # exclude the columns with subjects, verbs In this formula, the dot following the equation is shorthand for all variables in the data frame with the exception of the dependent variable. The tree objectdative.rpis visu-alized withplot()and labeled withtext().

162

DRAFT

cp X−val Relative Error 0.40.60.81.0

Inf 0.13 0.071 0.058 0.041 0.024 0.013

1 2 3 4 6 8 9

size of tree

Figure 5.15: Cost-complexity cross-validation plot for the unpruned CART tree (Fig-ure 5.14) for the realization of the recipient in English.

> plot(dative.rp, compress = T, branch = 1, margin = 0.1)

> text(dative.rp, use.n = T, pretty = 0)

The plot options are explained in detail in the help forplot.rpart(), and the options for labelling in the help fortext.rpart(). When the optionuse.nis set toTRUE, counts are added to the leaf nodes. By settingprettyto zero, we force the use of the full names of the factor levels, instead of the codes thatrpart()produces by default.

The problem with this initial tree is that it still overfits the data. It implements too many splits that have no predictive value for new data. To increase the prediction accu-racy of the tree, we have to prune it by snipping of useless branches. This is done with the help of an algorithm known asCOST-COMPLEXITY PRUNING. Cost-complexity prun-ing pits the size of the tree (in terms of its number of leaf nodes) against its success in reducing the impurity in the tree by means of a cost-complexity parametercp. The larger the value ofcp, the greater the number of branches that is pruned. For very largecp, all that remains of the tree is its root stump. Whencpis very low, it is too small to induce any pruning.

How should we evaluate the balance between success in classification accuracy on the one hand and the complexity of one’s theory (gauged by its number of leaf nodes) on the other hand? The answer to this question is10-fold cross-validation. For successive values ofcp, and hence for successive tree sizes, we take the data, and randomly divide it into10equally-sized parts. We then select the first part, put it aside, and build a tree for the remaining9parts lumped together. Next, we evaluate how well this tree predicts the

163

DRAFT

realization of the recipient for the held-out part by comparing its misclassification rate with the misclassification rate for the root model, the simplest possible model without any predictors. The result is a relative error score. We repeat this process for each of the nine remaining parts. What we end up with is, for each tree size,10relative error scores that inform us how well the model generalizes to unseen data. Of course, it would be better to evaluate the model against new data, but in the absence of a second equivalent data set, cross-validation provides a way of assessing predictivity anyway.

Figure 5.15, obtained withplotcp(), plots the means of these error scores.

> plotcp(dative.rp)

The horizontal axis displays the values of the cost-complexity parametercpat which branches are pruned. The corresponding sizes of the pruned tree are shown at the top of the plot. The vertical axis represents the cross-validation error. The small vertical lines for each point mark one standard error above and below the mean. The dotted line represents one standard error above the mean for the lowest point in the graph. A common selection rule for the cost-complexity parameter is to select the leftmost point that is still under this dotted line. In this example, this leftmost point would also be the rightmost point. To be a little conservative, we prune the tree (withprune()) forcp = 0.041, and obtain a tree with6leaves, as shown in Figure 5.16.

> dative.rp1 = prune(dative.rp, cp = 0.041)

> plot(dative.rp1, compress = T, branch = 1, margin = 0.1)

> text(dative.rp1, use.n = T, pretty = 0)

We accept the predictors in this tree as statistically significant, and note that here cross-validation has taken over the function of thep-values associated with classical statistics associated with thet,For chi-squared distributions.

A verbal summary of the model is obtained by typing the object name to the prompt.

> dative.rp1 n= 3263

node), split, n, loss, yval, (yprob)

* denotes terminal node 1) root 3263 849 NP (0.74 0.26)

2) AccessOfRec=given 2302 272 NP (0.88 0.12)

4) AccessOfTheme=accessible,new 1977 116 NP (0.94 0.06) * 5) AccessOfTheme=given 325 156 NP (0.52 0.48)

10) PronomOfTheme=nonpronominal 136 17 NP (0.88 0.12) * 11) PronomOfTheme=pronominal 189 50 PP (0.26 0.74) * 3) AccessOfRec=accessible,new 961 384 PP (0.40 0.60)

6) SemanticClass=a,c,f,p 531 232 NP (0.56 0.44) 12) LengthOfTheme>=4.5 209 44 NP (0.79 0.21) * 13) LengthOfTheme< 4.5 322 134 PP (0.42 0.58) * 7) SemanticClass=t 430 85 PP (0.20 0.80) *

164

DRAFT

The first line mentions the number of data points. The second line provides a legend for the remainder, each line of which consists of a node number, the splitting criterion, the number of observations in the subtree dominated by the node, a measure of the reduction in node impurity effected by the split, and the probabilities of theNPandPPrealizations.

AccessOfRec=given|

AccessOfTheme=accessible,new

PronomOfTheme=nonpronominal

SemanticClass=a,c,f,p

LengthOfTheme>=4.5 NP

1861/116

NP 119/17

PP 50/139

NP 165/44

PP 134/188

PP 85/345

Figure 5.16: Cost-complexity prunedCARTtree for the realization of the recipient in En-glish.

How successful is the model in predicting the realization of the recipient? To answer this question, we pit the predictions of theCARTtree against the actually observed real-izations. We extract the predictions from the model withpredict().

> head(predict(dative.rp1))

n p

[1,] 0.9413252 0.05867476

165

DRAFT

[2,] 0.9413252 0.05867476 [3,] 0.9413252 0.05867476 [4,] 0.9413252 0.05867476 [5,] 0.9413252 0.05867476 [6,] 0.9413252 0.05867476

Each row of the input data frame is paired with probabilities, one for each level of the dependent variable. In the present example, we have a probability for the realization as NPand one for the realization asPP. We choose the realization with the largest probability (see section 7.4 for a more precise evaluation method using thesomers2()function).

Our choice is thereforeNPif the first column has a value greater than or equal to0.5, and PPotherwise.

> choiceIsNP = predict(dative.rp1)[,1] >= 0.5

> choiceIsNP[1:6]

[1] TRUE TRUE TRUE TRUE TRUE TRUE

We combine this vector with the original observations

> preds = data.frame(obs = dative$RealizationOfRecipient, choiceIsNP)

> head(preds) obs choiceIsNP

1 n TRUE

2 n TRUE

3 n TRUE

4 n TRUE

5 n TRUE

6 n TRUE

and cross-tabulate.

> xtabs( ˜ obs + choiceIsNP, data = preds) choiceIsNP

obs FALSE TRUE n 269 2145 p 672 177

On a total of3263data points, only269+177 = 446are misclassified,13.7%. This compares favorably to a baseline classifier that simply predicts the most likely realization for all data points, and therefore is in error for all and only all data points withPPas realization.

> xtabs( ˜ RealizationOfRecipient, dative) RealizationOfRecipient

n p

2414 849

The misclassification rate for this baseline model is849/3263 = 26%.

An important property ofcarttrees is that they deal very elegantly with interac-tions. Interactions arise when the effects of two predictors are not independent, i.e.,

166

DRAFT

when the effect of one predictor is codetermined by the value of another predictor. Fig-ure 5.16 illustrates many interactions. For instance,SemanticClassappears only in the right branch of the tree, hence it is relevant only for clauses in which the accessibility of the recipient is notgiven. Hence, we have here an interaction ofSemanticClassby AccessOfRec. The other three predictors in the model also interact withAccessOfRec. Furthermore,LengthOfThemeinteracts withSemanticClass, andPronomOfTheme withAccessOfTheme. Whereas such complex interactions can be quite difficult to un-derstand in regression models, they are transparent and easy to grasp in classification and regression trees.