Tables with distances: Hierarchical cluster analysis

4.6 A note on statistical significance

5.1.5 Tables with distances: Hierarchical cluster analysis

The final technique for tracing groups in numerical tables that we consider in this chapter is hierarchical cluster analysis. Hierarchical cluster analysis is the name for a family of techniques for clustering data and displaying them in a tree-like format. Just as multidi-mensional scaling, these techniques require a distance object as input.

There are many different ways to form clusters. One way is to begin with an initial cluster containing all data points, and then to proceed with successivelyPARTITIONING clusters into smaller clusters. One of the functions inRthat uses thisDIVISIVE CLUSTER -INGapproach isdiana(). This method is reported to have difficulties finding optimal divisions for smaller clusters. However, when the goal is to find a few large clusters, it is an attractive method.

More commonly, clustering begins small, with single points, which are then agglom-erated into groups, and these groups into larger groups, and so on. AGGLOMERATIVE CLUSTERING is implemented in the functionhclust(). The clustering depends to a considerable extent on the criteria used for combining points and groups of points into larger clusters. Which criteriahclust()should use is specified by means of the option method. The default inRiscomplete, which evaluates the dissimilarity between two

148

DRAFT

1930 1960

−0.50.00.51.0

year of birth

dimension 1

female male

−0.50.00.51.0

dimension 3

Figure 5.9: Year of birth and sex as reflected in the first and third dimension of a multi-dimensional scaling of string-based cross-entropies for the spontaneous spoken Dutch of 165speakers.

clusters as the maximum of the dissimilarities between the individual members of these clusters.

By way of example, we consider23lexical measures characterizing2233 monomor-phemic and monosyllabic English words as available in theenglishdata set. For con-venience, the information pertaining to just the words and their associated measures are available separately as the data setlexicalMeasures. Brief information on these mea-sures can be obtained with?lexicalMeasuresorhelp(lexicalMeasures).

> lexicalMeasures[1:5, 1:6]

Word CelS Fdif Vf Dent Ient

1 doe 3.912023 1.0216510 1.386294 0.14144 0.02114 2 whore 4.521789 0.3504830 1.386294 0.42706 0.94198 3 stress 6.505784 2.0893560 1.609438 0.06197 1.44339

149

DRAFT

4 pork 5.017280 -0.5263339 1.945910 0.43035 0.00000 5 plug 4.890349 -1.0445450 2.197225 0.35920 1.75393

All these measures are correlated to some extent. A matrix listing all pairwise correlations between these variables, theCORRELATION MATRIXof this data set, is obtained simply withcor()applied tomeasuresafter excluding the first column, which is not numeric:

> lexicalMeasures.cor = cor(lexicalMeasures[, -1])

> lexicalMeasures.cor[1:5, 1:5]

CelS Fdif Vf Dent Ient

CelS 1.00000000 0.04553879 0.66481876 0.25211726 -0.04662943 Fdif 0.04553879 1.00000000 -0.13101020 -0.02376464 -0.12678869 Vf 0.66481876 -0.13101020 1.00000000 0.68828793 0.08484806 Dent 0.25211726 -0.02376464 0.68828793 1.00000000 -0.06582160 Ient -0.04662943 -0.12678869 0.08484806 -0.06582160 1.00000000

Even correlations that seem quite small, such as the correlation ofCelS(frequency) and Ient(inflectional entropy) are significant, thanks to the large number of words in this data set.

> cor.test(lexicalMeasures$CelS, lexicalMeasures$Ient) Pearson’s product-moment correlation

data: measures$CelS and measures$Ient t = -2.2049, df = 2231, p-value = 0.02757

alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval:

-0.087940061 -0.005158676 sample estimates:

cor -0.04662943

The question of interest to Baayen et al. [2006] was whether word frequency (CelS) enters into stronger correlations with measures of a word’s form (such as its length) or with measures of its meaning (such as its morphological family size or its number of synsets in WordNet). The answer to this question may contribute to understanding the role of frequency in lexical processing. The ubiquitous effect of word frequency in a reaction time experiments has often been interpreted as reflecting the processing load of a word’s form.

But if word frequency happens to be more tightly correlated with semantic measures, this would suggest that it might be useful to reconceptualize frequency as a measure of one’s familiarity with a word’s meaning. In an experimental task such as lexical decision, it might then be thought of as gauging, at least in part, semantic processing load.

A hierarchical cluster analysis is ideal for exploring the correlational structure of these 23measures. However, the above correlation matrix is not the best starting point for a

150

DRAFT

ffV ffN ffNonzero spelV friendsV spelN friendsN fbV fbN phonV phonN Vf Dent NsyS CelS NsyC Ncou Len Bigr Ient NVratio Fdif InBi

0.00.51.01.52.0

hclust (*, "complete") measures.dist

Height

Figure 5.10: Agglomerative hierarchical cluster analysis of23lexical variables.

cluster analysis. Correlations can be both positive and negative. For a matrix of dis-tances, it is desirable to have only non-negative values. This requirement is easy to sat-isfy by squaring the correlation matrix. (When we square the matrix, each of its element is squared.)

> (lexicalMeasures.corˆ2)[1:5, 1:5]

CelS Fdif Vf Dent Ient

CelS 1.000000000 0.002073781 0.441983979 0.063563114 0.002174303 Fdif 0.002073781 1.000000000 0.017163673 0.000564758 0.016075372 Vf 0.441983979 0.017163673 1.000000000 0.473740272 0.007199192 Dent 0.063563114 0.000564758 0.473740272 1.000000000 0.004332483 Ient 0.002174303 0.016075372 0.007199192 0.004332483 1.000000000

Another consideration is thatcor()works best for reasonably symmetric vectors. How-ever, many of the present measures have skewed distributions or distributions with more

151

DRAFT

than one peak (MULTIMODALITY). Therefore, it makes sense to make use of Spearman correlations.

> lexicalMeasures.cor = cor(lexicalMeasures[,-1], method="spearman")ˆ2

> lexicalMeasures.cor[1:5, 1:5]

CelS Fdif Vf Dent Ient

CelS 1.0000000000 0.0004464715 0.44529233 0.097394824 0.003643291 Fdif 0.0004464715 1.0000000000 0.02163948 0.001183269 0.017550778 Vf 0.4452923284 0.0216394843 1.00000000 0.533855660 0.011743931 Dent 0.0973948244 0.0011832693 0.53385566 1.000000000 0.001875520 Ient 0.0036432911 0.0175507780 0.01174393 0.001875520 1.000000000 The last preparatory step is to convert this matrix into a distance object.

> lexicalMeasures.dist = dist(lexicalMeasures.cor)

The cluster analysis itself is straightforward. First consider agglomerative clustering, for which we usehclust()to carry out the cluster analysis, andplclust()to plot the dendrogram.

> lexicalMeasures.clust = hclust(lexicalMeasures.dist)

> plclust(lexicalMeasures.clust)

Figure 5.10 shows that the highest split separates three measures of orthographic con-sistency from all other measures. The next split isolates another four measures of ortho-graphic consistency, and the same holds for the next split as well. The fourth split starts to become interesting, in that its left branch groups together four semantic measures: fam-ily size (Vf), derivational entropy (Dent), and two synset counts (NsyS, NsyC). It also contains frequency (CelS). The right branch dominates various measures of form such as the count of neighbors (Ncou) and word length (Len). But this right branch also contains two measures that are not measures of form, inflectional entropy (Ient, a measure of the complexity of a word’s inflectional paradigm), and the ratio of the word’s frequency as a noun and as a verb (NVratio). In other words, the clustering algorithm that we used shows some structure, but a clear separation of measures of form and measures of meaning is not obtained.

Let’s now consider divisive clustering with thediana()function from thecluster package. We feed the output ofdiana()intopltree(), which handles the graphics.

The result is shown in Figure 5.11.

> library(cluster)

> pltree(diana(lexicalMeasures.dist))

Divisive clustering succeeds in bringing all measures that do not pertain to meaning to-gether in one cluster at the left of the dendrogram, the left branch of the third main split.

Again, frequency (CelS) does not side with the measures of word form.

If you want to know to which clusters the variables are assigned, you first have to de-cide how many clusters you think you need, and use this number as the second argument forcutree(). Here, we opt for five clusters.

152

DRAFT

CelS NsyC NsyS Vf Dent Ient NVratio Fdif InBi Len Bigr Ncou spelV friendsV spelN friendsN phonV phonN fbV fbN ffV ffN ffNonzero

0.00.51.01.52.0

Dendrogram of diana(x = citems.sel)

diana (*, "") citems.sel

Height

Figure 5.11: Divisive hierarchical cluster analysis of23lexical variables.

> cutree(diana(lexicalMeasures.dist), 5)

[1] 1 2 1 1 1 1 1 2 2 2 2 3 3 4 4 3 3 5 5 4 4 5 1

When combined with the names of the measures, and with the classification of these measures in the data setlexicalMeasuresClasses, we obtain a very close correspon-dence between the class of the variable and cluster number, with as only exception the Fdifmeasure, which gauges the difference between a word’s frequency in speech ver-sus writing.

> x = data.frame(measure = rownames(lexicalMeasures.cor), + cluster = cutree(diana(lexicalMeasures.dist), 5), + class = lexicalMeasuresClasses$Class)

> x = x[order(x$cluster), ]

> x

measure cluster class

153

DRAFT

1 CelS 1 Meaning

3 Vf 1 Meaning

4 Dent 1 Meaning

5 Ient 1 Meaning

6 NsyS 1 Meaning

7 NsyC 1 Meaning

23 NVratio 1 Meaning

2 Fdif 2 Meaning

8 Len 2 Form

9 Ncou 2 Form

10 Bigr 2 Form

11 InBi 2 Form

12 spelV 3 Form

13 spelN 3 Form

16 friendsV 3 Form

17 friendsN 3 Form

14 phonV 4 Form

15 phonN 4 Form

20 fbV 4 Form

21 fbN 4 Form

18 ffV 5 Form

19 ffN 5 Form

22 ffNonzero 5 Form

As a second example of cluster analysis, we consider data published by Dunn et al.

[2005] on the phylogenetic classification of Papuan and Oceanic languages using gram-matical features. The vocabularies of Papuan languages are so different that classification based on the amount of lexical overlap using basic word lists is bound to fail. Dunn and colleagues showed that it is possible to probe the classification of Papuan languages in an interesting and revealing way using nonlexical, grammatical traits. Their data set, avail-able asphylogeny, contains125binary features for15Papuan and16Oceanic languages (columns). The first column specifies the language, the second the language family, and the remaining125colums the grammatical properties, such as whether a language has prenasalized stops. Presence is coded by1, absence by0.

> phylogeny[1:5, 1:5]

Language Family Frics PrenasalizedStops PhonDistBetweenLAndR

1 Motuna Papuan 1 0 0

2 Kol Papuan 1 0 1

3 Rotokas Papuan 1 0 0

4 Ata Papuan 1 0 0

5 Kuot Papuan 1 0 1

The left panel of Figure 5.12 shows the dendrogram obtained by applying divisive clus-tering usingdiana(). We first create a distance object appropriate for binary data,

> phylogeny.dist = dist(phylogeny[ ,3:ncol(phylogeny)], method="binary") 154

DRAFT

and we also create a vector of language names with the names for Papuan languages in upper case withtoupper().

> plotnames = as.character(phylogeny$Language)

> plotnames[phylogeny$Family=="Papuan"] =

+ toupper(plotnames[phylogeny$Family=="Papuan"]) Divisive clustering and visualization is now straightforward:

> library(cluster)

> plot(diana(dist(phylogeny[, 3:ncol(phylogeny)], + method = "binary")), labels = plotnames, cex = 0.8,

+ main = " ", xlab= " ", col = c("black", "white"), which.plot = 2) We note a fairly clear separation of Papuan and Oceanic languages.

The right panel of Figure 5.12 shows an unrooted tree obtained with an algorithm known as neighbor-joining that is often used for phylogeny estimation. In what follows, we use theapepackage developed by Paradis and described in detail, together with other packages for phylogenetic analysis, in Paradis [2006]. We load theapepackage, the only package in this book that is not loaded automatically. (This is because theapepackage depends on the thenlmepackage, which is incompatible with thelme4package that is loaded bylanguageR.) We then apply thenj()function to obtain a phylogenetic tree object.

> library(ape)

> phylogeny.dist.tr = nj(phylogeny.dist)

The plot method for phylogenetic tree objects has a wide variety of options. One option, illustrated in the right panel of Figure 5.12, is to use different fonts to highlight subsets of observations. Since the leaf nodes (or tips) of the tree are labeled by default with the row numbers of the observations in the input distance matrix, we need to do some extra preparatory work to get the names of the languages into the plot. We begin with the row numbers, which are available in the form of a character vector in the tree object as tip.label. We then use these row numbers to reconstruct the names of the language families,

> families = as.character(

+ phylogeny$Family[as.numeric(phylogeny.dist.tr$tip.label)]) and also the names of the languages themselves:

> languages = as.character(

+ phylogeny$Language[as.numeric(phylogeny.dist.tr$tip.label)]) We substitute the language names for the row names in the tree object,

> phylogeny.dist.tr$tip.label = languages and plot the tree.

155

DRAFT

MOTUNANASIOI BUINROTOKAS YELI_DNYE

KOL MALI

SULKA ATA ANEMKUOT BILUA

Banoni Sisiqa Siar TaiofNalik TungagRoviana Kokota BaliGapapaiwa SudestKilivila YabemKaulong

Kairiru Takia

TOUOSAVOSAVO LAVUKALEVE

0.20.30.40.50.60.7

Divisive Coefficient = 0.47

Height

Figure 5.12: Divisive clustering withdiana()(in theclusterpackage) and the corre-sponding unrooted tree obtained with the neighbor-joining algorithmnj()(in theape package) of16 Oceanic and15 Papuan languages using125grammatical traits [Dunn et al., 2005].

> plot(phylogeny.dist.tr, type="u", + font = as.numeric(as.factor(families)))

The optiontype="u"requests an unrooted tree. In unrooted trees, all nodes have at least three connecting branches, and there is no longer a single root node that can be considered as the common ancestor of all tip nodes. It is easy to see that the two dendrograms shown in Figure 5.12 point to basically the same topology.

As mentioned above, the focus of the study of Dunn and colleagues was the internal classification of the Papuan languages, as it is here that traditional word-based classifi-cation fails most dramatically. The tree presented in the upper left of Figure 5.13 shows that the unrooted phylogenetic tree groups languages according to geographic region, as indicated by different fonts (plain: Bismarck Archipelago; bold: Bougainville; italic: Cen-tral Solomons; bold italic: Louisiade Archipelago). This striking result is reproduced as follows.

> papuan = phylogeny[phylogeny$Family == "Papuan",]

156

DRAFT

Figure 5.13: Unrooted phylogenetic trees for the subset of Papuan languages in the data of Dunn et al. [2005], obtained with the node-joining algorithm. The fonts repre-sent geographical areas (plain: Bismarck Archipelago; bold: Bougainville; italic: Central Solomons; bold italic: Louisiade Archipelago). The upper right tree adds thermometers for bootstrap support to the tree in the upper left. The lower left tree is a consensus tree across200bootstrap trees.

157

DRAFT

> papuan$Language = as.factor(as.character(papuan$Language))

> papuan.meta = papuan[ ,1:2]

> papuan.mat = papuan[, 3:ncol(papuan)]

> papuan.meta$Geography = c(

+ "Bougainville", "Bismarck Archipelago", "Bougainville",

+ "Bismarck Archipelago", "Bismarck Archipelago", "Central Solomons", + "Bougainville", "Louisiade Archipelago", "Bougainville",

+ "Bismarck Archipelago", "Bismarck Archipelago",

+ "Bismarck Archipelago", "Central Solomons", "Central Solomons", + "Central Solomons")

> papuan.dist = dist(papuan.mat, method = "binary")

> papuan.dist.tr = nj(papuan.dist)

> plot(papuan.dist.tr, type = "u", font = as.numeric(as.factor(fonts))) The clustering techniques that we have considered in this section are not based on a formal model, but on reasonable but nevertheless heuristic procedures. As a consequence, there are no hard and fast criteria to help decide what kind of clustering (agglomerative or divisive) is optimal for a given data set. When a cluster analysis is reported, only one dendrogram tends to be shown, even though the authors may have tried out a variety of clustering techniques. Typically, the dendogram shown is the one that fits best the authors’ hypothesis about the data. This is fine, as long as the reader keeps in mind that the dendrogram probably depicts an optimal solution.

A technique that provides a means for validating a cluster analysis is theBOOTSTRAP. The bootstrap is a general technique that we will also use in the chapters on regression modeling. The basic idea of the bootstrap as applied to the present data is that we sample (with replacement) from the columns of our data matrix. For each sample, we construct the distance matrix and grow the corresponding unrooted tree with the node-joining al-gorithm. Finally, we compare our original dendrogram with the dendrograms for the bootstrap samples, and calculate the proportions of boostrapped dendrograms that sup-port the groupings (subtrees, or clades in the terminology of phylogenetics) in the original trees. In this way, we obtain insight in the extent to which the clustering depends on the idiosyncracies of the set of grammatical traits that happened to be selected for analysis.

The proportion of support for the different subtrees is shown in the upper right panel of Figure 5.13 by means of thermometers: the higher the temperature, the greater the proportional support for a subtree. The boostrap analysis underlying this panel closely follows the example of Paradis, 2006:117. We begin with defining the number of bootstrap runs, and prepare a list in which we save the bootstrap trees.

> B = 200

> btr = list()

158

DRAFT

> length(btr) = B

We now create200bootstrap trees, sampling with replacement from the columns of our data matrix.

> for (i in 1:B) {

+ trB = nj(dist(papuan.mat[ ,sample(ncol(papuan.mat), replace = TRUE)], + method = "binary"))

+ trB$tip.label = as.character(papuan.meta$Language[as.numeric(

+ trB$tip.label)]) + btr[[i]] = trB + }

The proportions of bootstrap trees that support the subtrees of our original tree are ob-tained with the help ofprop.clades().

> props = prop.clades(papuan.dist.tr, btr)/B

> props

[1] 1.000 0.600 0.865 0.050 0.100 0.115 0.200 0.315 0.555 0.680 0.625 [12] 0.445 0.920

We plot the original tree

> plot(papuan.dist.tr, type = "u", font = as.numeric(as.factor(fonts))) and add the thermometers withnodelabels().

> nodelabels(thermo = props, piecol = c("black", "grey"))

The proportion of bootstrap support decreases as one moves to the center of the graph.

This points to a lack of consensus with respect to how subtrees should be linked. A dif-ferent way of bringing this uncertainty out into the open is to plot aCONSENSUS TREE. In a consensus tree, subgroups that are not observed in all bootstrap trees (strict consen-sus) or in a majority of all bootstrap trees (majority-rule consenconsen-sus) will be collapsed.

The result is a tree with multichotomies. The lower left tree of Figure 5.13 shows such a multichotomy in the center, where8branches come together. Theapepackage provides the functionconsensus()for constructing a consensus tree for a list of trees, given a proportionpspecifying the required level of consensus.

> btr.consensus = consensus(btr, p = 0.5)

Consensus trees come with a plot method, and can be visualized straightforwardly with plot(). Some extra steps are required to plot the tree with fonts representing geograph-ical areas.

> x = btr.consensus$tip.label

> x

[1] "Anem" "Ata" "Bilua" "Buin" "Nasioi"

[6] "Motuna" "Kol" "Sulka" "Mali" "Kuot"

159

DRAFT

[11] "Lavukaleve" "Rotokas" "Savosavo" "Touo" "Yeli_Dnye"

> x = data.frame(Language = x, Node = 1:length(x))

> x = merge(x, papuan.meta, by.x = "Language", by.y = "Language")

> head(x)

Language Node Family Geography 1 Anem 1 Papuan Bismarck Archipelago 2 Ata 2 Papuan Bismarck Archipelago 3 Bilua 3 Papuan Central Solomons

4 Buin 4 Papuan Bougainville

5 Kol 7 Papuan Bismarck Archipelago 6 Kuot 10 Papuan Bismarck Archipelago

> x = x[order(x$Node),]

> x$Geography = as.factor(x$Geography)

> plot(btr.consensus, type = "u", font = as.numeric(x$Geography)) The consensus tree shows that the grouping of Bilua, Kuot, Lavukaleve, Rotokas, and Yeli Dnye is inconsistent across bootstrap runs. We should at the same time keep in mind that a given bootstrap run will make use of roughly80of the125available grammatical traits. A loss of about a third of the available grammatical markers may have had severe adverse consequences for the goodness of the clustering. Therefore, replication studies with a larger set of languages and an even broader range of grammatical traits may well support the interesting similarity in geographical and grammatical topology indicated by the original tree constructed with all125traits currently available.

Im Dokument A practical introduction to statistics (Seite 79-85)