Cluster analysis and multidimensional scaling

The techniques of cluster analysis and multidimensional scaling both take a dissimi-larity matrix as their input. The matrix of similarities that theφcoefficients describe can thereby be transformed into a matrix of dissimilarities by defining a distance func-tion (e.g., the Euclidean distance) of their feature vectors.⁸ The feature vector that is used in this work describes each symbol in terms of the individual φ values of its matrix row.⁹ Such dissimilarity (or distance) matrices can in turn be used to induce a structural grouping of the individual symbols. In this section, I want to present two data mining methods of processing such dissimilarity matrices that allow for an easier

8For the place of articulation distinctions, the context on which the φ values are calculated al-ready defines a dissimilarity relation and therefore automatically yields a dissimilarity (rather than a similarity) of symbols in the form of theφvalue directly.

9The reason for taking row vectors rather than column vectors is that the linear order from left to right is considered to be more salient in languages, in particular for the vowel harmony visualization in Chapter 6.

interpretation of possible patterns in the data, viz. cluster analysis with dendrograms (3.3.1) and multidimensional scaling (3.3.2). Both techniques also provide a visual representation of the results.

3.3.1 Cluster analysis with dendrograms

Cluster analysis is the process of grouping the data into classes of similar objects. A cluster is thereby a group of objects that are most similar to one another but dissimilar to the objects in other clusters (cf. Han and Kamber 2006; Steinhausen and Langer 1977; Kaufmann and Rousseeuw 1990 [2005]). Dissimilarities are assessed on the basis of the values which describe the individual objects in the cluster. For the purpose of the present investigations, the objects of the clustering are symbols (i.e., phones or letters) in the data sets under consideration. Their dissimilarities are based on their φ values, which in turn are derived from the respective contexts that determine the (dis-)similarity relationship in the particular approach in question.

In this section, I want to explore the computation of the clustering analysis to-gether with how to interpret its results. The basis for the following description is a dissimilarity matrix, which resembles the matrix in Table 3.5 in that each object (i.e., each symbol) in the data set occurs in one of the rows and columns, respectively. For each pair of symbols, the respective cell in the matrix represents the dissimilarity value for the pair. I will mainly use normalized dissimilarity values where the two symbols iand j are highly similar if their dissimilarity value is close to 0, while their dissim-ilarity value becomes larger (with a maximum of 1 for the highest dissimdissim-ilarity) the more they differ. Since the dissimilarity of two objects is a symmetric relation and the dissimilarity of an object to itself is by definition 0, the measured dissimilarity d(i, j) between objects i and j givesd(i, j) =d(j, i) and d(i, i) = 0 and yields the following matrix where the upper half is omitted due to the symmetric property of the matrix.





 0

d(2,1) 0 d(3,1) d(3,2) 0

... ... ...

d(n,1) d(n,2) . . . 0







For theφvalues in Table 3.5 the dissimilarity matrix on the basis of the Euclidean distance of the matrix row vectors would look as in Table 3.7. The Euclidean distance between two symbols is calculated by summing up the square of the difference between their values for each index position in the vector (e.g., d_a(a, e) = (−0.22413088 − 0.018913003)²) and then taking the square root of the sum. For the matrix row vector and the two symbolsaandein Table 3.5 the squared distances for each index position (the five columns of the matrix) are given asd(a, e) in Table 3.6:

The distance value for the symbols aand egiven their matrix row values is thus the dissimilarity value given in Table 3.7 for the pair of symbolsaand e:

d(a, e) = p

d_a(a, e)²+d_e(a, e)²+d_n(a, e)²+d_r(a, e)²+d_w(a, e)²

= √

0.05907033 + 0.01099928 + 0.01685451 + 0.0008002013 + 0.001010145

= 0.2978833

Table 3.6: Euclidean distance between the matrix row vectors for two symbols and theirφvalues.

a e n r w

a −0.2241 −0.2659 0.2327 0.2175 0.0607 e 0.0189 −0.3708 0.1029 0.2458 0.0289 d_i(a,e) 0.0590 0.0109 0.0168 0.0008 0.0010

Table 3.7: Dissimilarity matrix for the bigram φvalues from Table 3.5. The dissimi-larity value corresponds to the Euclidean distance between the matrix row vectors for each symbol.

w 0.6583065 0.5977484 0.2593637 0.4218040 0



The corresponding normalized dissimilarity matrix is obtained by adding one to the φ values and dividing by two. This compensates for the fact that the φ values range from [−1,1]. The resulting normalized dissimilarity matrix is shown in Table 3.8:

Table 3.8: Dissimilarity matrix for the normalized bigramφ values from Table 3.5.

w 0.32915325 0.29887419 0.12968184 0.21090201 0



Based on such dissimilarity matrices the clustering algorithms can be performed.

There exist many algorithms in the literature (cf. Han and Kamber 2006), which can

be categorized according to several features. The method that will be used here be-longs to the category of hierarchical clustering methods. Hierarchical methods create a hierarchical decomposition of the objects in the data set. Such methods can be clas-sified as either being agglomerative (also called bottom-up) or divisive (also known as top-down). Both approaches differ in the direction in which the clusters are generated.

In the agglomerative approach, each object is initially considered to be its own cluster.

The algorithm then successively merges the objects or clusters that are least dissimilar until all clusters or objects are merged into one, which represents the topmost level of the hierarchy. In the divisive approach, on the other hand, all objects start off as belonging to the same cluster. Each successive iteration then splits up the cluster into smaller clusters until each object constitutes its own cluster. The major disadvantage of hierarchical methods is the fact that once a step in the clustering process is done (for instance, a merging of different clusters into one in the agglomerative approach), it can never be undone. Sometimes a different grouping at a later stage would lead to a better result, but this grouping is no longer possible because part of it has already been grouped with other objects or clusters in a previous step. However, the greedy nature of the method is also useful in that it reduces the computation costs as opposed to other clustering methods that have to take all the combinatorial possibilities into account.

There are a number of methods of how to determine the next merge or split on the basis of the dissimilarities of the given objects. Earlier experiments (cf. Powers 1997) have shown that Ward’s method (Ward 1963) is particularly well suited for the groups that are to be established for linguistic classes as it attempts to create more balanced clusters, where the differences in the number of elements for the individual clusters is kept to a minimum. This corresponds to the fact that phonological classes are relatively evenly sized (see the results in Part II of this thesis), with only rare cases of outliers that represent clusters with only a single member as opposed to other clusters that subsume the rest of the objects. Ward’s minimum variance method belongs to the class of agglomerative hierarchical clustering algorithms. To form clusters, it minimizes an objective function, the sum of squared errors, which is also known from multivariate analysis of variance (Bortz 2005:575). The method thereby successively merges those objects (or clusters) whose fusion leads to the minimum increase in the sum of squared errors. For a more detailed description of the clustering method see Legendre and Legendre (1998:329-334) and Kaufmann and Rousseeuw (1990 [2005]:230-234).¹⁰

In either agglomerative or divisive hierarchical clustering, the user can in advance specify the number of clusters into which the objects in the data set should be par-titioned. If the number of clusters is not known beforehand, a better solution is to graphically represent the process of hierarchical clustering with the help of so-called dendrograms. A dendrogram shows how the objects in the data set are grouped to-gether step by step. The user can therefore see in one representation all the possible clusterings that the procedure would generate depending on the number of clusters that are specified in advance. How to interpret a dendrogram can best be explained with an example. Figure 3.1 shows the dendrogram that has been created on the basis

10Powers (1997) notes that Ward’s method is similar to a weighted median method in determining the merging of clusters at each step.

of the normalized dissimilarity matrix in Table 3.8.¹¹

Figure 3.1: Dendrogram for the normalized dissimilarity matrix from Table 3.8 gener-ated with the Ward method of thehclustfunction inR.

The dendrogram in Figure 3.1 shows the clustering of the English letters in our toy corpus according to their bigram frequencies and the resultingφvalues. The underlying dissimilarity matrix has been generated from the similar behavior of symbols as to their bigram neighbors. The more similar two symbols are in terms of their bigram frequencies the lower the dissimilarity value. The clustering in the dendrogram thus indicates which symbols are more similar by grouping them together on a lower level of the hierarchical tree structure. The height axis shows the distance value where the merging of two objects or clusters has been performed. For individual objects this distance value can be directly taken from the dissimilarity matrix. For instance, the symbols a and e are merged at a distance of 0.14894166, which directly corresponds to their dissimilarity value in Table 3.8. The merging points at a higher level are determined by the cost function (the sum of squared errors, see above) of the Ward method. All clusterings for the individual cluster numbers can be inspected in the dendrogram. In the case of two clusters, they would consist of C1 ={a, e} and C2 = {w, n, r}. For three clusters, they fall into C₁ = {a, e}, C₂ = {w} and C₃ = {n, r}.

Finally, the four cluster configuration would yield C₁ = {a}, C₂ = {e}, C₃ = {w}

and C4 = {n, r}. The different clusterings are obtained by “cutting” the tree at that position where the desired number of clusters would “fall down”. If the tree is cut at

11See Johnson (2008) for a description of how to generate dendrograms inR(R Development Core Team 2010) with thehclustfunction.

height position 0.5, for instance, two clusters would be generated. The interpretation of the cutting point is that no element in the generated clusters is further away from the other than the respective distance value that the height position indicates. In the example of two clusters, no element in both clusters is further away than 0.5 in terms of their dissimilarity value.¹² A dendrogram is in that sense a summary of all possible sets of clusters that the hierarchical clustering procedure would generate.

3.3.2 Multidimensional scaling

Agglomerative hierarchical clustering methods and dendrograms for their graphical representation yield a grouping of objects in the data sets according to their dissimilari-ties. As mentioned above, one of the disadvantages of hierarchical clustering techniques is the fact that clusters on a lower level automatically determine the membership of objects to a cluster at a higher level. This leads to the problem that the dissimilarity for individual object pairs that do not belong to the same merge step can no longer be obtained. It is only possible to compare the clusters to which they belong. If the data set contains an inherent structure which spreads across several dimensions, only one of the dimensions can be displayed in the dendrogram. A technique which does not cluster the objects but arranges them according to their similarity in a geometric space (and thus makes it possible to compare the distance of any object pair in the data set) is multidimensional scaling (MDS). Like clustering methods, MDS analyzes the (dis-)similarity of a set of objects. MDS attempts to model such data as distances among points in a (typically two-dimensional) geometric space. It shares some of the motivations with visual analytics: the goal is to have a graphical representation of the structure of the data, which is much easier to interpret than a matrix of numbers and which displays the essential information in the data, thereby smoothing out noise.

There are a number of variants of MDS that have been suggested in the literature.

Among other things, they differ in the type of geometry into which the data is mapped, the mapping function, the algorithms for the optimal data representation, etc. (see Borg and Groenen 1997; Kruskal and Wish 1978 for a comprehensive presentation of MDS methods). In what follows, I will give a brief description of classical metric multidimensional scaling (implemented in the functioncmdscale() inR), which I will use in Chapters 6 and 7. The dissimilarity matrix of Table 3.8 again serves as the starting point for the computation of the MDS representation. Initially, the data set has five dimensions, which are represented by the five symbols in the matrix. The matrix row vector has a value for each of the five dimensions, which can be interpreted as a point in a five-dimensional geometric space. The idea of MDS is then to reduce the number of dimensions in such a way that the resulting lower dimensional representation maximally corresponds to the distances in the higher dimensional configuration. The goal of the underlying algorithms is to ensure that the distances between the points in the resulting configuration are approximately equal to the dissimilarities. The exact mathematical processes behind the method are beyond the scope of the present work.

For this, the interested reader is referred to a textbook on multidimensional scaling (e.g., Borg and Groenen 1997).

12In fact, the value is much lower (somewhere around 0.2) as the two major clusters are split up into more fine-grained clusters at a much lower level.

The interesting aspect of MDS for the present purposes is that it enables the user to inspect the data and to explore their structure visually. The rationale behind the visual representation can best be explained when looking at distances between cities and their graphical display on a map. The cities are also defined in terms of a multidimensional vector with the number of cities as the number of dimensions and the distance to travel from one city to the other as the distance values for the dimensions. The map, on the other hand, is a two-dimensional representation of these distances but still contains all the information that the higher dimensional matrix of distance values also contains.

In the case of mapping the distances of cities, there is no loss in the reduction from the higher to the lower dimensional space as all distances can still be reconstructed from the reduced representation of the map. In general, MDS methods attempt to reduce the higher dimensional space in such a way that the information loss is kept to a minimum.

Figure 3.2 shows the dissimilarity matrix of Table 3.8, which has been reduced to two dimensions (instead of the five dimensions that the original matrix contains). As in the clustering result in Figure 3.1, the most important distinction in the objects is between the symbols{a, e} and the symbols {w, n, r} in the first dimension of the MDS. Again, the symbols{n, r}form a tighter cluster as opposed to the symbolw, as the larger distance (in the second dimension) indicates. It can also be seen that the symbols a and e are very similar to each other on the first dimension, whereas they exhibit a larger distinction in the second dimension.

-0.2 -0.1 0.0 0.1 0.2

-0.050.000.05

Dimension 1

Dimension 2

e r n

Figure 3.2: Classical (metric) multidimensional scaling plot for the normalized dissim-ilarity matrix of Table 3.8 for two dimensions.

The interpretation of what kind of aspect in the data the different dimensions reflect cannot be achieved automatically and is left to be explored by the user. In the example of the English bigrams, the first dimension would indicate a distinction between vowels {a, e} and consonants {w, n, r}, with the glide w showing a somewhat intermediate behavior. The interpretation of the second dimension is much more difficult in this example. It has to be kept in mind that the method yields a fixed number of dimensions even if the internal structure of the data would suggest a lower (or higher) dimensional interpretation. The user should therefore select an appropriate number of dimensions to which the high-dimensional vector space should be reduced and already have a motivation for choosing this number.

3.4 Visualization

Section 3.2 has introduced the statistical measure on which the subsequent induction processes are built. The aim of the measure is to abstract away from the absolute frequencies of the individual symbols in order to obtain a value for the strength of association (or correlation) of two symbols. Given these values we can summarize the relationship between the individual symbols in a quadratic matrix as shown in Table 3.5. These values already include all the information about the inherent structure in the data as well as its underlying patterns. Yet they are difficult to interpret for mainly two reasons: (i) the arrangement of the table does not follow from the structure that might be contained in the data (in the example in Table 3.5 the symbols are in alphabetical order); (ii) the interpretation of important distinctions in the results is not provided by the values as such, i.e., their differences are gradual and not clear-cut.¹³

In this section, I want to present the visualization technique that will be employed in Sections 4 and 6 in order to preprocess and visually represent the results in a way that allows for an easier interpretation of the structure that has been inferred from the data. This section thereby provides an introduction into the emerging field of visual analytics, whose aim is to represent abstract data so that interesting patterns and relations become visible to the researcher.

3.4.1 Why data visualization?

Pictures have always been an appropriate means to communicate information, even before the invention of written language. Pictures can convey a wealth of information and are processed much more quickly than text.¹⁴ When looking at the four sets of data in Anscombe’s Quartet (Anscombe 1973) it is hard to see any pattern. One common approach to get a better understanding of the data is to employ statistical measures that try to summarize the individual data points in order to capture a general trend in the data. Four of these measures (mean, standard deviation, correlation and linear regression) are given at the bottom of Table 3.9. They would all suggest that the four data sets in Table 3.9 show the same distribution.

13For this purpose, the test for statistical significance would be a possible means to establish such distinctions. However, it is not useful due to the fact that most associations would be significant.

14The American mathematician John Tukey said that “the greatest value of a picture is when it forces us to notice what we never expected to see” (cited from Fry 2008:1).

Table 3.9: Data for Anscombe’s Quartet.

Set A Set B Set C Set D

X Y X Y X Y X Y

0 10 8.04 10 9.14 10 7.46 8 6.58

1 8 6.95 8 8.14 8 6.77 8 5.76

2 13 7.58 13 8.74 13 12.74 8 7.71

3 9 8.81 9 8.77 9 7.11 8 8.84

4 11 8.33 11 9.26 11 7.81 8 8.47

5 14 9.96 14 8.10 14 8.84 8 7.04

6 6 7.24 6 6.13 6 6.08 8 5.25

7 4 4.26 4 3.10 4 5.39 19 12.50

8 12 10.84 12 9.13 12 8.15 8 5.56

9 7 4.82 7 7.26 7 6.42 8 7.91

10 5 5.68 5 4.74 5 5.73 8 6.89

mean 9.00 7.50 9.00 7.50 9.00 7.50 9.00 7.50

std 3.32 2.03 3.32 2.03 3.32 2.03 3.32 2.03

corr 0.82 0.82 0.82 0.82

lin. reg. y = 3.00 + 0.500x y = 3.00 + 0.500x y = 3.00 + 0.500x y = 3.00 + 0.500x

Another way to get an overview of the data sets is to plot all data points in a graph. Figure 3.3 shows the scatter plots of all four graphs in the data set together with their linear regression lines. A quick look at the graphs helps us to get a clear picture of how the data points are distributed. In fact, the graphs clearly show that the data sets are quite different from one another.

Human beings are intensely visual creatures that can easily interpret and extract meaning from visual representations while they have a hard time detecting patterns among rows of numbers. For that reason, visualizing data can help us to reveal trends and patterns in huge amounts of information and is the fastest way to communicate them to others (Murray 2010:1). The emerging field of visual analytics, whose aim is to graphically represent information so that interesting patterns and relations become visible at a glance, provides the required methodology to make patterns more easily accessible to human perception.

3.4.2 Visual analytics

Thomas and Cook (2005) define visual analytics as “the science of analytical reasoning

Im Dokument The Induction of Phonological Structure (Seite 56-72)