Hierarchical Clustering Procedures - Wealth, urbanization and infrastructure

The category hierarchical clustering procedures can initially be subdivided into agglomerative and divisive procedures. The agglomerative procedures start at an initial partition with each observation being one separate cluster, and new clusters are obtained by joining observations, and then clusters of observations, consecutively. Thedivisiveprocedures start at a partition with just one cluster containing all observations and dividing this cluster into an increasing number of clusters in following steps. The way the agglomerative hierarchical algorithms differ is mainly characterized by the way distance ma-trix D is used. All agglomerative hierarchical procedures start with joining those two observations into one cluster, which are separated by the shortest distance, i.e. the smallest value d_i,j in the distance matrix D. After the first clustering step, a new distance matrix D exists, which is of dimension n−1 and contains the distances between all observations and between the cluster of the two joint observations and all other observations. The way this distance between the cluster(s) and the other observations or clusters is calculated is the source of differentiation between the algorithms. A choice of the most frequently used agglomerative algorithms follows.

4.3.1 Single Linkage Algorithm

The single linkage method is also called nearest-neighbor method, the reason being that the new distance between a cluster and an observation is calculated as the minimum of the distances between each observation within the cluster and that outside the cluster. If objects i and j are joined in one cluster, the new distance between this cluster and the object l is calculated in thesingle

linkage method the following way:²²

d_ij,l =min(d_i,l, d_j,l). (11) The single linkage algorithm combines the two objects or clusters with the smallest distance between its closest neighbors. The main shortcoming of this algorithm is the so-calledchainingproperty. Single linkagetends to build too large groups due to its weakness two detect poorly separated clusters. On the other hand, this property helps to detect outliers.

4.3.2 Complete Linkage Algorithm

The complete linkage algorithm, in contrast, combines those objects or clus-ters into a new cluster, which have the shortest distance in the distance matrix D, but latter being calculated by the furthest distance between the observations within one cluster and an outside observation. Hence, it is also called thefurthest neighbor method. The new distance to objectlafter joining objects i and j is computed by²³

d_ij,l =max(d_i,l, d_j,l). (12) Both, the single linkage and the complete linkage algorithms are indepen-dent of the distance measure used in the process as long as the ordering of distances remains. A discussion of the different algorithms can be found in H¨ardle & Simar (2002), Gordon (1999) or Johnson & Wichern (1998).²⁴

4.3.3 Ward’s Error Sum of Squares Method

Ward (1963) proposed one clustering algorithm that is not based on joining those objects or clusters with the smallest distance between them, but those where the loss of information resulting from grouping observations or clusters is smallest based on the deviations of every observation from the mean of its

22If two clusters are joined, the resulting distance would be dij,lm = min(di,l, dj,l, di,m, dj,m). A detailed derivation of the distance matrices with simple and illustrative examples is given in Dillon & Goldstein (1984), pp. 168ff.

23The distancedij,lm would be calculated respectively as in the footnote above.

24A computational approach using the Software EnvironmentXploRer, which is also utilized for this analysis, can be found in Mucha & Sofyan (2000).

cluster (Dillon & Goldstein (1984)). For the Ward method, the underlying distance measure is the within-cluster sum of squares calculated by

d²_ij =

k=1

(x_jk−v_ik)² (13)

with the following notation:

c, number of clusters (i)

n, number of objects to be classified (i.e. countries j=1,2,...,192) p, number of variables (i.e. k=1,2,...,9)

x_jk, value of the kth variable on observation j

vik, the cluster mean of the kth variable in cluster i.²⁵.

The resulting dissimilarity indexfrom theWardprocedure is the Ward vari-ance obtained after every step in the clustering process, i.e. after joining clusters consecutively:

This measure E is called the error sum of squares. According to the Ward procedure, those two observations or clusters are combined into one cluster that increase the value of E the least.²⁶

The Ward procedure has a different approach to the twolinkage algorithms described above in that it does not combine those observations or clusters that are separated by the shortest distance, but whose combination has the least impact on a combined measure of within-cluster homogeneity. The Ward method is used preferably in most empirical research and is far less sensitive to chaining mentioned above.

25It is calculated the following way: vik = _n¹

j∈Cixjk, where ni is the number of observations in clusteriand j∈Ci those observations contained in clusteri

The notation is taken from Romesburg (1984) but slightly adapted to be consistent with the notation of this paper.

26An easy-to-follow empirical approach is given in Backhaus, Erichson, Plinke & Weiber (1996).

4.3.4 Graphical device: The Dendrogram

There exists one useful graphical device to represent the stepwise clustering process of all three algorithms defined above, which is called the dendrogram and looks as follows (for a simple random example of eight observations):

Dendrogram - 8 points

05101520

Distance or E

1 2 3 5 4 6 7 8

Figure 1: Example of a Dendrogram

The horizontal axis contains the observations that are joined sequentially and the vertical axis gives the value of the distance between those observa-tions or clusters that are joined in the case of the linkagealgorithms, and it gives the value of E, the error sum of squares after the respective clustering step for the Ward approach. The dendrogram shows in an illustrative way, what observations and clusters are combined at what stage.

4.3.5 Other Agglomerative Hierarchical Procedures

Several other agglomerative hierarchical procedures have been proposed. How-ever, most of them are being used less in practice.

One algorithm that is often mentioned in the literature is the average link-age algorithm which takes the distance between two clusters as the average between all items in each one of them.

Other algorithms are the Centroid algorithm, theMedian algorithm and the

flexible method. Without going further into detail, the basic properties shall be given shortly. The way the distance between two objects or clusters that are to be grouped is measured depends on the algorithm, as described above.

If clusters or observations i and j are joined in one cluster, its distance to the observation or group l can generally be written as

dij,l =δ1di,l+δ2dj,l +δ3di,j +δ4|di,l−dj,l| (15) with notation as above (Mucha & Sofyan (2000)). In this case, the mentioned algorithms would define the δs the following way:

Algorithm δ₁ δ₂ δ₃ δ₄

Single linkage ¹₂ ¹₂ 0 −¹₂

Complete linkage ¹₂ ¹₂ 0 ¹₂

Average linkage _nⁿⁱ

i+nj

Table 1: Different Agglomerative Hierarchical Clustering Algorithms²⁷ Bergs (1981) found out that the Wardprocedure, in general, found very rea-sonable partitions and was more efficient than the other algorithms. This is the reason why some algorithms are treated only shortly in this section and also one reason why theWard algorithm will be used mainly throughout the empirical analysis following in this paper.

4.3.6 Divisive Algorithms

Divisive algorithmswork in the opposite direction to theagglomerative meth-ods. They start at an initial partition with just one cluster containing all observations. The algorithms sub-divide the clusters in a stepwise optimal process similar to the agglomerative techniques. A discussion on these tech-niques can be found in Gordon (1999), p. 90. However, the divisive al-gorithms are found less in the literature and empirical work, even though Gordon (1999) mentions, that the use should sometimes be preferred since researchers are mainly interested in larger clusters.

27The table is taken from Mucha & Sofyan (2000) and adjusted to the notation of this paper.

Im Dokument Wealth, urbanization and infrastructure (Seite 23-28)