• Keine Ergebnisse gefunden

The focus in this section lies on clustering algorithms. Moreover, improvement and repair heuristics are explained here too. The last part is about other algorithms that are important for the implementation.

Cluster Analysis

Clustering is a process where a set of data is divided into subsets so that the elements in each subset are similar according to some distance function. The subsets are called clusters and can be seen as groups of similar items. The terms cluster analysis and clustering are used synomy-mously. Cluster analysis is commonly used in statistics, where the data are usually statistical observations. For a simple illustration of a clustering example, see Figure 3.2.

The elements are typically represented as vectors in n-dimensional space, in which each dimension defines an element’s property.

A distance metric is a function that takes two vectors as input and assigns a non-negative number based on the relative positions of the vectors to each other as output. A common distance metric is the Euclidean distance, where the distance functiondfor the vectorsxandyis defined

(a) Original set. (b) Result of the cluster analysis.

Figure 3.2: Cluster analysis example. Elements are clustered according to their color property.

as follows:

Other metrics for example are the Manhattan distance, Hamming distance or Mahalanobis dis-tance (e.g., where normalization of the vector is needed).

Girvan-Newman Clustering

The Girvan-Newman algorithm [6] is based on finding components in a network. It was success-fully applied to various social networks. One usecase is the detection of online communities.

A simple approach to cluster a network is to detect its connected components. Since real world networks are highly connected so that usually just one giant component is contained, the betweennessmeasure is used instead.

Betweenness can be defined on the nodes V and on the edges E of a graphG = (V, E).

In the followingσst is defined as the number of shortest paths (SP) between the nodessandt, σst(v)is defined as the number of SP between the nodessandtthat “run through” the vertexv andσst(e)is defined as the number of SP between the nodessandtthat “run along” the edgee.

This leads to the following formula definitions [9, 4]:

Node betweenness of vertexv: P

s6=v6=t∈V σst(v)

σst

Edge betweenness of edgee: P

s6=t∈V σst(e)

σst

Girvan-Newman algorithm in pseudocode is shown in Algorithm 3.1.

The algorithm works as follows: At the beginning the graph consists of one component. In each iteration the algorithm determines the edge e with the highest betweenness value. This edge is removed. Then the algorithm checks if the graph can be split, which means that one component can be separated. If this is the case, the original component (before splitting) that

Algorithm 3.1:The Girvan-Newman algorithm.

input : A connected graphG= (V, E)

output: A hierarchical clustering of the graphG

1 whileedges left in graphGdo

2 calculate betweenness values for edgese∈E;

3 delete edgeewith highest betweenness;

4 ifgraph splits in more componentsthen

5 build a new cluster for each component;

6 add each new cluster as subcluster to its parent cluster;

7 end

8 end

9 returncluster hierarchy;

stands for a cluster is split into two subclusters, which are hierarchically attached to the original cluster. This gives two corresponding subdendrograms. After that the next iteration is started.

This procedure is continued until no edges are left so that the graph is fully transformed into a dengrogram and the hierarchical clustering process is finished.

Note that no edge weights are considered. Only the shortest paths influence the betweenness.

To illustrate the Girvan-Newman algorithm a detailed example follows.

Example

Given an undirected weighted graphG= (V, E)with vertices V ={1, . . . ,11}

which leads to the adjacency matrix in Table 3.4. The graph is shown in Figure 3.3.

To calculate the edge betweenness values the first step is to calculate the number of shortest paths between nodes. An efficient method that is based on Breadth-First-Search is described in [6]. The resulting SP values are shown in Table 3.1.

Next, the proportion of the number of SP passing trough each edge has to be calculated. For each node pair(s, t)withs 6=t ∈ V the set of all SP has to be determined and for each edge e∈Ethe number of paths that containewill give the value forσst(e).

1

Figure 3.4: Adjacency matrix of the example graph.

Empty values indicate that there is no edge between the specific nodes (0-values were omitted for readability).

Figure 3.5: The example graph.

Figure 3.6: Both shortest paths between the nodes 1 and 4.

For example for s = 1andt = 4there are 2 shortest paths (thereforeσ1,4 = 2), namely h(1,2),(2,4)i andh(1,3),(3,4)i. This set of shortest paths contains the edges(1,2), (1,3), (2,4)and(3,4). Each of these edges occurs in exactly one shortest path (between 1 and 4), so σ1,4((1,2)) = σ1,4((1,3)) = σ1,4((2,4)) = σ1,4((3,4)) = 1. Therefore, each of the propor-tions σ1,4σ (e)

1,4 is12. See Figure 3.6 for illustration.

The resulting betweenness values after the first iteration are shown in Table 3.2 and the graph

1 2 3 4 5 6 7 8 9 10 11

Table 3.1: Number of shortest paths between nodes (σst). Only the upper triangle matrix is shown, the other half is symmetric since the graph is undirected.

1 2 3 4 5 6 7 8 9 10 11

Table 3.2: Betweenness values after the first iteration.

in Figure 3.7a. The edges(4,5)and(7,8)have the highest betweenness value of 56. If binary clustering is desired, only one of those edges will be removed, otherwise all edges with the highest value are removed from the graph.

In this case the graph splits into three components.

After that the betweenness calculation process starts again (for each component). For the next iteration, the exact values are omitted, but the resulting graph is shown in Figure 3.7b.

The result of the clustering process can now be seen in Figure 3.8.

1

2

3

4

5

7 6

8

9 10

11

(a) Graph after the first iteration.

1

2

3

4

5

7 6

8

9 10

11

(b) Graph after the second iteration.

Figure 3.7: Example graph after iteration 1 and 2. Edges of highest betweenness value (red) are removed at that iteration.

1 2 3 4 5 6 7 8 9 10 11

1 2 3 4 5 6 7 8 9 10 11

1 2 3 4

2 3

5 6 7 8 9 10 11

9 10

Figure 3.8: Resulting clustering of the example graph respresented as a dendrogram.