Simple approaches - Clustering with Spectral Methods

Considering problem 4.1 we see that the set of entities together with the weighted binary relation forms a graph with an edge weight function. There-fore we restrict ourselves to the following problem:

Problem 4.2 (Graph clustering)

Input: a graph G with an edge weight s and a cost function c on the partitions of G

Output: a partition P with minimal cost with respect to all possible parti-tions of G

Note that this problem description is more flexible with respect to the bi-nary relation. The general cluster problem usually involves a complete re-lation on the entity set. The graph structure offers several possibilities to compact these information without loss. For example a partial relation may be complete by the calculation of shortest paths. We use also the following additional notation:

Definition 4.3

Let G = (V, E) be a graph with an edge weight s and P = (C₁, . . . , C_r) a partition of G. An edge which has its source and its targets in different components is called an inter–cluster edge. The collection of all such edges is denoted by I_P and its weight (with respect tos) is defined by:

s(I_P) = X

e∈I_P

s(e)

The quotient

s(I_P) P

e∈Es(e)

is the ratio of uncovered weight to the total weight. Edges which are not inter–cluster edges are cluster edges or inner–cluster edges.

Although the following approaches were seldom stated in terms of graphs in the literature we will use graph notation to describe and discuss them.

Therefore assume for the input:G= (V, E) as a graph with an edge weights.

Single linkage

Approach 1 (Single linkage)

As additional input let ∆ ∈ R be a threshold value. Denote the set of all edges with weight greater than ∆ by E⁰. Then the undirected connected components of (V, E⁰) form the cluster partition.

This method simply eliminates all elements of the binary relation which have a degree of similarity below ∆. It is quite economical since its runtime is O(|V|+|E|) and we need no additional space. The weak point of this procedure is the choice of ∆, since a bad choice may lead to partitions of too small or too great size. If the component size is too large then it implies that vertices belong to the same component although they have little in common. This effect is calledchaining, since one connecting path is sufficient to put two vertices into the same component. Consider the following example.

Figure4.1shows the input graph. The degree of similarity is expressed by the thickness of the edges. So if we choose the threshold value ∆ too small then we will obtain one component. But the vertices with a square shape and those with a rhombus shape have little in common. In fact the only connection is a path containing circle shaped vertices. A more intuitive partition would be using all square–shaped vertices as a component, all circular vertices as another component and a final component with all rhombus–shaped vertices.

Because of this chaining effect single linkage is seldom used on its own. Often it is used for a preprocessing step. The graph created by single linkage is also called the ∆–threshold graph.

Complete linkage

Chaining effects depend on the existence of few edges with a relatively high degree of similarity. To thwart this we consider more edges. An optimal situation is that every possible edge exists and has a high degree of simi-larity. This is the idea behind complete linkage and can be best recursively described:

Approach 2 (Complete linkage)

As additional input let ∆ ∈ R be a threshold value. Start with a partition where every single vertex is a component on its own. If there are two com-ponents such that their union induces a complete subgraph with edges of

Figure 4.1: Chaining example

weight greater than ∆, then join them into a single component. If no pair of components exists which fulfills the condition output the current partition.

Figure 4.2: An example for a graph with non disjoint maximal complete subgraphs

Note that this approach is not equiv-alent to finding a maximal complete subgraph with a certain edge weight since these maximal complete sub-graphs are in general not disjoint.

Consider the graph in figure 4.2.

The maximal complete subgraphs are {1,2,3} and {3,4,5}, but they are not disjoint. Therefore cluster-ing by maximal complete subgraph would require overlapping partitions.

But this has also a drawback. Olapping components have many ver-tices in common and this is usually not desired. Figure 4.3 shows such a graph. Every maximal complete sub-graph contains {0,1,2,3,4,5}. The

bold edges are only a visual help and are all edges which connect two ele-ments of {0,1,2,3,4,5}.

0 1

4 5

Figure 4.3: Example of a graph where maximal complete subgraphs have many vertices in common

Complete linkage is more complex and requires more run time. Testing if the union of two partition componentsP₁, P₂ form a complete subgraph re-quires O(|P₁| · |P₂|). This implies that complete linkage runs in O(|V|⁵).

It needs no additional space. Like single linkage one difficulty is the choice of ∆. Wrong choices may lead to components with relatively small size.

Single and complete linkage are opposed in their methods. Both have dif-ficulties with certain situations. In the next section we will consider such problems. Before we move on we briefly show some other approaches. These use a restricted version of problem 4.2, namely:

Problem 4.4 (k–graph clustering)

Input: a graphGwith an edge weights, an integerk and a cost functionc on the partitions of G

Output: a partitionP of sizek with minimal cost with respect to all possible partitions of G with size k

Note the following difference: In problem4.4 the size of the final partition is fixed. Sometimes the problem is weakened to bound the size of the partition by k only either as an upper bound or as a lower bound. The following functions are often considered in the literature:

•maximal covered weight:

As cost function we use the number of inter–cluster edges. This implies that the final partition has as few inter–cluster edges as possible. A weighted version uses the weight of the inter–cluster edges as cost. Minimising this function implies that the sum of degrees of similarity between vertices which are adjacent but in different cluster components is as low as possible.

•inter–cluster distance:

Here we use the average or the maximal weight of the inter–cluster edges as cost. The intentions are similar to maximal covered weight.

•minimal diameter:

The diameterd of a graph is the minimal number of edges such that every pair of vertices can be connected by a path of length at least d. As cost function we choose the maximum diameter of all partition components. A similar approach is the content of [HS00].

•minimal average distance:

The cost is the sum of the average distance of two vertices in a cluster component. This distance can simply be the minimal number of edges connecting them or the weight of such a path.

All approaches have in common that they try to ensure that the cluster components are highly connected. Every method measures this in its own way. For more details and other procedures see also [Fas99] and [Gon85].

Im Dokument Clustering with Spectral Methods (Seite 53-57)