Subspace Clustering Algorithms - Visual Cluster Separation Factors: Sketching a Taxonomy

4.2 Visual Cluster Separation Factors: Sketching a Taxonomy

5.1.2 Subspace Clustering Algorithms

Given a set X of data points in some multidimensional space D, a subspace clustering algorithm aims to find a subsetX_kof data points together with a subsetD_kof dimensions such that the points in X_k are closely clustered in the subspace of dimensionD_k.

The most critical part of subspace clustering is the subspace generation. Given a d-dimensional space, there are 2^d possible subsets of dimensions. It is computationally infeasible to examine each possible subset to find subspaces of interest for a predefined pattern. Since this is clearly not a viable way, every algorithm is based on some kind of heuristic that speeds up the search in such a huge combinatoric space. A number of subspace clustering algorithms with strategies for narrowing down the search space have been proposed in the past and some of them enumerated in Section 2.3.1. As suggested by Parsons et al. [110], the existing algorithms can be categorized into bottom-up and top-down strategies.

The bottom-up approaches implement a so called ”downward closure property” (or monotonicity property), which means if subspace S contains a cluster, then any subspace T ™ S must also contain a cluster. The property is used for pruning – if a subspace T does not have high enough density, then any superspace S,T ™S, can be excluded from the searching space. A common implementation of a bottom-up approach starts from one dimensional dense subspaces, iteratively considering an increasing number of dimensions and combining the dense units that are adjacent until no more new dense units are found.

A typical algorithm will have three major steps:

5.1.2 Subspace Clustering Algorithms 97

1. generate high dense units (subspaces) using an a-priori-like approach;

2. assign cluster membership to each object;

3. remove outliers that have distance to the cluster center higher than the critical value.

The top-downapproach starts with an initial configuration where data is clustered us-ing the full feature space with equally weighted dimensions. Each dimension is assigned a weight for each cluster to characterize the relevance of the dimension to the cluster. Sub-sequently the annotated clusters are re-clustered taking into account the weights assigned in the preceding step. Typically sampling techniques are used to improve performance as the approach involves multiple iterations of re-clustering in the full set of dimensions.

Any of these approaches require some kind of parametrization. Bottom-up approaches generally require specifications of threshold densities and bin size. Top-down approaches require a specification of the desired number of clusters (similar to k-means) and the average number of dimensions included in a subspace.

In this chapter we use Proclus, which is one of the most established algorithms and has demonstrated advantages over a number of subspace clustering techniques [102]. Pro-clus [4] takes a top-down approach and extends the traditionalk-medoid clustering algo-rithm. Thek-medoid algorithm starts with an initial partition and then iteratively assigns objects to medoids, computes the quality of clustering, and improves the partition and medoid. Proclus extends k-medoid by associating medoids with subspaces and improves both partitions and subspaces iteratively.

Taking two input parameters, number of clusterskand the average number of dimen-sions l, the algorithm proceeds in 3 phases. (1) In the initialization phase the set of k medoid candidates is selected, by picking a representative sample from the entire data and choosing the medoids from the representatives by using a greedy method. (2) In the iterative phasethe medoids are improved and a subspace for each medoid is computed.

This is done by going through the following steps. First a random set of k medoids is selected from the representatives and the optimal set of dimensions is determined for each medoid. Then all the objects are assigned to the nearest medoid. If the current clustering is better than the previous, than it is kept. These steps are repeated until the clustering does not change anymore when determining the bad medoids and replacing them with ran-dom representatives. (3) In the last phase, thecluster refinement phase, once the best medoids are found, the clustering is improved by determining optimal dimension sets for the medoids and reassigning the objects to clusters. Algorithm 1 presents the pseudocode from [4] describing the algorithmic steps in more detail.

A number of reviews and surveys exist to compare and classify the subspace clustering approaches. The survey mentioned above by Parsons et al. [110] organizes the techniques in a hierarchy of algorithmic strategies and provide a small experiment on representa-tive algorithms of each class. Kriegel et al. present a more thorough systematization and updated survey [90], where the broader problem of clustering high-dimensional data is discussed. The recent work of M¨uller et al. [102] presents a systematic and unique evaluation of subspace clustering algorithms in terms of quality of generated output and performance. According to [102], Proclus is one of the best partitioning algorithms and has a good runtime compared to other techniques. We rely on this and use Proclus in our experiments.

Algorithm 1 PROCLUS(No. of Clusters: k, Avg. Dimensions: l) {Ci is the ith cluster}

{Di is the set of dimensions associated with cluster Ci} {M_current is the set of medoids in current iteration} {M_best is the best set of medoids found so far}

{Ni is the final set of medoids with associated dimensions}

{A, B are constant integers}

/*1. Initialization Phase: select set of k medoid candidates */

S = random sample of size A·k M= GREEDY(S, B·k)

/*2. Iterative Phase: improve medoids and compute subspace for each medoid */

BestObjective =Œ

M_current = Random set of medoids {m₁, m₂, . . . , m_k}µM repeat

/* Approximate the optimal set of dimensions */

for each medoidm_i œM_current do

Let ”_i be the distance to nearest medoid from m_i Li = Points in sphere centered at mi width radius”i

end for

L={L1, . . . ,Lk}

(D1,D2, . . . ,Dk) = FindDimensions(k, l,L) {Form the clusters}

(C1, . . . ,Ck) = AssignPoints(D1, . . . ,Dk)

ObjectiveF unction=EvaluateClusters(C1, . . . ,Ck,D1, . . . ,Dk) if ObjectiveF unction < BestObjectivethen

BestObjective=ObjectiveF unction Mbest=Mcurrent

Compute the bad medoids in M_best end if

Compute M_current by replacing the bad medoids in M_best with random points from M

until(termination criterion)

/*3. Cluster Refinement Phase: improve quality of the partitions and subspaces */

L={C1, . . . ,Ck}

(D1,D2, . . . ,Dk) = FindDimensions(k, l,L) (C1, . . . ,Ck) = AssignPoints(D1, . . . ,Dk) N = (Mbest,D1,D2, . . . ,Dk)

return N

Im Dokument Visual Analytics of Patterns in High-Dimensional Data (Seite 108-111)