• Keine Ergebnisse gefunden

Task Definition and Design Space for Visual Subspace Cluster Analysis 99

4.2 Visual Cluster Separation Factors: Sketching a Taxonomy

5.1.3 Task Definition and Design Space for Visual Subspace Cluster Analysis 99

5.1.3 Task Definition and Design Space for Visual Subspace Cluster Analysis Subspace cluster visualization remains a challenging task due to the multiple types of information contained in subspace clustering results such as subspaces, cluster membership of objects, and overlap between subspaces and clusters. Existing subspace visualization techniques have been detailed in Section 2.4.2. To develop effective visualization systems for subspace cluster analysis, it is necessary to take into consideration the different tasks that are involved in the data analysis and use it as a base for exploring the design space.

We describe next main tasks that an appropriate subspace cluster visualization technique needs to address and, therefore, provide a generic and reusable characterization. We also analyze the design space and provide: (1) a classification, and (2) a reasoned analysis of common design alternatives, from which a baseline design space is derived. This analysis serves as a baseline not only for the design of our proposed subspace cluster visualization system, but allows to compare with existing approaches and identify empty areas in this design space for future work.

Scope of Subspace Cluster Analysis

Clustering abstracts a larger data set to a smaller number of groups that are presumably more amenable to analysis and interpretation. Standard clustering algorithms rely on a fixed set of dimensions used in the similarity function of the clustering algorithm. Typ-ically, the selection of dimensions is done outside of the clustering algorithm. Subspace clustering methods, on the other hand, provide an extended output, including also the set of dimensions relevant to finding the groups, possibly described with weights indicating the importance of each dimension for the found result. Depending on the subspace method used, there can be an overlap between dimensions and records between the clusters. In principle, analysis of the subspace clustering can be done without considering the identi-fied dimensions. In our work, we are interested in jointly analyzing the clustering results and the sets of selected dimensions, to provide enhanced analysis capabilities.

Tasks

The analysis of properties and relationshipswithinandamongclusters are important tasks in cluster analysis. We break these general analysis tasks down to a series of subtasks:

T1 Reveal properties of individual clusters

When analyzing clustering output it is necessary to understand the main features of each generated cluster. In particular, once the clustering output has been generated and a visualization is constructed to represent it, it is necessary to perceive the following information:

T1.1 How many records does the cluster contain?

T1.2 How many dimensions are involved and what are their weights?

T1.3 How are the data values distributed in each of the contained dimensions? (ho-mogeneity of cluster members, central and outlier elements, subgrouping of clusters)

T2 Enable cluster comparison

Once the output has been considered and each cluster has been characterized visually, it is important to display the information in a way that meaningful comparisons can be made among the clusters. It is important to understand how similar (or distant) clusters are, which translates into:

T2.1 How do clusters differ with respect to contained records and involved dimen-sions?

T2.2 Is there overlap between records and dimensions or are they distinct?

T3 Indicate the quality of the generated cluster output

Subspace clustering algorithms, as many methods that work on multidimensional spaces, are heavily based on heuristics and are dependent on parameterization. For this reason, clustering outputs are not always optimal. Even if research in subspace clustering has largely improved the clustering quality, it is still important to be able to judge the output quality by considering the following:

T3.1 How good is the clustering quality produced by a given algorithm?

T3.2 How sensitive is the output with respect to parameter variations?

We take these task considerations as a baseline for developing the ClustNails system presented in the next Section. While we have not formally evaluated the degree to which ClustNails fulfills each of these criteria, we find that they are at the core of the functionality that ClustNails offers.

Design Space

In terms of the previously described tasks, the information entities of interest to be visu-alized are: Elements (data records, clusters, dimensions), Relationships (membership of records in clusters, clusters overlap with respect to records and dimensions), Attributes (cluster size, dimension distribution, dimension weight, etc.)

We identify two main categories of visualization solutions for the representation of the subspace clustering output: Cluster-Centric(CC) andData-Centric(DC).Cluster-centric solutions put their focus on the representation of the clusters first, with the intent to allow their comparison. Data-centricsolutions put their focus on the representation of the data values with the intent to ease the interpretation of each cluster in terms of their internal distributions.

There is a natural tension between these two extremes. Cluster-centric solutions scale much better in terms of number of data items and dimensions. Their higher level of abstraction allows an easier comparison between the cluster features, however, at the expense of limiting their interpretation. On the contrary, data-centric views ease cluster interpretation but do not scale very well with respect to data size and dimensionality.

In our analysis of the design space, we explored several alternative visual designs and isolated some basic ones for both approaches. To discuss them briefly helps to better motivate our proposed final solution.

Record-Centric Designs

In record-centric designs each visual item represents a record. A 2D scatterplot projec-tion is often used as a way to identify clusters of data elements in traditional clustering,

5.1.4 The ClustNails System 101 however, it is not clear how to extend this design in a way that information about cluster dimensions is included. Parallel coordinates plots (PCP) could in principle be extended to represent subspace clusters by drawing lines between adjacent axes only when these belong to the cluster being drawn. But this generates complicated ordering problems with potential extreme cases where the polyline of a whole record might not be drawn at all because its axes are never adjacent. Also, PCP do not scale well to data of even moderate dimensionality, which in turn is the main focus of subspace clustering. Heat maps (or ma-trix/tabular representations) can be extended more easily by using different color scales for included and not included dimensions. In addition, their design allows for easy reordering of records and dimensions so that the structure of the clusters can be more easily perceived.

Cluster-Centric Designs

In cluster-centric designs each visual item represents a cluster. A 2Dscatterplot projection is possible, as the one presented in VISA [14] (see also Figure 5.7). The clusters are projected with MDS, or similar techniques, taking into account their similarity according to some predefined criteria (e.g., shared number of dimensions). This solution permits to group clusters according to their similarity but their visibility and understanding is often hindered by the amount of overlap the items have. A matrix comparing one cluster to another in terms of their shared dimensions and records is also possible but its effectiveness depends on how well row and columns are ordered, plus it is not necessarily the most compact design. Finally, icons or glyphs can be used to provide a rich representation of each cluster in a way that every single icon can provide information about cluster dimensions, records and weights in a integrated fashion.

In ClustNails we integrate the best of the two approaches in a multiple views user interface (see Figure 5.5). A cluster-centric view based on sorted icons provides support for cluster understanding and comparison (T1.1, T1.2, and T2.2). A data-centric view based on sorted and compressed heat maps provides support in interpreting and comparing the clusters in terms of their data distribution (T1.3, T2.1, T2.2). All of them in turn help interpreting the quality of the generated output (T3.1 and T3.2). In the following section, we describe the whole system and its views in detail.

5.1.4 The ClustNails System

ClustNails is designed as an interactive visualization tool for subspace clustering analysis.

It integrates a number of subspace clustering algorithms with novel visual representations and ordering techniques to help analysts generate subspace clusters from multidimensional data and identify interesting patterns from the visualization models. We next provide an overview of the design and main functionalities of the system, as well as a detailed description of the visualization and ordering techniques applied.

Overview

ClustNails integrates the OpenSubspace library of Weka [106] that contains a range of subspace clustering algorithms including Clique, Doc, Fires, Proclus, MineClus, INSCY, P3c, Schism, Statpc, and Subclu. The system takes multidimensional data as input, clusters the objects using a user-selected subspace clustering algorithm, and displays the

clustering result in a multi-view user interface. A number of ordering functions allow the analyst to examine the results and compare clusters from different perspectives. Various user interactions are added to allow the user to select clustering algorithms, parameters, and the order of the clustering results in the visualization panels. A linking-and-brushing function is implemented such that dimensions/clusters of interest can be highlighted in different views. By placing the mouse cursor over an item (record, dimension, or cluster) in the visualization panel, the analyst can see detailed information of the item in a tooltip.

Subspace

DATA SPACE VISUAL SPACE

cluster and dimension ordering

subspace cluster view

121

Figure 5.2: Workflow of subspace cluster analysis using the ClustNails system.

Figure 5.2 illustrates the workflow supported by our tool. Figure 5.2 (left) shows that the system loads a d-dimensional data set as input and a user-selected clustering algorithm computes the subspace clusters, provided as a list of clusters, each containing a subset of records and a subset of dimensions. Figure 5.2 (middle) shows that each cluster is quantified in terms of the number of instances and associated number of dimensions;

this information, together with the records for each subspace cluster is visualized in a multiple view visualization panel, which includes a Spikes view for cluster-centric analysis (top), and a HeatNails view for record-centric analysis (bottom). Figure 5.2 (right) shows that the order of clusters, dimensions and records can be rearranged in each view for easy comparison between clusters. Next, we describe the different views and supported ordering strategies.

Visualization Components

Visualization of Clusters: the Spikes View

The Spikes view is a cluster-oriented view and provides a matrix of thumbnails, each representing a subspace cluster. Each cluster is visualized in a circular area that contains radial spikes. The spikes represent the individual dimensions (the subspace) that define the given cluster, and the spike length is scaled according to the weight (importance) of a dimension for the cluster (see below for the definition). The radial dimension sequence is identical for each spike-glyph. The number of records in the cluster is represented by the area size of the inner circle.

Subspace clustering algorithms provide as output a subset of dimensionsDk for each cluster SCk, as well as the set of instances (records) of this clusterXk. Given a dimension m within the set of dimensionsDk in a subspace clusterSCk, we define the weight of that dimension in that cluster as:

wmk = q

xmi œXkm|xmicmk|

|Xk| , (5.2)

5.1.4 The ClustNails System 103 wherecmk is the center of the points inXkalong the dimensionm,xmi the value in dimension m of the point xi of this cluster and |Xk| the number of elements in SCk. The smaller wmk is, the more compact are the points around the center in dimension m. This implies that dimensions withsmaller weights have better clustered points and are defined asmore important for a cluster. We normalize the weightswkm for all dimensions of all clusters to the interval [0,1] and map the corresponding values inversely to the length of the spike.

The lower wkm (the more important the dimension), the longer the corresponding spike.

Note that owing to our definition ofwmk, the relationship between weights and importance is inverse, and we reflect this by an inverse mapping between weights and size of the visual attribute (the spikes). Also note that in case the given subspace cluster algorithm natively outputs weights for each dimension, those weights can also be mapped to the spike length instead.

Figure 5.3: Two subspace clusters visualized as spikes. The clusters share common dimensions but the importance of the dimensions for the clusters are different. Dim29 and dim32 in the left cluster show smaller pikes than in the right cluster, as they are considered less important for the definition of that cluster according to our measure wkm. Furthermore, the left cluster has fewer dimensions and more objects than the right cluster.

The visual representation for each subspace cluster is a circle in the Spikes view. Each spike in a circle represents a dimension contained in that subspace. The length of the spike represents the weight of the dimension for that particular cluster (the longer, the more important). The order of the dimensions is identical for each cluster. The area of the inner circles indicates the number of records within each cluster. Figure 5.3 illustrates the Spikes view.

The resulting Spikes view allows users to quickly recognize overlapping dimensions by comparing the spike patterns of the different clusters. To support this comparison, a background is divided into pies and colored alternatively with two colors (gray and light red). This supports the comparison of the spike angles in two different clusters.

Visualization of Records: the HeatNails View

The HeatNails view is an extended heat map displaying the data values and dimensions.

Rows represent dimensions, and columns represent data items (records). Each HeatNail cell represents a data value of a record in one dimension. Data items are grouped by clusters. These clusters are aligned next to each other and separated by black lines. Data values are normalized globally and mapped to an appropriate color scale. A yellow-to-green color scale is used for dimensions that are members of the given cluster, while a gray scale is used for the remaining dimensions of the data set per cluster (see Figure 5.4

(bottom)). This allows for an effective visual perception of the distribution of values across dimensions, and the relation between dimensions and clusters with respect to their inclusion in the cluster definition.

Figure 5.4: HeatNails visualization. Bottom: showing the distribution of dimension values for all dimensions (rows) and records (columns). Top: showing histograms for the values of all dimensions per cluster for comparison purposes.

We also give a summary representation of the values of the dimensions occurring in the clusters. The distribution of dimension values of each cluster is discretized into a histogram and visualized by color (for dimensions included) and gray scales (for dimensions not included). This allows for easy comparison between clusters with respect to data values. Figure 5.4 (top) shows these histogram views. Finally, depending on the clustering algorithm, it is possible that records are members in multiple clusters. We illustrate this by marking the cluster IDs of multi-cluster members at the bottom of the display. In addition to the Spikes view, the HeatNails view also allows the quick recognition of overlapping dimensions across the clusters by means of the given color and grey-scale patterns. Both Spikes and HeatNails views incorporate linking-and brushing functionality. Clicking on any set of dimensions/clusters of interest in one view highlights the same dimensions/clusters in all other views.

Ordering Heuristics

Ordering is implemented to support perception of structural similarities of clusters with re-spect to dimensions and value distributions. As ordering problems for clusters, dimensions, and records are typically complex NP-complete combinatorial optimization problems [9], we rely on heuristics to order dimensions, records, clusters, and values in the various dis-plays. Our essential idea is to place similar or closely related objects together to help the analyst find interesting patterns.

Dimension Ordering

To find a global ordering of the dimensions, we compute a frequency value for each dimen-sion, denoting the number of subspace clusters that are using this dimension. We order the list of dimensions by this frequency value starting the sequence of dimensions with the dimension that is most frequently used by the set of subclusters. The next positions are filled in the same way: the dimension that co-occurs most frequently with the previous positioned dimension is placed next. If a co-occurrence is not found, the most frequent dimension from the remaining dimensions is positioned next in the ordering vector. The

5.1.4 The ClustNails System 105

dimension ordering can be applied to both the Spikes view and HeatNails view.

Subspace Cluster Ordering

A useful visual representation of subspace clustering results should arrange similar sub-spaces next to each other to reduce visual search time by the user. We propose an ordering strategy that is formalized in the following. Using the dimension weights defined in Equa-tion 5.2, we propose a measure for the global interestingness ISCg

k of a clusterSCk: ISCg

k = qmœDkwkm

|Dk| , (5.3)

wherewkmis the weight of dimensionmœDkofSCk, and|Dk|is the number of dimensions in this subcluster. We define theglobal interestingness of a clusterkas the average of the weights of the dimensions contained in this subcluster. This measure is used to determine the first cluster in the ordering. We then use the subspace cluster distance (eq. 5.4) employed in [14] to find the most similar cluster, which is placed next to the initial cluster.

This distance function is a convex sum of subspace distance and object distance:

A

1≠|DiDj|

|DiDj| B

+ (1≠—) A

1≠ |XiXj| min{|Xi|,|Xj|}

B

[14] (5.4)

where |DiDj|is the number of common dimensions of the two subspaces i and j, and

|XiXj|the number of shared objects of the two subspaces. We continue this placement until all clusters are placed.

Record Ordering

Two different types of record ordering strategies are implemented in HeatNails. One strat-egy is to order the records from min to max with respect to their values in the dimension that has the biggest variance, among all dimensions. A second strategy is to order the records according to the Euclidian distance across the contained dimensions of the given subspace, based on a selected starting record. The starting record, in turn, may either be user selected, or selected automatically as the record that shows the largest variance over all dimensions.

Value Ordering

A value ordering facility is implemented in the HeatMap view and visible in the top summary row of the HeatNails view. In each row the distribution of values in a given dimension is shown. To that end, we sort the values from min to max, and bin them into a user-selectable number of bins. In this view the distribution of values per dimension and cluster is indicated in the form of a color-coded histogram. The histograms help in understanding the distribution of data values within each dimension, and may support finding out why a particular dimension was selected or not by the clustering algorithm.

Summary and Discussion of the ClustNails System Design

ClustNails is an integrated system for visual subspace cluster analysis. Its design features (1) a number of subspace clustering algorithms from which the user can chose and (2) a design of different visual representations for the most important aspects of the output of automatic subspace cluster analysis.

Regarding (1), we provide access to a number of state of the art algorithms as contained in the OpenSubspace library [106]. The list of integrated algorithms is extensive.

Regarding (2), we composed a visual display of three aspects. The Spikes view is inspired by star glyphs and distinguishes clusters from each other, in terms of included

Regarding (2), we composed a visual display of three aspects. The Spikes view is inspired by star glyphs and distinguishes clusters from each other, in terms of included