• Keine Ergebnisse gefunden

Subspace Cluster Analysis and Visualization

2.4 Visual Analytics for High-Dimensional Data

2.4.2 Subspace Cluster Analysis and Visualization

As traditional full-space clustering is often not effective for revealing a meaningful clus-tering structure for high-dimensional data (see Section 2.3.1), in the emerging research field of subspace clustering [90] several approaches aim at discovering meaningful clusters in locally relevant subspaces. The problem of finding clusters in high-dimensional data can be divided into two sub-problems: subspace search and cluster search. The first one aims at finding the subspaces where clusters exist, the second one at finding the actual clusters. The large majority of existing algorithms considers the two problems

simultane-2.4.2 Subspace Cluster Analysis and Visualization 27 ously and produces a set of clusters, where each cluster is typically represented by a set of clustered objects (rows of the original data table) and the subset of relevant dimensions (columns of the original data table). Several methods have been proposed that differ to the clustering search strategy and constraints with respect to the overlap of clusters and dimensions [38, 84, 107]. Kriegel et al. [90] categorize these algorithms into four classes:

(1) projected clustering; (2) “soft” projected clustering; (3) subspace clustering; (4) hy-brid. The first two generate clusters that do not overlap, that is, every object belongs to only one cluster. Subspace clustering and hybrid may generate clusters that do overlap.

While extensive research has been carried out in designing subspace clustering algo-rithms, surprisingly little attention has been paid to develop visualization support for subspace clustering. To our knowledge only a few subspace cluster visualization systems exist.

(a) (b)

Figure 2.6: (a) VISA system [14]. Left: MDS projection for the global view of clusters. Right:

Matrix of subspace clusters for in-depth view. (b) Heidi Matrix [141] over a subspace.

The VISA system [14] implements both a global view and an in-depth view (see Figure 2.6(a)) to help interpret the subspace clustering result. In the global view, the subspace clusters are projected onto a 2D display using a multidimensional scaling (MDS) projection. The aim is to show the similarity between clusters in terms of the number of records and dimensions in each cluster. Each cluster is represented as a colored circle where color represents the number of dimensions and the size represents the number of instances. Thein-depth view shows the detailed characteristics of the clustering result including data items in each cluster and their values using a matrix representation. It uses different color codes to visualize all characteristics of an object: black for unselected dimensions, brightness for areas of interest, and hue for value. The MDS projection in VISA provides a good overview of the clustering results. However, using circles of different sizes in the MDS projection in VISA can be problematic; the distance between two clusters can be obscured by the radius of the circles, and the overlap between clusters often causes a cluttered display. The in-depth view shows detailed characteristics of the clustering result, but as shown in Figure 2.6(a), both hue and brightness are relatively weak at showing difference/variations between numbers and values in unselected dimension.

Heidi Matrix [141] uses a complex arrangement of subspaces in a matrix represen-tation. This matrix is based on the computation of the k-Nearest Neighbors (kN N) in each subspace (see Figure 2.6(b)). Rows and columns represent the data items, and each

entry (i, j) in the matrix represents the number of subspaces in which iand j are neigh-bors. A categorical coloring scheme is used to color the cells according to the particular combination of subspaces in which two data items are neighbors. In addition, rows and columns are ordered according to the output generated by a clustering algorithm. The biggest advantage of Heidi Matrix is that it displays the full information of the data and the subspace clustering result. However, the rather abstract visual mapping scheme makes interpretation of the results difficult and to the best of our knowledge its effectiveness has not been evaluated yet. The scalability of the visualization is another critical issue because it requires nndisplay space, where nis the number of data items.

Figure 2.7: Visualization techniques applied in Ferdosi’s work [52]. Left: 1D subspace. Middle:

2D subspace. Right: Subspace with 3 or more dimensions.

Ferdosi et al. [52] proposed an algorithm for finding interesting subspaces in astronom-ical data as well as a visual system for displaying the results. The algorithm identifies candidate subspaces from data and ranks those by a quality metric based on density es-timation and morphological operators. The result subspaces are visualized in different forms: line graphs for 1-dimensional subspaces, 2D scatterplots for 2-dimensional sub-spaces, and principle component analysis (PCA) projections for subspaces with higher dimensionalities (see Figure 2.7). Ferdosi’s work provides some interesting insight into subsets of dimensions in astronomical data with a high density of data objects. However, the algorithm does not assign objects to subspaces. Hence, the subspace clustering infor-mation is partially missing from both the data mining and the visualization compared to VISA and Heidi Matrix, meaning there is no direct way of comparing subspaces.

In all of the above mentioned visualization systems, the visualization of overlapping dimensions and overlapping clusters is lacking. It is difficult to see and compare such overlapping information in the visual representations. In Section 5.1 we propose a visual tool to investigate subspace clustering results and represent also dimension and object overlap among clusters.

We note that if we apply one of these subspace clustering visualizations, we immediately inherit two main challenges of this paradigm that is still considered an open research issues, namely: the efficiency challenge (relating to subspace cluster search) and the redundancy challenge (relating to the typical redundancy of the outputs generated). In Section 5.2 the redundancy problem is addressed by our proposed analytical workflow.

Quality Measures based Visual Analysis of 3

High-Dimensional Data

„Measure what is measurable, and make measurable what is not so.”

Galileo Galilei Contents

3.1 Quality Measures for Scatterplots and Parallel Coordinates . . 30 3.1.1 Overview and Problem Description . . . . 30 3.1.2 Quality Measures for Scatterplots with Unclassified Data . . . . 32 3.1.3 Quality Measures for Scatterplots with Classified Data . . . . 34 3.1.4 Quality Measures for Parallel Coordinates with Unclassified Data 38 3.1.5 Quality Measures for Parallel Coordinates with Classified Data . 40 3.1.6 Application on Real Data Sets . . . . 41 3.1.7 Evaluation of the Measures’ Performance Using Synthetic Data . 49 3.1.8 Conclusion and Future Work . . . . 53 3.2 Quality Measures and Human Perception – An Empirical Study 54 3.2.1 Measures . . . . 54 3.2.2 Empirical Evaluation . . . . 57 3.2.3 Results . . . . 59 3.2.4 Discussion . . . . 62 3.2.5 Guidelines . . . . 63 3.2.6 Conclusion and Future Work . . . . 63

V

isualexploration of multivariate data typically requires projection onto lower-dimen-sional representations. The number of possible representations grows rapidly with the number of dimensions, and manual exploration quickly becomes ineffective or even unfeasible. In this chapter, we propose automatic analysis methods to extract potentially relevant visual structures from a set of candidate visualizations. Based on these features, the visualizations are ranked in accordance with a specified user task. The user is provided with a manageable number of potentially useful candidate visualizations that can be used as a starting point for interactive data analysis. This can effectively ease the task of finding truly useful visualizations and potentially speed up the data exploration task. Therefore in Section 3.1, we present quality measures for class-based as well as non class-based scatterplots and parallel coordinates visualizations. The proposed analysis methods are evaluated on real and synthetic data sets and the results are presented in Section 3.1.6 and 3.1.7. Section 3.2 presents an empirical study to compare the measures ranking with the user perception. The study helped us to derive further factors that we must take into account when designing new measures that have to fit the users’ perception.

Parts of this chapter appeared in the following publications [132, 133, 134]1.

3.1 Quality Measures for Scatterplots and Parallel Coordinates

In this section, we present an automated approach that supports the user in the exploration process of high-dimensional data. The basic idea is to generate different projections from the high-dimensional data set and to automatically identify potentially relevant visual or data-structures from this set of possible candidates. These structures are used to determine the relevance of each projection to common predefined analysis tasks. The user may then use the projection with the highest relevance as the starting point of the visual interactive analysis. We present relevance measures for typical analysis tasks based on scatterplots and parallel coordinates. The experiments on class-labeled and non class-labeled data sets demonstrate the potential of our quality measures to find interesting projections and visualizations and thus speed up the exploration process.