• Keine Ergebnisse gefunden

Discussion and Further Research

4.2 Visual Cluster Separation Factors: Sketching a Taxonomy

4.2.4 Discussion and Further Research

This study shows that so far measures were developed and validated on far too few and too simple data sets. The real world is much more complex, and since the data complexity rises, a more systematic development of the measures is needed. As we saw in the previ-ous section, more aspects can be identified in real data sets, that are not covered yet by existing measures. In the following, we present a list of issues that emerged as a result of this study, and which we deem important for further research in the area of quality metrics.

Taxonomy based evaluation and systematization

A large number of metrics for cluster separation in scatterplots have been developed. They all try to discover good views displaying the data clusters. We believe that there are two main reasons, why there are a variety of measures for this task: different strengths of mea-sures and missing unified picture of existing approaches. First, the meamea-sures have different strengths according to the factors of the taxonomy. They cannot cover the entire spectrum of data characteristics, and therefore focus just on a subset of these. Using the metrics for the area that they cannot cope with will lead to wrong results. Therefore, guided by the taxonomy presented before, an evaluation of the existent metrics is needed that can help users to choose the right measure depending on their data. Second, the variety of measures makes the development of new ones difficult since a unifying picture is missing.

Guided by the taxonomy axes, the existent approaches can be evaluated and their ranges of success can be marked to them. This analysis would provide a good systematization of current approaches spotting the data characteristics that have to be addressed in the future and lead the researchers through the variety of approaches.

Taxonomy based measure development

After the gaps of existent measures are identified, new research can be conducted to cover the data characteristics missing so far. We believe that it is hard to develop one single measure to cover all these factors, but having different measures and being aware of their coverage potential along these axes helps in avoiding false rankings in the future.

4.2.4 Discussion and Further Research 91 New taxonomies for different visualization techniques

While this taxonomy focuses on one prominent visualization technique, the scatterplot, there are also metrics designed for other high-dimensional visualization techniques like cat-egorized in Section 4.1.4. Different visualization techniques will need different factors to characterize different patterns (e.g., cluster separation). Even though a taxonomy like this is laborious, the benefits of it can improve the development of metrics for these techniques.

New taxonomies for different quality metric factors

We have seen in Section 4.1.4 that different patterns are quantified by measures, and a systematization of the factors that influence them is missing for other factors too. Fac-tors like correlation, outliers, complex patterns, image quality, or feature preservation are missing such a taxonomy. Having all these taxonomies – which would be the ideal case scenario – it would be possible to identify interrelations between different patterns and how they are represented in visualizations. We believe that these insights can help in combining measures to identify more than one pattern.

Metrics for dimension reduction properties

Dimension reduction techniques are often used to reduce the dimensionality of the data sets before displaying them on the screen. The metrics are always applied on dimension reduced data sets, so artifacts included by these techniques cannot be excluded. A study of how different data characteristics are maintained or obscured by these techniques, can be conducted by comparing different techniques, or the same technique with different parameter settings on the same data set. As far as we know, there are no studies reporting on this type of analysis, and we believe it to be an interesting topic for future research. Also quality measures can be designed to automatically detect structure changes, by parameter or technique change. Properties like noise invariance, rotation invariance, scalability with respect to data points and dimensions, can be explored by new quality metrics.

Visual Subspace Analysis of 5

High-Dimensional Data

„Visual ideas combined with technology combined with personal interpretation equals photography. Each must hold it’s own; if it doesn’t, the thing collapses.”

Arnold Newman Contents

5.1 Visual Exploration for Subspace Clustering . . . 94 5.1.1 Motivation . . . . 94 5.1.2 Subspace Clustering Algorithms . . . . 96 5.1.3 Task Definition and Design Space for Visual Subspace Cluster

Analysis . . . . 99 5.1.4 The ClustNails System . . . 101 5.1.5 Use Case and System Comparison . . . 106 5.1.6 Conclusions and Future Work . . . 109 5.2 Visual Analytics of Subspace Search . . . 110 5.2.1 Introduction . . . 110 5.2.2 Subspace Analysis . . . 112 5.2.3 Proposed Analytical Workflow . . . 113 5.2.4 Application . . . 120 5.2.5 Discussion and Possible Extensions . . . 124 5.2.6 Conclusions . . . 127

S

ubspace clustering addresses an important problem in clustering multidimensional data. In sparse multidimensional data, many dimensions are irrelevant and obscure the cluster boundaries. Subspace clustering helps by mining the clusters present in only locally relevant subsets of dimensions. However, understanding the result of subspace clustering by analysts is not trivial. In addition to the grouping information, relevant sets of dimensions and overlaps between groups, both in terms of dimensions and records, need to be analyzed. In Section 5.1, we present an interactive visualization system called ClustNails to analyze, navigate, relate, and understand subspace clustering results. Real world data sets are used to demonstrate the functionality of the system.

Additionally, high-dimensional data spaces often consist of combined features that mea-sure different properties, in which case the particular relationships between the various properties may not be clear to the analysts a priori since it can only be revealed if appropri-ate feature combinations (subspaces) of the data are taken into consideration. Considering just asingle subspace is, however, often not sufficient since different subspaces may show complementary, conjointly, or contradicting relations between data items. Useful

informa-tion may consequently remain embedded in sets of subspaces of a given high-dimensional input data space.

Relying on the notion of subspaces in Section 5.2, we propose a novel method for the visual analysis of high-dimensional data in which we employ an interestingness-guided subspace search algorithm to detect a candidate set of subspaces. Using proper defined subspace similarity functions we provide an interactive exploration environment to com-pare and relate subspaces with respect to their topological similarities and dimension similarities. Real and synthetic data sets are used to demonstrate our approach.

Parts of this chapter appeared in the following publications [135, 136].

5.1 Visual Exploration for Subspace Clustering

In this section, we introduce a visual subspace cluster analysis system called ClustNails. It integrates several novel visualization techniques with various user interaction facilities to support the navigation and interpretation of subspace clustering results. We demonstrate the effectiveness of the proposed system by analyzing real world data sets and comparing it to other existing visual subspace cluster analysis systems.

This section is organized as follows. In Section 5.1.1, we elaborate what aspects mo-tivated our research in this area. In Section 5.1.2, we introduce the subspace clustering problem and point to important overview articles in this area. We also explain in Sec-tion 5.1.3 the challenges in designing effective visualizaSec-tion tools for subspace clustering analysis tasks. In Section 5.1.4, we provide an overall view of the system as well as detailed visualization and ordering techniques. In Section 5.1.5, we validate the system with real world data sets and compare it with a state of the art system, and Section 5.1.6 concludes.