Thesis Structure - Visual Analytics of Patterns in High-Dimensional Data

their impact and directions for future research (Chapter 3) [134].

• A systematization of techniques that use quality metrics to help in the visual ex-ploration of meaningful patterns in high-dimensional data. We present reflections on how different quality measure methods are related to each other and how the approach can be developed further. For this purpose, we provide an overview of ap-proaches that use quality metrics in high-dimensional data visualization and propose a systematization based on a thorough literature review. We carefully analyze the papers and derive a set of factors for discriminating the quality metrics, visualization techniques, and the process itself. A quality metrics pipeline is proposed to model all the encountered varieties of metrics (Chapter 4) [27].

• A visual subspace cluster analysis system (ClustNails) to understand the result of subspace clustering. In subspace clustering in addition to the grouping information (clusters), the relevance of dimensions for particular groups and overlaps between groups, both in terms of dimensions and records, need to be analyzed. ClustNails in-tegrates several novel visualization techniques with various user interaction facilities to support navigating and interpreting the result of subspace clustering algorithms (Chapter 5) [136].

• A novel method for the visual analysis of high-dimensional data for understand-ing high-dimensional data from different perspectives and investigatunderstand-ing alternative clusterings. We employ an interestingness-guided subspace search algorithm to de-tect a candidate set of interesting subspaces, that may contain important patterns for further analysis. Based on appropriately defined subspace similarity functions, we visualize the subspaces and provide navigation facilities to interactively explore large sets of subspaces. Our approach allows users to effectively compare and relate subspaces identifying complementary or contradicting relations among them, thus identifying alternative clusterings (Chapter 5) [135].

1.3 Thesis Structure

After illustrating the problem in the previous section and enumerating the contributions of this thesis, the remainder of the thesis is structured as follows.

Chapter 2 provides a brief overview of important related work in the field of high-dimensional data analysis, covering three main areas. Section 2.1 introduces the common challenges when analyzing high-dimensional data and presents dimension reduction tech-niques that reduce the data complexity. Section 2.2 describes important visualization techniques for high-dimensional data. Section 2.3 introduces standard automatic tech-niques from the Data Mining community, as well as presents quality measures, that are automated ranking functions, to judge the quality of a visualization with respect to a given task. Section 2.4 presents some examples where the interplay between visualization, automation, and interaction is far more beneficial then any of these techniques alone.

Chapter 3 proposes eight new quality metrics, for different tasks and two visualization types: scatterplot matrices and parallel coordinates. The metrics are tested on a set of synthetical and real data sets to prove their effect. To ensure that the metrics reflect the

user’s perception, a selected subset of measures for scatterplot matrices is evaluated and compared with the user’s perception. We found that both perform similar. Based on this study, we have formulated guidelines for further evaluation of existing metrics.

Based on a literature review, Chapter 4 introduces a systematization of different qual-ity measures for high-dimensional data visualization. Their relation is described through characteristic factors like visualization techniques or a purpose for coming up with a co-herent and unified picture for these techniques. By putting the existing methods into a common framework, we hope in easing the generation of new research in the field and spot-ting relevant gaps to bridge with future research. Following, Section 4.2 briefly presents the results of a qualitative data analysis that lead to a visual cluster separability taxon-omy. This results are the basis for the follow up discussion on relevant aspects that arise when analyzing clusters visually and what future works need to be focused on.

Chapter 5 presents two interactive systems that help to make sense of the high-dimensional data sets with respect to different clusterings. Searching in subspaces is needed as automatic pattern search is done trough clustering algorithms, and it is not fea-sible to search for clusters in full space for high-dimensional data. Section 5.1 introduces a visual tool, ClustNails, to investigate subspace clustering results for different state of the art subspace clustering algorithms. This tool is intended to support the interpretation of the result with respect to the subspace cluster relations. With this visual tool questions like how many objects do clusters contain, how many dimensions, what dimensions do overlap between clusters or what objects are shared by more clusters can be answered.

Section 5.2 goes one step further and presents an analytical approach to support the identification of alternative clusterings in this spaces. As we know, the high-dimensionality provides different facets in the data like for example in a data set about people we might have clusters in the taste of musicperspective (rock-music, classical music, jazz, etc.) but at the same time we also might have different groupings of the same people describing their sportive activity level. Both views on this data are valid but provide a different insight about the data. To discover such alternative clusterings in high-dimensional data, in this section we propose an analytical workflow that starts from searching the set of possible subspaces identifying interesting subspaces. We then group these subspaces according to their data similarity providing filtering mechanisms for further interactive investigation.

Supported by interaction, different clusterings of the data can be identified.

Chapter 6 concludes the thesis and gives an overview of further research questions that we seem interesting to be investigated in future.

A schematic overview of the chapter interrelations is shown in Figure 1.2.

1.3. Thesis Structure 7

Chapter2: High Dimensional Data Analysis Chapter1: Introduction

Chapter5: Visual Subspace Analysis of HD Data Chapter3: QM based Visual Analysis of HD Data

Chapter4: A Model of HD Data Visualization

Chapter6: Conclusion and Future Work subspaces with

"interesting"

patterns

visualization of the result space ranking

the result space subspaces

how do we visualize and interact with that?

what is interesting?

Data Quality Metrics Visual Quality

Metrics

dimension projections

present most interesting results first

methods to extract patterns HD data

how do subspaces relate

to each other?

Figure 1.2: Schematic overview of the interrelation of chapters in this thesis.

Parts of this thesis where published in:

1. A. Tatu, G. Albuquerque, M. Eisemann, J. Schneidewind, H. Theisel, M. Magnor, and D. Keim. Combining automated analysis and visualization techniques for effective exploration of high dimensional data. Proceedings of the IEEE Symposium on Visual Analytics Science and Technology (VAST), pages 59-66, 2009.

The contributions: for this publication I took the lead on the computer science research part of the paper implementing the data space measures and leading also the writing of the paper itself. G. Albuquerque and M. Eisemann implemented the image quality metrics and provided their description in the paper and some parts of the evaluation section with these metrics. The Histogram Density measures were programmed by myself. J. Schneidewind gave advice for structuring the paper and presenting the results. D. Keim accompanied the project with suggestions for im-provements for application and text. H. Theisel and M. Magnor gave advice to the project. All parts of the paper where revised several times by me, thus in this thesis I use the paper text without citation marks. G. Albuquerque’s thesis (title unknown by the time of my submission) might contain some text passages of this paper too for the parts she took part in the project.

2. A. Tatu, G. Albuquerque, M. Eisemann, P. Bak, H. Theisel, M. Magnor, and D. A.

Keim. Automated Visual Analysis Methods for an Effective Exploration of High-Dimensional Data. IEEE Transactions on Visualization and Computer Graphics (TVCG), 17(5):pp. 584-597, May 2011.

The contributions: publication 1. was elected as one of the best for the VAST’09 conference and this publication is an invited extension of 1. As primary author, I was responsible for writing the paper, generating new use-cases, testing our measures and describing further research directions in this area. G. Albuquerque implemented, described and tested the new CSM measure. P. Bak gave advice for structuring the experiments and presenting the results. D. Keim accompanied the paper with suggestions for improvements for application and text. M. Eisemann, H. Theisel and M. Magnor gave advice to the paper. All parts of the paper where revised several times by me, thus, in this thesis I use the paper text without citation marks. G.

Albuquerque’s thesis (title unknown by the time of my submission) might contain some text passages of this paper too for the parts she took part in the project.

3. A. Tatu, P. Bak, E. Bertini, D. A. Keim, and J. Schneidewind. Visual quality metrics and human perception: an initial study on 2D projections of large multidimensional data. In Proceedings of the Working Conference on Advanced Visual Interfaces (AVI), pages 49-56. ACM, 2010.

The contributions: for this publication I took primary responsibility and addition-ally, I took the lead on the automatic evaluation. P. Bak took the lead on the human experiment. Together we compared the results and evaluated them statistically. E.

Bertini, D. Keim and J. Schneidewind accompanied the paper with suggestions for improvements for experimental design and text. All parts of the paper where revised several times by me; thus, in this thesis I use the paper text without citation marks.

4. D. J. Lehmann, G. Albuquerque, M. Eisemann, A. Tatu, D. A. Keim, H. Schumann, M. Magnor and H. Theisel. Visualisierung und Analyse multidimension-aler Datens¨atze. Informatik-Spektrum, Springer Berlin/Heidelberg, 33(6):589-600, 2010.

The contributions: this publication was authored by D. Lehman. My contribution was to describe the use of quality metrics for high-dimensional data. This thesis was inspired by the discussions of this paper.

5. E. Bertini, A. Tatu, and D. A. Keim. Quality Metrics in High-Dimensional Data Visualization: An Overview and Systematization. Proceedings of the IEEE Symposium on Information Visualization (InfoVis), 17(12):pages 2203-2212, Dec. 2011.

The contributions: this publication was authored equally by E. Bertini and myself.

We decided to show this by enumerating our names alphabetically in the authors list.

E. Bertini and I conducted the literature review, came up with the systematization and description model of quality metrics, and described this process in this paper. D.

Keim played the devils advocate to test our model and gave advice for improvement.

All parts of the paper where written and revised several times by both leading authors.

Thus, in this thesis I use the paper text without citation marks.

6. M. Sedlmair, A. Tatu, T. Munzner, and M. Tory. A taxonomy of visual cluster separation factors. Computer Graphics Forum (EuroVis), 31(3pt4):1335-1344, June 2012.

The contributions: M. Sedlmair took the lead in writing this publication. M.

Sedlmair and I conducted the qualitative analysis of the over 800 plots, and labeled

1.3. Thesis Structure 9 all the cases with different keywords. Based on these M. Sedlmair and T. Munzner came up with the taxonomy, and described it in the paper. I tested special cases like grid size influence during the writing process of the paper. M. Tory accompanied the paper with suggestions for improvements for the analysis and taxonomy and revised the text. In this thesis, I describe the results presented in that paper, without using the text, and I provide further ideas for research in this area.

7. A. Tatu, F. Maaß, I. F¨arber, E. Bertini, T. Schreck, T. Seidl, and D. Keim. Sub-space Search and Visualization to Make Sense of Alternative Clusterings in High-Dimensional Data. IEEE Symposium on Visual Analytics Science and Technology (VAST), pages 63-72, 2012.

The contributions: for this publication I took the lead on the project and paper writing. F. Maaß implemented the subspace tool advised by myself, E. Bertini and T.

Schreck. T. Schreck gave advise in structuring the paper and presenting the results by providing initial sections of the paper. I. F¨arber provided an initial section on subspace clustering. T. Seidl and D. Keim gave advice to the project. Major parts of the paper where written by myself and all the other parts where revised several times by me. Thus, in this thesis I use the paper text without citation marks.

8. A. Tatu, L. Zhang, E. Bertini, T. Schreck, D. A. Keim, S. Bremm, and T. von Landes-berger. ClustNails: Visual Analysis of Subspace Clusters. Tsinghua Science and Technology, Special Issue on Visualization and Computer Graphics, 17(4):419-428, Aug. 2012.

The contributions: for this publication I took the lead on the project and paper writing. I implemented the subspace tool supported for some components by L. Zhang.

E. Bertini, T. Schreck gave advise in structuring the paper and presenting the results and provided initial sections that I shaped for the final submission. D. A. Keim, S.

Bremm, and T. von Landesberger gave advice to the project. Major parts of the paper where written by myself and I revised all the other parts of my co-authors several times to shape the final paper version. Thus, in this thesis I use the paper text without citation marks.

Other publications to which I contributed but are not included in this thesis:

1. M. Schaefer, L. Zhang, T. Schreck, A. Tatu, J. A. Lee, M. Verleysen and D. A.

Keim. Improving projection-based data analysis by feature space trans-formations. In Proceedings of SPIE 8654, Visualization and Data Analysis, 2013.

2. B. Bustos, D. A. Keim, D. Saupe, T. Schreck and A. Tatu. Methods and User Interfaces for Effective Retrieval in 3D Databases (in German). Daten-bank - Spektrum - Zeitschrift fuer DatenDaten-bank Technologie und Information Retrieval, dpunkt.verlag, 7(20):23-32, 2007.

High-Dimensional Data Analysis 2

„You can observe a lot by watching.”

Yogi Berra Contents

2.1 Basic Techniques for High-Dimensional Data Analysis . . . 12 2.1.1 Common Challenges with High-Dimensional Data . . . . 12 2.1.2 Feature Selection and Feature Extraction . . . . 12 2.2 Information Visualization Techniques for High-Dimensional

Data . . . 13 2.2.1 Information Visualization Techniques . . . . 13 2.2.2 Limitations while Visualizing High-Dimensional Data . . . . 16 2.3 Automated Techniques for High-Dimensional Data . . . 17 2.3.1 Data Mining Techniques for High-Dimensional Data . . . . 17 2.3.2 Quality Measures for High-Dimensional Data Visualizations . . . 19 2.4 Visual Analytics for High-Dimensional Data . . . 22 2.4.1 Visual Interactive Systems for High-Dimensional Data Analysis . 22 2.4.2 Subspace Cluster Analysis and Visualization . . . . 26

H

igh-dimensional data contains complex patterns and different data analysis ap-proaches have beed developed during the past years to uncover the possible hidden patterns of this data. As is outlined in the following, this thesis is related to a number of broader areas in data analysis and visualization of high-dimensional data.

In this chapter, Section 2.1 describes the main challenges when dealing with high-dimensional data and some basic techniques to reduce its high-dimensionality. Section 2.2 gives an overview of existing visualization techniques for high-dimensional data, and identifies the visualization challenges that arise due to the data complexity. Section 2.3 presents a series of automated techniques from Data Mining for pattern analysis in high-dimensional data, focusing on clustering. The second part presents mechanisms to quantify the qual-ity of visualizations, called qualqual-ity metrics. Due to the limitations of the pure visual-interactive solution or a sole automatic approach, in Section 2.4 we present works from related fields where the interplay of visualization and automation together with interactive features can provide better solutions to the tasks at hand. All examples of these sections are in the context of pattern finding and understanding of high-dimensional data.

Parts of this chapter appeared in [27, 132, 133, 134, 135, 136].

2.1 Basic Techniques for High-Dimensional Data Analysis

2.1.1 Common Challenges with High-Dimensional Data

Before presenting different techniques to analyze high-dimensional data sets, we will dis-cuss two common challenges in this area.

The first issue is the so called curse of dimensionality. In high-dimensional analysis problems are known to be difficult due to the curse of dimensionality. This term was formulated by R. Bellman [20] in the context of dynamic programming, and describes the fact, that when dimensionality increases the data becomes sparse. In other words, in high-dimensional data everything tends to be basically equidistant making it hard to make any distinctions between objects. Additionally, many existing Data Mining algo-rithms have a complexity exponential with respect to the number of data dimensions.

With increasing dimensionality, these algorithms become computationally intractable and therefore inapplicable in many real applications.

The second issue concerns the meaning of similarity in a high-dimensional space is therefore diminished. It was shown in [28] that as dimensionality increases the distance to the nearest data point approaches the distance to the farthest data point. This problem influences the design of similarity functions for objects in high-dimensional spaces.

2.1.2 Feature Selection and Feature Extraction

A simple, but sometimes very effective, way to deal with high-dimensional data is to reduce the number of dimensions by eliminating those that seem to be irrelevant.

Dimension reduction can be achieved by eitherfeature selection [61] orfeature extrac-tion [44]. Feature selection is the problem of selecting from a large space of input features (or dimensions) a smaller number of features that optimize a measurable criterion, e.g., the accuracy of a classifier [97].

Feature extraction methods reduce the dimensionality of the data by forming a new set of dimensions as a linear or nonlinear combination of the original dimensions. This synthetic dimensions represent most (or all) of the structure of the original data set by using less attributes. Depending on the training data, the methods can be supervised or unsupervised. “Supervised methods rely on class labels and optimize the performance of a supervised learning algorithm, typically a classifier. Unsupervised methods rely on quality criteria measured from the output of an unsupervised learning method, typically a clustering algorithm. However, many algorithms have variations for both supervised and unsupervised learning” [119]. Most automatic feature selection methods rely on supervised information (e.g., class labeled data) to perform the selection. Consequently, they are not directly applicable to the explorative analysis problem.

For understanding the fundamental principle of feature extraction techniques in the next paragraphs, we describe the traditional dimension reduction methods, the principal component analysis (PCA) [83] and the multidimensional scaling (MDS) [41].

PCA tries to preserve the variance in the data and transforms the set of possibly correlated dimensions into new set of linearly uncorrelated dimensions that are a linear combination of the original dimensions and are called principal components. The first component contains the largest variance of the original dimension set, the second compo-nent is linearly uncorrelated to the previous one and also contains the maximal possible

2.2. Information Visualization Techniques for High-Dimensional Data 13

Im Dokument Visual Analytics of Patterns in High-Dimensional Data (Seite 17-25)