Visual Analytics of Patterns in High-Dimensional Data

(1)

Visual Analytics of Patterns in High-Dimensional Data

Dissertation zur Erlangung des akademischen Grades eines Dr. rer. nat.

vorgelegt von

Andrada Tatu

an der

Mathematisch-Naturwissenschaftliche Sektion Fachbereich Informatik und Informationswissenschaft

Tag der m¨undlichen Pr¨ufung: 12 Juli 2013

Referenten: Prof. Dr. Daniel A. Keim, Universit¨at Konstanz Prof. Dr. Oliver Deussen, Universit¨at Konstanz

Prof. Dr. Giuseppe Santucci, Sapienza Universit`a di Roma

Konstanzer Online-Publikations-System (KOPS) URL: http://nbn-resolving.de/urn:nbn:de:bsz:352-243266

(2)

(3)

Pentru p˘arint¸ii mei iubitori.

(4)

(5)

Acknowledgements

This dissertation is the most important milestone in my academic career. One of the joys of completion is to look back and remember all the mentors, friends, collaborators, colleagues and family who have guided, supported, and inspired me along this fulfilling journey.

First and foremost, I would like to express my deep appreciation to my advisor, Pro- fessor Dr. Daniel Keim, who has stirred my interest in Visual Analytics early on in my studies. He has not only been a strong supporter of my work, but he has also allowed me great freedom to develop my thesis. Without his guidance and persistent help, this dissertation would not have been possible. As a part of his group, I was able to perfect my research skills and draw appropriate conclusions.

In addition, I would like to thank my committee members, Professor Dr. Oliver Deussen and Professor Dr. Giuseppe Santucci for their encouraging and insightful comments and their analytic questions that prompted me shape my ideas comprehensively.

I am especially grateful to Dr. Enrico Bertini and Dr. Tobias Schreck, who closely accompanied my research during these years and motivated me to seek perfect solutions.

Many of the results reported here present joint efforts. Their recommendations and in- structions have enabled me to assemble and finish the dissertation effectively.

I would also like to express my gratitude to my collaborators for their guidance and inspirations in these past years, and especially name Ines Färber, Professor Dr. Thomas Seidl, Professor Dr. Tamara Munzner, Dr. Michael Sedlmair, Professor Dr. Melanie Tory, Georgia Albuquerque, Dr. Martin Eisemann, Dr. Jörn Schneidewind and Dr. Peter Bak.

I am grateful to my colleagues for creating a pleasant working atmosphere. A special thank you goes to Svenja Simon (for her friendship and tricky R programming sessions), Miloˇs Krstaji´c (for supporting all my moods and encouraging me throughout these years), Dr. Florian Mansmann (for getting me into the group and becoming a lovely friend), David Spretke (for accompanying me from the first day of my Bachelor studies to the last of my doctoral work as a friend and hardworking colleague), Dr. Andreas Stoffel (for always keeping his door open and the helpful debugging sessions), Christian Rohrdantz (for helpful suggestions and mental support during the writing phase and preparation of my defense talk), Dr. Leishi Zhang (for the great collaboration during the ClustNails project), Dr. Daniela Oelke (for initial paper writing suggestions and providing me the thesis template), and Sabine Kuhr (for her support in administrative work). I am very happy that, in many cases, my friendship with all of you has enriched my time beyond our shared time in the office.

Special thanks goes to my student assistant Fabian Maaß, who implemented parts of the subspace visualization system and whose creativity shaped the research outcome.

(6)

This acknowledgement would not be complete without extending my sincere thanks to our DBVIS support team, which really made my life easier by providing fast, anytime technical support, computational power, and storage opportunities for my projects. I would like to specially mention Florian Stoffel and Juri Buchmüller.

Special thanks go to Mrs. Anna Dowden-Williams from the Academic Staff Develop- ment for proofreading most of my research papers and this thesis, which has profoundly improved its overall composition.

My deepest appreciation and gratitude goes, however, to my family who has encour- aged my studies from the start and provided me with the moral and emotional support needed through the entire process. They believed in my dream and helped me to fulfill it.

I will be forever grateful for your unconditional love and support.

I gratefully acknowledge also the financial support received from the German Re- search Society (DFG) under the research grant DFG-611 within the DFG Priority Program

“Scalable Visual Analytics: Interactive Visual Analysis Systems of Complex Information Spaces” (SPP 1335). I also recognize being an associated PhD student to the GK-1042 (PhD Graduate Program) “Explorative Analysis and Visualization of Large Information Spaces”.

(7)

Abstract

Due to the technological progress over the last decades, today’s scientific and commercial applications are capable of generating, storing, and processing, massive amounts of data sets. This influences the type of data generated, which in turn means that with each data entry different aspects are combined and stored into one common database. Often the describing attributes are numeric; we name data with more than a handful attributes (dimensions) high-dimensional. Having to make use of these types of data archives provides new challenges to analysis techniques.

The work of this thesis centers around the question of finding interesting patterns (meaningful information) in high-dimensional data sets. This task is highly challenging because of the so called curse of dimensionality, expressing that when dimensionality increases the data becomes sparse. This phenomena disturbs standard analysis techniques.

Automatic techniques have to deal with the data complexity not only increasing their runtime, but also vitiating their computation functions (like distance functions). Moreover, exploring these data sets visually is hindered by the high number of dimensions that have to be displayed on the two dimensional screen space.

This thesis is motivated by the idea that searching for interesting patterns in this kind of data can be done through a mixed approach of automation, visualization, and interaction. The amount of patterns a visualization contains can be measured by so called quality metrics. These automated functions can then filter the high number of high- dimensional visualizations and present to the user a pre-filtered good subset for further investigation. We propose quality metrics for scatterplots and parallel coordinates focusing on different user tasks like identifying clusters and correlations. We also evaluate these measures with regard to (1) their ability to identify clusters in a variety of real and synthetic datasets; (2) their correlation with human perception of clusters in scatterplots.

A thorough discussion of results follows reflecting the impact on directions for future research.

As quality metrics were developed for a large number of different high-dimensional visualization techniques, we present our reflections on how these methods are related to each other and how the approach can be developed further. For this purpose, we provide an overview of approaches that use quality metrics in high-dimensional data visualization and propose a systematization based on a comprehensive literature review.

In high-dimensional data, patterns exist often only in a subset of the dimensions.

Subspace clustering techniques aim at finding these subspaces where clusters exist and which might otherwise be hidden if a traditional clustering algorithm is applied. While subspace clustering approaches tackle the sparsity problem in high-dimensional data well, designing effective visualization to help analyzing the clustering result is not trivial. In addition to the cluster membership information, the relevant sets of dimensions and the overlaps of memberships and dimensions need to also be considered. Although, a number of techniques (for example, scatterplots, heat maps, dendrograms, hierarchical parallel coordinates) exist for visualizing traditional clustering results, little research has been done for visualizing subspace clustering results. Moreover, while extensive research has been carried out with regard to designing subspace clustering algorithms, surprisingly little attention has been paid to the developing of effective visualization tools analyzing the

(8)

clustering result. Appropriate visualization techniques will not only help in monitoring the clustering process but, with special mining techniques, they could also enable the domain expert to guide and even to steer the subspace clustering process to reveal the patterns of interest. To this goal, we envision a concept that combines subspace clustering algorithms and interactive scalable visual exploration techniques. This work includes the task of comparative visualization and feedback guided computation of alternative clusterings.

(9)

Zusammenfassung

Bedingt durch den technologischen Fortschritt der letzten Jahrzehnte sind heutige kom- merzielle Applikationen in der Lage, riesige Datenmengen zu erzeugen, zu speichern und zu verarbeiten. Diese Entwicklung beeinflusst auch die Natur der erzeugten Daten, d.h.

dass für jeden Dateneintrag unterschiedliche Aspekte in der gleichen Datenbank gespe- ichert werden. Oft sind die beschreibenden Attribute numerisch. Datensätze, die mehr als fünf solcher Attribute (Dimensionen) beinhalten, nenne ich hochdimensional. Der wertbringende Gebrauch solcher Datenarchive bringt neue Herausforderungen an Analy- setechniken mit sich.

Die vorliegende Dissertation bearbeitet die Fragestellung, wie interessante Muster (be- deutende Information) in hochdimensionalen Räumen gefunden werden können. Diese Aufgabenstellung ist durch das Problem des Fluches der Dimensionalität äußerst her- ausfordernd. Dieses Problem besagt, dass Daten im hochdimensionalen Raum spärlich vorkommen. Herkömmliche Analysetechniken werden dadurch beeinträchtigt. Automa- tische Methoden müssen die Datenkomplexität nicht nur ihre Laufzeit, sondern auch ihre Berechnungsfunktionen (z.B. Distanzfunktionen) betreffend, einbeziehen. Außerdem wird die visuelle Exploration dieser Daten durch die Zweidimensionalität der Darstellungen beeinträchtigt.

Diese Dissertation stützt sich auf das Konzept, dass die Suche nach interessanten Mustern in hochdimensionalen Datenmengen mit einem kombinierten Ansatz von automatischen, visuellen und interaktiven Methoden durchgeführt werden kann. Die Ausprägung der Muster einer Visualisierung kann durch sogenannte Qualitätsmaße gemessen werden.

Durch diese automatischen Funktionen kann die große Menge an hochdimensionalen Vi- sualisierungen eingegrenzt und dem Benutzer eine ausgewählte Menge zur weiteren Un- tersuchung zur Verfügung gestellt werden. Ich schlage Qualitätsmaße für Scatterplots und Parallele Koordinaten vor, die sich auf unterschiedliche Aufgaben, wie die Identi- fikation von Gruppen oder Korrelationen, konzentrieren. Zusätzlich werden diese Tech- niken bezüglich (1) ihrer Fähigkeit Cluster in unterschiedlichen realen und synthetischen Datensätzen und (2) ihrer Korrelation mit der menschlichen Wahrnehmung untersucht.

Der ausführlichen Diskussion dieser Resultate folgen Überlegungen für die zukünftige Forschung.

Da viele verschiedene Qualitätsmaße für eine Reihe weiterer hochdimensionaler Visu- alisierungen entwickelt wurden, werde ich Vorschläge für deren Vernetzung und Weiteren- twicklung vorstellen. Hierfür wird eine Übersicht über die verschiedenen Ansätze erstellt, welcher eine Systematisierung zugrunde liegt, die aufgrund einer umfassenden Literatu- rauswertung zustande kam.

Im hochdimensionalen Raum existieren manche Muster nur in verschiedenen Unter- räumen des Datenraumes. Subspace Clustering Algorithmen wurden entwickelt, um Un- terräume zu finden in denen Cluster existieren, die durch traditionelle Clustering Algo- rithmen nicht gefunden werden würden. Obwohl diese Algorithmen spärlich mit Daten besetzte, hochdimensionale Räume gut explorieren können, ist das Entwickeln von effek- tiven Visualisierungstechniken, um diese Clusteringresultate zu analysieren, nicht trivial.

Zusätzlich zu der Clusterzugehörigkeit von Elementen müssen die relevanten Attribut- mengen eines Clusters und die Objekt- und Dimensionsüberlappungen von Subspaceclus-

(10)

tern dargestellt werden. Auch wenn eine Reihe von Techniken für die Visualisierung von traditionellen Clustering Resultaten existiert (z.B. Scatterplots, Heatmaps, Dendro- gramme, hierarchische Parallele Koordinaten) gibt es nur wenige Ansätze, um das Re- sultat von Subspace Clustering Algorithmen zu visualisieren. Außerdem wurden bisher erstaunlich wenige Ansätze vorgestellt, die eine visuelle Analyse der Subspace Cluster- ing Ergebnisse unterstützen können, obwohl im Bereich der Subspace Clustering Al- gorithmen viel Forschung betrieben wurde. Angemessene Visualisierungstechniken, die von speziellen Methoden zur Extraktion von Informationen unterstützt werden, würden nicht nur die Nachverfolgung der Clustering Ergebnisse ermöglichen, sondern auch Fach- leuten dabei helfen, den Subspace Clustering Prozess so zu steuern, dass relevante Muster zum Vorschein kommen. Dieses Ziel vor Augen stelle ich ein Konzept vor, das Subspace Clustering Algorithmen mit interaktiven skalierbaren Visualisierungen kombiniert. Meine Ansätze widmen sich deshalb der Aufgabe der Visualisierung zum Vergleich von alterna- tiven Clustergruppen, die durch Nutzerfeedback gesteuert werden.

(11)

Introduction 1

„Everybody gets so much information all day long that they lose their common sense.”

Gertrude Stein Contents

1.1 Need for Visual Interactive Data Exploration . . . . 1 1.2 Contributions of the Thesis . . . . 4 1.3 Thesis Structure . . . . 5

1.1 Need for Visual Interactive Data Exploration

T

^odaydata is produced everywhere - everything is recorded from production processes in the industry to employees working behavior and their personal data. Even animals are equipped with sensors and all their movements are recorded over long periods of time, click behavior of internet users is traced, or supermarket purchases are stored for later analysis. Since today’s technology allows for inexpensive and abundant storage space, there will even be more data stored in the near future. At the same time, these advantages reveal the problem of how to handle the data most effectively. The gap between the generated data and the understanding of it increases [154], which also poses a challenge for analysis techniques, e.g. it is difficult to filter and extract relevant information since not only the volume increases, but also the complexity.

Visualization has long been used as an effective tool to explore and make sense of data, especially when analysts need to generate hypotheses about the information that is hidden in the data. While some techniques and commercial products have proven to be useful in providing effective solutions, there are still modern databases that can store data of such complexities that go well beyond the limits of human understanding.

The goal of this thesis is pattern finding in high-dimensional or multidimensional data.

The methods presented here work with numerical data sets, with a large number of objects, and a large number of dimensions, also called attributes. Depending on the application area, a large number of objects can already start at hundreds and go up to thousands. The same is true for the describing attributes, or features of the objects. In this work we call high-dimensional data, all data sets with more than hundred objects and more than ten dimensions. An example of analysis tasks based on a costumer database will be described later in this section.

Classical data exploration requires the user to find interesting phenomena in the data

(14)

interactively, by starting with an initial visual representation. In [36] the authors suggest that “the purpose of visualization is insight, not pictures”. The techniques for high- dimensional data visualization can also incorporate automated analysis components to reduce its complexity and to effectively guide the user during the interactive exploration process. This process is called visual analytics. “Visual analytics strives to facilitate the analytical reasoning process by creating software that maximizes human capacity to perceive, understand, and reason about complex and dynamic data and situations” [137].

Patterns are also not a new concept when analyzing data. Witten and Frank expressed this perfectly in [154]: “There is nothing new about this” (patterns). “People have been seeking patterns in data since human life began. Hunters seek patterns in animal migration behavior, farmers seek patterns in crop growth, politicians seek patterns in voter opinion, and lovers seek patterns in their partners’ responses. A scientist’s job (like a baby’s) is to make sense of data, to discover the patterns that govern how the physical world works and encapsulate them in theories that can be used for predicting what will happen in new situations.”

In large scale multivariate data sets, sole interactive exploration becomes ineffective or even unfeasible since the number of possible representations grows rapidly with the number of dimensions. Methods are needed that help the user to automatically find effective and expressive visualizations. Effective and efficient analysis methods of large multidimensional data is necessary to understand the complexity of the information hidden in these databases. Data dimensionality is often the major limiting factor.

For automatic pattern detection, a typically employed paradigm is one of clustering identifying groups of objects based on their mutual similarity. Unlike traditional clustering methods, for the aforementioned high-dimensional data considering all features simultaneously is no longer effective due to the so-called curse of dimensionality [28]. As dimensionality increases, the distances between any two objects become less discriminative.

Moreover, the probability of many dimensions being irrelevant for the underlying cluster structure increases. In such data sets it can be observed that each object may participate in different groupings, meaning that objects may have different roles. In comparison, in classical clustering each object belongs to one cluster, and the data set is partitioned into a number of clusters. “For example, in customer segmentation, we observe for each customer multiple possible behaviors which should be detected as clusters. In other domains, such as sensor networks each sensor node can be assigned to multiple clusters according to different environmental events. In gene expression analysis, objects should be detected in multiple clusters due to the various functions of each gene. In general, multiple groupings are desired as they characterize different views of the data” [103].

If we consider for example a customer database with a large number of customers (rows in the table) described by a large number of attributes (columns in the table) we may ask, how do this customers relate to each other, and what kind of patterns in this case groups can be identified in this database. In Figure 1.1 we can see a toy-example belonging to this kind of multiple valid groupings for one database. We can have groups like: “rich oldies”, “healthy sporties”, “unhealthy gamers”, “unemployed people”, “average people” and “sport professionals”¹. To facilitate the data analysis in this direction, we present in Chapter 5 visual interactive systems and new analysis methods to support the understanding and comparison of different groupings in high-dimensional data.

As already mentioned, this thesis is about visual analytics of patterns in high-dimensional

1This image appeared in the tutorial slides of M¨uller et al. [104] and the describing story is made up by myself.

(15)

1.1. Need for Visual Interactive Data Exploration 3

Figure 1.1: Multiple valid and interesting groupings of a high-dimensional data set [104].

data. To assist the analysis of such data sets, effective information visualization techniques providing a mapping of data properties to the screen, have been developed and are needed to make sense of the complex data at hand. The visualization of large complex information spaces typically involves mapping high-dimensional data to lower-dimensional visual representations. The challenge for the analyst is to find an insightful mapping, while the dimensionality of the data, and consequently the number of possible mappings, increases.

As we will see later in Chapter 2, numerous expressive and effective low-dimensional visualizations for high-dimensional data sets have been proposed in the past, such as scatterplots and scatterplot matrices (SPLOM) [37], parallel coordinates [78], glyph-based techniques [147], pixel-based displays [145] and geometrically transformed displays [86, 145]. However, finding information-bearing and user-interpretable visual representations automatically remains a difficult task since there could be a large number of possible representations. In addition, it could be difficult to explain their relevance to the user.

Finding relations, patterns, and trends over numerous dimensions is also difficult because the projection ofn-dimensional objects over 2D spaces carries necessarily some form of information loss. Projection techniques like multidimensional scaling (MDS) and principal component analysis (PCA) offer traditional solutions by creating data embeddings that try as much as possible to preserve distances of the original multidimensional space in the 2D projection. These techniques have, however, severe problems in terms of interpretation, as it is no longer possible to interpret the observed patterns in terms of the dimension of the original data space.

Mechanisms to measure the quality of the visualizations are therefore needed. In the past, quality measures have been developed for different areas like measures for data quality (outliers, missing values, sampling rate, level of detail), clustering quality (purity, F-measure (combining precision and recall), Rand index [114], silhouette coefficient [85], etc.), association rule quality (support and confidence [7], information gain [40], etc.) or the distance distribution measure in SURFING [16], a subspace search algorithm described and used in Chapter 5 to filter data spaces and find interesting subspaces. For visualizations, a number of authors have started introducing quality measures to quantify their importance. The rationale behind this method is that quality measures can help users reduce the search space by filtering out views with low information content. In the ideal

(16)

system, users can select one or more measures and the system optimizes the visualization in such a way as to reflect the choice of the user. This thesis also contributes to the field of quality measures, and in Chapter 3 new measures are presented for scatterplot matrices and parallel coordinates plots.

However, there is one problem with these measures the lack of empirical validation based on user studies. These studies are in fact needed to inspect the underlying assump- tion that the patterns captured by these measures correspond to the patterns captured by the human eye. Since many different patterns can be analyzed, in this thesis we started with clusters in visualizations and research in this direction by comparing some of the most promising quality measures for filtering visualizations that present clusters to the human judgement by looking at the visualizations.

The analysis of high-dimensional data is an ubiquitously relevant, yet well-known difficult problem. Problems exist both in automatic data analysis and in the visualization of this kind of data. On the visual-interactive side, a limited number of available visual variables and limited short-term memory of human analysts make it difficult to effectively visualize data in high numbers of dimensions. In Chapter 5 we tackle this problem from the visual-interactive side. We present a visual-interactive tool to make sense of clusters in different subspaces, as well as an approach to identify subspaces that might show complementary clusterings.

In summary, the focus of this thesis is to contribute on both sides of pattern finding in high-dimensional data, the automatic and the visual interactive part. We believe that these parts are simultaneously needed to solve the problem and therefore we present automatic mechanisms namely quality measures to reduce the alternative possible visualizations of high-dimensional data, and on the other side we visualize the relations between results to support the user in an interactive pattern finding process.

1.2 Contributions of the Thesis

This dissertation provides visual analytics mechanisms for pattern finding in high-dimensional data. In achieving this goal Substantiating the results, we supply the following contributions:

• Quality measures for scatterplots and parallel coordinates plotsare developed. Visual quality metrics have been recently devised to automatically extract interesting visual projections out of a large number of available candidates in the exploration of high- dimensional databases. The metrics permit for instance to search within a large set of scatterplots (e.g., in a scatterplot matrix) and select the views that contain the best separation among clusters. The rationale behind these techniques is that automatic selection of “best” views is not only useful but also necessary when the number of potential projections exceeds the limit of human interpretation (Chapter 3) [132, 133].

• Validating the measures trough a perceptual study. We present a perceptual study investigating the relationship between human interpretation of clusters in 2D scatterplots and the measures that were automatically extracted from these plots. Specifi- cally, we compare a series of selected metrics and analyze how they predict human

(17)

1.3. Thesis Structure 5 detection of clusters. A thorough discussion of results follows with reflections on their impact and directions for future research (Chapter 3) [134].

• A systematization of techniques that use quality metrics to help in the visual ex- ploration of meaningful patterns in high-dimensional data. We present reflections on how different quality measure methods are related to each other and how the approach can be developed further. For this purpose, we provide an overview of approaches that use quality metrics in high-dimensional data visualization and propose a systematization based on a thorough literature review. We carefully analyze the papers and derive a set of factors for discriminating the quality metrics, visualization techniques, and the process itself. A quality metrics pipeline is proposed to model all the encountered varieties of metrics (Chapter 4) [27].

• A visual subspace cluster analysis system (ClustNails) to understand the result of subspace clustering. In subspace clustering in addition to the grouping information (clusters), the relevance of dimensions for particular groups and overlaps between groups, both in terms of dimensions and records, need to be analyzed. ClustNails in- tegrates several novel visualization techniques with various user interaction facilities to support navigating and interpreting the result of subspace clustering algorithms (Chapter 5) [136].

• A novel method for the visual analysis of high-dimensional data for understand- ing high-dimensional data from different perspectives and investigating alternative clusterings. We employ an interestingness-guided subspace search algorithm to de- tect a candidate set of interesting subspaces, that may contain important patterns for further analysis. Based on appropriately defined subspace similarity functions, we visualize the subspaces and provide navigation facilities to interactively explore large sets of subspaces. Our approach allows users to effectively compare and relate subspaces identifying complementary or contradicting relations among them, thus identifying alternative clusterings (Chapter 5) [135].

1.3 Thesis Structure

After illustrating the problem in the previous section and enumerating the contributions of this thesis, the remainder of the thesis is structured as follows.

Chapter 2 provides a brief overview of important related work in the field of high- dimensional data analysis, covering three main areas. Section 2.1 introduces the common challenges when analyzing high-dimensional data and presents dimension reduction techniques that reduce the data complexity. Section 2.2 describes important visualization techniques for high-dimensional data. Section 2.3 introduces standard automatic techniques from the Data Mining community, as well as presents quality measures, that are automated ranking functions, to judge the quality of a visualization with respect to a given task. Section 2.4 presents some examples where the interplay between visualization, automation, and interaction is far more beneficial then any of these techniques alone.

Chapter 3 proposes eight new quality metrics, for different tasks and two visualization types: scatterplot matrices and parallel coordinates. The metrics are tested on a set of synthetical and real data sets to prove their effect. To ensure that the metrics reflect the

(18)

user’s perception, a selected subset of measures for scatterplot matrices is evaluated and compared with the user’s perception. We found that both perform similar. Based on this study, we have formulated guidelines for further evaluation of existing metrics.

Based on a literature review, Chapter 4 introduces a systematization of different quality measures for high-dimensional data visualization. Their relation is described through characteristic factors like visualization techniques or a purpose for coming up with a co- herent and unified picture for these techniques. By putting the existing methods into a common framework, we hope in easing the generation of new research in the field and spot- ting relevant gaps to bridge with future research. Following, Section 4.2 briefly presents the results of a qualitative data analysis that lead to a visual cluster separability taxonomy. This results are the basis for the follow up discussion on relevant aspects that arise when analyzing clusters visually and what future works need to be focused on.

Chapter 5 presents two interactive systems that help to make sense of the high- dimensional data sets with respect to different clusterings. Searching in subspaces is needed as automatic pattern search is done trough clustering algorithms, and it is not feasible to search for clusters in full space for high-dimensional data. Section 5.1 introduces a visual tool, ClustNails, to investigate subspace clustering results for different state of the art subspace clustering algorithms. This tool is intended to support the interpretation of the result with respect to the subspace cluster relations. With this visual tool questions like how many objects do clusters contain, how many dimensions, what dimensions do overlap between clusters or what objects are shared by more clusters can be answered.

Section 5.2 goes one step further and presents an analytical approach to support the identification of alternative clusterings in this spaces. As we know, the high-dimensionality provides different facets in the data like for example in a data set about people we might have clusters in the taste of musicperspective (rock-music, classical music, jazz, etc.) but at the same time we also might have different groupings of the same people describing their sportive activity level. Both views on this data are valid but provide a different insight about the data. To discover such alternative clusterings in high-dimensional data, in this section we propose an analytical workflow that starts from searching the set of possible subspaces identifying interesting subspaces. We then group these subspaces according to their data similarity providing filtering mechanisms for further interactive investigation.

Supported by interaction, different clusterings of the data can be identified.

Chapter 6 concludes the thesis and gives an overview of further research questions that we seem interesting to be investigated in future.

A schematic overview of the chapter interrelations is shown in Figure 1.2.

(19)

1.3. Thesis Structure 7

Chapter2: High Dimensional Data Analysis Chapter1: Introduction

Chapter5: Visual Subspace Analysis of HD Data Chapter3: QM based Visual Analysis of HD Data

Chapter4: A Model of HD Data Visualization

Chapter6: Conclusion and Future Work subspaces with

"interesting"

patterns

visualization of the result space ranking

the result space subspaces

how do we visualize and interact with that?

what is interesting?

Data Quality Metrics Visual Quality

Metrics

dimension projections

present most interesting results first

methods to extract patterns HD data

how do subspaces relate

to each other?

Figure 1.2: Schematic overview of the interrelation of chapters in this thesis.

Parts of this thesis where published in:

1. A. Tatu, G. Albuquerque, M. Eisemann, J. Schneidewind, H. Theisel, M. Magnor, and D. Keim. Combining automated analysis and visualization techniques for effective exploration of high dimensional data. Proceedings of the IEEE Symposium on Visual Analytics Science and Technology (VAST), pages 59-66, 2009.

The contributions: for this publication I took the lead on the computer science research part of the paper implementing the data space measures and leading also the writing of the paper itself. G. Albuquerque and M. Eisemann implemented the image quality metrics and provided their description in the paper and some parts of the evaluation section with these metrics. The Histogram Density measures were programmed by myself. J. Schneidewind gave advice for structuring the paper and presenting the results. D. Keim accompanied the project with suggestions for im- provements for application and text. H. Theisel and M. Magnor gave advice to the project. All parts of the paper where revised several times by me, thus in this thesis I use the paper text without citation marks. G. Albuquerque’s thesis (title unknown by the time of my submission) might contain some text passages of this paper too for the parts she took part in the project.

2. A. Tatu, G. Albuquerque, M. Eisemann, P. Bak, H. Theisel, M. Magnor, and D. A.

Keim. Automated Visual Analysis Methods for an Effective Exploration of High-Dimensional Data. IEEE Transactions on Visualization and Computer Graphics (TVCG), 17(5):pp. 584-597, May 2011.

(20)

The contributions: publication 1. was elected as one of the best for the VAST’09 conference and this publication is an invited extension of 1. As primary author, I was responsible for writing the paper, generating new use-cases, testing our measures and describing further research directions in this area. G. Albuquerque implemented, described and tested the new CSM measure. P. Bak gave advice for structuring the experiments and presenting the results. D. Keim accompanied the paper with suggestions for improvements for application and text. M. Eisemann, H. Theisel and M. Magnor gave advice to the paper. All parts of the paper where revised several times by me, thus, in this thesis I use the paper text without citation marks. G.

Albuquerque’s thesis (title unknown by the time of my submission) might contain some text passages of this paper too for the parts she took part in the project.

3. A. Tatu, P. Bak, E. Bertini, D. A. Keim, and J. Schneidewind. Visual quality metrics and human perception: an initial study on 2D projections of large multidimensional data. In Proceedings of the Working Conference on Advanced Visual Interfaces (AVI), pages 49-56. ACM, 2010.

The contributions: for this publication I took primary responsibility and addition- ally, I took the lead on the automatic evaluation. P. Bak took the lead on the human experiment. Together we compared the results and evaluated them statistically. E.

Bertini, D. Keim and J. Schneidewind accompanied the paper with suggestions for improvements for experimental design and text. All parts of the paper where revised several times by me; thus, in this thesis I use the paper text without citation marks.

4. D. J. Lehmann, G. Albuquerque, M. Eisemann, A. Tatu, D. A. Keim, H. Schumann, M. Magnor and H. Theisel. Visualisierung und Analyse multidimension- aler Datens¨atze. Informatik-Spektrum, Springer Berlin/Heidelberg, 33(6):589- 600, 2010.

The contributions: this publication was authored by D. Lehman. My contribution was to describe the use of quality metrics for high-dimensional data. This thesis was inspired by the discussions of this paper.

5. E. Bertini, A. Tatu, and D. A. Keim. Quality Metrics in High-Dimensional Data Visualization: An Overview and Systematization. Proceedings of the IEEE Symposium on Information Visualization (InfoVis), 17(12):pages 2203-2212, Dec. 2011.

The contributions: this publication was authored equally by E. Bertini and myself.

We decided to show this by enumerating our names alphabetically in the authors list.

E. Bertini and I conducted the literature review, came up with the systematization and description model of quality metrics, and described this process in this paper. D.

Keim played the devils advocate to test our model and gave advice for improvement.

All parts of the paper where written and revised several times by both leading authors.

Thus, in this thesis I use the paper text without citation marks.

6. M. Sedlmair, A. Tatu, T. Munzner, and M. Tory. A taxonomy of visual cluster separation factors. Computer Graphics Forum (EuroVis), 31(3pt4):1335-1344, June 2012.

The contributions: M. Sedlmair took the lead in writing this publication. M.

Sedlmair and I conducted the qualitative analysis of the over 800 plots, and labeled

(21)

1.3. Thesis Structure 9 all the cases with different keywords. Based on these M. Sedlmair and T. Munzner came up with the taxonomy, and described it in the paper. I tested special cases like grid size influence during the writing process of the paper. M. Tory accompanied the paper with suggestions for improvements for the analysis and taxonomy and revised the text. In this thesis, I describe the results presented in that paper, without using the text, and I provide further ideas for research in this area.

7. A. Tatu, F. Maaß, I. F¨arber, E. Bertini, T. Schreck, T. Seidl, and D. Keim. Sub- space Search and Visualization to Make Sense of Alternative Clusterings in High-Dimensional Data. IEEE Symposium on Visual Analytics Science and Technology (VAST), pages 63-72, 2012.

The contributions: for this publication I took the lead on the project and paper writing. F. Maaß implemented the subspace tool advised by myself, E. Bertini and T.

Schreck. T. Schreck gave advise in structuring the paper and presenting the results by providing initial sections of the paper. I. F¨arber provided an initial section on subspace clustering. T. Seidl and D. Keim gave advice to the project. Major parts of the paper where written by myself and all the other parts where revised several times by me. Thus, in this thesis I use the paper text without citation marks.

8. A. Tatu, L. Zhang, E. Bertini, T. Schreck, D. A. Keim, S. Bremm, and T. von Landes- berger. ClustNails: Visual Analysis of Subspace Clusters. Tsinghua Science and Technology, Special Issue on Visualization and Computer Graphics, 17(4):419- 428, Aug. 2012.

The contributions: for this publication I took the lead on the project and paper writing. I implemented the subspace tool supported for some components by L. Zhang.

E. Bertini, T. Schreck gave advise in structuring the paper and presenting the results and provided initial sections that I shaped for the final submission. D. A. Keim, S.

Bremm, and T. von Landesberger gave advice to the project. Major parts of the paper where written by myself and I revised all the other parts of my co-authors several times to shape the final paper version. Thus, in this thesis I use the paper text without citation marks.

Other publications to which I contributed but are not included in this thesis:

1. M. Schaefer, L. Zhang, T. Schreck, A. Tatu, J. A. Lee, M. Verleysen and D. A.

Keim. Improving projection-based data analysis by feature space trans- formations. In Proceedings of SPIE 8654, Visualization and Data Analysis, 2013.

2. B. Bustos, D. A. Keim, D. Saupe, T. Schreck and A. Tatu. Methods and User Interfaces for Effective Retrieval in 3D Databases (in German). Daten- bank - Spektrum - Zeitschrift fuer Datenbank Technologie und Information Retrieval, dpunkt.verlag, 7(20):23-32, 2007.

(22)

(23)

High-Dimensional Data Analysis 2

„You can observe a lot by watching.”

Yogi Berra Contents

2.1 Basic Techniques for High-Dimensional Data Analysis . . . 12 2.1.1 Common Challenges with High-Dimensional Data . . . . 12 2.1.2 Feature Selection and Feature Extraction . . . . 12 2.2 Information Visualization Techniques for High-Dimensional

Data . . . 13 2.2.1 Information Visualization Techniques . . . . 13 2.2.2 Limitations while Visualizing High-Dimensional Data . . . . 16 2.3 Automated Techniques for High-Dimensional Data . . . 17 2.3.1 Data Mining Techniques for High-Dimensional Data . . . . 17 2.3.2 Quality Measures for High-Dimensional Data Visualizations . . . 19 2.4 Visual Analytics for High-Dimensional Data . . . 22 2.4.1 Visual Interactive Systems for High-Dimensional Data Analysis . 22 2.4.2 Subspace Cluster Analysis and Visualization . . . . 26

H

igh-dimensional data contains complex patterns and different data analysis approaches have beed developed during the past years to uncover the possible hidden patterns of this data. As is outlined in the following, this thesis is related to a number of broader areas in data analysis and visualization of high-dimensional data.

In this chapter, Section 2.1 describes the main challenges when dealing with high- dimensional data and some basic techniques to reduce its dimensionality. Section 2.2 gives an overview of existing visualization techniques for high-dimensional data, and identifies the visualization challenges that arise due to the data complexity. Section 2.3 presents a series of automated techniques from Data Mining for pattern analysis in high-dimensional data, focusing on clustering. The second part presents mechanisms to quantify the quality of visualizations, called quality metrics. Due to the limitations of the pure visual- interactive solution or a sole automatic approach, in Section 2.4 we present works from related fields where the interplay of visualization and automation together with interactive features can provide better solutions to the tasks at hand. All examples of these sections are in the context of pattern finding and understanding of high-dimensional data.

Parts of this chapter appeared in [27, 132, 133, 134, 135, 136].

(24)

2.1 Basic Techniques for High-Dimensional Data Analysis

2.1.1 Common Challenges with High-Dimensional Data

Before presenting different techniques to analyze high-dimensional data sets, we will dis- cuss two common challenges in this area.

The first issue is the so called curse of dimensionality. In high-dimensional analysis problems are known to be difficult due to the curse of dimensionality. This term was formulated by R. Bellman [20] in the context of dynamic programming, and describes the fact, that when dimensionality increases the data becomes sparse. In other words, in high-dimensional data everything tends to be basically equidistant making it hard to make any distinctions between objects. Additionally, many existing Data Mining algorithms have a complexity exponential with respect to the number of data dimensions.

With increasing dimensionality, these algorithms become computationally intractable and therefore inapplicable in many real applications.

The second issue concerns the meaning of similarity in a high-dimensional space is therefore diminished. It was shown in [28] that as dimensionality increases the distance to the nearest data point approaches the distance to the farthest data point. This problem influences the design of similarity functions for objects in high-dimensional spaces.

2.1.2 Feature Selection and Feature Extraction

A simple, but sometimes very effective, way to deal with high-dimensional data is to reduce the number of dimensions by eliminating those that seem to be irrelevant.

Dimension reduction can be achieved by eitherfeature selection [61] orfeature extrac- tion [44]. Feature selection is the problem of selecting from a large space of input features (or dimensions) a smaller number of features that optimize a measurable criterion, e.g., the accuracy of a classifier [97].

Feature extraction methods reduce the dimensionality of the data by forming a new set of dimensions as a linear or nonlinear combination of the original dimensions. This synthetic dimensions represent most (or all) of the structure of the original data set by using less attributes. Depending on the training data, the methods can be supervised or unsupervised. “Supervised methods rely on class labels and optimize the performance of a supervised learning algorithm, typically a classifier. Unsupervised methods rely on quality criteria measured from the output of an unsupervised learning method, typically a clustering algorithm. However, many algorithms have variations for both supervised and unsupervised learning” [119]. Most automatic feature selection methods rely on supervised information (e.g., class labeled data) to perform the selection. Consequently, they are not directly applicable to the explorative analysis problem.

For understanding the fundamental principle of feature extraction techniques in the next paragraphs, we describe the traditional dimension reduction methods, the principal component analysis (PCA) [83] and the multidimensional scaling (MDS) [41].

PCA tries to preserve the variance in the data and transforms the set of possibly correlated dimensions into new set of linearly uncorrelated dimensions that are a linear combination of the original dimensions and are called principal components. The first component contains the largest variance of the original dimension set, the second component is linearly uncorrelated to the previous one and also contains the maximal possible

(25)

2.2. Information Visualization Techniques for High-Dimensional Data 13 variance and so on. The data set can be reduced by maintaining a smaller set of principal coordinates, as transformed dimensions.

MDS tries to preserve the pairwise distances between the data points. There are a lot of variants of MDS dependent on the used distance functions [31]. The simplest version is the linear MDS, also called classical scaling, and its solution is very closely related to PCA when using an Euclidian distance function.

All these techniques rely on the idea that variation of the data can be explained by a smaller number of transformed features. Their main difference to the feature selection methods is that these methods instead of choosing a subset of dimensions from the data, create new dimensions defined as functions over all dimensions. They also do not consider class labels but rather their computation is relying just on data points.

General problems in these techniques are that the mapping often is not unique. The techniques have several parameters that influence the result, and the interpretability of resulting dimensions is sometimes difficult because the original space dimensions coming from a specific domain have a certain interpretation (like age, income, etc.) but their linear combinations can be hardly interpreted.

Koren and Carmel propose a series of new methods for creating projections from high- dimensional data sets using linear transformations [89]. For non-labeled data, they propose a generalization of the PCA, the normalized PCA, that normalizes the squared pairwise distances to reduce the dominance of the large distances normally occurring for the standard PCA transformation. For labeled data, their methods integrate the class labels of the data in the computation, resulting in projections with a clearer separation between the classes. This methods compared to traditional PCA or MDS have the advantage that they also capture intra-cluster shapes.

In addition to PCA and MDS presented above, there have been developed more techniques based on linear or non-linear transformations of the original features to obtain a reduced set of synthetic dimensions. Detailed surveys can be found in [111, 153]. Another prominent group of techniques for dimension reduction, which we want to recall shortly at this point, rely on signal processing techniques, that, when applied to a data vector, transform it to a numerically different vector [64]. These are for e.g. Discrete Fourier Transform, Cosine Transform, Wavelet Transform etc. Since input and transformed data vectors have the same length, the data is reduced by a user specified threshold that is used to truncate the transformed vector (e.g. wavelet coefficients).

2.2 Information Visualization Techniques for High-Dimensional Data

2.2.1 Information Visualization Techniques

The representation of high-dimensional data is one of the main research challenges in visualization. Several techniques have been developed in recent years to deal with the problem of representing relations among many dimensions on a computer display, which is inherently bi-dimensional. Considering also the visual variables data visualizations can go a bit beyond 2D using color, shape, etc. but still have different issues for representing high-dimensional data sets. Classic approaches include parallel coordinates, scatterplot matrices, glyph-based and pixel-oriented techniques [145]. Figure 2.1 shows some examples

(26)

for these techniques taken from [145].

A B

C D

Figure 2.1: High-dimensional visualization techniques taken from [145]. A: Scatterplot matrix showing on the diagonal a histogram plot for each dimension. Selected points are marked in red in all plots. B: Parallel coordinates plot of a seven-dimensional data set. One polyline representing one data point is highlighted in red. C: Star glyphs in a MDS layout. D: Dense pixel displays representing a 14-dimensional data set.

Scatterplots and Scatterplot Matrices [37]

2D scatterplots are one of the most common used visualization techniques in data analysis. The data is represented by points in a rectangular box, each having the value of one variable (dimension) determining the position on the horizontal axis, and the value of the other variable, determining the position on the vertical axis. To represent a data set of a higher dimensionality, a common approach is to build a scatterplot matrix (SPLOM)[37].

Figure 2.1A shows an example of such a matrix for a four-dimensional data set, where every pair of dimensions is represented in one scatterplot. The matrix shows every plot twice, being symmetrical with respect to the diagonal. Additionally, on the diagonal, dimension histograms show the value distribution information for each dimension. Selected points are highlighted in red and a purple rectangle indicates their region.

(27)

2.2.1 Information Visualization Techniques 15 Parallel Coordinates [78]

Another important visualization method for multivariate data sets isparallel coordinates.

Parallel coordinates was first introduced by Inselberg [77] and is used in several tools, e.g. XmdvTool [146] and VIS-STAMP [60], for visualizing multivariate data. The basic idea is that each dimension¹ of the data is a vertical line, so the axes of the plot are a collection of parallel lines. Each data point is a polyline that crosses each dimension axis by intersecting it at its dimension value. Figure 2.1B shows an example of parallel coordinates for a seven-dimensional data set where one data point’s ployline is highlighted in red. In comparison to the scatterplots, parallel coordinates can show data sets of higher dimensionality in one display. In a SPLOM a higher dimensional data set can be visualized by plotting every two-dimensional combination in one scatterplot. For both, parallel coordinates and SPLOM, the ordering is important. For parallel coordinates the order of axes (dimensions) and analog for the SPLOM the order of rows and columns, since different orderings make different relations in the data visible. It is important to decide the order of the dimensions that are to be presented to the user. Their effectiveness, however, is highly related to the dimensionality of the data under inspection. Because the resolution available decreases as the number of data dimensions increases, it becomes very difficult, if not impossible, to explore the whole set of available orderings manually. In Section 2.3.2, we describe the notion of quality metrics that are mechanisms to automatically quantify the quality of the display and in Section 3.1.4, we introduce new quality metrics to deter- mine the best ordering in parallel coordinates with respect to a given task.

Glyph-based techniques [147]

“Glyphs are graphical entities that convey one or more data values via attributes such as shape, size, color, and position” [147]. There is a variety of glyphs proposed in the literature so far, and just to name some there are: star glyphs,face glyphs, profile glyphs or box glyphs. An overview of multivariate glyphs can be found in [147]. They all have in common that they have one graphical representation per object, but use different en- codings for the objects attributes (e.g. length, area, color). In Figure 2.1C star glyphs are exemplified. As the name suggests each object is represented by a star shaped glyph, where the value of each dimension is represented by the length of evenly spaced rays. The ray ends are connected by a polyline.

Pixel-oriented techniques [145]

Pixel-oriented techniques “map each value to individual pixels and create a filled polygon to represent each dimension” [145]. In Figure 2.1D a 14-dimensional data set is represented by dense pixel displays showing each dimension in a separate rectangle and each data value as a colored pixel in the rectangle. The values are sorted according to the tenth dimension, that is marked with a black border. Here we can see several challenges for this techniques. One is the already mentioned ordering of data values, to spot correlated dimensions, another one is the ordering of dimensions to position similar dimensions close to each other on the screen. Using different colormaps can also reveal different patterns in the data, thus choosing the suitable colormap for each data and task, suitable colormap is yet another challenge. Additionally, positioning the dimensions on the screen is not trivial, since different layouts – not only the grid layout – can be possible.

1We use the terms dimension and attribute (as well as feature, variable, column and axis) interchange- ably in this thesis. We choose among them based on the context of the discussion, while attempting to be consistent with their use in the literature.

(28)

2.2.2 Limitations while Visualizing High-Dimensional Data

As previously demonstrated, there are different ways to represent high-dimensional data on the screen and all these bring a number of challenges with them. Moreover, as already identified there are challenges due to the scalability of the display, the ordering of displayed objects or dimensions, the positioning of objects on the screen, the high number of possible visual mappings. Providing solutions for some of this problems would ease the exploration of the high-dimensional data. By an appropriate sorting of dimensions and an appropriate mapping to visual variables, clutter can be reduced and these visualization methods could allow to overview and relate high-dimensional data sets [49]. The data dimensionality causes problems in the visual mapping stage, meaning it is unclear which mapping is the best, so what data dimension should be mapped to what visual variable.

Because of the high number of possible mappings for a high-dimensional data set, automated methods are needed to restrict this number. One way to judge the quality of these mappings is to compute quality measures for the displayed data (see Chapter 3 for more details) or to reduce the number of dimensions by dimensionality reduction techniques (see Section 2.1.2).

Enriching Visualizations

Static visualization techniques are not flexible enough to reveal the complex high-dimensional patterns, thus interaction is needed at this point. Proposed are different solutions to make visualizations interactive, supporting a dynamic use for high-dimensional data. These include brushing and linking [46], panning and zooming [19], focus-plus-context [92], magic lenses [29].

“Brushing and linking refers to the connecting of two or more views of the same data, such that a change to the representation in one view affects the representation in the other views as well. . . . Panning and zooming refers to the actions of a movie camera that can scan sideways across a scene (panning) or move in for a closeup or back away to get a wider view (zooming). . . . When zooming is used, the more detail is visible about a particular item, the less can be seen about the surrounding items. Focus-plus-context is used to partly alleviate this effect. The idea is to make one portion of the view – the focus of attention – larger, while simultaneously shrinking the surrounding objects. The farther an object is from the focus of attention, the smaller it is made to appear. . . . Magic lenses are directly manipulable transparent windows that, when overlapped on some other data type, cause a transformation to be applied to the underlying data, thus changing its appearance” [15]. A full exemplification of these techniques is out of the scope of this work, and more details can be read in [15]².

Patterns that are just visible in subspaces of the original data space also need spe- cialized visualizations to disclose the relations between the different subspaces from which they originate as well as their possible object overlap. In Chapter 5 we present a visual- interactive tool for this purpose.

2The cited description for each technique are from Chapter 10: User Interfaces and Visualization - by Marti Hearst. This chapter can also be found online athttp://people.ischool.berkeley.edu/˜hearst/

irbook/10/node3.html#SECTION00122000000000000000f(last accessed on 03/13).

(29)

2.3. Automated Techniques for High-Dimensional Data 17

2.3 Automated Techniques for High-Dimensional Data

In this section, we present automated methods for analyzing high-dimensional data. Sec- tion 2.3.1 discusses different data mining approaches to extract patterns from data. The focus is on clustering. We present general approaches, enumerating approaches that have been especially developed for coping with high-dimensional data, and present the difference between clustering in a dimension reduced data set and subspace clustering. Besides automated pattern extraction, in Section 2.3.2 we introduce automation to judge the quality of visualization, namely by quality metrics. Given the huge number of possible visual representations for high-dimensional data, the user is assisted in finding the right visual mapping or the right projection for his data. Our contribution to this area consisting of new measures, a quality measures pipeline, and a systematization of existing measures, is outlined in Chapters 3 and 4.

2.3.1 Data Mining Techniques for High-Dimensional Data

Data Mining refers to extracting, or mining, knowledge (interesting patterns) from large amounts of data [64]. In order to extract these data patterns, different intelligent methods have been developed in the past. One important method, which is also the closest to this thesis, is clustering. Clustering takes the data set as input and groups the objects according to their similarity into different groups, called clusters. Therefore, the similarity between objects of one group is maximized, and between objects of different groups the similarity is minimized. That means that objects of one group are very similar to each other, while dissimilar to objects of other groups. The similarity is calculated on the full attribute space, using different distance functions, like Euclidian, Minkowski, or City-block distances.

State of the Art Clustering

There are different criteria to classify the existing clustering algorithms. We would like to differentiate them roughly intohierarchicalclustering algorithms, andpartitioningcluster- ing algorithms and enumerate some of the most known representatives. For further details please refer to the following surveys [21, 155] or the original papers of the algorithms.

Hierarchical clustering organizes objects into groups that are at the same time grouped into groups. This is done consecutively building up a hierarchy of clusters. Representatives for this category, which we will also use later in Section 5.2, are hierarchical clusterings with different linkage methods, like single-linkage, complete-linkage, average-linkage, or minimum variance [144]. Trying to develop algorithms for handling large-scale data, in recent years, new hierarchical algorithms appeared that improve the clustering performance.

Examples include BIRCH [162] an algorithm designed to use a height-balanced tree to store summaries of the original data that can achieve a linear computational complexity.

The partitioning methods, divide all the data objects into a fixed number of groups, without any hierarchical structure. Major representatives for this category are algorithms like the density based DBSCAN [50] and OPTICS [10], or relocation methods like k- medoids and k-means methods [56].

(30)

Clustering in High Dimensions

For high-dimensional data sets, the challenge is to design effective and efficient clustering algorithms that can cope with the high number of objects, dimensions, and the noise level of this kind of data. Therefore a number of different algorithms were proposed to cluster this type of data.

CURE [57] is a hierarchical clustering algorithm that can explore arbitrary cluster shapes and utilizes a random sample strategy to reduce computational complexity.

Density-based clustering (DENCLUE) [70] is a well known approach for density based clustering for high-dimensional data. To make computations more feasible, the data is in- dexed using a B⁺-tree. The algorithm is built on the idea that the influence of each data point on his neighborhood can be modeled using a so called influence function. The overall density of the data space can be modeled analytically as the sum of the influence function applied to all data points. Clusters are then determined by identifying local maxima of the overall density function.

Although, these algorithms can deal with large-scale data, they are sometimes not sufficient to analyze high-dimensional data. Due to the previously described problem, the curse of dimensionality, namely algorithms relying on distance functions, can no longer perform well in high-dimensional spaces. To overcome this problem, dimension reduction (see Section 2.1.2) is used in cluster analysis to reduce the dimensionality of the data sets. However, dimensionality reduction methods cause some loss of information, and may destroy the interpretability of the results, even distort the real clusters. Moreover, such techniques do not actually remove any of the original attributes from the analysis.

This is problematic when there are a large number of irrelevant attributes. The irrelevant information may mask the real clusters, even after transformation. Another way to tackle this problem is to use subspace clustering algorithms, that search for data clusters in different subsets of the same data set. Different subspaces may contain different meaningful clusters. The problem here is how to identify such subspace clusters efficiently.

A large number of algorithms for subspace clustering have been developed in the past and we picked some representatives to be briefly described next. CLIQUE (CLustering In QUEst) [6] employs a bottom-up approach and searches for dense rectangular cells in all subspaces with high density of points. The clusters are generated by merging these rectangles. OptiGrid [71] is designed to obtain an optimal grid partitioning using cutting hyperplanes. It uses density estimations similar to DENCLUE to find the plane that separates two significantly dense half spaces, and goes trough a point of minimal density, using a set of linear projections. In Section 5.1 we use the k-medoid based algorithm PROCLUS (PROjected CLUstering) [4], one of the most robust algorithms for subspace clustering. It defines a cluster as a densely distributed subset of data objects in a subspace.

ORCLUS (arbitrarily ORiented projected CLUster generation) [5] uses a similar approach but uses non-axes parallel subspaces to find the clusters. Further elaborations on the problem of subspace clustering are described in Section 2.4.2 and Section 5.1.2.

Other Data Mining Techniques

In addition to clustering techniques, many other techniques have been developed during the past.

Mainly they are miningfrequent patterns,associations,correlations, oroutliers