• Keine Ergebnisse gefunden

Need for Visual Interactive Data Exploration

1.2 Contributions of the Thesis . . . . 4 1.3 Thesis Structure . . . . 5

1.1 Need for Visual Interactive Data Exploration

T

odaydata is produced everywhere - everything is recorded from production processes in the industry to employees working behavior and their personal data. Even animals are equipped with sensors and all their movements are recorded over long periods of time, click behavior of internet users is traced, or supermarket purchases are stored for later analysis. Since today’s technology allows for inexpensive and abundant storage space, there will even be more data stored in the near future. At the same time, these advantages reveal the problem of how to handle the data most effectively. The gap between the generated data and the understanding of it increases [154], which also poses a challenge for analysis techniques, e.g. it is difficult to filter and extract relevant information since not only the volume increases, but also the complexity.

Visualization has long been used as an effective tool to explore and make sense of data, especially when analysts need to generate hypotheses about the information that is hidden in the data. While some techniques and commercial products have proven to be useful in providing effective solutions, there are still modern databases that can store data of such complexities that go well beyond the limits of human understanding.

The goal of this thesis is pattern finding in high-dimensional or multidimensional data.

The methods presented here work with numerical data sets, with a large number of objects, and a large number of dimensions, also called attributes. Depending on the application area, a large number of objects can already start at hundreds and go up to thousands. The same is true for the describing attributes, or features of the objects. In this work we call high-dimensional data, all data sets with more than hundred objects and more than ten dimensions. An example of analysis tasks based on a costumer database will be described later in this section.

Classical data exploration requires the user to find interesting phenomena in the data

interactively, by starting with an initial visual representation. In [36] the authors suggest that “the purpose of visualization is insight, not pictures”. The techniques for high-dimensional data visualization can also incorporate automated analysis components to reduce its complexity and to effectively guide the user during the interactive exploration process. This process is called visual analytics. “Visual analytics strives to facilitate the analytical reasoning process by creating software that maximizes human capacity to perceive, understand, and reason about complex and dynamic data and situations” [137].

Patterns are also not a new concept when analyzing data. Witten and Frank expressed this perfectly in [154]: “There is nothing new about this” (patterns). “People have been seeking patterns in data since human life began. Hunters seek patterns in animal migration behavior, farmers seek patterns in crop growth, politicians seek patterns in voter opinion, and lovers seek patterns in their partners’ responses. A scientist’s job (like a baby’s) is to make sense of data, to discover the patterns that govern how the physical world works and encapsulate them in theories that can be used for predicting what will happen in new situations.”

In large scale multivariate data sets, sole interactive exploration becomes ineffective or even unfeasible since the number of possible representations grows rapidly with the number of dimensions. Methods are needed that help the user to automatically find effective and expressive visualizations. Effective and efficient analysis methods of large multidimensional data is necessary to understand the complexity of the information hidden in these databases. Data dimensionality is often the major limiting factor.

For automatic pattern detection, a typically employed paradigm is one of clustering identifying groups of objects based on their mutual similarity. Unlike traditional clustering methods, for the aforementioned high-dimensional data considering all features simulta-neously is no longer effective due to the so-called curse of dimensionality [28]. As di-mensionality increases, the distances between any two objects become less discriminative.

Moreover, the probability of many dimensions being irrelevant for the underlying cluster structure increases. In such data sets it can be observed that each object may participate in different groupings, meaning that objects may have different roles. In comparison, in classical clustering each object belongs to one cluster, and the data set is partitioned into a number of clusters. “For example, in customer segmentation, we observe for each cus-tomer multiple possible behaviors which should be detected as clusters. In other domains, such as sensor networks each sensor node can be assigned to multiple clusters according to different environmental events. In gene expression analysis, objects should be detected in multiple clusters due to the various functions of each gene. In general, multiple groupings are desired as they characterize different views of the data” [103].

If we consider for example a customer database with a large number of customers (rows in the table) described by a large number of attributes (columns in the table) we may ask, how do this customers relate to each other, and what kind of patterns in this case groups can be identified in this database. In Figure 1.1 we can see a toy-example belonging to this kind of multiple valid groupings for one database. We can have groups like: “rich oldies”, “healthy sporties”, “unhealthy gamers”, “unemployed people”, “average people” and “sport professionals”1. To facilitate the data analysis in this direction, we present in Chapter 5 visual interactive systems and new analysis methods to support the understanding and comparison of different groupings in high-dimensional data.

As already mentioned, this thesis is about visual analytics of patterns in high-dimensional

1This image appeared in the tutorial slides of M¨uller et al. [104] and the describing story is made up by myself.

1.1. Need for Visual Interactive Data Exploration 3

Figure 1.1: Multiple valid and interesting groupings of a high-dimensional data set [104].

data. To assist the analysis of such data sets, effective information visualization techniques providing a mapping of data properties to the screen, have been developed and are needed to make sense of the complex data at hand. The visualization of large complex informa-tion spaces typically involves mapping high-dimensional data to lower-dimensional visual representations. The challenge for the analyst is to find an insightful mapping, while the dimensionality of the data, and consequently the number of possible mappings, increases.

As we will see later in Chapter 2, numerous expressive and effective low-dimensional visualizations for high-dimensional data sets have been proposed in the past, such as scatterplots and scatterplot matrices (SPLOM) [37], parallel coordinates [78], glyph-based techniques [147], pixel-based displays [145] and geometrically transformed displays [86, 145]. However, finding information-bearing and user-interpretable visual representations automatically remains a difficult task since there could be a large number of possible representations. In addition, it could be difficult to explain their relevance to the user.

Finding relations, patterns, and trends over numerous dimensions is also difficult be-cause the projection ofn-dimensional objects over 2D spaces carries necessarily some form of information loss. Projection techniques like multidimensional scaling (MDS) and prin-cipal component analysis (PCA) offer traditional solutions by creating data embeddings that try as much as possible to preserve distances of the original multidimensional space in the 2D projection. These techniques have, however, severe problems in terms of inter-pretation, as it is no longer possible to interpret the observed patterns in terms of the dimension of the original data space.

Mechanisms to measure the quality of the visualizations are therefore needed. In the past, quality measures have been developed for different areas like measures for data quality (outliers, missing values, sampling rate, level of detail), clustering quality (purity, F-measure (combining precision and recall), Rand index [114], silhouette coefficient [85], etc.), association rule quality (support and confidence [7], information gain [40], etc.) or the distance distribution measure in SURFING [16], a subspace search algorithm described and used in Chapter 5 to filter data spaces and find interesting subspaces. For visualiza-tions, a number of authors have started introducing quality measures to quantify their importance. The rationale behind this method is that quality measures can help users reduce the search space by filtering out views with low information content. In the ideal

system, users can select one or more measures and the system optimizes the visualization in such a way as to reflect the choice of the user. This thesis also contributes to the field of quality measures, and in Chapter 3 new measures are presented for scatterplot matrices and parallel coordinates plots.

However, there is one problem with these measures the lack of empirical validation based on user studies. These studies are in fact needed to inspect the underlying assump-tion that the patterns captured by these measures correspond to the patterns captured by the human eye. Since many different patterns can be analyzed, in this thesis we started with clusters in visualizations and research in this direction by comparing some of the most promising quality measures for filtering visualizations that present clusters to the human judgement by looking at the visualizations.

The analysis of high-dimensional data is an ubiquitously relevant, yet well-known dif-ficult problem. Problems exist both in automatic data analysis and in the visualization of this kind of data. On the visual-interactive side, a limited number of available visual variables and limited short-term memory of human analysts make it difficult to effectively visualize data in high numbers of dimensions. In Chapter 5 we tackle this problem from the visual-interactive side. We present a visual-interactive tool to make sense of clus-ters in different subspaces, as well as an approach to identify subspaces that might show complementary clusterings.

In summary, the focus of this thesis is to contribute on both sides of pattern finding in high-dimensional data, the automatic and the visual interactive part. We believe that these parts are simultaneously needed to solve the problem and therefore we present automatic mechanisms namely quality measures to reduce the alternative possible visualizations of high-dimensional data, and on the other side we visualize the relations between results to support the user in an interactive pattern finding process.