• Keine Ergebnisse gefunden

Automated Techniques for High-Dimensional Data

2.3 Automated Techniques for High-Dimensional Data

In this section, we present automated methods for analyzing high-dimensional data. Sec-tion 2.3.1 discusses different data mining approaches to extract patterns from data. The focus is on clustering. We present general approaches, enumerating approaches that have been especially developed for coping with high-dimensional data, and present the differ-ence between clustering in a dimension reduced data set and subspace clustering. Besides automated pattern extraction, in Section 2.3.2 we introduce automation to judge the qual-ity of visualization, namely by qualqual-ity metrics. Given the huge number of possible visual representations for high-dimensional data, the user is assisted in finding the right visual mapping or the right projection for his data. Our contribution to this area consisting of new measures, a quality measures pipeline, and a systematization of existing measures, is outlined in Chapters 3 and 4.

2.3.1 Data Mining Techniques for High-Dimensional Data

Data Mining refers to extracting, or mining, knowledge (interesting patterns) from large amounts of data [64]. In order to extract these data patterns, different intelligent methods have been developed in the past. One important method, which is also the closest to this thesis, is clustering. Clustering takes the data set as input and groups the objects according to their similarity into different groups, called clusters. Therefore, the similarity between objects of one group is maximized, and between objects of different groups the similarity is minimized. That means that objects of one group are very similar to each other, while dissimilar to objects of other groups. The similarity is calculated on the full attribute space, using different distance functions, like Euclidian, Minkowski, or City-block distances.

State of the Art Clustering

There are different criteria to classify the existing clustering algorithms. We would like to differentiate them roughly intohierarchicalclustering algorithms, andpartitioning cluster-ing algorithms and enumerate some of the most known representatives. For further details please refer to the following surveys [21, 155] or the original papers of the algorithms.

Hierarchical clustering organizes objects into groups that are at the same time grouped into groups. This is done consecutively building up a hierarchy of clusters. Representatives for this category, which we will also use later in Section 5.2, are hierarchical clusterings with different linkage methods, like single-linkage, complete-linkage, average-linkage, or minimum variance [144]. Trying to develop algorithms for handling large-scale data, in re-cent years, new hierarchical algorithms appeared that improve the clustering performance.

Examples include BIRCH [162] an algorithm designed to use a height-balanced tree to store summaries of the original data that can achieve a linear computational complexity.

The partitioning methods, divide all the data objects into a fixed number of groups, without any hierarchical structure. Major representatives for this category are algorithms like the density based DBSCAN [50] and OPTICS [10], or relocation methods like k-medoids and k-means methods [56].

Clustering in High Dimensions

For high-dimensional data sets, the challenge is to design effective and efficient clustering algorithms that can cope with the high number of objects, dimensions, and the noise level of this kind of data. Therefore a number of different algorithms were proposed to cluster this type of data.

CURE [57] is a hierarchical clustering algorithm that can explore arbitrary clus-ter shapes and utilizes a random sample strategy to reduce computational complexity.

Density-based clustering (DENCLUE) [70] is a well known approach for density based clustering for high-dimensional data. To make computations more feasible, the data is in-dexed using a B+-tree. The algorithm is built on the idea that the influence of each data point on his neighborhood can be modeled using a so called influence function. The overall density of the data space can be modeled analytically as the sum of the influence function applied to all data points. Clusters are then determined by identifying local maxima of the overall density function.

Although, these algorithms can deal with large-scale data, they are sometimes not sufficient to analyze high-dimensional data. Due to the previously described problem, the curse of dimensionality, namely algorithms relying on distance functions, can no longer perform well in high-dimensional spaces. To overcome this problem, dimension reduction (see Section 2.1.2) is used in cluster analysis to reduce the dimensionality of the data sets. However, dimensionality reduction methods cause some loss of information, and may destroy the interpretability of the results, even distort the real clusters. Moreover, such techniques do not actually remove any of the original attributes from the analysis.

This is problematic when there are a large number of irrelevant attributes. The irrelevant information may mask the real clusters, even after transformation. Another way to tackle this problem is to use subspace clustering algorithms, that search for data clusters in different subsets of the same data set. Different subspaces may contain different meaningful clusters. The problem here is how to identify such subspace clusters efficiently.

A large number of algorithms for subspace clustering have been developed in the past and we picked some representatives to be briefly described next. CLIQUE (CLustering In QUEst) [6] employs a bottom-up approach and searches for dense rectangular cells in all subspaces with high density of points. The clusters are generated by merging these rectangles. OptiGrid [71] is designed to obtain an optimal grid partitioning using cutting hyperplanes. It uses density estimations similar to DENCLUE to find the plane that separates two significantly dense half spaces, and goes trough a point of minimal density, using a set of linear projections. In Section 5.1 we use the k-medoid based algorithm PROCLUS (PROjected CLUstering) [4], one of the most robust algorithms for subspace clustering. It defines a cluster as a densely distributed subset of data objects in a subspace.

ORCLUS (arbitrarily ORiented projected CLUster generation) [5] uses a similar approach but uses non-axes parallel subspaces to find the clusters. Further elaborations on the problem of subspace clustering are described in Section 2.4.2 and Section 5.1.2.

Other Data Mining Techniques

In addition to clustering techniques, many other techniques have been developed during the past.

Mainly they are miningfrequent patterns,associations,correlations, oroutliers

2.3.2 Quality Measures for High-Dimensional Data Visualizations 19 in data. A frequent pattern is a set of items that occur frequently in a data set. This term was first proposed by [7] in the context of frequent itemsets and association rule mining. By mining frequent patterns, the goal is to identify regularities in the data, like products purchased often together in basket data analysis. Frequent patterns form the foundation for many essential data mining tasks, such as association analysis, correlation analysis, classification (associative classification) and cluster analysis (frequent pattern-based clustering). “Association analysis is the discovery of association rules showing attribute-value conditions that occur frequently together in a dataset” [63]. As mentioned in Section 1.1 support and confidence can characterize the quality of association rules. The rules are generated based on the identified frequent itemset in the data. One problem, however, is that for low support and confidence levels the resulting set of association rules is very high. Using higher levels of support and confidence can remove useful rules, so a mechanism is needed to detect the right confidence level. Visualization can help to overcome this issue, and supports the user in identifying the right rules. In Section 3.1 we will present image based quality measures to identify correlation among data attributes and attributes forming strong groups (clusters) in the data.

Inclassification analysisthe data is often classified (labeled), and a model is derived to distinguish these data classes. This model is trained on a subset of the data, called training set. Another subset of the data is used to validate the rules, which is the so called test set. The model can be represented by classification rules, decision trees, neural networks or mathematical formulas and is used to classify new data. However, often users need to predict missing values in the data, rather than class labels. When the predicted values are numerical the process is namedprediction. Our work on quality metrics with labeled data (see Section 3.1.3 and Section 3.1.5), can be seen as a complementary way to identify the attributes that can best distinguish the classes in the data relevant for building the classification model. Classification is also referred to as supervised learning, because the training set is used to teach how to classify new data. Clustering is referred asunsupervised learning, since there are no class labels for training, and clusters or classes are established to group the data elements.

In some applications, as in fraud detection, rare events can be of interest. The analysis of outlier data is referred to as outlier mining. Outliers can be detected for example by using statistical tests, but also by some quality metrics. Examples for quality metrics for outliers are marked in Table 4.2 later in Chapter 4.

2.3.2 Quality Measures for High-Dimensional Data Visualizations General Measures

Quality metrics (or measures) in visualization have a long history. While in our work we focus only on their specific use in high-dimensional data analysis, they have a broader scope than we can describe here. Early attempts to calculate quality metrics can be traced back to the work of Tufte [139], where he proposed metrics such as the data to ink ratioand the lie factor, which respectively optimize the use of the visualization space and reduce the distortions that visualization may introduce. Later in 1997 Richard Brath proposed a rich set of metrics to characterize the quality of business visualizations [32]

and, around the same period Miller et al. advocated the use of visualization metrics as a way to compare visualizations [100]. The graph drawing community developed its own set

of metrics, most notable aesthetic metrics such as those found in the foundational work of Ware et al. on cognitive measurements of graph aesthetics [149]. Later, the word quality metrics assumed a more specific meaning; in particular it appeared in the context of a number of papers related to clutter reduction and scalability [24, 26, 80, 82, 112].

For the sake of completeness, it is worth mentioning that the word metric is also used in the context of information visualization user studies as a way to indicate how the elements of interest are measured (e.g., [108, 113]).

Scatterplot Measures

The idea of using measures calculated over the data or over the visualization space to select interesting projections, has been proposed already in some foundational works like Projec-tion Pursuit [54, 74] andGrand Tour [13]. Projection Pursuit searches for low-dimensional (one or two-dimensional) projections that expose interesting structures, using a “Projec-tion Pursuit Index” that considers inter-point distances and their varia“Projec-tion. Grand Tour adopts a more interactive approach by allowing the user to easily navigate through many viewing directions, creating a movie like presentation of the whole original space.

More recently, several works appeared in the visualization community that propose dif-ferent forms of quality measures. Examples are, graph-theoretic measures for scatterplot matrices [151], measures over pixel-based visualizations [120], measures based on clutter reduction for visualizations [25, 112], and composite measures to find several data struc-tures outliers, correlations, and sub-clusters [82]. We present a systematization of works on quality measures in Chapter 4 and propose a quality measures pipeline to describe the process of these measures. Additionally, several factors are derived to characterize the measures in a common language, and implications on further research are raised. At this point, it seems important to provide a short description of the first two categories, and postpone the details for the others for Chapter 4.

First, the scagnostics measures [140] have an important role since they are a major inspiration source for our work. As an alternative to Projection Pursuit, the scagnos-tics method [140] was proposed to analyze structures in scatterplots. Since they never published their specifics of the method, Wilkinson et al. [151] take their opportunity to presented this scagnostics ideas and apply them to high-dimensional data. They describe detailed graph-theoretic measures for scatterplots. This means that graphs and their prop-erties (like convex hull,alpha hull,Minimum Spanning Tree (MST)) are used as bases for computing scagnostics measures. Their scagnostics indices assess five aspects of the point distribution: outliers, shape, trend, density and coherence proposing nine characteristic indices for the distribution of points in scatterplots: outlying, skewed, clumpy, convex, skinny, striated, stringy, straight, and monotonic. Originally these indices are used to form a SPLOM of scagnostics, where each axes is a scagnostics measure. Here each data scatterplot is represented by a point according to his measures. The scagnostics SPLOM was used to spot unusual scatterplots regarding their data distribution (see Figure 2.2A).

These indices were also used as ranking functions in data SPLOMs supporting different analysis tasks [152] as shown in Figure 2.2B.

Second, the approach most similar to ours presented in Chapter 3 is Pixnostics, pro-posed by Schneidewind et al. [120]. They also use image-analysis techniques to rank the different lower-dimensional views of the data set and present only the best ranked to the user. The method does not only provide valuable lower-dimensional projections to the

2.3.2 Quality Measures for High-Dimensional Data Visualizations 21

A B

Figure 2.2: (A) Scagnostics SPLOM having as axes scagnostics measures and showing each data scatterplot as a point in the measures scatterplot [152]. (B) Scagnostics indices used as quality measures to rank data scatterplots [152].

user, but also optimized parameter settings for pixel-level visualizations. However, while their approach concentrates on pixel-level visualizations, we focus on scatterplots and parallel coordinates.

We contribute to the field of quality metrics by proposing image-based and data-based measures for classified and non-classified data in scatterplots and parallel coordinates in Section 3.1. In Section 3.1.2 we present an image-based measure for non-classified scatterplots in order to quantify the structures and correlations between the respective dimensions. Our measure could for example be used as an additional index in a scagnostics matrix.

Parallel to our work from Section 3.1 published in [133], Sips et al. [129] developed a class consistency visualization algorithm. Similar to our Histogram Density measures, the class consistency method proposes measures to rank 2D scatterplots. It filters the highest ranked scatterplots and presents them in an ordinary scatterplot matrix.

Parallel Coordinates Measures

Measures were not only used to rank a high number of visualizations regarding their structures, but also with the purpose to optimize visualizations for high-dimensional data representation. One major factor handled by these measures is optimizing the ordering of elements (like axes or data points) in the visualization. Aiming at dimension reordering, Ankerst et al. [9] presented a method based on similarity clustering of dimensions, placing similar dimensions close to each other. Yang [159] developed a method to generate inter-esting projections also based on similarity between the dimensions. Similar dimensions are clustered and used to create a lower-dimensional projection of the data.

As an alternative to the methods for dimension reordering for parallel coordinates, we propose a method based on the structure presented on the low-dimensional embeddings of the data set. Three different kinds of measures to rank these embeddings are presented

in Section 3.1.4 for class and non-class based visualizations.

Evaluating Measures

A common denominator of all these works is the total absence of user studies able to inspect the relationship between human-detected and machine-detected data patterns. While it is certainly clear how these measures can help users deal with large data spaces, there are a number of open issues related to the human perception of the structures captured automatically by the suggested algorithms. In Section 3.2 we focus on the question of whether there is a correlation between what the human eye perceives and what the machine detects.

Despite the lack of user studies specifically focused on the issues discussed above, there are a number of user studies focused on the detection of visual patterns which are worth mentioning here. A large literature exists on the detection of pre-attentive features, notably the work of Healey focused on visualization [67] and of Gestalt Laws [148], which are often taken as the basis for the detection of patterns from visual representations. Some more specific works focused on visualization are: [25] and [68] based on the perception of density in pixel-based scatterplots and in visualizations based on “pexels” (perceptual texture elements) respectively, [81] on the study of thresholds for the detection of patterns in parallel coordinates, and [65] on the correlation between the visualization performance an similarity with natural images. The study presented in [118] is also relevant and very similar to ours presented in Section 3.2 in terms of experiment design. Users ranked a series of images in terms of their perception of the degree of clutter exposed by the image, and the study correlated the degree of correlation between the user rank and the rank given by the suggested measure namedfeature congestion.