• Keine Ergebnisse gefunden

Visual Interactive Systems for High-Dimensional Data Analysis

2.4 Visual Analytics for High-Dimensional Data

2.4.1 Visual Interactive Systems for High-Dimensional Data Analysis

As presented in the previous chapter, combining data visualization with interactive and automated components speeds up the analysis of high-dimensional data sets. As a con-sequence, many interactive systems have been developed recently to support the user in analyzing high-dimensional data sets. Since there is a large number of interactive sys-tems in the literature, presenting a full summary would overload this section. Hence in the following paragraphs, we identify only the four main domains related to this thesis and enumerate a selection of visual interactive systems for visual feature selection, visual clustering, visual classification, and dimension reordering.

Visual Feature Selection

Reducing high-dimensional data to a lower subset of features that express the data char-acteristics, is a crucial task in high-dimensional data analysis. Data features are therefore compared, for example computing correlations, data variation, etc. to identify their

impor-2.4.1 Visual Interactive Systems for High-Dimensional Data Analysis 23 tance in expressing the data characteristics. Since fully automated feature selection meth-ods often are infeasible, due to the data complexity and dimensionality, visual-interactive systems have been developed to deal with this problem. We illustrate three examples for such systems in Figure 2.3 with a short description, and point to more literature in this field in the next paragraphs.

A B

C

Figure 2.3: Visual interactive feature selection systems. A: Rank-by-Feature Framework presented in [125]. B: Feature selection supported by quality measures [82]. C: DimStiller for feature selec-tion [76].

In existing works involving visual-interactive selections or comparison of features, the Rank-by-Feature Framework [125] (see Figure 2.3A) provides a sorted visual overview of the correlation among pairs of features. In [82], the selection of input features was supported by a measure of the interestingness of the visual view provided by candidate features (see Figure 2.3B). An interactive dimensionality reduction workflow was presented in [76], relying on visual approaches to guide users in selecting features (see Figure 2.3C).

In [33] and [34], interactive visual comparison was proposed to relate data described in different given feature spaces based on 2D mappings and tree structures extracted from the different data spaces. Furthermore, in [93] a visual design based on network and heat map visualization was proposed to relate clusterings in different subsets of dimensions.

In [159], dimensions are hierarchically clustered based on a simple value-oriented similarity measure. Based on this structure, user navigation can take place to identify interesting subspaces. In a recent work [161], the output of this simple search method was visualized by tree- and matrix-based views, where each dimension combination was represented by a single MDS plot.

In summary, many of these methods are applicable to compare data regarding different

criteria. However, most of them assume the feature selection to be performed globally and do not take the subspace search problem directly into account. One focus of this thesis is to show that local selection of features is essential when analyzing patterns of high-dimensional data. The analysis is then performed in different subspaces of the data and related work on visual analysis tools that deal especially with subspaces will be presented in the next subsection.

Visual Clustering

Identification and relation of groups of data is a key explorative data analysis task. Often, user interaction is needed to identify and revise the number and characteristics of data clusters found by automatic search methods. To this end, visual-interactive approaches are useful. Although, many methods have been proposed, we can only highlight few of them in an exemplary manner. In [124], interactive exploration of hierarchically clustered data along a dendrogram data structure is proposed to help users find the right level of clusters for their tasks (see Figure 2.4A). In [159], the parallel coordinates approach serves as a basic display to show data clustering results allowing to compare clusters along their high-dimensional data space. Also, 2D projections, possibly in conjunction with glyph-based representation of clusters, are widely employed, a recent example is [35] (see Figure 2.4B).

A B

Figure 2.4: Interactive visual analysis systems for clustering in high-dimensional visualization. A:

Interactive exploration of hierarchically clustered data along a dendrogram [124]. B: (a) Grouping icons to form clusters based on visual similarity. (b) User-defined grouping of icons [35].

These approaches to visualization and clustering in high-dimensional data spaces all have in common that they are based on a given full (or reduced) dimensionality of the input data set. Thereby, they show only asingularperspective of the usually multi-faceted high-dimensional data, which might not be the most relevant one. As we will show in this thesis, it is also useful to explore high-dimensional data for patterns in different subsets of its full high-dimensional input space to increase potential data insight.

Visual Classification

Classification is using a model that distinguishes data classes, and is created based on a labeled training data set, to label new data. The classification model can be represented by decision trees. With pure automatic approaches, problems like over-fitting the model or tree pruning, are difficult to tackle [86]. Using visualization can help to overcome

2.4.1 Visual Interactive Systems for High-Dimensional Data Analysis 25 these problems, for example by incorporating the user in the tree constructing process.

Ankerst et al. present in [11] a user-centered approach that combines the domain knowledge of users, with computation strengths of the computer to create rules that satisfy the user’s constrains and generate visualizations of these patterns. Additionally, the pattern recognition of the human supported by adequate data visualizations can be used to increase the effectivity of decision trees. In Figure 2.5A the visual classification shows the decision tree, visualizing each attribute-value by a colored pixel arranged in bars. Each attribute bar is sorted, and the purest value distribution is selected as split attribute of the decision tree. This procedure is repeated until all leaves contain pure classes. The split is marked with a black vertical line, and the leaves are underlined with a black line. Compared to standard visualizations of decision trees, additional information is encoded in a compact way, namely: size of the nodes (number of training records for the corresponding node), quality of the split (visible in the purity of the resulting partitions), class distribution (frequency and location of the training instances of all classes).

A

B

Figure 2.5: Interactive visual analysis systems for classification in high-dimensional data. A: Visual classification from [11] illustrates the decision tree for DNA training data having 19 attributes, visualizing each attribute-value by a colored pixel arranged in bars. B: Decision tree construction system [142], representing the tree in a node-link diagram, displaying split points on the links and the split attributes on the node.

Figure 2.5B shows a recent example from [142] of an interactive system for decision tree construction. Here the authors have the same goal, e.g. to bring the domain specific

knowledge of the user into the construction of the tree. A tight integration of visualization, interaction and automation supports domain experts in growing, pruning, optimizing and analyzing decision trees [142]. Compared to the previous example, here the tree represen-tation is a more classic one since the tree is represented by node-link diagrams. Internal and leaf nodes are represented by node glyphs, and each parent-child relationship is rep-resented by a link from patent to child node. The advantage of this visual representation is that it allows for an easier counting of the number of leafs while at the same time it shows which nodes are on the same level [142]. The main view displays split points on the links, using the width to encode the number of items and color the class membership of the items. The split attribute is shown on the nodes of the tree. These are visualized as rectangles containing relevant information like split attribute, class distribution, split points, and class histogram. Additional linked views support the user in constructing and optimizing the decision tree.

Dimension Reordering

As already discussed, dimension ordering is a relevant component of high-dimensional data visualization and exploration, as different orderings can expose different patterns. Ankerst et al. introduced the problem of dimensional ordering as an optimization problem in [9]

and demonstrated that it is a NP-complete problem that must thus be solved through heuristics. Peng et al. in [112] applies dimension reordering on a series of n-dimensional visualization techniques to reduce clutter. Matrix based visualizations, starting from the seminal work of Bertin [22] have also been heavily researched in terms of the patterns they can expose through reordering. In Section 5.1 we use dimensional reordering and cluster reordering to make relationships among dimensions and clusters apparent in our ClustNails system.

In [59] Guo also addresses ways to integrate visual and computational measures for picking and ordering variables for display on parallel coordinates. He describes a human-centered exploration environment, which incorporates a coordinated suite of computa-tional and visualization methods to explore high-dimensional data and find patterns in this spaces. The main difference between this approach and our approach presented in Section 3.1.4 is that Guo searches for locally defined patterns in subspaces, while our work concentrates on finding global patterns in a 2-dimensional projection of the data set.

To summarize, ordering plays and important role in different areas: like ordering axes of parallel coordinates, ordering as a way to reduce clutter in scatterplot matrices, ordering to support similarity search of glyph-based visualizations or pixel-based displays.