• Keine Ergebnisse gefunden

Information Visualization Techniques for High-Dimensional Data

coordinates, as transformed dimensions.

MDS tries to preserve the pairwise distances between the data points. There are a lot of variants of MDS dependent on the used distance functions [31]. The simplest version is the linear MDS, also called classical scaling, and its solution is very closely related to PCA when using an Euclidian distance function.

All these techniques rely on the idea that variation of the data can be explained by a smaller number of transformed features. Their main difference to the feature selection methods is that these methods instead of choosing a subset of dimensions from the data, create new dimensions defined as functions over all dimensions. They also do not consider class labels but rather their computation is relying just on data points.

General problems in these techniques are that the mapping often is not unique. The techniques have several parameters that influence the result, and the interpretability of resulting dimensions is sometimes difficult because the original space dimensions coming from a specific domain have a certain interpretation (like age, income, etc.) but their linear combinations can be hardly interpreted.

Koren and Carmel propose a series of new methods for creating projections from high-dimensional data sets using linear transformations [89]. For non-labeled data, they propose a generalization of the PCA, the normalized PCA, that normalizes the squared pairwise distances to reduce the dominance of the large distances normally occurring for the stan-dard PCA transformation. For labeled data, their methods integrate the class labels of the data in the computation, resulting in projections with a clearer separation between the classes. This methods compared to traditional PCA or MDS have the advantage that they also capture intra-cluster shapes.

In addition to PCA and MDS presented above, there have been developed more tech-niques based on linear or non-linear transformations of the original features to obtain a reduced set of synthetic dimensions. Detailed surveys can be found in [111, 153]. Another prominent group of techniques for dimension reduction, which we want to recall shortly at this point, rely on signal processing techniques, that, when applied to a data vector, transform it to a numerically different vector [64]. These are for e.g. Discrete Fourier Transform, Cosine Transform, Wavelet Transform etc. Since input and transformed data vectors have the same length, the data is reduced by a user specified threshold that is used to truncate the transformed vector (e.g. wavelet coefficients).

2.2 Information Visualization Techniques for High-Dimensional Data

2.2.1 Information Visualization Techniques

The representation of high-dimensional data is one of the main research challenges in visualization. Several techniques have been developed in recent years to deal with the problem of representing relations among many dimensions on a computer display, which is inherently bi-dimensional. Considering also the visual variables data visualizations can go a bit beyond 2D using color, shape, etc. but still have different issues for representing high-dimensional data sets. Classic approaches include parallel coordinates, scatterplot matrices, glyph-based and pixel-oriented techniques [145]. Figure 2.1 shows some examples

for these techniques taken from [145].

A B

C D

Figure 2.1: High-dimensional visualization techniques taken from [145]. A: Scatterplot matrix showing on the diagonal a histogram plot for each dimension. Selected points are marked in red in all plots. B: Parallel coordinates plot of a seven-dimensional data set. One polyline representing one data point is highlighted in red. C: Star glyphs in a MDS layout. D: Dense pixel displays representing a 14-dimensional data set.

Scatterplots and Scatterplot Matrices [37]

2D scatterplots are one of the most common used visualization techniques in data analy-sis. The data is represented by points in a rectangular box, each having the value of one variable (dimension) determining the position on the horizontal axis, and the value of the other variable, determining the position on the vertical axis. To represent a data set of a higher dimensionality, a common approach is to build a scatterplot matrix (SPLOM)[37].

Figure 2.1A shows an example of such a matrix for a four-dimensional data set, where every pair of dimensions is represented in one scatterplot. The matrix shows every plot twice, being symmetrical with respect to the diagonal. Additionally, on the diagonal, di-mension histograms show the value distribution information for each didi-mension. Selected points are highlighted in red and a purple rectangle indicates their region.

2.2.1 Information Visualization Techniques 15 Parallel Coordinates [78]

Another important visualization method for multivariate data sets isparallel coordinates.

Parallel coordinates was first introduced by Inselberg [77] and is used in several tools, e.g. XmdvTool [146] and VIS-STAMP [60], for visualizing multivariate data. The basic idea is that each dimension1 of the data is a vertical line, so the axes of the plot are a collection of parallel lines. Each data point is a polyline that crosses each dimension axis by intersecting it at its dimension value. Figure 2.1B shows an example of parallel coor-dinates for a seven-dimensional data set where one data point’s ployline is highlighted in red. In comparison to the scatterplots, parallel coordinates can show data sets of higher dimensionality in one display. In a SPLOM a higher dimensional data set can be visualized by plotting every two-dimensional combination in one scatterplot. For both, parallel coor-dinates and SPLOM, the ordering is important. For parallel coorcoor-dinates the order of axes (dimensions) and analog for the SPLOM the order of rows and columns, since different orderings make different relations in the data visible. It is important to decide the order of the dimensions that are to be presented to the user. Their effectiveness, however, is highly related to the dimensionality of the data under inspection. Because the resolution available decreases as the number of data dimensions increases, it becomes very difficult, if not impossible, to explore the whole set of available orderings manually. In Section 2.3.2, we describe the notion of quality metrics that are mechanisms to automatically quantify the quality of the display and in Section 3.1.4, we introduce new quality metrics to deter-mine the best ordering in parallel coordinates with respect to a given task.

Glyph-based techniques [147]

“Glyphs are graphical entities that convey one or more data values via attributes such as shape, size, color, and position” [147]. There is a variety of glyphs proposed in the literature so far, and just to name some there are: star glyphs,face glyphs, profile glyphs or box glyphs. An overview of multivariate glyphs can be found in [147]. They all have in common that they have one graphical representation per object, but use different en-codings for the objects attributes (e.g. length, area, color). In Figure 2.1C star glyphs are exemplified. As the name suggests each object is represented by a star shaped glyph, where the value of each dimension is represented by the length of evenly spaced rays. The ray ends are connected by a polyline.

Pixel-oriented techniques [145]

Pixel-oriented techniques “map each value to individual pixels and create a filled polygon to represent each dimension” [145]. In Figure 2.1D a 14-dimensional data set is represented by dense pixel displays showing each dimension in a separate rectangle and each data value as a colored pixel in the rectangle. The values are sorted according to the tenth dimension, that is marked with a black border. Here we can see several challenges for this techniques. One is the already mentioned ordering of data values, to spot correlated dimensions, another one is the ordering of dimensions to position similar dimensions close to each other on the screen. Using different colormaps can also reveal different patterns in the data, thus choosing the suitable colormap for each data and task, suitable colormap is yet another challenge. Additionally, positioning the dimensions on the screen is not trivial, since different layouts – not only the grid layout – can be possible.

1We use the terms dimension and attribute (as well as feature, variable, column and axis) interchange-ably in this thesis. We choose among them based on the context of the discussion, while attempting to be consistent with their use in the literature.

2.2.2 Limitations while Visualizing High-Dimensional Data

As previously demonstrated, there are different ways to represent high-dimensional data on the screen and all these bring a number of challenges with them. Moreover, as already identified there are challenges due to the scalability of the display, the ordering of dis-played objects or dimensions, the positioning of objects on the screen, the high number of possible visual mappings. Providing solutions for some of this problems would ease the exploration of the high-dimensional data. By an appropriate sorting of dimensions and an appropriate mapping to visual variables, clutter can be reduced and these visualization methods could allow to overview and relate high-dimensional data sets [49]. The data dimensionality causes problems in the visual mapping stage, meaning it is unclear which mapping is the best, so what data dimension should be mapped to what visual variable.

Because of the high number of possible mappings for a high-dimensional data set, auto-mated methods are needed to restrict this number. One way to judge the quality of these mappings is to compute quality measures for the displayed data (see Chapter 3 for more details) or to reduce the number of dimensions by dimensionality reduction techniques (see Section 2.1.2).

Enriching Visualizations

Static visualization techniques are not flexible enough to reveal the complex high-dimensional patterns, thus interaction is needed at this point. Proposed are different solutions to make visualizations interactive, supporting a dynamic use for high-dimensional data. These in-clude brushing and linking [46], panning and zooming [19], focus-plus-context [92], magic lenses [29].

“Brushing and linking refers to the connecting of two or more views of the same data, such that a change to the representation in one view affects the representation in the other views as well. . . . Panning and zooming refers to the actions of a movie camera that can scan sideways across a scene (panning) or move in for a closeup or back away to get a wider view (zooming). . . . When zooming is used, the more detail is visible about a particular item, the less can be seen about the surrounding items. Focus-plus-context is used to partly alleviate this effect. The idea is to make one portion of the view – the focus of attention – larger, while simultaneously shrinking the surrounding objects. The farther an object is from the focus of attention, the smaller it is made to appear. . . . Magic lenses are directly manipulable transparent windows that, when overlapped on some other data type, cause a transformation to be applied to the underlying data, thus changing its appearance” [15]. A full exemplification of these techniques is out of the scope of this work, and more details can be read in [15]2.

Patterns that are just visible in subspaces of the original data space also need spe-cialized visualizations to disclose the relations between the different subspaces from which they originate as well as their possible object overlap. In Chapter 5 we present a visual-interactive tool for this purpose.

2The cited description for each technique are from Chapter 10: User Interfaces and Visualization - by Marti Hearst. This chapter can also be found online athttp://people.ischool.berkeley.edu/˜hearst/

irbook/10/node3.html#SECTION00122000000000000000f(last accessed on 03/13).