Data Mining - Data Processing - Visual Analytics and Related Research Fields 11

2. Visual Analytics and Related Research Fields 11

2.5. Data Processing

2.5.2. Data Mining

Data mining, generally speaking, is the process of extracting hidden patterns from data algorithmically. Data mining methods include classification, regression, clustering, summarization (see Section2.5.1), dependency modeling, change and deviation detection, association analysis, outlier detection [FPsS96,HK06].

• Classification concentrates on finding a set of modules/functions that describes and distinguishes data classes or concepts for the purpose of determining classes of unlabeled objects. More specifically, the output are categoric values.

3Attribute selection techniques can be regarded as linear orthogonal projection techniques.

2.5. Data Processing

• Clusteringis an unsupervised method for the grouping of objects, which maximizes intra-group similar-ity and at the same time maximizes inter-group dissimilarsimilar-ity. In contrast to classification, objects to be clustered have unknown classes.

• Association analysis(also called association rule learning or mining) focuses on the search for interesting relationships among data items. Alternatively, it tries to discover association rules showing attribute value conditions that occur frequently together in a given data set.

• Theregressionmethods model the data by approximating the given data (e.g., linear regression). In contrast to association rules, input and output data are continuous in nature.

• Outlier analysistries to find outliers in the data. Outliers are such data items, which are extraordinary or which do not comply with the general behavior of the data.

As can be seen, there are many types of data mining methods. For the purposes of the thesis, we concentrate on clustering in the following.

2.5.2.1. Clustering

Clustering is an important data mining technique, which has its root in statistics. It has been applied in various areas. Clustering supports the examination of large amounts of data by revealing their structure in the data set and observing characteristics of groups of data [HK06]. It abstracts in an unsupervised manner data objects into a limited number of data groups (i.e., clusters). The clustering results can be used as input to various further applications.

The objective of clustering is to group together objects having a high similarity within the clusters (intra-group compactness) and at the same time having a high dissimilarity between the clusters (inter-cluster separability).

The common criterion for grouping objects is their similarity, which is very often based on a predefined distance function depending on the data matter. Many algorithms exist and the choice of clustering algorithms depends on the type of input data and on the purpose of the clustering. In general, it is difficult to assess the effectiveness of unsupervised methods (including clustering) as it is a subjective matter [HTF09].

For multi-dimensional categoric (including binary) and continuous data, there are several functions that are used on a regular basis in clustering (e.g., Euclidean, Minkowski, TaniMoto, Mahalanobis). However, for com-plex data objects such as pictures, videos, graphs, trajectories, 3D Models, specialized distance functions need to be defined and used. In these cases, the objects are often described by so called feature vectors – multidi-mensional continuous/categoric vectors, which are used for measuring similarity between objects by applying the selected standard metrics mentioned above. For the thesis, the definition of graph and trajectory similarity are of special interest. The feature sets applied on these types of data objects are described in Sections3.6and 4.6respectively. For definingsimilarity between sets of objects (clusters), there are several standard approaches including single linkage, complete link, average link.

Clustering techniques can be broadly categorized into partitioning, hierarchic, density-based, grid-based, statistic model-based and neural network methods. The allocation of the techniques is not unique as various algorithms combine multiple methods for gaining better clustering results [HK06]. In the following we con-centrate on neural network methods, in particular self organizing maps as they are relevant for the approaches applied in the visual analysis of both data types of interest to the thesis.

Self-organizing map (SOM) is a neural network learning algorithm with a strong disposition for visualization [Ves99,Koh01]. SOM combines dimension reduction and clustering. It preserves topological properties of the data set (i.e., two data points that are close in original space are close in the lower dimensional space). The

algorithm can handle large data sets and offers good clustering results. SOM method for clustering has previously successfully been applied to many different data types including documents [HKLK97], audio [RM06], and images [Bar08].

In this algorithm, a network of prototype vectors is iteratively trained to represent a set of input data vectors.

The network is often assumed to be a 2-dimensional regular grid. During training, the algorithm iterates over the input data vectors. For each input vector, it finds the best matching prototype, and adjusts it as well as a number of its network neighbors toward the input vector. In the course of the process, the considered neighborhood size and the strength of the adjustment process (learning rate) are reduced. The training result is a set of prototype vectors representing the input data. In addition, the low-dimensional arrangement of prototypes on the network yields a topological ordering of the prototype vectors, approximating the topology of data samples in original data space.

The main parameterization required by the algorithm includes the initialization of the prototype vectors and the specification of the learning parameters. The latter includes the duration of the training process, the definition of the neighborhood kernel, and the degree of vector adjustment (the learning rate). While a number of rules of thumb exist for the parameter setting [Koh01,Ves99], finding good settings for a given data set usually requires experimentation and evaluation by the user.

In respect to the complexity of the computation, several enhancements of the original SOM calculation have been proposed [STM^∗06,Koi94,LAR99].

It is difficult to compare the effectiveness of self-organizing maps to other clustering methods. Preliminary comparisons to K-Means algorithm [Ult95] show that SOM yields better clustering results. As for limited data examples, they do not generalize to all data sets. However, self-organizing maps and k-means relate. When setting the neighborhood size to zero, SOM equals K-Means algorithm [Kas97].

Clustering quality The quality of the clustering results plays an important role when choosing the appropri-ate data representation. Various approaches to the algorithmic assessment of clustering quality are described in several surveys including [P¨04,LJB06,KL96,FFG^∗08]. From the portfolio of the proposed measures, a subset is applicable to multiple clustering algorithm results (e.g., quantization error, compactness, proximity) and the rest only to SOM results (e.g., topographic product, topographic error). The general measures focus on the assess-ment of cluster compactness and inter-cluster separation, while SOM specific measures also consider specific properties of the SOM output, in particular preservation of topology.

Visualization of Clustering Results Visualization is often key to understand otherwise possibly abstract clus-tering results. While certain clusclus-tering approaches implicitly yield visual representations (e.g., dendrograms for hierarchic clustering [SS02]), many other clustering techniques need a specific post-processing of the results in order to visually represent them. In case of high dimensional data outputs (e.g. multi-dimensional feature vec-tors in k-means clustering), parallel coordinate or star views [FWR99,Kan01] or projection-based approaches are common [EC01]. When focusing on SOM result visualization, the visualization techniques usually show the SOM grid with the reference prototypes (e.g., using multi-dimensional techniques), by labeling the cells with the most common members [Kas97], or by showing the nearest member to the prototype. An overview of the SOM visualization techniques is provided in [Ves99]. The visual assessment of clustering quality is supported by the display of the distances between SOM cells using the so called U- and U*-Matrix [Ult03], the exploration of the topology properties using vector fields [PDR06], or showing the SOM topology by nearest neighbor con-nections [PRD05]. The distribution of the prototype values across the SOM grid is presented using so called component planes [Koh01], where each dimension of the prototype vectors is shown in a separate heat-map.

For the exploration and refinement of the clustering results, interaction techniques such as focus, zoom, filtering, or multiple-linked views are provided [SS02,CL03,NHM^∗07].

2.5. Data Processing

Im Dokument Visual Analytics of Large Weighted Directed Graphs and Two-Dimensional Time-Dependent Data (Seite 54-57)