Defects outside the area of interest have been removed as outliers. Better results have been achieved using a single-linkage method according to the assumptions made for the performance criteria

where each object is initially considered as a single-element cluster (leaf). It repeatedly mer-ges pairs of similar clusters until all points are clustered into one root cluster. Divisive cluste-ring just does the counterpart working in a top-down manner, starting from a root cluster and recursively splitting heterogeneous clusters until each point is in its own cluster. [1] [2]

Merging and splitting of clusters is perfor-med based on similarity/dissimilarity mea-sures. In order to measure the dissimilarity between clusters of observations, different cluster agglomeration (linkage) methods are available. Among the available linkage me-thods, the common types are complete (ma-ximum), single (minimum), average, centroid and Ward’s minimum variance method. These linkage methods use the maximum, minimum or average value of the computed pairwise dissimilarity distance measure to link clusters.

The centroid and Ward’s minimum variance methods compute the dissimilarity between centroids and minimum total within-cluster va-riance respectively between clusters to merge clusters at each step. For our work, agglomera-tive clustering with single linkage method ba-sed on Euclidean distance was uba-sed to conduct the formation of clusters. In order to identify clusters, the dendrogram should be cut at a certain height or number of clusters to group observations, specified by the user. In this stu-dy, the number of clusters was specified and it was also dependent on the size of the dataset.

The study addresses the problem of detec-ting clusters of defects around the border of a metal work piece in the manufacturing process of car body parts. The objective is to realize the increase of defect elements on the border and also to decide at which point in time the ma-chine requires sharpening or change of blade.

Hierarchical clustering is the approach used to obtain clusters of defects. Variant linkage me-trics were sought out during the study. Single linkage method presented better results ac-cording to the performance criteria of this ap-plication. In the following sections theoretical concepts of hierarchical clustering, the struc-ture of the data and experimental evaluation of the method approached are discussed res-pectively.

The general background of cluster analysis is to find clusters (groups) those elements are close to each other by some notion of distance between them given a set of observations in a dataset. Hierarchical clustering is such one kind method that creates a sequence of nes-ted partitions, i.e. a hierarchy of clusters vi-sualized in an upside down tree-like structure called dendrogram. The clusters in the hier-archy range from the lowest level of the tree (the leaves), consisting of each observation in its own cluster, to the highest level (the root), consisting of all points in one cluster. There are two approaches of applying hierarchical clus-tering: agglomerative and divisive. Agglome-rative clustering works in a bottom-up manner

The key idea behind the outlier detection methodology was to use the size of the resul-ting initial clusters as indicators of the presen-ce of outliers. The outliers in this case would be the clusters with number of elements less than the threshold (τ) parameter. τ was fixed to 50 and can be replaced based on any per-formance criteria a production manager sees fit. In Figure 1 the original data and the results obtained through out the process of clustering and removing outliers is portrayed for a samp-le dataset (dataset 13800). The sampsamp-le data set contains 1351 observations and 5 variables;

results to 135 initial clusters as shown in Figu-re 1b. One final cluster is available with num-ber of elements greater than τ. The remaining 134 clusters are removed as outliers and the-se were defects that were further away from the border of the work piece. After clustering and removing outliers the further interest in analysis was to see the available defect depth ranges within each cluster. Figure 1d shows histogram of depth ranges classified as 0-9, 10-19 and > 20 for the available final clusters.

For dataset 13800 it shows that the work piece has defects with depth > 20 µm within its final cluster. This result can be used as an indicator that this work piece will not make it as a car body part and also as an indicator for an over-due change or sharpening of the blade used in production.

A time series of how defects with high depths are growing across the manufacturing process and also the growth of outliers across all datasets with a step size of 1000 files are illustrated in Figure 1 e & f on page 68.

To specify the number of clusters the heuristic formula in [3] has been used:

This was used to make number of clusters depend on the number of observations in the dataset and also to ensure that outliers are isolated in small clusters. The dataset used in the project is a synthetically generated set of observations for the metal work pieces (800 mm x 100 mm) of a car panel. About 25,000 datasets have been generated to be analyzed.

Records in the dataset represent defects that occur during the bending of the work piece in the production line. Each dataset contains n observations and k variables and its structure is described below.

The variables of interest for the clustering analysis are the locations of the defect (i.e. X and Y variables) and the depth of the defect (D).

Since the variables X, Y and D were measured with different units, they had to be standardi-zed (scaled) in order to make them comparab-le. Variables ID and C are of no influence to the formation of clusters, hence they are avoided as inputs to the clustering technique.

In order to assess whether or not the data- sets had meaningful clusters, a statistical clus-tering tendency method named Hopkins sta-tistic was applied. Hopkins stasta-tistic measures the feasibility of clustering by testing the spa-tial randomness of data. The result value of a Hopkins statistic is a probability which indica-tes whether the given data (D) has non-random structure or is uniformly distributed. The mean of the nearest neighbor distance in a simulated dataset (random D) divided by the sum of the mean nearest neighbor distances in the real (D) and across the simulated dataset give the Hopkins statistical probability (H). If the value of H is less than 0.5, then it is concluded that the dataset D has meaningful clusters. [2]

Tab. 1:

Data structure of Metal Work Piece Data

ELEKTROTECHNIK UND INFORMATIONSTECHNIK

IAF Hochschule Offenburg I forschung im fokus 2017 I 67

(a) (b)

(e) (f)

ELEKTROTECHNIK UND INFORMATIONSTECHNIK

Referenzen/References:

[1] M. J. Zaki and W. J. Meira, Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press, 2014.

[2] A. Kassambra, „STHDA: Statistical Tools For High-Throughput Data Analysis“, [Online]. Available: http://

www.sthda.com/english/wiki/cluster-analysis-in-r-unsupervised-machine-learning. [Accessed July 2016]

[3] A. Loureiro, L. Torgo and C. Soares. Outlier Detection Using Clustering Methods: a Data Cleaning Application. In Proceedings of KDNet Symposium on Knowledge-based systems for the Public Sector (2004) Ruth Tesfaye Zibello M. Sc.

Akademische Mitarbeiterin, Fakultät Elektrotechnik und Informationstechnik ruth.zibello@hs-offenburg.de AUTOREN

Prof. Dr. rer. nat. Tobias Lauer Fakultät E+I, Forschungsgruppe Analytics und Data Science,

Lehrgebiete: Parallele Programmierung, Betriebssysteme

tobias.lauer@hs-offenburg.de Prof. Dr. rer. nat. Stephan Trahasch Fakultät E+I, Forschungsgr. Analytics und Data Science, Lehrgebiete: Data Mining, Big Data Analytics, IT-Security, stephan.trahasch@hs-offenburg.de http://analytics.hs-offenburg.de

Figure 1 e-f:

Results of clustering, removing outliers and time series

IAF Hochschule Offenburg I forschung im fokus 2017 I 69 IAF Hochschule Offenburg I forschung im fokus 2017 I 69

FAKULTÄT

Im Dokument Hochschule Offenburg. Institut für Angewandte Forschung. Ausgabe Nr. 20 / 2017 (Seite 64-67)