• Keine Ergebnisse gefunden

projected pairwise distance of the mean points, as given in [17], it might happen that two microclusters that have a different subspace orientation are grouped because the mean of the one microcluster fits into the subspace of the other one although their subspace preferences are different.

full-dimensional noise cluster, the correlation clusters have a random dimen-sionality which is below the full dimendimen-sionality.

For all experiments, we report the results when using the best considered parameter setting. We consider different parameter settings by performing a grid search over buff_size∈ {10,15,20}; ∈ {0.1,0.15,0.2,0.3}; minM cs∈ {1,2,3}. The parameters that were already introduced by previous methods are fixed.

Runtime Experiments

The first set of experiments investigate the performance of the proposed method in terms of runtime. Figure 4.2(a) shows the results for varying numbers of data objects. The y-axis shows the measured runtime in log scale. While the static methods both show rather fast increasing runtimes, even for moderately sized databases, theCorrStream variants are able to process 42’000 data objects within only a few seconds. When ranging the dimensionality of the feature space from 4 to 24 dimensions, the outcome is quite similar. As depicted in Figure 4.2(b), the static algorithms need much more time compared to our method. Again, as can be observed, the CorrStreamvariants significantly improve the static competitors and only need a few seconds.

Quality Experiments

The next set of experiments investigates the considered methods in terms of clustering quality. Therefore, we measure the quality by comparing the re-sulting clusterings of each approach to the ground truth labels. Precisely, we use precision and recall values to examine the performance. First, Figure 4.3 compares the results of CorrStream using the ERiC approach for the offline phase against the static ERiC algorithm. In both plots the ERiC al-gorithm shows higher precision and recall values thanCorrStream. This might be reasoned by aging as well as by treating microclusters as noise if their buffer have not been filled. Figure 4.4 shows the corresponding results when using ORCLUS for the offline procedure. Interestingly, the CorrStreamresults are better in the experiments when increasing the di-mensionality. An explanation for this result might be the phenomenon that during the first iteration of ORCLUS, all data points are assigned to the closest cluster center with respect to the Euclidean distance. This leads to a situation where clusters, resp. groups of points that are treated as intermedi-ate clusters during the iterations, are spanned across several actual clusters.

Such intermediate clusters, that typically occur if differently oriented correla-tion clusters intersect, finally form a “false" subspace and thus might absorb data points that actually do belong to another cluster. CorrStream re-duces this problem due to using the correlation distance for the assignment of a data point as soon as the initialization of a microcluster is completed.

(a) Results for varying numbers of

data objects. (b) Results for varying dimensionali-ties.

Figure 4.3: Precision and Recall measurements when considering ERiC.

(a) Results for varying numbers of

data objects. (b) Results for varying dimensionali-ties.

Figure 4.4: Precision and Recall measurements when considering ORCLUS.


Finally, as the throughput is one of the major criterions for streaming algo-rithms, we evaluate CorrStream in terms of throughput, i.e., the number

Figure 4.5: The throughput of the online phase by considering different di-mensionalities.

of data objects processed per millisecond, by using various dimensionali-ties. The plot in Figure 4.5 investigates the throughput of the online phase by using different dimensionalities for the feature space. As can be seen, the throughput of the algorithm decreases with increasing dimensionalities.

Nevertheless, the decline of the throughput decreases for higher dimensional-ities and still is about 363 data objects per second for 24 dimensional feature spaces.

Influence of Parameters

In the following block of experiments, we discuss the influence of the param-eters that are newly introduced in this work and are required to be set by the user for CorrStream, i.e., the parameter giving the allowed maximum distance of a data point to a microcluster center for the assignment step, and the buff_size parameter which regulates the size of the buffer, or in other words the number of points used for the initial PCA. Note that we omit the discussion of theminMcs parameter, as this parameter isERiC specific and is not used in the ORCLUS-like variant. Note that the ORCLUS specific k0 parameter, i.e., the parameter that defines the initial number of clusters which is subsequently reduced to k during the iterations, is implicitly given by the number of microclusters.

The plot in Figure 4.6 describes the influence of the parameter. The precision and recall values slightly decrease with increasing values, cf., Fig-ure 4.6(a). This can be explained by the enlarged absorption radius as more distant points can be absorbed by a microcluster, especially during the

ini-tialization phase. Furthermore, as can be seen in Figure 4.6(b), the number of produced microcluster decreases with increasing values of and thus the required computation time, too.

(a) Precision and Recall for varying

values of . (b) Runtime (left axis) and the

absolute number of microclusters (right axis) for varying values of . Figure 4.6: Performance measures for various values of the parameter.

(a) Precision and Recall for varying

values of buff_size. (b) Runtime (left axis) and the absolute number of microclusters (right axis) for varying values of buff_size.

Figure 4.7: Performance measures for various values of thebuff_size param-eter.

Figure 4.7 shows the results when considering various buffer sizes. While varying values for this parameter hardly affect the measured precision and recall values, the measurements for the runtime shows that the size of the initial microcluster buffers has a rather high impact on the performance, which generally seems to benefit from larger buffers.

To summarize: based on our findings we want to state that, of course, the choice of the parameters affects the efficiency of the online phase. An inappropriate selection of parameters might lead to increased runtimes. For instance if the parameter is chosen wrongly, the number of generated mi-croclusters may increase which in turn leads to a higher number of necessary distance computation when assigning incoming data objects to existing mi-croclusters. Note that the increased computational costs do not necessarily lead to a significany improvement for the overall clustering quality. On the other hand, it might happen that the combination of the and buff_size parameters are chosen in such a way that microclusters do not initialize.

However, this strongly depends on the data distribution that is given by the underlying data generating process.

Im Dokument Unsupervised learning on social data (Seite 60-65)