• Keine Ergebnisse gefunden

Problem 1: Mining frequent itemsets from streams with fixed unknown distribution:

5. Strongly Closed Itemset Mining from Transactional Data Streams

5.5. Discussion

Fractionofqueries

∆˜

Legend

Baseline

Algorithm 9

Figure 5.20.: Empirical results for the task of product configuration recommendation. We show the fraction of queries of AlgorithmBaselineand our Algorithm 9 in comparison to Algorithm Naive. The box-plots display the distribution of the averages over all input data sets for varying ˜∆.

the choice of ˜∆ and the time of updating the family of strongly closed sets. It is also very likely that the improvement will not continue forever with decreasing ˜∆: As the number of closed sets increases with decreasing ˜∆, they lose their characteristic sharp drop in frequency and will hence become similar to the frequent itemsets used by Algo-rithm Baseline.

To illustrate the effect of ˜∆ on the number of strongly closed sets, in Figure 5.21 we report the number of strongly closed sets for the values of ˜∆ used in this experiment.

The number of strongly closed sets increases fast with decreasing ˜∆. In fact, it increases faster than the saving increases. It might thus be concluded that the contribution of each additional closed set becomes less and less, as the number of closed sets increases.

Mining strongly closed patterns for lower values of ˜∆ takes longer and requires more memory. The parameter ˜∆ must, therefore, be chosen carefully for a specific application scenario (see Section 5.3.1 for the effect of ˜∆ on both runtime and memory). The time to compute the recommendation given a family of strongly closed sets is negligible in all our experiments.

The experimental results presented in this section clearly demonstrate the potential of strongly closed itemsets for the computer-aided product configuration task.

∆˜

Figure 5.21.: Number of strongly closed sets for the product configuration data sets for varying ˜∆.

parameter ∆ that has the clear semantic interpretation that any superset must have a support count of at least ∆ transactions less. These two advantageous properties motivated our choice of this particular pattern class. Our algorithm relies on reservoir sampling to obtain a fixed size data set from a growing data stream. The sample size is chosen by Hoeffding’s inequality providing the following probabilistic guarantee: With probability at least 1−δ, the discrepancy between the relative frequency of an itemset X in the stream St and that in the sample Dt is at most for all XI. The fixed sample size enables our algorithm to heavily utilize some of the nice algebraic and al-gorithmic properties of this kind of itemsets. The careful investigation of the impact of the parameters of our algorithm presented in Section 5.3 has shown that it achieves high F-scores, requires only a small fraction of the memory required by theBatchalgorithm, and is in most settings clearly faster than the Batch algorithm. The approximation and speed-up results clearly indicate the suitability of our algorithm for mining strongly closed itemsets even frommassivetransactional data streams.

In Section 5.4 we have demonstrated the suitability of strongly closed sets for two practi-cal applications, the classipracti-cal problem of concept drift detection and the task of computer aided product configuration recommendation for hyper-configurable systems.

The potential of strongly closed sets for concept drift detection has been demonstrated with a basic algorithm that computes families of strongly closed sets from a stream divided into consecutive mini-batches. The Jaccard distance was used to compare the families of strongly closed sets from two successive windows. The plots of the Jaccard distance as a function of the stream length clearly showed that a simple threshold can be used to signal a drift. The algorithm has been evaluated over diverse settings, varying both the characteristics of the data streams as well as the parameters of the algorithm.

We considered data streams with different pace of drift and varying commonality for the

112

distributions before and after the drift. For the algorithm, the relative strength of the closure ˜∆, the delay between consecutive miners, and the buffer size have been varied.

In all experiments, the drifts are clearly visible in the plots of the Jaccard distance, demonstrating that strongly closed sets are a good indicator of concept drifts for a wide range of drifts and algorithmic settings. Both characteristics are important. The first to detect drifts of different types, the second to detect drifts even when the algorithmic parameters are not perfectly tuned.

The algorithm to detect concept drifts with strongly closed sets serves mainly as a proof of concept. It is obviously less sophisticated than the StreamKrimp algorithm (van Leeuwen and Siebes, 2008), which is tailored to the specific problem of concept drift detection. However, our algorithm can further be improved and turned into a sophisticated algorithm that does not need to compute a new set of strongly closed itemsets for each batch.

As a second practical application, the task of product configuration recommendation has been introduced. This problem was coined by an industrial project. Since the data is not publicly available and differs significantly from classical benchmark data sets, the data characteristics have been reported and compared to those of a classical benchmark data set. The essential difference is that the real-world data contains a much more intrinsic structure, whereas the benchmark data is rather uniform. In particular, the real-world data contains items that imply others and also items that exclude others.

Strongly closed itemsets mined on historical data contain only item combinations that were sold together, which implies that they are valid solutions. To solve the problem of product configuration recommendation, an algorithm querying strongly closed sets has been proposed. It has been empirically compared to a naive and a baseline algorithm and turned out to be superior to both algorithms requiring up to 23.4%

fewer user queries than the baselinealgorithm, which in turn is superior to thenaive algorithm. We conclude that strongly closed sets provide a clear algorithmic advantage over the other two algorithms and are indeed suitable for the problem of computer-aided product configuration recommendation. Additionally, they implicitly capture inclusion and exclusions of items that must be modeled explicitly with other machine learning approaches. This is a huge benefit, as it eliminates any manual work to define these concepts.

We close this section with with two problems for future research. The speed-up results reported in Section 5.3.6 can further be improved by utilizing that |C∆,Dt| is typically (much) smaller than the sample sizescalculated by Hoeffding’s inequality (for a detailed discussion on the size of C∆,Dt see Boley et al. (2009b)). In such cases, the closure σ∆,Dt0(C∪ {e}) can be computed from C∆,Dt without any database access to Dt0, even when the closure of C∪ {e} has not been calculated for Dt. For example, instead of computing σ∆,Dt(C∪ {e}) in line 2 of Algorithm 6, we can return

\{Y ∈ C∆,Dt :C∪ {e} ⊆Y} , asC∆,Dt is a closure system.

Besides the landmark model considered in this work, the problem of mining strongly closed itemsets under the sliding window model would be another interesting related problem. Since our algorithm handles both insertion and deletion of transactions, it might be applicable to this setting if only the sampling is replaced by a sliding window.

It might be however possible to design a more sophisticated algorithm for this setting.

For practical applications, more sophisticated algorithms tailored to the specific problems would be beneficial over our proof-of-concept solutions.