• Keine Ergebnisse gefunden

Problem 1: Mining frequent itemsets from streams with fixed unknown distribution:

5. Strongly Closed Itemset Mining from Transactional Data Streams

5.2. The Strongly Closed Stream Mining Algorithm

5.2.2. Incremental Update

Note that the sample sizes in (5.3) depends on the error and confidence parameters andδ only. That is,sdoes not change with increasing data stream length. Hence, both denominators in the LHS of (2.1) (page 27) will be fixed (i.e., s) for the entire mining process from thes-th transaction onward. More precisely, for any transaction database Dof size sand ˜∆∈[0,1], the family ofrelatively∆-closed itemsets of˜ Dis equal to the family C∆,D of absolutely ∆-closed itemsets for ∆ = ds∆˜e. This allows us to consider the following problem equivalent to the ˜∆-Closed Set Listing problem (Problem 4):

2 We note that Hoeffding’s inequality applies to samples without replacement as well (Hoeffding,1963).

A tighter bound can be derived from Serfling’s inequality (Serfling,1974). The improvement becomes, however, marginal with increasing data stream length.

76

Algorithm 5 Update C∆,Dt

input: data setsDdel,Dins overI and ∆∈N

require: totally ordered set (I,≤), data setDt overI, and C∆,Dt output: C∆,D

t0 forDt0 =Dt Ddel⊕ Dins Main:

1: C∆,D

t0 ← {∅}

2: ListClosed(∅,∅, minI) ListClosed(C, N, i):

1: X← {k∈I\C:ki}

2: if X 6=∅ then

3: e←minX;CeC∪ {e}

4: if Ddel[Ce] =∅ ∧ Dins[Ce] =∅then C0 ←Closure_α(C, e,C∆,Dt) //Case (α)

5: else if Dins[Ce] =∅then C0 ←Closure_β(C, e,Ddel[Ce],C∆,Dt) //Case (β)

6: else if Ddel[Ce] =∅ thenC0 ←Closure_γ(C, e,Dins[Ce],C∆,Dt) //Case (γ)

7: elseC0σ∆,Dt0(Ce) //Case (δ)

8: if C0N =∅ then

9: add (C, e, N, C0,↑) to C∆,D

t0

10: ListClosed(C0, N, e+ 1)

11: else

12: add (C, e, N, C0,↓) to C∆,D

t0

13: Y ← {k∈I\C:k > e}

14: if Y 6=∅ then

15: e0 ←minY

16: ListClosed(C, N ∪ {e}, e0)

Problem 5: Listing-Closed sets from data streams: Given Dt, Ddel, Dins for St and St0 as defined in Section 5.2.1, an integer ∆>0, and the familyC∆,Dt of ∆-closed itemsets of Dt,generateall elements of C∆,D

t0 forDt0 =Dt Ddel⊕ Dins. Instead of generating C∆,D

t0 from scratch, our goal is to design a much faster practical algorithm by reducing the number of evaluations of the closure operator for Dt0. This is motivated by the fact that the execution of the closure operator is the most expensive part of the algorithm. We make use of the fact that the expected number of changes in Dt0 w.r.t. Dt becomes smaller and smaller ast0 increases (cf. (5.2)). Accordingly, our focus in the design of the updating algorithm is onquicklydeciding whether an element C0 ∈ C∆,Dt remains ∆-closed in Dt0, where C0 is obtained by C0 =σ∆,Dt(C∪ {e}) for some C ∈ C∆,Dt and eI. Below, we show that in all of the cases when at least one of the support sets Ddel[C∪ {e}] or Dins[C∪ {e}] is empty, the problem above can be decided much faster than with the naive way of using Algorithm 1. As we empirically demonstrate in Section 5.3.6, a considerable speed-up over the naive algorithm can be achieved in this way.

We first briefly sketch the algorithm computing C∆,D

t0 from C∆,Dt (see Algorithm 5).

It requires four auxiliary pieces of information for all strongly closed itemsets in C∆,Dt, except for the empty set (cf. line 1 of Main). Hence, to simplify the notation, the set variables C∆,Dt and C∆,D

t0 in all algorithms of this section store quintuples, where the first component is the strongly closed itemset itself; the other four components are specified below. Algorithm 5 is a divide and conquer algorithm that recursively calls ListClosed with some ∆-closed set C ∈ C∆,D

t0, forbidden set NI, and minimum candidate generator element i. It first determines the next smallest generator element e(line 3) and calculates the closure C0 =σ∆,Dt0(C∪ {e}) in lines 4–7; these steps are discussed in detail below. We storeC0, together with some auxiliary information (lines 9 and 12). The algorithm then callsListClosed recursively for generating further ∆-closed supersets of C0. In particular, if C0 does not contain any forbidden item from N, then the last element of the quintuple stored forC0 is ↑ (line 9); o/w it is ↓ (line 12). After all elements of C∆,D

t0 have been generated that are supersets of C, contain e, but do not contain any element inN, the algorithm generates all closed sets inC∆,D

t0 that are supersets ofC and do not contain any element from N∪ {e}(lines 13–16).

Example 1. Using the transactions in Figure 5.1, we show how Algorithm 5 updates the family of 2-closed itemsets for the first eight transactions (cf. Figure 5.1b) to that for the last eight (cf. Figure 5.1c). For I = {a, b, c, d} with a < b < c < d and C∆,Dt ={d, ad, bd, cd}, the input to the algorithm for this update consists ofDdel={t1}, Dins ={t9}, and ∆ = 2. The algorithm first initializes C∆,D

t0 ← {∅} (line 2) and then calls ListClosed(∅, ∅, a) (line 3). The recursive calls of the function ListClosed are visualized in Figure 5.2. The edges corresponding to lines 1–12 are labeled with the value of the variable Ce (cf. line 3) and the case used for the update in lines 4–7; unlabeled edges correspond to lines 14–16.

Theorem 2. Algorithm 5 generates all elements ofC∆,D

t0 correctly, irredundantly, in to-tal timeO|I| · |C∆,D

t0| · kDt0k0, with delayO |I|2kDt0k0, and in spaceO(|I|+kDt0k0). Proof. Regarding the correctness, we only need to show that C0 computed in lines 4–7 satisfiesC0 =σ∆,Dt0(C∪ {e}). The correctness of Closure_α(Algorithm 6),Closure_β (Algorithm 7), and Closure_γ (Algorithm 8) is shown below in Lemmas 1, 2, and 3, respectively. The proofs of the irredundancy and the time and space complexity are immediate from Boley et al. (2010) and Gély (2005) by noting that Algorithm 5 must call the closure operator for all elements inC∆,D

t0 in the worst-case.

In the rest of this section, we give the algorithms for the cases distinguished in lines 4–7 (case (δ) is trivial) and prove their correctness.

Case(α) We first consider the case that the setC∪ {e} withC∈ C∆,D

t0 and eI to be extended for further ∆-closed sets satisfies

Ddel[C∪ {e}] =∅and Dins[C∪ {e}] =∅ (5.4) (line 4 of Algorithm 5). The closureσ∆,Dt0(C∪ {e}) for this case can be computed by Algorithm 6; the correctness of Algorithm 6 is stated in Lemma 1 below.

78

LC(∅,∅,a) LC(ad,∅,b)

LC(ad,b,c)

LC(∅,a,b)

LC(bd,a,c) LC(∅,ab,c)

LC(c,ab,d) LC(∅,abc,d) a:α

abd:α

acd:α

b:β

bcd:β c:γ

cd:α d:α

Figure 5.2.: Call stack for the update in Example 1. Labeled edges correspond to lines 1–

12 of Algorithm 5; unlabeled to lines 14–16. The first component of an edge label denotes Ce (cf. line 3 of Algorithm 5), the second the case applied (cf. lines 4–7).

Algorithm 6 Closure_α input: C∈ C∆,D

t0 with Dt0 =Dt Ddel⊕ Dins,eI, and C∆,Dt require: Dt

output: σ∆,Dt0(C∪ {e})

1: if ∃(C, e, N, C0, q)∈ C∆,Dt for someN,C0, andq then returnC0

2: else returnσ∆,Dt(C∪ {e}) // σ∆,Dt(C∪ {e}) =σ∆,Dt0(C∪ {e}) for this case

Lemma 1. Algorithm 6 is correct, i.e., for all C ∈ C∆,D

t0 and for all eI, the output of the algorithm is σ∆,Dt0(C∪ {e}).

Proof. Condition (5.4) implies thatDt[C∪ {e}] =Dt0[C∪ {e}], whereDt0 =Dt Ddel⊕ Dins. Hence, σ∆,Dt0(C∪ {e}) =σ∆,Dt(C∪ {e}) andσ∆,Dt0(C∪ {e})∈ C∆,Dt, from which the proof is immediate for both cases considered in lines 1–2.

Example 2. In our running Example 1, the first call LC(∅,∅,a) in Figure 5.2 corresponds to case (α), asDins[a] =Ddel[a] =∅. Algorithm 6 returnsadas the closure ofain line 1, i.e., we do not need to (re)evaluate the closure operator on a.

Case(β) We now turn to the case that C∈ C∆,D

t0 andeI fulfill

Ddel[C∪ {e}]6=∅ and Dins[C∪ {e}] =∅ (5.5) (line 5 of Algorithm 5). In Proposition 1 below we first prove some monotonicity results that will be used also for case (γ).

Proposition 1. Let D1 and D2 be transaction databases over I. If D1 ⊆ D2, then for all ∆∈N,

C∆,D1 ⊆ C∆,D2 . (5.6)

Algorithm 7 Closure_β input: C ∈ C∆,D

t0 withDt0 =Dt Ddel⊕ Dins,eI,Ddel[C∪ {e}], andC∆,Dt require: Dt

output: σ∆,D

t0(C∪ {e})

1: if there exists (C, e, N, C0, q) in C∆,Dt for someN,C0, and q then

2: C0.count← |Dt[C0]| − |Ddel[C0]|

3: for alliI\C0 do

4: C0.∆i← |Dt[C0∪ {i}]|

5: if C0.count−C0.i+|Ddel[C0∪ {i}]|<then

6: return σ∆,D

t0(C∪ {e})

7: return C0

8: else

9: return σ∆,Dt0(C∪ {e})

Furthermore, for all∆∈Nand for all XI,

σ∆,D1(X)⊇σ∆,D2(X) . (5.7) Proof. Let C ∈ C∆,D1 for some ∆ ∈N and let D0 =D2 D1. Then, for any eI\C, we have

|D2[C∪ {e}]| = |D1[C∪ {e}]|+|D0[C∪ {e}]|

≤ |D1[C]| −∆ +|D0[C]|

= |D2[C]| −∆ ,

where the inequality follows fromC∈ C∆,D1 and from the anti-monotonicity of support sets. Hence,C ∈ C∆,D2 completing the proof of (5.6).

To show (5.7), suppose that during the calculation ofσ∆,D2(X), the items inσ∆,D2(X)\X have been added toXin the ordere1, . . . , ek. LetX0 =XandXi =X∪{e1, . . . , ei−1, ei} for all i∈[k]. Then |D2[Xi−1]| − |D2[Xi]|< ∆ for alli ∈[k] (see Algorithm 1). Since D2[Xi−1] ⊇ D2[Xi] and D1 ⊆ D2, we have |D1[Xi−1]| − |D1[Xi]| < ∆ for all i. Thus, as Algorithm 1 is Church-Rosser, all ei will be added to σ∆,D1(X) as well, implying (5.7).

Using Proposition 1, we have the following result for Algorithm 7 concerning case (β):

Lemma 2. Algorithm 7 is correct, i.e., for all C ∈ C∆,D

t0 and for all eI, the output of the algorithm isσ∆,D

t0(C∪ {e}).

Proof. By Condition (5.5),Dt0[C∪ {e}]⊆ Dt[C∪ {e}] and hence Proposition 1 implies that there is no Y ∈ C∆,D

t0 with C ∪ {e} ( Y ( σ∆,Dt(C ∪ {e}). Furthermore, if σ∆,Dt(C∪ {e})6∈ C∆,D

t0, then σ∆,Dt(C∪ {e})(σ∆,Dt0(C∪ {e}). Thus, to check whether C0 =σ∆,Dt(C∪ {e}) remains closed inDt0, it suffices to test whether

|Dt0[C0]| − |Dt0[C0∪ {i}]| ≥∆ (5.8)

80

Algorithm 8 Closure_γ input: C ∈ C∆,D

t0 withDt0 =Dt Ddel⊕ Dins,eI,Dins[C∪ {e}], and C∆,Dt require: Dt

output: σ∆,D

t0(C∪ {e})

1: if there exists (C, e, N, C0, q) in C∆,Dt for someN,C0, and q then

2: C00C∪ {e};D0←(Dt⊕ Dins)[C00]

3: repeat

4: for alliC0\C00 do

5: if |D0| − |D0[i]|<then

6: C00C00∪ {i};D0 ← D0[i]

7: untilD0 has not been changed in Loop 4–6

8: return C00

9: else

10: return σ∆,Dt0(C∪ {e})

further holds for all items iI \C0 (lines 2–6 of Algorithm 7). If so, the algorithm returnsC0 in line 7, implying the correctness of Algorithm 7 for the case thatC0 ∈ C∆,D

t0; the claim is trivial for the other two cases (lines 6 and 9).

We note that in our implementation of Algorithm 7 we do not calculate C0.count and C0.i in lines 2 and 4, but store and maintain them consistently. In this way, the condition in line 5 can be decided from Ddel, without any access toDt. It is important to mention that with increasing stream length, the number of elements to be deleted from C∆,Dt becomes smaller (cf. (5.2)) and typically, most of the elements of C∆,D

t0 are calculated by terminating in line 7.

Example 3. In our running Example 1, the call of LC(∅,a,b) in Figure 5.2 corresponds to case (β) because Dins[b] =∅ andDdel[b]6=∅. Since (∅, a, b, bd,↑)∈ C∆,Dt, Algorithm 7 only needs to compute support queries onDdel in lines 2, 4, and 5. For alliconsidered in line 3, the condition in line 5 is not fulfilled. Hence, the algorithm returns dbin line 7, without calling the closure operator.

Case(γ) Finally we discuss the case thatC ∈ C∆,D

t0 and eI satisfy the condition Ddel[C∪ {e}] =∅ and Dins[C∪ {e}]6=∅ (5.9) (see line 6 of Algorithm 5). The proof for this case is shown also by using Proposition 1.

Lemma 3. Algorithm 8 is correct, i.e., for all C ∈ C∆,D

t0 and for all eI, the output of the algorithm is σ∆,Dt0(C∪ {e}).

Proof. The proof is automatic for the case that the condition in line 1 of Algorithm 8 is false. Consider the case that it is true. Proposition 1 with Condition (5.9) implies that C∆,Dt ⊆ C∆,D

t0 (i.e., all ∆-closed itemsets in C∆,Dt are preserved) and that σ∆,D

t0(C

{e})⊆σ∆,Dt(C∪ {e}). Thus, when calculating σ∆,D

t0(C∪ {e}) in Loop 3–7, it suffices to consider only the elements in σ∆,Dt(C∪ {e})\(C ∪ {e}), from which the claim is immediate for this case.

Compared to case (β), we need to calculate support counts in the entire sample Dt0 for this case. However, the inner loop (lines 4–6) iterates over a typically much smaller set than the general closure algorithm (cf. lines 2–5 of Algorithm 1). Analogously to case (β), the number of new ∆-closed itemsets to be added to C∆,D

t0 becomes smaller with increasing stream length, and hence, most of the elements ofC∆,D

t0 are calculated in the

“then” part (line 2–8) of the “if” statement.

Example 4. In our running Example 1, the call of LC(∅,ab,c) in Figure 5.2 corresponds to case (γ) since item c occurs only in Dins (i.e., Dins[c] 6= ∅ and Ddel[c] = ∅). Since (∅, ab, c, cd,↑)∈ C∆,Dt, the algorithm goes into the loop 4–6, iterating over all elements of cd\c. The condition in line 5 is not satisfied for d and thus c is returned as a new closed itemset in line 8, without calling the closure operator.

Controlling the Time and Space Complexity of the Update

Although by Theorem 2 Algorithm 5 does not improve the worst-case time and space complexity of the batch algorithm (Boley et al.,2009b) calculating the family of strongly closed sets from scratch, our experimental results presented in Section 5.3.6 clearly demonstrate a considerable speed-up on artificial and real-world data sets. The to-tal time depends on the cardinality ofC∆,D

t0, which can be exponential in|I|. The time and space of the update can be controlled by selecting the parameter ∆ in a way that

|C∆,D

t0| < K for some reasonable smallK. The value of K may depend, e.g., on time or space capacity constraints. OnceK has been fixed, the value of ∆ can automatically be adjusted when the number of elements in C∆,D

t0 that have already been enumerated exceedsK. More precisely, suppose Algorithm 5 has generated a subsetC0 ⊆ C∆,D

t0 with

|C0|=K+ 1. For allC ∈ C0, let ∆C be the strength ofC and denote ∆0 = minC∈C0C. Clearly, ∆0 ≥ ∆. Let C00 = {C ∈ C0 : ∆C >0}. For the set obtained we have C00⊆ C0+1,D

t0 and|C00| ≤K.

This change of ∆ to ∆0 + 1 requires, however, the maintenance of auxiliary pieces of information for all already generated strongly closed sets, as well as the reconstruction of the five tuples for the closed sets remaining. More precisely, suppose ∆C,e=|Dt[C]| −

|Dt[C∪ {e}]|has been calculated correctly for allC∈ C∆,Dt and for alleI\C. Notice that the strength ofCinDtis given by mine∈I\CC,e, where the ∆C,es are obtained as a byproduct of the algorithm computing the closure operator (cf. Algorithm 1). One can see that if C ∈ C∆,D

t0 and C has not been recalculated by calling the closure operator, then ∆C,e can be updated by

C,e= ∆C,e+|Dins[C]| − |Dins[C∪ {e}]| − |Ddel[C]|+|Ddel[C∪ {e}]|

82

for all eI \C. Thus, the complexity of the update for this case depends on the cardinality of Dins and Ddel only, which become smaller and smaller with increasing t0 by (5.2). Finally, utilizing the algebraic properties of closure systems, the five tuples can be reconstructed by a top-down traversal of the enumeration tree corresponding to Algorithm 5.