Incremental Update - The Strongly Closed Stream Mining Algorithm

Problem 1: Mining frequent itemsets from streams with fixed unknown distribution:

5. Strongly Closed Itemset Mining from Transactional Data Streams

5.2. The Strongly Closed Stream Mining Algorithm

5.2.2. Incremental Update

Note that the sample sizes in (5.3) depends on the error and confidence parameters andδ only. That is,sdoes not change with increasing data stream length. Hence, both denominators in the LHS of (2.1) (page 27) will be fixed (i.e., s) for the entire mining process from thes-th transaction onward. More precisely, for any transaction database Dof size sand ˜∆∈[0,1], the family ofrelatively∆-closed itemsets of˜ Dis equal to the family C_∆,D of absolutely ∆-closed itemsets for ∆ = ds∆˜e. This allows us to consider the following problem equivalent to the ˜∆-Closed Set Listing problem (Problem 4):

2 We note that Hoeffding’s inequality applies to samples without replacement as well (Hoeffding,1963).

A tighter bound can be derived from Serfling’s inequality (Serfling,1974). The improvement becomes, however, marginal with increasing data stream length.

Algorithm 5 Update C_∆,D_t

input: data setsD_del,D_ins overI and ∆∈N

require: totally ordered set (I,≤), data setD_t overI, and C_∆,D_t output: C_∆,D

t0 forD_t⁰ =D_t D_del⊕ D_ins Main:

1: C_∆,D

t0 ← {∅}

2: ListClosed(∅,∅, minI) ListClosed(C, N, i):

1: X← {k∈I\C:k≥i}

2: if X 6=∅ then

3: e←minX;C_e ←C∪ {e}

4: if D_del[Ce] =∅ ∧ D_ins[Ce] =∅then C⁰ ←Closure_α(C, e,C_∆,D_t) //Case (α)

5: else if D_ins[Ce] =∅then C⁰ ←Closure_β(C, e,D_del[Ce],C_∆,D_t) //Case (β)

6: else if D_del[C_e] =∅ thenC⁰ ←Closure_γ(C, e,D_ins[C_e],C_∆,D_t) //Case (γ)

7: elseC⁰ ←σ∆,D_t0(Ce) //Case (δ)

8: if C⁰∩N =∅ then

9: add (C, e, N, C⁰,↑) to C_∆,D

10: ListClosed(C⁰, N, e+ 1)

11: else

12: add (C, e, N, C⁰,↓) to C_∆,D

13: Y ← {k∈I\C:k > e}

14: if Y 6=∅ then

15: e⁰ ←minY

16: ListClosed(C, N ∪ {e}, e⁰)

Problem 5: Listing∆-Closed sets from data streams: Given D_t, D_del, D_ins for S_t and S_t⁰ as defined in Section 5.2.1, an integer ∆>0, and the familyC_∆,D_t of ∆-closed itemsets of D_t,generateall elements of C_∆,D

t0 forD_t⁰ =D_t D_del⊕ D_ins. Instead of generating C_∆,D

t0 from scratch, our goal is to design a much faster practical algorithm by reducing the number of evaluations of the closure operator for D_t⁰. This is motivated by the fact that the execution of the closure operator is the most expensive part of the algorithm. We make use of the fact that the expected number of changes in D_t⁰ w.r.t. D_t becomes smaller and smaller ast⁰ increases (cf. (5.2)). Accordingly, our focus in the design of the updating algorithm is onquicklydeciding whether an element C⁰ ∈ C_∆,D_t remains ∆-closed in D_t⁰, where C⁰ is obtained by C⁰ =σ∆,Dt(C∪ {e}) for some C ∈ C_∆,D_t and e∈ I. Below, we show that in all of the cases when at least one of the support sets D_del[C∪ {e}] or D_ins[C∪ {e}] is empty, the problem above can be decided much faster than with the naive way of using Algorithm 1. As we empirically demonstrate in Section 5.3.6, a considerable speed-up over the naive algorithm can be achieved in this way.

We first briefly sketch the algorithm computing C_∆,D

t0 from C_∆,D_t (see Algorithm 5).

It requires four auxiliary pieces of information for all strongly closed itemsets in C_∆,D_t, except for the empty set (cf. line 1 of Main). Hence, to simplify the notation, the set variables C_∆,D_t and C_∆,D

t0 in all algorithms of this section store quintuples, where the first component is the strongly closed itemset itself; the other four components are specified below. Algorithm 5 is a divide and conquer algorithm that recursively calls ListClosed with some ∆-closed set C ∈ C_∆,D

t0, forbidden set N ⊆ I, and minimum candidate generator element i. It first determines the next smallest generator element e(line 3) and calculates the closure C⁰ =σ∆,D_t0(C∪ {e}) in lines 4–7; these steps are discussed in detail below. We storeC⁰, together with some auxiliary information (lines 9 and 12). The algorithm then callsListClosed recursively for generating further ∆-closed supersets of C⁰. In particular, if C⁰ does not contain any forbidden item from N, then the last element of the quintuple stored forC⁰ is ↑ (line 9); o/w it is ↓ (line 12). After all elements of C_∆,D

t0 have been generated that are supersets of C, contain e, but do not contain any element inN, the algorithm generates all closed sets inC_∆,D

t0 that are supersets ofC and do not contain any element from N∪ {e}(lines 13–16).

Example 1. Using the transactions in Figure 5.1, we show how Algorithm 5 updates the family of 2-closed itemsets for the first eight transactions (cf. Figure 5.1b) to that for the last eight (cf. Figure 5.1c). For I = {a, b, c, d} with a < b < c < d and C_∆,D_t ={d, ad, bd, cd}, the input to the algorithm for this update consists ofD_del={t₁}, D_ins ={t₉}, and ∆ = 2. The algorithm first initializes C_∆,D

t0 ← {∅} (line 2) and then calls ListClosed(∅, ∅, a) (line 3). The recursive calls of the function ListClosed are visualized in Figure 5.2. The edges corresponding to lines 1–12 are labeled with the value of the variable Ce (cf. line 3) and the case used for the update in lines 4–7; unlabeled edges correspond to lines 14–16.

Theorem 2. Algorithm 5 generates all elements ofC_∆,D

t0 correctly, irredundantly, in to-tal timeO|I| · |C_∆,D

t0| · kD_t⁰k₀, with delayO |I|²kD_t⁰k₀, and in spaceO(|I|+kD_t⁰k₀). Proof. Regarding the correctness, we only need to show that C⁰ computed in lines 4–7 satisfiesC⁰ =σ∆,D_t0(C∪ {e}). The correctness of Closure_α(Algorithm 6),Closure_β (Algorithm 7), and Closure_γ (Algorithm 8) is shown below in Lemmas 1, 2, and 3, respectively. The proofs of the irredundancy and the time and space complexity are immediate from Boley et al. (2010) and Gély (2005) by noting that Algorithm 5 must call the closure operator for all elements inC_∆,D

t0 in the worst-case.

In the rest of this section, we give the algorithms for the cases distinguished in lines 4–7 (case (δ) is trivial) and prove their correctness.

Case(α) We first consider the case that the setC∪ {e} withC∈ C_∆,D

t0 and e∈I to be extended for further ∆-closed sets satisfies

D_del[C∪ {e}] =∅and D_ins[C∪ {e}] =∅ (5.4) (line 4 of Algorithm 5). The closureσ∆,D_t0(C∪ {e}) for this case can be computed by Algorithm 6; the correctness of Algorithm 6 is stated in Lemma 1 below.

LC(∅,∅,a) LC(ad,∅,b)

LC(ad,b,c)

LC(∅,a,b)

LC(bd,a,c) LC(∅,ab,c)

LC(c,ab,d) LC(∅,abc,d) a:α

abd:α

acd:α

b:β

bcd:β c:γ

cd:α d:α

Figure 5.2.: Call stack for the update in Example 1. Labeled edges correspond to lines 1–

12 of Algorithm 5; unlabeled to lines 14–16. The first component of an edge label denotes C_e (cf. line 3 of Algorithm 5), the second the case applied (cf. lines 4–7).

Algorithm 6 Closure_α input: C∈ C_∆,D

t0 with D_t⁰ =D_t D_del⊕ D_ins,e∈I, and C_∆,D_t require: D_t

output: σ∆,D_t0(C∪ {e})

1: if ∃(C, e, N, C⁰, q)∈ C_∆,D_t for someN,C⁰, andq then returnC⁰

2: else returnσ∆,Dt(C∪ {e}) // σ∆,Dt(C∪ {e}) =σ∆,D_t0(C∪ {e}) for this case

Lemma 1. Algorithm 6 is correct, i.e., for all C ∈ C_∆,D

t0 and for all e∈I, the output of the algorithm is σ∆,D_t0(C∪ {e}).

Proof. Condition (5.4) implies thatD_t[C∪ {e}] =D_t⁰[C∪ {e}], whereD_t⁰ =D_t D_del⊕ D_ins. Hence, σ∆,D_t0(C∪ {e}) =σ∆,D_t(C∪ {e}) andσ∆,D_t0(C∪ {e})∈ C_∆,D_t, from which the proof is immediate for both cases considered in lines 1–2.

Example 2. In our running Example 1, the first call LC(∅,∅,a) in Figure 5.2 corresponds to case (α), asD_ins[a] =D_del[a] =∅. Algorithm 6 returnsadas the closure ofain line 1, i.e., we do not need to (re)evaluate the closure operator on a.

Case(β) We now turn to the case that C∈ C_∆,D

t0 ande∈I fulfill

D_del[C∪ {e}]6=∅ and D_ins[C∪ {e}] =∅ (5.5) (line 5 of Algorithm 5). In Proposition 1 below we first prove some monotonicity results that will be used also for case (γ).

Proposition 1. Let D₁ and D₂ be transaction databases over I. If D₁ ⊆ D₂, then for all ∆∈N,

C_∆,D₁ ⊆ C_∆,D₂ . (5.6)

Algorithm 7 Closure_β input: C ∈ C_∆,D

t0 withD_t⁰ =D_t D_del⊕ D_ins,e∈I,D_del[C∪ {e}], andC_∆,D_t require: D_t

output: σ_∆,D

t0(C∪ {e})

1: if there exists (C, e, N, C⁰, q) in C_∆,D_t for someN,C⁰, and q then

2: C⁰.count← |D_t[C⁰]| − |D_del[C⁰]|

3: for alli∈I\C⁰ do

4: C⁰.∆i← |D_t[C⁰∪ {i}]|

5: if C⁰.count−C⁰.∆i+|D_del[C⁰∪ {i}]|<∆then

6: return σ_∆,D

t0(C∪ {e})

7: return C⁰

8: else

9: return σ∆,D_t0(C∪ {e})

Furthermore, for all∆∈Nand for all X ⊆I,

σ∆,D₁(X)⊇σ∆,D₂(X) . (5.7) Proof. Let C ∈ C_∆,D₁ for some ∆ ∈N and let D⁰ =D₂ D₁. Then, for any e∈I\C, we have

|D₂[C∪ {e}]| = |D₁[C∪ {e}]|+|D⁰[C∪ {e}]|

≤ |D₁[C]| −∆ +|D⁰[C]|

= |D₂[C]| −∆ ,

where the inequality follows fromC∈ C_∆,D₁ and from the anti-monotonicity of support sets. Hence,C ∈ C_∆,D₂ completing the proof of (5.6).

To show (5.7), suppose that during the calculation ofσ∆,D2(X), the items inσ∆,D2(X)\X have been added toXin the ordere₁, . . . , e_k. LetX₀ =XandX_i =X∪{e₁, . . . , ei−1, e_i} for all i∈[k]. Then |D₂[Xi−1]| − |D₂[X_i]|< ∆ for alli ∈[k] (see Algorithm 1). Since D₂[Xi−1] ⊇ D₂[Xi] and D₁ ⊆ D₂, we have |D₁[Xi−1]| − |D₁[Xi]| < ∆ for all i. Thus, as Algorithm 1 is Church-Rosser, all e_i will be added to σ∆,D₁(X) as well, implying (5.7).

Using Proposition 1, we have the following result for Algorithm 7 concerning case (β):

Lemma 2. Algorithm 7 is correct, i.e., for all C ∈ C_∆,D

t0 and for all e∈I, the output of the algorithm isσ_∆,D

t0(C∪ {e}).

Proof. By Condition (5.5),D_t⁰[C∪ {e}]⊆ D_t[C∪ {e}] and hence Proposition 1 implies that there is no Y ∈ C_∆,D

t0 with C ∪ {e} ( Y ( σ∆,Dt(C ∪ {e}). Furthermore, if σ∆,Dt(C∪ {e})6∈ C_∆,D

t0, then σ∆,Dt(C∪ {e})(σ∆,D_t0(C∪ {e}). Thus, to check whether C⁰ =σ_∆,D_t(C∪ {e}) remains closed inD_t⁰, it suffices to test whether

|D_t⁰[C⁰]| − |D_t⁰[C⁰∪ {i}]| ≥∆ (5.8)

Algorithm 8 Closure_γ input: C ∈ C_∆,D

t0 withD_t⁰ =D_t D_del⊕ D_ins,e∈I,D_ins[C∪ {e}], and C_∆,D_t require: D_t

output: σ_∆,D

t0(C∪ {e})

1: if there exists (C, e, N, C⁰, q) in C_∆,D_t for someN,C⁰, and q then

2: C⁰⁰←C∪ {e};D⁰←(D_t⊕ D_ins)[C⁰⁰]

3: repeat

4: for alli∈C⁰\C⁰⁰ do

5: if |D⁰| − |D⁰[i]|<∆then

6: C⁰⁰←C⁰⁰∪ {i};D⁰ ← D⁰[i]

7: untilD⁰ has not been changed in Loop 4–6

8: return C⁰⁰

9: else

10: return σ∆,D_t0(C∪ {e})

further holds for all items i ∈ I \C⁰ (lines 2–6 of Algorithm 7). If so, the algorithm returnsC⁰ in line 7, implying the correctness of Algorithm 7 for the case thatC⁰ ∈ C_∆,D

t0; the claim is trivial for the other two cases (lines 6 and 9).

We note that in our implementation of Algorithm 7 we do not calculate C⁰.count and C⁰.∆i in lines 2 and 4, but store and maintain them consistently. In this way, the condition in line 5 can be decided from D_del, without any access toD_t. It is important to mention that with increasing stream length, the number of elements to be deleted from C_∆,D_t becomes smaller (cf. (5.2)) and typically, most of the elements of C_∆,D

t0 are calculated by terminating in line 7.

Example 3. In our running Example 1, the call of LC(∅,a,b) in Figure 5.2 corresponds to case (β) because D_ins[b] =∅ andD_del[b]6=∅. Since (∅, a, b, bd,↑)∈ C_∆,D_t, Algorithm 7 only needs to compute support queries onD_del in lines 2, 4, and 5. For alliconsidered in line 3, the condition in line 5 is not fulfilled. Hence, the algorithm returns dbin line 7, without calling the closure operator.

Case(γ) Finally we discuss the case thatC ∈ C_∆,D

t0 and e∈I satisfy the condition D_del[C∪ {e}] =∅ and D_ins[C∪ {e}]6=∅ (5.9) (see line 6 of Algorithm 5). The proof for this case is shown also by using Proposition 1.

Lemma 3. Algorithm 8 is correct, i.e., for all C ∈ C_∆,D

t0 and for all e∈I, the output of the algorithm is σ∆,D_t0(C∪ {e}).

Proof. The proof is automatic for the case that the condition in line 1 of Algorithm 8 is false. Consider the case that it is true. Proposition 1 with Condition (5.9) implies that C_∆,D_t ⊆ C_∆,D

t0 (i.e., all ∆-closed itemsets in C_∆,D_t are preserved) and that σ_∆,D

t0(C∪

{e})⊆σ_∆,D_t(C∪ {e}). Thus, when calculating σ_∆,D

t0(C∪ {e}) in Loop 3–7, it suffices to consider only the elements in σ∆,Dt(C∪ {e})\(C ∪ {e}), from which the claim is immediate for this case.

Compared to case (β), we need to calculate support counts in the entire sample D_t⁰ for this case. However, the inner loop (lines 4–6) iterates over a typically much smaller set than the general closure algorithm (cf. lines 2–5 of Algorithm 1). Analogously to case (β), the number of new ∆-closed itemsets to be added to C_∆,D

t0 becomes smaller with increasing stream length, and hence, most of the elements ofC_∆,D

t0 are calculated in the

“then” part (line 2–8) of the “if” statement.

Example 4. In our running Example 1, the call of LC(∅,ab,c) in Figure 5.2 corresponds to case (γ) since item c occurs only in D_ins (i.e., D_ins[c] 6= ∅ and D_del[c] = ∅). Since (∅, ab, c, cd,↑)∈ C_∆,D_t, the algorithm goes into the loop 4–6, iterating over all elements of cd\c. The condition in line 5 is not satisfied for d and thus c is returned as a new closed itemset in line 8, without calling the closure operator.

Controlling the Time and Space Complexity of the Update

Although by Theorem 2 Algorithm 5 does not improve the worst-case time and space complexity of the batch algorithm (Boley et al.,2009b) calculating the family of strongly closed sets from scratch, our experimental results presented in Section 5.3.6 clearly demonstrate a considerable speed-up on artificial and real-world data sets. The to-tal time depends on the cardinality ofC_∆,D

t0, which can be exponential in|I|. The time and space of the update can be controlled by selecting the parameter ∆ in a way that

|C_∆,D

t0| < K for some reasonable smallK. The value of K may depend, e.g., on time or space capacity constraints. OnceK has been fixed, the value of ∆ can automatically be adjusted when the number of elements in C_∆,D

t0 that have already been enumerated exceedsK. More precisely, suppose Algorithm 5 has generated a subsetC⁰ ⊆ C_∆,D

t0 with

|C⁰|=K+ 1. For allC ∈ C⁰, let ∆C be the strength ofC and denote ∆⁰ = minC∈C⁰∆C. Clearly, ∆⁰ ≥ ∆. Let C⁰⁰ = {C ∈ C⁰ : ∆C > ∆⁰}. For the set obtained we have C⁰⁰⊆ C_∆⁰_+1,D

t0 and|C⁰⁰| ≤K.

This change of ∆ to ∆⁰ + 1 requires, however, the maintenance of auxiliary pieces of information for all already generated strongly closed sets, as well as the reconstruction of the five tuples for the closed sets remaining. More precisely, suppose ∆C,e=|D_t[C]| −

|D_t[C∪ {e}]|has been calculated correctly for allC∈ C_∆,D_t and for alle∈I\C. Notice that the strength ofCinD_tis given by mine∈I\C∆C,e, where the ∆C,es are obtained as a byproduct of the algorithm computing the closure operator (cf. Algorithm 1). One can see that if C ∈ C_∆,D

t0 and C has not been recalculated by calling the closure operator, then ∆C,e can be updated by

∆C,e= ∆C,e+|D_ins[C]| − |D_ins[C∪ {e}]| − |D_del[C]|+|D_del[C∪ {e}]|

for all e ∈ I \C. Thus, the complexity of the update for this case depends on the cardinality of D_ins and D_del only, which become smaller and smaller with increasing t⁰ by (5.2). Finally, utilizing the algebraic properties of closure systems, the five tuples can be reconstructed by a top-down traversal of the enumeration tree corresponding to Algorithm 5.

Im Dokument Mining Frequent Itemsets from Transactional Data Streams with (Seite 92-99)