• Keine Ergebnisse gefunden

KRIMP Algorithm Details

Im Dokument Diversity Driven Parallel Data Mining (Seite 52-58)

4.2 The KRIMP Algorithm

4.2.2 KRIMP Algorithm Details

TheKrimpalgorithm describes the interaction of several different constructs: Minimum Description Length (MDL) is the fundamental construct for measuring the encoding of the database. Thecodes lengths are from ideal codes, whose length are based on Shannon entropy. The orderings of the code table and of the candidate table are heuristic, and affect how the tables cover the transactions in the database.

MDL

The Krimp algorithm is based on the Minimum Description Length (MDL) principle, first described by Rissanen in his seminal 1978 paper “Modeling by Shortest Description.”

[Ris78] MDL is a form of modeling based on compression, where the compression’s goal is to minimize the size of the encoding of the model measured in bits [BBHK10, p.106]. To decompress the model, two elements are required: the encoded model and the decompression rule(s).

In Krimpthe de-/compression rule(s) are facilitated by acode table. The code table is a simple translation table with two columns, the itemset to be encoded and its code length [VvLS11]. (An example code table is found in Figure 4.3b.) Assumed is the use of prefix-free codes, where no code is the prefix of a longer code in the code table [CT12, p. 106]. Additionally, the actual code itself and the encoding is unimportant to the Krimp algorithm; only the lengths of the codes are relevant. Moreover, the Krimp algorithm uses Crude MDL, which minimizes the sum of the encoded model and the de-/compression rules(s), and not Refined MDL where the de-/compression rules are encoded with the database [Gr¨u07, pp. 10-11].

Code Lengths

When considering the performance of compression in this context, better compression has a shorter output, as measured in bits. For the purposes of finding frequent itemsets in a database, the presence of itemsets that appear more frequently should be reflected by

C

Figure 4.2: The Standard Code Table for the database shown in Figure 4.3a.

shorter codes. Optimal code lengths can be calculated using Shannon entropy because of the relationship between code lengths and probability distributions [LV97, p. 353]. The calculation for the code length of an itemset is shown in Equations4.6and4.7[VvLS11].

L(X) =−log2(P(X)) (4.6)

where L(X) is the length of the code of an itemset X ∈ X in bits, and P(X) is the probability of X being used in the encoding, i.e.,

P(X) = usage(X) P

c∈CTusage(c) (4.7)

The numerator in Equation 4.7 contains usage which is related to, but not synony-mous with, support (See Definition4.1). Theusage of an itemset is the number of times actually used to encode the database at a given iteration in the Krimp algorithm, and similarly, the summation in the denominator of Equation4.7, sums over all of theusages calculated in that iteration. In contrast, the support of an itemset is invariant with respect to the database, irrespective of theusage employed by the encoding.

The Standard Code Table shown in Figure 4.2 has the code lengths for each of the four items in the database shown in Figure 4.3a. The probability of item {A} (and for {B} and {C}) per Equation 4.7 is P({A}) = 10/37 = 0.27. The code length from Equation 4.6 isL({A}) = −log2(0.27) = 1.89 bits. Similarly, Figure 4.3b shows the relationship between itemset usage and the code lengths for the sample database.

Itemset{A, B, C, D}has a usage of 6, where sum of all usages is 6 + 3 + 1 + 1 + 1 = 12.

Per Equation 4.7 P({A, B, C, D}) = 6/12 = 0.5. The code length per Equation 4.6 is L({A, B, C, D}) =−log2(0.5) = 1 bit.

Covers

Krimpconsiders itemsets from a set of candidate itemsets, F, which are discovered via an algorithm external to the Krimp algorithm, typically with a minsup = 1. Each

40 4.2. The KRIMP Algorithm

Covered by CT Code Length (bits) 1

(a) A transaction database shown with each transaction covered by a an itemset and the associated code lengths.

(b) A code table shows the respective lengths for encoded itemsets for possible itemsets from a database.

Figure 4.3: A transaction database and its corresponding code table.

candidate itemset, F ∈ F, is used iteratively in the code table, CT, in an attempt to cover a transaction, i.e., when F is a subset of tn ∈ D, the usage of F, usage(F), is incremented and used for the calculation of the code length of F with the formulas in Equations4.6and4.7. It is important to note thatKrimponly considers non-overlapping itemsets for encoding a transaction [VvLS11].

Definition 4.7 Cover. The cover of a transaction, t, is the set of itemsets X ∈ CS whose union is the largest subset of t found while iterating through CT. CS is the set of itemsets used for encoding, which are also paired with a code length,L(X). The elements of t which are not in the union of itemsets from CS are covered by the single items, i, from the Standard Code Table, ST, and are considered part ofCT for encoding purposes [VvLS11].

cover(CT, t) = [

i=1..|CT|

CTi ⊂(t\cover(CTi−1, t)) (4.8)

The distinctions between CT and CS is necessary in Definition 4.7, because CT maintains the list of itemsets that will be used in an attempt to cover the database, and CS is the list of itemsets actually used to cover the database in a given iteration:

∀Xp : usageCS(Xp) > 0. When calculating the total encoded size of the database, the

size of the encoding table is also used, which would be negatively affected by the inclusion of elements not used for the encoding [VvLS11].

In Figure4.3a, each transaction is covered completely by an itemset from Figure4.3b.

However, because of the exclusionary principle of non-overlapping itemsets, the item-sets {A} and {B} are not considered as part of the encoding table CS.

MDL Size

The MDL Size is the sum of size of the encoded database and the size of the code table used to encode the database. The size of the encoded database, D, is simply the sum of the size of all the transactions.

The size of a transaction is a summation of the code lengths of the itemsets and items used in its cover.

Definition 4.8 Transaction Code Length. The length of the code used to cover a transaction [VvLS11].

L(t|CT) = X

X∈cover(CT,t)

L(codeCT(X)) (4.9)

Each transaction in Figure 4.3a is covered by one itemset, so each transaction has only the length of each of those corresponding itemsets.

The size of the encoded database is found from the sum of all transactions.

Definition 4.9 Encoded Database Size. The sum of all of the sizes of the transac-tions in the database [VvLS11].

L(D|CT) =X

t∈D

L(t|CT) (4.10)

The size of the encoded database in Figure4.3ausing the Code Table from Figure4.3b isP

i=1..nL(codeCT(tn)) = 1 + 1 + 1 + 1 + 1 + 1 + 2 + 2 + 2 + 3.58 + 3.58 + 3.58 = 22.74 bits.

The size of the code table, CT used to encode the database D is the sum of all the code lengths for each itemset, codeCT(x), in the code table. The itemsets themselves, however, also have to be encoded. The Standard Code Table is used for the code lengths of each individual item in each item set, codeST(x) [VvLS11].

Definition 4.10 Encoded Code Table Size. The sum of the code lengths for each itemset, X, used to encode database, D, plus the size of each item, i, in X as encoded in the Standard Code Table.

L(CT|D) = X

X∈CT:usageD6=0

L(codeST(X)) +L(codeCT(X)) (4.11)

42 4.2. The KRIMP Algorithm

The size of the Code Table in Figure 4.3b is size of each of the codes, which is just the lengths themselves P

X∈CT L(codeCT(X)) = 1 + 2 + 3.58 + 3.58 + 3.58 = 13.74 bits plus the encoded size of each itemset identifier. The code size of each itemset identifier is taken from the Standard Code Table from where the code length of each item is used. (See Figure 4.2.) Itemset identifier codeST({A, B, C, D}) has a size of 1.89 + 1.89 + 1.89 + 2.4 = 8.07 bits. Similarly, codeST({A, B, C}) = 6.18 bits, codeST({A, B}) = 4.29 bits, codeST({C}) = 1.89 bits and codeST({D}) = 2.4 bits.

Summed together they equal 22.83 bits.

As described in Section 4.2.2, and here more formally, the MDL size is the size of the encoded database and the size of the coding rules, i.e., the Code Table, CT.

Definition 4.11 MDL Size. The minimum description length of a database, D, is the size of the encoded database plus the size of the code table, CT, used to encode it.

L(D, CT) = L(D|CT) +L(CT|D) (4.12) The total MDLSize for the sample database shown in Figure4.3bisL(D|CT) = 22.74 bits +L(CT|D) = P

X∈CT L(codeST(X)) = 22.83 bits +P

X∈CTL(codeCT(X)) = 13.74 bits = 59.31 bits.

The order with whichKrimpconsiders the itemsets fromCT to cover the transactions is specific. The goal of MDL is to find the shortest codes, therefore the itemsets in CT are heuristically ordered in decreasing cardinality of the itemset, because larger itemsets cover more items in a transaction. These are then in turn ordered by decreasing support, because the more an itemset is used, the shorter its code. These are finally ordered lexicographically, in essence as a tie-breaker. This is called Standard Cover Order [VvLS11].

Definition 4.12 Standard Cover Order. The ordering of itemsets, X, within the code table, CT, that prefers itemset cardinality, then support, then lexicographics [VvLS11].

|X| ↓supportD(X)↓lexicography ↑ (4.13) Compress the Database

The general Krimp algorithm is shown in Algorithm 4.1. It first creates the Standard Code Table using the algorithm as described in Algorithm 4.2. Each of the individual items in the database are used to create a code table, where each item is assigned a code (Equation 4.6) based on its probability (Equation 4.7). The Standard Code Table is used in theKrimp algorithm for coding individual items not covered by larger itemsets inCT and for encoding CT itself [VvLS11].

Candidate itemsets from the set of interesting itemsets,F, are examined singly in turn according to the Standard Candidate Order. The candidate itemsets are heuristically

Algorithm 4.1:The KrimpAlgorithm [VvLS11]

Input: A transaction database D and a candidate set F, both over a set of items I

Output: A heuristic solution to the Minimal Coding Set Problem, code table CT

1 CT ← Standard Code Table(D)

2 FO ← F in Standard Candidate Order

3 foreach F ∈ FO\ I do

4 CTC ←(CT ∪F) in Standard Cover Order

5 if L(D, CTC)< L(D, CT) then

6 CT ←CTC

7 end

8 end

9 return CT

Algorithm 4.2:The Standard Code TableAlgorithm [VvLS11]

Input: A transaction database D over a set of items I Output: The standard code tableCT forD

1 CT ← ∅

2 foreach i∈ I do

3 insert i into CT

4 usageCT(i)←supportD(i)

5 codeCT(i)← optimal code fori

6 end

7 return CT

ordered to enable finding itemsets earliest, i.e., in a greedy fashion, in the search space that will allow the most compression. Standard Candidate Order orders the itemsets primarily by their support, because these are the itemsets that will have the shortest codes. They are then secondarily ordered by cardinality, and tertiarily, again as a tie-breaker, lexicographically [VvLS11].

Definition 4.13 Standard Candidate Order. The ordering of candidate itemsets F ∈ F that prefers itemset support, then cardinality and finally lexicographics [VvLS11].

supportD(F)↓ |F| ↓lexicography ↑ (4.14) The candidate itemset, F, is provisionally inserted into the code table, CTC, in Standard Cover Order (See Definition 4.12). Each transaction, tn, in the database, D, is checked for being covered by the itemsets in CT, and is covered by itemset X, when the itemset is a subset of the transaction, tn, as described in Algorithm 4.3. After all transactions have been maximally covered from the itemsets from CT, the MDL size

44 4.2. The KRIMP Algorithm

(See Definition 4.11) with the candidate itemset, F, is compared to the previous MDL size without F. If the MDL size shows improvement, candidate itemset F is retained in CT, and if not, F is discarded. The next candidate itemset is pulled from F and evaluated in the same process until all itemsets from F have been evaluated [VvLS11].

Algorithm 4.3:The Standard CoverAlgorithm [VvLS11]

Input: Transaction t ∈D and code table CT, with CT and D over a set of items I

Output: A cover of t using non-overlapping elements of CT

1 S ← largest3 element X of CT inStandard Cover Order for which X ⊆t

In an attempt to explore areas of the solution space that the Krimp algorithm may have excluded due to its heuristics, Vreeken, et al. introduced code table pruning as an enhancement. After a new itemset,F, is accepted into the code table,CT, thePruning algorithm, as shown in Algorithm 4.4, iterates through all of the accepted codes whose usage in the code table declined when compared to before F’s acceptance. With a lower usage, and therefore a longer code length, it could be that their presence in the code table is detrimental to overall MDL Size. The MDL Size of the database is calculated with each of these itemsets removed from the code table, and if a compression improvement is shown in its absence, the itemset is removed from CT [VvLS11].

Im Dokument Diversity Driven Parallel Data Mining (Seite 52-58)