Discussion of Variants for Support Computation

MiTemP : Mining Temporal Patterns

4.3 Support Computation

4.3.1 Discussion of Variants for Support Computation

The deﬁnition of the support is not unproblematic for a complex pattern repre-sentation as chosen in this thesis. Before the actual deﬁnition of support for our approach is given, diﬀerent variants and their advantages and disadvantages are discussed. Due to the relational data, diﬀerent objects in scenes, and due to the temporal dimension it is not obvious what a good estimation of the support can be. Agrawal et al. [AS94] and Dehaspe [Deh98], for instance, count itemsets and matches of queries. H¨oppner decided to use an observation time semantics with a sliding window. The diﬀerent approaches are discussed brieﬂy in the following.

Match Counting with Key Parameter

In the task of frequent pattern discovery in logic, Dehaspe [Deh98] introduced an extra key parameter in order to determine what is counted. Entities are uniquely identiﬁed by each binding of the variables in key [Deh98, p. 34]. Such a key param-eter could, e.g., be diﬀerent player objects of the predicate inBallControl(player) in a pattern. In this case, the support is deﬁned by the number of diﬀerent player objects (which are actually in ball control) in all matches of the pattern. A disad-vantage of this support deﬁnition is that the key parameter must be part of each pattern in order to get a support greater than zero. Thus, it is not possible to com-pare two diﬀerent patterns if they do not share this key parameter. An advantage of this approach is its clear semantics: It is clearly stated what is counted (e.g., player instances that are in ball control).

Match Counting without Key Parameter

The diﬃculty with the counting key parameter is that it must be part of every pat-tern. Intuitively, it would be more elegant to ﬁnd a solution that counts occurrences of patterns in a scene without this restriction. At ﬁrst glance, it should be possible to only count occurrences of patterns by taking each predicate into account once only. The simple example in Fig. 4.6 would lead to two matches for the pattern A∧B without any ambiguity as the temporal gap between the predicate instances of the two matches is large. The dashed line illustrates the window. Two other (still quite simple) examples shall illustrate the problems of this counting approach.

Fig. 4.7 shows two sequences where predicates A and B appear. If the pattern A∧B has to be matched, in the ﬁrst sequence it can be seen that either one or two matches could be found depending on the order of the pattern matching process.

Figure 4.6: Counting without key parameter

Figure 4.7: Assignment and match reuse problem

If the ﬁrst match is taken in a greedy way – i.e., taking the predicate instances in the order they appear – the match would happen to consist of the ﬁrst A and the ﬁrst B predicate instance as marked by the shaded bars. However, if the ﬁrst A was combined with the second B and the ﬁrst B with the second (shorter) A we could get two valid matches! This simple example already illustrates what we call theassignment problem. This problem has also been identiﬁed in [H¨op03, p. 51]. In the worst case, this could even lead to a violation of the anti-monotonicity condition of the support, i.e., that the support of a more special pattern would be higher than the one of its more general parents. In this case, the completeness of the pattern mining algorithm could not be guaranteed any longer, i.e., that frequent patterns might be excluded because they are assumed to be infrequent due to infrequent more general patterns. This might also lead to missed frequent patterns in the ongoing steps.

A way to handle this problem is to identify one of the optimal matchings, i.e., to ﬁnd an assignment that leads to the maximal possible support. However, this would lead to eﬃciency problems as all assignment combinations had to be checked.

A second problem is illustrated in the second sequence in Fig. 4.7. Here, again the pattern A∧B has to be matched. If we do not have any further restrictions,

the greedy match would again lead to an assignment of the ﬁrst appearing predicate instances of A and B as marked by the shaded bars. If we add a restriction that A must be before B, the match of the more general pattern (without restriction) does not match the pattern anymore. However, there is still a valid match if the A predicate was combined with the second B predicate. This is problematic as we cannot just reuse the matches of the more general pattern(s) and thus cannot avoid a complete scan of the sequence without risking to miss matches! This problem cannot even be solved by identifying one of the optimal matches. We refer to this problem as thematch reuse problem. However, it is possible to restrict the search in the sequence by just searching in the neighborhoods (speciﬁed by the window size) of the matches of the more general patterns.

Computing the support without a key parameter leads not only to the problem of high computational cost or possible loss of monotonicity. The semantics of the matches is not as clear as it appears at the ﬁrst glance. It is not motivated why cer-tain predicate instances should be grouped together to a match (besides maximizing the support). Depending on the application it might even be desired to allow some predicate instances to be reused (and thus, one or more counting attributes should be used).

It is possible to come up with further modiﬁcations, e.g., a dynamic selection of a counting parameter. One idea is to select from each pattern the predicate with minimal occurrences in the sequence as the counting parameter. But this can also lead to strange eﬀects: A minimal change in the support of single predicates leading to a another counting parameter can result in enormous changes of the support value of the pattern¹. A way to avoid this eﬀect would be to dynamically select the “minimal predicate” (the one with least occurrences) from the set of matches as key parameter, but with such a deﬁnition of support the semantics gets even more incomprehensible as the support computation could be based on diﬀerent key values for diﬀerent patterns.

Observation Time with Strict Window Semantics

H¨oppner [H¨op03] chose a diﬀerent way to compute the support. His measurement is based on an observation time semantics. The support is deﬁned by the length of all time intervals where a pattern can be observed for a given window size. It is assumed that only a part of the sequence can be seen, i.e., the sliding window determines what can be observed. Only if the pattern can be matched by the given information in the current window, the interval of the window is taken into account for the support computation. Having the complete length of the sequence and the summed-up length of the intervals where the pattern holds, the likelihood

1Personal communication with Frank H¨oppner; e-mail correspondence October 26 - November 10, 2005

of observing the pattern in a randomly chosen time window can be computed easily.

Besides the clear semantics of this support another advantage is that it is not necessary to collect all possible matches for a pattern in the sequence. If one match has been found at a window position, the matching can be stopped and the window can be moved to the next position (where the observation changes w.r.t. the window, i.e., where a predicate “leaves” or “enters” the window). On the other hand, it cannot be distinguished between the cases if just one or many matches occur at a window position.

Observation Time with Memory

As it has already been outlined by H¨oppner [H¨op03, p. 56], time points that have been passed by the time window could be memorized and thus still be used for the matching process even if they are not visible at the current window position.

The advantage of this extension is that information that actually is provided in the sequence and can easily be stored could be used to identify more pattern instances even if they do not ﬁt in the selected window size.

Maximal match length

It should also be brieﬂy discussed what the eﬀect of a slightly diﬀerent support deﬁnition would be. If the interval of the whole match (from the earliest start time point to the latest end time point of the predicate) was taken as basis for support computation, huge support values could be generated. Here, again, it was not guaranteed that the support value decreases if the pattern complexity in-creases! If a pattern is specialized by adding a predicate, the support could increase.

For instance, if the pattern ((A, B),{(1,2,{bef ore})},CR) would be extended to ((A, B, C),{(1,2,{bef ore}),(1,3,{bef ore}),(2,3,{bef ore})},CR) the latest end ti-me of a match (by predicate C) could cover a region that has not been covered by the more general pattern. Another problem is that depending on the match the sup-port values could diﬀer enormously. In order to have a fair supsup-port computation, it would be necessary – again – to check all matches and select the one with the highest (or lowest) length. These problems show that this variant is not really an alternative.

Due to the problems with the support deﬁnitions based on match counting – with or without key parameter – the observation time semantics (with memory) is chosen in this thesis for support computation. The convincing advantages of using observation time as support are the clear semantics and the better eﬃciency as not all matches have to be collected or maybe even further processed. The anti-monotonicity property for this support deﬁnition holds (as it will be shown in Section 4.4) and the support intervals of previous steps (i.e., of more general patterns) can

be reused in order to restrict the search to parts of the temporal sequence during support computation.

The next section addresses the matching of patterns as this is needed for the support deﬁnition presented in Section 4.3.3.

Im Dokument Temporal Pattern Mining in Dynamic Environments (Seite 110-114)