• Keine Ergebnisse gefunden

Discussion of Variants for Support Computation

MiTemP : Mining Temporal Patterns

4.3 Support Computation

4.3.1 Discussion of Variants for Support Computation

The definition of the support is not unproblematic for a complex pattern repre-sentation as chosen in this thesis. Before the actual definition of support for our approach is given, different variants and their advantages and disadvantages are discussed. Due to the relational data, different objects in scenes, and due to the temporal dimension it is not obvious what a good estimation of the support can be. Agrawal et al. [AS94] and Dehaspe [Deh98], for instance, count itemsets and matches of queries. H¨oppner decided to use an observation time semantics with a sliding window. The different approaches are discussed briefly in the following.

Match Counting with Key Parameter

In the task of frequent pattern discovery in logic, Dehaspe [Deh98] introduced an extra key parameter in order to determine what is counted. Entities are uniquely identified by each binding of the variables in key [Deh98, p. 34]. Such a key param-eter could, e.g., be different player objects of the predicate inBallControl(player) in a pattern. In this case, the support is defined by the number of different player objects (which are actually in ball control) in all matches of the pattern. A disad-vantage of this support definition is that the key parameter must be part of each pattern in order to get a support greater than zero. Thus, it is not possible to com-pare two different patterns if they do not share this key parameter. An advantage of this approach is its clear semantics: It is clearly stated what is counted (e.g., player instances that are in ball control).

Match Counting without Key Parameter

The difficulty with the counting key parameter is that it must be part of every pat-tern. Intuitively, it would be more elegant to find a solution that counts occurrences of patterns in a scene without this restriction. At first glance, it should be possible to only count occurrences of patterns by taking each predicate into account once only. The simple example in Fig. 4.6 would lead to two matches for the pattern A∧B without any ambiguity as the temporal gap between the predicate instances of the two matches is large. The dashed line illustrates the window. Two other (still quite simple) examples shall illustrate the problems of this counting approach.

Fig. 4.7 shows two sequences where predicates A and B appear. If the pattern A∧B has to be matched, in the first sequence it can be seen that either one or two matches could be found depending on the order of the pattern matching process.

Figure 4.6: Counting without key parameter

Figure 4.7: Assignment and match reuse problem

If the first match is taken in a greedy way – i.e., taking the predicate instances in the order they appear – the match would happen to consist of the first A and the first B predicate instance as marked by the shaded bars. However, if the first A was combined with the second B and the first B with the second (shorter) A we could get two valid matches! This simple example already illustrates what we call theassignment problem. This problem has also been identified in [H¨op03, p. 51]. In the worst case, this could even lead to a violation of the anti-monotonicity condition of the support, i.e., that the support of a more special pattern would be higher than the one of its more general parents. In this case, the completeness of the pattern mining algorithm could not be guaranteed any longer, i.e., that frequent patterns might be excluded because they are assumed to be infrequent due to infrequent more general patterns. This might also lead to missed frequent patterns in the ongoing steps.

A way to handle this problem is to identify one of the optimal matchings, i.e., to find an assignment that leads to the maximal possible support. However, this would lead to efficiency problems as all assignment combinations had to be checked.

A second problem is illustrated in the second sequence in Fig. 4.7. Here, again the pattern A∧B has to be matched. If we do not have any further restrictions,

the greedy match would again lead to an assignment of the first appearing predicate instances of A and B as marked by the shaded bars. If we add a restriction that A must be before B, the match of the more general pattern (without restriction) does not match the pattern anymore. However, there is still a valid match if the A predicate was combined with the second B predicate. This is problematic as we cannot just reuse the matches of the more general pattern(s) and thus cannot avoid a complete scan of the sequence without risking to miss matches! This problem cannot even be solved by identifying one of the optimal matches. We refer to this problem as thematch reuse problem. However, it is possible to restrict the search in the sequence by just searching in the neighborhoods (specified by the window size) of the matches of the more general patterns.

Computing the support without a key parameter leads not only to the problem of high computational cost or possible loss of monotonicity. The semantics of the matches is not as clear as it appears at the first glance. It is not motivated why cer-tain predicate instances should be grouped together to a match (besides maximizing the support). Depending on the application it might even be desired to allow some predicate instances to be reused (and thus, one or more counting attributes should be used).

It is possible to come up with further modifications, e.g., a dynamic selection of a counting parameter. One idea is to select from each pattern the predicate with minimal occurrences in the sequence as the counting parameter. But this can also lead to strange effects: A minimal change in the support of single predicates leading to a another counting parameter can result in enormous changes of the support value of the pattern1. A way to avoid this effect would be to dynamically select the “minimal predicate” (the one with least occurrences) from the set of matches as key parameter, but with such a definition of support the semantics gets even more incomprehensible as the support computation could be based on different key values for different patterns.

Observation Time with Strict Window Semantics

H¨oppner [H¨op03] chose a different way to compute the support. His measurement is based on an observation time semantics. The support is defined by the length of all time intervals where a pattern can be observed for a given window size. It is assumed that only a part of the sequence can be seen, i.e., the sliding window determines what can be observed. Only if the pattern can be matched by the given information in the current window, the interval of the window is taken into account for the support computation. Having the complete length of the sequence and the summed-up length of the intervals where the pattern holds, the likelihood

1Personal communication with Frank H¨oppner; e-mail correspondence October 26 - November 10, 2005

of observing the pattern in a randomly chosen time window can be computed easily.

Besides the clear semantics of this support another advantage is that it is not necessary to collect all possible matches for a pattern in the sequence. If one match has been found at a window position, the matching can be stopped and the window can be moved to the next position (where the observation changes w.r.t. the window, i.e., where a predicate “leaves” or “enters” the window). On the other hand, it cannot be distinguished between the cases if just one or many matches occur at a window position.

Observation Time with Memory

As it has already been outlined by H¨oppner [H¨op03, p. 56], time points that have been passed by the time window could be memorized and thus still be used for the matching process even if they are not visible at the current window position.

The advantage of this extension is that information that actually is provided in the sequence and can easily be stored could be used to identify more pattern instances even if they do not fit in the selected window size.

Maximal match length

It should also be briefly discussed what the effect of a slightly different support definition would be. If the interval of the whole match (from the earliest start time point to the latest end time point of the predicate) was taken as basis for support computation, huge support values could be generated. Here, again, it was not guaranteed that the support value decreases if the pattern complexity in-creases! If a pattern is specialized by adding a predicate, the support could increase.

For instance, if the pattern ((A, B),{(1,2,{bef ore})},CR) would be extended to ((A, B, C),{(1,2,{bef ore}),(1,3,{bef ore}),(2,3,{bef ore})},CR) the latest end ti-me of a match (by predicate C) could cover a region that has not been covered by the more general pattern. Another problem is that depending on the match the sup-port values could differ enormously. In order to have a fair supsup-port computation, it would be necessary – again – to check all matches and select the one with the highest (or lowest) length. These problems show that this variant is not really an alternative.

Due to the problems with the support definitions based on match counting – with or without key parameter – the observation time semantics (with memory) is chosen in this thesis for support computation. The convincing advantages of using observation time as support are the clear semantics and the better efficiency as not all matches have to be collected or maybe even further processed. The anti-monotonicity property for this support definition holds (as it will be shown in Section 4.4) and the support intervals of previous steps (i.e., of more general patterns) can

be reused in order to restrict the search to parts of the temporal sequence during support computation.

The next section addresses the matching of patterns as this is needed for the support definition presented in Section 4.3.3.