Complexity Analysis - Usage-based Task Tree Generation

4. Automated Field Usability Evaluation Using Generated Task Trees 31

4.4. Usage-based Task Tree Generation

4.4.4. Complexity Analysis

When recording users of websites, large amounts of action instances may be recorded. The generation of task trees out of these action instances can, hence, become time-consuming.

Therefore, we perform a complexity analysis as an estimation of the performance of our approach for larger input data. We do this by first considering the individual steps that we take in our generation process. Then, we define a complexity for them. finally, we derive a complexity for the whole approach. The overall task tree generation approach is shown in Algorithm 4.3. In the following paragraphs, we describe each of the steps in the algorithm and their complexities.

Initially, we perform an alternating iteration and sequence detection (Lines 1 to 4 in Algorithm 4.3). The iteration detection itself can be done by reading the input task instance list once. During this read, we can store directly, which actions are repeated, as well as the positions of the repeated actions. This storing has a complexity ofO(1), as the information can be stored, e.g., in an array or a list. Afterwards, the instances of the repeated actions are replaced through respective iteration instances, which requires at most a second read of the input data. Hence, the iteration detection itself has a complexity ofO(n), wherenis the number of processed action instances.

The complexity of the sequence detection is similar, but due to the choosing of sequences to be merged, its calculation requires more insight. Through a single read of the input task instance list, all n-grams can be determined, including their size and locations in the input data. The only boundary here is the size of the memory, as an action instance list of length ncontains∑ⁿ⁻¹_i=2i= ⁿ⁽ⁿ⁺¹⁾₂ −(n+1)permutations of n-gramsl⁰with a length 1<|l⁰|<n.

During the single read, the detected n-grams can directly be combined into n-gram sets, which represent the same action combinations. The assignment of n-grams to n-gram sets can be done with a complexity ofO(1). For this, we can use an algorithm that is capable of using an n-gram as unique index of the corresponding n-gram set in an array of n-gram sets.

From all n-gram sets, we determine which n-gram sets need to be replaced first. We initially choose those gram sets with first the highest number and second the longest length of

n-Algorithm 4.3Simplified task detection process with complexities.

1: repeat

2: // iteration detection O(n)

3: // sequence detection O(n)

4: untilno new sequence or iteration detected O(n²) 5:

6: repeat

7: // compare sequences O(n⁴)

8: // choose sequences to be merged first O(n²) 9:

10: ifthere are sequences to be mergedthen 11: for allsequence pairs to be mergeddo

12: // adapt flattened task instances O(n²)

13:

14: repeat

15: // iteration detection on flattened task instances O(n) 16: // sequence detection on flattened task instances O(n) 17: untilno new sequence or iteration detected O(n²) 18:

19: // harmonize parent tasks O(n)

20: end for O(n³)

21: end if

22: untilno sequences to be merged O(n⁵)

grams. This can be done during the initial reading of the input data. When having read the next action instance from the input data, several new n-grams are completed. For each of these n-grams, we know to which of the n-gram sets it belongs. We also know the current number of n-grams in a set, as well as the n-gram length. Hence, we can also determine at any time during the first read, which n-gram sets currently contain most n-grams, and which length these n-grams have. This can be done by maintaining pointers to these n-gram sets. These pointers are also available at the end of reading the input data. Hence, the initial choosing of the n-gram sets, which contain most and the longest n-grams, also has a complexity ofO(1). After this initial choosing, only an amount of n-gram sets remains, which is a fraction of n. This is because a task instance list of length n contains only n+1− |l⁰|permutations of n-gramsl⁰of length|l⁰|. These n-grams must be identical to be added to an n-gram set. As the minimum number of identical n-grams per set is two, at most

n+1−|l⁰|

2 n-gram sets will remain after the first choosing. The subsequent choosing process will discard at least one of the sets in any repetition and, therefore, runs at most ^n+1−|l₂ ⁰^| times. This shows, that considering an unlimited amount of memory, also the sequence detection has a maximum complexity ofO(n).

The iteration and sequence detection are repeated alternately (Lines 1 to 4 in Algo-rithm 4.3). Either, the iteration or the sequence detection will detect at least one iteration or

one sequence in a cycle. In the worst case scenario, only two action or task instances in the task instance list are combined to one new task instance per cycle, resulting in at mostn−1 cycles for the iteration and sequence detection. Hence, the complexity of the alternation is alsoO(n). As the iteration and sequence detection already have a complexity ofO(n), and as they are called in the alternation, the resulting complexity of the iteration and sequence detection isO(n²).

After the iteration and sequence detection, we perform a merging of similar sequences (Lines 6 to 20 in Algorithm 4.3). For this, we first have to determine similar sequences, which is done by comparing each sequence with any other sequence. This means perform-ingn(n−1)comparisons if n is the number of sequences. Considering an algorithm for comparing two sequences with a complexity of at most O(n²), which is given for My-ers Diff algorithm used in this thesis [89], the complexity of the detection of similar se-quences isO(n⁴). Afterwards, we perform a choosing of similar sequences to be merged first. The complexity of this choosing increases with an increasing number of sequences that are shared between the pairs. But still any pair is handled at most three times, resulting also in a maximum complexity ofO(n²)wherenis the number of sequences.

The subsequent flattening of task instances needs to be done for any task instance and also for any detected delta. Hence, the complexity of this step is linearly dependent on the multiplicity of the number of task instances and the number of deltas. Therefore, it is at mostO(n²). Afterwards, we perform an alternating iteration and sequence detection on the flattened instances, which, as shown above, also has a complexity ofO(n²). Furthermore, we do a harmonization of parent tasks, which can be done in linear time (O(n)). Finally, after a merging, we check if there are further similar sequences and merge them if required.

Through this repetition of the merging process, the complexity increases toO(n⁵)for the overall merging process. So referring to Algorithm 4.3, the major complexity issue is the multiple repetition of Line 7, which already has a complexity ofO(n⁴).

The complexity of our approach with O(n⁵) seems to be quite high. But this consid-ers the worst case scenario, in which we have as few as possible identical n-grams in the task instance list. The complexity also depends strongly on the number of distinct avail-able actions and action combinations. If, for example, the user can perform only a small set of distinct actions and only a small amount of distinct action combinations, even large task instance lists will contain many identical n-grams. Hence, already the sequence and iteration detection will perform many more reductions of the number of entries in the task instance list within one cycle, than done in the worst case scenario. The same applies for the detection and merging of similar sequences. With a decreasing number of possible action combinations, fewer similar sequences are detected and merged. In addition, after a certain data set size is reached, no new actions or action combinations are recorded, because the previously recorded data already contains all actions and action combinations possible on a software. Hence, the number of iteration and sequence detection cycles, as well as the number of required sequence comparisons, do not increase anymore, even if the data set size increases. Furthermore, action combinations performed only once in a data set are not

detected and, hence, there is no corresponding detection cycle for them in the iteration and sequences detection.

Additionally, the approach has some opportunities for runtime improvements, which are not considered in the complexity analysis. For example, one aim in this thesis is to de-termine, how many action instances a detected sequence should cover to consider it rep-resentative for typical user behavior. This implies not detecting sequences, that are not be representative. Through this, we intent to reduce the number of cycles in the iteration and sequence detection, as well as the amount of sequence comparisons for the subsequent merging. In fact, in our case studies, we only compare and merge those sequences with each other that cover most of the recorded action instances and which are, hence, the most representative ones. Furthermore, when comparing sequences for detecting similar ones, it is possible to skip comparisons based on the knowledge about the sequences structures. For example, if the task list of one sequence is much longer than the one of another sequence, the similarity levelsim(si,sj) can be estimated to be lower than simmin. In this case, the comparison does not need to be done. We also skip sequence pairs whose deltas are at the beginning or end of the task lists. This can also be checked before applying a complex diff algorithm on the task lists. Hence, it is sufficient to only compare sequences with each other whose first and last elements of the corresponding task lists are the same. In addition, if a sequences₁is the direct or indirect child of a sequences₂, the comparison ofs₁ands₂can be skipped, because parent tasks are not merged with their children. This is ensured by the choosing process applied for identifying sequences to merge. All this reduces the number of required comparisons significantly. Finally, the remaining comparisons can be done in parallel, as they are independent from each other, which can further reduce the runtime. We implemented the optimizations named in this chapter in our case studies.

Im Dokument Automated Field Usability Evaluation Using Generated Task Trees (Seite 70-73)