• Keine Ergebnisse gefunden

In case of the child axis, only the number of context nodes is used as a parameter for result size estimation. The method is independent of the actual data distribution in a context set. For the descendant axis, this approach would fail, because there is no proportionality between the context set size and the number of its descendants.

If, for example, the root is in the context set, the number of further nodes does not effect the result size. It is thus important to look inside the context set for the actually contained data.

Recalling Eq. 2.9, we can exactly determine the number of descendants of a specific node by merely looking at its pre-/postorder ranks and level. For arbitrary

context sets, an initial pruning suffices to ensure overlap-free results, as it is de-scribed for the staircase join algorithm (Sec. 2.3.1). Hence, the exact result size is calculated usingCSpr:

|CSpr/descendant|= X

v∈CSpr

post(v)− X

v∈CSpr

pre(v) + X

v∈CSpr

level(v). (3.10)

3.4.1 Sampling

Although the calculation is based on simple additions, in case of large context sets a lookup on all 3 values turns into a costly operation. To provide appropriate estimates rather than exact calculation, sampling techniques offer a suitable solution in this case. Data access will thus be limited to a small subset of the actual context sequence.

The remaining task is to slice out an appropriate sample. If single nodes are selected randomly, it is impossible to take into account the effect of pruning, i.e., to have a result estimation without duplicate elimination. It is thus necessary to pick out larger slices of the context set and to run the pruning Algorithm 1 on the sample before single node results are added up. Pruning can also be applied “on the fly” in the process of addition. With the correct result size for the pruned sample available, the descendant estimate for the complete context set is calculated by

estimate(|CS/descendant|) = |CS|

|Sample|∗ |Samplepr/descendant|. Experimental results have confirmed good performance and acceptable accuracy when picking out two slices of about 100 nodes each.

3.5 Parent Axis

Unlike in the case of the child axis, every node is known to have exactly one parent, but whereas two different nodes v, w never have common children, this does not hold for any other axis step. More precisely, in case of the parent axis, on average fannodes will have the same parent:

fan= |D| −1

|Dpar| .

In terms of tree structure,fancan be thought of as the medium branching factor or fan-out. Applying this information allows to bound the estimation by

|CS|

fan ≤ |CS/parent| ≤ |CS|, (3.11) and |CS/parent| would be expected to tend “rightwards”, i.e., towards |CS| for small context sets, as the probability for the few selected nodes to have common parents is low, and leftwards, ifCSincludes nearly all document nodes. To move the estimation correctly between these bounds, a stochastic model is needed, describing how many different parents are encountered starting from |CS| nodes. Given the set of all possible parent nodesDpar, we will first determine the probability for each of these nodes not to belong to the result.

Looking at one node v ∈ Dpar there are fan children pointing at v as their parent. There is thus a chance of |D| −fan

|D| for the first node c1 ∈ CS not to

point at v. The overall probability forv not to be a parent of any nodeci∈CSis

The second case, i.e., |CS|> fan, is not necessary for the correctness of the equa-tion, but it reduces the overall number of arithmetic operations significantly, namely to O fan

. With P(v /∈CS/parent), the number of encountered parents can be estimated by the inverse probability

estimate(|CS/parent|) =|Dpar| ∗(1−P(v /∈CS/parent)). (3.12) Notice, that the stochastic model enables dynamical adjustment of the calcula-tion based on statistical informacalcula-tion. With respect to |CS|, the resulting estimate lies within the bounds given in Eq. 3.11.

3.5.1 Fan-Out Specific Groups

First experiments using the proposed model have shown only poor prediction qual-ity. A closer look has revealed that the assumption offanas a common fan-out leads to overestimating the result of medium sized context sets. In fact, normally there are very few nodes in a document with high fan-outs and therefore high chances to be parent of any of the randomly chosen context nodes. On the other hand there is a high number of nodes having only one child. These nodes are less likely reached by a parent step. To enhance our model to deal with fan-out differences, more statistical information has to be collected.

Instead of using a common average parent fan-out, the set of all document nodes is partitioned by the fan-out of their parents. The equivalence relation∼f anformally defines the partitioning by its equivalence classes:

v∼f an w⇔ |v/parent/child|=|w/parent/child|, for allv, w∈D Every resulting groupGithus represents a subset of nodes having parents with the same fan-out. For each groupGiwe maintain its cardinalityg sizeiand the number of its parentsg pari;g f ani andg f raci can be derived (Eq. 3.5, 3.6).

These statistics makes it possible to split the context set according tog f rac (Eq. 3.7) into subsets with the same parent fan-out, and to calculate the estimation on the parent step for each of these groups. The overall result estimate is the sum of all group specific results. Algorithm 5 summarizes the introduced parent estimation procedure.

3.5.2 Tag-Name Specific Groups

As for the child axis, context sets containing nodes sharing the same tag-name undermine the quality of the estimation, because their parents specific fan-outs

Algorithm 5: Parent Estimation estimate par(cs size)≡

stat. data :g size: array(num),g par: array(num),doc size begin

result ←0;

foreachgroupi do

result ←result+parents of group(doc sizecs size ∗g size[i],g size[i],g par[i]);

returnresult; end

parents of group(g cs, g size, g par)≡ begin

g fan←g sizeg par;est ←1;

ifg cs >g size−g fanthen returng par

ifg cs >g fanthen for i←1tog fando

est ←est∗(1−g sizeg csg fan+i);

return(1−est)∗g par else

for i←1tog cs do

est ←est∗(1−g sizeg fang cs+i);

return(1−est)∗g par end

may differ highly from the average. To solve this problem, nodes can be partitioned according to their tag-name rather than according to parent fan-outs. In the case we know that all context nodes have the same certain tag name,CS⊆Dtag, only the statistics of that specific group are taken into account. Estimation is thus simplified to a single invocation of theparents_of_groupfunction (Algorithm 5):

estimate(|CS/parent|) =parents of group(|CS|, g sizetag, g partag) For randomly chosen context sets, as considered before, calculation can also be performed using all tag-name groups instead of the groups resulting from the fan-out partitioning. However, the partitioning by tag-names differs from the one on parent fan-outs in that the corresponding sets of parent nodes belonging to the tag-name groups may overlap. Hence, it is not possible to simply add up all group related results. Applying

Pg pari

Dpar

as a correction factor to the final result prevents overestimation, but the accuracy still falls behind the one achieved with the parent fan-outs, as can be seen in the experimental study presented at the end of this chapter.

3.5.3 Summarization of Groups

Although the group based estimation model enables high-quality estimates, its per-formance decreases as every group requires a separate call of theparents_of_group function. In order to regain performance, the number of groups has to be reduced.

If we reunite the groups with similar parent fan-outs, the accuracy loss remains small, whereas the speed-up of the calculation should be significant.

Experiments in the field of applicable summarization techniques have revealed the necessity of preserving the groups with the smallest fan-outs separated. If we sort the groups according to their fan-out, the sizes of the groups decrease for higher

g fan g size g size (summarized) g fan (summarized)

1 20 20 1

2 12

3 7 19 2.368

5 3

8 4 8 7.5

13 1

Table 3.1: Example for group summarization.

fan-outs. This observation enables a simple summarization technique: Using the cardinality of the first group in the fan-out sorted group sequence as a size limit, as many following groups are repeatedly reunited until the resulting summarized group exceeds the limit. A small example in Table 3.1 shows the reunion of six fan-out groups to three summarized groups applying the described procedure.

The summarization process can be done during the statistics generation to main-tain the summarized fan-out groups rather than the exact ones. The resulting smaller statistics table also saves on storage requirements.