• Keine Ergebnisse gefunden

Estimations on the ancestor axis have turned out to be the most sophisticated task for result size estimation, although, unlike in the case of the descendant axis, the variation among result sizes of single nodes remains small. The entries of the level table can be interpreted directly as tupleshv,|v/ancestor|iwithheight(TD) as the maximum number of ancestors of a single node. We can thus define a simple upper bound condition for the ancestor axis:

|CS/ancestor|<|CS| ∗height(TD). (3.13) Starting from this upper bound as a first rough result estimation, we will under-take a stepwise enhancement of the estimation. The above approximation obviously suffers from two generalizations:

(1) Most of the nodes lie on levels smaller thanheight(TD). In unbalanced trees, as in the case of typical XML documents, not even a large fraction of nodes is located on the deepest level.

(2) Ancestor sets of single nodes are highly overlapping with each other, e.g., every single ancestor set contains at least the root node.

To deal with (1) means to provide a more suitable model for assuming the level distribution of the context set nodes. To be more exact, the level distribution of the ancestor-pruned context set is required to exclude overlapping caused by context nodes being themselves ancestors of other context nodes. The pruned context set model alone, however, does not fully resolve the overlap problem mentioned in (2).

In order to calculate an overlap free result size, a method is needed to further avoid the repeated counting of the common nodes in the paths from the context nodes up to the root.

3.6.1 Level Model of the Pruned Context Set

There are two possible ways of building a level model of the ancestor pruned context set: usage of a level histogram or examination of the context set itself by employing sampling techniques.

The histogram-based approach is realized by maintaining an ancestor-pruned level histogram of the entire document, i.e., a table holding for each level the number leaf nodes. This enables to provide a level model for arbitrary context sets by a straightforward proportional partitioning according to the histogramLpr:

|CSpri |= |CS| D ∗Lipr,

where|CSpri |denotes the number of context nodes at leveli. The method does not touch the data of the current context set in contrast to the following sampling-based approach.

If sampling is used for obtaining a level model, a small sample of the context set is extracted and investigated regarding its ancestor pruned level distribution.

In this case, proportional expansion yields the level model for the complete context set:

|CSpri |= |CS|

|Sample|∗Sampleipr

As for the applied sampling technique, it differs from the one in the descendant case. Sampling can be improved here by picking out single nodes from the complete range of the context set instead of cutting out dense slices of nodes only at the beginning and the end of the document. The difference results from the fact that the combined pruning process plays only a minor role for the ancestor estimation, whereas the level distribution of nodes can vary considerably in different parts of the document. Experiments have shown to perform best when test nodes are picked out by a cursor moving over the context sequence in steps of increasing distance, for instance the step range between two test nodes starts at 1 and is increased by a factor of 1.1 with every cursor step. Thus, the first part of the sampling with context nodes in smaller distance enables to show pruning effects, whereas larger steps in the end ensure that no part of the context set is left untouched.

The choice of one of these two methods depends mainly on the expected node dis-tribution within context sets. The histogram-based approach shows its advantages, when the context set consists of nodes that are uniformly distributed over the whole document tree (Fig. 3.1(a)), but since context sets as intermediate results of node tests often consist of nodes with the same specific feature, e.g., the identical tag-name, uniform distribution is not always an appropriate assumption. Figure 3.1(b) demonstrates the superiority of the sampling-based approach for tag-name specific context sets.

3.6.2 Overlap-Free Ancestor Estimation

In our estimation method we will make use of the probabilistic parent step model to avoid multiple counting of common ancestor nodes. It propagates parent estimation

“level-wise” up the tree while the particular level result sizes are summed up:

estimate(|CS/ancestor|) =

X1 i=height(TD)

estimate(|CSi/parent|). (3.14)

Similarly to|CSpri |, that was introduced to denote the number of context nodes at leveliin the pruned context set model, we have to extend the considered context setCSifor the parent estimation at leveliby those nodes being parents of the level below. Note that the two setsCSpri andCSi+1/parentare disjoint, because of the already performed ancestor pruning. Their cardinalities can thus be added:

|CSi|=|CSpri |+|CSi+1/parent|. (3.15)

0 1 2 3 4 5 6 7 8

(a) uniform random node distribution

0 1 2 3 4 5 6 7 8

Figure 3.1: Histogram vs. sampling-based model of the context set’s level distri-bution compared with the actual distridistri-bution (tested on shakespeare.xml [Bos]).

It remains to aggregate the respective level’s related statistical information to perform parent estimation in the scope of a certain level. First experiments using only the overall number of nodes per level and the number of their parents have shown insufficient accuracy. Therefore it is important to enhance the statistics by a “level-wise” construction of summarized fan-out groups as described in the last section. Algorithm 6 presents the process of ancestor estimation using these group statistics.

Algorithm 6: Ancestor Estimation estimate anc (cs pr: array(num))≡

stat. data :g frac: array(num),g size: array(num),g par: array(num) begin

result ←0;level res←0;

fori←height(TD)to1do csi←level res+cs pr[i];

level res←0;

foreachgroupj of leveli do

level res←parents of group(csi∗g frac[j],g size[j],g par[j]);

result ←result+level res; returnresult;

end

3.6.3 Tag-Name Specific Context Sets

Result size estimation on the parent axis has already revealed the necessity of using tag-name related statistical information, if the context set only contains nodes with the same tag-name. As we divided the process of ancestor estimation in multiple es-timation steps on the parent axis, we can reuse the presented methods and collected statistics.

In Eq. 3.15, the proportion of context nodes belonging to a certain level is divided in the two setsCSpri andCSi+1/parent. Instead of adding both cardinalities before calculation we now treat both parts separately. Whereas parent estimation for the

result size of the level below is performed using the level related statistics as before, theCSpri part, as the set of the newly introduced context nodes having the specified tag-name, is calculated using the statistics of its respective tag name group.

It remains to reunite the separately evaluated parent step estimates of both parts. However, we cannot simply add up their results because the two parent sets, the union of all level related fan-out groups, on the one hand, and the parents of the tag-name group, on the other hand, overlap to a great extent. As overestimation occurs more frequently, it seems convenient to use the maximum of both single parent estimates as the overall result of the current level.