• Keine Ergebnisse gefunden

3.10 Experimental Studies

3.10.4 Storage Requirements

Table 3.3 presents an overview of the histogram tables generated for all introduced estimation methods. For all three test documents the number of entries in the histograms remains quite small. In fact, the required space for their maintenance in the Monet system is exceeded by the meta-data belonging to each histogram BAT.

axis histograms description number of entries

general statistics D dblp shakespeare

parent fan-out group size 50 73 214

parent fan-out group’s parents 50 73 214

parent summarized fan-out group size 5 4 4

parent summarized fan-out group’s parents 5 4 4 ancestor summarized level group size 59 24 37 ancestor summarized level group’s parents 59 24 37

ancestor level model of leaf nodes 13 6 8

tag-name specific statistics

all tag-name group size 74 29 22

parent tag-name group’s parents 74 29 22

child tag-name group’s children 74 29 22

sibling tag-name group’s fol-siblings 74 29 22 sibling tag-name group’s pre-siblings 74 29 22 sibling average number of fol-siblings of a single

nodes within the tag-name groups

74 29 22

sibling average number of pre-siblings of a single nodes within the tag-name groups

74 29 22

Table 3.3: Overview on the collected statistics. For each table the generated number of its entries is shown.

Looking at the generated entry numbers, all histogram sizes, except for the one for the fan-out groups, depend on the number of tag-names or levels in the docu-ments. Using summarization for the fan-out groups reduces the variation of their cardinalities to a number < 10, which explains the observed constant estimation performance.

The ancestor pruned level model is only required, if the histogram-based ap-proach is applied for ancestor estimation on arbitrary context sets.

3.11 Conclusion

We have developed accurate estimation methods for all XPath axes. Contrary to the application of standard join and selection size estimation, the proposed techniques are tree aware, in that they exploit the respective axis characteristics. Furthermore, all estimations are calculated in a small fraction of the time needed for evaluation of the corresponding axis and require minimal overhead on generating and storing statistical information.

Although the approaches for particular axes differ widely among each other, us-ing samplus-ing, parametric, or histogram-based techniques, parent estimation has be-come the central starting point for some other methods, such as ancestor, following-and preceding-sibling estimation.

In all cases we additionally provided a specialized estimation version allowing to exploit information about tag-name restrictions of the context set. Since foregoing name tests are often encountered during the evaluation of entire path expressions, it was an important issue to make use of the available information, especially since equally named nodes typically have common, but sometimes exceptional character-istics.

The development of the proposed techniques was always accompanied by exten-sive testing of the achieved accuracy. Ranging from randomly composed context sets to those resulting from name tests, we examined both the average as well as the worst case deviation. Whereas all methods have proved good average prediction quality, the worst case deviation often suffers from XPath inherent difficulties, such

as that one single context node may itself have more descendants or followings than the union of all others.

When applying the methods for strategical query optimization, it is important to notice, that at least the sampling techniques always require the presence of the actual context set, which restricts optimization to single step expressions. However, we regard the limitation to be admissible, since changes within the non-associative order of step expressions cause overhead on storing intermediate results, as shown in the introduction (Sec. 1.4).

Access Patterns and Cost Models

Cost modeling for the XPath operations does not differ conceptually from cost modeling for other database operations. In the domain of main memory database systems, [Man02] is the only work that provides an applicable framework for devel-oping physical cost models for arbitrary kinds of database operations. [ADHW99]

also performed a detailed analysis of typical database operations with respect to their CPU processing, but does not provide a general method for deriving cost func-tions. The following analysis of the introduced XPath operations is therefore based directly on the cost modeling process described in [Man02].

Physical cost models enable rough estimations of the time required for the execu-tion of an operaexecu-tion. For applicaexecu-tion of the models usually two kinds of informaexecu-tion are needed:

• Hardware characteristics, such as cache sizes and their respective latencies.

• Cardinalities of the processed data: sizes of each input operand and of the output.

In order to generate a model, different cost factors have to be considered. Mainly, memory access costsTM emand further CPU calculation costsTCPU should be dis-tinguished. In disk based or distributed databases also disk-I/O and network com-munication costs, respectively, would have to be taken into account. Because for typical database operations TM em dominates TCPU, this analysis will concentrate on the modeling of the cache utilization.

We start this chapter by introducing the terms and notations in the field of hier-archical memory access thus outlining the generic cost model approach of [Man02].

As the memory access patterns are similar for all XPath operations, the following sections aim at obtaining a generic model, rather than starting the analysis for each axis separately. The remaining differences among the axes as well as CPU cost calibration are shown subsequently.

4.1 Hierarchical Memory Access Models

Unlike the main memory access model of early MMDBMS research [GMS92], we cannot consider any request for a main memory address a random access and treat all data accesses equally with respect to the time it takes the CPU to fetch them from memory. The hierarchical system of cache levels in modern computers needs to distinguish how many caches the required data has to pass on its way to the CPU.

47

If a data item is available at the first level cache L1, CPU access is considerably faster than for any data that has to be fetched from main memory causing misses on all the above cache levels.

As conventional disk based database systems estimate the number of required I/O-operations of an operation, the basic approach of [Man02] is to predict the number of cache misses Mi on each cache levelLi, i= 1...N. With further infor-mation about the cache miss latencies li of the system, the total memory access costs can be calculated by

TM em= XN i=1

(Mi∗li). (4.1)

The simple model assumes constant miss latencies for all cache misses on the same level. Modern CPUs, however, are able to recognize sequential data access patterns, and enhance the access bandwidth in this case by prefetching the expected memory addresses in the caches. Modeling this feature thus requires distinction between sequential and random cache misses, denoted by Mis and Mir, and score them with the respective latencies for random and sequential accesslirandlsi:

TM em= XN i=1

(Mis∗lsi +Mir∗lri). (4.2) lri and lis as well as other cache characteristics can be collected by in advance calibration of the hardware. For instance the calibrator tool [Man] allows to obtain these parameters (see [Man02] for detailed explanation of its functionality).