Estimation Based Tests - Calibration of the CPU Costs

4.3 Calibration of the CPU Costs

4.4.5 Estimation Based Tests

When used for query plan optimization, cost models are run on estimated parame-ters rather than on measured ones. In case of the descendant and child steps we are able to provide estimates for all required values. In order to get a first impression of the quality losses, we repeated the accuracy tests for both axes, however, based on estimates only. The results are presented in Fig. 4.19.

Like in the previous accuracy tests, the cost models still match the order of magnitude. On the other hand, result size estimation errors are even intensified due

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

0 5 10 15 20 25 30

time in microsec.

test number desc measures

est desc model chl measures est chl model 1/50 of desc-pr deviation

Figure 4.19: Estimation-based modeling of child and descendant steps on D.xml, combined with the visualization of the underlying error when estimating|CSpr|. to their multiple occurrence within the cost functions. For instance, the descendant estimation, being the most erroneous result size estimation technique, is applied in case of the child step cost function to obtain the number of touched data items as well as to determine|CSpr|, which is further used to calculate the number of jumps as well as the stack usage and the cache misses.

4.5 Conclusion

We have undertaken a detailed examination of the XPath operations with respect to their data access and cache usage as well as an algorithmic run-time analysis.

With the focus on data access, we identified a generic access pattern based on the cost modeling approach of [Man02], which is parameterized by the axis and context specific values. Thus, the same basic model is adaptable to serve cache miss estimation for all axis step operations. A finally performed experimental study confirmed the reliability of the models developed for the second level cache.

Furthermore, we have composed physical cost functions combining the data access and further CPU processing expenses. The latter have to be calibrated once for each algorithm to obtain hardware dependent factors. Different kinds of accuracy tests have proved the cost functions’ ability to match real execution times rather accurately when calculation is based on measured parameters.

As cost modeling accompanied the development of the axis step operations, it has already yielded some improvements. Thinking of the level-based axis steps, the stack application enabling sequential access as well as the usage of 1 byte sized level information were motivated by data access considerations. On the other hand, the detailed cost functions in the present state cannot be deployed for the purpose of query optimization. Although the actual calculation is simple and its execution takes just a few clock cycles, a large set of required parameters is not available at optimization time, but depends on accurate estimations. In some cases our result size estimation techniques help to provide the needed values, in others, however, accurate estimations are still missing. Since query optimization only requires to capture rough proportionalities instead of precise predictions of execution times, it would be advisable to identify the most characteristic parts of the cost functions and to try to further reduce the needed information even at the price of decreased accuracy.

Future Work

Obviously, this thesis leaves a large set of questions unanswered or even has revealed a number of new ones. Some of them shall be mentioned here as well as promising ideas for future improvements.

5.1 Improving the Evaluation of Step Expressions

Recent work on efficient implementation of further XQuery operations, such as copying parts of the document tree for element construction, has led to new consid-erations concerning the basic encoding of the XML tree structure. In spite of storing pre-/postorder ranks, we could as well maintain the combination of preorder and descendant sizehpre(v),|v/descendant|iof every node in the document. Recalling Eq. 2.9 makes us realize the equivalence between the two encodings with respect to the information about the XMLstructure it represents. We will not discuss the task of element creation here, but we would like to point out that such a change would as well enable improvements in the introduced axis step algorithms. Whereas the postorder conditions such aspost(v)<post(w) could always be replaced by the sim-ilar conditionpre(v) +|v/descendant|<pre(w), the advantage becomes apparent when considering the skipping techniques. Knowing the number of descendants of any node allows to perform exact skipping even in those cases where we previously had to move the scanning cursor to the calculated lower bound of a descendant block. Hence, when scanning avoid-BAT, data access could be further reduced to its minimum touching only the relevant nodes.

In case of the child, parent, and sibling axes, it would be an interesting approach to exploit the available descendant size information in order to enable further skip-ping of irrelevant descendants. Although such algorithms would lose the advantage of scanning only the smaller level relation, they could be used as an alternative im-plementation that potentially show significant advantages when applied on small-sized context sets. Furthermore, it should also be possible to use descendant size information for developing variants of the level-based algorithms, that could work on sparseoidnode sets as well. The latter would provide the opportunity to further extend query optimization to these axes.

We did not touch upon the field of predicate evaluation so far. However, per-forming efficient evaluation of complete step expressions, it will become an essential issue to enlarge the scope of query processing so that it includes predicate evaluation as well.

Im Dokument Methods and Cost Models for XPath Query Processing in Main Memory Databases (Seite 67-70)