Query Optimization in XPath - Methods and Cost Models for XPath Query Processing in Main Memory

When trying to find starting points for query optimization, we have to identify exchangeable parts within query plans that might differ in their execution per-formance, but still remain semantically equivalent. However, path expression are neithercommutative norassociative with respect to the ordering of their steps.

[WJLY03] resolves the problem by regarding an axis step as a containment join of the two setsCS andNS, redefiningCS,NS, andRS as sets of node pairs, e.g., everyc ∈ CS is a node tuple of the form hc1, c2i. The result of an axis step can then be specified as

S(CS, NS) ={hc1, n2i| hc1, c2i ∈CS,hn1, n2i ∈NS, n1on axisS ofc2}. Any path expressionCS/S0/S1/ . . . /Sn performed on documentDis translated to

Sn(. . . S1(S0(CS0, NS0), NS0). . . , NS0), CS0={hc, ci|c∈CS}, NS0={hn, ni|n∈D}.

The use of these node pairs allows to rewrite the query plans emerging from a path expression. For any two step expressionsSi, Si+1, Si+1 following directly on Si in the considered path, the step evaluation can be exchanged as follows:

Si+1(Si(CS, NS), NS)≡Si(CS, Si+1(NS, NS)).

In order to meet XPath semantics, the final result of the path expression has to be post-processed to yield a sorted, duplicate free node sequence.

Although this approach enables query optimization to a great extent, the node pairs introduce a significant overhead in terms of intermediate result sizes, espe-cially for the major XPath axes. Whereas the previously described evaluation process defined in [BBC⁺03] requires duplicate elimination for the result of each step expression, the node pairs need to store each combination of nodes relying on the specified axis, which contradicts the aim of finding query plans with smaller intermediate results.

A different way of query optimization is suggested by the authors of [OMFB02].

They searched for symmetries in the semantics of path expressions and derived a large set of path equivalences. Applying simple rewriting rules allows to exchange parts of the path by equivalent expressions which enable faster evaluation. The

::emailN

Figure 1.2: Query plans for the example path expression, changing the evaluation order of axis stepS and node test N inside the step expressions.

found symmetries are mainly concerned with the rewriting from reverse to forward axes, since the former ones cause extremely higher evaluation costs in case of a SAX-like stream-based XML-processing. In the context of DBMS based querying, however, the differences between forward and reverse axes become negligible. Fur-thermore, the suggested path rewriting often leads to a higher complexity of the resulting expressions. For example, reverse axes are often exchanged by forward axes combined with further predicates. Thus, the approach would even cause addi-tional costs for predicate evaluation in our case.

Recent research has generated more ideas on XPath query optimization, for instance, [KG02] performs logical optimization of path queries using information obtained from the corresponding DTDs. However, all these approaches consider the best query plan only in terms of an abstractly defined optimality. Without any knowledge about the actual implementation of the algorithms used for path evaluation, they cannot provide a concrete cost model for a given query plan.

We will thus pursue the long-term objective of query optimization from the opposite side. As described in the outline of this thesis (Sec. 1.1), we start our analysis at the level of the algorithms with the aim to provide detailed physical cost models for each introduced XPath operation. The models may be applied later by a query optimizer, to judge between possible query plans in any given situation.

As the above described methods of path rewriting and containment join order-ing would not be beneficial in our case, we constrain the optimization problem to identify the best query plan for the evaluation of single step expressions. Similar to entire path expressions, the way the axis step and combined node test are or-dered can become an issue of optimization. Figure 1.2 depicts two semantically equivalent query plans executing axis step S and node testN. The second version can be regarded as a typical selection push-down, reducing in advance the node set cardinality of the axis step. Intuitively, we would judge the node test push-down to show better performance in any case. Surprisingly, however, our further analysis will identify considerable advantages of the first version.

The XPath Accelerator and its Axis Evaluation

As mentioned before, this study is part of the Pathfinder project whose principal aim is the construction of an XQuery engine. The project’s underlying data model, the so calledXPath accelerator described in detail in [GKT04], is the first subject of this section.

This thesis‘ goal is to exploit the performance of a relational database system for XPath processing. But using an RDBMS means working on relational tables which do not allow natural storage of tree-shaped XML data. To resolve this problem, an encoding is needed that maps the structure of an XML document and also supports efficient querying of all XPath axes. Recent work on the subject has shown that thepre/post plane [Gru02] appears to be a very efficient XML encoding at least for query intensive usage.

In a nutshell, all nodes of the XML tree are labeled withpreorder andpostorder values. These two enumerations of the nodes are sufficient to represent the tree structure of the document. To be more precise we can define an order on the XML sequence of nodes. Ifa, bare nodes in an XML document D

a < b, ifaappears beforebin a sequential read ofD (2.1) For element nodes with the start tag being separated from the end tag only the start tag is taken into account. This order is called document order [FMM⁺03].

The enumeration of nodes in document order assigns an integer value to every node v∈D, called the preorder-valuepre(v).

If the end tag is considered instead of the start tag, a similar order can be defined on the XML node sequence. Again, the enumeration according to that order assigns an integer value to every nodev∈D, the postorder-valuepost(v).

With the above definitions of pre- and postorder, we can derive the respective values from the textual representation ofD and store them in a simple relational table containing the tupleshpre(v),post(v)ifor every nodev∈D. Figure 2.1 shows the transformation of a small sample document into a pre/post table. This process can be efficiently executed with the help of a SAX [SAX] based parser. The SAX eventsstartElementandendElementin combination with a stack suffice to build the pre/post table within one sequential read, whereas the stack never contains more elements thanheight(TD), the height of the XML treeTDcorresponding toD.

See [Gru02] for more detail on the implementation of the SAX callback procedures.

For a complete database storage of XML documents, additional node specific data such as tag-names and node kinds has to be collected as well. Since this data belongs to each particular node, the pre/post table can easily be extended to

<a>

Figure 2.1: Textual representation and pre/post table for an XML document.

store these additional attributes. Either pre- or postorder values may be chosen as primary keys because of their uniqueness. A relation containing this data may look like this:

pre post tag-name text kind

Notice that the described mapping between pre/post table and textual represen-tation ofD is bijective, meaning that the XML document structure can be restored completely from the pre/post table in the database.

2.1 Main Memory DB specific Adaptations of the Data Model

The XML encoding introduced so far can be implemented in any relational database system. A more detailed description of the data model actually used in the Path-finder project includes specific main memory database related adaptations. Our chosen MMDB system Monet comes with the restriction that it supports only bi-nary tables (BATs) on the physical level of storage. Relations with more than two attributes have to be fragmented vertically. Setting up the XPath accelerator on the base of Monet thus means that the single table containing all document data first has to be split into several BATs, one for each attribute. In each of these tables the primary key has to be maintained. Preorder values are chosen for this purpose as being dense and ascending integers they are suitably represented by avoidcolumn that causes no additional storage overhead. Table 2.1 shows the fully fragmented pre/post relation.

This data model is aimed at enabling a highly efficient evaluation of all XPath axes. The discussion of this issue follows in the next section, but in order to complete the introduction of the data model it is important to mention that fast support for the child/parent as well as sibling axes requires another tabledoc levelfor holding

doc prepost pre post Preorder and postorder ranks doc level pre level Preorder ID of a node and its level

doc tag pre name Preorder ID and tag-name of all element nodes doc text pre text Preorder ID and text value of all text nodes doc pi pre pi Preorder ID and value of all processing

in-struction nodes

doc com pre comment Preorder ID and value of comment nodes doc aname attr name Attribute ID and name of attribute doc avalue attr value Attribute ID and value of attribute

doc aowner attr owner Attribute ID and preorder ID of its owner node Table 2.1: Representation of the XML document with BATs.

the preorder identifier of a node and its level in the document tree.

level(v) =|v/ancestor| (2.2) Notice that the structural redundancy, introduced by the level BAT, leads to small storage overhead only, as a 1 byte integer values suffice to keep level informa-tion. For typical XML instances, the number of hierarchical levels of nodes remains quite small. The well-knownshakespeare.xml [Bos] for instance, an XML document containing all plays of Shakespeare, has a tree height of 7. Thus, we can at least expectheight(TD)<255.

For evaluation performance reasons, attribute nodes are numbered separately.

For most of the axis steps, XPath semantics excludes attributes from the result.

The cost of selections to filter out attributes are thus saved if axis step operations can work directly on attribute-free tables. Nevertheless, attributes as well as other nodes may happen to reside in the same context sequence. It is thus important to choose a numbering scheme for attributes that uses the same data type but does not interfere with the preorder numbering on other nodes. A possible solution would be to indicate attributes by a leading indicator bit.

Experiments have shown that the overall storage volume of the database in-creases by the factor of ≈1.5 in comparison to the textual representation of the XML document.

Im Dokument Methods and Cost Models for XPath Query Processing in Main Memory Databases (Seite 9-13)