Path Traversal - Location Paths - Storing and Querying Large XML Instances

3.3 Optimizations

3.4.2 Location Paths

3.4.2.2 Path Traversal

Figure 3.7:Class diagram: location path expressions

Figure 3.7 depicts the most important expressions of location paths. In this work, a LocationPath is specified as an expression with an optional root Expressionand several AxisSteps, and anAxisStepconsists of anAxis, aNodeTestand zero or moreExpressionsas predicates. Both the conventional and iterative versions of theLocationPathandAxisStep

3.4. Evaluation

Algorithm 25DiskNode iterators

Require: DATA:=database reference,PRE:=prevalue,P:= PRE(precursor) Child.Next() : Node

3.4. Evaluation

expressions (prefixed with Eval andIter) have anIterator() method, which returns the evaluated next node, or null if all no more results are found. TheIterator() method of the Axis enumeration returns an iterator for the specifiedNode. NodeTest.Matches() returns true if the specified node complies with the test. A test may accept all nodes (AnyKindTest), nodes of a specific Kind (KindTest), or elements and attributes with a certain name (NameTest)¹³. If the test is successful, the predicates are evaluated.

We will now have a closer look on the individual expressions, starting bottom-up with the axis implementations. Both database (DiskNode) and constructed (MemNode) nodes offer particular implementations for all XPath axes: while the memory-based iterators are similar to conventional tree traversals, the database variants have been derived from the Staircase Join algorithms and, hence, are more interesting in this context.

Algorithm 25 lists theNext()methods of the iterators of 10 of the 12 XPath axes, which all can be implemented without considerable effort. As indicated in the last section, all nodes are traversed one by one. A simple integer (P) references the current prevalue.

null is returned if an axis will return no more nodes. Otherwise, P is updated, and a new Node instance is created and returned. For some axes, which do not return a self reference, the precursor needs to be initialized differently. As an example, in the prolog of the Following.Next()method, the descendants of the initial node are skipped;

after that, the cursor is incremented by the node’s attribute size (asize). All algorithms are optimal in the sense that only the relevant nodes are touched (in contrast, e.g., the pre/post/levelencoding requires a traversal of all descendants to findchildnodes).

The two missing axes are precedingandpreceding-sibling. As the nodes of the axes are to be returned in reverse document order (as is the case with the ancestor axes), and as no direct reference to left siblings is available in our encoding, aNodeIteratoris used to cache the results in ascending order, and then return them in backward direction.

Even so, only the nodes will be touched that contribute to the final result. To avoid caching, the database table can as well be traversed backwards, and all touched nodes can be matched against the axis. A reverse traversal is much slower, however, as many irrelevant nodes are touched (including attributes, which will never be addressed by the precedingaxes). Next, usual prefetching strategies on hard disk are optimized for reading forward. As preceding and following axes are rarely used in practice, implementation details are skipped, and the reader is referred to the source code of BASEX [Gr¨u10].

13The XQuery specification specifies more advanced tests for element, attribute and document nodes, which have been excluded from this overview.

3.4. Evaluation On the next, higher level, the IterStep expression loops through all nodes that are re-turned by the axis iterator. Its Next()method has many similarities with theFilter vari-ant, which has been presented in 3.4.1.3. Algorithm 26 will only yield valid results if at most one positional test is specified, which must additionally be placed as first predicate. Note that this limitation is not intrinsic to iterative processing; instead, it was introduced to simplify the presented pseudo-code. If a separate context position is managed and cached for each predicate, the algorithm will be able to process arbitrary positional predicates.

Algorithm 26IterStep.Iterator.Next() : Node Require:

AXIS:=XPath axis

TEST:=node test

PREDS:=filter predicates

CONTEXT:=query context

ITERATOR:=node iterator, generated from theAXISand input node

POS:= 0(current context position)

1 loop

2 POS:=POS+ 1

3 node:=ITERATOR.Next()

4 ifnode=nullthen

5 return null

6 else ifTEST.Matches(node)then

7 CONTEXT.VALUE:=node

8 CONTEXT.POS:=POS 9 forpredinPREDSdo

10 break ifthetruth valueofpredisfalse

11 end for

12 returnnodeifall predicates tests were successful

13 end if

14 end loop

In contrast to the filter expression, the step iterator additionally performs the node test before the predicates are considered. Next, the context item and position is not reset, as this will be done once for all steps by the path expression.

Finally, Algorithm 27 shows theNext()method of theIterPath, which creates and triggers the axis step iterators and returns the results of the last step. The ITERATORSvariable contains references to the step iterators, the first of which is initialized before the first call (the optional root expression is excluded from this algorithm; it is treated the same way as the axis steps). The original context value and position are cached before and

3.4. Evaluation

Algorithm 27IterPath.Iterator.Next() : Node Require:

ITERATORS:=iterator array

CONTEXT:=query context

ITERATORS[0]=STEPS[0].Iterator()

P:= 0(index on current iterator)

1 cache context value and position

2 loop

3 node:=ITERATORS[P].Next()

4 ifnode=nullthen

5 P :=P−1

6 break ifP<0

7 else ifp <#ITERATORS– 1then

8 P :=P+ 1

9 CONTEXT.VALUE:=node

10 ITERATORS[P]=STEPS[P].Iterator()

11 else

12 break

13 end if

14 end loop

15 restore context value and position

16 return node

restored after the evaluation. In the main loop, the next item is requested from the current iterator, which is referenced by P. If the iterator is exhausted, the next higher iterator is addressed by decrementing P. If the leftmost iterator (i.e., the first location step) returnsnull, the loop is canceled, as the location path will return no more results.

Otherwise, if the current iterator is not the rightmost, P is incremented, the evaluated node is set as context node and the iterator of the next location step is initialized. If P

points to the last location step, the evaluated node is returned as result.

3.4.2.3 Optimizations

The proposed framework offers room for numerous tweaks and improvements, both conceptual and technical, which will further speed up the evaluation of location paths.

Some of the most important optimizations are sketched in the following. First of all, the IterStepalgorithm is clearly suboptimal, regarding the evaluation ofpositional predicates:

• All nodes are iterated, even if a position test is specified that filters the result-ing nodes to a small subset. In the example query descendant::node()[1], the

3.4. Evaluation iterator could be canceled after the first hit.

• Predicates using thelast()function (which returns the last item of a sequence) are not supported, as the number of results is not known in advance. As the last()function will reduce the processed nodes to a single item, the conventional evaluation, which caches all nodes, is highly undesirable.

Algorithm 28IterStep.Iterator.Next() : Node Require: see Algorithm 26, plus:

POSITION:=position expression(optional)

RETURNLAST:=flag for returning the last node(optional)

SKIP:=false

1 return nullifSKIP 2 last:=null

3 loop

4 POS:=POS+ 1

5 node:=ITERATOR.Next()

6 ifnode=nullthen

7 SKIP:= RETURNLAST 8 return last

9 else ifTEST.Matches(node)then

10 CONTEXT.VALUE:=node

11 CONTEXT.POS:= POS 12 forpredinPREDSdo

13 break ifthetruth valueofpredisfalse

14 end for

15 ifall predicates tests were successfulthen

16 SKIP:= POSITION.Skip(CONTEXT)

17 returnnode

18 end if

19 ifRETURNLASTthen

20 last:=node

21 end if

22 end if

23 end loop

Algorithm 28 includes optimizations for the two requirements: RETURNLASTwill betrue if the step contains a single last() predicate, and POSITION will be assigned if the first predicate is an implementation-definedPOSITIONexpression (see Section 3.5 for its definition). The SKIP flag indicates that the iterator will return no more results. It is set totrueif the iterator is exhausted and theRETURNLAST flag istrue, or if theSkip() method of the position expression returns true for the given context, indicating that

Im Dokument Storing and Querying Large XML Instances (Seite 98-104)