• Keine Ergebnisse gefunden

that occur in a query are not immediately processed, but instead collected in a pending update list, which is executed only at the end of the query execution. While this solution seems unfamiliar at first glance, it helps to simplify error handling and to minimize side effects, which are caused by the heterogeneity of XML documents.

Another advantage is that pre values, which may occur as intermediate references to other database nodes, need not be changed during query execution. As has been shown in 2.4.2.5, deletions and insertions only affect database nodes n with n ≥ pre. As a consequence, before updates are carried out, all operations are sorted by theprevalue of the target nodes. All updates are then applied in a backward manner: the operation with the largest prevalue is executed first, followed by the remaining operations with decreasing prevalues. This way, all used pre values will remain valid until the end of query execution.

Implementation details and performance results on XQUF are beyond the scope of this work and can be looked up in Kircher’s bachelor thesis [Kir10].

3.2 Query Processing

A raw XPath or XQuery is represented as a simple string, and classical compiler tech-niques are needed to convert the input into executable code. As literature on compila-tion construccompila-tion offers a broad terminology for classifying the necessary transformacompila-tion steps, we will base our wording on the XQuery Recommendation, which divides query processing into the two phases: Static AnalysisandDynamic Evaluation [BCF+07]. As an extension, the first phase was split up into two transformation steps, namelyanalysis and compilation. The second phase is represented by the evaluation and serialization step.

3.2.1 Analysis

Before a query can be executed, the input is interpreted and transformed into an exe-cutable data structure. The process of splitting the incoming byte stream into atomic tokensis calledlexical analysis. In the subsequentsyntax analysisstep, anexpression tree is built from the tokens, using a formal grammar. The grammar of XQuery and other XML Recommendations is based on the EBNF notation [Wir77] and can be parsed by an

3.2. Query Processing LL(1) parser4in a rather straightforward manner.

The division into lexical and syntax analysis can help to keep the resulting code more readable. The complexity of XQuery, however, requires a lexical scanner to have knowl-edge on its currentlexical state, as e.g. detailed in a W3C working draft on parsing XPath and XQuery [Boa05]. The following list demonstrates that a simple token "for" can have different semantics, as it might occur in a:

• FLWOR expression: for $i in 1 to 10 return $i

• text node: <xml>marked for delivery</xml>

• comment: (: needed for result output :)

• node constructor: element for { "text" }

• variable: declare variable $for := 1;

• location path: /xml/for/sub

• string: "for all of us"

To avoid the distinction between too many different lexical states, which all lead to different scanning branches, it is common to merge the two steps and scan and convert the input by a single parser. While existing analyzers and parser generators, such as FLEX, BISON, or JAVACC, could have been used to convert the XQuery grammar to an executable parser, we decided to write our own parser to get better performance – an approach that has also been taken by other query processors, such as SAXON or QIZX. The parser performs the following steps:

• The static contextis initialized. It contains global information on the query, such as default namespaces, variables, functions, or statically known documents.

• The input is analyzed and converted to expressions, all of which form the expres-sion tree(synonymous: query plan).

• Parse errors are raised if the input does not comply with the LL1 grammar and extra-grammatical constraints.

• As a function may call another function that has not yet been declared in a query, all function calls need to be verified after the whole query has been parsed.

Figure 3.1 depicts some of the expressions that are created by the parsing step, or will be computed by the evaluation step. All expressions are derived from the abstract Ex-pression class. A Value is either an Item or Sequence. Items of type Node may either

4LL(1) means: Left to right, Leftmost derivation, one look-ahead

3.2. Query Processing

Value Sequence

Item

Compute

Set Function

Boolean Position Number ...

Expression

...

Union Intersect Except Filter Path Step FLWOR

String Integer Date

DiskNode MemNode ... ItemSequence NodeSequence RangeSequence

Node

Figure 3.1:Class diagram with expression types

refer to a database node or a main-memory fragment, created by a node constructor.

Sequences are further subdivided into ItemSequence,NodeSequence, andRangeSequence types, which offer optimizations for items of a specific type. All other expressions, such asSetand its subtypes,Filter, etc. are derived from theComputeexpression.

3.2.2 Compilation

Formal Semantics were defined for XQuery 1.0 [DFF+07] as an intent to standardize the normalization of queries, including the atomization of effective boolean values, or static typing. Due to the complexity of the language and its subtleties, this effort was discontinued with Version 3.0. All implementations may choose their own compilation steps as long as the query results conform to the specification.

Compilation includes all steps that simplify and optimize the expression tree (details will be discussed in 3.3):

• Static operations will be pre-evaluated.

• Expressions will be rewritten if their arguments always yieldtrueorfalse.

• FLWOR expressions and location paths will be simplified.

• Predicates will be rewritten to access available index structures.

• Unknown tags or attributes will be removed.

• Static type checks will be performed before the query is evaluated.

3.2.3 Evaluation

In the evaluation step, the resulting item sequence of an expression is computed. For simple and static queries, all necessary computation steps might have been performed in the optimization step, and the root expression to be evaluated might already contain

3.3. Optimizations