11.3 XPath Accelerator

(1)

XML Databases

12. Updates

Silke Eckstein Andreas Kupfer

Institut für Informationssysteme

Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de

(2)

1. Finish chapter 11:

– Short recapitulation of XPath Accelerator encoding – 11.6 Staircase join

2. Chapter 12:

Plan for today

2. Chapter 12:

– Updates

3. Tutorial

– Presentation of exercise 8 – Presentation of exercise 9

XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 2

(3)

11.1 Introduction

11.2 Node-based encoding

11.3 Path-based XPath Accelerator encoding

11. XML storage – details

11.4 Evaluation in SQL

11.5 Skeleton compression 11.6 Staircase join

11.7 Overview and References

(4)

• XPath axes in the pre/post plane

– Plane partitions ≡ XPath axes, o is arbitrary!

11.3 XPath Accelerator

– Pre/post plane regions ≡ major XPath axes

XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 4 [Gru08]

The major XPath axes descendant, ancestor, following, preceding correspond to rectangular pre/post plane windows.

(5)

• XPath Accelerator encoding

– XML fragment f and its skeleton tree

11.3 XPath Accelerator

– Pre/post encoding of f : table accel

(6)

• XPath axes and pre/post plane windows

– Window def's for axis α , name test t ( * = don't care)

11.4 Evaluation in SQL

(7)

• Result of path(·) simplified and unnested

– path(fn:root()/descendant::a/child::text())

11.4 Evaluation in SQL

SELECT DISTINCT v

1.*

FROM accel v₃, accel v₂,accel v₁

WHERE v₁INSIDE window (child::text(), v₂) AND v₂INSIDE window (descendant::a, v₃)

– An XPath location path of n steps leads to an n-fold self join of encoding table accel.

– The join conditions are

• conjunctions √ of

• range or equality predicates √ .

AND v₂INSIDE window (descendant::a, v₃) AND v₃.pre = 0

ORDER BY v₁.pre

}}

multi-dimensional window!

(8)

11.2 Node-based encoding

11.3 Path-based XPath Accelerator encoding

11. XML storage – details

11.4 Evaluation in SQL

11.5 Skeleton compression 11.6 Staircase join

(9)

• Enhancing tree awareness

– We now know that the XPath Accelerator is a true

isomorphism with respect to the XML skeleton tree structure.

• Witnessed by our discussion of shredder (ε) and serializer (ε ^-1_{) .}

– We will now see how the database kernel can benefit

11. 6 Staircase Join

– We will now see how the database kernel can benefit from a more elaborate tree awareness (beyond

document order and semantics of the four major XPath axes).

– This will lead to the design of staircase join ⋈⋈⋈⋈, the core of MonetDB/XQuery's XPath engine.

(10)

• Tree awareness?

– Document order and XPath semantics aside, what are further tree properties of value to a relational XML processor?

11. 6 Staircase Join

The size of the subtree rooted in node a is 4

The leaf-to-root paths of nodes b, c meet in node d The subtrees rooted in e and a are necessarily disjoint

(11)

• Tree awareness : Subtree size

– Tree property subtree size ( on previous slide) is implicitly present in a pre/post-based tree encoding:

– To exploit property subtree size, we were able to find

11. 6 Staircase Join

post(v) - pre(v) = size(v ) - level(v)

– To exploit property subtree size, we were able to find a means on the SQL language level, i.e., outside

the database kernel.

⇒ This led to window shrink-wrapping for the XPath descendant axis.

(12)

• Tree awareness on the SQL level

– Shrink-wrapping for the descendant axis

– path(Q)

11. 6 Staircase Join

Q ≡ (c)/following::node()/descendant::node()

XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 12 [Gru08, Gru02]

SELECT DISTINCT v2 .pre FROM accel v₁, accel v₂ WHERE v₁.pre > c.pre

AND v₁.pre < v₂.pre AND v₁.post > c.post AND v₁.post > v₂.post

AND v₂.pre <= v₁.post + h ANDv₂.post >= v₁.pre – h ORDER BY v₂.pre

(13)

• Tree awareness : Meeting ancestor paths

– Evaluation of axis ancestor can clearly benefit from knowledge about the exact element node where several given node-to-root paths meet.

• For example:

For context nodes c1…c_n, determine their lowest common ancestor v = lca(c …c ).

11. 6 Staircase Join

1 n

ancestor v = lca(c1…c_n).

⇒Above v , produce result nodes once only.

(This still produces duplicate nodes below v.)

– This knowledge is present in the encoding but is not as easily expressed on the level of commonly

available relational query languages (such as SQL or relational algebra).

(14)

• Tree awareness : Disjoint subtrees

– An XPath location step cs/α is evaluated for a context node sequence cs.

• This " set-at-a-time" processing mode is key to the efficient

evaluation of queries against bulk data. We want to map this into set-oriented operations on the RDBMS.

11. 6 Staircase Join

set-oriented operations on the RDBMS.

(Remember: location step is translated into join between context node sequence and document encoding table accel.)

– But: If two context nodes c_{i ,j} ∈ cs are in α-relationship, duplicates and out-of-order results may occur.

• Need efficient way to identify the c_i ∈∈∈∈ cs which are not in α- relationship with any other c_j

(for α = descendant: " c_{i ,j} in disjoint subtrees?").

(15)

• Staircase Join: An injection of tree awareness

– Since we fail to explain tree properties and at the relational language level interface, we opt to invade the database kernel in a controlled fashion

• Inject a new relational operator, staircase join ⋈⋈⋈⋈_s, , , , into the relational query engine.

11. 6 Staircase Join

relational query engine.

• Query translation and optimization in the presence of ⋈⋈⋈⋈_s continues to work like before (e.g., selection pushdown).

• The ⋈⋈⋈⋈_s algorithm encapsulates the necessary tree knowledge.

⋈

⋈_s is a local change to the database kernel.

– Remember: All of this is optional. XPath Accelerator is a purely relational XML document encoding, working on top of any RDBMS.

(16)

• Tree awareness: Window overlap, coverage

– Location step (c₁, c₂, c₃, c₄)/descendant::node().

The pairs (c₁, c₂) and (c₃, c₄) are in descendant- relationship:

• Window overlap and coverage (descendant axis)

11. 6 Staircase Join

(17)

11. 6 Staircase Join

Axis window overlap (descendant axis)

Axis window overlap (ancestor axis)

(18)

11. 6 Staircase Join

Axis window overlap (following axis)

Axis window overlap (preceding axis)

(19)

• Context node sequence pruning

– We can turn these observations about axis window overlap and coverage into a simple strategy to prune the initial context node sequence for an XPath location step.

11. 6 Staircase Join

location step.

• Context node sequence pruning

Given cs/α determine minimal cs⁻ ⊆ cs, such that

cs/α _{= cs} ⁻ _/α _.

We will see that this minimization leads to axis step evaluation on the pre/post plane, which never emits duplicate nodes or out-of-order results.

(20)

• Context node pruning: following axis

– Once context pruning for the following axis is complete, all remaining context nodes relate to each other on the ancestor/descendant axes:

• Covering nodes c , c in descendant relationship

11. 6 Staircase Join

• Covering nodes c₁, c₂ in descendant relationship

(21)

• Empty regions in the pre/post plane

11. 6 Staircase Join

Relating two context

nodes (c₁, c₂) on the plane

Empty regions?

Given c₁,₂ on the left, why are the regions U,S marked Ø guaranteed to not hold any nodes?

to not hold any nodes?

(22)

• Context pruning (following axis)

– (c₁, c₂)/following::node()

11. 6 Staircase Join

(c₁, c₂)/following::node() ≡ S ∪ T ∪ W

≡ T ∪ W

≡ (c₂)/following::node()

(23)

• Context pruning (following axis)

11. 6 Staircase Join

Context pruning (following axis)

Replace context node sequence cs by singleton sequence (c), c ∈ cs, with post(c) minimal.

(24)

• Context pruning (preceding axis)

11. 6 Staircase Join

Context pruning (preceding axis)

Replace context node sequence cs by singleton sequence (c), c ∈ cs, with pre(c) maximal.

– Regardless of initial context size, axes following and preceding yield simple single region queries.

– We focus on descendant and ancestor now.

(25)

• More empty regions

11. 6 Staircase Join

Remaining context nodes c₁, c₂after pruning for descendant axis

Empty region?

Why is region Z marked Ø guaranteed to be empty?

(26)

• Context pruning (descendant axis)

11. 6 Staircase Join

• The region marked Ø above is a region of type Z (previous slide). In general, a non-singleton sequence remains.

(27)

• Context pre-processing: Pruning

– prune_context_desc(context : TABLE(pre,post))

11. 6 Staircase Join

(28)

• " Staircases" in the pre/post plane

– Note that after context pruning, the remaining context nodes form a proper "staircase" in the plane. (This is an important assumption in the following.)

• Context pruning & "staircase"

11. 6 Staircase Join

(29)

• Flashback: Intersecting ancestor paths

– Even with pruning applied, duplicates and out-of-order results may still be generated due to intersecting ancestor paths.

• We have observed this before: apply function ancestors(c₁, c₂) where c₁ (c₂) denotes the element node with tag d (e) in the sample tree below.

(Nodes c₁_,₂, would not have been removed during pruning.)

11. 6 Staircase Join

(Nodes c₁_,₂, would not have been removed during pruning.)

Remember: ancestors((d,e)) yielded (a,b,a,c).

Sample tree Simulate XPath ancestor via parent axis

declare function

ancestors($n as node()*) as node()*

{ if (fn:empty($n)) then ()

else (ancestors($n/..), $n/..) }

(30)

11. 6 Staircase Join

• Separation of ancestor paths

– Idea: try to separate the ancestor paths by defining suitable cuts in the XML fragment tree.

• Stop node-to-root traversal if a cut is encountered.

Path separation (ancestor axis)

(31)

• Parallel scan along the pre dimension

– Separating ancestor paths

11. 6 Staircase Join

Scan partitions (intervals): [p₀, p₁), [p₁, p₂), [p₂, p₃).

• Can scan in parallel. Partition results may be concatenated.

• Context pruning reduces numbers of partitions to scan.

(32)

• Basic Staircase Join (descendant)

– ⋈⋈⋈⋈_s _desc(accel: TABLE(pre,post), context : TABLE(pre,post))

11. 6 Staircase Join

(33)

• Partition scan (sub-routine)

– scanpartition(pre₁ ,pre₂ , post; Ɵ)

11. 6 Staircase Join

Notation accel[i] does not imply random access to document encoding:

• Access is strictly forward sequential (also between invocations of scanpartition(·)).

(34)

• Basic Staircase Join (ancestor)

– ⋈

anc(accel : TABLE(pre,post), context : TABLE(pre,post))

11. 6 Staircase Join

(35)

• Basic Staircase Join: Summary

– The operation of staircase join is perhaps most closely described as merge join with a dynamic range

predicate: the join predicate traces the staircase boundary:

• ⋈⋈⋈⋈ scans the accel and context tables and populates the result

11. 6 Staircase Join

• ⋈⋈⋈⋈_s scans the accel and context tables and populates the result table sequentially in document order,

• ⋈⋈⋈⋈_s scans both tables once for an entire context sequence,

• ⋈⋈⋈⋈_s never delivers duplicate nodes.

– ⋈⋈⋈⋈_s works correctly only if prune_context(·) has previously been applied.

• prune_context(·) may be inlined into ⋈⋈⋈⋈_s , thus performing context pruning on-the-fly.

(36)

• Skip ahead, if possible

– While scanning the partition associated with c₁_,₂ :

– v is outside staircase boundary, thus not part of the result.

– No node beyond v in result (Ø-region of type Z).

⇒ Can terminate scan early and skip ahead to pre(c2).

11. 6 Staircase Join

(c ;c )/descendant::node()

(c

1;c

2)/descendant::node()

(37)

• Skipping for the descendant axis

– scanpartitiondesc(pre₁ ,pre₂ , post)

11. 6 Staircase Join

– Note: keyword break transfers control out of innermost enclosing loop (cf. C, Java).

(38)

• Effectiveness of skipping

– Enable skipping in scanpartition(·). Then, for each node in context, we either

1. hit a node to be copied into table result, or

2. encounter an offside node (node v on previous slide) which

11. 6 Staircase Join

2. encounter an offside node (node v on previous slide) which leads to a skip to a known pre value (→ positional access).

– To produce the final result, ⋈⋈⋈⋈_s thus never touches more than

context + result nodes in the plane (without skipping: context + accel).

• In practice: > 90% of nodes in table accel are skipped.

(39)

• "Database-Supported XML Processors", [Gru08]

– T. Grust

– Lecture, Uni Tübingen, WS 08/09

• "XML and Databases", [Scholl07]

– M. Scholl

– Lecture, Uni Konstanz, WS07/08

• "Staircase Join: Teach a Relational DBMS to Watch its (Axis) Steps"

11.7 References

• "Staircase Join: Teach a Relational DBMS to Watch its (Axis) Steps"

[GKT03]

– Torsten Grust, Maurice van Keulen, Jens Teubner.

– In Proc. 29th Int'l Conference on Very Large Databases (VLDB), pages 524- 535, 2003.

– http://www.informatik.uni-konstanz.de/~grust/files/staircase-join.pdf

• " Accelerating XPath Location Steps" [Gru02]

– T. Grust

– ACM SIGMOD 2002, June 4–6, Madison, Wisconsin, USA

– http://www-db.informatik.uni-tuebingen.de/files/publications/xpath-accel.pdf

(40)

1. Finish chapter 11:

– Short recapitulation of XPath Accelerator encoding

– 11.6 Staircase join

2. Chapter 12:

Plan for today

2. Chapter 12:

– Updates

3. Tutorial

– Presentation of exercise 8 – Presentation of exercise 9

(41)

12.2 XQuery Update Facility

12.3 Impact on XPath Accelerator Encoding

12. Updates

(42)

• Throughout the course, up to now, we have not been looking into updates to XML documents at all.

– If we want to discuss efficiency/performance issues w.r.t.

mappings of XML documents to databases, though, we need to take modifications into account as well as pure retrieval operations.

12.1 Introduction

retrieval operations.

– As always during physical database design, there is a trade- off between accelerated retrieval and update performance.

– The following examples are formulated in XQuery

Update, an extension to XQuery that currently has W3C Candidate Recommendation status

(http://www.w3.org/TR/xquery-update-10/).

(43)

• Updates and tree structures

– During our discussion of XQuery, we have seen that tree construction has been a major concern. Updates, however, cannot be expressed with XQuery.

• Yet, we need to be able to specify modifications of existing XML documents/fragments as well.

12.1 Introduction

documents/fragments as well.

• We certainly need to be able to express:

– modification of all aspects (name, attributes, attribute values, text contents) of XML nodes, and

– modifications of the tree structure (insert/delete/replace nodes or subtrees) and to rename and transform them.

• Like in the SQL case, target node(s) of such modifications should be identifiable by means of queries.

(44)

12. Updates

(45)

• XQuery Update Facility: New XQuery expressions

12.2 XQuery Update Facility

XQuery expressions

ExprSingle ::= FLWORExpr

| QuantifiedExpr

| TypeswitchExpr

| IfExpr

| InsertExpr

Syntax and examples taken from the W3C

– N.B. Updating expressions (insert, delete, rename, replace) lead to a loss of type/validation information at the affected nodes.

Such information may be recovered by revalidation.

| InsertExpr

| DeleteExpr

| RenameExpr

| ReplaceExpr

| TransformExpr

| OrExpr

taken from the W3C web site.

(46)

• Node insertion

– An insert expression is an updating expression that inserts copies of zero or more nodes into a designated position with respect to a target node.

12.2 XQuery Update Facility

Syntax

XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 46 [Scholl07]

Syntax

InsertExpr ::= "insert" ("node" | "nodes")

SourceExpr InsertExprTargetChoice TargetExpr InsertExprTargetChoice ::= (("as" ("first" | "last"))? "into")

| "after" | "before"

SourceExpr ::= ExprSingle TargetExpr ::= ExprSingle

(47)

• Node insertion: Examples

12.2 XQuery Update Facility

Insert a year element after the publisher of the first book.

insert node <year>2005</year>

after fn:doc("bib.xml")/books/book[1]/publisher

Navigating by means of several bound variables, insert a new police report into the list of police reports for a particular accident.

insert node $new-police-report

as last into fn:doc("insurance.xml")/policies /policy[id = $pid]

/driver[license = $license]

/accident[date = $accdate]

/police-reports

(48)

• Node deletion

– A delete expression deletes zero or more nodes from an XDM instance.

– The keywords node and nodes may be used interchangeably, regardless of how many nodes are actually deleted.

12.2 XQuery Update Facility

Syntax

DeleteExpr ::= "delete" ("node" | "nodes") TargetExpr TargetExpr ::= ExprSingle

Delete the last author of the first book in a given bibliography.

delete node

fn:doc("bib.xml")/books/book[1]/author[last()]

Delete all email messages that are more than 365 days old.

delete nodes /email/message

[fn:currentDate() - date > xs:dayTimeDuration("P365D")]

(49)

• Node replacement

– Replace takes two forms, depending on whether value of is specified:

12.2 XQuery Update Facility

Syntax

ReplaceExpr ::= "replace" ("value" "of")? "node"

TargetExpr "with" ExprSingle TargetExpr ::= ExprSingle

– Replace takes two forms, depending on whether value of is specified:

• If value of is not specified, a replace expression replaces one node with a new sequence of zero or more nodes. The replacement nodes occupy the position in the node hierarchy that was formerly occupied by the node that was replaced.

– Hence, an attribute node can be replaced only by zero or more attribute nodes, and an element, text, comment, or processing instruction node can be replaced only by zero or more element, text, comment, or processing

instruction nodes.

• If value of is specified, a replace expression is used to modify the value of a node while preserving its node identity.

(50)

• Node replacement: Examples

12.2 XQuery Update Facility

Replace the publisher of the first book with the publisher of the second book.

replace node fn:doc("bib.xml")/books/book[1]/publisher with fn:doc("bib.xml")/books/book[2]/publisher

Increase the price of the first book by ten percent.

replace value of node fn:doc("bib.xml")/books/book[1]/price with fn:doc("bib.xml")/books/book[1]/price * 1.1

(51)

• Renaming nodes

– A rename expression replaces the name property of a data model node with a new QName.

12.2 XQuery Update Facility

Syntax

RenameExpr ::= "rename" "node" TargetExpr "as"

NewNameExpr NewNameExpr

Rename the first author element of the first book to principal-author.

rename node fn:doc("bib.xml")/books/book[1]/author[1]

as "principal-author"

Rename the first author element of the first book to the QName that is the value of the variable $newname.

rename node fn:doc("bib.xml")/books/book[1]/author[1]

as $newname

(52)

• Renaming is local!

– The effects of a rename expression are limited to its target node, descendants are not affected.

Global change of names or namespaces needs explicit iteration.

12.2 XQuery Update Facility

Example (Change all QNames from prefix abc to xyz and new namespace URI http://xyz/ns for node $root and its decendents.)

for $node in $root//abc:*

let $localName := fn:local-name($node),

$newQName := fn:concat("xyz:", $localName) return

rename node $node as fn:QName("http://xyz/ns", $newQName), for $attr in $node/@abc:*

let $attrLocalName := fn:local-name($attr),

$attrNewQName := fn:concat("xyz:", $attrLocalName) return

rename node $attr as fn:QName("http://xyz/ns",

$attrNewQName)

(53)

• Node transformation

– . . . creates modified copies of existing nodes. Each copied node obtains a new node identity. The resulting XDM instance can contain both, newly created and previously existing nodes.

Node transformation is a non-updating expression, since it does not modify existing nodes!

12.2 XQuery Update Facility

Syntax

– Idea:

1. Bind variables of copy clause (non-updating expressions), 2. update copies (only!) as per modify clause,

3. construct result by return (copied/modified and/or other nodes).

Syntax

TransformExpr ::= "copy" "$"VarName ":=" ExprSingle ("," "$"VarName ":=" ExprSingle)*

"modify" ExprSingle

"return" ExprSingle

(54)

• Node transformation: Examples

12.2 XQuery Update Facility

Return a sequence consisting of all employee elements that have Java as a skill, excluding their salary child-elements.

for $e in //employee[skill = "Java"]

return

copy $je := $e

modify delete node $je/salary

– N.B. Underlying persistent data not changed by these examples!

modify delete node $je/salary return $je

Copy a node, modify copy, then return original and modified copy.

let $oldx := /a/b/x return

copy $newx := $oldx

modify (rename node $newx as "newx",

replace value of node $newx by $newx * 2) return ($oldx, $newx)

(55)

• On the semantics of the XQuery Update Facility

– Formally specifying the exact semantics of the XQuery UF is non-trivial for several reasons:

• Formal update semantics are always a lot more involved

12.2 XQuery Update Facility

• Formal update semantics are always a lot more involved than retrieval semantics.

• Updates and bulk operations do not go together well (cf.

SQL set-oriented updates).

• XUF uses a notion of "snapshots" and "pending update lists"

to work around some of the subtleties.

• The details are beyond the scope of this lecture.

(56)

12. Updates

(57)

• Text node updates

– Obviously, replacing the value of a text (or attribute,

comment, processing instruction) node has little impact on the XML representation.

12.3 Impact on XPath Acc. Enc.

Replacing text by text

<a>

foo

bar

</a>

⇓

⇓⇓

⇓ replace text "bar" by "foo"

<a>

foo

foo

</a>

(58)

• Text node updates

– Translated into, e.g., the XPath Accelerator representation, we see that

• Replacing text nodes by text nodes has local impact only on the pre/post encoding of the updated tree.

The update leads to a local relational update

• Similar observations can be made for updates on comment and processing instruction nodes.

The update leads to a local relational update

⇒⇒

(59)

• Structural updates

Inserting a new subtree

<a>

<c><d/><e/></c>

<f><g/>

<h><j/></h>

</f>

⇓⇓

– Question: What are the effects w.r.t. our structure encoding. . . ?

</f>

</a>

⇓⇓

⇓⇓ insert node <k><l/><m/></k> into /a/f/g

<a>

<c><d/><e/></c>

<f><g><k><l/><m/></k></g>

<h><j/></h>

</f>

</a>

(60)

• Insertion: Global impact on encoding

– Global shifts in the pre/post plane

(61)

• Insertion: Global impact on pre/post plane

Insert a subtree of n nodes below parent element v

1. post(v) ← post(v) + n

2. ∀ v' ∈ v/following::node():

pre(v') ← pre(v') + n; post(v') ← post(v') + n 3. ∀ v' ∈ v/ancestor::node():

←

∀ ∈

← ←

3. ∀ v' ∈ v/ancestor::node():

post(v') ← post(v') + n

Update cost

3. is not so much a problem of cost but of locking. Why?

Cost (tree of N nodes) O(N) + O(log N)

2.

2. 3.3.

(62)

1. Introduction 2. XML Basics

3. Schema definition

4. XML query languages I 5. Mapping relational data

8. XML query languages II – XQuery Data Model

9. XML query languages III – XQuery

10. XML storage I –

12.4 Overview

5. Mapping relational data to XML

6. SQL/XML

7. XML processing

10. XML storage I – Overview

11. XML storage II 12. Updates

13.Systems

(63)

• "Database-Supported XML Processors", [Gru08]

– T. Grust

– Lecture, Uni Tübingen, WS 08/09

12.4 References

– M. Scholl

– Lecture, Uni Konstanz, WS07/08

(64)

• Now, or ...

• Room: IZ 232

• Office our: Tuesday, 12:30 – 13:30 Uhr

Questions, Ideas, Comments

• Office our: Tuesday, 12:30 – 13:30 Uhr or on appointment

• Email: eckstein@ifis.cs.tu-bs.de