XML Databases
1. Introduction, 27.10.08
Silke Eckstein Andreas Kupfer
Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de
1.1 Motivation
1.2 Relational Databases – Repetition 1.3 Why use XML?
1.4 XML & Databases 1.5 XML Fundamentals 1.6 Organisational matters 1.7 Overview
1.8 References
2
1. Introduction
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
"If I invent another programming language, its name will contain the letter X“
(N. Wirth, Software Pioniere Konferenz, Bonn 2001)
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 3
1.1 Motivation
• Within the last 10 years XML has become the de facto standard for data exchange over the web
–Examples:
•The latest office documents
•SVG graphics files
•Lots of conguration files
•Some WebCMSs store page contents in XML format
•Mpeg7 is a standard for describing media meta data in XML format
•. . .
–In order to see examples of XML-structured documents, browse through your computer's file system and check for file contents starting with "<?xml "!
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 4
1.1 Motivation
• Why is XML relevant from DB perspective?
–
XML is becoming the data "format"
•Amount of XML is ever increasing
•DBMS are good at handling GBs and TBs of data –
Accepted model for semi-structured data
•Overcome limitations of structured data
•Extend usefulness of DBMS
–
DB technology is not limited to DBMS
•Apps servers, application integration
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 5 [Fisch05]
1.1 Motivation
Aim of this lecture
Give answers to the following questions:
• What (additional) concepts do we need in order to store XML data in a RDBMS?
• What concepts are crucial in order to build native XML-DBMS systems?
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 6
1.1 Motivation
1.1 Motivation
1.2 Relational Databases – Repetition 1.3 Why use XML?
1.4 XML & Databases 1.5 XML Fundamentals 1.6 Organisational matters 1.7 Overview
1.8 References
7
Outline
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
What is a Database?
• A database (DB) is a collection of related data
–Represents some aspects of the real world
•Universe of Discourse (UoD) –
Data is logically coherent
–
Is provided for an intended group of users and
applicationsXML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 8 [EN06, 1.1]
1.2 Relational Databases
What is a Database Management System?
• A database management system (DBMS) is a collection of programs to maintain a database, i.e.
for
–
Definition of Data and Structure
–Physical Construction
–
Manipulation
–Sharing/Protecting
–Persistence/Recovery
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 9 [EN06, 1.1]
1.2 Relational Databases
Why not use the File System?
• File management systems are physical interfaces
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10
1.2 Relational Databases
F i l e S y s t e m Account
Data
Customer Data
Loans
App 1
App 2
Balance Sheets Customer Letters
Money Transfer
File Systems
• Advantages
–
Fast and easy access
• Disadvantages
–
Uncontrolled redundancy
–Inconsistent data
–
Limited data sharing and access rights
–Poor enforcement of standards
–
Excessive data and access paths maintenance
1.2 Relational Databases
• Databases are logical interfaces
–Controlled redundancy
–
Data consistency & integrity constraints
–Integration of data
–
Effective and secure data sharing
–Backup and recovery
• However…
–
More complex
–
More expensive data access
1.2 Relational Databases
• Databases control redundancy
–
Same data used by different applications/tasks is only stored once
–
Access via a single interface provided by DBMS
–Redundancy only purposefully used to speed up data
access (e.g. materialized views)
• Databases are well-structured
–Catalog
(data dictionary) contains all meta-data
–Defines the structure of the data in the database
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 13 [EN06, 1.6.1, 1.3]
1.2 Relational Databases
• Databases aim at efficient manipulation of data
–Physical tuning allows for good data allocation
–Indexes speed up search and access
–
Query plans are optimized for improved performance
• Example: Simple Index
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 14 [EN06, 1.3]
1.2 Relational Databases
AccNo type balance
1278945 saving € 312.10
2437954 saving € 1324.82
4543032 checking € -43.03
5539783 saving € 12.54
7809849 checking € 7643.89
8942214 checking € -345.17
9134354 saving € 2.22
9543252 saving € 524.89
AccNo 1278945 5539783 9134354 Index File
Data File
• Isolation between applications and data
–Database employs data abstraction by providing data models
–Applications work only on the conceptual representation of data
•Data is strictly typed(Integer, Timestamp, VarChar,…)
•Details on where data is actually stored and how it is accessed is hiddenby the DBMS
•Applications can access and manipulate data by invoking abstract operations (e.g. SQL Select statements)
–DBMS-controlled parts of the file system are strongly protectedagainst outside manipulation (tablespaces)
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 15 [EN06, 1.3]
1.2 Relational Databases
• Example: Schema is changed and table-space moved without an application noticing
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 16
1.2 Relational Databases
Application
DBMS
AccNo balance 1278945 € 312.10 2437954 € 1324.82 4543032 € -43.03 5539783 € 12.54
Disk 1 Disk 2
SELEC T AccNo FROM account WHERE balance>0
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 17
1.2 Relational Databases
Application
DBMS
AccNo balance 1278945 € 312.10 2437954 € 1324.82 4543032 € -43.03 5539783 € 12.54
AccNo type balance
1278945 saving € 312.10 2437954 saving € 1324.82 4543032 checking € -43.03 5539783 saving € 12.54
Disk 1 Disk 2
SELEC T AccNo FROM account WHERE balance>0
• Example: Schema is changed and table-space moved without an application noticing
• Databases support multiple views of the data
–Views provide a different perspectiveof the DB
•A user’s conceptual understanding or task-based excerpt of all data (e.g. aggregations)
•Security considerations and access control (e.g. projections) –For the application, a view does not differ from a table –Views may contain subsetsof a DB and/or contain
virtual data
•Virtual data is derivedfrom the DB (mostly by simple SQL statements, e.g. joins over several tables)
•Can either be computed at query time or materializedupfront
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 18 [EN06, 1.3]
1.2 Relational Databases
• Sharing of data and support for atomic multi- user transactions
–
Multiple user and applications may access the DB at the same time
–Concurrency control is necessary for maintaining
consistency
–
Transactions need to be
atomicand isolated from each other
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 19 [EN06, 1.3]
1.2 Relational Databases
• Persistence of data and disaster recovery
–Data needs to be persistent and accessible at all times
–Quick recovery from system
crashes without data loss
–Recovery from natural
desasters ( fire, earthquakes,…)
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 20 [EN06, 1.3]
1.2 Relational Databases
1.1 Motivation
1.2 Relational Databases – Repetition 1.3 Why use XML?
1.4 XML & Databases 1.5 XML Fundamentals 1.6 Organisational matters 1.7 Overview
1.8 References
21
Outline
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
• Bioinformatics example:
• Presentation and processing of database query results
–
Flat file
–Web page
–HTML text
–XML text
–
Search in TRANSPATH database for molecule "TLR4"
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 22
1.3 Why use XML?
Key Originator Molecule name Species Links to other DBs Gene Ontology
references Reactions the molecule
participates in Publications
F la t f ile
Key Originator Molecule name Species Links to other DBs Gene Ontology
references Reactions the molecule
participates in Publications
W e b p a g e
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 25
Key Originator
Molecule name
Species
H T M L
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 26
Key Originator Molecule name Species Links to other DBs Gene Ontology
references
X M L
• Flat files
• HTML
• Solution
•
Little layout information
•
Suitable for presentation only to a limited extent
•
Can be parsed, but cumbersome
•
Only layout information
•
Good for presentation
•
Automatic processing difficult
•
Just as generation of other presentation formats
•
Separation of layout and content
1.3 Why use XML?
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 27
• What is XML?
–
The Extensible Markup Language (XML) is the universal format for structured documents and data on the Web.
–
Base specifications:
•XML 1.0, W3C Recommendation Feb '98
•XML 1.1 (2nd Ed.), W3C Recommendation Aug '06
•Namespaces, W3C Recommendation Jan '99
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 28 [Fisch05]
1.3 Why use XML?
• What is XML now then?
–XML is semi-structured text
–XML is a tag-based markup-language (like HTML)
•eXtensible Markup Language
–XML was designed to exchange data –XML tags are not predefined
•Tags are defined in a separate schema –XML is designed to be self-descriptive –XML is a W3C Recommendation
–XML became highly popular due to its simplicityand flexibility
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 29
1.3 Why use XML?
• XML Data Example
− Syntax, no abstract model
− Documents, elements and attributes
− Tree-based, nested, hierarchically organized structure
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 30 [Fisch05]
1.3 Why use XML?
<Buch>
<Autor id="1234567890">Rainer Eckstein</Autor>
<Autor id="1234568723">Silke Eckstein</Autor>
<Titel>XML und Datenmodellierung</Titel>
<Untertitel>XML-Schema ...</Untertitel>
<Verlag id="3-89864">dpunkt.Verlag</Verlag>
</Buch>
1.1 Motivation
1.2 Relational Databases – Repetition 1.3 Why use XML?
1.4 XML & Databases 1.5 XML Fundamentals 1.6 Organisational matters 1.7 Overview
1.8 References
31
Outline
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig [Fisch05] 32
1.4 XML & Databases
• Database world
–1970 relational
databases
–
1990 nested relational model and object oriented databases
–1995 semi-structured
databases
• Documents world
–1974 SGML (Structured
Generalized Markup Language)
–
1990 HTML (Hypertext Markup Language)
–1992 URL (Universal
Resource Locator) Data + documents = information 1996 XML (Extensible Markup Language)
URI (Universal Resource Identifier)
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
• Information systems have different degrees of data structure rigidness
–
Structured, e.g., relational databases
•Structure explicitly specified in schema
•Every tuple in a table has the same attributes and domains
•Queries can take advantage of structure
–
Unstructured, e.g., information retrieval systems
•Often just full text with no or only limited structure information
•Properties of data usually unknown
•Queries difficult to evaluate
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 33
1.4 XML & Databases
• But there is also something in between
–Semi-structured,e.g., XML
•Structure of data follows a template, but still allows for a degree of flexibility
•Data instances following the same schema may have a different structure
•Often, complex relationships between data are allowed (associations, inheritance, sub-classing, aggregation, etc.)
•Queries often involve those relationships
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 34
1.4 XML & Databases
• Relational data
–Killer Application:
Banking –Invented as a
mathematically clean abstract data model –Philosophy: schema first,
then data
• XML
–1st killer application:
Publishing industry –Invented as a syntax for
data, only later an abstract data model
–Philosophy: data and schemas should not be correlated, data can exist with or without schema, or with multiple schemas
1.4 XML & Databases
• Relational data
–Never had a standard
syntax for data
–Strict rules for data
normalization, flat tables
–
Order is irrelevant, textual data supported but not primary goal
• XML
–
Standard syntax existed
–
No data normalization, flexibility is a must, nesting is good
–Order may be very
important, textual data support a primary goal
1.4 XML & Databases
• Data-Centric XML
–
XML is used to store or transport regularly structured and fine grained data
–
Data can be mapped to relational tables with some tricks
–
Is often designed to be pro- cessed by machines
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 37
1.4 XML & Databases
Table Columns
Aggregated Columns? Foreign Keys?
Another table?
• Document-Centric XML
–
Just loosely structured with a lot of unstructured text
–Often intended to for
human consumption
–Querying and proc-
essing quite difficult
–Advantages of rela- tional DBs don’t pay of
–
Additional IR techni- ques advantageous
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 38
1.4 XML & Databases
• XML documents thus can store all kinds of data
• Thus, is an XML document already a database?
–Generally speaking… yes. But a crappy one!
–For allowing effective XML use, we additionally need
•Storage schemes for efficiently storing even huge documents
•Query Languages
•Schema Languages
•Support for data integrity and transactions (ACID)
•Support for data security
•Programming Interfaces
•… and all the other thing we know from real DBMS systems
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 39
1.4 XML & Databases
• Many of these requirements can be fulfilled by specialized standards and technologies
–Storage:
•XML document on the file system –Queries:
•Simple queries with XPath
•Complex queries with XQuery –Schemas:
•Simple schemas with DTD
•Complex schemas XML-Schema (XSD) –Programming Interfaces:
•Provided by various implementations of SAX, DOM, STAX, …
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 40
1.4 XML & Databases
• Still, those isolated technologies are not yet a real DBMS
• The topic of XML Databases deals with integrating them into a fully functional DBMS
• Two options
–
Integrating XML support into RDMS systems
•Especially suited for data-centric XML –
Building native XML-DBMS systems
•Suited for data-centric and document centric XML
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 41
1.4 XML & Databases
•
What are XML supporting RDBMS?
–Maps XML data into relational tables
–Main problem: How to create an efficient and meaningful mapping?
•
What are native XML databases?
–„Native“ is a marketing term –Common Agreement:
• Native XML DBs works with a logical model of the XML document (not directly with the data)
–i.e. nodes, attributes, types, tree structure, CDATA entries, …
• XMLis the primaryform of storage
• Are not limited to a particular storage model (could use a relational DB, an object DB, file system, etc)
–Main problem: How to query and store effieciently?
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 42
1.4 XML & Databases
• Example (very simple):
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 43
1.2 XML & Databases
id airline origin destination 1 ABC Air Dallas Fort Worth
id departure arrival flight_ref
1 09:15 09:16 1
2 11:15 11:16 1
3 13:15 13:16 1
Flights
Flight
Relational Mapping
Native Mapping
id parent name value
1 null Flights null
2 1 Airline ABC Air
3 1 Origin Dallas
4 1 Destination Fort Worth
5 1 Flight Null
6 4 Departure 09:15
Tags
• RDBMS with XML support
• Native XML-DBMS systems
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 44
1.4 XML & Databases
1.1 Motivation
1.2 Relational Databases – Repetition 1.3 Why use XML?
1.4 XML & Databases 1.5 XML Fundamentals 1.6 Organisational matters 1.7 Overview
1.8 References
45
Outline
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
1.5 XML Fundamentals
•
Reasons for the XML success:
–XML is a general data representation format –XML is human readable
–XML is machine readable
–XML is internationalized (UNICODE) –XML is platform independent –XML is vendor independent
–XML is endorsed by the World Wide Web Consortium –XML is not a new technology
–XML is not onlya data representation format, it’s a full infrastructure of technologies
46 [Fisch05] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
• W3C: World Wide Web Consortium
–Established in 1994
–
Initiator: Tim Berners-Lee
–
Over 400 member organizations from more than 40 countries
–
Mission:
•" To lead the World Wide Web to its full potential by developing protocols and guidelines that ensure long-term growth for the Web."
1.5 XML Fundamentals 1.5 XML Fundamentals
Source: Mario Jeckle, www.jeckle.de
• W3C Process
1.5 XML Fundamentals
• Structure of XML documents
–XML prolog
–Document Type Definition (DTD) –Document Instance
–Have to be well-formed
49 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
<Bücher>
<Buch>
<Autor id="1234567890">Rainer Eckstein</Autor>
<Autor id="1234568723">Silke Eckstein</Autor>
<Titel>XML und Datenmodellierung</Titel>
<Untertitel>XML-Schema ...</Untertitel>
<Verlag id="3-89864">dpunkt.Verlag</Verlag>
</Buch>
</Bücher>
1.5 XML Fundamentals
• Document Type Definition
–
Validity
50 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
<!DOCTYPE Bücher [
<!ELEMENT Bücher (Buch)* >
<!ELEMENT Buch (Autor+, Titel, Untertitel?, Verlag >
<!ELEMENT Autor (#PCDATA) >
<!ATTLIST Autor
id ID #REQUIRED email CDATA #IMPLIED
>
<!ELEMENT Titel (#PCDATA) >
<!ELEMENT Untertitel (#PCDATA) >
<!ELEMENT Verlag (#PCDATA)>
]>
1.5 XML Fundamentals
51 XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
• XML Schema
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="Bücher">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="Buch" maxOccurs="unbounded" minOccurs="0" >
<xsd:complexType>
<xsd:sequence>
<xsd:element name="Autor" maxOccurs="unbounded" >
<xsd:complexType>
<xsd:simpleContent>
<xsd:extension base="string">
<xsd:attribute name="id" type="ID"/>
<xsd:attribute name="email" type="string"/>
</xsd:extension>
</xsd:simpleContent>
</xsd:complexType>
</xsd:element>
...
</xsd:schema>
• Misunderstanding about XML
–“Data is self-describing.”
–
Tags don’t hold semantics, they only hold the structure of the information
–
The interpretation of the tags is in the application that handles the data, not in the tags themselves.
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 52 [Fisch05]
1.5 XML Fundamentals
• XML as a family of technologies
–XML Information Set –XML Schema –XML Query
–The Extensible Stylesheet Transformation Language (XSLT) –XLink, XPointer
–XML Forms –XML Protocol –XML Encryption –XML Signature –Others
–… almost all the pieces needed for a good XML-based information hub
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 53 [Fisch05]
1.5 XML Fundamentals
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 54
1.5 XML Fundamentals
Source: Mario Jeckle, www.jeckle.de
• Overview of XML Technologies
–W3C Standards
•Data: XML, Namespaces, Infoset, Schema
•Communication: SOAP, Encryption, WSDL, UDDI
•Processing: Xpath, XSLT, Xquery, Xupdate, Xquery Text
•Integration: RDF, OWL –
Other Standards
•Vertical domains: RosettaNet, ebXML, SBML, GML
•Workflow: BPEL
•Interfaces: DOM, SAX, JAXP, SQL/XML
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 55 [Fisch05]
1.5 XML Fundamentals
1.1 Motivation
1.2 Relational Databases – Repetition 1.3 Why use XML?
1.4 XML & Databases 1.5 XML Fundamentals 1.6 Organisational matters 1.7 Overview
1.8 References
56
Outline
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
• Who is who ?
–Silke Eckstein
•(Lecture, exams) –
Andreas Kupfer
•(Tutorial) –
Regine Dalkıran
•(Office) –
Wolf-Tilo Balke
•(Head)
• In case of questions, don't hesitate to ask us.
57
1.6 Organisational matters
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
• Lectures:
–
Monday, 9:45 – 11:15, (IZ 131, lecture)
–Monday, 11:30 – 12: 15, (IZ 131, tutorial)
• Office hours:
–
Silke Eckstein: Tuesday, 12:30 – 13:30, IZ 232
–Andreas Kupfer: Friday, 10:30 – 11:30, IZ 213
• Course homepage:
–http://infbsdb1.idb.cs.tu-bs.de/eckstein/xmldatabases –
lecture notes, links, latest news etc.
58
1.6 Organisational matters
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
• Assignments:
–
Presentations as well as programming
–Details will be announced
• Credits: 4
• Exams: Oral
–
Master students: agree on certain week in
Feb./Mar.–
Diploma students: on appointment
Please contact R. Dalkiran (regine.dalkiran at tu-braunschweig.de) for an exam appointment.
1.6 Organisational matters
1. Introduction 2. XML Basics 3. Schema definition 4. XML query languages I 5. Mapping relational data
to XML 6. SQL/XML 7. XML processing
8. XML query languages II 9. XML storage I 10. XML storage - index 11. XML storage - native 12. Updates / Transactions 13. Systems
14. XML Benchmarks
1.7 Overview
• http://www.w3.org/ [W3C]
• XML in a Nutshell [HM04]
–Harold & Means
–O'Reilly, 2004, ISBN 0596007647
• Beginning XML Databases [Pow07]
–Gavin Powell
–Wiley & Sons, 2007, ISBN 0471791202
• XML und Datenbanken [Sch02]
–Harald Schöning
–Hanser, 2002, ISBN 3446220089
61
1.8 References
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
• XQuery: Grundlagen und fortgeschrittene Methoden [LS04]
–Lehner & Schöning
–Dpunkt-Verlag, 2004, ISBN 3898642666
• XML & Datenbanken. Konzepte, Sprachen und Systeme [KM02]
–Klettke & Meyer
–Dpunkt-Verlag, 2002, ISBN 3898641481
• Peter Fischer, "XML und Datenbanken", Lecture, ETH Zürich, WS 05/06 [Fisch05]
62
1.8 References
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig
• Fundamentals of Database Systems [EN06]
–
Elmasri & Navathe
–
Addison Wesley, 2006, ISBN 032141506X
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 63
1.8 References
• Now, or ...
• Room: IZ 232
• Office our: Tuesday, 12:30 – 13:30 Uhr or on appointment
• Email: eckstein@ifis.cs.tu-bs.de
XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 64