XML Databases
3. Schema Definition, 11.11.09
Silke Eckstein Andreas Kupfer
Institut für Informationssysteme
Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de
3.1 Introduction
3.2 Document Type Definitions – DTDs 3.3 XML Schema
3. Schema Definition
3.4 DTDs vs. XML Schema 3.5 Validation
3.6 Overview 3.7 References
• Structure of XML documents
– XML prolog
• Document Type Definition (DTD)
– Document Instance
3.1 Introduction
– Have to be well-formed (see last week)
<Bücher>
<Buch>
<Autor id="1234567890">Rainer Eckstein</Autor>
<Autor id="1234568723">Silke Eckstein</Autor>
<Titel>XML und Datenmodellierung</Titel>
<Untertitel>XML-Schema ...</Untertitel>
<Verlag id="3-89864">dpunkt.Verlag</Verlag>
</Buch>
</Bücher>
• Valid XML
– More often than not, applications that operate on XML data require the XML input data to conform to a specific XML dialect.
– This requirement is more strict than just XML well- formedness.
3.1 Introduction
formedness.
– The (hard-coded) application logic relies on, e.g.,
• the presence or absence of specifically named elements [attributes],
• the order of child elements within an enclosing element,
• attributes having exactly one of several expected values, . . .
– If the input data fails to meet the requirements, results are often disastrous.
• DTDs – Document Type Definitions
– The XML Recommendation includes technology that enables applications to rigidly specify the XML dialect (the document type) they expect to see: DTDs (Document Type Definitions).
– XML parsers use the DTD to ensure that input data is not only well-formed but also conforms to the DTD (XML speak: input
3.1 Introduction
well-formed but also conforms to the DTD (XML speak: input data is valid).
• Valid XML documents ⊂⊂⊂⊂ well-formed XML documents
• XML Schema
– Besides DTDs, there exists another schema description language: XML Schema
– It is more sophisticated than DTDs but has a less compact syntax.
• Document validation is critical, if
– distinct organizations (B2B) need to share XML data:
also share the DTDs / the schemas,
– applications need to discover and explore yet unknown XML dialects,
3.1 Introduction
unknown XML dialects,
– high-speed XML throughput is required (once the input is validated,we can abandon a lot of runtime checks).
3.1 Introduction
3.2 Document Type Definitions – DTDs 3.3 XML Schema
3. Schema Definition
3.4 DTDs vs. XML Schema 3.5 Validation
3.6 Overview 3.7 References
• A document's DTD is directly attached to its XML text using a DOCTYPE declaration:
3.2 Document Type Definitions
DOCTYPE Declaration
<?xml version="1.1"?>
<!DOCTYPE t d
e d
i>
<t>
...
– The DOCTYPE declaration follows the text declaration (<?xml. . . ?>)
(comments <!--. . . -->, processing instructions <?. . . ?> in between are OK).
– The first parameter t of the DOCTYPE declaration is required to match the document's root element tag.
– The document type definition itself consists of an external subset (de ≡ SYSTEM "uri",) as well as an internal subset (di ≡ [. . . ]), i.e., embedded in the document itself).
...
</t>
• Internal and external DTD
3.2 Document Type Definitions
DOCTYPE Declaration – external DTD
<?xml version="1.1"?>
<!DOCTYPE greeting SYSTEM "hello.dtd">
<greeting>Hello, world!</greeting>
DOCTYPE Declaration- internal DTD
– Both subsets are optional. Should clashes occur, declarations in the internal subset override those in the external subset.
DOCTYPE Declaration- internal DTD
<?xml version="1.1" encoding="UTF-8" ?>
<!DOCTYPE greeting [
<!ELEMENT greeting (#PCDATA)>
]>
<greeting>Hello, world!</greeting>
• The ELEMENT Declaration
– The DTD ELEMENT declaration, in some sense, defines the vocabulary available in an XML dialect.
– Any XML element t to be used in the dialect needs to be introduced via
3.2 Document Type Definitions
be introduced via
<!ELEMENT t cm>
• The content model cm of the element defines which element content is considered valid.
• Whenever an application encounters a t element anywhere in a valid document, it may assume that t's content
conforms to cm.
• Element content models
3.2 Document Type Definitions
Content model Valid content
ANY arbitrary well-formed XML content
EMPTY no child elements allowed (attributes OK) Children only child elements, no character data;
order and occurrence of child elements must match regular order and occurrence of child elements must match regular expression over tag names and constructors ,, |, +, *, ? Mixed character data, optionally interspersed with child elements (but
see constraints below)
• Content model "Children"
– Regular expressions provide control over the exact order and occurence of children nodes below an element node:
3.2 Document Type Definitions
Reg. exp. Semantics
t (tag name) child element with tag t
– Example (abstract DTD):
c1 , c2 c1 followed by c2
c1 | c2 c1 or, altenatively, c2
c+ c, one or more times
c* c, zero or more times
c? optional c
<!ELEMENT A (B|C,(D|E)*)>
• Content model "Mixed"
– A mixture of character data and child elements
– The types of the child elements may be constrained, but not their order or their number of occurrences:
3.2 Document Type Definitions
<!ELEMENT anreisebeschreibung
<!ELEMENT anreisebeschreibung
(#PCDATA | auto | bahn | flugzeug)* >
<anreisebeschreibung>
Sie können unser Haus auf verschiedenen Wegen erreichen:
<bahn> per Bahn: 1 km ab Bhf Warnemünde </bahn>
<auto> per Auto: 19 km ab Autobahn A19 Rostock - Berlin </auto>
<flugzeug> per Flugzeug: 55 km ab Rostock-Laage, 235 km ab Berlin-Tegel </flugzeug>
Sie finden uns direkt an der Uferpromenade.
</anreisebeschreibung>
• Elements with mixed content typical for document-centric XML
– Free text interspersed with markup
• to highlight something
• to provide certain structures as e.g. addresses, tables
3.2 Document Type Definitions
• to provide certain structures as e.g. addresses, tables
• etc.
– For elements with mixed content white space
(#PCDATA) is regarded essential and thus reported to the application.
• In all other content models an XML parser will not report white space contained in an element to its underlying
application.
• Example: DTD and valid XML encoding academic titles
3.2 Document Type Definitions
<?xml version="1.1"?>
<!DOCTYPE academic [
<!ELEMENT academic (Prof?,
(Dr, (rernat|emer|phil)*)?, Firstname, Middlename*,
<academic>
<Prof/><Dr/><emer/>
<Firstname>
Firstname, Middlename*, Don Lastname) >
<!ELEMENT Prof EMPTY >
<!ELEMENT Dr EMPTY >
<!ELEMENT rernat EMPTY >
<!ELEMENT emer EMPTY >
<!ELEMENT phil EMPTY >
<!ELEMENT Firstname (#PCDATA) >
<!ELEMENT Middlename (#PCDATA) >
<!ELEMENT Lastname (#PCDATA) >
]>
Don
</Firstname>
<Middlename>
E
</Middlename>
<Lastname>
Knuth
</Lastname>
</academic>
• The ATTLIST Declaration
– Using the DTD ATTLIST declaration, validation of XML documents is extended to attributes.
– The ATTLIST declaration associates a list of attribute names ai with their owning element named t:
3.2 Document Type Definitions
ATTLIST Declaration
• The attribute types
i define which values are valid for attribute ai.
• The defaults di indicate if ai is required or optional (and, if absent, if a default value should be assumed for ai).
• In XML, the attributes of an element are unordered. The ATTLIST declaration prescribes no order of attribute usage.
ATTLIST Declaration
<!ATTLIST t a1
1 d1
… an
n dn >
• Via attribute types, control over the valid attribute values can be exercised:
3.2 Document Type Definitions
Attribute Type i Semantics
CDATA character data (no <, but <, . . . ) (v1|v2|. . . |vm) enumerated literal values
ID value is document-wide unique identifier for owner element
• Example:
ID value is document-wide unique identifier for owner element IDREF references an element via its ID attribute
Academic.xml (fragment)
<!ELEMENT academic (Firstname, Middlename*, Lastname) >
<!ATTLIST academic
title (Prof|Dr) #REQUIRED type CDATA #IMPLIED >
>
<academic title="Dr" type="rer.nat."> ... </academic>
• Attribute defaulting in DTDs:
3.2 Document Type Definitions
Attribute Default di Semantics
#REQUIRED element must have attribute ai
#IMPLIED attribute ai is optional
v (a value) attribute ai is optional, if absent, default value v for ai is assumed
• Examples of attribute-list declarations:
for ai is assumed
#FIXED v attribute ai is optional, if present, must have value v
<!ATTLIST termdef
id ID #REQUIRED
name CDATA #IMPLIED>
<!ATTLIST list
type (bullets|ordered|glossary) "ordered">
<!ATTLIST form
method CDATA #FIXED "POST">
• Crossreferencing via ID and IDREF
– Well-formed XML documents essentially describe tree-structured data.
– Attributes of type ID and IDREF may be used to encode graph structures in XML. A validating XML parser can check such a graph encoding for consistent connectivity.
3.2 Document Type Definitions
check such a graph encoding for consistent connectivity.
– To establish a directed edge between two XML document nodes a and b
1. attach a unique identifier to node b (using an ID attribute), 2. refer to b from a via this identifier (using an IDREF attribute), 3. for an outdegree > 1 (see below), use an IDREFS attribute.
aa bb
aa bb
cc
Graph.xml
<?xml version="1.1"?>
<!DOCTYPE graph [
<!ELEMENT graph (node+) >
<!ELEMENT node ANY > <!-- attach arbitrary data to a node -->
<!ATTLIST node
id ID #REQUIRED
• Example
3.2 Document Type Definitions
id ID #REQUIRED
edges IDREFS #IMPLIED > <!-- we may have nodes with outdegree 0 -->
]>
<graph>
<node id="A">a</node>
<node id="B" edges="A C">b</node>
<node id="C" edges="D">c</node>
<node id="D">d</node>
<node id="E" edges="D D">e</node>
</graph>
• Drawbacks of the ID/IDREF concept
– IDs have to be document wide unique – Only attributes can be referenced
– Example:
3.2 Document Type Definitions
<!ELEMENT person (name)>
<!ATTLIST person id ID #REQUIRED>
<!ELEMENT department EMPTY>
<!ATTLIST department id ID #REQUIRED>
<!ELEMENT project EMPTY>
<!ATTLIST project personInCharge IDREF #REQUIRED>
3.2 Document Type Definitions
<person id='p0001'>
<name>Meier</name>
</person>
…
<department id='d0001'/>
…
<project personInCharge='p0001'/>
– Not possible: guarantee, that only persons are being referenced as persons in charge.
(But: see XMLSchema below. . . )
<project personInCharge='p0001'/>
<project personInCharge='d0001'/>
• Attributes versus Elements:
3.2 Document Type Definitions
Elements Attributes
Cardinalities 1, ?, +, * #REQUIRED, #IMPLIED Alternatives No alternatives
No defaults Defaults
No fixed values Fixed values
No enumeration types Enumeration types
Content with spaces No spaces in attribute values No order
• Usage of attributs and elements:
3.2 Document Type Definitions
Elements Attributes
Representation of data Representation of metadata
"Visible" information Additional information (for interpretation / processing) interpretation / processing) Data objects and their components Characteristics
suitable for complex information suitable for unstructered, non- hierarchical data
Alternative conditions through
attributes; not through presence or absence of elements
• Other DTD features
– User-defined entities via <!ENTITY e d> declarations (usage: &e;)
<!ENTITY phb "The Pointy-Haired Boss">
– Parameter entities ("DTD macros") via <!ENTITY % e d> (usage: %e;)
<!ENTITY ident "ID #REQUIRED">
...
<!ATTLIST character
3.2 Document Type Definitions
<!ATTLIST character id %ident; >
– Conditional sections in DTDs via <![INCLUDE[. . . ]]> and
<![IGNORE[. . . ]]>
<!ENTITY % withCharacterIDs "INCLUDE" >
<!ATTLIST bubble
<![%withCharacterIDs;
speaker %ident;
to %ident;
]]>
tone (angry|question|...) #IMPLIED >
• Concluding remarks
– DTD syntax:
• Pro: compact, easy to understand
• Con: not in XML
– DTD functionality:
3.2 Document Type Definitions
– DTD functionality:
• no distinguishable types (everything is character data)
• no further value constraints (e.g., cardinality of sequences)
• no built-in scoping (but: use XMLns for name spaces)
– From a database perspective, DTDs are a poor schema definition language.
(But: see XMLSchema below. . . )
3.1 Introduction
3.2 Document Type Definitions – DTDs 3.3 XML Schema
3. Schema Definition
3.4 DTDs vs. XML Schema 3.5 Validation
3.6 Overview 3.7 References
• XML Schema
– With XML Schema, the W3C provides a schema
description language for XML documents that goes way beyond the capabilities of the "native" DTD concept.
Specically:
1. XML Schema descriptions are valid XML documents
3.3 XML Schema
1. XML Schema descriptions are valid XML documents themselves.
2. XML Schema provides a rich set of built-in data types.
(Modelled after the SQL and Java type systems.)
3. Far-reaching control over the values a data type can assume (facets).
4. Users can extend this type system via user-defined types.
5. XML element (and attribute) types may even be derived by inheritance.
• Some XML Schema Constructs
1.
No further typing specified: the author element may contain string values only.
2.
3.3 XML Schema
Declaring an element
<xsd:element name="author"/>
Declaring an element with bounded occurence
<xsd:element name="author" minOccurs="1"
Absence of minOccurs/maxOccurs implies exactly once.
3.
Content of Birthday takes the format YYYY-MM-DD.
4. XML Schema distinguishes 3 kinds of simple types: atomic types, list types and union types.
• simple types can not contain elements or attributes and
• are either build-in or derived from other simple types (e.g. using restrictions)
<xsd:element name="author" minOccurs="1"
maxOccurs="unbounded"/>
Declaring a typed element
<xsd:element name="Birthday" type="xsd:date"/>
• XML Schema's built-in simple types (examples):
– decimal, double, float – integer
– boolean
3.3 XML Schema
– boolean – time
– hexBinary
– string, normalizedString, token – language, Name, NCName
– DTD--Typen (ID, IDREF, IDREFS, etc.)
Source: http://www.learn-xml-schema-tutorial.com/
Source: http://www.learn-xml-schema-tutorial.com/
• Simple types can be restricted using facets:
3.3 XML Schema
Restricting the value space of a simple type (enumeration)
<xsd:simpleType name="Tone">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="question"/>
<xsd:enumeration value="angry"/>
<xsd:enumeration value="screaming"/>
</xsd:restriction>
• Other facets: length, maxInclusive, minExclusive, …
</xsd:restriction>
</xsd:simpleType>
Restricting the value space of a simple type (regular expression)
<xsd:simpleType name="AreaCode">
<xsd:restriction base="xsd:string">
<xsd:pattern value="0[0-9]+"/>
<xsd:minLength value="3"/>
<xsd:maxLength value="5"/>
</xsd:restriction>
</xsd:simpleType>
• Complex types
– … are built from simple types using type constructors.
3.3 XML Schema
Declaring sequenced content
<xsd:complexType name="Address" >
<xsd:sequence>
<xsd:element name="name" type="xsd:string"/>
<xsd:element name="street" type="xsd:string"/>
<xsd:element name="city" type="xsd:string"/>
– An xsd:complexType may be used anonymously (no name attribute).
<xsd:element name="city" type="xsd:string"/>
<xsd:element name="state" type="xsd:string"/>
<xsd:element name="zip" type="xsd:decimal"/>
</xsd:sequence>
</xsd:complexType>
<xsd:element name="address" type="Address"/>
• New complex types may be derived from an existing (base) type.
3.3 XML Schema
Deriving a new complex type
<xsd:complexType name="UKAddress">
<xsd:complexContent>
<xsd:extension base="Address">
<xsd:sequence>
<xsd:element name="postcode"
type="UKPostcode"/>
</xsd:sequence>
<xsd:attribute name="exportCode"
type="xsd:positiveInteger"
fixed="1"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
• Content Models
3.3 XML Schema
Example, Nested Choice and Sequence Groups
<xsd:complexType name="PurchaseOrderType">
<xsd:sequence>
<xsd:choice>
<xsd:group ref="shipAndBill"/>
<xsd:element name="singleUSAddress" type="USAddress"/>
</xsd:choice>
</xsd:choice>
<xsd:element ref="comment" minOccurs="0"/>
<xsd:element name="items" type="Items"/>
</xsd:sequence>
<xsd:attribute name="orderDate" type="xsd:date"/>
</xsd:complexType>
<xsd:group id="shipAndBill">
<xsd:sequence>
<xsd:element name="shipTo" type="USAddress"/>
<xsd:element name="billTo" type="USAddress"/>
</xsd:sequence>
• Mixed Content
– With attribute mixed="true", an
xsd:complexType admits mixed content.
–
3.3 XML Schema
Example: Snippet of Customer Letter
<letterBody>
<salutation>Dear Mr.<name>Robert Smith</name>.
– In contrast to DTDs, order and number of occurrences of elements can be specified.
<productName>Baby Monitor</productName> shipped from
<salutation>Dear Mr.<name>Robert Smith</name>.
</salutation>
Your order of <quantity>1</quantity>
<productName>Baby Monitor</productName> shipped from our warehouse on <shipDate>1999-05-21</shipDate>.
....
</letterBody>
3.3 XML Schema
Example: Snippet of Schema for Customer Letter
<xsd:element name="letterBody">
<xsd:complexType mixed="true">
<xsd:sequence>
<xsd:element name="salutation">
<xsd:complexType mixed="true">
<xsd:sequence>
<xsd:element name="name" type="xsd:string"/>
</xsd:sequence>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name="quantity"
type="xsd:positiveInteger"/>
<xsd:element name="productName"type="xsd:string"/>
<xsd:element name="shipDate" type="xsd:date"
minOccurs="0"/>
<!-- etc. -->
</xsd:sequence>
</xsd:complexType>
• Attributes are declared within their owner element.
3.3 XML Schema
Declaring attributes
<xsd:element name="strip">
<xsd:attribute name="copyright"/>
<xsd:attribute name="year" type="xsd:gYear"/> ...
</xsd:element>
– Other xsd:attribute modifiers:
• use (required, optional, prohibited),
• fixed,
• default.
• XML schemas and target namespaces
– A schema can be viewed as a collection (vocabulary) of type definitions and element declarations whose names belong to a particular namespace called a target
namespace.
– Target namespaces enable us to distinguish between
3.3 XML Schema
– Target namespaces enable us to distinguish between definitions and declarations from different vocabularies.
• E.g., target namespaces enable us to distinguish between the declaration for element in the XML Schema language
vocabulary, and a declaration for element in a hypothetical chemistry language vocabulary.
• The former is part of the http://www.w3.org/2001/XMLSchema target namespace, and the latter is part of another target
namespace.
• Example, Purchase Order Schema with Target Namespace, po1.xsd:
3.3 XML Schema
<schema xmlns="http://www.w3.org/2001/XMLSchema"
xmlns:po="http://www.example.com/PO1"
targetNamespace="http://www.example.com/PO1"
elementFormDefault="unqualified"
elementFormDefault="unqualified"
attributeFormDefault="unqualified">
<element name="purchaseOrder"
type="po:PurchaseOrderType"/>
<element name="comment" type="string"/>
…
3.3 XML Schema
…
<complexType name="PurchaseOrderType">
<sequence>
<element name="shipTo" type="po:USAddress"/>
<element name="billTo" type="po:USAddress"/>
<element ref="po:comment" minOccurs="0"/>
<!-- etc. -->
</sequence>
<!-- etc. -->
<!-- etc. -->
</complexType>
<complexType name="USAddress">
<sequence>
<element name="name" type="string"/>
<element name="street" type="string"/>
<!-- etc. -->
</sequence>
</complexType>
<!-- etc. -->
• Example, a Purchase Order with Unqualified Locals, po1.xml
3.3 XML Schema
<?xml version="1.1"?>
<apo:purchaseOrder xmlns:apo="http://www.example.com/PO1"
orderDate="1999-10-20">
<shipTo country="US">
<name>Alice Smith</name>
<street>123 Maple Street</street>
<street>123 Maple Street</street>
<!-- etc. -->
</shipTo>
<billTo country="US">
<name>Robert Smith</name>
<street>8 Oak Avenue</street>
<!-- etc. -->
</billTo>
<apo:comment>Hurry, my lawn is going wild</apo:comment>
<!-- etc. -->
</apo:purchaseOrder>
• Uniqueness constraints, keys and referential integrity
– xsd:unique element consists of
• one selector element
– selects a set of elements for which uniqueness has to be guaranteed
• one or more field elements
3.3 XML Schema
• one or more field elements
– identify elements or attributes, which have to have unique values
• based on XPATH expressions (see lecture 5)
– xsd:key element
• analog to xsd:unique element
– xsd:keyref element
• also analog to xsd:unique element
• with additional element xsd:refer which contains the name of a xsd:key element
• Example
– Uniqueness
constraints, keys and referential integrity
3.3 XML Schema
integrity
3.3 XML Schema
3.3 XML Schema
• Other XML Schema Concepts
– Fixed and default element content, – support for null values,
– reuse concepts (inheritance, model groups).
3.3 XML Schema
3.1 Introduction
3.2 Document Type Definitions – DTDs 3.3 XML Schema
3. Schema Definition
3.4 DTDs vs. XML Schema 3.5 Validation
3.6 Overview 3.7 References
Document Type Definition XML Schema
Not XML XML (XML tools can be used)
Compact syntax Verbose
No datatypes
(everything is character data)
Sophisticated type system
3.4 DTDs vs. XML Schema
Few possibilities to constrain the schema Diverse possibilities to constrain the schema
• Only "*", "+", "?" as cardinality constraints
• Full flexibility for cardinality constraints
• Only rudimental key concept • Mature key concept
Reuse only through entities Reuse concepts available (type definitions, model groups etc.)
3.5 Validation
Source: Mario Jeckle, www.jeckle.de
Introduction and Basics 1. Introduction
2. XML Basics
3. Schema Definition 4. XML Processing Querying XML
Producing XML
9. Mapping relational data to XML
Storing XML 10. XML storage
3.6 Overview
Querying XML
5. XPath & SQL/XML Queries
6. XQuery Data Model 7. XQuery
XML Updates
8. XML Updates & XSLT
10. XML storage
11. Relational XML storage 12. Storage Optimization Systems
13. Technology Overview
• XML Schema Recommendations
1. XML Schema Part 0: Primer, 2nd Edition, W3C Recommendation 28 October 2004
http://www.w3.org/TR/xmlschema-0/, [XMLSchema0-04]
• non-normative document intended to provide an easily readable description of the XML Schema facilities
3.7 References
description of the XML Schema facilities
2. XML Schema Part 1: Structures, 2nd Edition, W3C Recommendation 28 October 2004 http://www.w3.org/TR/xmlschema-1/
• how to define elements, attributes, content models etc.
3. XML Schema Part 2: Datatypes, 2nd Edition, W3C Recommendation 28 October 2004 http://www.w3.org/TR/xmlschema-2/
• standard datatypes and mechanims to build user-defined datatypes
• XML und Datenmodellierung
– R. und S. Eckstein
– Dpunkt-Verlag, 2004, ISBN 3898642224
• XML in a Nutshell [HM04]
3.7 References
• XML in a Nutshell [HM04]
– Harold & Means
– O'Reilly, 2004, ISBN 0596007647
• M. Scholl, "XML and Databases", Lecture, Uni Konstanz, WS07/08 [Scholl07]
• Now, or ...
• Room: IZ 232
• Office our: Tuesday, 12:30 – 13:30 Uhr
Questions, Ideas, Comments
• Office our: Tuesday, 12:30 – 13:30 Uhr or on appointment
• Email: eckstein@ifis.cs.tu-bs.de