Wolf-Tilo Balke
Benjamin Köhncke
Institut für Informationssysteme
Technische Universität Braunschweig
Relational Database Systems 2
13. Non-Standard Applications
13.1 XML databases
13.2 Deductive Databases
13.3 Spatial databases and GIS 13.4 Bio-information systems
13 Non-Standard Applications
• Information systems have different degrees of data structure rigidness
– Structured, e.g., relational databases
• Structure explicitly specified in schema
• Every tuple in a table has the same attributes and domains
• Queries can take advantage of structure
– Unstructured, e.g., information retrieval systems
• Often just full text with no or only limited structure information
• Properties of data usually unknown
13.1 XML Databases
• But there is also something in between
– Semi-structured, e.g., XML
• Structure of data follows a template, but still allows for a degree of flexibility
• Data instances following the same schema may have a different structure
• Often, complex relationships between data are allowed (associations, inheritance,
sub-classing, aggregation, etc.)
• Queries often involve those relationships
13.1 XML Databases
• What is XML?
– XML is semi-structured text
– XML is a tag-based markup-language (like HTML)
• eXtensible Markup Language
– XML was designed to exchange data – XML tags are not predefined
• Tags are defined in a separate schema
– XML is designed to be self-descriptive – XML is a W3C Recommendation
– XML became highly popular due to its simplicity and
13.1 XML Databases
• What is XML? Example
• Structure defined in an XML schema
13.1 XML Databases
<?xml version="1.0"?>
<note date=“14.07.08” >
<to>Anna</to>
<from>Tom</from>
<heading>Reminder</heading>
<body>Don't forget to learn for the Rel. DBS exam!</body>
</note>
<xs:element name="note">
<xs:complexType>
<xs:attribute name=“date" type="xs:date"/>
<xs:sequence>
<xs:element name="to" type="xs:string"/>
<xs:element name="from" type="xs:string"/>
<xs:element name="heading" type="xs:string"/>
<xs:element name="body" type="xs:string"/>
</xs:sequence>
• So, but has that to do with databases? Another example:
13.1 XML Databases
<?xml version="1.0"?>
<notes>
<note date=“14.07.08” >
<to>Anna</to>
<from>Tom</from>
<heading>Reminder</heading>
<body>Don't forget to learn for the Rel. DBS exam!</body>
</note>
<note date=“15.07.08” >
<to>Tom</to>
<from>Anna</from>
<heading>Re: Reminder</heading>
<body>Naah – I would NEVER do that!</body>
</note>
…
• Now we want to retrieve all body texts of notes from Tom written at the 15.07.2008
• We need a query language!
• A simple query language for path expressions
– W3C XPath
– Query takes advantage of document structure!
13.1 XML Databases
/notes/note*@date=’15.07.2008’ and from=‘Tom’+/body
• Path expressions are simple, but limited.
• A more complex query language is e.g. XQuery
13.1 XML Databases
for $book in collection("biblio")/book:book,
$author in collection("biblio")/bio:author let $name := $author/name,
$birth := $author/birth
where $book/author = $name and
$birth le xs:date('1929-12-31')
return <author-info> {$name, $birth, $book/title } </author-info>
• Data-Centric XML
– XML is used to store or transport regularly structured and fine grained data
– Data can be mapped to relational tables with some tricks
– Is often designed to be pro- cessed by machines
13.1 XML Databases
Table
Columns
• Document-Centric XML
– Just loosely structured with a lot of unstructured text – Often intended to for
human consumption – Querying and proc-
essing quite difficult – Advantages of rela-
tional DBs don‟t pay of
– Additional IR techni- ques advantageous
13.1 XML Databases
• XML documents thus can store all kinds of data
• Thus, is an XML document already a database?
– Generally speaking… yes. But a crappy one!
– For allowing effective XML use, we additionally need
•
Storage schemes for efficiently storing even huge documents
•
Query Languages
•
Schema Languages
•
Support for data integrity and transactions (ACID)
•
Support for data security
•
Programming Interfaces
•
… and all the other thing we know from real DBMS systems
13.1 XML Databases
• Many of these requirements can be fulfilled by specialized standards and technologies
– Storage:
•
XML document on the file system
– Queries:
•
Simple queries with XPath
•
Complex queries with XQuery
– Schemas:
•
Simple schemas with DTD
•
Complex schemas XML-Schema (XSD)
– Programming Interfaces:
13.1 XML Databases
• Still, those isolated technologies are not yet a real DBMS
• The topic of XML Databases deals with integrating them into a fully functional DBMS
• Two options
– Integrating XML support into RDMS systems
• Especially suited for data-centric XML
– Building native XML-DBMS systems
• Suited for data-centric and document centric XML
13.1 XML Databases
• What are XML supporting RDBMS?
– Maps XML data into relational tables
– Main problem: How to create an efficient and meaningful mapping?
• What are native XML databases?
– „Native“ is a marketing term – Common Agreement:
•
Native XML DBs works with a logical model of the XML document (not directly with the data)
– i.e. nodes, attributes, types, tree structure, CDATA entries, …
• XML
is the primary form of storage
•
Are not limited to a particular storage model (could use a relational DB, an object DB, file system, etc)
13.1 XML Databases
• Example (very simple):
13.1 XML Databases
id airline origin destination 1 ABC Air Dallas Fort Worth
id departure arrival flight_ref
1 09:15 09:16 1
2 11:15 11:16 1
3 13:15 13:16 1
Flights
Flight
Relational Mapping
Native Mapping
id parent name value
1 null Flights null
2 1 Airline ABC Air
3 1 Origin Dallas
4 1 Destination Fort Worth
5 1 Flight Null
Tags
• RDBMS with XML support
• Native XML-DBMS systems
13.1 XML Databases
• Since ancient times, people dream of intelligent machines
– Golden robots of Hephaestus
– Archytas‟ wooden pigeon (400 BC)
– Leonardo da Vinci‟s mechanical knight (1495) – The Turk of Wolfgang von Kempelen (1770) – …
13.2 Deductive DBS and Expert Systems
• In the 20
thcentury, the field of A.I. (Artificial Intelligence) became popular
– 1950: Alan Turing
•
“The brain is just like a complex machine.”
•
Turing test
– 1956: Darthmouth Conference
•
Founding of the A.I. laboratories
– 1965: H. A. Simon
•
"Machines will be capable, within twenty years, of doing any work a man can do“
– 1967: Marvin Minsky
•
"Within a generation ... the problem of creating 'artificial intelligence' will substantially be solved."
13.2 Deductive DBS and Expert Systems
• In the initial phase of A.I. research, people were highly motivated and full of visions
– High amount of research money available, mainly from the military (DARPA)
• In the mid seventies, the great visions died…
– A long series of failures took its toll – The A.I. winter
– Most research funds were not granted for general A.I.
• Change of research direction
– Do not imitate the full human brain, but find
intelligent algorithms for solving particular difficult problems
13.2 Deductive DBS and Expert Systems
• Main critique – Hubert Dreyfus (UC Berkeley, USA)
– Expertise cannot readily be extracted from human experts
– Much knowledge is not explicit, but somehow embodied
•
The brain is not simply hardware running a program based on discrete symbolic
calculations
13.2 Deductive DBS and Expert Systems
• In the 1980ies, A.I. focused on well-defined problem domains building first commercially successful systems
– Knowledge-based systems or „expert systems‟
• Idea: Create a system which can draw
conclusions and thus support people in difficult decisions
– Simulate a human expert
– Main idea: extract knowledge of experts and just cheaply copy it to all places
you might need it
13.2 Deductive DBS and Expert Systems
• Expert Systems were supposed to be especially useful in
– Medical diagnosis
• Great failure up to now
– Production and machine failure diagnosis
• Works quite well
– Financial services
• Widely used and successful
13.2 Deductive DBS and Expert Systems
• Usually this is based on interference rules and specific problem data
– Rule: All frogs are green – Fact: Hektor is a frog
– Implies new fact: Hektor is green
• Also, uncertainly can be supported
– Rule: Almost all birds can fly except ostriches, chicken and penguins
– Fact: Tweety is a bird – Query: Can Tweety fly?
•
Only few species are ostrichs, chicken or penguins
•
Tweety can fly with high probability
13.2 Deductive DBS and Expert Systems
• Common architecture of an expert system
– User Interface: Usually based on a question-response dialog – Inference Engine: Tries to deduce an answer based on the
knowledge base and the problem data
– Explanation System: Explains to the user why a certain answer was given or question asked
– Knowledge Base: Set of rules and base facts
– Problem Data: Facts provided for a specific problem via user interface
13.2 Deductive DBS and Expert Systems
Interface Explanation System Problem Data
• Expert systems have to keep and manage valuable data in their knowledge base
– Basically expert systems just support another query type, but have the same requirements like a normal database system
• A deductive DBS is a database system with limited support for reasoning
• All the goodies of databases (transactions, recovery, etc.)
• Queries based on recursive views are possible
• Efficient query optimization
13.2 Deductive DBS and Expert Systems
• System may deduce new facts using rules
– Leads to inference chains
• Most systems heavily rely on mathematical logics
– First-Order Predicate Logics
13.2 Deductive DBS and Expert Systems
• Deductive queries/programs are often stated in Prolog or Datalog
– Prolog is a logical programming language created in 1972
– Datalog is a subset of Prolog especially designed for deductive databases
• No predicates are allowed as arguments
• Only fix-point iteration
• Efficient bottom-up evaluation
13.2 Deductive DBS and Expert Systems
• Datalog example
– Facts
•
parent(bill, mary).
•
parent(mary, john).
– Rules
•
ancestor(X, Y) :- parent(X, Y)
•
ancestor(X, Y) :- ancestor(X, Z) , ancestor(Z, Y)
– Query
•
:- ancestor(bill, X)
– Answer
•
ancestor(bill, mary)
13.2 Deductive DBS and Expert Systems
• Why don‟t normal databases do the trick?
– SQL queries can be read as follows
• “If some tuples exist in the FROM tables satisfying the WHERE conditions, then the SELECT tuples are the answer”
– Datalog is a query language that has the same if-then flavor, but…
• The an intermediate answer table can appear in the FROM clause, i.e., recursion
13.2 Deductive DBS and Expert Systems
• Example: a public transport information system
• Database stores connected stops as facts, e.g.,
– connection(Maschplatz, Hamburgerstr, 2 minutes).
– Transitive closure contains all connections
• Additional rules
– connection(X,Y,T) :- connection (X,Z,T1),
13.2 Deductive DBS and Expert Systems
• How long does it take to go from „Hauptbahnhof‟ to
„Rathaus‟?
– Only solution in standard SQL-92: create a view materializing all connections
•
Big challenge on storage space and data consistency
– In deductive DBS: query using recursive rule
• ?connection(HBF, RH, X)
•
Can be answered efficiently by binding the start and goal stop and deducing everything in between
– Deductive databases are no longer in the marketplace, but their technology was integrated into SQL-99
•
E.g., „common table expressions‟ in IBM DB2
13.2 Deductive DBS and Expert Systems
• MYCIN
– Developed 1970 at Stanford University, USA – Medical expert system for treating infections
•
Diagnosis of infection types and recommended antibiotics (antibiotics names usually end with ~mycin)
– Around 600 rules (also supporting uncertainty)
– MYCIN was treated as a success by the project team
•
Experiments showed good results, especially with rare infections
– … but was never used in practice
•
Too clumsy
13.2 Notable Expert Systems
• MYCIN example rule
• MYCIN used backward chaining from goal to facts for reasoning
– Try to apply rules concluding the reasoning process by asking the user about the rules premises
13.2 Notable Expert Systems
If the organism 1) stains grampos 2) has coccus shape 3) grows in chains then there is a suggestive evidence of 0.7 that it is streptococcus
• MYCIN example
–
13.2 Notable Expert Systems
---PATIENT-1---
1) Patient's name: FRED SMITH 2) Sex: MALE
3) Age: 55
4) Have you been able to obtain positive cultures from a site at which Fred Smith has an infection?
YES
---INFECTION-1---
5) What is the infection? PRIMARY-BACTEREMIA
6) Please give the date when signs of INFECTION-1 appeared. 5/5/75
The most recent positive culture associated with the primary bacteremia will be referred to as:
---CULTURE-1---
7) From what site was the specimen for CULTURE-1 taken? BLOOD 8) Please give the date when this culture was obtained. 5/9/75 The first significant organism from this blood culture will be called:
---ORGANISM-1---
9) Enter the identity of ORGANISM-1. UNKNOWN
• MYCIN explain example
13.2 Notable Expert Systems
** Did you use RULE 163 to find out anything about ORGANISM-1?
RULE163 was tried in the context of ORGANISM-1, but it failed because it is not true that the patient has had a genito-urinary tract manipulative
procedure (clause 3).
** Why didn't you consider streptococcus as a possibility?
The following rule could have been used to determine that the identity of ORGANISM-1 was streptococcus: RULE033
But clause 2 (“the morphology of the organism is coccus”) was already known to be false for ORGANISM-1, so the rule was never tried.
• Dipmeter Advisor
– Developed 1980 by Schlumberger Doll Research
– Support for oil drilling operations – Very simple system with ~90 rules
– However, system was a huge success!
• Heavily used in field and appreciated by it‟s users
• Sold as fully functional unit including hardware
• First great success after A.I. winter!
13.2 Notable Expert Systems
• Dipmeter Advisor: Function
– Dipmeters analyze ground properties by conductivity measurements
– Usually end up with an dipmeter log
• Hard to interpret, even by experts
13.2 Notable Expert Systems
• Dipmeter Advisor: Function
– Dipmeter advisor creates log analysis and provides a summary using rules
13.2 Notable Expert Systems
example rule
example summary
• NASA Shine
– Spacecraft Health Inference Engine
– Development started in mid 70s by NASA and JPL (Jet Prolusion Lab) for the Deep Space Network
•
Commercially used by ViaSpace
– Multi-purpose inference system
– Detects system failures within complex mission critical machineries
– Designed to run in real-time in embedded and distributed systems
13.2 Notable Expert Systems
• NASA Shine: currently used by
– Deep Space Network
– Lockheed Martin F-35 Lightning 35 – McDonnel Douglas F/A-18 Hornet – NASA CEV (Crew Exploration
Vehicle)
– NASA Ares Rocket Program – NASA Voyager spacecrafts – Lockheed Martin X-33
– Galileo Space Probe
– Extreme Ultraviolet Explorer – …
13.2 Notable Expert Systems
• Database technology has eased the handling of relational data and provides efficient querying
– Typical queries
•
List the names of all bookstore with more than ten thousand titles
•
List the names of the customers with highest sales in the year 2007
• But what about queries with a spatial dimension?
– List all bookstores within ten miles of Hannover
– List the average amounts for
purchases of customers who live in Braunschweig and its adjoining area
13.3 Spatial Databases and GIS
• A Geographical Information System (GIS) is any information system capable of providing
geographically referenced information
– This includes integrating, editing, analyzing, sharing, and displaying information
• For storing and querying the information a specialized spatial database is used
– Highly optimized to store and query data related to objects in space, including points, lines and polygons
13.3 Spatial Databases and GIS
• The basic idea is to integrate special spatial functionality into (or on top of) a DBMS
– Besides many commercial and open-source specialized GIS implementations, there are database extensions – As of Oracle version 8i
Oracle Spatial is offered, simlarly IBM offers a
DB2 Spatial Extender and the Informix
Spatial DataBlade
13.3 Spatial Databases and GIS
Spatial application DBMS
Interface to DBMS
Interface to spatial application
Taxonomy Data types Operations Query language Algorithms Access methods
Core Spatial Functionality
• The Open Geospatial Consortium (OGC) is an international industry consortium developing publicly available geo-processing specifications
– Consists of 334 companies, government agencies and universities participating in a consensus process
– Open interfaces and protocols defined by OpenGIS Specifications support interoperable solutions that „geo- enable‟ the Web, wireless and location-based services, and mainstream IT
– Open Geospatial Consortium protocols include e.g., the Web Map Service (WMS) and Web Feature Service (WFS)
13.3 Data Exchange
• Cro-Magnon hunters already stored information mixing graphical elements with attributes
– Associated with animal drawings are track lines and tallies thought to depict migration routes
13.3 History of GIS
• It is conjectured that also star formations are shown in very early artifacts
– A representation of the Pleiades?
13.3 History of GIS
• The sky disk found at Nebra, Germany has been dated to about 1600 BC
– A bronze disk of around 30 cm diameter, patinated blue- green and inlaid with gold astronomical symbols
– Interpreted generally as a sun or full moon, a lunar crescent, and stars
•
including a cluster interpreted as the Pleiades
– Two golden arcs along the sides, marking the angle between the solstices, were added later
13.3 History of GIS
• In 1854, John Snow depicted a cholera outbreak in London
– Points on a map represented the locations of individual cases
– The study of the
distribution of cholera led to the source of the disease, a contaminated water pump in the
middle of the cholera outbreak
13.3 History of GIS
• Applications for geographic information system technology are…
– Scientific investigations, resource management, asset management, environmental impact assessment, urban
planning, cartography, criminology, history, sales, marketing, and logistics
• For example,…
– GIS might be used to find wetlands that need protection from pollution
– GIS can be used by a company to site a new business location to take advantage of a previously
underserved market
13.3 Spatial Databases and GIS
• How does it work?
– Users interact with an interface or pose queries directly
• Basically there are three types of queries
– Basic spatial operations on all data types
• E.g., IsEmpty, Envelope, Boundary,…
– Topological/set operators
• E.g., Disjoint, Touch, Contains,…
– Spatial analysis operators
13.3 Spatial Queries
• Form an entity to hold country names, populations, and geographies
– CREATE TABLE Country( Name varchar(30), Population Integer,
Shape Polygon);
• Form an entity to hold river names, sources, lengths, and geographies
– CREATE TABLE River( Name varchar(30), Source varchar(30), Distance Integer,
Shape LineString);
13.3 Spatial Data Entity Creation
• Find all the countries that border on Germany
– SELECT C1.Name
FROM Country C1, Country C2
WHERE Touch(C1.Shape, C2.Shape) = 1 AND C2.Name = „Germany‟
• Find all the countries through which the Danube river runs
– SELECT C.Name, R.Name FROM Country C, River R
WHERE Intersect(C.Shape, R.Shape) = 1 AND R.Name = „Danube‟
13.3 Example Spatial Query
• A main issue is the indexing in spatial databases, since geographical data is high dimensional
– 3 spatial dimensions
– Often also a temporal dimension – Dimensions for the actual
data attributes
• Typical high-dimensional index structures include
R-Trees, Grid File indexes, etc.
13.3 Database Indexing
• For traversing indexes space filling curves can be utilized (e.g., Hilbert curve, Z-curve)
– They achieve a better ordering of multidimensional objects in a tree node
– This ordering has to be good, in the sense that it
should group similar data rectangles together to minimize the area and perimeter of the resulting
minimum bounding rectangles
13.3 Database Indexing
• The Human Genome Project
– "Unravelling the three billion or so base pairs of our entire DNA has been compared with landing on the Moon, splitting the atom and even inventing the wheel." , Nature, February 2001
– "[...the publication of] the human genome sequence is likely to be greeted with the same awestruck
feeling that accompanied the landing of the first human on the moon [...]", Science, Februar 2001
13.4 Bio-Information Systems
13.4 Bio-Information Systems
13.4 Bio-Information Systems
Sequence Proteins Networks Organism
• Hundreds of molecular biological databases exist:
– Presently more than 1078 – 110 more than last year – Database issue of journal
"Nucleic Acid Research", always in January,
13.4 Bio-Information Systems
http://nar.oxfordjournals.org/content/vol36/suppl_1/index.dtl
13.4 Bio-Information Systems
13.4 Bio-Information Systems
National Center for Biotechnology Information at National Institutes of Health, USA
13.4 Bio-Information Systems
DNA & RNA Proteins Networks Organism
Sequence
Structure
Amino acid sequences Nucleid acid
sequences
DNA RNA
Gen Genom
Enzym functions Protein protein
interactions all
Genexpression Pathways+
Variation Mutation+
Diseases Immun biology
Drug
development Antibody research
other Genetic engineering
Model organisms Microarray data
Metabolic pathways
Signal transduction pathways
• Highly overlapping data
• Highly linked databases
• Data integration needed!
13.4 Bio-Information Systems
Source: GenMapper (Do / Rahm, http://ducati.izbi.unileipzig.de:8080/
GenMapper/servlet/gui.MainFrame)
• Use case:
"Retrieve sequences for all human expressed sequence tags (ESTs) that by BLAST are >60%
identical over >50 amino acids to mouse channel genes expressed in central nervous system."
• Sources:
– A Practitioner’s Guide to Data Management and Data Integration in
Bioinformatics, Barbara A. Eckman in Bioinformatics by Zoe Lacroix and Terence Critchlow, 2003, Morgan Kaufmann.
– U. Leser / F. Naumann, VL "Informationsintegration", WS 06/07 bzw. 05/06
13.4 Use Case
• Involved data sources and tools:
– Mouse Genome Database (MGD) @ Jackson Labs – SwissProt @ EBI
– BLAST tool @ NCBI
– GenBank nucleotide sequence database @ NCBI
• All freely available
13.4 Use Case
• Traditional approach:
browsing
– Search for channel
sequences expressed in CNS tissues using the MGD query form
13.4 Use Case
• MGD result:
– 14 genes from 17 experiments
• Look at details for each of the 14 genes
– On average 5 SwissProt links per gene
13.4 Use Case
13.4 Use Case
• Visit each SwissProt page
– launch BLAST search
• Inspect each BLAST result
– eliminate non-human sequence hits
– check other constraints (e.g. >60% identity over >
50 amino acids)
• For each remaining entry:
– retrieve complete EST sequence from GenBank
• How often did we click?
– 1 + (Search)
– 14 + (Gene details) – 14*5 + (SwissProt) – 14*5 + (Blast)
– 14*5*X (Genbank)
13.4 Use Case
• What we really would like to have:
SELECT g.accnum, g.sequence FROM genbank g, blast b,
swissprot s, mgd m WHERE m.exp = “CNS”
AND m.defn LIKE “%channel%”
AND m.spid = s.id AND s.seq = b.query AND b.hit = g.accnum
AND b.percentid > 60 AND b.alignlen > 50
13.4 Use Case
• E.g. DB2s Information Integrator:
13.4 Use Case
Information Integrator
• Why do we need wrappers at all?
• All classical biological databases emerged from books:
– Data collections for DNA, protein sequences and structures etc.
– Published annually / quarterly
– Book – issue – CDROM – FTP – WWW
• Flat, text oriented data model
– Structured according to "Entries"
– Many items as free text
– For human, not for machines – Flat file format
• For data exchange
• Sometimes still for storage
– Increasingly use of XML formats