• Keine Ergebnisse gefunden

Relational Database Systems 2

N/A
N/A
Protected

Academic year: 2021

Aktie "Relational Database Systems 2"

Copied!
74
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Wolf-Tilo Balke

Benjamin Köhncke

Institut für Informationssysteme

Technische Universität Braunschweig

Relational Database Systems 2

13. Non-Standard Applications

(2)

13.1 XML databases

13.2 Deductive Databases

13.3 Spatial databases and GIS 13.4 Bio-information systems

13 Non-Standard Applications

(3)

• Information systems have different degrees of data structure rigidness

Structured, e.g., relational databases

• Structure explicitly specified in schema

• Every tuple in a table has the same attributes and domains

• Queries can take advantage of structure

Unstructured, e.g., information retrieval systems

• Often just full text with no or only limited structure information

• Properties of data usually unknown

13.1 XML Databases

(4)

• But there is also something in between

Semi-structured, e.g., XML

• Structure of data follows a template, but still allows for a degree of flexibility

• Data instances following the same schema may have a different structure

• Often, complex relationships between data are allowed (associations, inheritance,

sub-classing, aggregation, etc.)

• Queries often involve those relationships

13.1 XML Databases

(5)

• What is XML?

– XML is semi-structured text

– XML is a tag-based markup-language (like HTML)

• eXtensible Markup Language

– XML was designed to exchange data – XML tags are not predefined

• Tags are defined in a separate schema

– XML is designed to be self-descriptive – XML is a W3C Recommendation

– XML became highly popular due to its simplicity and

13.1 XML Databases

(6)

• What is XML? Example

• Structure defined in an XML schema

13.1 XML Databases

<?xml version="1.0"?>

<note date=“14.07.08” >

<to>Anna</to>

<from>Tom</from>

<heading>Reminder</heading>

<body>Don't forget to learn for the Rel. DBS exam!</body>

</note>

<xs:element name="note">

<xs:complexType>

<xs:attribute name=“date" type="xs:date"/>

<xs:sequence>

<xs:element name="to" type="xs:string"/>

<xs:element name="from" type="xs:string"/>

<xs:element name="heading" type="xs:string"/>

<xs:element name="body" type="xs:string"/>

</xs:sequence>

(7)

• So, but has that to do with databases? Another example:

13.1 XML Databases

<?xml version="1.0"?>

<notes>

<note date=“14.07.08” >

<to>Anna</to>

<from>Tom</from>

<heading>Reminder</heading>

<body>Don't forget to learn for the Rel. DBS exam!</body>

</note>

<note date=“15.07.08” >

<to>Tom</to>

<from>Anna</from>

<heading>Re: Reminder</heading>

<body>Naah – I would NEVER do that!</body>

</note>

(8)

• Now we want to retrieve all body texts of notes from Tom written at the 15.07.2008

• We need a query language!

• A simple query language for path expressions

– W3C XPath

– Query takes advantage of document structure!

13.1 XML Databases

/notes/note*@date=’15.07.2008’ and from=‘Tom’+/body

(9)

• Path expressions are simple, but limited.

• A more complex query language is e.g. XQuery

13.1 XML Databases

for $book in collection("biblio")/book:book,

$author in collection("biblio")/bio:author let $name := $author/name,

$birth := $author/birth

where $book/author = $name and

$birth le xs:date('1929-12-31')

return <author-info> {$name, $birth, $book/title } </author-info>

(10)

Data-Centric XML

– XML is used to store or transport regularly structured and fine grained data

– Data can be mapped to relational tables with some tricks

– Is often designed to be pro- cessed by machines

13.1 XML Databases

Table

Columns

(11)

Document-Centric XML

– Just loosely structured with a lot of unstructured text – Often intended to for

human consumption – Querying and proc-

essing quite difficult – Advantages of rela-

tional DBs don‟t pay of

– Additional IR techni- ques advantageous

13.1 XML Databases

(12)

• XML documents thus can store all kinds of data

• Thus, is an XML document already a database?

– Generally speaking… yes. But a crappy one!

– For allowing effective XML use, we additionally need

Storage schemes for efficiently storing even huge documents

Query Languages

Schema Languages

Support for data integrity and transactions (ACID)

Support for data security

Programming Interfaces

… and all the other thing we know from real DBMS systems

13.1 XML Databases

(13)

• Many of these requirements can be fulfilled by specialized standards and technologies

– Storage:

XML document on the file system

– Queries:

Simple queries with XPath

Complex queries with XQuery

– Schemas:

Simple schemas with DTD

Complex schemas XML-Schema (XSD)

– Programming Interfaces:

13.1 XML Databases

(14)

• Still, those isolated technologies are not yet a real DBMS

• The topic of XML Databases deals with integrating them into a fully functional DBMS

• Two options

– Integrating XML support into RDMS systems

• Especially suited for data-centric XML

– Building native XML-DBMS systems

• Suited for data-centric and document centric XML

13.1 XML Databases

(15)

• What are XML supporting RDBMS?

– Maps XML data into relational tables

– Main problem: How to create an efficient and meaningful mapping?

• What are native XML databases?

– „Native“ is a marketing term – Common Agreement:

Native XML DBs works with a logical model of the XML document (not directly with the data)

i.e. nodes, attributes, types, tree structure, CDATA entries, …

XML

is the primary form of storage

Are not limited to a particular storage model (could use a relational DB, an object DB, file system, etc)

13.1 XML Databases

(16)

• Example (very simple):

13.1 XML Databases

id airline origin destination 1 ABC Air Dallas Fort Worth

id departure arrival flight_ref

1 09:15 09:16 1

2 11:15 11:16 1

3 13:15 13:16 1

Flights

Flight

Relational Mapping

Native Mapping

id parent name value

1 null Flights null

2 1 Airline ABC Air

3 1 Origin Dallas

4 1 Destination Fort Worth

5 1 Flight Null

Tags

(17)

• RDBMS with XML support

• Native XML-DBMS systems

13.1 XML Databases

(18)

• Since ancient times, people dream of intelligent machines

– Golden robots of Hephaestus

– Archytas‟ wooden pigeon (400 BC)

– Leonardo da Vinci‟s mechanical knight (1495) – The Turk of Wolfgang von Kempelen (1770) – …

13.2 Deductive DBS and Expert Systems

(19)

• In the 20

th

century, the field of A.I. (Artificial Intelligence) became popular

– 1950: Alan Turing

“The brain is just like a complex machine.”

Turing test

– 1956: Darthmouth Conference

Founding of the A.I. laboratories

– 1965: H. A. Simon

"Machines will be capable, within twenty years, of doing any work a man can do“

– 1967: Marvin Minsky

"Within a generation ... the problem of creating 'artificial intelligence' will substantially be solved."

13.2 Deductive DBS and Expert Systems

(20)

• In the initial phase of A.I. research, people were highly motivated and full of visions

– High amount of research money available, mainly from the military (DARPA)

• In the mid seventies, the great visions died…

– A long series of failures took its toll – The A.I. winter

– Most research funds were not granted for general A.I.

• Change of research direction

– Do not imitate the full human brain, but find

intelligent algorithms for solving particular difficult problems

13.2 Deductive DBS and Expert Systems

(21)

• Main critique – Hubert Dreyfus (UC Berkeley, USA)

– Expertise cannot readily be extracted from human experts

– Much knowledge is not explicit, but somehow embodied

The brain is not simply hardware running a program based on discrete symbolic

calculations

13.2 Deductive DBS and Expert Systems

(22)

• In the 1980ies, A.I. focused on well-defined problem domains building first commercially successful systems

Knowledge-based systems or „expert systems‟

Idea: Create a system which can draw

conclusions and thus support people in difficult decisions

– Simulate a human expert

Main idea: extract knowledge of experts and just cheaply copy it to all places

you might need it

13.2 Deductive DBS and Expert Systems

(23)

Expert Systems were supposed to be especially useful in

– Medical diagnosis

• Great failure up to now

– Production and machine failure diagnosis

• Works quite well

– Financial services

• Widely used and successful

13.2 Deductive DBS and Expert Systems

(24)

• Usually this is based on interference rules and specific problem data

Rule: All frogs are green – Fact: Hektor is a frog

– Implies new fact: Hektor is green

• Also, uncertainly can be supported

Rule: Almost all birds can fly except ostriches, chicken and penguins

Fact: Tweety is a birdQuery: Can Tweety fly?

Only few species are ostrichs, chicken or penguins

Tweety can fly with high probability

13.2 Deductive DBS and Expert Systems

(25)

• Common architecture of an expert system

User Interface: Usually based on a question-response dialogInference Engine: Tries to deduce an answer based on the

knowledge base and the problem data

Explanation System: Explains to the user why a certain answer was given or question asked

Knowledge Base: Set of rules and base facts

Problem Data: Facts provided for a specific problem via user interface

13.2 Deductive DBS and Expert Systems

Interface Explanation System Problem Data

(26)

• Expert systems have to keep and manage valuable data in their knowledge base

– Basically expert systems just support another query type, but have the same requirements like a normal database system

• A deductive DBS is a database system with limited support for reasoning

• All the goodies of databases (transactions, recovery, etc.)

• Queries based on recursive views are possible

• Efficient query optimization

13.2 Deductive DBS and Expert Systems

(27)

• System may deduce new facts using rules

– Leads to inference chains

• Most systems heavily rely on mathematical logics

First-Order Predicate Logics

13.2 Deductive DBS and Expert Systems

(28)

• Deductive queries/programs are often stated in Prolog or Datalog

Prolog is a logical programming language created in 1972

Datalog is a subset of Prolog especially designed for deductive databases

• No predicates are allowed as arguments

• Only fix-point iteration

• Efficient bottom-up evaluation

13.2 Deductive DBS and Expert Systems

(29)

Datalog example

Facts

parent(bill, mary).

parent(mary, john).

Rules

ancestor(X, Y) :- parent(X, Y)

ancestor(X, Y) :- ancestor(X, Z) , ancestor(Z, Y)

Query

:- ancestor(bill, X)

Answer

ancestor(bill, mary)

13.2 Deductive DBS and Expert Systems

(30)

• Why don‟t normal databases do the trick?

SQL queries can be read as follows

• “If some tuples exist in the FROM tables satisfying the WHERE conditions, then the SELECT tuples are the answer”

Datalog is a query language that has the same if-then flavor, but…

• The an intermediate answer table can appear in the FROM clause, i.e., recursion

13.2 Deductive DBS and Expert Systems

(31)

• Example: a public transport information system

• Database stores connected stops as facts, e.g.,

connection(Maschplatz, Hamburgerstr, 2 minutes).

Transitive closure contains all connections

• Additional rules

connection(X,Y,T) :- connection (X,Z,T1),

13.2 Deductive DBS and Expert Systems

(32)

• How long does it take to go from „Hauptbahnhof‟ to

„Rathaus‟?

– Only solution in standard SQL-92: create a view materializing all connections

Big challenge on storage space and data consistency

– In deductive DBS: query using recursive rule

?connection(HBF, RH, X)

Can be answered efficiently by binding the start and goal stop and deducing everything in between

– Deductive databases are no longer in the marketplace, but their technology was integrated into SQL-99

E.g., „common table expressions‟ in IBM DB2

13.2 Deductive DBS and Expert Systems

(33)

MYCIN

– Developed 1970 at Stanford University, USA – Medical expert system for treating infections

Diagnosis of infection types and recommended antibiotics (antibiotics names usually end with ~mycin)

– Around 600 rules (also supporting uncertainty)

– MYCIN was treated as a success by the project team

Experiments showed good results, especially with rare infections

– … but was never used in practice

Too clumsy

13.2 Notable Expert Systems

(34)

MYCIN example rule

• MYCIN used backward chaining from goal to facts for reasoning

– Try to apply rules concluding the reasoning process by asking the user about the rules premises

13.2 Notable Expert Systems

If the organism 1) stains grampos 2) has coccus shape 3) grows in chains then there is a suggestive evidence of 0.7 that it is streptococcus

(35)

MYCIN example

13.2 Notable Expert Systems

---PATIENT-1---

1) Patient's name: FRED SMITH 2) Sex: MALE

3) Age: 55

4) Have you been able to obtain positive cultures from a site at which Fred Smith has an infection?

YES

---INFECTION-1---

5) What is the infection? PRIMARY-BACTEREMIA

6) Please give the date when signs of INFECTION-1 appeared. 5/5/75

The most recent positive culture associated with the primary bacteremia will be referred to as:

---CULTURE-1---

7) From what site was the specimen for CULTURE-1 taken? BLOOD 8) Please give the date when this culture was obtained. 5/9/75 The first significant organism from this blood culture will be called:

---ORGANISM-1---

9) Enter the identity of ORGANISM-1. UNKNOWN

(36)

MYCIN explain example

13.2 Notable Expert Systems

** Did you use RULE 163 to find out anything about ORGANISM-1?

RULE163 was tried in the context of ORGANISM-1, but it failed because it is not true that the patient has had a genito-urinary tract manipulative

procedure (clause 3).

** Why didn't you consider streptococcus as a possibility?

The following rule could have been used to determine that the identity of ORGANISM-1 was streptococcus: RULE033

But clause 2 (“the morphology of the organism is coccus”) was already known to be false for ORGANISM-1, so the rule was never tried.

(37)

Dipmeter Advisor

– Developed 1980 by Schlumberger Doll Research

– Support for oil drilling operations – Very simple system with ~90 rules

– However, system was a huge success!

• Heavily used in field and appreciated by it‟s users

• Sold as fully functional unit including hardware

• First great success after A.I. winter!

13.2 Notable Expert Systems

(38)

Dipmeter Advisor: Function

– Dipmeters analyze ground properties by conductivity measurements

– Usually end up with an dipmeter log

• Hard to interpret, even by experts

13.2 Notable Expert Systems

(39)

Dipmeter Advisor: Function

– Dipmeter advisor creates log analysis and provides a summary using rules

13.2 Notable Expert Systems

example rule

example summary

(40)

NASA Shine

Spacecraft Health Inference Engine

– Development started in mid 70s by NASA and JPL (Jet Prolusion Lab) for the Deep Space Network

Commercially used by ViaSpace

– Multi-purpose inference system

– Detects system failures within complex mission critical machineries

– Designed to run in real-time in embedded and distributed systems

13.2 Notable Expert Systems

(41)

NASA Shine: currently used by

– Deep Space Network

– Lockheed Martin F-35 Lightning 35 – McDonnel Douglas F/A-18 Hornet – NASA CEV (Crew Exploration

Vehicle)

– NASA Ares Rocket Program – NASA Voyager spacecrafts – Lockheed Martin X-33

– Galileo Space Probe

– Extreme Ultraviolet Explorer – …

13.2 Notable Expert Systems

(42)

• Database technology has eased the handling of relational data and provides efficient querying

– Typical queries

List the names of all bookstore with more than ten thousand titles

List the names of the customers with highest sales in the year 2007

• But what about queries with a spatial dimension?

– List all bookstores within ten miles of Hannover

– List the average amounts for

purchases of customers who live in Braunschweig and its adjoining area

13.3 Spatial Databases and GIS

(43)

• A Geographical Information System (GIS) is any information system capable of providing

geographically referenced information

– This includes integrating, editing, analyzing, sharing, and displaying information

• For storing and querying the information a specialized spatial database is used

– Highly optimized to store and query data related to objects in space, including points, lines and polygons

13.3 Spatial Databases and GIS

(44)

• The basic idea is to integrate special spatial functionality into (or on top of) a DBMS

– Besides many commercial and open-source specialized GIS implementations, there are database extensions – As of Oracle version 8i

Oracle Spatial is offered, simlarly IBM offers a

DB2 Spatial Extender and the Informix

Spatial DataBlade

13.3 Spatial Databases and GIS

Spatial application DBMS

Interface to DBMS

Interface to spatial application

Taxonomy Data types Operations Query language Algorithms Access methods

Core Spatial Functionality

(45)

• The Open Geospatial Consortium (OGC) is an international industry consortium developing publicly available geo-processing specifications

– Consists of 334 companies, government agencies and universities participating in a consensus process

Open interfaces and protocols defined by OpenGIS Specifications support interoperable solutions that „geo- enable‟ the Web, wireless and location-based services, and mainstream IT

– Open Geospatial Consortium protocols include e.g., the Web Map Service (WMS) and Web Feature Service (WFS)

13.3 Data Exchange

(46)

• Cro-Magnon hunters already stored information mixing graphical elements with attributes

– Associated with animal drawings are track lines and tallies thought to depict migration routes

13.3 History of GIS

(47)

• It is conjectured that also star formations are shown in very early artifacts

– A representation of the Pleiades?

13.3 History of GIS

(48)

• The sky disk found at Nebra, Germany has been dated to about 1600 BC

– A bronze disk of around 30 cm diameter, patinated blue- green and inlaid with gold astronomical symbols

– Interpreted generally as a sun or full moon, a lunar crescent, and stars

including a cluster interpreted as the Pleiades

– Two golden arcs along the sides, marking the angle between the solstices, were added later

13.3 History of GIS

(49)

• In 1854, John Snow depicted a cholera outbreak in London

Points on a map represented the locations of individual cases

– The study of the

distribution of cholera led to the source of the disease, a contaminated water pump in the

middle of the cholera outbreak

13.3 History of GIS

(50)

Applications for geographic information system technology are…

– Scientific investigations, resource management, asset management, environmental impact assessment, urban

planning, cartography, criminology, history, sales, marketing, and logistics

• For example,…

– GIS might be used to find wetlands that need protection from pollution

– GIS can be used by a company to site a new business location to take advantage of a previously

underserved market

13.3 Spatial Databases and GIS

(51)

How does it work?

– Users interact with an interface or pose queries directly

• Basically there are three types of queries

Basic spatial operations on all data types

• E.g., IsEmpty, Envelope, Boundary,…

Topological/set operators

• E.g., Disjoint, Touch, Contains,…

Spatial analysis operators

13.3 Spatial Queries

(52)

• Form an entity to hold country names, populations, and geographies

– CREATE TABLE Country( Name varchar(30), Population Integer,

Shape Polygon);

• Form an entity to hold river names, sources, lengths, and geographies

– CREATE TABLE River( Name varchar(30), Source varchar(30), Distance Integer,

Shape LineString);

13.3 Spatial Data Entity Creation

(53)

• Find all the countries that border on Germany

– SELECT C1.Name

FROM Country C1, Country C2

WHERE Touch(C1.Shape, C2.Shape) = 1 AND C2.Name = „Germany‟

• Find all the countries through which the Danube river runs

– SELECT C.Name, R.Name FROM Country C, River R

WHERE Intersect(C.Shape, R.Shape) = 1 AND R.Name = „Danube‟

13.3 Example Spatial Query

(54)

• A main issue is the indexing in spatial databases, since geographical data is high dimensional

– 3 spatial dimensions

– Often also a temporal dimension – Dimensions for the actual

data attributes

• Typical high-dimensional index structures include

R-Trees, Grid File indexes, etc.

13.3 Database Indexing

(55)

• For traversing indexes space filling curves can be utilized (e.g., Hilbert curve, Z-curve)

– They achieve a better ordering of multidimensional objects in a tree node

– This ordering has to be good, in the sense that it

should group similar data rectangles together to minimize the area and perimeter of the resulting

minimum bounding rectangles

13.3 Database Indexing

(56)

• The Human Genome Project

– "Unravelling the three billion or so base pairs of our entire DNA has been compared with landing on the Moon, splitting the atom and even inventing the wheel." , Nature, February 2001

– "[...the publication of] the human genome sequence is likely to be greeted with the same awestruck

feeling that accompanied the landing of the first human on the moon [...]", Science, Februar 2001

13.4 Bio-Information Systems

(57)

13.4 Bio-Information Systems

(58)

13.4 Bio-Information Systems

Sequence Proteins Networks Organism

(59)

• Hundreds of molecular biological databases exist:

– Presently more than 1078 – 110 more than last year – Database issue of journal

"Nucleic Acid Research", always in January,

13.4 Bio-Information Systems

http://nar.oxfordjournals.org/content/vol36/suppl_1/index.dtl

(60)

13.4 Bio-Information Systems

(61)

13.4 Bio-Information Systems

National Center for Biotechnology Information at National Institutes of Health, USA

(62)

13.4 Bio-Information Systems

DNA & RNA Proteins Networks Organism

Sequence

Structure

Amino acid sequences Nucleid acid

sequences

DNA RNA

Gen Genom

Enzym functions Protein protein

interactions all

Genexpression Pathways+

Variation Mutation+

Diseases Immun biology

Drug

development Antibody research

other Genetic engineering

Model organisms Microarray data

Metabolic pathways

Signal transduction pathways

(63)

• Highly overlapping data

• Highly linked databases

• Data integration needed!

13.4 Bio-Information Systems

Source: GenMapper (Do / Rahm, http://ducati.izbi.unileipzig.de:8080/

GenMapper/servlet/gui.MainFrame)

(64)

• Use case:

"Retrieve sequences for all human expressed sequence tags (ESTs) that by BLAST are >60%

identical over >50 amino acids to mouse channel genes expressed in central nervous system."

• Sources:

A Practitioner’s Guide to Data Management and Data Integration in

Bioinformatics, Barbara A. Eckman in Bioinformatics by Zoe Lacroix and Terence Critchlow, 2003, Morgan Kaufmann.

– U. Leser / F. Naumann, VL "Informationsintegration", WS 06/07 bzw. 05/06

13.4 Use Case

(65)

• Involved data sources and tools:

– Mouse Genome Database (MGD) @ Jackson Labs – SwissProt @ EBI

– BLAST tool @ NCBI

– GenBank nucleotide sequence database @ NCBI

• All freely available

13.4 Use Case

(66)

• Traditional approach:

browsing

– Search for channel

sequences expressed in CNS tissues using the MGD query form

13.4 Use Case

(67)

• MGD result:

– 14 genes from 17 experiments

• Look at details for each of the 14 genes

– On average 5 SwissProt links per gene

13.4 Use Case

(68)

13.4 Use Case

• Visit each SwissProt page

– launch BLAST search

• Inspect each BLAST result

– eliminate non-human sequence hits

– check other constraints (e.g. >60% identity over >

50 amino acids)

(69)

• For each remaining entry:

– retrieve complete EST sequence from GenBank

• How often did we click?

– 1 + (Search)

– 14 + (Gene details) – 14*5 + (SwissProt) – 14*5 + (Blast)

– 14*5*X (Genbank)

13.4 Use Case

(70)

• What we really would like to have:

SELECT g.accnum, g.sequence FROM genbank g, blast b,

swissprot s, mgd m WHERE m.exp = “CNS”

AND m.defn LIKE “%channel%”

AND m.spid = s.id AND s.seq = b.query AND b.hit = g.accnum

AND b.percentid > 60 AND b.alignlen > 50

13.4 Use Case

(71)

• E.g. DB2s Information Integrator:

13.4 Use Case

Information Integrator

(72)

• Why do we need wrappers at all?

• All classical biological databases emerged from books:

– Data collections for DNA, protein sequences and structures etc.

– Published annually / quarterly

– Book – issue – CDROM – FTP – WWW

• Flat, text oriented data model

– Structured according to "Entries"

– Many items as free text

– For human, not for machines – Flat file format

For data exchange

Sometimes still for storage

– Increasingly use of XML formats

13.4 Bio-Information Systems

(73)

• Flat files

13.4 Bio-Information Systems

Key Author

Molecule name Species

Links to other databases References to

Gene Ontology Reactions in which

the molecule

participates

(74)

Key Author

Molecule name Species

Links to other databases References to

Gene Ontology

• XML format

13.4 Bio-Information Systems

Referenzen

ÄHNLICHE DOKUMENTE

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 12 EN 1.6.1.. 1.1 Characteristics

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 2.. 2 Physical

Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 4 SKS 10.5!. 3.1 Introduction to

– Both of the two child nodes to the left and right of the deleted element have the minimum number of elements (L-1) and then can then be joined into a legal single node with

• Cost estimate: (Height of the tree or 1 for hash index) plus #pages that contain overflow lists. 5.5

• For a non-cluster index the expected number of blocks accessed is #blocks ≈ #result size. • If no index is given, #blocks

• DB administrators may provide optimization hints to override optimizer heuristics and results. – Uses explain statement‟s PLAN_TABLE

8.1 Basic join order optimization 8.2 Join cost and size estimations 8.3 Left-deep join trees.. 8.4 Dynamic programming 8.5