Wolf-Tilo Balke
Benjamin Köhncke
Institut für Informationssysteme
Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de
Relational Database Systems 2
14. Non-Standard Applications 2
14.1 Distributed databases & P2P 14.2 Multimedia databases
14.3 IR & Web search 14.4 Data warehousing
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 2
14 Non-Standard Applications
• Up to now we have only considered scenarios where a central DBMS is responsible for
– Keeping the data consistent and persistent
– Optimizing query plans and minimizing disk accesses – Guaranteeing the ACID properties of interaction
– Controlling the data security and privacy
• Especially, all meta-information about the processes is concentrated at the DBMS
14.1 Distributed Databases
• But organizations usually have a variety of different systems to keep data records
– Customer database, human resources, product catalogs, financial records
• How can a complete view of the distributed data be provided, keeping all the advantages of a single central database system?
– Integrate data, but avoid single point of failure
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 4
14.1 Distributed Databases
• A distributed database is a database
– Under the control of a central database management system
– Operations run on multiple computers located in the same physical location, or may be dispersed over a
network of interconnected computers
– Storage devices are not all attached to a common CPU
– Collections of data can be distributed across multiple physical locations
14.1 Distributed Databases
• Typical scenarios for distributed databases apply to intranets in companies
• Solutions needed for
– Dealing with redundancy – Distributed access control – Distributed querying
– Distributed transactions – Distributed recovery
– …
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 6
14.1 Distributed Databases
• Database fragmentation
– Horizontal (row distribution) – Vertical (column and/or
table distribution)
• Distributing databases (top-down)
– You have a database
– How to split and allocate data to individual sites?
• Integrating databases (bottom-up)
– Combine existing databases
– How to deal with heterogeneity & autonomy?
14.1 Distributed Databases
• Major advantages
– Reflects organizational structure
• Database fragments are located in the departments they relate to
– Local autonomy
• A department can control its own data and enforce policies
– Improved availability
• A fault in one database system will only affect one fragment, instead of the entire database
– Improved performance
• The database fragments enable parallelized operations, allowing load on the databases to be balanced among servers
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 8
14.1 Distributed Databases
• But not only managed data access within some structured organization is necessary, but also data access via the Internet
– Distributed information sources – Deep Web / Hidden Web
– E-Commerce, comparison shopping
• Data transfer over the Web makes up for a large amount of bandwidth
– About 90% of network resources (including file sharing applications)
14.1 Distributed Databases
• Avoiding a single point of failure is mostly beneficial for organizations
– The company‟s central database is down, the complete company is out of business?
– On the other hand multiple points of access may pose severe security issues…
• For unstructured data the
peer-to-peer paradigm is often used for content access
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 10
14.1 Distributed Databases
• P2P systems are overlay architectures, with the following characteristics
– Two logically separate networks – Mostly IP based signaling
– Decentralized and self organizing
– Employ distributed shared resources
14.1 Distributed Databases
• First application: file sharing
– Classical application of P2P systems
• Large distributed database of files (music , videos, etc.) for free download
• The application provides a unified view
• Napster, Gnutella & Co
– First large scale occurrence of digital copyright infringement
• Strong reactions from industry,
e.g. Recording Industry Association of America (RIAA)
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 12
14.1 Distributed Databases
• But also legal distribution of software/updates
– Basic idea of distributing software updates or patches in a P2P fashion
• Obviously used for obtaining updates for P2P client software (Gnutella
& Co)
• But also for a wide variety of other software distributions
– Prominent examples
• Patches for the game „World of Warcraft‟ by Blizzard Entertainment
• Linux company Lindows distributes their Linspire (prev. LindowsOS) via P2P
– Technology used
• Today mostly BitTorrent (Block-based File Swarming)
• Microsoft‟s Avalanche (File Swarming with Network Coding)
14.1 Distributed Databases
• Also for data management within
organizations the peer-to-peer paradigm begins to be used
– E.g., Microsoft Groove for Office document management
– From the advertisements:
• Empower business teams
without sacrificing centralized control
• Bring the data center to your teams, wherever they work
• Help business teams collaborate easily and with enhanced security across boundaries
• …
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 14
14.1 Distributed Databases
• Example: distributed data systems are important in astronomy
– No site can hold all information
• Telescope image archives are already in the multi-TB range
• Promise to grow larger quickly with the increasing size of digital detectors and the advent
of new all-sky surveys
14.1 Distributed DBS
• Much of the astronomical information is dynamic
– Static catalogs and indexes quickly become obsolete
• Astronomers use multiple types of data
– images, spectra, time series, catalogs, journal articles,...
– All should be easily located and easily accessed with query terms and syntax natural to the discipline
• Astronomers need to know the provenance of the data they are using and all details about it
– No one data center is able to have expertise in the wide range of astronomical instrumentation and data sets
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 16
14.1 Distributed DBS
• Sample distributed datasets at NASA
14.1 Distributed DBS
Solar System Exploration – Lunar and planetary science data and mission information
Heliophysics –
Space and solar physics data and mission information
Universe Exploration – Astrophysics data and mission information
http://nssdc.gsfc.nasa.gov/
• NASA/MIT “Hopping Mars Exploration Microbots”
– Data collection in harsh, unreachable environments – Mars, Moon, etc (incl.
underground caves)
– Large swarms (1000+ units)
– Single units are small, cheap and expendable
– Communication using low-power, short-range wireless
• Units relay messages towards the mothership
• Cave exploration possible by building message chains
– Huge advantage vs. single unit robots (rovers etc.)
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 18 WV 12.1
14.1 Distributed DBS
• Relational databases efficiently store and retrieve structured data
– Bank accounts, customer data,…
• How to achieve persistent storage of media like
– Text documents
– Vectorgraphics, CAD – Image, audio, video
• What about content-based retrieval?
– Efficiently searching media content – Standardization of meta-data
(e.g., MPEG-7, MPEG-21)
14.2 Multimedia Databases
• Find all images in the database that show a sun set!
• What are their common characteristics? Can we only retrieve them by meta-data annotations?
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 20
14.2 Multimedia Databases
• Characterization by low level features
– Color information may help to discriminate images – A frog is no sun set
– But not all frogs are green…
14.2 Multimedia Databases
• Different types of features can be combined with other features to aid retrieval
– For instance Fourier transform for textures
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 22
14.2 Multimedia Databases
• Also continuous data can be described
– For perception of audio data psychoacoustic models are helpful
– Waveforms can actually be described by some features similar to image features
14.2 Multimedia Databases
• Harmonies in music allow to recognize and compare melodies
– Query by Humming and ‘sounds like’ search
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 24
14.2 Multimedia Databases
• Videos combine many techniques used in audio and image retrieval and interleave them respecting the structure of the video
14.2 Multimedia Databases
Story Unit Story Unit Story Unit Structural
Unit
Structural Unit
Structural Unit Structural
Unit Structural
Unit
Shot Shot Shot Shot Shot Shot
Frames
• The Video is broken down to its shots and the shots are individually compared for similarity
– Efficient methods needed for shot detection and effective video similarity measures
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 26
14.2 Multimedia Databases
• Clustering techniques allow to estimate the similarity of entire video sequences and movies
– Very interesting, e.g., for finding movies of similar genre, or detecting copyright infringements
14.2 Multimedia Databases
• Result presentation for videos is another major problem for the interface design
– Automatic generation of storyboards or trailers
– Results from film theory hint at how directors achieve certain effects and can be exploited
• E.g., extracting special effects or characteristic scenes
– The audio track of movies can be used to find important phrases or keywords
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 28
14.2 Multimedia Databases
• In summary: a variety of mathematical techniques can be used for describing media perception
– Fourier transform, wavelets, random field models, multi resolution analysis, etc.
– They describe different aspects of images, audio, and video files
14.2 Multimedia Databases
• Multi-dimensional indexing techniques provide means for efficient retrieval
– Filter and refine
– Karhunen-Loeve transform – R-trees, X-trees, M-trees
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 30
14.2 Multimedia Databases
• State of the art is interesting to see:
– Actually, very beta…
• Operating since 2005, not much changed
• Bought YouTube for USD 1.65 billion, not much changed
• Offered copyrighted material from 2006 („online TV‟), stopped service middle of 2007
– Results are not focused on what is shown in the video, or how the video is probably perceived by users
– Search focuses only on keywords in video title, meta- data, and surrounding text
– Simple extension of classical Web search without taking the medium into account
14.2 Multimedia DBS
• Look for “sunflower” in image search
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 32
14.2 Multimedia DBS
• In contrast look for “sunflower” videos
14.2 Multimedia DBS
• Results for “sunflower”
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 34
14.2 Multimedia DBS
• Extremely relevant for practical applications is the retrieval of textual documents
– The importance of information retrieval (IR) was already recognized in the 1940ies
– In contrast to relational data, texts are unstructured – The goal is to find documents
from a large collection that are relevant with respect to an information need
14.3 IR & Web Search
• Origins in period immediately after World War II
– Tremendous scientific progress during the war – Rapid growth in amount of scientific publications
available
• The “Memex Machine”
– Conceived by Vannevar Bush, President Roosevelt's science advisor
– Outlined in 1945 Atlantic Monthly article titled “As We May Think”
– Foreshadows the development of hypertext (the Web) and information retrieval system
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 36
14.3 IR & Web Search
14.3 IR & Web Search
• Relational databases are rather useless for this task…
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 38
14.3 IR & Web Search
IR System DBMS
Imprecise Semantics Precise Semantics
Keyword search SQL
Unstructured data format Structured data
Read-Mostly. Add docs occasionally Expect reasonable number of updates Page through ranked result list Generate full answer sets
• First systems used the “bag of words” model
– The content of a text is characterized by the terms that it contains
• Treat all the words in a document as index terms for that document
• Disregard order, structure, meaning, etc. of the words
– Relevance regarding a query is measured using the matching
between text terms and query terms and the (normalized)
number of term occurrences
• Simple, but effective…
14.3 IR & Web Search
• But there is more to it: words in documents have a certain importance for the document
• Idea: Hans Peter Luhn (IBM), 1958
– Terms that appear often in a document should get high weights
– Terms that appear in many documents should get low weights
– Leads to the idea of term frequency and inverse document frequency
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 40
14.3 IR & Web Search
• First and most influential system was the SMART system by Gerald Salton
(Cornell, 1965)
– Weighting with TFxIDF measure
– Documents and queries are considered as points in a high dimensional vector space
– The cosine similarity assesses the relevance of a document
with respect to a query
14.3 IR & Web Search
t1 d2
d1 d3
d t3
φ θ
• But text search is far more complicated
– Synonyms, hyponyms, etc.
– Phrases and distance between terms (n-grams) – Citations and references to other documents – …
• Moreover, retrieval efficiency and effectiveness is crucial for the user
– Inverted file indexes, stop-word lists,...
– Relevance feedback, query refinement,…
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 42
14.3 IR & Web Search
• The WWW started with a telephone book kept up by Sir Tim Berners-Lee at CERN and quickly developed into a vast variety of
interconnected Web servers
– Berners-Lee‟s first Web server
14.3 IR & Web Search
• The advent of the World Wide Web opened up another challenge in IR: distributed IR
– Basically the Web is a large document collection scattered over a multitude of servers
• How can the best pages in the WWW be found efficiently?
– First search engines just relied on crawling Web sites and building up a large index
– Traditional IR techniques then were used on that index and returned a list of best matching pages
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 44
14.3 IR & Web Search
• The first engines (1993-1994) all relied only on IR techniques
– Infoseek, Lycos, AltaVista, Inktomi, HotBot, etc.
• In 1998 Google also took the structure of the Web into account
– Link analysis favored pages that are pointed to from many others and from more important pages
14.3 IR & Web Search
• Actually link analysis plus IR techniques
proved to be a quantum leap in result relevance
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 46
14.3 IR & Web Search
46,47
17,16 13,76
12,87
9,74
worldwide market share as of Dec. 2007
Google Yahoo!
Baidu MSN others
• The innovation was the PageRank algorithm invented by Google‟s founders Larry Page and Sergey Brin
– The relative importance of a site is derived from the hyperlinks pointing to it
– Links propagate scores through the system
– Getting many links from important sites improves the „belief‟ that a site is relevant regarding a topic
14.3 IR & Web Search
• How to be top of the list?
– Manipulating IR techniques in Web search is commonplace for business advantage
• Example: Hommingberger Gepardenforelle
– In April 2005, Heise started a competition in which sites would rank highest in Google for query „Hommingberger Gepardenforelle”
– The closing day was December, 15
thDatenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 48
14.1 IR & Web Search
• All optimization techniques to boost the rank at Google were allowed
– White Techniques: Recommended by search engines
• Good Design
• Useful and diverse content
• Active user community
• …
– Black Techniques: Forbidden and penalized by search engines
• Link Farming
• Index Spamming
• Cloaking
• …
14.1 IR & Web Search
• Before the competition, obviously no major search engine listed the term
• Results grew quickly since more and more participants entered the contest
– Two checkpoints:
short-term success and long-term
success
– Search engines also included Yahoo!,
MSN, and Seekport
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 50
14.1 IR & Web Search
• The techniques used differed
– In the long run, white techniques prevailed
– Most important was to use the good reputation of already existing Web sites to profit from high
PageRank scores
• The respective Wikipedia page was among the first results not only for the Google engine
– Also taking the keyword into the page address was a good idea as the first four results of Google showed
1. www.hommingberger-gepardenforelle.de 2. www.hommingberger-gepardenforelle.net
3. de.wikipedia.org/wiki/Hommingberger_Gepardenforelle 4. www.hommingberger-gepardenforelle-page.de
14.1 IR & Web Search
• The winner‟s Web site was well interconnected and emulated a small
company specializing on the fish
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 52
14.1 IR & Web Search
• Similar results were obtained in English
language contests, e.g., with query „nigritude ultramarine‟
– Here the contest winner stated that his goal in
entering the contest was to "prove that real content trumps all the shady optimization tricks that someone can figure out"
– Instead of resorting to such tactics, he simply wrote a Weblog entry and asked his readers to link to it
14.1 IR & Web Search
• But search engine optimization can also occur from the outside (Google bombing)
– Because of the way that Google's algorithm works, a page will be ranked higher, if the sites that link to that page use consistent anchor text
– A Google bomb is created if many sites link to the page in this manner
– Example: a search for "miserable failure“
on September 29, 2006 brought up the official George W. Bush biography
number one on Google, Yahoo!, and MSN and number two on Ask.com
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 54
14.1 IR & Web Search
• For business-oriented data relational systems build a good foundation, but…
– There is a huge flood of updates in productive databases
– For data analysis purposes data often has to be transformed and aggregated
– Reports have to be generated quickly to support important decisions
• Should such stress be put on top of
operational database systems’ workloads?
14.4 Data Warehousing
• Basic idea: Don‟t put stress on your crucial
DBMSs, but use a second independent system
– Data Warehouses
• provide a unified view of business data and
• provide retrieval of data without slowing down the operational systems
• facilitate decision support system applications such as trend reports or market analysis
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 56
14.4 Data Warehousing
• The data warehousing concept was intended to
– Provide an architectural model for the flow of data from operational systems to decision support environments
– Address the various problems associated with this flow
• Especially the high costs associated with it, i.e., an enormous amount of redundancy of information was required to support the multiple
decision support environment that usually existed
– Unify multiple decision support environments that typically operated independently
• Each serving different users, but often requiring the same data
• The process of gathering, cleaning and integrating data from various sources (legacy systems) was in part replicated for each environment
– Allow for new requirements necessitating gathering, cleaning and integrating new data from the operational systems that
were logically related to prior gathered data
14.4 Data Warehousing
• The concept of data warehousing dates back to the 1980ies
– In 1983 Teradata introduces a database
management system specifically designed for decision support
– In 1988 Barry Devlin and Paul Murphy introduce the term „business data warehouse‟ (IBM Systems Journal) – From 1990 on several warehousing systems are
released (Red Brick Systems, Prism Solutions,…)
– In 1997 Oracle releases version 8i with support for star queries
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 58
14.4 Data Warehousing
8 i
• A data warehouse provides a common data
model for all data of interest regardless of the data's source
– Data is usually scattered over several systems in
companies: sales invoices, order receipts, production data, etc.
– For reporting and analysis the data would have to be retrieved from each respective source, transformed into a common model and then aggregated
• Before loading into the warehouse all data can be cleaned
– Inconsistencies can be identified and resolved
14.4 Data Warehousing
• Ways to build a data warehouse
– Top down
• The data warehouse is designed using a normalized enterprise data model
• Atomic data, i.e., data at the lowest level of detail, is stored in the warehouse
• Dimensional aggregations (data marts) containing data needed for
specific business processes or specific departments are created directly from the data warehouse
– Bottom up
• Data marts are first created to provide reporting and analytical capabilities for specific business processes
• Data marts contain atomic data and, if necessary, also aggregated data
• These data marts can eventually be unioned together to create a comprehensive data warehouse
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 60
14.4 Data Warehousing
• Global enterprise data models are optimized for efficient retrieval
– The most simple schema is the star model
– The model consists of a (few) central fact tables that are are connected to multiple dimensions
– All dimensions are denormalized with each dimension being represented by a single table
• If dimensions are normalized into several related tables with minimized redundancy, the snowflake model evolves
14.4 Data Warehousing
• Example
– Each dimension table has a primary key Id, relating to one of the primary key columns of the fact table
– The non-primary key Units_Sold of the fact table represents a measure or metric that can be used in calculations and analysis – The non-primary key columns of the dimension tables
represent additional attributes of the dimensions
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 62
14.4 Data Warehousing
• Queries on data warehouses follow the paradigm of online analytical processing (OLAP)
– Exploiting the multidimensional data model of the warehouse allows for complex analytical and ad- hoc queries with a rapid execution time
– The heart of any OLAP system is an OLAP cube consisting of
numeric facts called measures
that are categorized by dimensions
14.4 Data Warehousing
• The OLAP cube metadata is typically directly created from a star or snowflake schema
– Measures are derived from the records in the fact table and dimensions are derived from the
dimension tables
• The output of an OLAP query is typically displayed in a matrix format
– The dimensions form the row and column of the matrix, and the measures determine the values
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 64
14.4 Data Warehousing
• The most important mechanism in OLAP for performance is the use of aggregations
– Aggregations are built from the fact table by changing the granularity on specific dimensions and
aggregating data along these dimensions
– The number of possible aggregations is determined by every possible combination of dimension granularities
• The combination of all possible aggregations and the base data contains the answers to every query which can be answered from the data
14.4 Data Warehousing
• Potentially a large number of aggregations has to be precalculated
– Often only a predetermined number are fully
calculated, while the remainder are solved on demand
• The problem of deciding which aggregations to calculate is called the view selection problem
– The objective of view selection is typically to
minimize the average time to answer OLAP queries – View selection is NP-complete, but many strategies
are used: greedy or genetic algorithms, A* search,…
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 66
14.4 Data Warehousing
• Data warehousing has a history of innovations that did not really receive market acceptance
– Executive information systems
• EIS was a much-hyped, forward-looking development sporting attractive graphical user interfaces, new- fangled mice, and touch screens
• The flaw was in the workflow: senior executives spend their time in meetings, reading printed reports and making presentations, and delegate the analyzing of data…
• The actual analysts need powerful tools, rather than patronizingly simple „business cockpits‟
14.4 Data Warehousing
– Database-embedded OLAP
• It seemed only a matter of time before database vendors embedded OLAP directly into relational database engines (ROLAP)… Well, it didn‟t happen….
• Oracle still uses a version of the Express engine (Express was acquired by Oracle in the early 1990ies) and stores its multidimensional data in the relational database, but as a MOLAP, i.e., stored as a set of blobs
• Microsoft bundles Analysis Services with SQL Server, but it doesn‟t even pretend that the two are integrated at the
product level
Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 68
14.4 Data Warehousing
– New visualization methods
• Business graphics are more colorful, easier to produce and faster than ever, but the popular basic chart types haven‟t changed in decades
• Numerous new visualization techniques have been introduced, but not been widely adopted, despite the
apparent need for them in business intelligence applications with large data volumes to report
• One problem seems to be that the more advanced charts are not immediately intuitive for the reader