Data Warehousing
& Mining Techniques
& Mining Techniques
Wolf-Tilo Balke Silviu Homoceanu
Institut für Informationssysteme
Technische Universität Braunschweig
• Last week:
– What is a Data Warehouse – Applications and users
– Lifecycle and phases
2. Summary
Lifecycle and phases
• Architecture and Data model
2. Architecture
2.1 Basic Architecture 2.2 Storage Structures 2.3 Tier Architectures
2. Architecture
2.3 Tier Architectures 2.4 Distributed DW
2.5 DW Data Modeling
• Architecture of a DW
2.1 Basic Architecture
Users
Analysis
Warehouse
Operational
Data Sources Staging Area
Purchasing
Data Marts
Summary
Data Raw Data
Metadata
Analysis
Reporting Operational
System
Operational System
Purchasing
Sales
• The Data Staging Area
– Is both a storage and process area (the ETL process)
– It represents everything that
2.1 Basic Architecture
Warehouse Data
Sources Staging Area
Data Marts
It represents everything that
happens between the operational source system and the data presentation area
– The key architectural requirement for data staging
area is that it is off-limits to business users and
does not provide query and presentation services
• Customers aren’t invited to visit the kitchen…
– Similar to a restaurant’s kitchen, the data staging area should be accessible only to skilled professionals
2.1 Basic Architecture
• The Data Presentation Area
– Is where data is organized, stored and made available for queries, report writers,
2.1 Basic Architecture
Warehouse Data
Sources Staging Area
Data Marts
for queries, report writers,
and other analytical processing
– This area is the Warehouse as far as the business
community is concerned
• Storage structure
– After extraction from the operational data, in DW information is stored in databases
– The databases are operated by a DBMS
2.2 Storage Structures
– Different database structures can be used for a DW:
• Relational model (RDB) operated by a RDBMS
• MultiDimensional model (MDB) operated by a
MDBMS
• RDB and MDB are complementary and do not have to exclude each other
– In the staging area some RDBMS can be used,
however it must be off-limits to user queries because of performance reasons
2.2 Storage Structures
of performance reasons
– By default, normalized databases are excluded from
the presentation area, which should be strictly
multi-dimensionally (MDBMS)
• DB in relational model
– A database is seen as a collection of predicates over a finite set of variables
– The content of the DB is modeled as a set of relations in which all predicates are satisfied
2.2 Relational DB
relations in which all predicates are satisfied
Books Title
ISBN (PK) Price
Publisher Name ID (PK)
• A relation is defined as a set of tuples that have the same attributes
– It is usually described as a table
2.2 Relational DB
Attribute
Relation Tuple
• Multidimensional DB (MDB) are optimized for DW and OLAP applications
– They are created using input from the staging area – Designed for efficient and convenient storage and
retrieval of large volumes of data
2.2 Multidimensional DB
retrieval of large volumes of data
– Stored, viewed and analyzed from different
perspectives called dimensions
• Example: an automobile manufacturer wants to increase sale volumes
– Evaluation requires to view historical sale volume figures from multiple dimensions
– Sales volume by model,
2.2 MDB
– Sales volume by model, by color, by dealer,
over time
• A relational structure of the given evaluation would be
2.2 MDB
Model Color Sales volume
Mini VAN Blue 324
Mini VAN Black 113
Mini VAN Red 18
Mini VAN Red 18
Sedan Black 160
Sedan Blue 115
Sedan Red 6
Sports coupe Red 16
2.2 Multidim. Structure
113 324 18
160 115 6
Mini VAN
Sedan
289 451 40
455
281
* 1560
160 115 6
16 12 16
Coupe Sedan
Blue Red
Black
281
44
*
• The complexity grows quickly with the number of dimensions and the number of positions
– Example: 3 dimensions with 10 values each and no indexes
– If we consider viewing information
2.2 MDB
– If we consider viewing information in a RDB it would result in a
worst case of 10 3 =1000
records view
• Now, if we consider performance
– For responding to a query when car type = Sedan, color = Blue, and dealer = Berg
• RDBMS has to search through 1000 records to find the right record
2.2 MDB
right record
• MDB has more knowledge about where data lies
• The maximum of searches in the case of MDB is of 30 positions
• Average case 18 vs. 501 positions
• If the query is more relaxed
– Total sales across all dealers for all colors when car type = sedan
• RDBMS still has to go through the 1000 records
• MDB, however, goes only through a slice of 10x10
2.2 MDB
• MDB, however, goes only through a slice of 10x10
• Performance advantages
– MDBs are an order of magnitude faster than RDBMSs
– Performance benefits are more for queries that generate cross-tab views of data (the case of DW)
• Conclusion
2.2 MDB
• Conclusion
– The performance advantages offered by MDBs facilitates the development of
interactive decision support applications
like OLAP that can be impractical in a
relational environment
• Any database manipulation is possible with both technologies
• MDBs however offer some advantages in the context of DW:
2.2 RDB vs. MDB
in the context of DW:
– Ease of data presentation – Ease of maintenance
– Performance
• Ease of data presentation
– Data views are natural output of the MDBs
– Obtaining the same views in RDB requires a complex query
• Example with Walmart and Sybase:
– select sum(sales.quantity_sold) from sales, products, product_categories, manufacturers, stores, cities where manufacturer_name = ‘Colgate’
2.2 RDB vs. MDB
manufacturers, stores, cities where manufacturer_name = ‘Colgate’
and product_category_name = ‘toothpaste’
and cities.population < 40 000
and trunc(sales.date_time_of_sale) = trunc(sysdate-1) and sales.product_id = products.product_id
and sales.store_id = stores.store_id
and products.product_category_id = product_categories.product_category_id and products.manufacturer_id = manufacturers.manufacturer_id
and stores.city_id = cities.city_id
• Ease of data presentation
– Top k queries cannot be expressed well in SQL
• Find the five cheapest hotels in Frankfurt
– SELECT * FROM hotels h WHERE h.city = Frankfurt AND 5 >
(SELECT count(*) FROM hotels h1 WHERE h1.city = Frankfurt
2.2 RDB vs. MDB
(SELECT count(*) FROM hotels h1 WHERE h1.city = Frankfurt AND h1.price < h.price);
• Some RDBMS extended the functionality of SQL with STOP AFTER functionality
– SELECT * FROM hotels WHERE city = Frankfurt Order By price
STOP AFTER 5;
• Ease of maintenance
– No additional overhead to translate user queries into requests for data
• Data is stored as it is viewed
– RDBs use indexes and sophisticated joins which
2.2 RDB vs. MDB
– RDBs use indexes and sophisticated joins which
require significant maintenance and storage to
provide same intuitiveness
• Performance
– Performance of MDBs can be matched by RDBs through database tuning
– Not possible to tune the database for all possible ad- hoc queries
2.2 RDB vs. MDB
hoc queries
– Aggregate navigators are helping RDBs to catch up with MDBs as far as aggregation queries are
concerned
• When MDBs are in-appropriate?
– If the dataset types are not highly related, using a MDB results in a sparse representation
2.2 MDB
113 324 18
160 115 6
16 12 16
Mini VAN
Coupe Sedan
Blue Red
Black
34 25
45
Smith
Fox James
115 3
6
34
• When MDBs are appropriate?
– In the case of highly interrelated dataset types
MDBs are recommended for greatest ease of access and analysis
– Examples of applications
2.2 MDB
– Examples of applications
• Financial Analysis and Reporting
• Budgeting
• Promotion Tracking
Quality Assurance and Quality Control
• Popular DW architectures
– Generic Two-Tier Architecture – Independent Data Mart
– Dependent Data Mart and Operational Data Store – Logical Data Mart and Active Warehouse
2.3 Tier Architectures
– Logical Data Mart and Active Warehouse – Three-Tier Architecture
• Other
– One-Tier Architecture
– N-Tier Architecture
• Generic Two-Tier Architecture
– Data is not completely current in the DW – Periodic extraction
2.3 Layered Architectures
• Data analysis comes in two flavors
– Depending on the execution place of the analysis
• Thin Client
– Analytics are executed on the server – Client just displays
2.3 Layered Architectures
– Client just displays
– This architecture fits well for Internet/Intranet DW access
Server Client HTTP, IIOP
• Fat Client
– The server just delivers the data – Analytics are executed on the client
– Communication between client and server must be able to sustain large data transfers
2.3 Layered Architectures
Server Client
ODBC, JDBC, NFS
• Independent Data Mart
– Mini warehouses – limited in scope
– Separate ETL for each independent Data Mart – High Data Marts access complexity
2.3 Layered Architectures
High Data Marts access complexity
• Dependent Data Mart and Operational Data Store
– Single ETL for the DW
– Data Marts are loaded from the DW
2.3 Layered Architectures
– More simple
data access than
in the previous
case
• Logical Data Mart and Active Warehouse
– The ETL is near real-time
– Data Marts are not separate databases, but logical views of the DW
2.3 Layered Architectures
2.3 DW vs. Data Marts
DW Data Marts
Application independent
Specific DSS application Centralized, Decentralized by user area Planned Organic, possibly not planned
DW Data Marts
Scope
DW Data Marts
Multiple subjects One central subject
DW Data Marts
Many internal and external sources
Few internal and external sources
Data
Sources Subjects
DW Data Marts
Historical, detailed, summarized
Some history, detailed, summarized
Lightly
denormalized
Highly denormalized
Other characteristics
DW Data Marts
Flexible Restrictive
Data-oriented Project oriented
Long life Short life
Large Start small, becomes
• Generic Three-Tier Architecture
– Derived data
• Data that had been selected, formatted, and aggregated for DSS support
2.3 Layered Architectures
Data mart metadata Derived data
Data Mart
DSS support
– Reconciled data
• Detailed, current data
intended to be the single, authoritative source for all
DW metadata
Operational metadata Reconciled data
DW and ODS
Operational data
• One-Tier Architecture
– Theoretically possible
– Might be interesting for mobile applications
• N-Tier Architecture
– Higher tier architecture is also possible
But the complexity grows with the number of tier-interfaces
2.3 Layered Architectures
Higher tier architecture is also possible
• But the complexity grows with the number of tier-interfaces
• Web-based Architecture
– Advantages:
• Usage of existing software, reduction of costs, platform
independence
• In most cases the economics and technology greatly favor a single centralized DW
• But in some cases, distributed DW make sense
• Types of distributed DW
– Geographically distributed
2.4 Distributed DW
– Geographically distributed
• Local DW/global DW
– Technologically distributed DW
• Logically one DW, physically more DW
– Independently evolving distributed DW
• Geographically distributed
– In the case of corporations spread around the world
• Information is needed both locally and globally
– A distributed DW makes sense
2.4 Distributed DW
• When much processing occurs at the local level
• Even though local branches report to the same balance sheet, the local
organizations are their own companies
USA
2.4 Distributed DW
Europe Site A
Local DW
Asia
Site B
Local DW
All IBM
IBM/Teradata
Sybase
Local
operational Local
operational
USA
HQ
Local DW
Global DW operational
processing operational
processing
Local operational
processing
• Technologically distributed DW
– Placing the DW on the distributed technology of a vendor
– Advantages
The entry cost is cheap – large centralized hardware is
2.4 Distributed DW
• The entry cost is cheap – large centralized hardware is expensive
• No theoretical limit to how much data can be placed in the
DW – we can add new servers to the network
– As the DW starts to expand network data
communication starts playing an important role
• Example: Let’s simplify and consider we have 4 nodes holding each data regarding the last 4 years
• Now let’s consider we have a
2.4 Distributed DW
• Now let’s consider we have a query which needs to access the data from the last 4 years:
such a query arises the issue of transporting large amount of data between processors
2005
2006
2007
2008
• Independently evolving distributed DW
– In practice there are many cases in which independent DW are developed concurrently and uncontrolled in the same organization
• The first step many corporations make is to build a DW for
2.4 Distributed DW
• The first step many corporations make is to build a DW for financial or marketing
• Once it is successfully set up, other parts of the organization follow
independently the process resulting
in the coexistence of more indepen-
dent DW in the same organization
• Data Modeling / DB Design - Basics
– Is the process of creating a data model by analyzing the requirements needed to support the business
processes of an organization
• It is sometimes called database
2.5 Data Modeling
• It is sometimes called database
modeling/design because a data
model is eventually implemented
in a database
• Data models
– Provide the definition and format of data
– Graphical representations of the data within a specific area of interest
Enterprise Data Model: represents the integrated data
2.5 Data Modeling
• Enterprise Data Model: represents the integrated data requirements of a complete business organization
• Subject Area Data Model: Represents the data
requirements of a single business area or application
2.5 Phases
Requirement Analysis
Conceptual Design Functional
Analysis
Data requirements
Conceptual schema
Physical Design Application
Program Design
Transaction Implementation
Logical Design
Logical schema DBMS Independent
DBMS Dependent
• Conceptual Design
– Transforms data requirements to conceptual model
– Conceptual model describes data entities, relationships, constraints, etc. on high-level
• Does not contain any implementation details
• Independent of used software and hardware
• Logical Design (next lecture)
2.5 Phases
• Logical Design (next lecture)
– Maps the conceptual data model to the logical data model used by the DBMS
• e.g. relational model, dimensional model, …
• Technology independent conceptual model is adapted to the used DBMS software
• Physical Design (next lecture)
• Going from one phase to the next:
• The phase must be complete
– The result serves as input for the next phase
• Often automatic transition is possible with additional designer feedback
2.5 Phases
designer feedback
Conceptual
Design Logical
Design
Physical Design ER-diagram,
UML, … Tables,
• Highest conceptual grouping of ideas
– Data tends to naturally cluster with data from the same or similar categories relevant to the
organization
• The major relationships between subjects have
2.5 Conceptual Model
• The major relationships between subjects have been defined
– Least amount of detail
• Conceptual design
– See RDB1 course
– Entity-Relationship (ER) Modeling
• Entities - “things” in the real world
E.g. Car, Account, Product
2.5 Conceptual Model
Conceptual Design
ER-diagram, UML, …
Car Account Product
– E.g. Car, Account, Product
• Attributes – property of an entity, entity type, or relationship type
– E.g. color of a car, balance of an account, price of a product
• Relationships – between entities there can be relationships, which also can have attributes
Car Account Product
Car ColorColor
2.5 Conceptual Model
Student Professor
registration number registration
number
name
name department
id
attends teaches
instantiates time
day of week day of
week
room
semester
Lecture instance 1
N N 1 N
1
title credits
id Lecture
Course of
enrolls
part of
prereq.
curriculum semester curriculum
semester
N
N N
N
N N
• Conceptual design in usually done using the Unified Modeling Language (UML)
– Class Diagram, Component Diagram, Object Diagram, Package Diagram…
– For Data Modeling only Class Diagrams are used
2.5 Conceptual Model
Conceptual Design
ER-diagram, UML, …
– For Data Modeling only Class Diagrams are used
• Entity type becomes class
• Relationships become associations
• There are special types of associations like:
aggregation, composition, or generalization
CLASS NAME
attribute 1 : domain attribute n : domain operation 1
…
• Logical design arranges data into a logical structure
– Which can be mapped into the storage objects supported by DBMS
• In the case of RDB, the storage objects are tables which
2.5 Logical Model
Logical Design
Tables, Columns,
…
• In the case of RDB, the storage objects are tables which store data in rows and columns
AttributeTuple
• Physical design specifies the physical configuration of the database on the storage media
– Detailed specification of:
data elements, data types,
2.5 Physical Model
Physical Design
Tablespaces Indexes
data elements, data types, indexing options, and
other parameters
residing in the DBMS
data dictionary
• Managing Complex Data Relationships
– Helps keep track of the complex environment that is a DW
• Many complex relationships exist, with the ability to change over time
2.5 Data Model in DW
over time
– Transformations and integration from various systems of record need to be worked out and maintained
– Provides the means of supplying users with a
roadmap through the data and relationships
• Modeling business queries
– Goal
• Define the purpose, and decide on the subject(s) for the data warehouse
• Identify questions of interest
2.5 Conceptual Model
Time
• Identify questions of interest
– Subject
• Who bought the products?
(customers and their structure)
• Who sold the product? (sales organization) What was sold? (product structure)
Customers Employees
Products Business
Model
• For Conceptual design in DW conventional techniques like E/R or UML are not appropriate
– Lack of necessary semantics for modeling the multidimensional data model
– E/R are constituted to
2.5 Conceptual Model
– E/R are constituted to
• Remove redundancy in the data model
• Facilitate retrieval of individual records
– Therefore optimize OLTP
• Components
– Facts: a fact is a focus of interest for decision-making, e.g., sales, shipments..
– Measures: attributes that describe facts from
different points of view, e.g. , each sale is measured by
2.5 Conceptual Model in DW
different points of view, e.g. , each sale is measured by its revenue
– Dimensions: discrete attributes which determine the granularity adopted to represent facts, e.g. ,
product, store, date
– Hierarchies: are made up of dimension attributes
• Conceptual design models for DW
– Multidimensional Entity Relationship (ME/R) Model – Multidimensional UML (mUML)
– Other methods e.g., Dimension Fact Model,
2.5 Conceptual Model
Other methods e.g., Dimension Fact Model,
Totok approach, etc.
• ME/R Model
– Its purpose is to create an intuitive representation of the multidimensional data that is optimized for
high-performance access
– It represents a specialization and evolution of the E/R
2.5 Multidim. E/R Model
– It represents a specialization and evolution of the E/R to allow specification of multidimensional
semantics
• ME/R notation was influenced by the following considerations
– Specialization of the E/R model
• All new elements of the ME/R have to be specializations of the E/R elements
• In this way the flexibility and power of expression of the E/R
2.5 Multidim. E/R Model
• In this way the flexibility and power of expression of the E/R models are not reduced
– Minimal expansion of the E/R model
• Easy to understand/learn/use: the number of additional elements should be small
Representation of the multidimensional semantics
• There are 3 main ME/R constructs
– The fact node – The level node
– A special binary classification edge
2.5 Multidim. E/R Model
A special binary classification edge
Fact
Characteristics
Classification level
• Lets consider a store scenario designed in E/R
– Entities bear little semantics
– E/R doesn’t support classification levels
2.5 Multidim. E/R Model
Package
Article Store
Package District City Name
Date
is sold Is
in Is packed
in
Belongs Belongs
Is in 1
n n
n
m
• ME/R notation:
2.5 Multidim. E/R Model
Article Prod. Group
Prod. Family Prod. Categ
Sales Characteristics Store
City District
Region Country
Week Month Day Quarter
Year
• ME/R notation:
– Sales was elected as fact node
– The dimensions are product, geographical area and time
2.5 Multidim. E/R Model
– The dimensions are represented through the so called Basic
Classification Level
– Alternative paths in the classification level are also
Sales Characteristics Store
Article
Day
• UML is a general purpose modeling language
• It can be tailored to specific domains through the use of the following mechanisms
– Stereotypes: building new elements
2.5 Unified Modeling Language
– Stereotypes: building new elements – Tagged values: new properties
– Constraints: new semantics
• Stereotype
– Grants a special semantics to an UML construction without modifying it
– There are 4 possible representations of the stereotype in UML
2.5 mUML
stereotype in UML
Icon Decoration Label None
Fact 2 <<Fact>>
Fact 3
Fact 4
• Tagged value
– Define properties by using a pair of tag and data value
• Tag = Value
• E.g. formula=“UnitsSold*UnitPrice”
2.5 mUML
• E.g. formula=“UnitsSold*UnitPrice”
<<Fact-Class>>
Sales UnitsSold: Sales UnitPrice: Price /VolumeSold: Price
{formula=“UnitsSold*UnitPrice”
, parameter=“UnitsSold,
2.5 mUML
<<Dimensional-Class>>
Week
<<Dimensional-Class>>
Month
<<Dimensional-Class>>
Quarter
<<Dimensional-Class>>
Year
<<Dimensional-Class>>
City
<<Dimensional-Class>>
Region
<<Dimensional-Class>>
Land
<<Roll-up>>
Distributor Country
<<Roll-up>>
Country
<<Roll-up>>
Region
<<Roll-up>>
Year
<<Roll-up>>
Quarter
<<Shared -Roll-up>>
Year
1..2
<<Fact-Class>>
Sales
<<Dimensional-Class>>
Day
<<Dimension>>
Time
<<Dimensional-Class>>
Store
<<Dimensional-Class>>
Prod. Group
<<Dimensional-Class>>
Product
<<Dimension>>
Geography
<<Roll-up>>
Product Group
<<Roll-up>>
<<Roll-up>> City Week
<<Roll-up>>
Month