Data Warehousing
& Mining Techniques
Exercise Sheet 1 April 20th, 2021
• A data warehouse is basically a (large) database storing copies of transactional data for analytical purposes (and not for the daily business!)
• It is mostly defined by Inmon to fulfill the following characteristics w.r.t. to the data
contained in it:
– Subject orientation – Integration
– Non-volatility – Time variancy
Exercise 1.1
• OLTP systems are used for day-to-day tasks like process invoices, register
incoming materials and so on.
– The data stored by those systems represent the current state of the business (stored in ODS)
• OLAP allows multi-dimensional analysis of the data, i.e. on historical data and not on the ODS!
– So a data warehouse is an OLAP system?
Exercise 1.2
• Not really!
• OLAP systems are usually build upon data warehouses. The DW stores the data and OLAP uses it. However, sometimes the term OLAP is
used as a synonym for a data warehouse ….
Exerise 1.2
The Data Warehouse …
… and OLAP
More on OLAP in Lecture 6
• In Data Mining we build a model of our data and try to produce insights based on the model.
• Examples:
– The price of gasoline usually increases during public holidays
– Customers who spent more then $100 on dumb bells are likely to buy protein supplements
Exercise 1.3
Exercise 2.1
Classical SDLC Requirements gathering
Analysis Design Programming
Testing Integration Implementation
We know what we want and therefore build the system in a top down fashion!
DW SDLC
Implement warehouse Integrate data
Test for bias Program against data
Design DSS system Analyze results
Understand requirements
We don‘t know what we want but we will learn it during the implementation!
Exercise 2.2
Talk to the end users; determine the relevant ODS, KPIs and design the logical and physical model
Lecture 3
Use the input from the design phase, prototpye it and go back to the user
Deploy the solution into production Day-to-day business, esp. management of ETL
Lecture 7
Improve the system over time
The case for 𝐁𝐢𝐨𝐧𝐭𝐞𝛘
• Mid-size manufacturer of antibiotics (in Germany)
• Chance to register a new antibiotic
• However, now they are in dire need for a new analytics solution
1. Name some advantages of a data warehouse in comparison to the legacy solution.
2. Give some examples of subjects which may be suitable for Bionteχ.
3. Would you agree, that the data integration is a moderate task based on the information you have?
4. Name and explain a few examples for analytics based on the data warehouse.
• Holistic view on the company
• Raw data allows for more ways to analyse the data
• Data preparation is standardized (less error prone)
• More complex queries and analytics possible
(how to query over multiple excel spreadsheets with reasonable effort
Exercise 3.1
• Some examples for subjects:
– Customers (activities, orders, aquisitions, custom antibiotica formulas etc.)
– Product (batches, costs, sales, etc.)
– Regions (sales, customers, potential admission regions etc.)
– Suppliers (…)
Exercise 3.2
• Definitively not!
• Inmon: subject-oriented, integrated, …
– We need to know which data we want and how to integrate it into a common schema
• ERP and CRM may have different formats [classic: product numbers in production and sales are not identical]
• External reports my contain different date formats or currencies
• Building Data Pipelines is time consuming
Exercise 3.3
Exercise 3.4
Data Warehouse
OLAP Data Mining
Visualization Reporting
• What did we sell in the last year and where?
• What‘s the value for a certain KPI?
• Etc.
How much units of the new antibiotic can we sell in the US if registered there next year?
Typically a collection of KPIs and emerging trends
Explain the data in a visual format
Exercise 3.5
Some DB
Exercise 3.5 Bonus
• So, what is a data lake?
– Basically what you want it to be …
• Typically the idea of storing all data in its raw
format in a central place and then run analytics on it
AWS Lake Formation
However …
• Data Warehouse Architecture
– Basic architectures – Storage models
– Layers
– Middleware
Next Lecture