Data Warehousing
& OLAP
& OLAP
Wolf-Tilo Balke Silviu Homoceanu
Institut für Informationssysteme
Technische Universität Braunschweig
http://www.ifis.cs.tu-bs.de
• Last week:
– Storage structures: MDB
– Architectures: N-Tier Architectures – Data Modeling – Conceptual Modeling
Summary
Data Modeling – Conceptual Modeling
• Multidimensional Entity Relationship (ME/R) Model
• Multidimensional UML (mUML)
• This week:
– Data Modeling (continued)
3. Data Modeling
3.1 Logical Modeling:
Cubes, Dimensions, Hierarchies 3.2 Physical Modeling:
Star, Snowflake, Array storage
3. DW Modeling
Star, Snowflake, Array storage
• Goal of the Logical Model
– Confirm the subject areas
– Create ‘real’ facts and dimensions from the subjects that we have identified
3.1 Logical Model
– Establish the needed granularity for our dimensions
• Logical structure of the multidimensional model
– Cubes: Sales, Purchase, Price, Inventory
– Dimensions: Product, Time, Geography, Client
3.1 Logical Model
Purchase Amount Store
City District
Region Country
Article Prod. Group
Prod. Family Prod. Categ
Week Month Day Quarter
Year
Price Unit Price
Inventory Stock
Sales Turnover Client
• Dimensions are…
analysis purpose chosen entities, within the data model
– One dimension can be used to define
3.1 Dimensions
more than one cube
– They are hierarchically organized
Purchase Amount Article
Prod. Group Prod. Family
Prod. Categ
Price Unit Price
Sales Turnover
• Dimension hierarchies are organized in
classification levels (e.g., Day, Month, …)
– The dependencies between the classification levels are described by the classification schema through
functional dependencies
3.1 Dimensions
functional dependencies
• An attribute B is functionally dependent on an attribute A, denoted A ⟶ B, if for all a ∈ dom(A) there exists exactly one b ∈ dom(B) corresponding to it
Week Month Day Quarter
Year
• Classification schemas
– The classification schema of a dimension D is a semi- ordered set of classification levels
({D.K
0, …, D.K
k}, ⟶ )
– With a smallest element D.K ,
3.1 Dimensions
⟶
– With a smallest element D.K
0,
i.e. there is no classification level
with smaller granularity
• A fully-ordered set of classification levels is called a Path
– If we consider the classification schema of the time dimension, then we have the following paths
• T.Day T.Week
3.1 Dimensions
• T.Day T.Week
• T.Day T.Month T.Quarter T.Year
– Here T.Day is the smallest element
Month Day Quarter
Year
Week
• Classification hierarchies
– Let D.K
0⟶ …⟶ D.K
kbe a path in the classification schema of dimension D
– A classification hierarchy concerning these path is a balanced tree which
3.1 Dimensions
a balanced tree which
• Has as nodes dom(D.K
0) … dom(D.K
k) {ALL}
• And its edges respect the functional dependencies
• Example: classification hierarchy from the path product dimension
3.1 Dimensions
Article Prod. Group
Prod. Family Prod. Categ
ALL Electronics
Video Audio
Video recorder
Video
recorder Camcorder
TR-34 TS-56
…
…
TV
…
Clothes
…
Article
Prod. Group Prod. Family Category
• Cubes consist of data cells with one or more measures
• If a cube schema S(G,M) consists of a granularity G=(D
1.K
1, …, D
n.K
n) and a set M=(M
1, …, M
m)
C C C C
3.1 Cubes
S(G,M)
granularity G=(D
1.K
1, …, D
n.K
n) and a set M=(M
1, …, M
m) representing the measure
– A Cube (C C C C) is a set of cube cells,
C ⊆ dom(G) x dom(M)
• The coordinates of a cell are the classification nodes from dom(G) corresponding to the cell
– Sales ((Article, Day, Store, Client), (Turnover)) – …
3.1 Cubes
3.1 Cubes
605818 825 14 400 818
… … … …
Supplier = s1 Supplier = s2 Supplier = s3
…818 … … …
Berlin MünchenParis Braunschweig
Q1
• 4 dimensions (supplier, city, quarter, product)
927 103 812 102
39 580 30 501 680 952
605 825
31 512 14 400
… …
… …
… …
… …
… …
… …
… …
… …
… …
… …
… …
… …
… …
… …
… …
… …
Q1 Q2 Q3 Q4
Computer Video
Audio Telephones Computer Video
Audio Telephones Computer Video
Audio Telephones
• We can now imagine n-dimensional cubes
– n-D cube is called a base cuboid
– The top most cuboid, the 0-D, which holds the highest level of summarization is called apex cuboid
3.1 Cubes
– The full data cube is formed by the
lattice of cuboids
• But things can get complicated pretty fast
3.1 Cubes
all
time supplier
0-D(apex) cuboid
1-D cuboids item location
time,item time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,location
time,item,supplier
time,location,supplier
item,location,supplier
2-D cuboids
3-D cuboids
4-D(base) cuboid
• Basic operations of the multidimensional model on the logical level
– Selection – Projection
3.1 Basic Operations
– Cube join – Sum
– Aggregation
• Multidimensional Selection
– The selection on a cube C((D
1.K
1,…, D
g.K
g),
(M
1, …, M
m)) through a predicate P, is defined as σ
P(C) = {z Є C:P(z)}, if all variables in P are either:
• Classification levels K , which functionally depend on a K D .K ⟶ K
3.1 Basic Operations
(M , …, M )) P
σ
P(C) = {z Є C:P(z)}, P
• Classification levels K , which functionally depend on a classification level in the granularity of K, i.e. D
i.K
i⟶ K
• Measures from (M
1, …, M
m)
– E.g. σ
P.Prod_group=“Video”(Sales)
• Multidimensional projection
– The projection of a function of a measure F(M) of cube C is defined as
-
F(M)(C) = { (g,F(m)) ∈ dom(G) x dom(F(M)): (g,m) ∈ C}
– E.g. , price projection -
turnover, sold_items(Sales)
3.1 Basic Operations
C
-
F(M)(C) = { (g,F(m)) dom(G) x dom(F(M)): (g,m) C}
– E.g. , price projection -
turnover, sold_items(Sales)
Sales Turnover Sold_items
• Join operations between cubes is usual
– E.g. if turnover would not be provided, it could be calculated with the help of the unit price from the price cube
• 2 cubes C
1(G
1, M
1) and C
2(G
2, M
2) can only be
3.1 Basic Operations
• 2 cubes C
1(G
1, M
1) and C
2(G
2, M
2) can only be joined, if they have the same granularity
(G
1= G
2= G)
– C
1⋈C
2= C(G, M
1∪ M
2)
Price Unit Price Sales
Units_Sold
• When the granularities are different, but we still need to join the cubes, aggregation has to be performed
– E.g. , Sales ⋈ Inventory: aggregate Sales((Day, Article, Store, Client)) to Sales((Month, Article, Store, Client))
3.1 Basic Operations
Sales ⋈ Inventory:
Store, Client)) to Sales((Month, Article, Store, Client))
Store City
District Region
Country
Article Prod. Group
Prod. Family Prod. Categ
Week Month Day Quarter
Year
Inventory Stock
Sales Turnover Client
• Aggregation: most important operation for OLAP operations
• Aggregation functions
– Build a single values from set of value, e.g. in SQL:
SUM (Sales)
3.1 Basic Operations
– Build a single values from set of value, e.g. in SQL:
SUM, AVG, Count, Min, Max
– Example: SUM
(P.Product_group, G.City, T.Month)(Sales)
• Comparing granularities
– A granularity G={D
1.K
1, …, D
g.K
g} is finer than G’={D
1’.K
1’, …, D
h’.K
h’}, if and only if
for each D
j’.K
j’ ∈ G’ ∃ D
i.K
i∈ G where D
i.K
i⟶ D
j’.K
j’
3.1 Basic Operations
G’={D ’.K ’, …, D ’.K ’},
D
j’.K
j’ G’ ∃ D
i.K
iG ⟶
• Classification schema, cube schema, classification hierarchy are all designed in the building phase and considered as fix
– Practice has proven otherwise – DW grow old, too
3.1 Change support
– DW grow old, too
– Changes are strongly connected to the time factor – This lead to the time validity of these concepts
• Reasons for schema modification
– New requirements
– Modification of the data source
• E.g. Saturn sells a lot of electronics
– Lets consider mobile phones
• They built their DW on 01.03.2003
• A classification hierarchy of their data until 01.07.2008 could look
3.1 Classification Hierarchy
data until 01.07.2008 could look
like this:
Mobile PhoneGSM 3G
Nokia 3600 O2 XDA BlackBerry
Bold
• After 01.07.2008 3G becomes hip and affordable and many phone makers start migrating towards 3G capable phones
– Lets say O2 makes its XDA 3G capable
3.1 Classification Hierarchy
Mobile Phone
GSM 3G
BlackBerry
• After 01.04.2010 phone makers already develop 4G capable phones
3.1 Classification Hierarchy
Mobile Phone
GSM 3G
Nokia 3600 O2 XDA BlackBerry
Bold
4G
Best phone ever
• It is important to trace the evolution of the data
– It can explain which data was available at which moment in time
– Such a versioning system of the classification
3.1 Classification Hierarchy
– Such a versioning system of the classification hierarchy can be performed by constructing a validity matrix
• When is something, valid?
• Use timestamps to mark it!
• Annotated Change data
3.1 Classification Hierarchy
Mobile Phone
[01.03.2003, ∞)
[01.04.2005, ∞) [01.04.2010, ∞)
GSM 3G
Nokia 3600 O2 XDA BlackBerry
Bold
4G
Best phone ever
[01.03.2003, ∞)
[01.04.2005, ∞) [01.04.2010, ∞)
[01.04.2010, ∞) [01.04.2005, ∞)
[01.03.2006, ∞) [01.07.2008, ∞)
[01.03.2003, 01.07.2008)
• The tree can be stored as dimension metadata
– The storage form is a validity matrix
• Rows are parent nodes
• Columns are child nodes
3.1 Classification Hierarchy
GSM 3G 4G Nokia 3600 O2 XDA Berry Bold Best phone
Mobile phone
[01.03.2003, ∞) [01.04.2005, ∞) [01.04.2010, ∞)
GSM [01.04.2005, ∞) [01.03.2003,
01.07.2008)
3G [01.07.2008, ∞) [01.03.2006, ∞)
4G [01.04.2010
, ∞) Nokia 3600
O2 XDA Berry Bold
• Deleting a node in a classification hierarchy
– Should be performed only in exceptional cases
• It can lead to information loss
• How do we solve it?
– Soon GSM phones will not be produced anymore
3.1 Classification Hierarchy
– Soon GSM phones will not be produced anymore
• But we might have some more in our warehouses, to be delivered
• Or we might want to query data since when GSM was sold
• Just mark the end validity date of the GSM branch in the
validity matrix
• Query classification
– Having the validity information we can support queries like as is versus as is
• Regards all the data as if the only valid classification hierarchy is the present one
3.1 Classification Hierarchy
hierarchy is the present one
• In the case of O2 XDA, it will be considered as it has always been a 3G phone
Mobile Phone
GSM 3G
Nokia 3600 O2 XDA BlackBerry Bold
4G
Best phone ever
• As is versus as was
– Orders the classification hierarchy by the validity matrix information
• O2 XDA was a GSM phone until 01.07.2008 and a 3G phone afterwards
3.1 Classification Hierarchy
phone afterwards
Mobile Phone
GSM 3G
Nokia 3600 O2 XDA BlackBerry
Bold BlackBerry
Bold 4G
Best phone ever Best phone
ever
…
… …
… …
…
…
• As was versus as was
– Past time hierarchies can be reproduced
– E.g., query data with an older classification
3.1 Classification Hierarchy
Mobile Phone
GSM 3G
…
… …
…
older classification
…hierarchy
• Like versus like
– Only data whose classification hierarchy remained unmodified, is evaluated
– E.g. the Nokia 3600 and the Black Berry
Nokia 3600 O2 XDA BlackBerry Bold
…
• Improper modification of a schema (deleting a dimension) can lead to
– Data loss
– Inconsistencies
• Data is incorrectly aggregated or adapted
3.1 Schema Modification
• Proper schema modification is complex but
– It brings flexibility for the end user
• The possibility to ask “As Is vs. As Was” queries and so on
• Alternatives
– Schema evolution
– Schema versioning
• Schema evolution
– Modifications can be performed without data loss – It involves schema modification and data adaptation
to the new schema
– This data adaptation process is called Instance
3.1 Schema Modification
– This data adaptation process is called Instance adaptation
Purchase Amount Article
Prod. Group Prod. Family
Prod. Categ Sales
Turnover
• Schema evolution
– Advantage
• Faster to execute queries in DW with many schema modifications
– Disadvantages
3.1 Schema Modification
– Disadvantages
• It limits the end user flexibility to query based on the past schemas
• Only actual schema based queries are supported
• Schema versioning
– Also no data loss
– All the data corresponding to all the schemas are always available
– After a schema modification the data is held in their belonging schema
3.1 Schema Modification
belonging schema
• Old data - old schema
• New data - new schema
Purchase Amount Article
Prod. Group Prod. Family
Prod. Categ Sales
Turnover
Article Prod. Group
Prod. Categ Sales
Turnover
• Schema versioning
– Advantages
• Allows higher flexibility, e.g., “As Is vs. As Was”, etc. queries
– Disadvantages
• Adaptation of the data to the queried schema is done on the spot
3.1 Schema Modification
the spot
• This results in longer
query run time
• Defining the physical structures
– Setting up the database environment
– Performance tuning strategies (next lecture)
• Indexing
3.2 Physical Model
• Partitioning
• Materialization
• Goal:
– Define the actual storage architecture
– Decide on how the data is to be accessed and how it
is arranged
• Physical implementation of the multidimensional paradigm model can be:
– Relational
• Snowflake-schema
• Star-schema
3.2 Physical Model
• Star-schema
– Multidimensional
• Matrixes
• Relational model, goals:
– As low loss of semantically knowledge as possible e.g., classification hierarchies
– The translation from multidimensional queries must
3.2 Physical Model
The translation from multidimensional queries must be efficient
– The RDBMS should be able to run the translated queries efficiently
– The maintenance of the present tables should be easy
and fast e.g., when loading new data
• Going from multidimensional to relational
– Representations for cubes, dimensions, classification hierarchies and attributes
– Implementation of cubes without the classification hierarchies is easy
3.2 Relational Model
hierarchies is easy
• A table can be seen as a cube
• A column of a table can be considered as a dimension mapping
• A tuple in the table represents a cell in the cube
• If we interpret only a part of the columns as dimensions we can use the rest as measures
• The resulting table is called a fact table
3.2 Relational Model
818 Product
Time
Laptops Mobile p.
Geography
13.11.2008 18.12.2008
Article Store Day Sales
Laptops Hannover, Saturn 13.11.2008 6 Mobile Phones Hannover Saturn 18.12.2008 24
• Snowflake-schema
– Simple idea: use a table for each classification level
• This table includes the ID of the classification level and other attributes
• 2 neighbor classification levels are connected by 1:n
3.2 Relational Model
• 2 neighbor classification levels are connected by 1:n connections e.g., from n Days to 1 Month
• The measures of a cube are maintained in a fact table
• Besides measures, there are also the foreign key IDs for the smallest
classification levels
• Snowflake?
– The facts/measures are in the center – The dimensions spread
out in each direction and branch out with their
3.2 Snowflake Schema
branch out with their
granularity
3.2 Snowflake Example
Sales Product_ID Day_ID Store_ID Sales Revenue Product
Product_ID Description Brand Product_gro up_ID
… Product group
Product_group_ID Description Product_categ_ID
Product category Product_category_ID Description
Day Day_ID Description Month_ID Week_ID
Week Week_ID Description Year_ID
Year Year_ID Description Month
Month_ID Description Quarter_ID n
n
n
n
n n
n
n n
1 1
1 1
1 1
1
1
Store Store_ID Description State_ID
… State
State_ID Description Region_ID Region
Region_ID Description Country_ID
Country Country_ID Description
Quarter Quarter_ID Description
Year_ID n
n
n n
1 1 1
1
1
fact table
dimension tables
time
location
• Snowflake schema – Advantages
– With a snowflake schema the size of the dimension tables will be reduced and queries will run faster
• If a dimension is very sparse (most measures corresponding to the dimension have no data)
3.2 Snowflake Schema
corresponding to the dimension have no data)
• And/or a dimension has long list of attributes
which may be queried
• Snowflake schema – Disadvantages
– Fact tables are responsible for 90% of the storage requirements
•
Thus, normalizing the dimensions usually lead to insignificant improvements
3.2 Snowflake Schema
improvements
– Normalization of the dimension tables can reduce the performance of the DW because it leads to a large number of tables
•
E.g., when connecting dimensions with coarse granularity these tables are joined with each other during queries
•
A query which connects Product category with Year and Country
is clearly not performant (10 tables need to be connected)
3.2 Snowflake Example
Sales Product_ID Day_ID Store_ID Sales Revenue Product
Product_ID Description Brand Product_gro up_ID
… Product group
Product_group_ID Description Product_categ_ID
Product category Product_category_ID Description
Day Day_ID Description Month_ID Week_ID
Week Week_ID Description Year_ID
Year Year_ID Description Month
Month_ID Description Quarter_ID n
n
n
n
n n
n
n n
1 1
1 1
1 1
1
1
Store Store_ID Description State_ID
… State
State_ID Description Region_ID Region
Region_ID Description Country_ID
Country
Quarter Quarter_ID Description
Year_ID n
n
n n
1 1 1
1
1
• Star schema
– Basic idea: use a denormalized schema for all the dimensions
• A star schema can be obtained from the snowflake schema through the denormalization of the tables
3.2 Relational Model
schema through the denormalization of the tables
belonging to a dimension
3.2 Star Schema - Example
Sales Product_ID Time_ID Geo_ID Sales Revenue Product
Product_ID Product group Product category Description
…
Time Time_ID Day Week Month Quarter Year n
n n 1
1
Geography Geo_ID Store State Region Country
…
Year
1
• Advantages
– Improves query performance for often-used data – Less tables and simple structure
– Efficient query processing with regard to dimensions
3.2 Star Schema
Efficient query processing with regard to dimensions
• Disadvantages
– In some cases, high overhead of redundant data
• Snowflake vs. Star
3.2 Snowflake vs. Star
– The structure of the classifications are
expressed in table schemas – The fact table and
dimension tables are
– The entire classification is expressed in just one table – The fact table is normalized
while in the dimension table the normalization is broken dimension tables are
normalized
the normalization is broken
• This leads to redundancy of information in the dimension tables
3.2 Examples
• Snowflake • Star
Product_ID Description Brand Prod_group_ID
10 E71 Nokia 4
11 PS-42A Samsung 2
12 5800 Nokia 4
Bold Berry 4
Product_
ID
Description … Prod.
group
Prod. categ
10 E71 … Mobile Ph.. Electronics
11 PS-42A … TV Electronics
12 5800 Mobile Ph.. Electronics
Prod_group_ID Description Prod_categ_ID
2 TV 11
4 Mobile Pho.. 11
Prod_categ_ID Description
11 Electronics
13 Bold Mobile Ph.. Electronics
• When should we go from Snowflake to star?
– Heuristics-based decision
• When typical queries relate to coarser granularity (like product category)
• When the volume of data in the dimension tables is
3.2 Snowflake to Star
• When the volume of data in the dimension tables is relatively low compared to the fact table
– In this case a star schema leads to negligible overhead through redundancy, but performance is improved
• When modifications on the classifications are rare compared to insertion of fact data
– In this case these modifications controlled through the data load process of the ETL reducing the risk of data anomalies
• Snowflake or Star?
– It depends on the necessity
• Fast query processing or efficient space usage
– However, most of the time a mixed form
3.2 Do we have a winner?
is used
• The Starflake schema: some dimensions stay normalized corresponding to the
snowflake schema, while others are
denormalized according to the star schema
• The Starflake schema
– The decision on how to deal with the dimensions is influenced by
• Frequency of the modifications: if the dimensions change often, normalization leads to better results
3.2 Our forces combined
often, normalization leads to better results
• Amount of dimension elements: the bigger the dimension tables, the more space normalization saves
• Number of classification levels in a dimension: more classification levels introduce more redundancy in the star schema
• Materialization of aggregates for the dimension levels:
if the aggregates are materialized, a normalization of the
• Galaxies
– In pratice we usually have more measures described by different dimensions
• Thus, more fact tables
3.2 More Schemas
• Thus, more fact tables
Sales Product_ID Store_ID Sales Revenue Store
Store_ID
…
Date Date_ID
…
Product Product_ID
…
Receipts Product_ID Date_ID
Vendor Vendor_ID
…
• Other schemas
– Fact constellations
• Pre-calculated aggregates
– Factless fact tables
3.2 Even More Schemas
• Fact tables do not have non-key data
– Can be used for event tracking or to inventory the set of possible occurrences
– …
• Relational model - disadvantages
– The representation of the
multidimensional data can be implemented relationally with a finite set of transformation steps, however:
• Multidimensional queries have to be first translated to the
3.2 Relational Model
• Multidimensional queries have to be first translated to the relational representation
• A direct interaction with the relational data model is not fit for the end user
• What about storing the data multidimensionally?
• The basic data structure for multidimensional data storage is the array
• The elementary data structures are the cubes and the dimensions
3.2 Multidimensional Model
the dimensions
– C=((D
1, …, D
n), (M
1, …, M
m))
• The storage is intuitive as arrays of arrays,
physically linearized
• Linearization example: 2D cube |D
1| = 5, |D
2| = 4, cube cells = 20
– Query: Jackets sold in March?
• Measure stored in cube cell D
1[4], D
2[3]
2D
3.2 Multidimensional Model
20
• Measure stored in cube cell D
1[4], D
2[3]
• The 2D cube is physically stored as a linear array, so
D
1[4], D
2[3] becomes array cell 14
– (Index(D2) – 1) * |D1| + Index(D1) – Linearized Index = 2 * 5 + 4 = 14
• Generalization…
1 1 3 6 47
2 2 53
7 8 89
6 4
9 11 11 16 1217
10 12 1313
15 18 1619
14 14
17 5 19 10 25 15 27 20
Jan (1) Feb(2) Mar(3) Apr(4)
D1 D2
• Generalization:
– Given a cube C=((D
1, D
2, …, D
n),
(M
1:Type
1, M
2:Type
2, …, M
m:Type
m)),
the index of a cube cell z with coordinates (x
1, x
2, …, x
n)
Index(z) = x + (x - 1) * |D | + (x - 1) * |D | * |D | + …
3.2 Linearization
(M
1:Type
1, M
2:Type
2, …, M
m:Type
m))
the index of a cube cell z with coordinates (x
1, x
2, …, x
n) can be linearized as follows:
• Index(z) = x
1+ (x
2- 1) * |D
1| + (x
3- 1) * |D
1| * |D
2| + … + (x
n- 1) * |D
1| * … * |D
n-1| =
= 1+ ∑
i=1
n
((x
i- 1) * ∏
j=1
i-1
|D
i|)
• Influence of the order of the dimensions in the cube definition
– In the cube the cells of D
2are ordered one under the other
e.g., sales of all pants involves a
3.2 Problems in Array-Storage
1 1 3 6 47
2 2 53
7 8 89
6 4 175
19 10
Jan (1) Feb(2)
D
D
e.g., sales of all pants involves a column in the cube
– After linearization, the information is
spread among more data blocks or pages
– If we consider a data block can hold 5 cells, a query over all products sold in January can be answered with just 1 block read, but a query of all sold pants, involves reading 4 blocks
9 11 11 16 1217
10 12 1313
15 18 1619
14 14 2515
27 20
Mar(3) Apr(4)
D1 D2
• Solution: use caching techniques
– But…caching and swapping is performed also by the operating system
– MDBMS has to manage its caches
3.2 Problems in Array-Storage
– MDBMS has to manage its caches
such that the OS doesn’t perform any
damaging swaps
• Storage of dense cubes
– If cubes are dense, array storage is more efficient.
However, operations suffer due to the large cubes – Solution: store dense cubes not linear but on 2
3.2 Problems in Array-Storage
– Solution: store dense cubes not linear but on 2 levels
• The first contains indexes and the second the data cells stored in blocks
• Optimization procedures like indexes (trees, bitmaps),
physical partitioning, and compression (run-length-
encoding) can be used
• Storage of sparse cubes
– All the cells of a cube, including empty ones, have to be stored
– Sparseness leads to data being stored in many physical blocks or pages
3.2 Problems in Array-Storage
blocks or pages
•
The query speed is affected by the large number of block accesses on the secondary memory
– Solution:
• Do not store empty blocks or pages
but adapt the index structure
• 2 level data structure: upper layer holds all possible
combinations of the sparse dimensions, lower layer holds dense
• 2 level cube storage
3.2 Problems in Array-Storage
Marketing campaign
Customer
Sparse upper level
Product Time Geo
Dense low level
• Based on the physical model used:
– DOLAP (Desktop OLAP)
– MOLAP (Multidimensional OLAP) – ROLAP (Relational OLAP)
3.2 Physical Models
T i m
ROLAP (Relational OLAP)
e– HOLAP (Hybrid OLAP)
eO L A P R O L A P H O L A P M O L A P D O L A P
• DOLAP
– Developed as extension to the production system reports
• It downloads a small hypercube from a central point (data mart or DW)
3.2 Physical Models
mart or DW)
• Performs multidimensional analysis while disconnected from the data source
• The computation occurs on the client
– Requires little investment
– They lack the ability to manage large data sets
• MOLAP
– Presentation layer provides the multidimensional view
– The OLAP server stores data in a
3.2 Physical Models
MOLAP Client
Presentation
Server
The OLAP server stores data in a multidimensional structure
• Computation occurs in this layer during the loading step (not at query)
MOLAP Interface
Data MDB
• Advantages
– Excellent performance
• Fast data retrieval
• Optimal for slicing and dicing
• Complex calculations
3.2 MOLAP
• Complex calculations
• All calculations are pre-generated when the cube is
created
• All calculations are pre-generated when the cube is created
3.2 MOLAP
all
0-D(apex) cuboid
time supplier
time,item time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,location
time,item,supplier
time,location,supplier
item,location,supplier
1-D cuboids
2-D cuboids
3-D cuboids
item location
• Disadvantages
– Limited amount of data it can handle
•
Cube can be derived from large amount of data, but only
summary level information will be included in the cube– Requires additional investment
3.2 MOLAP
– Requires additional investment
•
Cube technology are often proprietary
– Enormous amount of overhead
•
An input file of 200 MB can expand to 5 GB with calculations
• Products:
– Cognos (IBM), Essbase (Oracle), Microsoft Analysis
Service, Palo (open source)
• ROLAP
– Presentation layer provides the multidimensional view
– The ROLAP Server generates
3.2 ROLAP
Server ROLAP
Client
Presentation
The ROLAP Server generates SQL queries, from the OLAP OLAP requests, to query the RDBMS
– Data is stored in RDBs
Server ROLAP
Server
RDBMS
• Special schema design: e.g., star, snowflake
• Special indexes: e.g., bitmap, R-Trees
• Advantages
– Proven technology (relational model, DBMS)
3.2 ROLAP
– Can handle large amounts of data (VLDBs)
• Disadvantages
– Limited SQL functionalities
• Products
– Microsoft Analysis Service, Siebel Analytics (now
Oracle BI), Micro Strategy, Mondrian (open source)
• Based on OLAP needs
3.2 ROLAP vs. MOLAP
OLAP needs MOLAP ROLAP
User Benefits
Multidimensional View √ √
Excellent Performance √ -
Analytical Flexibility √ -
Benefits Analytical Flexibility √ -
Real-Time Data Access - √
High Data Capacity - √
MIS Benefits
Easy Development √ -
Low Structure Maintenance - √
Low Aggregate Maintenance √ -
• HOLAP
– Best of both worlds
• Storing detailed data in RDBs
• Storing aggregated data in MDBs
– Different partitioning approaches
3.2 HOLAP
Server HOLAP
Presentation
– Different partitioning approaches between MOLAP and ROLAP
• Vertical
• Horizontal
Server HOLAP
Server
RDBMS
Data
MDDB
• Vertical partitioning
– Aggregations are stored in MOLAP for fast query performance,
– Detailed data in ROLAP to optimize time of cube processing (loading the data from the OLTP)
3.2 HOLAP
processing (loading the data from the OLTP)
• Horizontal partitioning
– HOLAP stores some slice of data, usually the more recent one (i.e. sliced by Time dimension) in MOLAP for fast query performance
– Older data in ROLAP
• Other approaches
– Store some cubes in MOLAP and others in ROLAP, leveraging the fact that in a large cuboid, there will be dense and sparse sub-regions
3.2 HOLAP
• ROLAP
– RDBMS - star/snowflake schema
– For detailed and larger volumes of data (TB)
• MOLAP
3.2 Conclusions
• MOLAP
– MDBMS - Cube structures, array based storage
– For summarized and relatively “small” volumes of data (50GB)
• HOLAP is emerging as the OLAP server of
choice
• Logical Model
– Cubes, Dimensions, Hierarchies, Classification Levels
• Physical Level
– Relational Implementation through:
• Star schema: improves query performance for often-used
Summary
• Star schema: improves query performance for often-used data
– Less tables and simple structure
– Efficient query processing with regard to dimensions – In some cases, high overhead of redundant data
• Snowflake schema: reduce the size of the dimension tables
– However, through dimension normalization - large number of tables
• Physical Level
– Array based storage
• How to perform linearization
• Problems:
– Order of dimensions – solution: caching
Summary
– Order of dimensions – solution: caching
– Dense Cubes, Sparse Cubes - solution: 2 level storage