• Keine Ergebnisse gefunden

Data Warehousing & OLAP

N/A
N/A
Protected

Academic year: 2021

Aktie "Data Warehousing & OLAP"

Copied!
15
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Data Warehousing

& OLAP

Wolf-Tilo Balke Silviu Homoceanu

Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de

• Last week:

– Storage structures: MDB

– Architectures: N-Tier Architectures – Data Modeling – Conceptual Modeling

Multidimensional Entity Relationship (ME/R) Model

Multidimensional UML (mUML)

• This week:

– Data Modeling (continued)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 2

Summary

3. Data Modeling 3.1 Logical Modeling:

Cubes, Dimensions, Hierarchies 3.2 Physical Modeling:

Star, Snowflake, Array storage

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 3

3. DW Modeling

• Goal of the Logical Model – Confirm the subject areas

– Create ‘real’ facts and dimensions from the subjects that we have identified

– Establish the needed granularity for our dimensions

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 4

3.1 Logical Model

• Logical structure of the multidimensional model

– Cubes: Sales, Purchase, Price, Inventory – Dimensions: Product, Time, Geography, Client

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 5

3.1 Logical Model

Purchase Amount Store

City District Region Country

Article Prod. Group Prod. Family

Prod. Categ

Week

Day Month

Quarter Year

Price Unit Price

Inventory Stock Sales Turnover Client

• Dimensions are…

analysis purpose chosen entities, within the data model

– One dimension can be used to define more than one cube

– They are hierarchically organized

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 6

3.1 Dimensions

Purchase Amount Article

Prod. Group Prod. Family

Prod. Categ

Price Unit Price

Sales Turnover

(2)

• Dimension hierarchies are organized in classification levels (e.g., Day, Month, …)

– The dependencies between the classification levels are described by the classification schema through functional dependencies

An attribute B is functionally dependent on an attribute A, denoted A ⟶ B, if for all a ∈

dom(A)

there exists exactly one b ∈

dom(B) corresponding to it

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 7

3.1 Dimensions

Week

Day Month

Quarter Year

• Classification schemas

– The classification schema of a dimension D is a semi- ordered set of classification levels

({D.K

0

, …, D.K

k

}, ⟶ )

– With a smallest element D.K

0

, i.e. there is no classification level with smaller granularity

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 8

3.1 Dimensions

• A fully-ordered set of classification levels is called a Path

– If we consider the classification schema of the time dimension, then we have the following paths

T.Day T.Week

T.Day T.Month T.Quarter T.Year

– Here T.Day is the smallest element

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 9

3.1 Dimensions

Day Month

Quarter Year

Week

• Classification hierarchies

– Let D.K

0

⟶ …⟶ D.K

k

be a path in the classification schema of dimension D

– A classification hierarchy concerning these path is a balanced tree which

Has as nodes dom(D.K

0

) … dom(D.K

k

) {ALL}

And its edges respect the functional dependencies

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 10

3.1 Dimensions

• Example: classification hierarchy from the path product dimension

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 11

3.1 Dimensions

Article Prod. Group Prod. Family

Prod. Categ

ALL Electronics

Video Audio

Video recorder Video

recorder Camcorder

TR-34 TS-56

TV

Clothes

Article Prod. Group Prod. Family Category

• Cubes consist of data cells with one or more measures

• If a cube schema S(G,M) consists of a granularity G=(D

1

.K

1

, …, D

n

.K

n

) and a set M=(M

1

, …, M

m

) representing the measure

– A Cube (C C C) is a set of cube cells, C C ⊆ dom(G) x dom(M)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 12

3.1 Cubes

(3)

• The coordinates of a cell are the classification nodes from dom(G) corresponding to the cell

– Sales ((Article, Day, Store, Client), (Turnover)) – …

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 13

3.1 Cubes

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 14

3.1 Cubes

927 103 812 102

39 580 30 501 680 952 605818825

31 512 14 400

818

Supplier = s1 Supplier = s2 Supplier = s3

818

Berlin MünchenParis Braunschweig

Q1 Q2 Q3 Q4

Computer Video

Audio Telephones Computer Video

Audio Telephones Computer Video

Audio Telephones

• 4 dimensions (supplier, city, quarter, product)

• We can now imagine n-dimensional cubes – n-D cube is called a base cuboid

– The top most cuboid, the 0-D, which holds the highest level of summarization is called apex cuboid – The full data cube is

formed by the lattice of cuboids

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 15

3.1 Cubes

• But things can get complicated pretty fast

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 16

3.1 Cubes

all

time supplier

time,item time,location time,supplier

item,location item,supplier

location,supplier

time,item,location time,item,supplier

time,location,supplier

item,location,supplier

time, item, location, supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid item location

• Basic operations of the multidimensional model on the logical level

– Selection – Projection – Cube join – Sum – Aggregation

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 17

3.1 Basic Operations

• Multidimensional Selection

– The selection on a cube C((D

1

.K

1

,…, D

g

.K

g

), (M

1

, …, M

m

)) through a predicate P, is defined as σ

P

(C) = {z Є C:P(z)}, if all variables in P are either:

Classification levels

K , which functionally depend on a

classification level in the granularity of K, i.e. D

i.Ki⟶ K

Measures from (M

1, …, Mm)

– E.g. σ

P.Prod_group=“Video”

(Sales)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 18

3.1 Basic Operations

(4)

• Multidimensional projection

– The projection of a function of a measure F(M) of cube C is defined as

-F(M)(C) = { (g,F(m))∈dom(G) x dom(F(M)): (g,m) ∈C}

– E.g. , price projection -

turnover, sold_items

(Sales)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 19

3.1 Basic Operations

Sales Turnover Sold_items

• Join operations between cubes is usual – E.g. if turnover would not be provided, it could be

calculated with the help of the unit price from the price cube

• 2 cubes C

1

(G

1

, M

1

) and C

2

(G

2

, M

2

) can only be joined, if they have the same granularity (G

1

= G

2

= G)

– C

1

⋈C

2

= C(G, M

1

∪ M

2

)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 20

3.1 Basic Operations

Price Unit Price Sales

Units_Sold

• When the granularities are different, but we still need to join the cubes, aggregation has to be performed

– E.g. , Sales ⋈ Inventory: aggregate Sales((Day, Article, Store, Client)) to Sales((Month, Article, Store, Client))

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 21

3.1 Basic Operations

Store City District Region Country

Article Prod. Group Prod. Family

Prod. Categ

Week Month Day Quarter

Year

Inventory Stock

Sales Turnover Client

• Aggregation: most important operation for OLAP operations

• Aggregation functions

– Build a single values from set of value, e.g. in SQL:

SUM, AVG, Count, Min, Max

– Example: SUM

(P.Product_group, G.City, T.Month)

(Sales)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 22

3.1 Basic Operations

• Comparing granularities

– A granularity G={D

1

.K

1

, …, D

g

.K

g

} is finer than G’={D

1

’.K

1

’, …, D

h

’.K

h

’}, if and only if

for each D

j

’.K

j

’∈ G’ ∃ D

i

.K

i

∈ G where D

i

.K

i

⟶ D

j

’.K

j

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 23

3.1 Basic Operations

• Classification schema, cube schema, classification hierarchy are all designed in the building phase and considered as fix

– Practice has proven otherwise – DW grow old, too

– Changes are strongly connected to the time factor – This lead to the time validity of these concepts

• Reasons for schema modification – New requirements

– Modification of the data source

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 24

3.1 Change support

(5)

• E.g. Saturn sells a lot of electronics – Lets consider mobile phones

They built their DW on 01.03.2003

A classification hierarchy of their data until 01.07.2008 could look like this:

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 25

3.1 Classification Hierarchy

Mobile Phone

GSM 3G

Nokia 3600 O2 XDA BlackBerry

Bold

• After 01.07.2008 3G becomes hip and affordable and many phone makers start migrating towards 3G capable phones

– Lets say O2 makes its XDA 3G capable

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 26

3.1 Classification Hierarchy

Mobile Phone

GSM 3G

Nokia 3600 O2 XDA BlackBerry

Bold

• After 01.04.2010 phone makers already develop 4G capable phones

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 27

3.1 Classification Hierarchy

Mobile Phone

GSM 3G

Nokia 3600 O2 XDA BlackBerry

Bold 4G

Best phone ever

• It is important to trace the evolution of the data

– It can explain which data was available at which moment in time

– Such a versioning system of the classification hierarchy can be performed by constructing a validity matrix

When is something, valid?

Use timestamps to mark it!

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 28

3.1 Classification Hierarchy

• Annotated Change data

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 29

3.1 Classification Hierarchy

Mobile Phone

GSM 3G

Nokia 3600 O2 XDA BlackBerry

Bold 4G

Best phone ever

[01.03.2003, ∞)

[01.04.2005, ∞) [01.04.2010, ∞)

[01.04.2010, ∞) [01.04.2005, ∞)

[01.03.2006, ∞) [01.07.2008, ∞) [01.03.2003, 01.07.2008)

• The tree can be stored as dimension metadata – The storage form is a validity matrix

Rows are parent nodes

Columns are child nodes

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 30

3.1 Classification Hierarchy

GSM 3G 4G Nokia 3600 O2 XDA Berry Bold Best phone

Mobile phone

[01.03.2003, ∞) [01.04.2005, ∞) [01.04.2010, ∞)

GSM [01.04.2005, ∞) [01.03.2003,

01.07.2008)

3G [01.07.2008, ∞) [01.03.2006, ∞)

4G [01.04.2010

, ∞) Nokia 3600

O2 XDA Berry Bold Best phone

(6)

• Deleting a node in a classification hierarchy – Should be performed only in exceptional cases

It can lead to information loss

• How do we solve it?

– Soon GSM phones will not be produced anymore

But we might have some more in our warehouses, to be delivered

Or we might want to query data since when GSM was sold

Just mark the end validity date of the GSM branch in the validity matrix

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 31

3.1 Classification Hierarchy

• Query classification

– Having the validity information we can support queries like as is versus as is

Regards all the data as if the only valid classification hierarchy is the present one

In the case of O2 XDA, it will be considered as it has always been a 3G phone

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 32

3.1 Classification Hierarchy

Mobile Phone

GSM 3G

Nokia 3600 O2 XDA BlackBerry Bold 4G

Best phone ever

• As is versus as was

– Orders the classification hierarchy by the validity matrix information

O2 XDA was a GSM phone until 01.07.2008 and a 3G phone afterwards

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 33

3.1 Classification Hierarchy

Mobile Phone

GSM 3G

Nokia 3600 O2 XDA BlackBerry

Bold BlackBerry

Bold 4G

Best phone ever Best phone

ever

• As was versus as was – Past time hierarchies can be

reproduced

– E.g., query data with an older classification hierarchy

• Like versus like

– Only data whose classification hierarchy remained unmodified, is evaluated

– E.g. the Nokia 3600 and the Black Berry

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 34

3.1 Classification Hierarchy

Mobile Phone

GSM 3G

Nokia 3600 O2 XDA BlackBerry Bold

… …

• Improper modification of a schema (deleting a dimension) can lead to

– Data loss – Inconsistencies

Data is incorrectly aggregated or adapted

• Proper schema modification is complex but – It brings flexibility for the end user

The possibility to ask “As Is vs. As Was” queries and so on

• Alternatives – Schema evolution – Schema versioning

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 35

3.1 Schema Modification

• Schema evolution

– Modifications can be performed without data loss – It involves schema modification and data adaptation

to the new schema

– This data adaptation process is called Instance adaptation

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 36

3.1 Schema Modification

Purchase Amount Article

Prod. Group Prod. Family

Prod. Categ

Price Unit Price

Sales Turnover

(7)

• Schema evolution – Advantage

Faster to execute queries in DW with many schema modifications

– Disadvantages

It limits the end user flexibility to query based on the past schemas

Only actual schema based queries are supported

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 37

3.1 Schema Modification

• Schema versioning – Also no data loss

– All the data corresponding to all the schemas are always available

– After a schema modification the data is held in their belonging schema

Old data - old schema

New data - new schema

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 38

3.1 Schema Modification

Purchase Amount Article

Prod. Group Prod. Family

Prod. Categ

Price Unit Price

Sales Turnover

Purchase Amount Article

Prod. Group

Prod. Categ Sales

Turnover

….

• Schema versioning – Advantages

Allows higher flexibility, e.g., “As Is vs. As Was”, etc. queries – Disadvantages

Adaptation of the data to the queried schema is done on the spot

This results in longer query run time

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 39

3.1 Schema Modification

• Defining the physical structures – Setting up the database environment – Performance tuning strategies (next lecture)

Indexing

Partitioning

Materialization

• Goal:

– Define the actual storage architecture

– Decide on how the data is to be accessed and how it is arranged

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 40

3.2 Physical Model

• Physical implementation of the multidimensional paradigm model can be:

– Relational

Snowflake-schema

Star-schema – Multidimensional

Matrixes

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 41

3.2 Physical Model

• Relational model, goals:

– As low loss of semantically knowledge as possible e.g., classification hierarchies

– The translation from multidimensional queries must be efficient

– The RDBMS should be able to run the translated queries efficiently

– The maintenance of the present tables should be easy and fast e.g., when loading new data

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 42

3.2 Physical Model

(8)

• Going from multidimensional to relational – Representations for cubes, dimensions, classification

hierarchies and attributes

– Implementation of cubes without the classification hierarchies is easy

A table can be seen as a cube

A column of a table can be considered as a dimension mapping

A tuple in the table represents a cell in the cube

If we interpret only a part of the columns as dimensions we can use the rest as measures

The resulting table is called a fact table

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 43

3.2 Relational Model

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 44

3.2 Relational Model

818 Product

Geography Time 13.11.2008 18.12.2008

Article Store Day Sales

Laptops Hannover, Saturn 13.11.2008 6 Mobile Phones Hannover Saturn 18.12.2008 24 Laptops Braunschweig

Saturn

18.12.2008 3 Laptops

Mobile p.

• Snowflake-schema

– Simple idea: use a table for each classification level

This table includes the ID of the classification level and other attributes

2 neighbor classification levels are connected by 1:n connections e.g., from n Days to 1 Month

The measures of a cube are maintained in a fact table

Besides measures, there are also the foreign key IDs for the smallest classification levels

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 45

3.2 Relational Model

• Snowflake?

– The facts/measures are in the center – The dimensions spread

out in each direction and branch out with their granularity

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 46

3.2 Snowflake Schema

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 47

3.2 Snowflake Example

Sales Product_ID Day_ID Store_ID Sales Revenue Product

Product_ID Description Brand Product_gro up_ID

Product group Product_group_ID Description Product_categ_ID

Product category Product_category_ID Description

Store Store_ID Description State_ID

State State_ID Description Region_ID Region

Region_ID Description Country_ID

Country Country_ID Description

Day Day_ID Description Month_ID Week_ID

Week Week_ID Description Year_ID

Year Year_ID Description Month

Month_ID Description Quarter_ID

Quarter Quarter_ID Description Year_ID n

n n

n n

n

n

n n

n n

n n

1 1

1

1 1

1 1

1

1 1 1

1

1

fact table

dimension tables

time

location

• Snowflake schema – Advantages

– With a snowflake schema the size of the dimension tables will be reduced and queries will run faster

If a dimension is very sparse (most measures corresponding to the dimension have no data)

And/or a dimension has long list of attributes which may be queried

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 48

3.2 Snowflake Schema

(9)

• Snowflake schema – Disadvantages

– Fact tables are responsible for 90% of the storage requirements

•Thus, normalizing the dimensions usually lead to insignificant improvements

– Normalization of the dimension tables can reduce the performance of the DW because it leads to a large number of tables

•E.g., when connecting dimensions with coarse granularity these tables are joined with each other during queries

•A query which connects Product category with Year and Country is clearly not performant (10 tables need to be connected)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 49

3.2 Snowflake Schema

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 50

3.2 Snowflake Example

Sales Product_ID Day_ID Store_ID Sales Revenue Product

Product_ID Description Brand Product_gro up_ID

Product group Product_group_ID Description Product_categ_ID

Product category Product_category_ID Description

Store Store_ID Description State_ID

State State_ID Description Region_ID Region

Region_ID Description Country_ID

Country Country_ID Description

Day Day_ID Description Month_ID Week_ID

Week Week_ID Description Year_ID

Year Year_ID Description Month

Month_ID Description Quarter_ID

Quarter Quarter_ID Description Year_ID n

n n

n n

n

n

n n

n n

n n

1 1

1

1 1

1 1

1

1 1 1

1

1

• Star schema

– Basic idea: use a denormalized schema for all the dimensions

A star schema can be obtained from the snowflake schema through the denormalization of the tables belonging to a dimension

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 51

3.2 Relational Model

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 52

3.2 Star Schema - Example

Sales Product_ID Time_ID Geo_ID Sales Revenue Product

Product_ID Product group Product category Description

Geography Geo_ID Store State Region Country

Time Time_ID Day Week Month Quarter Year n n n

1 1

1

• Advantages

– Improves query performance for often-used data – Less tables and simple structure

– Efficient query processing with regard to dimensions

• Disadvantages

– In some cases, high overhead of redundant data

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 53

3.2 Star Schema

• Snowflake vs. Star

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 54

3.2 Snowflake vs. Star

–The structure of the classifications are expressed in table schemas –The fact table and

dimension tables are normalized

–The entire classification is expressed in just one table –The fact table is normalized

while in the dimension table the normalization is broken

•This leads to redundancy of information in the dimension tables

(10)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 55

3.2 Examples

• Snowflake • Star

Product_ID Description Brand Prod_group_ID

10 E71 Nokia 4

11 PS-42A Samsung 2

12 5800 Nokia 4

Bold Berry 4

Prod_group_ID Description Prod_categ_ID

2 TV 11

4 Mobile Pho.. 11

Prod_categ_ID Description

11 Electronics

Product_

ID

Description Prod.

group Prod. categ

10 E71 Mobile Ph.. Electronics

11 PS-42A TV Electronics

12 5800 Mobile Ph.. Electronics

13 Bold Mobile Ph.. Electronics

• When should we go from Snowflake to star?

– Heuristics-based decision

When typical queries relate to coarser granularity (like product category)

When the volume of data in the dimension tables is relatively low compared to the fact table

–In this case a star schema leads to negligible overhead through redundancy, but performance is improved

When modifications on the classifications are rare compared to insertion of fact data

–In this case these modifications controlled through the data load process of the ETL reducing the risk of data anomalies

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 56

3.2 Snowflake to Star

• Snowflake or Star?

– It depends on the necessity

Fast query processing or efficient space usage – However, most of the time a mixed form

is used

The Starflake schema: some dimensions stay normalized corresponding to the snowflake schema, while others are denormalized according to the star schema

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 57

3.2 Do we have a winner?

• The Starflake schema

– The decision on how to deal with the dimensions is influenced by

Frequency of the modifications: if the dimensions change often, normalization leads to better results

Amount of dimension elements: the bigger the dimension tables, the more space normalization saves

Number of classification levels in a dimension: more classification levels introduce more redundancy in the star schema

Materialization of aggregates for the dimension levels:

if the aggregates are materialized, a normalization of the dimension can bring better response time

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 58

3.2 Our forces combined

• Galaxies

– In pratice we usually have more measures described by different dimensions

Thus, more fact tables

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 59

3.2 More Schemas

Sales Product_ID Store_ID Sales Revenue Store

Store_ID

Date Date_ID

Product Product_ID

Receipts

Product_ID Date_ID

Vendor Vendor_ID

• Other schemas – Fact constellations

Pre-calculated aggregates – Factless fact tables

Fact tables do not have non-key data

–Can be used for event tracking or to inventory the set of possible occurrences

– …

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 60

3.2 Even More Schemas

(11)

• Relational model - disadvantages – The representation of the

multidimensional data can be implemented relationally with a finite set of transformation steps, however:

Multidimensional queries have to be first translated to the relational representation

A direct interaction with the relational data model is not fit for the end user

• What about storing the data multidimensionally?

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 61

3.2 Relational Model

• The basic data structure for multidimensional data storage is the array

• The elementary data structures are the cubes and the dimensions

– C=((D

1

, …, D

n

), (M

1

, …, M

m

))

• The storage is intuitive as arrays of arrays, physically linearized

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 62

3.2 Multidimensional Model

• Linearization example: 2D cube |D

1

| = 5, |D

2

| = 4, cube cells = 20

– Query: Jackets sold in March?

Measure stored in cube cell D

1[4], D2[3]

The 2D cube is physically stored as a linear array, so

D1[4], D2[3] becomes array cell 14 –(Index(D2) – 1) * |D1| + Index(D1) –Linearized Index = 2 * 5 + 4 = 14

• Generalization…

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 63

3.2 Multidimensional Model

1 1 3 6 47

2 2 53

7 8 89

6 4

9 11 11 16 1217

10 12 1313

15 18 1619

14 14 17 5 19 10 25 15 27 20

Jan (1) Feb(2) Mar(3) Apr(4) D1 D2

• Generalization:

– Given a cube C=((D

1

, D

2

, …, D

n

), (M

1

:Type

1

, M

2

:Type

2

, …, M

m

:Type

m

)), the index of a cube cell z with coordinates (x

1

, x

2

, …, x

n

) can be linearized as follows:

•Index(z) = x1+ (x2- 1) * |D1| + (x3- 1) * |D1| * |D2| + … + (xn- 1) * |D1| * … * |Dn-1| =

= 1+ ∑

i=1

n((xi- 1) * ∏

j=1 i-1|Di|)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 64

3.2 Linearization

• Influence of the order of the dimensions in the cube definition

– In the cube the cells of D

2

are ordered one under the other

e.g., sales of all pants involves a column in the cube

– After linearization, the information is spread among more data blocks or pages

– If we consider a data block can hold 5 cells, a query over all products sold in January can be answered with just 1 block read, but a query of all sold pants, involves reading 4 blocks

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 65

3.2 Problems in Array-Storage

1 1 3 6 47

2 2 53

7 8 89

6 4

9 11 11 16 1217

10 12 1313

15 18 1619

14 14 17 5 19 10 25 15 27 20

Jan (1) Feb(2) Mar(3) Apr(4) D1 D2

• Solution: use caching techniques – But…caching and swapping is performed also

by the operating system – MDBMS has to manage its caches

such that the OS doesn’t perform any damaging swaps

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 66

3.2 Problems in Array-Storage

(12)

• Storage of dense cubes

– If cubes are dense, array storage is more efficient.

However, operations suffer due to the large cubes – Solution: store dense cubes not linear but on 2

levels

The first contains indexes and the second the data cells stored in blocks

Optimization procedures like indexes (trees, bitmaps), physical partitioning, and compression (run-length- encoding) can be used

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 67

3.2 Problems in Array-Storage

• Storage of sparse cubes

– All the cells of a cube, including empty ones, have to be stored

– Sparseness leads to data being stored in many physical blocks or pages

•The query speed is affected by the large number of block accesses on the secondary memory

– Solution:

•Do not store empty blocks or pagesbut adapt the index structure

•2 level data structure: upper layer holds all possible combinations of the sparse dimensions, lower layer holds dense dimensions

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 68

3.2 Problems in Array-Storage

• 2 level cube storage

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 69

3.2 Problems in Array-Storage

Marketing campaign

Customer

Product

Time Geo

Dense low level Sparse upper level

• Based on the physical model used:

– DOLAP (Desktop OLAP) – MOLAP (Multidimensional OLAP) – ROLAP (Relational OLAP) – HOLAP (Hybrid OLAP)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 70

3.2 Physical Models

T i m e

O L A P R O L A P H O L A P M O L A P D O L A P

• DOLAP

– Developed as extension to the production system reports

It downloads a small hypercube from a central point (data mart or DW)

Performs multidimensional analysis while disconnected from the data source

The computation occurs on the client – Requires little investment

– They lack the ability to manage large data sets

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 71

3.2 Physical Models

• MOLAP

– Presentation layer provides the multidimensional view – The OLAP server stores data in a

multidimensional structure

Computation occurs in this layer during the loading step (not at query)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 72

3.2 Physical Models

MOLAP Interface

MDB Client

Presentation

Server

Data

(13)

• Advantages

– Excellent performance

Fast data retrieval

Optimal for slicing and dicing

Complex calculations

All calculations are pre-generated when the cube is created

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 73

3.2 MOLAP

• All calculations are pre-generated when the cube is created

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 74

3.2 MOLAP

all

time supplier

time,item time,location time,supplier

item,location

item,supplier location,supplier

time,item,location

time,item,supplier time,location,supplier

item,location,supplier

time, item, location, supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

item location

• Disadvantages

– Limited amount of data it can handle

•Cube can be derived from large amount of data, but only summary level information will be included in the cube

– Requires additional investment

•Cube technology are often proprietary

– Enormous amount of overhead

•An input file of 200 MB can expand to 5 GB with calculations

• Products:

– Cognos (IBM), Essbase (Oracle), Microsoft Analysis Service, Palo (open source)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 75

3.2 MOLAP

• ROLAP

– Presentation layer provides the multidimensional view – The ROLAP Server generates

SQL queries, from the OLAP OLAP requests, to query the RDBMS

– Data is stored in RDBs

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 76

3.2 ROLAP

Server ROLAP

Server

RDBMS Client

Presentation

Data

• Special schema design: e.g., star, snowflake

• Special indexes: e.g., bitmap, R-Trees

• Advantages

– Proven technology (relational model, DBMS) – Can handle large amounts of data (VLDBs)

• Disadvantages

– Limited SQL functionalities

• Products

– Microsoft Analysis Service, Siebel Analytics (now Oracle BI), Micro Strategy, Mondrian (open source)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 77

3.2 ROLAP

• Based on OLAP needs

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 78

3.2 ROLAP vs. MOLAP

OLAP needs MOLAP ROLAP

User Benefits

Multidimensional View

Excellent Performance -

Analytical Flexibility -

Real-Time Data Access -

High Data Capacity -

MIS Benefits

Easy Development -

Low Structure Maintenance -

Low Aggregate Maintenance -

(14)

• HOLAP

– Best of both worlds

Storing detailed data in RDBs

Storing aggregated data in MDBs – Different partitioning approaches

between MOLAP and ROLAP

Vertical

Horizontal

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 79

3.2 HOLAP

Server HOLAP

Server

RDBMS Presentation

Data

MDDB

• Vertical partitioning

– Aggregations are stored in MOLAP for fast query performance,

– Detailed data in ROLAP to optimize time of cube processing (loading the data from the OLTP)

• Horizontal partitioning

– HOLAP stores some slice of data, usually the more recent one (i.e. sliced by Time dimension) in MOLAP for fast query performance

– Older data in ROLAP

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 80

3.2 HOLAP

• Other approaches

– Store some cubes in MOLAP and others in ROLAP, leveraging the fact that in a large cuboid, there will be dense and sparse sub-regions

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 81

3.2 HOLAP

• ROLAP

– RDBMS - star/snowflake schema

– For detailed and larger volumes of data (TB)

• MOLAP

– MDBMS - Cube structures, array based storage – For summarized and relatively “small” volumes of data

(50GB)

• HOLAP is emerging as the OLAP server of choice

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 82

3.2 Conclusions

• Logical Model

– Cubes, Dimensions, Hierarchies, Classification Levels

• Physical Level

– Relational Implementation through:

Star schema: improves query performance for often-used data

–Less tables and simple structure

–Efficient query processing with regard to dimensions –In some cases, high overhead of redundant data

Snowflake schema: reduce the size of the dimension tables

–However, through dimension normalization - large number of

tables

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 83

Summary

• Physical Level – Array based storage

How to perform linearization

Problems:

–Order of dimensions – solution: caching

–Dense Cubes, Sparse Cubes - solution: 2 level storage

– MOLAP, ROLAP, HOLAP

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 84

Summary

(15)

• DW Optimization / Indexes – Bitmap indexes

– Tree based indexes – Hash indexes

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 85

Next lecture

Referenzen

ÄHNLICHE DOKUMENTE

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig