• Keine Ergebnisse gefunden

Data Warehousing & Data Mining

N/A
N/A
Protected

Academic year: 2021

Aktie "Data Warehousing & Data Mining"

Copied!
13
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Data Warehousing

& Data Mining

Wolf-Tilo Balke Silviu Homoceanu

Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de

• B-Trees are not fit for multidimensional data

• R-Trees

–MBR as geometry to build multidimensional indexes –Inefficient because they allow overlapping between

neighboring MBRs

–R+-trees -improve the search performance

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 2

Summary

• UB-Trees

–Reduce multidimensional data to one dimension in order to use B-Tree indexes

–Z-Regions, Z-Curve, use the advantage of bit operations to make optimal jumps

• Bitmap indexes

–Great for indexing tables with set-like attributes e.g., Gender: Male/Female

–Operations are efficient and easy to implement (directly supported by hardware)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 3

Summary

5. Optimization

5.1 Partitioning 5.2 Joins

5.3 Materialized Views

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 4

5. Optimization

• Breaking the data into several physical units that can be handled separately

• Granularity and partitioning are key to efficient implementation of a warehouse

• The question is not whether to use partitioning but how to do it

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 5

5.1 Partitioning

• Why partitioning?

–Flexibilityin managing data –Smaller physical units allow

•Easy restructuring

•Free indexing

•Sequential scans, if needed

•Easy reorganization

•Easy recovery

•Easy monitoring

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 6

5.1 Partitioning

(2)

• In DWs, partitioning is done to improve:

–Business query performance, i.e., minimize the amount of data to scan

–Data availability, e.g., back-up/restores can run at the partition level

–Database administration, e.g., adding new columns to a table, archiving data, recreating indexes, loading tables

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 7

5.1 Partitioning

• Possible approaches:

–Data partitioning

where data is usually partitioned by

•Date

•Line of business

•Geography

•Organizational unit

•Combinations of these factors –Hardware partitioning

•Hardware partitioning makes data available to different processing nodes by ensuring that sub-processes are capable of running on the different nodes

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 8

5.1 Partitioning

• At which level should data partitioning occur?

–Possibilities are

•on application

•or DBMS level

• Partitioning on DBMS level is clear, but it also makes sense to partition at application level

–E.g., allows different definitions for each year

•Important since DWs spans many years and as business evolves DWs change, too

•Think for instance about changing tax laws

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 9

5.1 Data Partitioning

vs.

• Data partitioning, involves:

–Splitting out the rows of a table into multiple tables i.e., horizontal partitioning

–Splitting out the columns of a table into multiple tables i.e., vertical partitioning

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 10

5.1 Data Partitioning

Master table

Horizontal Vertical

Primary key

• Horizontal partitioning

–The set of tuples of a table is split among disjoint table parts

–Definition: A set of Relations {R1,…, Rn}represent a horizontal partitioningof Master-Relation R if and only if Ri⊆ R, Ri⋂ RjØ and R ∪iRi, for 1≤ i, j ≤ n –According to the partitioning procedure we have

different horizontal partitioning solutions

•Range partitioning, List partitioning and Hash partitioning

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 11

5.1 Data Partitioning

• Range Partitioning

–Selects a partition by determining if the partitioning key is inside a certain range

–A partition can be represented as a restriction on the master-relation

•Ri= σPi(R), where Piis the partitioning predicate. The partitioning predicate can involve more attributes

P1: Country = ‘Deutschland’ and Year = 2009 P2: Country = ‘Deutschland’ and Year < 2009 P3: Country ≠‘Deutschland’

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 12

5.1 Horizontal Partitioning

(3)

• List Partitioning

–A partition is assigned a list of values. If the partitioning key has one of these values, the partition is chosen

•For example all rows where the column Country is either Iceland, Norway, Sweden, Finland or Denmark could build a partition for the Scandinavian countries

–Is also expressed as a simple restrictionon the master relation

•The partitioning predicate involves just one attribute P1: City IN (‘Hamburg’, ‘Hannover’, ‘Berlin’)

P2: City IN (DEFAULT) – represents tuples which do not fit to P1

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 13

5.1 Horizontal Partitioning

• Hash Partitioning

–The value of a hash function determines membership in a partition. Assuming there are four partitions, the hash function could return a value from 0 to 3

–For each tuple t, of the master-table R, the hash function will associate it to a partition table Ri

•Ri {t1, …, tm/tj∈R and H(tj) H(tk) for 1 ≤ j, k ≤ m}

–This kind of partitioning is particularly used in parallel processing

–The choosing of the hash function is decisive. The goal is to achieve an equal distribution of the data

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 14

5.1 Horizontal Partitioning

• Horizontal partitioning

in Data Warehousing partitions data by

–Time dimension

•Periods, such as week or month can be used or the data can be partitioned by the age of the data

•E.g., if the analysis is usually done on last month's data the table could be partitioned into monthly segments

–A dimensionother than time

•If queries usually run on a grouping of data: e.g., each branch tends to query on its own data and the dimension structure is not likely to change then partition the table on this dimension –On table size

•If a dimension cannot be used, partition the table by a predefined size. If this method is used, metadata must be created to identify what is contained in each partition

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 15

5.1 Horizontal Partitioning

• Vertical Partitioning

–Involves creating tables with fewer columns and using additional tables to store the remaining columns –Different physical storage might be used e.g., storing

infrequently used or very wide columns on a different device

–Usually called row splitting

–Row splitting creates a one-to-one relationship between the partitions

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 16

5.1 Vertical Partitioning

• In DW, common vertical partitioning means

–Moving seldom used columns from a highly-used

table to another table

–Creating a viewacross the two newly created tables restores the original table with a performance penalty

•However, performance will increase when accessing the highly-used data e.g., for statistical analysis

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 17

5.1 Vertical Partitioning

• In DWs with very large dimension tables (e.g., Amazon - a customer table with tens of millions of records) we have

–Most of the attributes are rarely, if at all, queried –E.g., the address attribute is not as interesting as

evaluating customers per age-group

–But we must still maintain the link between the fact table and the completecustomer dimension, which has high performance costs!

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 18

5.1 Vertical Partitioning

(4)

• The solution is to use Mini-Dimensions, a special case of vertical partitioning

–Many dimension attributes are used very frequently as browsing constraints

•In big dimensions these constraints can be hard to find among the lesser used ones

–Logical groups of often used constraints can be separated into small dimensions which are very well indexed and easily accessible for browsing

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 19

5.1 Vertical Partitioning

• Mini-Dimensions, e.g., the Demography table

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 20

5.1 Vertical Partitioning

ProdID TimeID GeoID CustomerID

DemogrID Profit

Qty

CustomerID Last Name First Name Address DemogrID

DemogrID Age group Income group

Area Fact table

Customer table

Demography

–All variables in these mini-dimensions must be presented as distinct classes

–The key to the mini-dimension can be placed as a foreign key in both the fact table and dimension table from which it has been broken off

–Mini-dimensions, as their name suggests, should be kept small and compact

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 21

5.1 Vertical Partitioning

• Advantages

–Records used together are grouped together –Each partition can be optimized for performance –Security, recovery

–Partitions stored on different disks: contention –Take advantage of parallel processing capability

• Disadvantages

–Slow retrieval across partitions (expensive joins) –Complexity

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 22

5.1 Partitioning

• Use partitioning when:

–A table is > 2GB (from Oracle) –A Table is > 100 Million rows (praxis) –Think about it, if table is > 1 million rows

• Partitioning does not come for free!

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 23

5.1 Partitioning

• Partitioning management

–Partitioning should be transparent outside the DBMS

•The applications work with the Master-Table at logical level

•The conversion to the physical partition tables is performed internally by the DBMS

•It considers also data consistency as if the data were stored in just one table

–Partitioning transparency is not yet a standard. Not all DBMS support it!

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 24

5.1 Partitioning

(5)

• Partitions in praxis

–Oracle supports Range-, List-, Hash-, Interval-, System- Partitions as well as combinations of these methods –E.g., partitioning in Oracle:

CREATE TABLE SALES(

ProdID NUMBER, GeoID NUMBER, TimeID DATE, Profit NUMBER)

PARTITION BY RANGE(timeID)(

PARTITION before 2008

VALUES LESS THAN (TO_DATE (’01-JAN- 2008’, ‘DD-MM-YYYY’)),

PARTITION 2008

VALUES LESS THAN (TO_DATE (’01-JAN- 2009’, ‘DD-MM-YYYY’))

);

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 25

5.1 Partitioning Management

• Partitions in praxis

–In Oracle partitioning is performed with the help of the LESS THAN function. How can we partition data in the current year?

•ALTER TABLE Sales

ADD PARTITION after2008 VALUES LESS THAN (MAXVALUE);

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 26

5.1 Partitioning Management

• Partitioning:

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 27

5.1 Partitioning Management

RowID ProdID GeoID TimeID Profit

121 132 2 05.2007 8K

122 12 2 08.2008 7K

123 15 1 09.2007 5K

124 14 3 01.2009 3K

125 143 2 03.2009 1,5K

126 99 3 05.2007 1K

RowID ProdID GeoID TimeID Profit

121 132 2 05.2007 8K

123 15 1 09.2007 5K

126 99 3 05.2007 1K

RowID ProdID GeoID TimeID Profit

122 12 2 08.2008 7K

RowID ProdID GeoID TimeID Profit

124 14 3 01.2009 3K

125 143 2 03.2009 1,5K

• In the data cleaning phase, records can be updated. For partition split tables, this means data migration:

–UPDATE Sales SET TimeID

= ‘05.2008' WHERE RowID

= 121;

•ERROR at line 1: ORA-14402:

updating partition key column would cause a partition change

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 28

5.1 Partitioning Management

RowID ProdID GeoID TimeID Profit

121 132 2 05.2007 8K

123 15 1 09.2007 5K

126 99 3 05.2007 1K

RowID ProdID GeoID TimeID Profit

122 12 2 08.2008 7K

124 14 3 01.2009 3K

125 143 2 03.2009 1,5K

• Data migration between partitions is by default disabled

–ALTER TABLE Sales ENABLE ROW MOVEMENT;

–ROW MOVEMENT deletes the record from one partition and inserts it into another

•The issue is that RowID is automatically changed!

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 29

5.1 Partitioning Management

RowID ProdID GeoID TimeID Profit

121 132 2 05.2007 8K

123 15 1 09.2007 5K

126 99 3 05.2007 1K

RowID ProdID GeoID TimeID Profit

122 12 2 08.2008 7K

124 14 3 01.2009 3K

125 143 2 03.2009 1,5K

13256 132 2 05.2008 8k

• Often queries over several partitions are needed

–This results in joinsover the data

–Though joins are generally expensive operations, the overall cost of the query may strongly differ with the chosen evaluation plan for the joins

• Joins are commutative and associative

–R ⋈ S ≡ S ⋈ R

–R ⋈ (S ⋈ T) ≡ (S ⋈ R) ⋈ T

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 30

5.2 Join Optimization

(6)

• This allows to evaluate individual joins in any order

–Results in join trees

•Different join trees may show very different evaluation performance

–Join trees have different shapes –Within a shape, there are different

relation assignments possible

• Example: R ⋈ S ⋈ T ⋈ U

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 31

5.2 Join Optimization

R S

T U

R S T U

• Number of possible join trees grows rapidly with number of join relations

–For n relations, there are T(n) different tree shapes

•“Any number of 1 ≤ i ≤ n-1 relations may be in the left subtree and ordered in T(i) shapes while the remaining n-i relations form the right subtree and can be arranged in T(n-i) shapes.”

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 32

5.2 Join Optimization

=

=

=

1

1 () ( )

) (

1 ) 1 (

n i

i n T i T n T T

• Optimizer has 3 choices

–Consider all possible join trees

•Usually not possible –Consider a subset of all trees

•i.e. restrict to trees of certain shapes –Use heuristics to pick a certain shape

• Classical join order optimization is discussed in more detail in the RDB2 lecture

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 33

5.2 Join Optimization

• Relational optimization of star-joins

–Star schema comprises

a bigfact table and many smalldimension tables

–An OLAP SQL query joins dimension and fact tables usually in the WHERE clause

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 34

5.2 Join Optimization in DW

Sales Product_ID Time_ID Geo_ID Sales Revenue Product

Product_ID Product group Product category Description

Geography Geo_ID Store State Region Country

Time Time_ID Day Week Month Quarter Year n n n

1 1

1

sales.ProdID = product.ProdID AND sales.TimeID = time.TimeID

AND sales.GeoID = geo.GeoID AND time.Year = 2009 AND geo.Country = “Deutschland ”

AND product.group = “wash machines”

• If the OLAP query specifies restrictions or group by’s on n dimensions, an n+1 order join is necessary

–Joins can be performed only pair-wise, resulting in (n+1)!

possible joining orders

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 35

5.2 Join Optimization in DW

Sales Geo

Time Product

σcountry= ‘Germany‘

σcountry = ‘Germany‘

σmonth = ‘Jan 2009‘

σmonth = ‘Jan 2009‘

σgroup = ‘Electronics‘

σgroup = ‘Electronics‘

• To reduce the number of join-orders, heuristics are used

–In OLTP heuristics show that it is not a good idea to join tables that are not connected via some attribute to each other

•E.g., Geo with Time relation leads to a Cartesian product

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 36

5.2 Join Heuristics

Sales Product_ID Time_ID Geo_ID Sales Revenue

Geography Geo_ID Store State Region Country

Time Time_ID Day Week Month Quarter Year n n n

1 1

1

55

Geo Time

σcountry=„Germany“

σcountry=„Germany“ σσmonth=„Jan 2009“month=„Jan 2009“

(7)

• But this heuristic rule from OLTP is not suitable for DW!

–E.g., join Sales with Geo in the following case:

•Sales has 10 mil records, in Germany there are 10stores, in January 2009 there were products sold in 20days, and the Electronics group has 50products

–If 20%of our sales were performed in Germany, the selectivity is small and an index would not help that much

•The intermediate result would still comprise 2 mil records

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 37

5.2 Join Heuristics

Sales Geo

σcountry= „Germany“

σcountry = „Germany“

• In star-joins a cross product of the dimension tables is recommended

–Geo dimension –10stores –Time dimension 20days –Product dimension 50products –10*20*50= 10 000 records after

performing the cross product of the dimensions

–The selectivity is in this case0.1%which is fit for using an index

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 38

5.2 Dimensional Cross Product

55 55

Geo Time

Product σcountry=„Germany“

σcountry=„Germany“ σσmonth=„Jan 2009“month=„Jan 2009“

σ

group=„Electronics“

σ

group=„Electronics“

Sales

• But dimension cross products can also become expensive

–If the restrictions on the dimensions are not restrictive enough or if there are many dimension tables

•E.g., if we would have query on 1000stores and 1000 customers, the cross product would comprise 1.000 * 20 * 50 * 1.000 * = 100 mil records…

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 39

5.2 Dimensional Cross Product

• E.g., In Germany the company has 1000 stores, it sold in 20 days in January, and it has 1000 types of electronics products

–1.000 * 20 * 1.000 * = 20 mil records…

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 40

5.2 Dimensional Cross Product

55 55

Geo Time

Product σcountry=„Germany“

σcountry=„Germany“ σσmonth=„Jan 2009“month=„Jan 2009“

σ

group=„Electronics“

σ

group=„Electronics“

Sales

20 000

1 000 20

20 000 000 1 000

100 000 000

• The IBM DB2 solution for expensive dimension cross products is to build semi-joins of the dimension tables with indexes

–A B*-Tree index will be on the fact table (sales) for each dimension (B*-Trees are unidimensional)

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 41

5.2 Star-Join Optimization

Super

Super

Geo

Index on Sales(GeoID) σcountry=„Germany“

σcountry=„Germany“

1 000

(pointers) IDs of sales in Germany!

GeoGermany m

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 42

5.2 Star-Join Optimization

GeoID Store City Country

1 S1 BS Germany

2 S2 BS Germany

3 S3 HAN Germany

4 S4 Lyon France

GeoID ID

1 1

1 2

4 3

GeoID ID

1 1

1 2

S⋉

σcountry=„Germany“

σcountry=„Germany“

Geo

Index on Sales

GeoID Store City Country

1 S1 BS Germany

2 S2 BS Germany

3 S3 HAN Germany

(8)

• Reduces the fact table to what we need, based on indexes

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 43

5.2 Star-Join Optimization

S⋉

S⋉

σcountry=„Germany“

σcountry=„Germany“

σgroup=„Wash machines“

σgroup=„Wash machines“

σmonth=„Jan 2009“

σmonth=„Jan 2009“

Geo Index on Sales(GeoID) S⋉

S⋉

Product Index on Sales(ProdID) S⋉

S⋉

Time Index on Sales(TimeID) S⋂

S⋂

List of TID (tuple IDs) m sales IDs , and GIDs

n sales IDs and PIDs

p sales IDs and TIDs TempGeo

TProd

TTime

• List of TID (tuple IDs) has a worst case of min(m, n, p) elements because of S⋂

• The complete operator tree

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 44

5.2 Star-Join Optimization

TempGeo

TempTime TempProduct

List of TID (tuple IDs)

Min (m, n, p) 1 000

20 1000

• Materialized Views (MV)

–Views whose tuples are stored

in the database are said to be materialized

–They provides fast access, like a (very high-level) cache –Need to maintain the view as the underlying tables

change

•Ideally, we want incremental view maintenance algorithms

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 45

5.3 Materialized Views

How can we use MV in DW?

–E.g., we have queries requiring us to join the Sales table with another table and aggregate the result

•SELECT P.Categ, SUM(S.Qty) FROM Product P, Sales S WHERE P.ProdID=S.ProdID GROUP BY P.Categ

•SELECT G.Store, SUM(S.Qty) FROM Geo G, Sales S WHERE G.GeoID=S.GeoID GROUP BY G.Store

•….

–There are more solutions to speed up such queries

•Pre-compute the two joins involved (product with sales and geo with sales)

•Pre-compute each query in its entirety

•Or use an already materialized view

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 46

5.3 Materialized Views

Having the following view materialized

–CREATE MATERIALIZED VIEW Totalsales (ProdID, GeoID, total) AS SELECT S.ProdID, S.GeoID, SUM(S.Qty) FROM Sales S GROUP BY S.ProdID, S.GeoID

We can use it in our 2 queries

–SELECT P.Categ, SUM(T.Total) FROM Product P, Totalsales T WHERE P.ProdID=T.ProdID GROUP BY P.Categ –SELECT G.Store, SUM(T.Total) FROM Geo G, Totalsales T

WHERE G.GeoID=T.GeoID GROUP BY G.Store

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 47

5.3 Materialized views

• MV issues

–Utilization

•What views should we materialize, and what indexes should we build on the pre-computed results?

–Choiceof materialized views

•Given a query and a set of materialized views, can we use the materialized views to answer the query?

–Maintenance

•How frequently should we refresh materialized views to make them consistent with the underlying tables?

•And how can we do this incrementally?

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 48

5.3 Materialized views

(9)

• Materialized views utilization has to be transparent

–Queries are internally rewritten to use the available MVs by the query rewriter

–The query rewriter performs integration of the MV based on the query execution graph

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 49

5.3 Utilization of MV

• E.g., materialized views utilization, mono-block query

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 50

5.3 Utilization of MV

σF σF Sales Product

σP Geo σP

σG σG σ SalesπPrice, Group, Store

Query Q

Sales Product

σP

σP

σ Sales, Invoiceπ Price, Group

MV M

σF σF

MV M

Geo σG

σG

σ Salesπ Store,Price,Group

Query Q`

• Integration of MV

–Valid replacement: A query Q` represents a valid replacement of query Q by utilizing the materialized view M, if Q and Q` always deliver the same result set –For general relational queries, the problem of finding a

valid replacement is NP-complete

•But there are praxis relevant solutions for special cases like star-queries

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 51

5.3 Integration of MV

• In order to be able to integrate MV M in Q and obtain Q`, the following conditions need to be respected

–The selection condition in M cannot be more restrictive than the one in Q

–The projection from Q has to be a subset of the projection from M

–It has to be possible to derive the aggregation functions of π(Q) fromπ(M)

–Additional selection conditions in Q have to be possible also on M

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 52

5.3 Integration of MV

• How do we use MV even when there is no perfect match?

• If the selection in M is more restrictive than the selection in Q

–Split Q in Qaand Qbsuch that σ(Qa) = (σ(Q) ⋀ σ(M)) andσ(Qb) = (σ(Q) ⋀ ¬σ(M))

–Multiblock queries

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 53

5.3 Integration of MV

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 54

5.3 Integration of MV

σF[Q]

σF[Q]

Sales Product

σP Geo σP

σG σG σSalesπPrice, Group, Store

Query Q

Sales Product σP

σP

σ Sales, InvoiceπPrice, Group

MV M

σF[M]

σF[M]

σF[Q] ⋀ σF[M]

σF[Q] ⋀ σF[M]

MV M

Geo σG

σG

σSalesπStore

Query Q`

Sales σF[Q] ⋀ ¬σF[M]

σF[Q] ⋀ ¬σF[M]

Product σP σP σ SalesπPrice, Group, Store

ALL

ALL

(10)

• In DW, materialized views are often used to store aggregated results

–The number of nodes in the lattice of cuboids is

•|n| = ∏

j=1 n2 = 2n –n = 3, |n| = 8 and we

would need to materialize 2-D cuboids 1-D cuboids and

0D cuboids; in total 7 views

–n = 16, |n| = 65534, … too much to materialize –What should we materialize?

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 55

5.3 MV in DW

Then there are classification levels. Each dimension can have more granularity levels

–If we consider the cuboid lattice for just 2 dimensions, e.g., Product with 4 aggregation levels and Geo with 5 levels we obtain a lattice with 24+5 = 512 nodes

But we can reduce this number through functional

dependencies

–Instead of 8 aggregation nodes we reduce it to 4

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 56

5.3 MV in DW

Article Prod. Group Prod. Family

Prod. Categ

(Article, Group, Category) (Article, Group) (Article, Category) (Article)

(Article)

(Group, Category) (Group)

(Group)

(Category) (Category)

ALL ALL

• In the case of 2 dimensions with 4 respectively 5 granularity levels

–It can be reduced to (4+1)* (5+1)=30

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 57

5.3 MV in DW

all CO R ST C S

PF PG

A PC

CO,PC R,PC ST,PC C,PC S,PC

S,PF S,PG

S,A C,PF

C,PG C,A ST,PF

ST,PG ST,A R,PF

R,PG R,A CO,PF

CO,PG CO,A

CO= country, R = region, ST = state, C = city, S = store, PC = product category, PF = family, PG = group, A = article

• Choosing the views to materialize

–Static choice:

•The choice is performed at a certain time point by the DB administrator (not very often) or by an algorithm

•The set of MVs remains unmodified until the next refresh

•The chosen MVs correspond to older queries –Dynamical choice:

•The MV set adapts itself according to new queries

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 58

5.3 Choosing the MV to Use

• Static choice

–Choose which views to materialize, in concordance with the “benefit” they bring

•The benefit is computed based on a cost function –Thecost functioninvolves

•Query costs

•Statistical approximations of the frequency of the query

•Actualization/maintenance costs

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 59

5.3 Static Choice

The problem of choosing what to materialize is now a classical

knapsack problem

–We have a maximum MV storage size and the cost of each node in the lattice

The choice algorithm is greedy

–Input: the lattice of cuboids, the expected cardinality of each node, and the maximum storage size available to save MVs

–It calculates the nodes from the lattice which bring the highest benefit according to the cost function, until there is no more space to store MVs

–Output: the list of lattice nodes to be materialized

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 60

5.3 Static Choice

(11)

• E.g., choose which nodes to materialize from the lattice

–MV storage of 2000 records

–Possible aggregations: Month 1-12, Region 1-50, Group 1-100

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 61

5.3 Static Choice

ALL

(A1) Month12 (A2) Region50 (A3) Group100

(A1,A2) Month,Region600

(A1,A3) Month,Group1200

(A2,A3) Region,Group5000

(A1,A2,A3) Month, Region, Group6000

Based on the cost function the algorithm may choose the following nodes to materialize

–(A1,A2), from which we can calculate A1, A2and ALL reading 600 records instead of 6000

–(A3), because Group queries had a very high frequency –(A1, A3) offers the next

best benefit –The next node

chosen by the algorithm as benefit would have been (A2,A3) but it doesn’t fit into the

remaining space; 5000 > 2000 – 600 – 100 - 1200

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 62

5.3 Static Choice

ALL

(A1) Month12 (A2) Region50 (A3) Group100

(A1,A2) Month,Region600

(A1,A3) Month,Group1200

(A2,A3) Region,Group5000

(A1,A2,A3) Month, Region, Group6000

• Disadvantages

of static choice

–OLAP applications are interactive

•Usually, the user runs a series of queries to explain a behavior he has observed, which happened for the first time

So now the query set comprises hard to predict, ad-hoc queries

–Even if the query pattern would be observed after a while, it is unknown for how much time it will remain used

•Queries are always changing

–Often modification to the data leads to high update effort

There are, however, also for OLAP applications, some often repeating queries that should in any case be statically materialized

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 63

5.3 Choosing the MV to Use

• Dynamic choice of MV

–Monitor the queries being executed over time –Maintain a materialized view processing plan (MVPP)

by incorporating most frequently executed queries –Modify MVPP incrementally by executing MVPP

generation algorithm (in background) –Decide on the views to be materialized –Reorganize the existing views

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 64

5.3 Choosing the MV to Use

It works on the same principle as caching, but with

semantic

knowledge

Considered factors for calculating the benefit are:

–Time of the last access –Frequency

–Size of the materialized view

–Costs, a new calculation or actualization would produce for a MV

–Number of queries which were answered with the MV –Number of queries which could be answered with this MV

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 65

5.3 Dynamic Choice of MV

• Dynamic update of the cache

–In each step, the benefit of MV in the cache as well as of the query are calculated

–All MVs as well as the query result are sorted according to the benefit

–The cache is then filled with MV in the order of their benefit, from high to low

•This way it can happen that one or more old MVs are replaced, to insert the result of the current query

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 66

5.3 Dynamic Choice of MV

(12)

• Maintenance of MV

–Keeping a materialized view up-to-datewith the underlying data

–Important questions

•Howdo we refresh a view when an underlying table is refreshed?

•Whenshould we refresh a view in response to a change in the underlying table?

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 67

5.3 Maintenance of MV

• Materialized views can be maintained by re- computation on every update

–Not the best solution

• A better option is incremental view maintenance

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 68

5.3 How to Refresh a MV

• Incremental view maintenance

–Changes to database relations are used to compute changes to materialized view, which is then updated –Considering that we have a materialized view V, and that the basis relations suffer modifications through inserts, updates or deletes, we can calculate V` as follows

•V` = (V -∆-(V)) ∪ ∆+(V), where ∆-(V) and ∆+(V) represent deleted respectively inserted tuples

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 69

5.3 How to Refresh a MV

• Immediate

–As part of the transaction that modifies the underlying data tables

•Advantage: materialized view is always consistent

•Disadvantage: updates are slowed down

• Deferred

–Some time later, in a separate transaction

•Advantage: can scale to maintain many views without slowing updates

•Disadvantage: view briefly becomes inconsistent

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 70

5.3 When to Refresh a MV

• Deferred refresh comes in 3 flavors

–Lazy:delay refresh until next query on view, then refresh before answering the query

–Periodic (Snapshot): refresh periodically; queries are possibly answered using outdated version of view tuples; widely used for DW

–Event-based:e.g., refresh after a fixed number of updates to underlying data tables

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 71

5.3 When to Refresh a MV

• Partitioning: Horizontal or Vertical

–Records used together are grouped together –However: slow retrieval across partitions –Mini-Dimensions

• Joins: for DW it is sometimes better to perform cross product on dimensions first

• Materialized Views: we can’t materialize everthing

–Static or Dynamic choice of what to materialize –The benefit cost function is decisive

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 72

Summary

(13)

• Queries!

–OLAP queries –SQL for DW –MDX

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 73

Next lecture

Abbildung

table to another table

Referenzen

ÄHNLICHE DOKUMENTE

Data Warehousing &amp; OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 3..

Data Warehousing &amp; OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig

Data Warehousing &amp; OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig

Data Warehousing &amp; OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig

Data Warehousing &amp; OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig

Data Warehousing &amp; OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig

Relational Database Systems 1 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 54. 12.2 Supply

Data Warehousing &amp; OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig