• Keine Ergebnisse gefunden

Chapt. 7 Multidimensional Hierarchical Clustering

N/A
N/A
Protected

Academic year: 2022

Aktie "Chapt. 7 Multidimensional Hierarchical Clustering"

Copied!
29
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Prof. Bayer, DWH, Ch.7, SS2002 1

Chapt. 7 Multidimensional Hierarchical Clustering

Fig. 3.1 Hierarchies in the `Juice and More´ schema

Year (3)

Month (12) TIME

Region (8)

Nation (7)

Trade Type (2)

Business Type (7) CUSTOMER

Type (5)

Brand (8)

Category (19)

Container (10)

PRODUCT

Sales

Organization (5) Distribution Channel (3) DISTRIBUTION

All Products All Customer All Distributions All Time

(2)

PRODKEY CUSTKEY DISTKEY TIMEKEY

SALES DISTCOST PRODKEY

PRODUCT 2180 rows

TYPE BRAND CATEGORY CONTAINER

...

CUSTKEY CUSTOMER

7064 rows

REGION NATION TRADE-TYPE

DISTKEY DISTRIBUTION

12 rows

SALESORG CHANNEL

...

TIMEKEY

36 TIME rows FACT

26M rows

...

(3)

Prof. Bayer, DWH, Ch.7, SS2002 3

Size of completely aggregated Cube (6*9*20*11)*(9*8*3*8)*(6*4)*(4*13)

--- = (5*8*19*10)*(8*7*2*7)*(5*3)*(3*12)

4*6*6*9*11*13 185.328

--- = --- = 7.96 larger than base cube 5*5*7*7*19 23.275

Base Cube has 2.245.024.000 cells * 4 B ~ 9 GB Number of available facts: 26 million

(4)

Sparsity:

26*106

--- = 0,0116 2,245* 109

100 - 1.16 = 98.84 % sparsity

(5)

Prof. Bayer, DWH, Ch.7, SS2002 5

Hierarchically aggregated Cube (1+5+40+760+7600) = 8406 (1+8+56+112+784) = 961

(1+5+15) = 21

(1+3+24) = 28

= 4.749.961.608 Size of base cube 2.145.024.000 Number of aggregate cells 2.504.937.608

==> Juice and More database has 96 times more

hierarchically aggregated cells than occupied base cells!

(6)

Star-Joins

Restrictions on several dimension tables, which are then joined with fact table

In addition: grouping, computation of aggregates, sorting of results.

Example:

Select <MEASURE AGGREGATION>

From Fact F, Customer C, DISTRIBUTION D, Product P, Time T

Where F. ProdKey = P. ProdKey AND F. CustKey = C. CustKey AND F.TIMEKEY = T.TIMEKEY AND F.DISTKEY = D.DISTKEY AND

<CUSTOMER RESTRICTION> AND

<DISTRIBUTION RESTRICTION> AND

(7)

Prof. Bayer, DWH, Ch.7, SS2002 7

Select <MEASURE AGGREGATION>

From Fact F

Where F. ProdKey BETWEEN Pkey1 AND Pkey2 AND F. DistKey BETWEEN Dkey1 AND Dkey2 AND F. CustKey BETWEEN Ckey1 AND Ckey2 AND F. TimeKey BETWEEN Tkey1 AND Tkey2

(8)

Key Question:

How to compute star-joins efficiently?

Secondary indexes on foreign keys of fact table (standard B-trees), see chapter 5 for details - intersect result lists

-retrieve tuples from fact table randomly

Bitmaps

(9)

Prof. Bayer, DWH, Ch.7, SS2002 9

Bitmap Index Intersection

bitmap for organization

= „TM“

bitmap for region

= „Asia

1...1.11 1.1...1.1. 1.1...1.1. ...1.1.... ..1.1...1.

11.1... 1.11...1 .1.1..1... 1.1.1... .1..1.1...

1... 1.1... ...1... ... ....1...

Page 1 Page 2 Page 3 Page 4 Page 5

result of bitmap intersection

accessed disk pages

(shaded)

34 % of

tuples

32 % of

tuples

10 % of

tuples

80 % of

pages

(10)

Problem: for small result sets of a few %, almost all pages of the facts table must be fetched from disk, if the hits in the result set are not clustered on disk.

Ex: with 8 KB pages 20 to 400 tuples per page,

i.e. at 0.25% to 5% hits in the result almost all pages must be fetched.

At least tuple clustering, preferably page clustering, are desirable, but how??

Goal: Code hierarchies in such a way, that for star- joins with the Fact table we have to join only with a query box on the Fact table

(11)

Prof. Bayer, DWH, Ch.7, SS2002 11

Basic Idea for Multidimensional Clustering

1L}

0.5L;

Juice Apple

1L;

OJ 0.7L;

OJ 0.33L;

0 {OJ

1

m

1L}

OJ 0.7L;

OJ 0.33L;

1 {OJ

1

m

0.5L}

{A-Juice

2 4 m

1L}

Juice Apple

0.5L Juice

Apple

1 {

2

m

0.33L}

2 {OJ

1

m m22 {OJ 0.7L} m32 {OJ1L} m 52 {A-Juice 1L}

Orange Juice Apple Juice

0,33L 0,7L 0,5L 1L

Product Category All Products

All

0 1

0 2 0 1

Level Label Member Ordinal (e.g.,1) Member Label (e.g., 0.7L) Legend:

Example Hierarchy in Member Set Representation AppleJuice

1 1L

(12)

Dimension D consists of

Value Set V = [[ v1, v2, ... vn ]]

Hierarchy H of height h consisting of h+1 hierarchy levels H = [[L0 , L1 ,..., Lh ]]

Level Li is a set of sets = [[m1i, ..., mji ]] with mki  V

mki get names, e.g. „Orange Juice“ as label(m11), in general label(mki)

Constraint: every m i+1 must be a subset of some m i

(13)

Prof. Bayer, DWH, Ch.7, SS2002 13

Hierarchic Relationships

The children of mki are all those sets mli+1 of the lower level i+1 with the property:

mli+1  mki , formally:

children(mki ) := [[mli+1  Li+1 : mli+1  mki ]]

parent(mki ) := [[mli-1  Li-1 : mli-1  mki ]]

Principle: the children of m are numbered by the bijective function ordm starting at 1 or 0

(14)

Hierarchic Relationships

The children of mki are all those sets mli+1 of the lower level i+1 with the property:

mli+1  mki , formally:

children(mki ) := [[mli+1  Li+1 : mli+1  mki ]]

parent(mki ) := [[mli-1  Li-1 : mli-1  mki ]]

Principle: the children of m are numbered by the bijective

(15)

Prof. Bayer, DWH, Ch.7, SS2002 15

Enumeration and Surrogate Functions Let A be an enumeration type

A = [[ a0, a1, ... ak ]]

f : A --> (0, 1 ,..., k ) defined as f (ai ) = i

then i is called the surrogate of ai

(16)

Hierarchies and composite Surrogates

Basic Idea: concatenate the surogates of successive hierarchy levels (compound surrogates cs)

Note: the root ALL of the hierarchy is not encoded Def: compound surrogate cs for hierarchy H

ordm : children (m) --> [[0, 1, ..., |children(m)| -1]]

cs (H, mi) := ord father (mi) (mi) if i=1

:=cs (H, father ( mi))  ord father (mi) (mi) otherwise

(17)

Prof. Bayer, DWH, Ch.7, SS2002 17

Example:

REGION f(REGION)

South Europe 0

Middle Europe 1

Northern Europe 2

Western Europe 3

North America 4

Latin America 5

Asia 6

Australia 7

(a)

(18)

0

CUSTOMER

South Europe North America Asia

Retail

Wholesale Kana´s Sushi Bar

Joe‘sSports Bar

... ...

Bar

4 6

2

1

1 0

Retail Canada 1 USA

0

... ...

... ...

... ...

Australia

7

Wholesale

0

(19)

Prof. Bayer, DWH, Ch.7, SS2002 19

Example: the path

North America --> USA --> Retail --> Bar has the compound surrogate 4112

Next Idea: for every hierarchy level determine the

higest branching degree (plus a safety margin for future extensions) and code by fixed number of bits.

surrogates (H,i) := max [[ cardinality (children (H,m)) : m  level (H, i-1) ]]

(20)

let l

i

:= log

2

surrogates (H,i)

then l

i

bits are needed for the surrogates of level i

let  be a path  = m

0

 m

1

 m

2

 ...  m

h

to a leaf m

h

of hierarchy H:

(21)

Prof. Bayer, DWH, Ch.7, SS2002 21

cs (H,) = cs (H,m

h

)

 

m

 

l l lh

father

m

ord

1 1

 2

2 3

 

m

 

l l lh

father

m

ord

2 2

 2

3 4

 

m

 

h

father

m

ord

h

:

=

... +

+

+

(22)

Example:

cs (H, Bar) = 100 001 1 010 = 538

l

1

=3 l

2

=3 l

3

=1 l

4

=3

number of bits needed at certain level

(23)

Prof. Bayer, DWH, Ch.7, SS2002 23

Properties of MHC Encoding

very compact coding of fixed length

• lexicographic order of composite keys remains, i.e. isomorphic to integer ordering

• point restrictions on arbitrary hierarchy levels lead to interval restrictions on the compound surrogates

(24)

Example: path to USA is:

North America --> USA

4 = 1002 1 = 0012

leads to range on cs:

100 001 0 0002 to 100 001 1 1112 and to the decimal range:

528 to 543 or [528 : 543]

==> star join with restriction North America.USA leads to an interval restriction on the fact table

(25)

Prof. Bayer, DWH, Ch.7, SS2002 25

Complex Hierarchies

time with months and weeks, both restrictions lead to intervals on the level of days

• Example of Fig. 4-4

• proposal for multiple hierarchies: choose the most useful (depending on the query profile) or consider multiple hierarchies as several independent

hierarchies. Caution, this increases the number of dimensions !!!

• Time variant hierarchies: extend by time interval of validity , see Example Fig. 4-5,

(26)

(a) (b) YEAR

MONTH WEEK

DAY

REGION

NATION

TRADE TYPE

CUSTOMER TYPE

CUSTOMER SIZE

CUSTOMER

(27)

Prof. Bayer, DWH, Ch.7, SS2002 27

CUSTOMER

South Europe North America ...

USA Canada

Retail Wholesale

Bar Restaurant

Joe ‘s Sports Bar

Year <= 1997 Year > 1997 Fig. 4-5 Change of a hierarchy over the time

(28)

Orange Juice

(29)

Prof. Bayer, DWH, Ch.7, SS2002 29

Apple

Juice

Processing a query box Asia

in sort order with the Tetris algorithm

Referenzen

ÄHNLICHE DOKUMENTE

In the years following the UK’s decision to remain in the EEC, and through successive Treaties which slowly but ineluctably bound both the UK and the Republic of Ireland ever more

3: Phase diagram representing the tunneling, field- suppressed and ballistic transport regimes in the field-doping parameter plane.. We first compare,

25 Ad- ditionally, we have presented the complete phase diagram in a field-doping parameter plane, illustrating the crossover from the field suppressed to the ballistic

Hardware Diagnostic Reference Manyal - This manual provides a functional description and detailed operating instructions for diagnostic programs.. System Operator~s

Machine language programs can also be written from BASIC with the individual bytes of the program written in decimal values in DATA statements.. The decimal

was observed (Fig. The amount of extractable binder of such low molecular weight prior to humid aging was unexpected. Examination of the GPC results obtained for

The method of introducing special coordinates is used in general relativity to handle parts of the characteristic Cauchy problem for the vacuum field equations and to handle

At first (investigation A) blood samples were drawn and tested in four different ways (I: drawn and tested imme- diately; II: drawn one minute post partum, stored at room