Prof. Bayer, DWH, Ch.7, SS2002 1
Chapt. 7 Multidimensional Hierarchical Clustering
Fig. 3.1 Hierarchies in the `Juice and More´ schema
Year (3)
Month (12) TIME
Region (8)
Nation (7)
Trade Type (2)
Business Type (7) CUSTOMER
Type (5)
Brand (8)
Category (19)
Container (10)
PRODUCT
Sales
Organization (5) Distribution Channel (3) DISTRIBUTION
All Products All Customer All Distributions All Time
PRODKEY CUSTKEY DISTKEY TIMEKEY
SALES DISTCOST PRODKEY
PRODUCT 2180 rows
TYPE BRAND CATEGORY CONTAINER
...
CUSTKEY CUSTOMER
7064 rows
REGION NATION TRADE-TYPE
DISTKEY DISTRIBUTION
12 rows
SALESORG CHANNEL
...
TIMEKEY
36 TIME rows FACT
26M rows
...
Prof. Bayer, DWH, Ch.7, SS2002 3
Size of completely aggregated Cube (6*9*20*11)*(9*8*3*8)*(6*4)*(4*13)
--- = (5*8*19*10)*(8*7*2*7)*(5*3)*(3*12)
4*6*6*9*11*13 185.328
--- = --- = 7.96 larger than base cube 5*5*7*7*19 23.275
Base Cube has 2.245.024.000 cells * 4 B ~ 9 GB Number of available facts: 26 million
Sparsity:
26*106
--- = 0,0116 2,245* 109
100 - 1.16 = 98.84 % sparsity
Prof. Bayer, DWH, Ch.7, SS2002 5
Hierarchically aggregated Cube (1+5+40+760+7600) = 8406 (1+8+56+112+784) = 961
(1+5+15) = 21
(1+3+24) = 28
= 4.749.961.608 Size of base cube 2.145.024.000 Number of aggregate cells 2.504.937.608==> Juice and More database has 96 times more
hierarchically aggregated cells than occupied base cells!
Star-Joins
Restrictions on several dimension tables, which are then joined with fact table
In addition: grouping, computation of aggregates, sorting of results.
Example:
Select <MEASURE AGGREGATION>
From Fact F, Customer C, DISTRIBUTION D, Product P, Time T
Where F. ProdKey = P. ProdKey AND F. CustKey = C. CustKey AND F.TIMEKEY = T.TIMEKEY AND F.DISTKEY = D.DISTKEY AND
<CUSTOMER RESTRICTION> AND
<DISTRIBUTION RESTRICTION> AND
Prof. Bayer, DWH, Ch.7, SS2002 7
Select <MEASURE AGGREGATION>
From Fact F
Where F. ProdKey BETWEEN Pkey1 AND Pkey2 AND F. DistKey BETWEEN Dkey1 AND Dkey2 AND F. CustKey BETWEEN Ckey1 AND Ckey2 AND F. TimeKey BETWEEN Tkey1 AND Tkey2
Key Question:
How to compute star-joins efficiently?
• Secondary indexes on foreign keys of fact table (standard B-trees), see chapter 5 for details - intersect result lists
-retrieve tuples from fact table randomly
• Bitmaps
Prof. Bayer, DWH, Ch.7, SS2002 9
Bitmap Index Intersection
bitmap for organization
= „TM“
bitmap for region
= „Asia“
1...1.11 1.1...1.1. 1.1...1.1. ...1.1.... ..1.1...1.
11.1... 1.11...1 .1.1..1... 1.1.1... .1..1.1...
1... 1.1... ...1... ... ....1...
Page 1 Page 2 Page 3 Page 4 Page 5
result of bitmap intersection
accessed disk pages
(shaded)
34 % of
tuples
32 % of
tuples
10 % of
tuples
80 % of
pages
Problem: for small result sets of a few %, almost all pages of the facts table must be fetched from disk, if the hits in the result set are not clustered on disk.
Ex: with 8 KB pages 20 to 400 tuples per page,
i.e. at 0.25% to 5% hits in the result almost all pages must be fetched.
At least tuple clustering, preferably page clustering, are desirable, but how??
Goal: Code hierarchies in such a way, that for star- joins with the Fact table we have to join only with a query box on the Fact table
Prof. Bayer, DWH, Ch.7, SS2002 11
Basic Idea for Multidimensional Clustering
1L}
0.5L;
Juice Apple
1L;
OJ 0.7L;
OJ 0.33L;
0 {OJ
1
m
1L}
OJ 0.7L;
OJ 0.33L;
1 {OJ
1
m
0.5L}
{A-Juice
2 4 m
1L}
Juice Apple
0.5L Juice
Apple
1 {
2
m
0.33L}
2 {OJ
1
m m22 {OJ 0.7L} m32 {OJ1L} m 52 {A-Juice 1L}
Orange Juice Apple Juice
0,33L 0,7L 0,5L 1L
Product Category All Products
All
0 1
0 2 0 1
Level Label Member Ordinal (e.g.,1) Member Label (e.g., 0.7L) Legend:
Example Hierarchy in Member Set Representation AppleJuice
1 1L
Dimension D consists of
Value Set V = [[ v1, v2, ... vn ]]
Hierarchy H of height h consisting of h+1 hierarchy levels H = [[L0 , L1 ,..., Lh ]]
Level Li is a set of sets = [[m1i, ..., mji ]] with mki V
mki get names, e.g. „Orange Juice“ as label(m11), in general label(mki)
Constraint: every m i+1 must be a subset of some m i
Prof. Bayer, DWH, Ch.7, SS2002 13
Hierarchic Relationships
The children of mki are all those sets mli+1 of the lower level i+1 with the property:
mli+1 mki , formally:
children(mki ) := [[mli+1 Li+1 : mli+1 mki ]]
parent(mki ) := [[mli-1 Li-1 : mli-1 mki ]]
Principle: the children of m are numbered by the bijective function ordm starting at 1 or 0
Hierarchic Relationships
The children of mki are all those sets mli+1 of the lower level i+1 with the property:
mli+1 mki , formally:
children(mki ) := [[mli+1 Li+1 : mli+1 mki ]]
parent(mki ) := [[mli-1 Li-1 : mli-1 mki ]]
Principle: the children of m are numbered by the bijective
Prof. Bayer, DWH, Ch.7, SS2002 15
Enumeration and Surrogate Functions Let A be an enumeration type
A = [[ a0, a1, ... ak ]]
f : A --> (0, 1 ,..., k ) defined as f (ai ) = i
then i is called the surrogate of ai
Hierarchies and composite Surrogates
Basic Idea: concatenate the surogates of successive hierarchy levels (compound surrogates cs)
Note: the root ALL of the hierarchy is not encoded Def: compound surrogate cs for hierarchy H
ordm : children (m) --> [[0, 1, ..., |children(m)| -1]]
cs (H, mi) := ord father (mi) (mi) if i=1
:=cs (H, father ( mi)) ord father (mi) (mi) otherwise
Prof. Bayer, DWH, Ch.7, SS2002 17
Example:
REGION f(REGION)
South Europe 0
Middle Europe 1
Northern Europe 2
Western Europe 3
North America 4
Latin America 5
Asia 6
Australia 7
(a)
0
CUSTOMER
South Europe North America Asia
Retail
Wholesale Kana´s Sushi Bar
Joe‘sSports Bar
... ...
Bar
4 6
2
1
1 0
Retail Canada 1 USA
0
... ...
... ...
... ...
Australia
7
Wholesale
0
Prof. Bayer, DWH, Ch.7, SS2002 19
Example: the path
North America --> USA --> Retail --> Bar has the compound surrogate 4112
Next Idea: for every hierarchy level determine the
higest branching degree (plus a safety margin for future extensions) and code by fixed number of bits.
surrogates (H,i) := max [[ cardinality (children (H,m)) : m level (H, i-1) ]]
let l
i:= log
2surrogates (H,i)
then l
ibits are needed for the surrogates of level i
let be a path = m
0 m
1 m
2 ... m
hto a leaf m
hof hierarchy H:
Prof. Bayer, DWH, Ch.7, SS2002 21
cs (H,) = cs (H,m
h)
m
l l lhfather
m
ord
1 1 2
2 3
m
l l lhfather
m
ord
2 2 2
3 4
m
hfather
m
ord
h:
=... +
+
+
Example:
cs (H, Bar) = 100 001 1 010 = 538
l
1=3 l
2=3 l
3=1 l
4=3
number of bits needed at certain level
Prof. Bayer, DWH, Ch.7, SS2002 23
Properties of MHC Encoding
• very compact coding of fixed length
• lexicographic order of composite keys remains, i.e. isomorphic to integer ordering
• point restrictions on arbitrary hierarchy levels lead to interval restrictions on the compound surrogates
Example: path to USA is:
North America --> USA
4 = 1002 1 = 0012
leads to range on cs:
100 001 0 0002 to 100 001 1 1112 and to the decimal range:
528 to 543 or [528 : 543]
==> star join with restriction North America.USA leads to an interval restriction on the fact table
Prof. Bayer, DWH, Ch.7, SS2002 25
Complex Hierarchies
• time with months and weeks, both restrictions lead to intervals on the level of days
• Example of Fig. 4-4
• proposal for multiple hierarchies: choose the most useful (depending on the query profile) or consider multiple hierarchies as several independent
hierarchies. Caution, this increases the number of dimensions !!!
• Time variant hierarchies: extend by time interval of validity , see Example Fig. 4-5,
(a) (b) YEAR
MONTH WEEK
DAY
REGION
NATION
TRADE TYPE
CUSTOMER TYPE
CUSTOMER SIZE
CUSTOMER
Prof. Bayer, DWH, Ch.7, SS2002 27
CUSTOMER
South Europe North America ...
USA Canada
Retail Wholesale
Bar Restaurant
Joe ‘s Sports Bar
Year <= 1997 Year > 1997 Fig. 4-5 Change of a hierarchy over the time
Orange Juice
Prof. Bayer, DWH, Ch.7, SS2002 29
Apple
Juice
Processing a query box Asia
in sort order with the Tetris algorithm