Data Warehousing
& Data Mining
Wolf-Tilo Balke Silviu Homoceanu
Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de
6. Optimization
6.1 Multidimensional storage 6.2 DW Optimization / Indexes
Tree based indexes Bitmap indexes Hash Based
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 2
6. Optimization
• The basic storage structure is the multidimensional array
– Customized based upon
• The data e.g., sparse or dense
• Characteristics of the secondary memory e.g., block- or page- oriented
– Cube data cells are stored sequentially
• Multidimensional cubes are linearized
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 3
6.1 Multidimensional Storage
• Linearization
– For n dimensions D
1to D
nthere are |D
1| * |D
2| *…* |D
n| addressed cube cells
– E.g., 2D cube
•
|D
1| = 5, |D
2| = 4, cube cells = 20 – To access a measure in a cube we go
through dimensions
•
Sold Jackets in February are stored in cube cell D
1[4], D
2[3]
•
After linearization D
1[4], D
2[3] becomes array cell 14
–This represents (Index(D2) – 1) * |D1| + Index(D1) (2 full rows + theremaining 4 elements from the incomplete row) –Linearized Index = 2 * 5 + 4 = 14
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 4
6.1 Multidimensional Storage
1 1 3 6 47
2 2 53
7 8 89
6 4
9 11 11 16 1217
10 12 1313
15 18 1619
14 14 17 5 19 10 25 15 27 20
Jan (1) Feb(2) Feb(3) Feb(4) D1
D2
• General idea of linearization
– Considering a cube C=((D
1, D
2, …, D
n), (M
1:Type
1, M
2:Type
2, …, M
m:Type
m)), the index of a cube cell z with coordinates (x
1, x
2, …, x
n) can be linearized as follows:
• Index(z) = x
1+ (x
2- 1) * |D
1| + (x
3- 1) * |D
1| * |D
2| + … + (x
n- 1) * |D
1| * … * |D
n-1| =
= 1+ ∑
i=1
n
((x
i- 1) * ∏
j=1 i-1
|D
i|)
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 5
6.1 Multidimensional Storage
• Problems in array-storage
– Influence of the order of the dimensions in the cube definition
• In the cube the cells of D
2are ordered one under the other, e.g., Pants A query of sales of all pants involves a column in the cube
• After linearization, the information is spread among more data blocks or pages
• If we consider a data block can hold 5 cells, a query over all products sold in January can be answered with just 1 block read, but a query of all sold pants, involves reading 4 blocks
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 6
6.1 Problems in Array-Storage
1 1 3 6 47
2 2 53
7 8 89
6 4
9 11 11 16 1217
10 12 1313
15 18 1619
14 14 17 5 19 10 25 15 27 20
Jan (1) Feb(2) Feb(3) Feb(4) D1
D2
• The problem of dimensions order can be diminished by using caching solutions
– Caching and swapping is performed also by the operating system
– MDBMS has to manage its caches such that the OS doesn’t perform any damaging swaps
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 7
6.1 Problems in Array-Storage
• Storage of dense cubes
– If cubes are dense, array storage is more efficient.
However, operations suffer due to the large cubes – The solution is to store dense cubes not linear but on
2 levels
• The first contains an indexes and the second the data cells stored in blocks
• Different optimization procedures like indexes (trees, bitmaps), physical partitioning, and compression (run- length-encoding) can be used
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 8
6.1 Problems in Array-Storage
• Storage of sparse cubes
– All the cells of a cube, including empty ones, have to be stored
– Sparseness leads to data being stored in many physical blocks or pages
•
The query speed is affected by the large number of block accesses on the secondary memory
– Solutions:
•Do not store empty blocks or pages: if there are large empty
portions of the array, they will not be physically stored, but the index structure will be adapted
•2 level data structure: upper layer holds all possible
combinations of the sparse dimensions, lower layer holds dense dimensions
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 9
6.1 Problems in Array-Storage
• 2 level cube storage
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 10
6.1 Problems in Array-Storage
Marketing campaign
Customer
Product
Time Geo
Dense low level Sparse upper level
• Indexes are used to optimize queries. OLAP queries have an aggregation role
– How many articles from product group washing devices were sold in 2008 for each month in each region
• Very big detailed data set (lots of sales in the sales fact table)
• Such aggregation (2008, region) queries on big data sets are costly: e.g., consider 100 GB of sales data stored in a star schema; for this query the whole set needs to be read…at an average speed of 40 MB/s it still takes 43 minutes only to read the data
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 11
6.2 Indexes
• Restriction role
– Besides aggregation, restrictions are the most time- consuming
– Typologies of restrictions based on query types
• Range query
–Restricted through intervals in each dimension
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 12
6.2 Remember OLAP Queries?
Product (Article)
Time (Days)
• Partial range query
– Some dimensions are not restricted – Geometrically described as a sub-space
• Partial match query
– Restricts more dimensions on a point, while other dimensions remain
unspecified
– Geometrically described as a hyper-level in the data set
• Point query
– Restricted to a point on all dimensions
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 13
6.2 Remember OLAP Queries?
Product (Article)
Time (Days)
Product (Article)
Time (Days)
Product (Article)
Time (Days)
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 14
6.2 Optimization Procedures
R1 R2
MV1 Base-layer of
the data
Layer of materialization
Layer of partitioning
Layer of Index struct.
MV2
P1 P2 P3
B*-Baum kB-Baum
R*-Baum UB-Baum
Logical access paths
Physical access paths
• Why index?
– Consider a 10 GB table; at 10 MB/s read speed we need 17 minutes for a full table scan
– Consider an OLAP query: the number of Bosch S500 washing machines sold in Braunschweig last month?
• Applying restrictions (product, location) the selectivity would be strongly reduced
–If we have 30 location, 10000 products and 24 months in the DW, the selectivity is
1/30 * 1/ 10000 * 1/24 = 0,00000014
– So…we read 10 GB for 1,4KB of data
…not very smart
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 15
6.2 Index Structures
• Reduce the size of read pages to a minimum with indexes
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 16
6.2 Index Structures
Product (Article)
Time (Days) Full table scan
Product (Article)
Time (Days) Cluster primary
index
Product (Article)
Time (Days) More secondary Indexes, bitmap indexes
Product (Article)
Time (Days) Optimal multi- dimensional index
Scanned data Selected data
• Types of queries
– k-nearest-neighbor-Search (k-NN-Search)
• Find the first K objects with the smallest distance to the query object
• Such a query is usually performed with approximations (with an error rate)
– Reverse-nearest-neighbor-Search
• Find all the objects whose nearest neighbor is the queried object
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 17
6.2 Index Structures
• Curse of Dimensionality (Richard Bellman) – The volume increases exponentially with the
dimensions number
– More dimensions mean more comparisons will be performed
• At the present time there is no real scalable indexing
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 18
6.2 Index Structures
• Classification criteria for index structures – Clustering
•
Data with high probability of being used together – Dimensionality
•
Refers to the number of attributes used to calculate the index key
– Symmetry
•
The order of the index attribute is not performance relevant – Tuple identifier (TID)
•
TID s are position numbers pointing to the physical storage place of the corresponding data
– Dynamical behavior
•
Effort needed for dynamical modifications can strongly vary from index to index
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 19
6.2 Index Structures
• In the beginning…there were B-Trees
– Data structures for storing sorted data with amortized run times for insertion and deletion
– Basic structure of a node
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 20
6.2 Trees
…
Tree Node
Node Pointers Key Value Data Pointer
• B-Trees as primary indexes in OLTP
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 21
6.2 Trees
1, Adams, $ 887,00 2, Bertram, $19,99 3, Behaim, $ 167,00 4, Cesar, $ 1866,00 5, Miller, $179,99 6, Naders, $ 682,56 7, Ruth, $ 8642,78 8, Smith, $675,99 9, Tarrens, $ 99,00
2 6 7
1 3 4 5 8 9
• What’s wrong with B-Trees?
– K-D-B trees are useful for point data only
• Exact-point lookup!
• Not good for storing geometrical data and multi- dimensional data
– The R-Tree provided a way to do that (thanks to Guttman ‘84)
• R-tree represents data objects in intervals in several dimensions
–Exact-point and range lookups!
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 22
6.2 B-Trees
• Demands
– Can store d-dimensional hypercubes
– Performs point, line, and box queries as fast as possible...
– ...but also keeps memory usage in check!
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 23
6.2 R-Trees
• R-Trees are recommended for lower dimensionality
– Up to 10 dimensions
• More scalable variants:
– R
+-Trees, R
*-Trees und X-Trees – Each up to 20 dimensions
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 24
6.2 R-Trees
• There is no total ordering of objects in the multidimensional space that preserves spatial proximity
– E.g. in time dimension it makes sense to keep data belonging to consecutive quarters clustered together
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 25
6.2 R-Trees
• R-Trees
– Can organize any-dimensional data
• Representing the data by a minimum bounding rectangle (MBR)
– Each node bounds it’s children – A node can have many objects in it
• E.g.,
–node capacity of 3 (in praxis node capacity is of 100s)
–2 dimensions
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 26
6.2 R-Trees
• The leaves point to the actual objects (stored on disk probably)
• The height is always log n (it is height balanced)
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 27
6.2 R-Trees
R1 R2
R3 R4 R5 R6
R1
R4 R3
R5
R6 R2
R3.1 R3.2 R3.3 R4.1 R4.2 R4.3
R4.1
…
Points to data tuples
leafs Non-leafs
Root
• 2 types of nodes
– Non-leaf nodes, contain entries of the form (I, P)
• I represents a vector I=(I
1, I
2, …, I
n) describing the boundaries of the element on the n dimensions, e.g., time and location, for R1 we have I=([08 Qtr1, 09 Qtr1], [a, b])
• P represents a pointer (e.g., disk page address) to the node with its
children
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 28
6.2 R-Trees
R1 R2
R3 R4
R1
R4 R3
R5
R6 R2
08 Qtr1 08 Qtr2 08 Qtr3 08 Qtr4 09 Qtr1 Time
Location
a b c d e f g
• Leaf nodes, contain entries of the form (I, RID) – I same as for non-leaf
– RID represents an unique tuple identifier
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 29
6.2 R-Trees
R3.1 R3.2 R3.3
Points to data tuples
• Operations – Let:
• E
Ibe the rectangle part of an index entry E
• E
pbe the tupel-identifier or pointer
• S be the search rectangle
–E.g., ([08 Qtr3, 09 Qtr1], [a, b])• T be the root of the R-Tree – Search:
• Start from the root node
• If multiple sub-trees contain the point of interest then follow all
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 30
6.2 R-Trees - Search
• Search (T, S) – If T is not a leaf
•
Check each entry E to determine whether E
Ioverlaps S
•
For all overlapping entries, invoke Search(E
p, S) – If T is a leaf
•
Check all entries E to determine whether E
Ioverlaps S
–If so, E is a qualifying record• No good performance guarantees
– In worst case all paths must be searched (due to overlapping)
• Search algorithms try to cut out irrelevant regions („Pruning“)
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 31
6.2 R-Trees - Search
• Search on sales on last 2 quarters of 08, locations e and f
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 32
6.2 R-Trees - Search
R1
R3 R4
R3.1 R3.2 R3.3 R4.1 R4.2 R4.3
R1
R4 R3
R5
R6 R2
08 Qtr1 08 Qtr2 08 Qtr3 08 Qtr4 09 Qtr1 Time
a b c d e f g
R5.3 R6.2 R6.3
• Insert, general idea
– New index records are added to the leaves!
• Nodes that overflow are split
• Splits propagate up the tree
• Node splitting is not trivial
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 33
6.2 R-Trees - Insert
• Insert (T, E)
– Find position for new record
•
Invoke ChooseLeaf to select a leaf node L in which to place E – Add record to leaf node:
•
If L has room for E then insert E and return
•
Otherwise, invoke SplitNode to obtain L and LL containing E and all the old entries of L
– Propagate changes upwards
•
Invoke AdjustTree on L, also passing LL if a split was performed – Grow tree taller
•
If node split propagation caused the root to split, create a new root whose children are the two resulting nodes.
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 34
6.2 R-Trees - Insert
• ChooseLeaf(E) – Initialize
•
Set N to be the root node – Leaf check
•
If N is a leaf, return N – Choose sub-tree
•
Let F be the entry in N whose rectangle F
Ineeds least
enlargementto include E
•Resolve ties
by choosing the entry with the rectangle of
smallest area– Descend until a leaf is reached
•
Set N to be the child node pointed to by F
pand repeat from Leaf check
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 35
6.2 R-Trees
• Node splitting
– A full node contains M entries
– Divide the collection of M+1 entries between 2 nodes.
– Objective: Make it as unlikely as possible for the resulting two new nodes to be examined on subsequent searches.
– Heuristic: The total area of two covering rectangles after a split should be minimized
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 36
6.2 R-Trees - Insert
• Node splitting
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 37
6.2 R-Trees - Insert
Bad split Good split
• Node splitting methods – Exhaustive algorithm
• Generate all possible groups and choose the best with minimum area
• Number of possibilities ~ 2
M-1• For M ~ 50 Number of possibilities ~ 600 Trillion
–Which is even more than theObama administration spent with the crisis!!!
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 38
6.2 R-Trees - Insert
• Quadratic-cost algorithm – A heuristic to find a small-area split
– Cost is quadratic in M and linear in the number of dimensions
– Pick two of the M+1 entries to be the first elements of the two new groups
• Calculate the MBR for each pair, and choose the one with the largest MBR
• These 2 objects are the new starting points for the resulting 2 nodes
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 39
6.2 R-Trees - Insert
• Quadratic-cost algorithm
– Assign remaining entries to groups one at a time
• Calculate d1 as the necessary volume difference to include the current entry in MBR1 and d2 for MBR2
• Calculate d1 and d2 for all the remaining entries
• Insert the entry with the highest preference into the corresponding node
–Preference = max(d1, d2) – min(d1, d2)
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 40
6.2 R-Trees - Insert
• Linear cost algorithm
– Identical to Quadratic with the following differences:
• Uses a linear procedure to identify the starting entries
–Find in each dimension 2 rectangles»the rectangle with the highest minimum coordinates
»and the rectangle with the lowest maximum coordinates
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 41
6.2 R-Trees - Insert
C
A D
B E
On X:
highest minimum coordinates -> E lowest maximum coordinates -> A On Y:
highest minimum coordinates -> D lowest maximum coordinates -> C
– Calculate the difference on the corresponding dimension between the coordinates of corresponding rectangles, and normalize it by the maximum on that dimension
– The two starting points are the two entries with the highest normalized difference
– Order the next entries so that the volume growth is the smallest from one step to another
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 42
6.2 R-Trees - Insert
C
A D
B E
Dx = (Ex – Ax)/max(x) Dy = (Dy - Cy)/max(y)
If (Dx > Dy) then choose E and A as starting points Else choose D and C
• Quadratic vs. Linear cost algorithm – Quadratic:
• Choose two objects that create as much empty space as possible
– Linear:
• Choose two objects that are furthest apart – Linear node-split is simple, fast, and as good as
quadratic!
– Quality of the splits is slightly worse!
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 43
6.2 R-Trees - Insert
• Delete entry
– Start a normal search of the entry to delete (FindLeaf) – Delete the record from the leaf (DeleteRecord) – Condense Tree if needed (if there are now nodes
which only have few entries)
• At condensation the node to be condensed is deleted as a whole and the entries which should remain are then inserted
• If the root has just one child, it will be the new root
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 44
6.2 R-Trees - Delete
• Update
– If the datasets are updated the existent rectangles can be changed
– In this case the index entry must be deleted updated and inserted
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 45
6.2 R-Trees - Update
• Let
– M = Maximum number of entries in a node.
– m <= M/2
– N = Number of records
• Properties
– Every leaf node contains between m and M index records
•
Root node is the exception
– For each index record (I, RID) in a leaf node, I is the smallest rectangle that spatially contains the n dimensional data object represented in the indicated tuple
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 46
6.2 R-Trees
– Every non-leaf node has between m and M children
• Root node is the exception
– For each entry (I, P) in a non-leaf node, I is the smallest rectangle that spatially contains the rectangles in the child node
– The root node has at least two children unless it is a leaf
– All leaves appear on the same level – Height of a tree = ceiling(log
mN)-1
– Worst case utilization for all nodes except the root is m/M
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 47
6.2 R-Trees
• Advantages
– Efficient for non-point queries – No downward cascading splits – Guaranteed utilization
• Disadvantages
– Dimension dependent fan-out
– Overlapping regions - search performance problem
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 48
6.2 R-Trees
• R
+-Trees enhances retrieval performance by avoiding visiting multiple paths
when searching for point queries – No overlap for MBRs at the same
level (internal nodes) – Specific object’s entry might be
duplicated
– Insertions might lead to a series of update operations in a chain-reaction
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 49
6.2 R-Tree Variations
• Compared to R-trees,
– Nodes are not guaranteed to be at least half filled – The entries of any internal node do not overlap – An object ID may be stored in more than one leaf
node
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 50
6.2 R-Tree Variations
A
B
C A, B and C do not overlap
• R
+-Trees – Advantages
• Because nodes are not overlapped with each other, point query performance benefits, e.g., a single path is followed and fewer nodes are visited than with the R-tree
– Disadvantages
• Since rectangles are duplicated, an R
+-Tree can be larger than an R-Tree built on same data set
• Construction and maintenance of R+ trees is more complex than the construction and maintenance of R-Trees
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 51
6.2 R-Tree Variations
• R
*-Trees
– Node split is more sophisticated
• When a node overflows, p entries are extracted and reinserted in the tree (p might be 25%)
– Considers minimization of:
• Overlapping between minimum bounding rectangles at the same level
• Perimeter of the produced minimum bounding rectangles – Insertion is more expensive while retrievals are faster
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 52
6.2 R-Tree Variations
• Combination of B
*-Tree and Z-curve = Universal B-Tree (UB-tree)
• Z-curve is used to map multidimensional points to one- dimensional values (Z-values)
• Z-values are used as keys in B
*-Tree
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 53
6.2 UB-Trees
Data part
8 17
8 17 3939 5151
28 28 Index part
• Concept of Z-Regions
– To create a disjunctive partitioning of the multidimensional space
– This allows for very efficient processing of multidimensional range queries
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 54
6.2 UB-Trees
1
3 4
2 5
7 8
6
9
11 12
10 13
15 16
14 17 19 20
18 21
23 24
22
25 27 28
26 29
31 32
30
33
35 36
34 37
39 40
38
41
43 44
42 45
47 48
46 49 51 52
50 53
55 56
54
57 59 60
58 61
63 64
62 0
1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
• Z-Regions
– The space covered by an interval on the Z-Curve – Defined by two Z-Addresses a and b
• We call b the region address of [a : b]
– Each Z-Region maps exactly onto one page on secondary storage
• I.e., to one leaf page of the B
*-Tree – E.g., of Z-Regions
• [1:9], [10, 18], [19, 28]…
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 55
6.2 UB-Trees
1
3 4
2 5
7 8
6
9
11 12
10 13
15 16 14 17
19 20
18 21
23 24
22
25
27 28
26 29
31 32
30
33
35 36
34 37
39 40 38
41
43 44
42 45
47 48 46 49
51 52
50 53
55 56
54
57
59 60
58 61
63 64
62
• Z-Value address representation
– Calculated through bit interleaving of the coordinates of the tuple
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 56
6.2 UB-Trees
1
3 4
2 5
7 8
6
9
11 12
10 13
15 16
14 17
19 20
18 21
23 24 22
25
27 28
26 29
31 32 30
33
35 36
34 37
39 40
38
41
43 44
42 45
47 48
46 49
51 52
50 53
55 56 54
57
59 60
58 61
63 64 62 0
1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
Tuple = 51, x = 4, y = 5
y
x X = 4 = 100
Y = 5 = 101
Z-value = 110010
• Why Z-Values?
– With Z-Values we reduce the dimensionality of the data to one dimension
– Z-Values are then used as keys in B
*-trees
• Using B
*-Trees results in high node filling degree (at least 50%)
• Logarithmical complexity at search, insert and delete
–Guaranteed maximum node-accesses to locate a key is– Z-Values are very important for range queries!
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 57
6.2 UB-Trees
log݂ܽ݊ −ݑݐ(݊ + 1 2 )ඈ
• Range queries (RQ) in UB-Trees
– Each query can be specified by 2 coordinates
• q
a(the upper left corner of the query rectangle)
• q
b(the lower right corner of the query rectangle) – RQ-algorithm
1. Starts with q
aand calculates its Z-Region
1. Z-Region of qais [10:18]
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 58
6.2 UB-Trees
1
3 4
2 5
7 8
6
9 11 12
10 13
15 16 14 17
19 20
18 21
23 24
22
25
27 28
26 29
31 32
30
33 35 36
34 37
39 40 38
41 43 44
42 45
47 48 46 49
51 52
50 53
55 56
54
57
59 60
58 61
63 64
62
• Range queries (RQ) in UB-Trees
2. The corresponding page is loaded and filtered with the query predicate
1. Tupels 15 and 16 fulfill the predicate
3. The next region (inside the query rectangle) on the Z- curve is calculated
1. The next jump point on the Z-curve is 27
4. Repeat steps 2 and 3 until the
end-address of the last filtered region is bigger than q
bData Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 59
6.2 UB-Trees
1
3 4
2 5
7 8
6
9
11 12
10 13
15 16
14 17
19 20
18 21
23 24
22
25
27 28
26 29
31 32
30
33
35 36
34 37
39 40
38
41
43 44
42 45
47 48
46 49
51 52
50 53
55 56
54
57
59 60
58 61
63 64
62
• The critical part of the algorithm is calculating the jump point on the Z-curve which is inside the query rectangle
– If this takes too long it eliminates the advantage obtained through optimized disk access – How is the jump point optimally calculated?
• From 3 points: q
a, q
band the current Z-Region
• By performing bit operations
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 60
6.2 UB-Trees
• Range Queries, UB-Trees and DW – In DW we have hierarchical organization of
dimensions
– No intervals for hierarchical restrictions
– Naive restrictions lead to many point queries instead of one interval on UB-Tree
– This is why we need Multidimensional Hierarchical Clustering (MHC)
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 61
6.2 UB-Trees
• With MHC UB-Trees can:
– Artificial encode hierarchies:
• Mapping of hierarchy restrictions to range restrictions
• Mapping is used for physical clustering of the fact table – Increase computation and space efficiency
• However, modification of query algorithms is necessary
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 62
6.2 MHC
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 63
6.2 MHC
Item Product Group Category Sector
Video Audio
Camcorder VCR
TR-780 TRV-30 GR-AX 200 GV-500 SLV-E800 Brown Goods White Goods
ALL
...
...
ID 2 11 5 8 21
Item Product Group Category Sector
Video
Audio Video
Audio
Camcorder VCR
TR-780 TRV-30
TR-780 TRV-30 GR-AX 200 GV-500 SLV-E800 Brown Goods White Goods
Brown Goods White Goods ALL
...
...
ID 2 11 5 8 21
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 64
6.2 MHC
...
Item Product Group Category Sector
Video Audio
Camcorder VCR
TR-780 TRV-30 GR-AX 200 GV-500 SLV-E800 Brown Goods White Goods
ALL
...
...
ID 2 11 5 8 21
0 1
00 01
0 1
Surrogate 00100000
0000 0001 0000 0001 0010
00100001 00110000 00110001 00110010
32 33 48 49 50
...
Item Product Group Category Sector
Video
Audio Video
Audio
Camcorder VCR
TR-780 TRV-30
TR-780 TRV-30 GR-AX 200 GV-500 SLV-E800 Brown Goods White Goods
Brown Goods White Goods ALL
...
...
ID 2 11 5 8 21
0 1
00 01
0 1
Surrogate 00100000
0000 0001 0000 0001 0010
00100001 00110000 00110001 00110010
32 33 48 49 50
• Bitmap Indexes
– Lets assume a relation Expenses with three attributes: Nr, Shop and Sum
– A bitmap index for attribute Shop looks like this
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 65
6.2 Bitmap Indexes
Nr Shop Sum
1 Saturn 150
2 Real 65
3 P&C 160
4 Real 45
5 Saturn 350
6 Real 80
Value Vector P&C 001000
Real 010101
Saturn 100010
• A bitmap index for an attribute of relation is:
– A collection of bit-vectors
– The number of bit-vectors represents the number of distinct values of the attribute in the relation – The length of each bit-vector is called the
cardinality of the relation
– The bit-vector for value v has 1 in position i, if the i
threcord has v in attribute A, and it has 0 otherwise
• Records are allocated permanent numbers
• There is a mapping between record numbers and record addresses
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 66
6.2 Bitmap Indexes
• Advantages
– Very efficient when used for partial match queries – They offer the advantage of buckets
• In our example each index vector is a bucket
–E.g., the Saturn bitmap vector is a bucketof 2, telling us that records having value Saturn in attribute Shop are first and 5threcord in the table
– They can also help answer range queries
– Efficient hardware support for bitmap operations (AND, OR, XOR, NOT)
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 67
6.2 Bitmap Indexes
Value Vector P&C 001000
Real 010101
Saturn 100010
• Assume:
– There are n records in the table
– Attribute A has m distinct values in the table
• The size of a bitmap index on attribute A is m*n
• Significant number of 0’s is m is big, and of 1’s if m is small
– Opportunity to compress
• Run Length Encoding (RLE)
• Gzip (Lempel-Ziv, LZ)
• Byte-Aligned Bitmap Compression (BBC): variable byte length encoding (Oracle patent)
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 68
6.2 Bitmap Indexes
• Handling modification
– Assume record numbers are not changed
– Deletion
• Tombstone replaces deleted record
• Corresponding bit is set to 0
• E.g. delete the 5
threcord
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 69
6.2 Bitmap Indexes
Nr Shop Sum
1 Saturn 150
2 Real 65
3 P&C 160
4 Real 45
5 Saturn 350
6 Real 80
Value Vector P&C 001000
Real 010101
Saturn 100010
Value Vector P&C 001000
Real 010101
Saturn 100000
Before After
• Insertion record is assigned the next record number
– A bit of value 0 or 1 is appended to each bit vector
– If new record contains a new value of the attribute, add one bit-vector
• E.g., insert new record with REWE as shop
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 70
6.2 Bitmap Indexes
Nr Shop Sum
1 Saturn 150
2 Real 65
3 P&C 160
4 Real 45
5 Saturn 350
6 Real 80
7 REWE 23
Value Vector P&C 001000
Real 010101
Saturn 100010
Value Vector P&C 0010000
Real 0101010
Saturn 1000100 REWE 0000001
Before After
• Modification
– Change the bit corresponding to the old value of the modified record to 0 – Change the bit corresponding to
the new value of the modified record to 1
– If the new value is a new value of A, then insert a new bit-vector: e.g., replace Shop for record 2 to REWE
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 71
6.2 Bitmap Indexes
Nr Shop Sum 1 Saturn 150
2 REWE 65
3 P&C 160
4 Real 45
5 Saturn 350
6 Real 80
Value Vector P&C 001000
Real 010101
Saturn 100010
Value Vector P&C 001000
Real 000101
Saturn 100010
REWE 010000
Before
After
• Select
– Basic AND, OR bit operations:
•
E.g., select the sums we have spent in Saturn and P&C
– Bitmap indexes should be used when selectivity is high
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 72
6.2 Bitmap Indexes
Nr Shop Sum 1 Saturn 150
2 Real 65
3 P&C 160
4 Real 45
5 Saturn 350
6 Real 80
Value Vector P&C 001000 Real 010101 Saturn 100010 Saturn OR P&C = Result
1 0 1
0 0 0
0 1 1
0 0 0
1 0 1
0 0 0
• Advantages
– Operations are efficient and easy to implement (directly supported by hardware)
• Disadvantages
– For each new value of an attribute a new bitmap-vector is introduced
•
If we bitmap index an attribute like birthday (only day) we have 365 vectors: 365/8 bits ≈ 46 Bytes for a record, just for that
•
Solution to such problems is multi-component bitmaps – Not fit for range queries where many bitmap vectors have
to be read
•
Solution: range-encoded bitmap indexes
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 73
6.2 Bitmap indexes
• Multi-component bitmap indexes – Encoding using a different numeration system
• E.g., for the month attribute, between 0 and 11 values can be encoded as x = 4 *y+z, where 0 ≤ y ≤2, and 0 ≤z ≤3, called <3,4> basis encoding
• 9 = 4*2+1
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 74
6.2 Bitmap indexes
Month Dec Nov Oct Sep Aug Jul Jun Mai Apr Mar Feb Jan M A11 A10 A9 A8 A7 A6 A5 A4 A3 A2 A1 A0
9 0 0 1 0 0 0 0 0 0 0 0 0
X Y Z
M A2,1 A1,1 A0,1 A3,0 A2,0 A1,0 A0,0
9 1 0 0 0 0 1 0
• Advantage of multi-component bitmap indexes
– If we have 100 (0..99) different Days to index we can use a multi-component bitmap index with basis of
<10,10>
– The storage is reduced from 100 to 20 bitmap-vectors (10 for y and 10 for z)
– The read-access for a point (1 day out of 100) query needs however 2 read operations instead of just 1
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 75
6.2 Multi-component bitmap indexes
• Range-encoded bitmap indexes
– Idea: set the bits of all bitmap vectors to 1 if they are higher or equal to the given value
– Query people born between March and August
•
For normal encoded bitmap indexes read 6 vectors, for range- encoded indexes, we can solve the query with just 2 vectors read:
((NOT A2) AND A7)Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 76
6.2 Bitmap indexes
Month Dec Nov Oct Sep Aug Jul Jun Mai Apr Mar Feb Jan M A11 A10 A9 A8 A7 A6 A5 A4 A3 A2 A1 A0
0 1 1 1 1 1 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1 1 1 0 0 0
5 1 1 1 1 1 1 1 0 0 0 0 0
11 1 0 0 0 0 0 0 0 0 0 0 0
• If the query is limited only on one side, (e.g., persons born in or after March), 1vector is enough (NOT A
1)
• For point queries, 2 vector reads are however necessary!
– E.g., persons born in March: ((NOT A
1) AND A
2)
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 77
6.2 Range-encoded bitmap indexes
• Combine multi-component bitmap indexes with range-encoding bitmap indexes and we have multi-component-range-encoding bitmap indexes
• Interval-encoded bitmap indexes – Each bitmap-vector represents an interval – It also needs to read at most 2 vectors, but the
storage is half, compared to range-encoded bitmap indexes
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 78
6.2 Bitmap indexes flavors
• Partitions the range of key values for each key into several buckets
• Dynamic structure using a grid directory – Grid array: a 2 dimensional array with pointers to
buckets (this array can be large, disk resident) G(0,…, n
x-1, 0, …, n
y-1)
– Linear scales: two 1 dimensional arrays that used to access the grid array (main memory) X(0, …, n
x-1), Y(0, …, n
y-1)
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 79
6.2 Grid Files
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 80
6.2 Grid Files
X Y
0 6 25 32
A F K O Z
• Properties
– Supports multi-dimensional data, but not high number of dimension
– Every key is treated as primary key
– The index structure adapts itself dynamically to maintain storage efficiency
– Guarantee two disk accesses for point queries – Values of key must be in linearly-ordered domain
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 81
6.2 Grid Files
• Exact Match Search: at most 2 I/Os assuming linear scales fit in memory
– First use liner scales to determine the index into the cell directory
– Access the cell directory to retrieve the bucket address (may cause 1 I/O if cell directory does not fit in memory) – Access the appropriate bucket (1 I/O)
• Range Queries:
– Use linear scales to determine the index into the cell directory
– Access the cell directory to retrieve the bucket addresses of buckets to visit
– Access the buckets
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 82
6.2 Grid Files: Queries
• E.g., Find 5<X<9 AND “Mat”<Y<“Robot”
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 83
6.2 Grid Files: Range queries
X Y
0 6 25 32
A F K O Z
• Determine the bucket into which insertion must occur
– If space in bucket, insert – Else, split bucket
– If bucket split causes a cell directory to split do so and adjust linear scales
• Insertion of these new entries potentially requires a complete reorganization of the cell directory… expensive!!!
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 84
6.2 Grid Files: Insert
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 85
6.2 Grid Files: Insert
a) b)
c) d)
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 86
6.2 Grid Files: Insert
d) e)
f) g)
• Delete the data node, and – Merge data pages/blocks if
possible
– Merge directory pages if possible
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 87
6.2 Grid Files: Delete
a)
b) c)
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 88
6.2 Grid Files: Delete
c) d)
e) f)
• Optimization – Partitioning – Joins
– Materialized Views
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 89