• Keine Ergebnisse gefunden

Multimedia Databases Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de

N/A
N/A
Protected

Academic year: 2021

Aktie "Multimedia Databases Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de"

Copied!
10
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Multimedia Databases

Wolf-Tilo Balke Silviu Homoceanu

Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de

13 Indexes for Multimedia Data 13.1 R-Trees

13.2 M-Trees

Multimedia Databases– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 2

13 Indexes for Multimedia Data

• Multimedia databases – Images

– Audio data – Video data

• Description of multimedia objects

– Usually (multidimensional) real-valued feature vectors – But also: skeletons, chain codes, ...

• The sequential search for similar objects in databases is very inefficient

• How can we speed up the search?

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 3

13.0 Indexes for Multimedia Data

• Speed up search through indexing

– Efficient management of multidimensional information

• Pre-structuring of data for the subsequent search functionality

• Efficient data structures, combined with search and comparison algorithms

– Transition from set semantics to list semantics

• To which degree does the object from the database satisfy the query?

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 4

13.0 Indexes for Multimedia Data

• Requirements for a multidimensional index structure

– Correctness and completeness of the corresponding indexing algorithms – Scalability with dimension growth – Support objects which are not

real-valued vectors – Search efficiency (sublinear)

13.0 Indexes for Multimedia Data

– Different types of queries:

• Exact search

• Area search

• Nearest neighbors search

• ...

– Efficient update operations

– Support for various distance functions – Low memory requirements

13.0 Indexes for Multimedia Data

(2)

• Fundamental problem:

– The more dimensions, the more comparisons are needed

– There is currently no truly scalable indexing – Cause: “Curse of Dimensionality”

(Richard Bellman)

• The volume of space grows exponentially with the number of its dimensions

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 7

13.0 Indexes for Multimedia Data

• Exact search – Point search – Area search

• k-nearest-neighbor search (k - NN-search) – Find the k objects that have the least distance to the

object given as reference in the request

– k-NN search is usually only calculated on approximation basis (with a specified error) due to the high cost

• Reverse-nearest-neighbor search

– Find all the objects whose nearest neighbor is provided in the query

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 8

13.0 Query Types

• Search in database systems – B-tree structures allow exact search

with logarithmic costs

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 9

13.0 Tree Structures

1 2 3 4 5 6 7 8 9

2 6 7

1 3 4 5 8 9

• Search in multimedia databases

– The data is multidimensional, B-trees however, support only one-dimensional search

• Are there any possibilities to extend tree functionality for multidimensional data?

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 10

13.0 Tree Structures

• The basic idea of multidimensional trees – Describe the sets of points through geometric

regions, which comprise the points (clusters) – The clusters are considered for the actual search and

not the individual points

– Clusters can contain each other, resulting in a hierarchical structure

13.0 Tree Structures

• Differentiating criterias for tree structures:

– Cluster construction:

• Completely fragmenting the space or

• Grouping data locally – Cluster overlap:

• Overlapping or

• Disjoint – Balance:

• Balanced or

• Unbalanced

13.0 Tree Structures

(3)

– Object storage:

• Objects in leaves and nodes, or

• Objects only in the leaves – Geometry:

• Hyper-spheres,

• Hyper-cube,

• ...

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 13

13.0 Tree Structures

• The R-tree (Guttman, 1984) is the prototype of a multi-dimensional extension of the classical B-trees

• Frequently used for low-dimensional applications (used to about 10 dimensions), such as geographic information systems

• More scalable versions: R

+

-Trees, R*-Trees and X- Trees (each up to 20 dimensions for uniform distributed data)

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 14

13.1 R-Trees

• Dynamic Index Structure

(insert, update and delete are possible)

• Data structure

– Data pages are leaf nodes and store clustered point data and data objects

– Directory pages are the internal nodes and store directory entries

– Multidimensional data are structured with the help of Minimum Bounding Rectangles (MBRs)

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 15

13.1 R-Tree Structure

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 16

13.1 R-Tree Example

root R2

R9 R7

R8 R1

R4 R5

R6 R3

R10 R11

root

R1 R2 R3

R4 R5 R6 R7 R8 R9 R10 R11 Xp

XO XQ

Q P O

• Local grouping for clustering

• Overlapping clusters (the more the clusters overlap the more inefficient is the index)

• Height balanced tree structure

(therefore all the children of a node in the tree have about the same number of successors)

• Objects are stored, only in the leaves – Internal nodes are used for navigation

• MBRs are used as a geometry

13.1 R-Tree Characteristics

• The root has at least two children

• Each internal node has between m and M children

• M and m ≤ M / 2 are pre-defined parameters

• For each entry (I, child-pointer) in an internal node, I is the smallest rectangle that contains the rectangles of the child nodes

13.1 R-Tree Properties

(4)

• For each index entry (I, tuple-id) in a leaf, I is the smallest bounding rectangle that contains the data object (with the ID tuple-id)

• All the leaves in the tree are on the same level

• All leaves have between m and M index records

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 19

13.1 R-Tree Properties

• The essential operations for the use and management of an R-tree are

– Search – Insert – Updates – Delete – Splitting

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 20

13.1 Operations of R-Trees

• The tree is searched recursively from the root to the leaves

– One path is selected

– If the requested record has not been found in that sub-tree, the next path is traversed

• The path selection is arbitrary

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 21

13.1 Searching in R-Trees

• No guarantee for good performance

• In the worst case, all paths must traversed (due to overlaps of the MBRs)

• Search algorithms try to exclude as many irrelevant regions as possible (“pruning”)

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 22

13.1 Searching in R-Trees

• All the index entries which intersect with the search rectangle S are traversed

– The search in internal nodes

• Check each object for intersection with S

• For all intersecting entries continue the search in their children

– The search in leaf nodes

• Check all the entries to determine whether they intersect S

• Take all the correct objects in the result set

13.1 Search Algorithm

• Check only 7 nodes instead of 12

13.1 Example

root R2

R9 R7

R8 R1

R4 R5

R6 R3

R10

R11 root

R1

R7 R8 R9 R2

S

R3

X X

Check all the objects in node R8

(5)

• Procedure

– The best leaf page is chosen (ChooseLeaf) considering the spatial criteria

• Beast leaf: the leaf that needs the smallest volume growth to include the new object

– The object will be inserted there if there is enough room (number of objects in the node < M)

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 25

13.1 Insert

– If there is no more place left in the node, it is considered a case for overflow and the node is divided (SplitNode)

• Goal of the split is to result in minimal overlap and as small dead space as possible

– Interval of the parent node must be adapted to the new object (AdjustTree)

– If the root is reached by division, then create a new root whose children are the two split nodes of the old root

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 26

13.1 Insert

• Inserting P either in R7 or R9

• In R7, it needs more space, but does not overlap

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 27

13.1 R-Tree Insert Example

R2

xP root R2

R9 R7

R8 R1

R4 R5

R6 R10

R11

R2

R9 R7

R8 xP R2

R9 R7

R8 xP

R3

• An object is always inserted in the nodes, to

which it produces the smallest increase in volume

• If it falls in the interior of a MBR no enlargement is need

• If there are several possible nodes, then select the one with the smallest volume

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 28

13.1 Heuristics

13.1 Insert with Overflow

XP root R2

R9 R7

R8 R1

R4 R5

R6 R10

R3 R11 R2

R9 R7

R8 XP R7b

root

R1 R2 R3

R4 R5 R6 R7 R7b R8 R9 R10 R11

• If an object is inserted in a full node, then the M+1 objects will be divided among two new nodes

• The goal in splitting is that it should rarely be needed to traverse both resulting nodes on subsequent searches

– Therefore use small MBRs. This leads to minimal overlapping with other MBRs

13.1 SplitNode

(6)

• Calculate the minimum total area of two rectangles, and minimize the dead space

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 31

13.1 Split Example

Bad split Better Split

• Deciding on how exactly to perform the splits is not trivial

– All objects of the old MBR can be divided in different ways on two new MBRs

– The volume of both resulting MBRs should remain as small as possible

– The naive approach of checking checks all splits and calculate the resulting volumes is not possible

• Two approaches – With quadratic cost – With linear cost

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 32

13.1 Overflow Problem

• Procedure with quadratic cost

– Compute for each 2 objects the necessary MBR and choose the pair with the largest MBR

– Since these two objects should not occur in an MBR, they will be used as starting points for two new MBRs

– Compute for all other objects, the difference of the necessary volume increase with respect to both MBRs

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 33

13.1 Overflow Problem

– Insert the object with the smallest difference in the corresponding MBR and compute the MBR again – Repeat this procedure for all unallocated objects

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 34

13.1 Overflow Problem

• Procedure with linear cost – In each dimension:

• Find the rectangle with the highest minimum coordinates, and the rectangle with the smallest maximum coordinates

• Determine the distance between these two coordinates, and normalize it on the size of all the rectangles in this dimension

– Determine the two starting points of the new MBRs as the two objects with the highest normalized distance

13.1 Overflow Problem

• x-direction: select A and E, as d

x

= diff

x

/max

x

= 5 / 14

• y-direction: select C and D, as d

y

= diff

y

/max

y

= 8 / 13

• Since d

x

< d

y

, C and D are chosen for the split

13.1 Example

8 E D

B

C

14 A

13 5

(7)

– Classify all remaining objects the MBR with the smallest volume growth

• The linear process is a simplification of the quadratic method

• It is usually sufficient providing similar quality of the split (minimal overlap of the resulting MBRs)

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 37

13.1 Overflow Problem

• Procedure

– Search the leaf node with the object to delete (FindLeaf)

– Delete the object (deleteRecord)

– The tree is condensed (CondenseTree) if the resulting node has < m objects

– When condensing, a node is completely erased and the objects of the node which should have remained are reinserted

– If the root remains with just one child, the child will become the new root

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 38

13.1 Delete

• An object from R9 is deleted (1 object remains in R9, but m = 2)

– Due to few objects R9 is deleted, and R2 is reduced (condenseTree)

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 39

13.1 Example

R2

R9 R7

R8

R2 R7

R8

root

R1 R2 R3

R4 R5 R6 R7 R8 R10 R11

• If a record is updated, its surrounding rectangle can change

• The index entry must then be deleted updated and then re-inserted

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 40

13.1 Update

• The most efficient search in R-trees is performed when the overlap and the dead space are minimal

13.1 Block Access Cost

E

C M N

D F

H K GS

I

L A

J B

E

A B C

D E F G H I J K L M N root

Avoiding overlapping is only possible if data points are known in advance

• Where are R-trees inefficient?

– They allow overlapping between neighboring MBRs

• R

+

-Trees (Sellis ua, 1987)

– Overlapping of neighboring MBRs are prohibited

– This may lead to identical leafs occurring more than once in the tree

– Improve search efficiency, but similar scalability as R-trees

13.1 Improved Versions of R-Trees

(8)

• Overlaps are not permitted (A and P)

• Data rectangles are divided and may be present (e.g., G) in several leafs

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 43

13.1 R + -Trees

C M N

D F

H G K

I

L A

J B

P E

S

D E F G

A B C P

I J K L M N G H root

• Differences to the R-tree – Insert

• Data object can be inserted into several leafs

• Splitting continues downwards, since no overlaps are allowed

– Delete

• There is no more minimum number of children

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 44

13.1 Operations in R + -Trees

• The main advantage of R

+

-trees is to improve the search performance

• Especially for point queries, this saves 50% of access time

• Drawback is the low occupancy of nodes resulting through many splits

• R

+

-trees often degenerate with the increasing number of changes

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 45

13.1 Performance

• R*- trees and X-trees improve the performance of the R

+

-trees (Kriegel and others, 1990/1996)

– Improved split algorithm in R*-trees

– “Extended nodes“ in X-trees allow sequential search of larger objects

– Scalable up to 20 dimensions

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 46

13.1 More Versions

• M-tree (Ciaccia et al, 1997) allows the use of arbitrary metrics for comparison of objects (“metric trees”)

– R-trees only work with Euclidean metrics, but what about for example, the editing distance?

– Use the triangle inequality to check sub-trees – Geometry is determined by the distance function

13.2 M-Trees

• A metric space is a pair of M = (U, d) – U is the universe of all possible values – d is a metric

• For all x, y, z ∈ U:

– d(x, y) ≥ 0, d(x, y) = 0 iff. x = y – d(x, y) = d(y, x)

– d(x, y) ≤ d(x, z) + d(z, y) (triangle inequality)

13.2 Metric Space

(9)

• Precomputed:

Distances for all pairs of points

• Task: Find the object with the smallest distance to Q

• Distance between Q and a is 2

• Distance between Q and b is 7.81

• Can C be the best object?

• d (Q, b b b ) ≤ d (Q, cccc ) + d ( b b b b , cccc ) b 5,51 ≤ d (Q, cccc )

• No. Therefore a is better

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 49

13.2 Triangle Inequality

a

b c

Q

a b c

a 6.70 7.07

b 2.30

c

• The M-tree partitions the objects in ε -environments with certain radius

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 50

13.2 Partitioning

A balanced partitioning is obtained by choosing P1 = { p | d(p, v) ≤ rv} and

P2 = { p | d(p, v) > rv} so that |P1| ≈ |P2| ≈ |P| / 2

For a query qwith

d(q, x) < r only P2must be considered

• M-trees are similar to R-trees, but use the distance information

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 51

13.2 M-Trees

• Each node N has a region Reg(N) – Reg Reg(N) = {p | p ∈ Reg Reg (N) = {p | p ∈ (N) = {p | p ∈ (N) = {p | p ∈ U, d(p, U, d(p, U, d(p, U, d(p, v v v v

NNNN

)))) ≤ rrrr

NNNN

}}}}

– With v

N

as so called “routing object“ and r

N

as the radius of the area (“covering radius”)

• All the indexed points p have guaranteed distance of at most r

N

from v

N

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 52

13.2 M-Trees

Queries q q q q with d(q, d(q, d(q, d(q, v v v v

NNNN

) > ) > ) > ) > rrrr

NNNN

+ r + r + r + r don’t need to consider node N N N N

• Internal nodes have – A routing object

– The radius of their region and – A distance to the parent node

• Leaf nodes have

– The values of the indexed objects and – Their distance from the parent node

13.2 M-Trees

• Precomputed distances to the respective parent nodes allow fast searching (“fast pruning”)

• d(v

P

, v

N

) is precomputed. We don’t need d(q, v

N

) if |d(q, v

P

) − d(v

P

, v

N

)| > r

N

+ r

13.2 M-Trees

(10)

• Insert is performed as by R-trees with the smallest expansion of the region radius

• At overflow, a split is performed

– No volumes are however calculated (as in MBRs in the R-tree)

– Delete the node and choose two new routing objects – Heuristic: Minimize the maximum of the two

resulting region radiuses

– Attribute then the routing objects to the new regions alternating between their nearest neighbors (Balanced Split)

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 55

13.2 M-Trees

• M-Trees overview

– Allow a variety of distance functions – Use triangle inequality for pruning – The dimensionality is

also very limited

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 56

13.2 M-Trees

• Indexes for Multimedia Data – Curse of Dimensionality – Dimension Reduction – GEMINI Indexing

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 57

Next lecture

Referenzen

ÄHNLICHE DOKUMENTE

Multimedia Databases– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 7.. 1.1

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 72. 2.1 Multimedia

You can search for video clips based on data that you maintain, such as a name, number, or description; or by data that the DB2 Video Extender maintains, such as the format of

Multimedia Databases– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 2?.

Multimedia Databases– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 7?. 1.1

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 3!. 2.1 Multimedia

Multimedia Databases– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 2?. 3 Using Textures for

Multimedia Databases– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 2?. 4