• Keine Ergebnisse gefunden

13.0 Indexes for Multimedia Data

N/A
N/A
Protected

Academic year: 2021

Aktie "13.0 Indexes for Multimedia Data"

Copied!
57
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Multimedia Databases Multimedia Databases

Wolf-Tilo Balke Silviu Homoceanu

Institut für Informationssysteme

Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de

(2)

13 Indexes for Multimedia Data

13.1 R-Trees 13.2 M-Trees

13 Indexes for Multimedia Data

(3)

Multimedia databases

Images

Audio data Video data

Description of multimedia objects

13.0 Indexes for Multimedia Data

Description of multimedia objects

Usually (multidimensional) real-valued feature vectors But also: skeletons, chain codes, ...

The sequential search for similar objects in databases is very inefficient

How can we speed up the search?

(4)

• Speed up search through indexing

Efficient management of multidimensional information

Pre-structuring of data for the subsequent search functionality

13.0 Indexes for Multimedia Data

functionality

Efficient data structures, combined with search and comparison algorithms

Transition from set semantics to list semantics

To which degree does the object from

(5)

• Requirements for a multidimensional index structure

Correctness and completeness of

the corresponding indexing algorithms Scalability with dimension growth

13.0 Indexes for Multimedia Data

Scalability with dimension growth Support objects which are not

real-valued vectors

Search efficiency (sublinear)

(6)

Different types of queries:

Exact search

Area search

Nearest neighbors search

...

13.0 Indexes for Multimedia Data

...

Efficient update operations

Support for various distance functions Low memory requirements

(7)

• Fundamental problem:

The more dimensions, the more comparisons are needed

There is currently no truly scalable indexing

13.0 Indexes for Multimedia Data

Cause: “Curse of Dimensionality”

(Richard Bellman)

The volume of space grows exponentially with the number of its dimensions

(8)

Exact search

Point search Area search

k-nearest-neighbor search (k - NN-search)

Find the k objects that have the least distance to the

13.0 Query Types

Find the k objects that have the least distance to the object given as reference in the request

k-NN search is usually only calculated on approximation basis (with a specified error) due to the high cost

Reverse-nearest-neighbor search

Find all the objects whose nearest neighbor is provided in

(9)

• Search in database systems

B-tree structures allow exact search with logarithmic costs

13.0 Tree Structures

2 6 7

1 2 3 4 5 6 7 8 9

2 6 7

1 3 4 5 8 9

(10)

• Search in multimedia databases

The data is multidimensional, B-trees however, support only one-dimensional search

• Are there any possibilities to extend tree

13.0 Tree Structures

functionality for multidimensional data?

(11)

• The basic idea of multidimensional trees

Describe the sets of points through geometric regions, which comprise the points (clusters)

The clusters are considered for the actual search and not the individual points

13.0 Tree Structures

not the individual points

Clusters can contain each other, resulting in a hierarchical structure

(12)

• Differentiating criterias for tree structures:

Cluster construction:

Completely fragmenting the space or

Grouping data locally

Cluster overlap:

13.0 Tree Structures

Cluster overlap:

Overlapping or

Disjoint

Balance:

Balanced or

(13)

Object storage:

Objects in leaves and nodes, or

Objects only in the leaves

Geometry:

Hyper-spheres,

13.0 Tree Structures

Hyper-spheres,

Hyper-cube,

...

(14)

• The R-tree (Guttman, 1984) is the prototype of a multi-dimensional extension of the classical

B-trees

• Frequently used for low-dimensional applications

13.1 R-Trees

• Frequently used for low-dimensional applications (used to about 10 dimensions), such as geographic information systems

• More scalable versions: R+-Trees, R*-Trees and X- Trees (each up to 20 dimensions for uniform

distributed data)

(15)

• Dynamic Index Structure

(insert, update and delete are possible)

• Data structure

Data pages are leaf nodes and store clustered point

13.1 R-Tree Structure

Data pages are leaf nodes and store clustered point data and data objects

Directory pages are the internal nodes and store directory entries

Multidimensional data are structured with the help of Minimum Bounding Rectangles (MBRs)

(16)

13.1 R-Tree Example

R1

R4

R5

R6

R3

R10

R11

root

R1 R2 R3

root R2

R9 R7

R8

R6 R1 R2 R3

R4 R5 R6 R7 R8 R9 R10 R11 Xp

XO XQ

Q P O

(17)

• Local grouping for clustering

• Overlapping clusters (the more the clusters overlap the more inefficient is the index)

• Height balanced tree structure

13.1 R-Tree Characteristics

• Height balanced tree structure

(therefore all the children of a node in the tree have about the same number of successors)

• Objects are stored, only in the leaves

Internal nodes are used for navigation

• MBRs are used as a geometry

(18)

• The root has at least two children

• Each internal node has between m and M children

• M and m ≤ M / 2 are pre-defined parameters

• For each entry (I, child-pointer) in an internal I

13.1 R-Tree Properties

M m ≤ M / 2

• For each entry (I, child-pointer) in an internal

node, I is the smallest rectangle that contains the rectangles of the child nodes

(19)

• For each index entry (I, tuple-id) in a leaf, I is the smallest bounding rectangle that contains the data object (with the ID tuple-id)

• All the leaves in the tree are on the same level

m M

13.1 R-Tree Properties

ID tuple-id

• All the leaves in the tree are on the same level

• All leaves have between m and M index records

(20)

• The essential operations for the use and management of an R-tree are

Search Insert

13.1 Operations of R-Trees

Updates Delete Splitting

(21)

• The tree is searched recursively from the root to the leaves

One path is selected

If the requested record has not been found in that sub-tree, the next path

13.1 Searching in R-Trees

found in that sub-tree, the next path is traversed

• The path selection is arbitrary

(22)

• No guarantee for good performance

• In the worst case, all paths must traversed (due to overlaps of the MBRs)

• Search algorithms try to exclude as many

13.1 Searching in R-Trees

• Search algorithms try to exclude as many irrelevant regions as possible (“pruning”)

(23)

• All the index entries which intersect with the search rectangle S are traversed

The search in internal nodes

Check each object for intersection with S

For all intersecting entries continue the search in their

13.1 Search Algorithm

For all intersecting entries continue the search in their children

The search in leaf nodes

Check all the entries to determine whether they intersect S

Take all the correct objects in the result set

(24)

13.1 Example

R1

R4

R5

R6

R3

R10

R11 root

R1 R2 R3

X X

• Check only 7 nodes instead of 12

root R2

R9 R7

R8

R7 R8 R9

S

X X

Check all the objects in node R8

(25)

• Procedure

The best leaf page is chosen (ChooseLeaf) considering the spatial criteria

Beast leaf: the leaf that needs the smallest volume growth to include the new object

13.1 Insert

include the new object

The object will be inserted there if there is enough room (number of objects in the node < M)

(26)

If there is no more place left in the node, it is considered a case for overflow and the node is divided (SplitNode)

Goal of the split is to result in minimal overlap and as small dead space as possible

13.1 Insert

dead space as possible

Interval of the parent node must be adapted to the new object (AdjustTree)

If the root is reached by division, then create a new root whose children are the two split nodes of the old root

(27)

13.1 R-Tree Insert Example

R2

R2 R7 R1

R4

R5

R6

R10

R11

R2

R9 R7

R8

xP

R2

R9 R7

R8

xP

R3

• Inserting P either in R7 or R9

• In R7, it needs more space, but does not overlap

xP

root R2

R9 R7

R8

(28)

• An object is always inserted in the nodes, to

which it produces the smallest increase in volume

• If it falls in the interior of a MBR no enlargement is need

13.1 Heuristics

is need

• If there are several possible nodes, then select the one with the smallest volume

(29)

13.1 Insert with Overflow

R2 R7 R1

R4

R5

R6

R10

R3 R11 R2

R9 R7

R8

XP R7b

XP

root R2

R9 R7

R8

root

R1 R2 R3

R4 R5 R6 R7 R7b R8 R9 R10 R11

(30)

• If an object is inserted in a full node, then the M+1 objects will be divided among two new nodes

• The goal in splitting is that it should rarely be

13.1 SplitNode

• The goal in splitting is that it should rarely be needed to traverse both resulting nodes on subsequent searches

Therefore use small MBRs. This leads to minimal overlapping with other MBRs

(31)

• Calculate the minimum total area of two rectangles, and minimize the dead space

13.1 Split Example

Bad split Better Split

(32)

• Deciding on how exactly to perform the splits is not trivial

All objects of the old MBR can be divided in different ways on two new MBRs

The volume of both resulting MBRs should remain as

13.1 Overflow Problem

The volume of both resulting MBRs should remain as small as possible

The naive approach of checking checks all splits and calculate the resulting volumes is not possible

• Two approaches

With quadratic cost

(33)

• Procedure with quadratic cost

Compute for each 2 objects the necessary MBR and choose the pair with the largest MBR

Since these two objects should not occur in an MBR, they will be used as starting points for two new

13.1 Overflow Problem

they will be used as starting points for two new MBRs

Compute for all other objects, the difference of the

necessary volume increase with respect to both MBRs

(34)

Insert the object with the smallest difference in the corresponding MBR and compute the MBR again

Repeat this procedure for all unallocated objects

13.1 Overflow Problem

(35)

• Procedure with linear cost

In each dimension:

Find the rectangle with the highest minimum coordinates, and the rectangle with the smallest maximum coordinates

13.1 Overflow Problem

maximum coordinates

Determine the distance between these two coordinates, and normalize it on the size of all the rectangles in this dimension

Determine the two starting points of the new MBRs as the two objects with the highest normalized

distance

(36)

13.1 Example

8 E

D

B

C A

13 5

x-direction: select A and E, as dx = diffx/max x = 5 / 14

y-direction: select C and D, as dy = diffy/maxy = 8 / 13

Since d < d , C and D are chosen for the split

C

14

(37)

Classify all remaining objects the MBR with the smallest volume growth

• The linear process is a simplification of the quadratic method

13.1 Overflow Problem

• It is usually sufficient providing similar quality of the split (minimal overlap of the resulting MBRs)

(38)

• Procedure

Search the leaf node with the object to delete (FindLeaf)

Delete the object (deleteRecord)

The tree is condensed (CondenseTree) if the resulting

< m

13.1 Delete

The tree is condensed (CondenseTree) if the resulting node has < m objects

When condensing, a node is completely erased and the objects of the node which should have remained are reinserted

If the root remains with just one child, the child will

(39)

• An object from R9 is deleted

(1 object remains in R9, but m = 2)

Due to few objects R9 is deleted, and R2 is reduced (condenseTree)

13.1 Example

R2

R9 R7

R8

R2 R7

R8

root

R1 R2 R3

R4 R5 R6 R7 R8 R10 R11

(40)

• If a record is updated, its surrounding rectangle can change

• The index entry must then be deleted updated and then

13.1 Update

deleted updated and then re-inserted

(41)

• The most efficient search in R-trees is performed when the overlap and the dead space are

minimal

13.1 Block Access Cost

K

root

E

C M N

D F

H

K G S

I

L A

J B

E

A B C

D E F G H I J K L M N Avoiding overlapping is only possible if data points are known in advance

(42)

• Where are R-trees inefficient?

They allow overlapping between neighboring MBRs

• R+-Trees (Sellis ua, 1987)

Overlapping of neighboring MBRs are

13.1 Improved Versions of R-Trees

Overlapping of neighboring MBRs are prohibited

This may lead to identical leafs occurring more than once in the tree

Improve search efficiency, but similar scalability as R-trees

(43)

13.1 R

+

-Trees

C M D

F

H G K

I A

J B

P

E

S

A B C P

root

• Overlaps are not permitted (A and P)

• Data rectangles are divided and may be present (e.g., G) in several leafs

C M

N L

D E F G I J K L M N G H

(44)

• Differences to the R-tree

Insert

Data object can be inserted into several leafs

Splitting continues downwards, since no overlaps are

13.1 Operations in R

+

-Trees

Splitting continues downwards, since no overlaps are allowed

Delete

There is no more minimum number of children

(45)

• The main advantage of R+-trees is to improve the search performance

• Especially for point queries, this saves 50% of access time

13.1 Performance

access time

• Drawback is the low occupancy of nodes resulting through many splits

• R+-trees often degenerate with the increasing number of changes

(46)

• R*- trees and X-trees improve the performance of the R+-trees (Kriegel and others, 1990/1996)

Improved split algorithm in R*-trees

“Extended nodes“ in X-trees allow sequential search of larger objects

13.1 More Versions

of larger objects

Scalable up to 20 dimensions

(47)

• M-tree (Ciaccia et al, 1997) allows the use of arbitrary metrics for comparison of objects (“metric trees”)

R-trees only work with Euclidean metrics, but what about for example, the editing distance?

13.2 M-Trees

about for example, the editing distance?

Use the triangle inequality to check sub-trees Geometry is determined by the distance function

(48)

• A metric space is a pair of M = (U, d)

U is the universe of all possible values d is a metric

• For all x, y, z ∈ U:

13.2 Metric Space

• For all x, y, z ∈ U:

d(x, y) ≥ 0, d(x, y) = 0 iff. x = y d(x, y) = d(y, x)

d(x, y) ≤ d(x, z) + d(z, y) (triangle inequality)

(49)

• Precomputed:

Distances for all pairs of points

• Task: Find the object with the smallest distance to Q

• Distance between Q and a is 2

7.81

13.2 Triangle Inequality

a

b c

Q

• Distance between Q and a is 2

• Distance between Q and b is 7.81

• Can C be the best object?

• d (Q, bbbb ) ≤ d (Q, cccc ) + d ( bbbb , cccc ) 5,51 ≤ d (Q, cccc )

b

a b c

a 6.70 7.07

b 2.30

c

(50)

• The M-tree partitions the objects in ε-environments with certain radius

13.2 Partitioning

A balanced partitioning is obtained by choosing P1 = { p | d(p, v) ≤ rv} and

P2 = { p | d(p, v) > rv}

so that |P1| ≈ |P2| ≈ |P| / 2 q

P1 = { p | d(p, v) ≤ r } P2 = { p | d(p, v) > rv}

so that |P1| ≈ |P2| ≈ |P| / 2 For a query q with

d(q, x) < r only P2 must be considered

(51)

• M-trees are similar to R-trees, but use the distance information

13.2 M-Trees

(52)

• Each node N has a region Reg(N)

RegReg(N) = {p | p ∈RegReg(N) = {p | p ∈(N) = {p | p ∈ U, d(p, (N) = {p | p ∈ U, d(p, U, d(p, vU, d(p, vvvNNNN)))) ≤ rrrrNNNN}}}}

With vN as so called “routing object“ and rN as the radius of the area (“covering radius”)

r v

13.2 M-Trees

vN rN

• All the indexed points p have guaranteed distance of at most rN from vN

Queries qqqq with d(q, d(q, d(q, d(q, vvvvNNNN) > ) > rrrr) > ) > NNNN + r+ r+ r+ r don’t need to consider node NNNN

(53)

• Internal nodes have

A routing object

The radius of their region and A distance to the parent node

13.2 M-Trees

A distance to the parent node

• Leaf nodes have

The values of the indexed objects and Their distance from the parent node

(54)

• Precomputed distances to the respective parent nodes allow fast searching (“fast pruning”)

• d(vP , vN) is precomputed. We don’t need d(q, vN) if |d(q, vP) − d(vP, vN)| > rN + r

13.2 M-Trees

d(vP , vN) d(q, vN)

if |d(q, vP) − d(vP, vN)| > rN + r

(55)

• Insert is performed as by R-trees with the smallest expansion of the region radius

• At overflow, a split is performed

No volumes are however calculated (as in MBRs in the R-tree)

13.2 M-Trees

the R-tree)

Delete the node and choose two new routing objects Heuristic: Minimize the maximum of the two

resulting region radiuses

Attribute then the routing objects to the new regions alternating between their nearest neighbors (Balanced Split)

(56)

• M-Trees overview

Allow a variety of distance functions Use triangle inequality for pruning

The dimensionality is

13.2 M-Trees

The dimensionality is also very limited

(57)

• Indexes for Multimedia Data

Curse of Dimensionality Dimension Reduction GEMINI Indexing

Next lecture

Referenzen

ÄHNLICHE DOKUMENTE

Mandatory DMPs and data publications by majority of funders (ZonMW, KNAW, NWO) DMP requirements for doctoral students Extensive governmental support. Even stricter requirements

Meta-omics data and collection objects (MOD-CO): a conceptual schema and data model for processing sample data in meta-omics research.. 2019: article

Author contributions BB has led overall research activities from proposal development to data compilation, data entry and processing, data analysis, and interpretation of the result

5.. The query times for OSQP are split into index matching, data loading and processing. RDF-3X query times in relation to number of pruned triple patterns; c) Ratio in relation

The foll()win~~ description providesi a detailed analysis of the ND4410 Control Module and Control timing for each mode of acquisition including timing diagrams

hash join: large buckets (build bucket does not fit main memory) index join: matching records on multiple disk pages. merge join: matching records do not fit in memory at the

hash join: large buckets (build bucket does not fit main memory) index join: matching records on multiple disk pages. merge join: matching records do not fit in memory at the

With tuple-wise data staging, the information in the tables RFID PATH in the cache and RFID READ in the warehouse are asynchronously updated if item e is scanned for the first time