• Keine Ergebnisse gefunden

Pruning with Reference Sets

N/A
N/A
Protected

Academic year: 2022

Aktie "Pruning with Reference Sets"

Copied!
6
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Similarity Search

Pruning with Reference Sets

Nikolaus Augsten

nikolaus.augsten@sbg.ac.at Department of Computer Sciences

University of Salzburg

http://dbresearch.uni-salzburg.at

WS 2019/20

Version February 24, 2020

Augsten (Univ. Salzburg) Similarity Search WS 2019/20 1 / 22

Outline

1 Reference Sets Definition

Upper and Lower Bound Combined Approach

Augsten (Univ. Salzburg) Similarity Search WS 2019/20 2 / 22

Reference Sets Definition

Outline

1 Reference Sets Definition

Upper and Lower Bound Combined Approach

Reference Sets Definition

Reference Sets [GJK + 02]

Reference sets take advantage of the triangle inequality for metrics.

An extended version of the triangle inequality is:

| δ(a, b) − δ(b, c ) | ≤ δ(a, c ) ≤ δ(a, b) + δ(b, c),

where δ() is a metric distance function computed between a, b, and c.

(2)

Reference Sets Definition

Illustration: Triangle Inequality

a b

c

δ(a, c )<δ(a, b) + δ(b, c )

a b

c

| δ(a, b) − δ(b, c ) | <δ(a, c )

a b c

δ(a, c )=δ(a, b) + δ(b, c )

a c b

| δ(a, b) − δ(b, c ) | =δ(a, c )

Augsten (Univ. Salzburg) Similarity Search WS 2019/20 5 / 22

Reference Sets Definition

Notation

S 1 and S 2 are sets of trees (unordered forests) K ⊆ S 1 ∪ S 2 is called the reference set

T i ∈ S 1 and T j ∈ S 2 are trees

k l ∈ K , 1 ≤ l ≤ | K | , is a tree of the reference set

Augsten (Univ. Salzburg) Similarity Search WS 2019/20 6 / 22

Reference Sets Definition

Reference Vector

v i is a vector of size | K |

that stores the distance from T i ∈ S

1

to each tree k l ∈ K the l -th element of v i stores the edit distance between T i and k l :

v il = δ t (T i , k l ) v j is the respective vector for T j ∈ S 2

Augsten (Univ. Salzburg) Similarity Search WS 2019/20 7 / 22

Reference Sets Definition

Metric Space

Metric space of the reference set:

the elements of the reference set define the basis of a metric space the vector v k represents tree T k as a point in this space

v kl is the coordinate of the point on the l-th axis

Example: Two trees T i and T j in the metric space defined by the reference set K = { k 1 , k 2 } .

v i1

v j 1 v k1

v i2 v j 2 v k2

T j T i

Augsten (Univ. Salzburg) Similarity Search WS 2019/20 8 / 22

(3)

Reference Sets Upper and Lower Bound

Outline

1 Reference Sets Definition

Upper and Lower Bound Combined Approach

Augsten (Univ. Salzburg) Similarity Search WS 2019/20 9 / 22

Reference Sets Upper and Lower Bound

Upper and Lower Bound

From the triangle inequality it follows that for all 1 ≤ l ≤ | K |

| v il − v jl | ≤ δ t (T i , T j ) ≤ v il + v jl Upper bound:

δ t (T i , T j ) ≤ min

l,1 ≤ l ≤| K | v il + v jl = u t (T i , T j ) Lower bound:

δ t (T i , T j ) ≥ max

l,1≤l≤|K| | v il − v jl | = l t (T i , T j ) Approximate join: match all trees with δ t (T i , T j ) ≤ τ

Upper bound: If u t (T i , T j ) ≤ τ , then T i and T j match.

Lower bound: If l t (T i , T j ) > τ , then T i and T j do not match.

Augsten (Univ. Salzburg) Similarity Search WS 2019/20 10 / 22

Reference Sets Upper and Lower Bound

Example: Similarity Join with Reference Sets

S 1 = { T 1 , T 2 , T 3 } , S 2 = { T 4 , T 5 , T 6 } , τ = 2 Reference set K = { T 1 } , 1 ≤ l ≤ | K | = 1 Distances:

T 1 T 2 T 3 T 4 T 5 T 6

T 1 0 4 1 1 4 1

T 2 0 4 4 1 4

T 3 0 1 4 1

T 4 0 4 1

T 5 0 4

T 6 0

Reference Vectors:

v 1,1 = (0), v 2,1 = (4), v 3,1 = (1), v 4,1 = (1), v 5,1 = (4), v 6,1 = (1) The clusters C 1 = { T 1 , T 3 , T 4 , T 6 } and C 2 = { T 2 , T 5 } are well separated.

Reference Sets Upper and Lower Bound

Reference Set — Tradeoff

Small reference set: efficient reference vector computation we have to compute | S | distances for each additional tree in the reference set to construct the vectors

for small reference sets the construction of the vectors is cheaper Large reference set: effective filters

large reference sets make u t and l t more effective

thus, once the reference vectors are computed, less distance computations are needed

Where is the optimum?

(4)

Reference Sets Upper and Lower Bound

Well Separated Clusters

We cluster the set S = S 1 ∪ S 2 .

The clusters are well separated for threshold τ if trees within a cluster have small distance (within

τ2

)

trees from different cluster have large distance (more than

2

) Example: Two well separated clusters C 1 and C 2 .

τ 2 > 2τ 2

C 1 C 2

Augsten (Univ. Salzburg) Similarity Search WS 2019/20 13 / 22

Reference Sets Upper and Lower Bound

Trees within the Same Cluster — Upper Bound Applies

Upper bound: δ t (T i , T j ) ≤ v il + v jl = δ t (T i , k l ) + δ t (k l , T j ) Assumption: T i ∈ S 1 and T j ∈ S 2 are in the same cluster C .

The clusters are well separated.

The cluster C contains a reference tree k l ∈ K . In this case v ilτ 2 and v jlτ 2 ⇒ δ t (T i , T j ) ≤ τ . Result: If T i , T j , and k l are in the same cluster

from v

il

and v

jl

we conclude that T i and T j match thus we need not to compute δ

t

(T

i

, T

j

)

k l

T j

T i

v jl

δ t (T i , T j ) v il

τ 2

Augsten (Univ. Salzburg) Similarity Search WS 2019/20 14 / 22

Reference Sets Upper and Lower Bound

Trees in Different Clusters — Lower Bound Applies

Lower bound: δ t (T i , T j ) > | v il − v jl | = | δ t (T i , k l ) − δ t (k l , T j ) | Assumption: T i ∈ S 1 is in cluster C 1 , T j ∈ S 2 is in cluster C 2 .

The clusters are well separated.

The cluster C

1

contains a reference tree k l ∈ K . In this case v ilτ 2 and v jl > 2 ⇒ δ t (T i , T j ) > τ.

Result: If T i and k l are in the same cluster, but T j is in a different cluster

from v

il

and v

jl

we conclude that T i and T j do not match thus we need not to compute δ

t

(T

i

, T

j

)

τ 2 > 2τ 2 T j T i

k l

C 1 v il C 2

v jl δ t (T i , T j )

Augsten (Univ. Salzburg) Similarity Search WS 2019/20 15 / 22

Reference Sets Upper and Lower Bound

Optimum — Well Separated Clusters

Optimum:

clusters are well separated

the reference set contains one tree per cluster

Guha et al. [GJK + 02] find clusters by sampling and estimate a reference set size.

Augsten (Univ. Salzburg) Similarity Search WS 2019/20 16 / 22

(5)

Reference Sets Combined Approach

Outline

1 Reference Sets Definition

Upper and Lower Bound Combined Approach

Augsten (Univ. Salzburg) Similarity Search WS 2019/20 17 / 22

Reference Sets Combined Approach

Combined Approach

We combine the previous approaches:

lower bound with traversal strings

upper bound with constrained edit distance reference sets

Instead of one vector v i we compute two vectors for each tree T i : lower bound: v i l contains the traversal string distance

upper bound: v i u contains the constrained edit distance

Augsten (Univ. Salzburg) Similarity Search WS 2019/20 18 / 22

Reference Sets Combined Approach

Metric Space

Metric space of the reference set:

the elements of the reference set define the basis of a metric space each tree T k is represented by a rectangle in this metric space v k l and v k u are two opposite corners of this rectangle

Example: Two trees T i and T j in the metric space defined by the reference set K = { k 1 , k 2 } .

v

k1l/u

v

k2l/u

v

jl1

v

i1u

v

i1l

v

j1u

v

i2l

v

j2l

v

i2u

v

j2u

T

j

T

i

Reference Sets Combined Approach

Combined Approach: Triangle Inequality

The triangle equations changes as follows:

(a)

For all 1 ≤ l ≤ | K |

δ t (T i , T j ) ≤ v il u + v jl u

(b)

For all 1 ≤ l ≤ |K |

δ t (T i , T j ) ≥

 

 

v jl l − v il u if v jl l > v il u v il l − v jl u if v il l > v jl u

0 otherwise

Note:

If v jl l > v il u or v il l > v jl u then [v il l , v il u ] and [v jl l , v jl u ] are disjoint intervals.

In all other cases we can not give a lower bound (other than 0).

(6)

Reference Sets Combined Approach

Combined Approach: Upper and Lower Bounds

Upper bound:

u t (T i , T j ) = min

l,1 ≤ l ≤| K | v il u + v jl u

Lower bound:

l t (T i , T j ) = max

l,1 ≤ l ≤| K |

 

 

v jl l − v il u if v jl l > v il u v il l − v jl u if v il l > v jl u

0 otherwise

Augsten (Univ. Salzburg) Similarity Search WS 2019/20 21 / 22

Reference Sets Combined Approach

Illustration: Cases for the Lower Bound

Case v jl l > v il u : lower bound is v jl l − v il u

v kl v il l v il u v jl l v jl u

lower bound Case v il l > v jl u : lower bound is v il l − v jl u

v kl v jl l v jl u v il l v il u

lower bound

All other cases ([v il l , v il u ] and [v jl l , v jl u ] overlap): no lower bound

v kl v jl l

v il l

v jl u

v il u

Augsten (Univ. Salzburg) Similarity Search WS 2019/20 22 / 22

Sudipto Guha, H. V. Jagadish, Nick Koudas, Divesh Srivastava, and Ting Yu.

Approximate XML joins.

In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 287–298, Madison, Wisconsin, 2002.

ACM Press.

Augsten (Univ. Salzburg) Similarity Search WS 2019/20 22 / 22

Referenzen

ÄHNLICHE DOKUMENTE

Zwei verschlungene Hände halten eine auf einer Prora stehende Standarte.. Dieser I'ypus war Cohen nur in

Die Vorinstanz begründet ihr Nichteintreten auf das erneute Leistungsge- such des Beschwerdeführers damit, dass eine Neuanmeldung einer vo- rangehenden, abweisenden Verfügung

Sauvignon blanc 2016 / 75cl Adrian Hartmann Oberflachs 49.00 Frisch und fruchtig mit exotischen

Am Samstag, 06.November 2021, fi ndet in Nidau wieder der alljährliche „Zibelemärit“ statt. Mit seinem farbenfrohen Erschei- nungsbild begeistert dieser die Besucher seit

Wie die Vorinstanz richtig festhält, kann auch daraus, dass die Illegalität des von der Beschwerdeführerin betriebenen Systems zwi- schenzeitlich richterlich festgestellt

Bei langer Dauer wird auch die zumutbare Tätigkeit in einem anderen Beruf oder Aufgabenbereich berücksichtigt (Art. Januar 2008 geltenden Fas- sung) haben jene Versicherten

Die Ausleerungen waren in den ersten Tagen regelmässig mehr oder weniger getrübt, auch dann, wenn der Urin in den Gummiblasen aufgefangen wurde, welche sonst auf ihren Inhalt

November 2000 als zu 70% arbeitsunfähig und in einer angepassten Verweistätigkeit seit demselben Zeitpunkt als un- eingeschränkt arbeitsfähig erachtet habe (IV/35),