• Keine Ergebnisse gefunden

Pruning with Reference Sets

N/A
N/A
Protected

Academic year: 2022

Aktie "Pruning with Reference Sets"

Copied!
22
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Similarity Search

Pruning with Reference Sets

Nikolaus Augsten

nikolaus.augsten@sbg.ac.at

Dept. of Computer Sciences University of Salzburg

http://dbresearch.uni-salzburg.at Version January 18, 2017

Wintersemester 2016/2017

(2)

Outline

1 Reference Sets Definition

Upper and Lower Bound

Combined Approach

(3)

Reference Sets Definition

Outline

1 Reference Sets Definition

Upper and Lower Bound

Combined Approach

(4)

Reference Sets Definition

Reference Sets [?]

Reference sets take advantage of the triangle inequality for metrics.

An extended version of the triangle inequality is:

| δ (a, b) − δ(b, c ) | ≤ δ (a, c ) ≤ δ(a, b) + δ (b, c ),

where δ() is a metric distance function computed between a, b, and c .

(5)

Reference Sets Definition

Illustration: Triangle Inequality

a b

c

δ(a, c )<δ (a, b) + δ(b, c )

a b

c

| δ (a, b) − δ(b, c ) | <δ (a, c )

a b c

δ(a, c )=δ (a, b) + δ(b, c )

a c b

| δ (a, b) − δ(b, c ) | =δ (a, c )

(6)

Reference Sets Definition

Notation

S 1 and S 2 are sets of trees (unordered forests) K ⊆ S 1 ∪ S 2 is called the reference set

T i ∈ S 1 and T j ∈ S 2 are trees

k l ∈ K , 1 ≤ l ≤ | K | , is a tree of the reference set

(7)

Reference Sets Definition

Reference Vector

v i is a vector

of size | K |

that stores the distance from T i ∈ S 1 to each tree k l ∈ K

the l -th element of v i stores the edit distance between T i and k l : v il = δ t (T i , k l )

v j is the respective vector for T j ∈ S 2

(8)

Reference Sets Definition

Metric Space

Metric space of the reference set:

the elements of the reference set define the basis of a metric space the vector v k represents tree T k as a point in this space

v kl is the coordinate of the point on the l -th axis

Example: Two trees T i and T j in the metric space defined by the reference set K = { k 1 , k 2 } .

v i 1

v j 1 v k 1 v i 2

v j 2 v k 2

T j

T i

(9)

Reference Sets Upper and Lower Bound

Outline

1 Reference Sets Definition

Upper and Lower Bound

Combined Approach

(10)

Reference Sets Upper and Lower Bound

Upper and Lower Bound

From the triangle inequality it follows that for all 1 ≤ l ≤ | K |

| v il − v jl | ≤ δ t (T i , T j ) ≤ v il + v jl Upper bound:

δ t (T i , T j ) ≤ min

l ,1 ≤ l ≤| K | v il + v jl = u t (T i , T j ) Lower bound:

δ t (T i , T j ) ≥ max

l ,1 ≤ l ≤| K | | v il − v jl | = l t (T i , T j ) Approximate join: match all trees with δ t (T 1 , T 2 ) ≤ τ

Upper bound: If u t (T i , T j ) ≤ τ , then T i and T j match.

Lower bound: If l t (T i , T j ) > τ , then T i and T j do not match.

(11)

Reference Sets Upper and Lower Bound

Example: Similarity Join with Reference Sets

S 1 = { T 1 , T 2 , T 3 } , S 2 = { T 4 , T 5 , T 6 } , τ = 2 Reference set K = { T 1 } , 1 ≤ l ≤ | K | = 1 Distances:

T 1 T 2 T 3 T 4 T 5 T 6

T 1 0 4 1 1 4 1

T 2 0 4 4 1 4

T 3 0 1 4 1

T 4 0 4 1

T 5 0 4

T 6 0

Reference Vectors:

v 1,1 = (0), v 2,1 = (4), v 3,1 = (1), v 4,1 = (1), v 5,1 = (4), v 6,1 = (1)

The clusters C 1 = { T 1 , T 3 , T 4 , T 6 } and C 2 = { T 2 , T 5 } are well

separated.

(12)

Reference Sets Upper and Lower Bound

Reference Set — Tradeoff

Small reference set: efficient reference vector computation

we have to compute | S | distances for each additional tree in the reference set to construct the vectors

for small reference sets the construction of the vectors is cheaper

Large reference set: effective filters

large reference sets make u t and l t more effective

thus, once the reference vectors are computed, less distance computations are needed

Where is the optimum?

(13)

Reference Sets Upper and Lower Bound

Well Separated Clusters

We cluster the set S = S 1 ∪ S 2 .

The clusters are well separated for threshold τ if

trees within a cluster have small distance (within τ 2 )

trees from different cluster have large distance (more than 2 )

Example: Two well separated clusters C 1 and C 2 .

τ 2 > 2τ 2

C 1 C 2

(14)

Reference Sets Upper and Lower Bound

Trees within the Same Cluster — Upper Bound Applies

Upper bound: δ t (T i , T j ) ≤ v il + v jl = δ t (T i , k l ) + δ t (k l , T j ) Assumption: T i ∈ S 1 and T j ∈ S 2 are in the same cluster C .

The clusters are well separated.

The cluster C contains a reference tree k l ∈ K .

In this case v ilτ 2 and v jlτ 2 ⇒ δ t (T i , T j ) ≤ τ . Result: If T i , T j , and k l are in the same cluster

from v il and v jl we conclude that T 1 and T 2 match thus we need not to compute δ t (T i , T j )

k T j

T i

v jl

δ t (T i , T j ) v il

τ 2

(15)

Reference Sets Upper and Lower Bound

Trees in Different Clusters — Lower Bound Applies

Lower bound: δ t (T i , T j ) ≥ | v il − v jl | = | δ t (T i , k l ) − δ t (k l , T j ) | Assumption: T i ∈ S 1 is in cluster C 1 , T j ∈ S 2 is in cluster C 2 .

The clusters are well separated.

The cluster C 1 contains a reference tree k l ∈ K .

In this case v ilτ 2 and v jl > 2 ⇒ δ t (T i , T j ) > τ .

Result: If T i and k l are in the same cluster, but T j is in a different cluster

from v il and v jl we conclude that T 1 and T 2 do not match thus we need not to compute δ t (T i , T j )

τ 2 > 2τ 2 T j T i

k l

C 1 v il C 2

v jl

δ t (T i , T j )

(16)

Reference Sets Upper and Lower Bound

Optimum — Well Separated Clusters

Optimum:

clusters are well separated

the reference set contains one tree per cluster

Guha et al. [?] find clusters by sampling and estimate a reference set

size.

(17)

Reference Sets Combined Approach

Outline

1 Reference Sets Definition

Upper and Lower Bound

Combined Approach

(18)

Reference Sets Combined Approach

Combined Approach

We combine the previous approaches:

lower bound with traversal strings

upper bound with constrained edit distance reference sets

Instead of one vector v i we compute two vectors for each tree T i :

lower bound: v i l contains the traversal string distance

upper bound: v i u contains the constrained edit distance

(19)

Reference Sets Combined Approach

Metric Space

Metric space of the reference set:

the elements of the reference set define the basis of a metric space each tree T k is represented by a rectangle in this metric space

v k l and v k u are two opposite corners of this rectangle

Example: Two trees T i and T j in the metric space defined by the reference set K = { k 1 , k 2 } .

v

kl/u2

v

il2

v

j2l

v

iu2

v

j2u

T

j

T

i

(20)

Reference Sets Combined Approach

Combined Approach: Triangle Inequality

The triangle equations changes as follows:

(a) For all 1 ≤ l ≤ | K |

δ t (T i , T j ) ≤ v il u + v jl u

(b) For all 1 ≤ l ≤ | K |

δ t (T i , T j ) ≥

 

 

v jl l − v il u if v jl l > v il u v il l − v jl u if v il l > v jl u

0 otherwise

Note:

If v jl l > v il u or v il l > v jl u then [v il l , v il u ] and [v jl l , v jl u ] are disjoint intervals.

In all other cases we can not give a lower bound (other than 0).

(21)

Reference Sets Combined Approach

Combined Approach: Upper and Lower Bounds

Upper bound:

u t (T i , T j ) = min

l ,1 ≤ l ≤| K | v il u + v jl u

Lower bound:

l t (T i , T j ) = max

l ,1 ≤ l ≤| K |

 

 

v jl l − v il u if v jl l > v il u v il l − v jl u if v il l > v jl u

0 otherwise

(22)

Reference Sets Combined Approach

Illustration: Cases for the Lower Bound

Case v jl l > v il u : lower bound is v jl l − v il u

v kl v il l v il u v jl l v jl u

lower bound Case v il l > v jl u : lower bound is v il l − v jl u

v kl v jl l v jl u v il l v il u

lower bound

All other cases ([v il l , v il u ] and [v jl l , v jl u ] overlap): no lower bound

v kl v jl l

v il l

v jl u

v il u

Referenzen

ÄHNLICHE DOKUMENTE

Dynamic Programming Algorithm Edit Distance Variants..

Search Space Reduction for the Tree Edit Distance Lower Bound: Traversal Strings.

1 Search Space Reduction for the Tree Edit Distance Similarity Join and Search Space Reduction Lower Bound: Traversal Strings.. Upper Bound: Constrained

lower bound: v i l contains the traversal string distance upper bound: v i u contains the constrained edit

Instead of one vector v i we compute two vectors for each tree T i : lower bound: v i l contains the traversal string distance. upper bound: v i u contains the constrained

Edit distance between two strings: the minimum number of edit operations that transforms one string into the another. Dynamic programming algorithm with O (mn) time and O (m)

Dynamic Programming Algorithm Edit Distance Variants.. Augsten (Univ. Salzburg) Similarity Search WS 2019/20 2

Instead of one vector v i we compute two vectors for each tree T i : lower bound: v i l contains the traversal string distance. upper bound: v i u contains the constrained