Pruning with Reference Sets

(1)

Similarity Search

Pruning with Reference Sets

Nikolaus Augsten

nikolaus.augsten@sbg.ac.at Department of Computer Sciences

University of Salzburg

http://dbresearch.uni-salzburg.at

WS 2019/20

Version February 24, 2020

Augsten (Univ. Salzburg) Similarity Search WS 2019/20 1 / 22

Outline

1 Reference Sets Definition

Upper and Lower Bound Combined Approach

Reference Sets Definition

Outline

1 Reference Sets Definition

Upper and Lower Bound Combined Approach

Reference Sets [GJK ⁺ 02]

Reference sets take advantage of the triangle inequality for metrics.

An extended version of the triangle inequality is:

| δ(a, b) − δ(b, c ) | ≤ δ(a, c ) ≤ δ(a, b) + δ(b, c),

where δ() is a metric distance function computed between a, b, and c.

(2)

Illustration: Triangle Inequality

a b

c

δ(a, c )<δ(a, b) + δ(b, c )

a b

c

| δ(a, b) − δ(b, c ) | <δ(a, c )

a b c

δ(a, c )=δ(a, b) + δ(b, c )

a c b

| δ(a, b) − δ(b, c ) | =δ(a, c )

Notation

S ₁ and S ₂ are sets of trees (unordered forests) K ⊆ S ₁ ∪ S ₂ is called the reference set

T _i ∈ S ₁ and T _j ∈ S ₂ are trees

k _l ∈ K , 1 ≤ l ≤ | K | , is a tree of the reference set

Reference Vector

v _i is a vector of size | K |

that stores the distance from T i ∈ S

1

to each tree k l ∈ K the l -th element of v i stores the edit distance between T i and k l :

v il = δ t (T i , k l ) v _j is the respective vector for T _j ∈ S ₂

Metric Space

Metric space of the reference set:

the elements of the reference set define the basis of a metric space the vector v k represents tree T k as a point in this space

v kl is the coordinate of the point on the l-th axis

Example: Two trees T i and T j in the metric space defined by the reference set K = { k ₁ , k ₂ } .

v i1

v j 1 v k1

v _i2 v _j ₂ v k2

T _j T _i

(3)

Reference Sets Upper and Lower Bound

Outline

1 Reference Sets Definition

Upper and Lower Bound Combined Approach

Upper and Lower Bound

From the triangle inequality it follows that for all 1 ≤ l ≤ | K |

| v _il − v _jl | ≤ δ t (T _i , T _j ) ≤ v _il + v _jl Upper bound:

δ _t (T _i , T _j ) ≤ min

l,1 ≤ l ≤| K | v _il + v _jl = u _t (T _i , T _j ) Lower bound:

δ _t (T _i , T _j ) ≥ max

l,1≤l≤|K| | v _il − v _jl | = l _t (T _i , T _j ) Approximate join: match all trees with δ _t (T _i , T _j ) ≤ τ

Upper bound: If u t (T i , T j ) ≤ τ , then T i and T j match.

Lower bound: If l t (T i , T j ) > τ , then T i and T j do not match.

Example: Similarity Join with Reference Sets

S 1 = { T 1 , T 2 , T 3 } , S 2 = { T 4 , T 5 , T 6 } , τ = 2 Reference set K = { T 1 } , 1 ≤ l ≤ | K | = 1 Distances:

T 1 T 2 T 3 T 4 T 5 T 6

T 1 0 4 1 1 4 1

T 2 0 4 4 1 4

T ₃ 0 1 4 1

T ₄ 0 4 1

T ₅ 0 4

T 6 0

Reference Vectors:

v _1,1 = (0), v _2,1 = (4), v _3,1 = (1), v _4,1 = (1), v _5,1 = (4), v _6,1 = (1) The clusters C ₁ = { T ₁ , T ₃ , T ₄ , T ₆ } and C ₂ = { T ₂ , T ₅ } are well separated.

Reference Set — Tradeoff

Small reference set: efficient reference vector computation we have to compute | S | distances for each additional tree in the reference set to construct the vectors

for small reference sets the construction of the vectors is cheaper Large reference set: effective filters

large reference sets make u t and l t more effective

thus, once the reference vectors are computed, less distance computations are needed

Where is the optimum?

(4)

Well Separated Clusters

We cluster the set S = S 1 ∪ S 2 .

The clusters are well separated for threshold τ if trees within a cluster have small distance (within

^τ₂

)

trees from different cluster have large distance (more than

^3τ₂

) Example: Two well separated clusters C 1 and C 2 .

≤ ^τ ₂ > ^3τ ₂ ≤ ^τ ₂

C ₁ C ₂

Trees within the Same Cluster — Upper Bound Applies

Upper bound: δ _t (T _i , T _j ) ≤ v _il + v _jl = δ _t (T _i , k _l ) + δ _t (k _l , T _j ) Assumption: T _i ∈ S 1 and T _j ∈ S 2 are in the same cluster C .

The clusters are well separated.

The cluster C contains a reference tree k l ∈ K . In this case v _il ≤ ^τ ₂ and v _jl ≤ ^τ ₂ ⇒ δ _t (T _i , T _j ) ≤ τ . Result: If T _i , T _j , and k _l are in the same cluster

from v

il

and v

jl

we conclude that T i and T j match thus we need not to compute δ

t

(T

i

, T

j

)

k _l

T j

T _i

v _jl

δ _t (T _i , T _j ) v _il

τ 2

Trees in Different Clusters — Lower Bound Applies

Lower bound: δ _t (T _i , T _j ) > | v _il − v _jl | = | δ _t (T _i , k _l ) − δ _t (k _l , T _j ) | Assumption: T _i ∈ S ₁ is in cluster C ₁ , T _j ∈ S ₂ is in cluster C ₂ .

The clusters are well separated.

The cluster C

1

contains a reference tree k l ∈ K . In this case v _il ≤ ^τ ₂ and v _jl > ^3τ ₂ ⇒ δ t (T _i , T _j ) > τ.

Result: If T i and k _l are in the same cluster, but T j is in a different cluster

from v

il

and v

jl

we conclude that T i and T j do not match thus we need not to compute δ

t

(T

i

, T

j

)

≤ ^τ ₂ > ^3τ ₂ ≤ ^τ ₂ T _j T _i

k _l

C ₁ v _il C ₂

v _jl δ _t (T _i , T _j )

Optimum — Well Separated Clusters

Optimum:

clusters are well separated

the reference set contains one tree per cluster

Guha et al. [GJK ⁺ 02] find clusters by sampling and estimate a reference set size.

(5)

Reference Sets Combined Approach

Outline

1 Reference Sets Definition

Upper and Lower Bound Combined Approach

Combined Approach

We combine the previous approaches:

lower bound with traversal strings

upper bound with constrained edit distance reference sets

Instead of one vector v i we compute two vectors for each tree T i : lower bound: v _i ^l contains the traversal string distance

upper bound: v _i ^u contains the constrained edit distance

Metric Space

Metric space of the reference set:

the elements of the reference set define the basis of a metric space each tree T k is represented by a rectangle in this metric space v _k ^l and v _k ^u are two opposite corners of this rectangle

Example: Two trees T _i and T _j in the metric space defined by the reference set K = { k 1 , k 2 } .

v

_k1^l/u

v

_k2^l/u

v

_j^l₁

v

_i1^u

v

_i1^l

v

_j1^u

v

i2^l

v

j2^l

v

_i2^u

v

_j2^u

T

j

T

i

Combined Approach: Triangle Inequality

The triangle equations changes as follows:

(a)

For all 1 ≤ l ≤ | K |

δ t (T i , T j ) ≤ v _il ^u + v _jl ^u

(b)

For all 1 ≤ l ≤ |K |

δ t (T i , T j ) ≥

 



 

v _jl ^l − v _il û if v _jl ^l > v _il û v _il ^l − v _jl û if v _il ^l > v _jl û

0 otherwise

Note:

If v _jl ^l > v _il û or v _il ^l > v _jl û then [v _il ^l , v _il û ] and [v _jl ^l , v _jl û ] are disjoint intervals.

In all other cases we can not give a lower bound (other than 0).

(6)

Combined Approach: Upper and Lower Bounds

Upper bound:

u t (T _i , T _j ) = min

l,1 ≤ l ≤| K | v _il ^u + v _jl ^u

Lower bound:

l _t (T _i , T _j ) = max

l,1 ≤ l ≤| K |

 



 

v _jl ^l − v _il û if v _jl ^l > v _il û v _il ^l − v _jl û if v _il ^l > v _jl û

0 otherwise

Illustration: Cases for the Lower Bound

Case v _jl ^l > v _il ^u : lower bound is v _jl ^l − v _il ^u

v _kl v _il ^l v _il ^u v _jl ^l v _jl ^u

lower bound Case v _il ^l > v _jl ^u : lower bound is v _il ^l − v _jl ^u

v _kl v _jl ^l v _jl ^u v _il ^l v _il ^u

lower bound

All other cases ([v _il ^l , v _il ^u ] and [v _jl ^l , v _jl ^u ] overlap): no lower bound

v _kl v _jl ^l

v _il ^l

v _jl ^u

v _il ^u

Sudipto Guha, H. V. Jagadish, Nick Koudas, Divesh Srivastava, and Ting Yu.

Approximate XML joins.

In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 287–298, Madison, Wisconsin, 2002.

ACM Press.

Pruning with Reference Sets

Similarity Search

Pruning with Reference Sets

Nikolaus Augsten

nikolaus.augsten@sbg.ac.at Department of Computer Sciences

University of Salzburg

http://dbresearch.uni-salzburg.at

WS 2019/20

Version February 24, 2020

Outline

1 Reference Sets Definition

Upper and Lower Bound Combined Approach

Outline

1 Reference Sets Definition

Upper and Lower Bound Combined Approach

Reference Sets [GJK + 02]

Reference sets take advantage of the triangle inequality for metrics.

An extended version of the triangle inequality is:

| δ(a, b) − δ(b, c ) | ≤ δ(a, c ) ≤ δ(a, b) + δ(b, c),

where δ() is a metric distance function computed between a, b, and c.

Illustration: Triangle Inequality

a b

c

δ(a, c )<δ(a, b) + δ(b, c )

a b

c

| δ(a, b) − δ(b, c ) | <δ(a, c )

a b c

δ(a, c )=δ(a, b) + δ(b, c )

a c b

| δ(a, b) − δ(b, c ) | =δ(a, c )

Notation

S 1 and S 2 are sets of trees (unordered forests) K ⊆ S 1 ∪ S 2 is called the reference set

T i ∈ S 1 and T j ∈ S 2 are trees

k l ∈ K , 1 ≤ l ≤ | K | , is a tree of the reference set

Reference Vector

v i is a vector of size | K |

that stores the distance from T i ∈ S

to each tree k l ∈ K the l -th element of v i stores the edit distance between T i and k l :

v il = δ t (T i , k l ) v j is the respective vector for T j ∈ S 2

Metric Space

Metric space of the reference set:

the elements of the reference set define the basis of a metric space the vector v k represents tree T k as a point in this space

v kl is the coordinate of the point on the l-th axis

Example: Two trees T i and T j in the metric space defined by the reference set K = { k 1 , k 2 } .

v i1

v j 1 v k1

v i2 v j 2 v k2

T j T i

Outline

1 Reference Sets Definition

Upper and Lower Bound Combined Approach

Upper and Lower Bound

From the triangle inequality it follows that for all 1 ≤ l ≤ | K |

| v il − v jl | ≤ δ t (T i , T j ) ≤ v il + v jl Upper bound:

δ t (T i , T j ) ≤ min

l,1 ≤ l ≤| K | v il + v jl = u t (T i , T j ) Lower bound:

δ t (T i , T j ) ≥ max

l,1≤l≤|K| | v il − v jl | = l t (T i , T j ) Approximate join: match all trees with δ t (T i , T j ) ≤ τ

Upper bound: If u t (T i , T j ) ≤ τ , then T i and T j match.

Lower bound: If l t (T i , T j ) > τ , then T i and T j do not match.

Example: Similarity Join with Reference Sets

S 1 = { T 1 , T 2 , T 3 } , S 2 = { T 4 , T 5 , T 6 } , τ = 2 Reference set K = { T 1 } , 1 ≤ l ≤ | K | = 1 Distances:

T 1 T 2 T 3 T 4 T 5 T 6

T 1 0 4 1 1 4 1

T 2 0 4 4 1 4

T 3 0 1 4 1

T 4 0 4 1

T 5 0 4

T 6 0

Reference Vectors:

v 1,1 = (0), v 2,1 = (4), v 3,1 = (1), v 4,1 = (1), v 5,1 = (4), v 6,1 = (1) The clusters C 1 = { T 1 , T 3 , T 4 , T 6 } and C 2 = { T 2 , T 5 } are well separated.

Reference Set — Tradeoff

Small reference set: efficient reference vector computation we have to compute | S | distances for each additional tree in the reference set to construct the vectors

for small reference sets the construction of the vectors is cheaper Large reference set: effective filters

large reference sets make u t and l t more effective

thus, once the reference vectors are computed, less distance computations are needed

Where is the optimum?

Well Separated Clusters

We cluster the set S = S 1 ∪ S 2 .

Reference Sets [GJK ⁺ 02]

S ₁ and S ₂ are sets of trees (unordered forests) K ⊆ S ₁ ∪ S ₂ is called the reference set

T _i ∈ S ₁ and T _j ∈ S ₂ are trees

k _l ∈ K , 1 ≤ l ≤ | K | , is a tree of the reference set

v _i is a vector of size | K |

v il = δ t (T i , k l ) v _j is the respective vector for T _j ∈ S ₂

Example: Two trees T i and T j in the metric space defined by the reference set K = { k ₁ , k ₂ } .

v _i2 v _j ₂ v k2

T _j T _i

| v _il − v _jl | ≤ δ t (T _i , T _j ) ≤ v _il + v _jl Upper bound:

δ _t (T _i , T _j ) ≤ min

l,1 ≤ l ≤| K | v _il + v _jl = u _t (T _i , T _j ) Lower bound:

δ _t (T _i , T _j ) ≥ max

l,1≤l≤|K| | v _il − v _jl | = l _t (T _i , T _j ) Approximate join: match all trees with δ _t (T _i , T _j ) ≤ τ

T ₃ 0 1 4 1

T ₄ 0 4 1

T ₅ 0 4

v _1,1 = (0), v _2,1 = (4), v _3,1 = (1), v _4,1 = (1), v _5,1 = (4), v _6,1 = (1) The clusters C ₁ = { T ₁ , T ₃ , T ₄ , T ₆ } and C ₂ = { T ₂ , T ₅ } are well separated.

≤ ^τ ₂ > ^3τ ₂ ≤ ^τ ₂

C ₁ C ₂

Upper bound: δ _t (T _i , T _j ) ≤ v _il + v _jl = δ _t (T _i , k _l ) + δ _t (k _l , T _j ) Assumption: T _i ∈ S 1 and T _j ∈ S 2 are in the same cluster C .

The cluster C contains a reference tree k l ∈ K . In this case v _il ≤ ^τ ₂ and v _jl ≤ ^τ ₂ ⇒ δ _t (T _i , T _j ) ≤ τ . Result: If T _i , T _j , and k _l are in the same cluster

k _l

T _i

v _jl

δ _t (T _i , T _j ) v _il

Lower bound: δ _t (T _i , T _j ) > | v _il − v _jl | = | δ _t (T _i , k _l ) − δ _t (k _l , T _j ) | Assumption: T _i ∈ S ₁ is in cluster C ₁ , T _j ∈ S ₂ is in cluster C ₂ .

contains a reference tree k l ∈ K . In this case v _il ≤ ^τ ₂ and v _jl > ^3τ ₂ ⇒ δ t (T _i , T _j ) > τ.

Result: If T i and k _l are in the same cluster, but T j is in a different cluster

≤ ^τ ₂ > ^3τ ₂ ≤ ^τ ₂ T _j T _i

k _l

C ₁ v _il C ₂

v _jl δ _t (T _i , T _j )

Guha et al. [GJK ⁺ 02] find clusters by sampling and estimate a reference set size.

Instead of one vector v i we compute two vectors for each tree T i : lower bound: v _i ^l contains the traversal string distance

upper bound: v _i ^u contains the constrained edit distance

the elements of the reference set define the basis of a metric space each tree T k is represented by a rectangle in this metric space v _k ^l and v _k ^u are two opposite corners of this rectangle

Example: Two trees T _i and T _j in the metric space defined by the reference set K = { k 1 , k 2 } .