Similarity Search
Pruning with Reference Sets
Nikolaus Augsten
nikolaus.augsten@sbg.ac.at Department of Computer Sciences
University of Salzburg
http://dbresearch.uni-salzburg.at
WS 2021/22
Version January 10, 2022
Outline
1 Reference Sets Definition
Upper and Lower Bound Combined Approach
2 Conclusion
Reference Sets Definition
Outline
1 Reference Sets Definition
Upper and Lower Bound Combined Approach
2 Conclusion
Reference Sets Definition
Reference Sets [GJK + 02]
Reference sets take advantage of the triangle inequality for metrics.
An extended version of the triangle inequality is:
| δ (a, b) − δ(b, c ) | ≤ δ (a, c ) ≤ δ(a, b) + δ (b, c ),
where δ() is a metric distance function computed between a, b, and c .
Reference Sets Definition
Illustration: Triangle Inequality
a b
c
δ(a, c )<δ (a, b) + δ(b, c )
a b
c
| δ (a, b) − δ(b, c ) | <δ (a, c )
a b c
δ(a, c )=δ (a, b) + δ(b, c )
a c b
| δ (a, b) − δ(b, c ) | =δ (a, c )
Reference Sets Definition
Notation
S 1 and S 2 are sets of trees (unordered forests) K ⊆ S 1 ∪ S 2 is called the reference set
T i ∈ S 1 and T j ∈ S 2 are trees
k l ∈ K , 1 ≤ l ≤ | K | , is a tree of the reference set
Reference Sets Definition
Reference Vector
v i is a vector
of size | K |
that stores the distance from T i ∈ S 1 to each tree k l ∈ K
the l -th element of v i stores the edit distance between T i and k l : v il = δ t (T i , k l )
v j is the respective vector for T j ∈ S 2
Reference Sets Definition
Interpretation in K -Dimensional Space
The reference set define the basis of a K -dimensional space.
Vector v k represents tree T k as a point in this space.
v kl is the coordinate of the point on the l -th axis (1 ≤ l ≤ K ).
Example: Two trees T i and T j represented as points by the reference set K = { k 1 , k 2 } .
v i 1
v j 1 v k 1 v i 2
v j 2 v k 2
T j
T i
Reference Sets Upper and Lower Bound
Outline
1 Reference Sets Definition
Upper and Lower Bound Combined Approach
2 Conclusion
Reference Sets Upper and Lower Bound
Upper and Lower Bound
From the triangle inequality it follows that for all 1 ≤ l ≤ | K |
| v il − v jl | ≤ δ t (T i , T j ) ≤ v il + v jl Upper bound:
δ t (T i , T j ) ≤ min
l ,1 ≤ l ≤| K | v il + v jl = u t (T i , T j ) Lower bound:
δ t (T i , T j ) ≥ max
l ,1 ≤ l ≤| K | | v il − v jl | = l t (T i , T j ) Approximate join: match all trees with δ t (T i , T j ) ≤ τ
Upper bound: If u t (T i , T j ) ≤ τ , then T i and T j match.
Lower bound: If l t (T i , T j ) > τ , then T i and T j do not match.
Reference Sets Upper and Lower Bound
Example: Similarity Join with Reference Sets
S 1 = { T 1 , T 2 , T 3 } , S 2 = { T 4 , T 5 , T 6 } , τ = 2 Reference set K = { T 1 } , 1 ≤ l ≤ | K | = 1 Distances:
T 1 T 2 T 3 T 4 T 5 T 6
T 1 0 4 1 1 4 1
T 2 0 4 4 1 4
T 3 0 1 4 1
T 4 0 4 1
T 5 0 4
T 6 0
Reference Vectors:
v 1,1 = (0), v 2,1 = (4), v 3,1 = (1), v 4,1 = (1), v 5,1 = (4), v 6,1 = (1)
The clusters C 1 = { T 1 , T 3 , T 4 , T 6 } and C 2 = { T 2 , T 5 } are well
separated.
Reference Sets Upper and Lower Bound
Reference Set — Tradeoff
Small reference set: efficient reference vector computation
we have to compute | S | distances for each additional tree in the reference set to construct the vectors
for small reference sets the construction of the vectors is cheaper
Large reference set: effective filters
large reference sets make u t and l t more effective
thus, once the reference vectors are computed, less distance computations are needed
Where is the optimum?
Reference Sets Upper and Lower Bound
Well Separated Clusters
We cluster the set S = S 1 ∪ S 2 .
The clusters are well separated for threshold τ if
trees within a cluster have small distance (within τ 2 )
trees from different cluster have large distance (more than 3τ 2 )
Example: Two well separated clusters C 1 and C 2 .
≤ τ 2 > 3τ 2 ≤ τ 2
C 1 C 2
Reference Sets Upper and Lower Bound
Trees within the Same Cluster — Upper Bound Applies
Upper bound: δ t (T i , T j ) ≤ v il + v jl = δ t (T i , k l ) + δ t (k l , T j ) Assumption: T i ∈ S 1 and T j ∈ S 2 are in the same cluster C .
The clusters are well separated.
The cluster C contains a reference tree k l ∈ K .
In this case v il ≤ τ 2 and v jl ≤ τ 2 ⇒ δ t (T i , T j ) ≤ τ . Result: If T i , T j , and k l are in the same cluster
from v il and v jl we conclude that T i and T j match thus we need not to compute δ t (T i , T j )
k l
T j T i
v jl
δ t (T i , T j ) v il
τ
2
Reference Sets Upper and Lower Bound
Trees in Different Clusters — Lower Bound Applies
Lower bound: δ t (T i , T j ) > | v il − v jl | = | δ t (T i , k l ) − δ t (k l , T j ) | Assumption: T i ∈ S 1 is in cluster C 1 , T j ∈ S 2 is in cluster C 2 .
The clusters are well separated.
The cluster C 1 contains a reference tree k l ∈ K .
In this case v il ≤ τ 2 and v jl > 3τ 2 ⇒ δ t (T i , T j ) > τ .
Result: If T i and k l are in the same cluster, but T j is in a different cluster
from v il and v jl we conclude that T i and T j do not match thus we need not to compute δ t (T i , T j )
≤ τ 2 > 3τ 2 ≤ τ 2 T j T i
k l
C 1 v il C 2
v jl
δ t (T i , T j )
Reference Sets Upper and Lower Bound
Optimum — Well Separated Clusters
Optimum:
clusters are well separated
the reference set contains one tree per cluster
Guha et al. [GJK + 02] find clusters by sampling and estimate a
reference set size.
Reference Sets Combined Approach
Outline
1 Reference Sets Definition
Upper and Lower Bound Combined Approach
2 Conclusion
Reference Sets Combined Approach
Combined Approach
We combine the previous approaches:
lower bound with traversal strings
upper bound with constrained edit distance reference sets
Instead of one vector v i we compute two vectors for each tree T i :
lower bound: v i l contains the traversal string distance
upper bound: v i u contains the constrained edit distance
Reference Sets Combined Approach
Interpretation in K -Dimensional Space
The reference set define the basis of a K -dimensional space.
Each tree T k is a (hyper)rectangle in this space.
v k l and v k u are two opposite corners of this rectangle.
Example: Two trees T i and T j represented as rectangles by the reference set K = { k 1 , k 2 } .
v
kl/u2u u
v
il2v
j2lv
iu2v
j2uT
jT
iReference Sets Combined Approach
Combined Approach: Triangle Inequality
The triangle equations changes as follows:
(a) For all 1 ≤ l ≤ | K |
δ t (T i , T j ) ≤ v il u + v jl u
(b) For all 1 ≤ l ≤ | K |
δ t (T i , T j ) ≥
v jl l − v il u if v jl l > v il u v il l − v jl u if v il l > v jl u
0 otherwise
Note:
If v jl l > v il u or v il l > v jl u then [v il l , v il u ] and [v jl l , v jl u ] are disjoint intervals.
In all other cases we can not give a lower bound (other than 0).
Reference Sets Combined Approach
Combined Approach: Upper and Lower Bounds
Upper bound:
u t (T i , T j ) = min
l ,1 ≤ l ≤| K | v il u + v jl u
Lower bound:
l t (T i , T j ) = max
l ,1 ≤ l ≤| K |
v jl l − v il u if v jl l > v il u v il l − v jl u if v il l > v jl u
0 otherwise
Reference Sets Combined Approach
Illustration: Cases for the Lower Bound
Case v jl l > v il u : lower bound is v jl l − v il u
v kl v il l v il u v jl l v jl u
lower bound Case v il l > v jl u : lower bound is v il l − v jl u
v kl v jl l v jl u v il l v il u
lower bound
All other cases ([v il l , v il u ] and [v jl l , v jl u ] overlap): no lower bound
v kl v jl l
v il l
v jl u
v il u
Conclusion