Similarity Search
Pruning with Reference Sets
Nikolaus Augsten
nikolaus.augsten@sbg.ac.at
Dept. of Computer Sciences University of Salzburg
http://dbresearch.uni-salzburg.at Version January 18, 2017
Wintersemester 2016/2017
Outline
1 Reference Sets Definition
Upper and Lower Bound
Combined Approach
Reference Sets Definition
Outline
1 Reference Sets Definition
Upper and Lower Bound
Combined Approach
Reference Sets Definition
Reference Sets [?]
Reference sets take advantage of the triangle inequality for metrics.
An extended version of the triangle inequality is:
| δ (a, b) − δ(b, c ) | ≤ δ (a, c ) ≤ δ(a, b) + δ (b, c ),
where δ() is a metric distance function computed between a, b, and c .
Reference Sets Definition
Illustration: Triangle Inequality
a b
c
δ(a, c )<δ (a, b) + δ(b, c )
a b
c
| δ (a, b) − δ(b, c ) | <δ (a, c )
a b c
δ(a, c )=δ (a, b) + δ(b, c )
a c b
| δ (a, b) − δ(b, c ) | =δ (a, c )
Reference Sets Definition
Notation
S 1 and S 2 are sets of trees (unordered forests) K ⊆ S 1 ∪ S 2 is called the reference set
T i ∈ S 1 and T j ∈ S 2 are trees
k l ∈ K , 1 ≤ l ≤ | K | , is a tree of the reference set
Reference Sets Definition
Reference Vector
v i is a vector
of size | K |
that stores the distance from T i ∈ S 1 to each tree k l ∈ K
the l -th element of v i stores the edit distance between T i and k l : v il = δ t (T i , k l )
v j is the respective vector for T j ∈ S 2
Reference Sets Definition
Metric Space
Metric space of the reference set:
the elements of the reference set define the basis of a metric space the vector v k represents tree T k as a point in this space
v kl is the coordinate of the point on the l -th axis
Example: Two trees T i and T j in the metric space defined by the reference set K = { k 1 , k 2 } .
v i 1
v j 1 v k 1 v i 2
v j 2 v k 2
T j
T i
Reference Sets Upper and Lower Bound
Outline
1 Reference Sets Definition
Upper and Lower Bound
Combined Approach
Reference Sets Upper and Lower Bound
Upper and Lower Bound
From the triangle inequality it follows that for all 1 ≤ l ≤ | K |
| v il − v jl | ≤ δ t (T i , T j ) ≤ v il + v jl Upper bound:
δ t (T i , T j ) ≤ min
l ,1 ≤ l ≤| K | v il + v jl = u t (T i , T j ) Lower bound:
δ t (T i , T j ) ≥ max
l ,1 ≤ l ≤| K | | v il − v jl | = l t (T i , T j ) Approximate join: match all trees with δ t (T 1 , T 2 ) ≤ τ
Upper bound: If u t (T i , T j ) ≤ τ , then T i and T j match.
Lower bound: If l t (T i , T j ) > τ , then T i and T j do not match.
Reference Sets Upper and Lower Bound
Example: Similarity Join with Reference Sets
S 1 = { T 1 , T 2 , T 3 } , S 2 = { T 4 , T 5 , T 6 } , τ = 2 Reference set K = { T 1 } , 1 ≤ l ≤ | K | = 1 Distances:
T 1 T 2 T 3 T 4 T 5 T 6
T 1 0 4 1 1 4 1
T 2 0 4 4 1 4
T 3 0 1 4 1
T 4 0 4 1
T 5 0 4
T 6 0
Reference Vectors:
v 1,1 = (0), v 2,1 = (4), v 3,1 = (1), v 4,1 = (1), v 5,1 = (4), v 6,1 = (1)
The clusters C 1 = { T 1 , T 3 , T 4 , T 6 } and C 2 = { T 2 , T 5 } are well
separated.
Reference Sets Upper and Lower Bound
Reference Set — Tradeoff
Small reference set: efficient reference vector computation
we have to compute | S | distances for each additional tree in the reference set to construct the vectors
for small reference sets the construction of the vectors is cheaper
Large reference set: effective filters
large reference sets make u t and l t more effective
thus, once the reference vectors are computed, less distance computations are needed
Where is the optimum?
Reference Sets Upper and Lower Bound
Well Separated Clusters
We cluster the set S = S 1 ∪ S 2 .
The clusters are well separated for threshold τ if
trees within a cluster have small distance (within τ 2 )
trees from different cluster have large distance (more than 3τ 2 )
Example: Two well separated clusters C 1 and C 2 .
≤ τ 2 > 3τ 2 ≤ τ 2
C 1 C 2
Reference Sets Upper and Lower Bound
Trees within the Same Cluster — Upper Bound Applies
Upper bound: δ t (T i , T j ) ≤ v il + v jl = δ t (T i , k l ) + δ t (k l , T j ) Assumption: T i ∈ S 1 and T j ∈ S 2 are in the same cluster C .
The clusters are well separated.
The cluster C contains a reference tree k l ∈ K .
In this case v il ≤ τ 2 and v jl ≤ τ 2 ⇒ δ t (T i , T j ) ≤ τ . Result: If T i , T j , and k l are in the same cluster
from v il and v jl we conclude that T 1 and T 2 match thus we need not to compute δ t (T i , T j )
k T j
T i
v jl
δ t (T i , T j ) v il
τ 2
Reference Sets Upper and Lower Bound
Trees in Different Clusters — Lower Bound Applies
Lower bound: δ t (T i , T j ) ≥ | v il − v jl | = | δ t (T i , k l ) − δ t (k l , T j ) | Assumption: T i ∈ S 1 is in cluster C 1 , T j ∈ S 2 is in cluster C 2 .
The clusters are well separated.
The cluster C 1 contains a reference tree k l ∈ K .
In this case v il ≤ τ 2 and v jl > 3τ 2 ⇒ δ t (T i , T j ) > τ .
Result: If T i and k l are in the same cluster, but T j is in a different cluster
from v il and v jl we conclude that T 1 and T 2 do not match thus we need not to compute δ t (T i , T j )
≤ τ 2 > 3τ 2 ≤ τ 2 T j T i
k l
C 1 v il C 2
v jl
δ t (T i , T j )
Reference Sets Upper and Lower Bound
Optimum — Well Separated Clusters
Optimum:
clusters are well separated
the reference set contains one tree per cluster
Guha et al. [?] find clusters by sampling and estimate a reference set
size.
Reference Sets Combined Approach
Outline
1 Reference Sets Definition
Upper and Lower Bound
Combined Approach
Reference Sets Combined Approach
Combined Approach
We combine the previous approaches:
lower bound with traversal strings
upper bound with constrained edit distance reference sets
Instead of one vector v i we compute two vectors for each tree T i :
lower bound: v i l contains the traversal string distance
upper bound: v i u contains the constrained edit distance
Reference Sets Combined Approach
Metric Space
Metric space of the reference set:
the elements of the reference set define the basis of a metric space each tree T k is represented by a rectangle in this metric space
v k l and v k u are two opposite corners of this rectangle
Example: Two trees T i and T j in the metric space defined by the reference set K = { k 1 , k 2 } .
v
kl/u2v
il2v
j2lv
iu2v
j2uT
jT
iReference Sets Combined Approach
Combined Approach: Triangle Inequality
The triangle equations changes as follows:
(a) For all 1 ≤ l ≤ | K |
δ t (T i , T j ) ≤ v il u + v jl u
(b) For all 1 ≤ l ≤ | K |
δ t (T i , T j ) ≥
v jl l − v il u if v jl l > v il u v il l − v jl u if v il l > v jl u
0 otherwise
Note:
If v jl l > v il u or v il l > v jl u then [v il l , v il u ] and [v jl l , v jl u ] are disjoint intervals.
In all other cases we can not give a lower bound (other than 0).
Reference Sets Combined Approach
Combined Approach: Upper and Lower Bounds
Upper bound:
u t (T i , T j ) = min
l ,1 ≤ l ≤| K | v il u + v jl u
Lower bound:
l t (T i , T j ) = max
l ,1 ≤ l ≤| K |
v jl l − v il u if v jl l > v il u v il l − v jl u if v il l > v jl u
0 otherwise
Reference Sets Combined Approach