Similarity Search
Pruning with Reference Sets
Nikolaus Augsten
nikolaus.augsten@sbg.ac.at Department of Computer Sciences
University of Salzburg
http://dbresearch.uni-salzburg.at
WS 2019/20
Version February 24, 2020
Augsten (Univ. Salzburg) Similarity Search WS 2019/20 1 / 22
Outline
1 Reference Sets Definition
Upper and Lower Bound Combined Approach
Augsten (Univ. Salzburg) Similarity Search WS 2019/20 2 / 22
Reference Sets Definition
Outline
1 Reference Sets Definition
Upper and Lower Bound Combined Approach
Reference Sets Definition
Reference Sets [GJK + 02]
Reference sets take advantage of the triangle inequality for metrics.
An extended version of the triangle inequality is:
| δ(a, b) − δ(b, c ) | ≤ δ(a, c ) ≤ δ(a, b) + δ(b, c),
where δ() is a metric distance function computed between a, b, and c.
Reference Sets Definition
Illustration: Triangle Inequality
a b
c
δ(a, c )<δ(a, b) + δ(b, c )
a b
c
| δ(a, b) − δ(b, c ) | <δ(a, c )
a b c
δ(a, c )=δ(a, b) + δ(b, c )
a c b
| δ(a, b) − δ(b, c ) | =δ(a, c )
Augsten (Univ. Salzburg) Similarity Search WS 2019/20 5 / 22
Reference Sets Definition
Notation
S 1 and S 2 are sets of trees (unordered forests) K ⊆ S 1 ∪ S 2 is called the reference set
T i ∈ S 1 and T j ∈ S 2 are trees
k l ∈ K , 1 ≤ l ≤ | K | , is a tree of the reference set
Augsten (Univ. Salzburg) Similarity Search WS 2019/20 6 / 22
Reference Sets Definition
Reference Vector
v i is a vector of size | K |
that stores the distance from T i ∈ S
1to each tree k l ∈ K the l -th element of v i stores the edit distance between T i and k l :
v il = δ t (T i , k l ) v j is the respective vector for T j ∈ S 2
Augsten (Univ. Salzburg) Similarity Search WS 2019/20 7 / 22
Reference Sets Definition
Metric Space
Metric space of the reference set:
the elements of the reference set define the basis of a metric space the vector v k represents tree T k as a point in this space
v kl is the coordinate of the point on the l-th axis
Example: Two trees T i and T j in the metric space defined by the reference set K = { k 1 , k 2 } .
v i1
v j 1 v k1
v i2 v j 2 v k2
T j T i
Augsten (Univ. Salzburg) Similarity Search WS 2019/20 8 / 22
Reference Sets Upper and Lower Bound
Outline
1 Reference Sets Definition
Upper and Lower Bound Combined Approach
Augsten (Univ. Salzburg) Similarity Search WS 2019/20 9 / 22
Reference Sets Upper and Lower Bound
Upper and Lower Bound
From the triangle inequality it follows that for all 1 ≤ l ≤ | K |
| v il − v jl | ≤ δ t (T i , T j ) ≤ v il + v jl Upper bound:
δ t (T i , T j ) ≤ min
l,1 ≤ l ≤| K | v il + v jl = u t (T i , T j ) Lower bound:
δ t (T i , T j ) ≥ max
l,1≤l≤|K| | v il − v jl | = l t (T i , T j ) Approximate join: match all trees with δ t (T i , T j ) ≤ τ
Upper bound: If u t (T i , T j ) ≤ τ , then T i and T j match.
Lower bound: If l t (T i , T j ) > τ , then T i and T j do not match.
Augsten (Univ. Salzburg) Similarity Search WS 2019/20 10 / 22
Reference Sets Upper and Lower Bound
Example: Similarity Join with Reference Sets
S 1 = { T 1 , T 2 , T 3 } , S 2 = { T 4 , T 5 , T 6 } , τ = 2 Reference set K = { T 1 } , 1 ≤ l ≤ | K | = 1 Distances:
T 1 T 2 T 3 T 4 T 5 T 6
T 1 0 4 1 1 4 1
T 2 0 4 4 1 4
T 3 0 1 4 1
T 4 0 4 1
T 5 0 4
T 6 0
Reference Vectors:
v 1,1 = (0), v 2,1 = (4), v 3,1 = (1), v 4,1 = (1), v 5,1 = (4), v 6,1 = (1) The clusters C 1 = { T 1 , T 3 , T 4 , T 6 } and C 2 = { T 2 , T 5 } are well separated.
Reference Sets Upper and Lower Bound
Reference Set — Tradeoff
Small reference set: efficient reference vector computation we have to compute | S | distances for each additional tree in the reference set to construct the vectors
for small reference sets the construction of the vectors is cheaper Large reference set: effective filters
large reference sets make u t and l t more effective
thus, once the reference vectors are computed, less distance computations are needed
Where is the optimum?
Reference Sets Upper and Lower Bound
Well Separated Clusters
We cluster the set S = S 1 ∪ S 2 .
The clusters are well separated for threshold τ if trees within a cluster have small distance (within
τ2)
trees from different cluster have large distance (more than
3τ2) Example: Two well separated clusters C 1 and C 2 .
≤ τ 2 > 3τ 2 ≤ τ 2
C 1 C 2
Augsten (Univ. Salzburg) Similarity Search WS 2019/20 13 / 22
Reference Sets Upper and Lower Bound
Trees within the Same Cluster — Upper Bound Applies
Upper bound: δ t (T i , T j ) ≤ v il + v jl = δ t (T i , k l ) + δ t (k l , T j ) Assumption: T i ∈ S 1 and T j ∈ S 2 are in the same cluster C .
The clusters are well separated.
The cluster C contains a reference tree k l ∈ K . In this case v il ≤ τ 2 and v jl ≤ τ 2 ⇒ δ t (T i , T j ) ≤ τ . Result: If T i , T j , and k l are in the same cluster
from v
iland v
jlwe conclude that T i and T j match thus we need not to compute δ
t(T
i, T
j)
k l
T j
T i
v jl
δ t (T i , T j ) v il
τ 2
Augsten (Univ. Salzburg) Similarity Search WS 2019/20 14 / 22
Reference Sets Upper and Lower Bound
Trees in Different Clusters — Lower Bound Applies
Lower bound: δ t (T i , T j ) > | v il − v jl | = | δ t (T i , k l ) − δ t (k l , T j ) | Assumption: T i ∈ S 1 is in cluster C 1 , T j ∈ S 2 is in cluster C 2 .
The clusters are well separated.
The cluster C
1contains a reference tree k l ∈ K . In this case v il ≤ τ 2 and v jl > 3τ 2 ⇒ δ t (T i , T j ) > τ.
Result: If T i and k l are in the same cluster, but T j is in a different cluster
from v
iland v
jlwe conclude that T i and T j do not match thus we need not to compute δ
t(T
i, T
j)
≤ τ 2 > 3τ 2 ≤ τ 2 T j T i
k l
C 1 v il C 2
v jl δ t (T i , T j )
Augsten (Univ. Salzburg) Similarity Search WS 2019/20 15 / 22
Reference Sets Upper and Lower Bound
Optimum — Well Separated Clusters
Optimum:
clusters are well separated
the reference set contains one tree per cluster
Guha et al. [GJK + 02] find clusters by sampling and estimate a reference set size.
Augsten (Univ. Salzburg) Similarity Search WS 2019/20 16 / 22
Reference Sets Combined Approach
Outline
1 Reference Sets Definition
Upper and Lower Bound Combined Approach
Augsten (Univ. Salzburg) Similarity Search WS 2019/20 17 / 22
Reference Sets Combined Approach
Combined Approach
We combine the previous approaches:
lower bound with traversal strings
upper bound with constrained edit distance reference sets
Instead of one vector v i we compute two vectors for each tree T i : lower bound: v i l contains the traversal string distance
upper bound: v i u contains the constrained edit distance
Augsten (Univ. Salzburg) Similarity Search WS 2019/20 18 / 22
Reference Sets Combined Approach
Metric Space
Metric space of the reference set:
the elements of the reference set define the basis of a metric space each tree T k is represented by a rectangle in this metric space v k l and v k u are two opposite corners of this rectangle
Example: Two trees T i and T j in the metric space defined by the reference set K = { k 1 , k 2 } .
v
k1l/uv
k2l/uv
jl1v
i1uv
i1lv
j1uv
i2lv
j2lv
i2uv
j2uT
jT
iReference Sets Combined Approach
Combined Approach: Triangle Inequality
The triangle equations changes as follows:
(a)
For all 1 ≤ l ≤ | K |
δ t (T i , T j ) ≤ v il u + v jl u
(b)
For all 1 ≤ l ≤ |K |
δ t (T i , T j ) ≥
v jl l − v il u if v jl l > v il u v il l − v jl u if v il l > v jl u
0 otherwise
Note:
If v jl l > v il u or v il l > v jl u then [v il l , v il u ] and [v jl l , v jl u ] are disjoint intervals.
In all other cases we can not give a lower bound (other than 0).
Reference Sets Combined Approach
Combined Approach: Upper and Lower Bounds
Upper bound:
u t (T i , T j ) = min
l,1 ≤ l ≤| K | v il u + v jl u
Lower bound:
l t (T i , T j ) = max
l,1 ≤ l ≤| K |
v jl l − v il u if v jl l > v il u v il l − v jl u if v il l > v jl u
0 otherwise
Augsten (Univ. Salzburg) Similarity Search WS 2019/20 21 / 22
Reference Sets Combined Approach
Illustration: Cases for the Lower Bound
Case v jl l > v il u : lower bound is v jl l − v il u
v kl v il l v il u v jl l v jl u
lower bound Case v il l > v jl u : lower bound is v il l − v jl u
v kl v jl l v jl u v il l v il u
lower bound
All other cases ([v il l , v il u ] and [v jl l , v jl u ] overlap): no lower bound
v kl v jl l
v il l
v jl u
v il u
Augsten (Univ. Salzburg) Similarity Search WS 2019/20 22 / 22
Sudipto Guha, H. V. Jagadish, Nick Koudas, Divesh Srivastava, and Ting Yu.
Approximate XML joins.
In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 287–298, Madison, Wisconsin, 2002.
ACM Press.
Augsten (Univ. Salzburg) Similarity Search WS 2019/20 22 / 22