Similarity Search
Traversal Strings and Constrained Edit Distance
Nikolaus Augsten
nikolaus.augsten@sbg.ac.at Department of Computer Sciences
University of Salzburg
http://dbresearch.uni-salzburg.at
WS 2018/19
Version December 21, 2018
Augsten (Univ. Salzburg) Similarity Search WS 2018/19 1 / 21
Outline
1 Search Space Reduction for the Tree Edit Distance Similarity Join and Search Space Reduction Lower Bound: Traversal Strings
Upper Bound: Constrained Edit Distance
Augsten (Univ. Salzburg) Similarity Search WS 2018/19 2 / 21
Search Space Reduction for the Tree Edit Distance Similarity Join and Search Space Reduction
Outline
1 Search Space Reduction for the Tree Edit Distance Similarity Join and Search Space Reduction Lower Bound: Traversal Strings
Upper Bound: Constrained Edit Distance
Search Space Reduction for the Tree Edit Distance Similarity Join and Search Space Reduction
Definition: Similarity Join
Definition (Similarity Join)
Given two sets of trees, S 1 and S 2 , and a distance threshold τ , let
δ t (T i , T j ) be a function that assesses the edit distance between two trees
T i ∈ S 1 and T j ∈ S 2 . The similarity join operation between two sets of
trees reports in the output all pairs of trees (T i , T j ) ∈ S 1 × S 2 such that
δ t (T i , T j ) ≤ τ.
Search Space Reduction for the Tree Edit Distance Similarity Join and Search Space Reduction
Similarity Join Algorithm with Upper and Lower Bounds
simJoin(S 1 , S 2 ) for each T i ∈ S 1 do
for each T j ∈ S 2 do
if upperBound(T i , T j ) ≤ τ then output(T i , T j )
else if lowerBound(T i , T j ) > τ then /* do nothing */
else if δ t (T i , T j ) ≤ τ then output(T i , T j )
Augsten (Univ. Salzburg) Similarity Search WS 2018/19 5 / 21
Search Space Reduction for the Tree Edit Distance Lower Bound: Traversal Strings
Outline
1 Search Space Reduction for the Tree Edit Distance Similarity Join and Search Space Reduction Lower Bound: Traversal Strings
Upper Bound: Constrained Edit Distance
Augsten (Univ. Salzburg) Similarity Search WS 2018/19 6 / 21
Search Space Reduction for the Tree Edit Distance Lower Bound: Traversal Strings
Preorder and Postorder Traversal Strings
Each node label is a single character of an alphabet Σ.
Traversal Strings:
pre(T) is the string of T’s node labels in preorder post(T) is the string of T’s node labels in postorder
Lemma (Tree Inequality)
Let pre(T 1 ) and pre(T 2 ) be the preorder strings, and post(T 1 ) and post (T 2 ) be the postorder strings of two trees T 1 and T 2 , respectively.
Then
pre(T 1 ) 6 = pre(T 2 ) or post(T 1 ) 6 = post(T 2 ) ⇒ T 1 6 = T 2 Proof.
The inversion of the argument is obviously true:
T 1 = T 2 ⇒ pre(T 1 ) = pre (T 2 ) and post(T 1 ) = post(T 2 )
Search Space Reduction for the Tree Edit Distance Lower Bound: Traversal Strings
Notes: Traversal Strings and Tree Inequality
If the traversal strings of two trees are equal, the trees can still be different:
T 1 T 2
a
b a
6
= a
b a
pre(T 1 ) = aba pre(T 2 ) = aba
Search Space Reduction for the Tree Edit Distance Lower Bound: Traversal Strings
Lower Bound
Theorem (Lower Bound)
If the trees are at tree edit distance k, then the string edit distance between their preorder and postorder traversals is at most k.
Proof.
Tree operations map to string operations (illustration on next slide):
Insertion (ins (v, p, k, m)): Let t 1 . . . t f be the subtrees rooted in the children of p. Then the preorder traversal of the subtree rooted in p is
p pre (t 1 ) . . . pre(t k−1 )pre(t k ) . . . pre(t m )pre(t m+1 ) . . . pre(t f ).
Inserting v moves the subtrees k to m:
ppre(t 1 ) . . . pre(t k − 1 ) v pre(t k ) . . . pre(t m )pre (t m+1 ) . . . pre(t f ).
The string distance is 1. Analog rationale for postorder.
Deletion: Inverse of insertion.
Rename: With node rename a single string character is renamed.
Augsten (Univ. Salzburg) Similarity Search WS 2018/19 9 / 21
Search Space Reduction for the Tree Edit Distance Lower Bound: Traversal Strings
Illustration for the Lower Bound Proof (Preorder)
p
t1 tk−1 tk tm tm+1 tf
... ... ...
p pre(t 1 ). . . pre (t k − 1 ) pre(t k ). . . pre (t m ) pre (t m+1 ). . . pre(t f )
p
t1 tk−1 v
tk tm
tm+1 tf
...
...
...
ins(v,p,k,m) del(v)
p pre (t 1 ). . . pre(t k−1 ) v pre(t k ). . . pre (t m ) pre(t m+1 ). . . pre (t f )
Augsten (Univ. Salzburg) Similarity Search WS 2018/19 10 / 21
Search Space Reduction for the Tree Edit Distance Lower Bound: Traversal Strings
Lower Bound
From the lower bound theorem it follows that
max(δ s (pre(T 1 ), pre (T 2 )), δ s (post(T 1 ), post(T 2 ))) ≤ δ t (T 1 , T 2 ) where δ s and δ t are the string and the tree edit distance, respectively.
The string edit distance can be computed faster:
string edit distance runtime: O(n
2) tree edit distance runtime: O(n
3)
Similarity join: match all trees with δ t (T 1 , T 2 ) ≤ τ if max(δ
s(pre(T
1), pre(T
2)), δ
s(post(T
1), post(T
2))) > τ then δ
t(T
1, T
2) > τ
thus we do not have to compute the expensive tree edit distance
Search Space Reduction for the Tree Edit Distance Lower Bound: Traversal Strings
Example: Traversal String Lower Bound
f
6d
4a
1c
3b
2e
5T
1pre (T 1 ) = fdacbe post (T 1 ) = abcdef
f
6c
4d
3a
1b
2e
5T
2pre(T 2 ) = fcdabe post(T 2 ) = abdcef δ s (pre(T 1 ), pre(T 2 )) = 2
δ s (post(T 1 ), post(T 2 )) = 2
δ t (T 1 , T 2 ) = 2
Search Space Reduction for the Tree Edit Distance Lower Bound: Traversal Strings
Example: Traversal String Lower Bound
The string distances of preorder and postorder may be different.
The string distances and the tree distance may be different.
a
b a
c T 1
pre(T
1) = abac post(T
1) = bcaa
a b
a c
T 2
pre(T
2) = abac post(T
2) = acba δ
s(pre(T
1), pre(T
2)) = 0
δ
s(post(T
1), post(T
2)) = 2 δ
t(T
1, T
2) = 3
Augsten (Univ. Salzburg) Similarity Search WS 2018/19 13 / 21
Search Space Reduction for the Tree Edit Distance Upper Bound: Constrained Edit Distance
Outline
1 Search Space Reduction for the Tree Edit Distance Similarity Join and Search Space Reduction Lower Bound: Traversal Strings
Upper Bound: Constrained Edit Distance
Augsten (Univ. Salzburg) Similarity Search WS 2018/19 14 / 21
Search Space Reduction for the Tree Edit Distance Upper Bound: Constrained Edit Distance
Edit Mapping
Recall the definition of the edit mapping:
Definition (Edit Mapping)
An edit mapping M between T 1 and T 2 is a set of node pairs that satisfy the following conditions:
(1)
(a, b) ∈ M ⇒ a ∈ N(T 1 ), b ∈ N (T 2 )
(2)
for any two pairs (a, b) and (x, y) of M:
(i)
a = x ⇔ b = y (one-to-one condition)
(ii)
a is to the left of x
1⇔ b is to the left of y (order condition)
(iii)
a is an ancestor of x ⇔ b is an ancestor of y (ancestor condition)
1
i.e., a precedes x in both preorder and postorder
Search Space Reduction for the Tree Edit Distance Upper Bound: Constrained Edit Distance
Constrained Edit Distance
We compute a special case of the edit distance to get a faster algorithm.
lca(a, b) is the lowest common ancestor of a and b.
Additional requirement on the mapping M:
(4)
for any pairs (a
1, b
1), (a
2, b
2), (x, y) of M:
lca(a
1, a
2) is a proper ancestor of x
⇔
lca(b
1, b
2) is a proper ancestor of y.
Intuition: Distinct subtrees of T 1 are mapped to distinct subtrees of
T 2 .
Search Space Reduction for the Tree Edit Distance Upper Bound: Constrained Edit Distance
Example: Constrained Edit Distance
a b
d e
c
f g
T 1
a
d h
e i
f g
T 2
Constrained edit distance (dashed lines): δ c (T 1 , T 2 ) = 5 constrained mapping M
c= { (a, a), (d, d), (c , i ), (f , f )(g , g ) } edit sequence: ren(c, i ), del(b), del (e), ins(h), ins(e)
Unconstrained edit distance (dotted lines): δ t (T 1 , T 2 ) = 3 mapping M
t= { (a, a), (d, d), (e , e ), (c , i ), (f , f )(g , g ) } edit sequence: ren(c, i ), del(b), ins(h)
Augsten (Univ. Salzburg) Similarity Search WS 2018/19 17 / 21
Search Space Reduction for the Tree Edit Distance Upper Bound: Constrained Edit Distance
Example: Constrained Edit Distance
a b
d e
c
f g
T 1
a
d h
e i
f g
T 2
(e, e ) violates the 4th condition of the constrained mapping:
lca(e, f ) in T
1is a
a is a proper ancestor of d in T
1assume (e, e), (f , f ), (d, d) ∈ M
clca(e, f ) in T
2is h
h is not a proper ancestor of d in T
2Augsten (Univ. Salzburg) Similarity Search WS 2018/19 18 / 21
Search Space Reduction for the Tree Edit Distance Upper Bound: Constrained Edit Distance
Complexity of the Constrained Edit Distance
Theorem (Complexity of the Constrained Edit Distance)
Let T 1 and T 2 be two trees with | T 1 | and | T 2 | nodes, respectively. There is an algorithm that computes the constrained edit distance between T 1 and T 2 with runtime
O( | T 1 || T 2 | ).
Proof.
See [?, ?].
Search Space Reduction for the Tree Edit Distance Upper Bound: Constrained Edit Distance
Constrained Edit Distance: Upper Bound
Theorem (Upper Bound)
Let T 1 and T 2 be two trees, let δ t (T 1 , T 2 ) be the unconstrained and δ c (T 1 , T 2 ) be the constrained tree edit distance, respectively. Then
δ t (T 1 , T 2 ) ≤ δ c (T 1 , T 2 ) Proof.
See [?].
Search Space Reduction for the Tree Edit Distance Upper Bound: Constrained Edit Distance
Use of the Upper Bound
The constrained edit distance can be computed faster:
constrained edit distance runtime: O(n
2) unconstrained edit distance runtime: O(n
3) Similarity join: match all trees with δ t (T 1 , T 2 ) ≤ τ
if δ
c(T
1, T
2) ≤ τ then also δ
t(T
1, T
2) ≤ τ .
thus we do not have to compute the expensive tree edit distance
Augsten (Univ. Salzburg) Similarity Search WS 2018/19 21 / 21