Traversal Strings and Constrained Edit Distance

(1)

Similarity Search

Traversal Strings and Constrained Edit Distance

Nikolaus Augsten

nikolaus.augsten@sbg.ac.at Department of Computer Sciences

University of Salzburg

http://dbresearch.uni-salzburg.at

WS 2021/22

Version October 26, 2021

(2)

1 Search Space Reduction for the Tree Edit Distance Similarity Join and Search Space Reduction

Lower Bound: Traversal Strings

Upper Bound: Constrained Edit Distance

2 Conclusion

(3)

Search Space Reduction for the Tree Edit Distance Similarity Join and Search Space Reduction

Outline

1 Search Space Reduction for the Tree Edit Distance Similarity Join and Search Space Reduction

Lower Bound: Traversal Strings

Upper Bound: Constrained Edit Distance

2 Conclusion

(4)

Definition: Similarity Join

Definition (Similarity Join)

Given two sets of trees, S ₁ and S ₂ , and a distance threshold τ , let

δ _t (T _i , T _j ) be a function that assesses the edit distance between two trees

T _i ∈ S ₁ and T _j ∈ S ₂ . The similarity join operation between two sets of

trees reports in the output all pairs of trees (T _i , T _j ) ∈ S ₁ × S ₂ such that

δ _t (T _i , T _j ) ≤ τ .

(5)

Search Space Reduction for the Tree Edit Distance Similarity Join and Search Space Reduction

Similarity Join Algorithm with Upper and Lower Bounds

simJoin(S ₁ , S ₂ )

for each T _i ∈ S ₁ do

for each T _j ∈ S ₂ do

if upperBound(T _i , T _j ) ≤ τ then output(T _i , T _j )

else if lowerBound(T _i , T _j ) > τ then /* do nothing */

else if δ _t (T _i , T _j ) ≤ τ then

output(T _i , T _j )

(6)

Outline

1 Search Space Reduction for the Tree Edit Distance Similarity Join and Search Space Reduction

Lower Bound: Traversal Strings

Upper Bound: Constrained Edit Distance

2 Conclusion

(7)

Search Space Reduction for the Tree Edit Distance Lower Bound: Traversal Strings

Preorder and Postorder Traversal Strings

Each node label is a single character of an alphabet Σ.

Traversal Strings:

pre (T) is the string of T’s node labels in preorder post (T) is the string of T’s node labels in postorder

Lemma (Tree Inequality)

Let pre (T ₁ ) and pre (T ₂ ) be the preorder strings, and post (T ₁ ) and post (T ₂ ) be the postorder strings of two trees T ₁ and T ₂ , respectively.

Then

pre (T ₁ ) 6 = pre (T ₂ ) or post (T ₁ ) 6 = post (T ₂ ) ⇒ T ₁ 6 = T ₂ Proof.

The inversion of the argument is obviously true:

T ₁ = T ₂ ⇒ pre (T ₁ ) = pre (T ₂ ) and post (T ₁ ) = post (T ₂ )

(8)

Notes: Traversal Strings and Tree Inequality

If the traversal strings of two trees are equal, the trees can still be different:

T ₁ T ₂

a

b a

6 = a

b a

pre (T ₁ ) = aba pre (T ₂ ) = aba

(9)

Lower Bound

Theorem (Lower Bound)

If the trees are at tree edit distance k, then the string edit distance between their preorder and postorder traversals is at most k.

Proof.

Tree operations map to string operations (illustration on next slide):

Insertion (ins (v, p, k , m)): Let t ₁ . . . t _f be the subtrees rooted in the children of p. Then the preorder traversal of the subtree rooted in p is

p pre (t ₁ ) . . . pre (t _k ₋ ₁ )pre (t _k ) . . . pre (t _k _+m ₋ ₁ )pre (t _k _+m ) . . . pre (t _f ).

Inserting v moves the subtrees k to m:

ppre (t ₁ ) . . . pre (t _k ₋ ₁ ) v pre (t _k ) . . . pre (t _k _+m ₋ ₁ )pre (t _k _+m ) . . . pre (t _f ).

The string distance is 1. Analog rationale for postorder.

Deletion: Inverse of insertion.

Rename: With node rename a single string character is renamed.

(10)

Illustration for the Lower Bound Proof (Preorder)

p

t₁ ... t_k₋₁ t_k ... t_k+m₋₁ t_k_+m ... t_f

p pre (t ₁ ). . . pre (t _k ₋ ₁ ) pre (t _k ). . . pre (t _k _+m ₋ ₁ )

pre (t _k _+m ). . . pre (t _f )

p

t₁ t_k₋₁ v

t_k t_k_+m₋₁

t_k+m t_f ...

...

ins(v,p,k,m) del(v)

p pre (t ₁ ). . . pre (t _k ₋ ₁ ) v pre (t _k ). . . pre (t _k _+m ₋ ₁ )

pre (t _k _+m ). . . pre (t _f )

(11)

Lower Bound

From the lower bound theorem it follows that

max(δ _s (pre (T ₁ ), pre (T ₂ )), δ _s (post (T ₁ ), post(T ₂ ))) ≤ δ _t (T ₁ , T ₂ ) where δ _s and δ _t are the string and the tree edit distance, respectively.

The string edit distance can be computed faster:

string edit distance runtime: O (n ² ) tree edit distance runtime: O (n ³ )

Similarity join: match all trees with δ _t (T ₁ , T ₂ ) ≤ τ

if max(δ _s (pre (T ₁ ), pre (T ₂ )), δ _s (post (T ₁ ), post (T ₂ ))) > τ then δ _t (T ₁ , T ₂ ) > τ

thus we do not have to compute the expensive tree edit distance

(12)

Example: Traversal String Lower Bound

f

₆

d

₄

a

₁

c

₃

b

₂

e

₅

T

₁

pre (T ₁ ) = fdacbe post (T ₁ ) = abcdef

f

₆

c

₄

d

₃

a

₁

b

₂

e

₅

T

₂

pre (T ₂ ) = fcdabe post (T ₂ ) = abdcef δ _s (pre (T ₁ ), pre (T ₂ )) = 2

δ _s (post(T ₁ ), post (T ₂ )) = 2

δ (T , T ) = 2

(13)

Example: Traversal String Lower Bound

The string distances of preorder and postorder may be different.

The string distances and the tree distance may be different.

a

b a

c T ₁

pre (T ₁ ) = abac post (T ₁ ) = bcaa

a b

a c

T ₂

pre (T ₂ ) = abac post(T ₂ ) = acba δ _s (pre (T ₁ ), pre (T ₂ )) = 0

δ _s (post (T ₁ ), post (T ₂ )) = 2

δ _t (T ₁ , T ₂ ) = 3

(14)

Outline

1 Search Space Reduction for the Tree Edit Distance Similarity Join and Search Space Reduction

Lower Bound: Traversal Strings

Upper Bound: Constrained Edit Distance

2 Conclusion

(15)

Search Space Reduction for the Tree Edit Distance Upper Bound: Constrained Edit Distance

Edit Mapping

Recall the definition of the edit mapping:

Definition (Edit Mapping)

An edit mapping M between T ₁ and T ₂ is a set of node pairs that satisfy the following conditions:

(1) (a, b) ∈ M ⇒ a ∈ N (T ₁ ), b ∈ N (T ₂ ) (2) for any two pairs (a, b) and (x, y) of M :

(i) a = x ⇔ b = y (one-to-one condition)

(ii) a is to the left of x ¹ ⇔ b is to the left of y (order condition)

(iii) a is an ancestor of x ⇔ b is an ancestor of y (ancestor condition)

1

i.e., a precedes x in both preorder and postorder

(16)

Constrained Edit Distance

We compute a special case of the edit distance to get a faster algorithm.

lca(a, b) is the lowest common ancestor of a and b.

Additional requirement on the mapping M :

(4) for any pairs (a ₁ , b ₁ ), (a ₂ , b ₂ ), (x, y) of M :

lca(a ₁ , a ₂ ) is a proper ancestor of x

⇔

lca(b ₁ , b ₂ ) is a proper ancestor of y.

Intuition: Distinct subtrees of T ₁ are mapped to distinct subtrees of

T ₂ .

(17)

Example: Constrained Edit Distance

a b

d e

c

f g

T ₁

a

d h

e i

f g

T ₂

Constrained edit distance (dashed lines): δ _c (T ₁ , T ₂ ) = 5

constrained mapping M _c = { (a, a), (d , d ), (c , i ), (f , f )(g , g ) } edit sequence: ren(c , i ), del (b), del (e ), ins (h), ins (e )

Unconstrained edit distance (dotted lines): δ _t (T ₁ , T ₂ ) = 3

mapping M _t = { (a, a), (d , d ), (e , e ), (c , i ), (f , f )(g , g ) }

edit sequence: ren(c , i ), del (b), ins (h)

(18)

Example: Constrained Edit Distance

a b

d e

c

f g

T ₁

a

d h

e i

f g

T ₂

(e , e ) violates the 4th condition of the constrained mapping:

lca(e , f ) in T ₁ is a

a is a proper ancestor of d in T ₁

assume (e , e ), (f , f ), (d , d ) ∈ M _c

lca(e , f ) in T ₂ is h

(19)

Complexity of the Constrained Edit Distance

Theorem (Complexity of the Constrained Edit Distance)

Let T ₁ and T ₂ be two trees with | T ₁ | and | T ₂ | nodes, respectively. There is an algorithm that computes the constrained edit distance between T ₁ and T ₂ with runtime

O ( | T ₁ || T ₂ | ).

Proof.

See [Zha95, GJK ⁺ 02].

(20)

Constrained Edit Distance: Upper Bound

Theorem (Upper Bound)

Let T ₁ and T ₂ be two trees, let δ _t (T ₁ , T ₂ ) be the unconstrained and δ _c (T ₁ , T ₂ ) be the constrained tree edit distance, respectively. Then

δ _t (T ₁ , T ₂ ) ≤ δ _c (T ₁ , T ₂ ) Proof.

See [GJK ⁺ 02].

(21)

Use of the Upper Bound

The constrained edit distance can be computed faster:

constrained edit distance runtime: O (n ² ) unconstrained edit distance runtime: O (n ³ )

Similarity join: match all trees with δ _t (T ₁ , T ₂ ) ≤ τ

if δ _c (T ₁ , T ₂ ) ≤ τ then also δ _t (T ₁ , T ₂ ) ≤ τ .

thus we do not have to compute the expensive tree edit distance

(22)

Summary

Tree Edit Distance Complexity Search Space Reduction

Lower Bound: Traversal Strings

Upper Bound: Constrained Edit Distance

(23)

Traversal Strings and Constrained Edit Distance

Similarity Search

Traversal Strings and Constrained Edit Distance

Nikolaus Augsten

nikolaus.augsten@sbg.ac.at Department of Computer Sciences

University of Salzburg

http://dbresearch.uni-salzburg.at

WS 2021/22

Version October 26, 2021

1 Search Space Reduction for the Tree Edit Distance Similarity Join and Search Space Reduction

Lower Bound: Traversal Strings

Upper Bound: Constrained Edit Distance

2 Conclusion

Outline

1 Search Space Reduction for the Tree Edit Distance Similarity Join and Search Space Reduction

Lower Bound: Traversal Strings

Upper Bound: Constrained Edit Distance

2 Conclusion

Definition: Similarity Join

Definition (Similarity Join)

Given two sets of trees, S 1 and S 2 , and a distance threshold τ , let

δ t (T i , T j ) be a function that assesses the edit distance between two trees

T i ∈ S 1 and T j ∈ S 2 . The similarity join operation between two sets of

trees reports in the output all pairs of trees (T i , T j ) ∈ S 1 × S 2 such that

δ t (T i , T j ) ≤ τ .

Similarity Join Algorithm with Upper and Lower Bounds

simJoin(S 1 , S 2 )

for each T i ∈ S 1 do

for each T j ∈ S 2 do

if upperBound(T i , T j ) ≤ τ then output(T i , T j )

else if lowerBound(T i , T j ) > τ then /* do nothing */

else if δ t (T i , T j ) ≤ τ then

output(T i , T j )

Outline

1 Search Space Reduction for the Tree Edit Distance Similarity Join and Search Space Reduction

Lower Bound: Traversal Strings

Upper Bound: Constrained Edit Distance

2 Conclusion

Preorder and Postorder Traversal Strings

Each node label is a single character of an alphabet Σ.

Traversal Strings:

pre (T) is the string of T’s node labels in preorder post (T) is the string of T’s node labels in postorder

Lemma (Tree Inequality)

Let pre (T 1 ) and pre (T 2 ) be the preorder strings, and post (T 1 ) and post (T 2 ) be the postorder strings of two trees T 1 and T 2 , respectively.

Then

pre (T 1 ) 6 = pre (T 2 ) or post (T 1 ) 6 = post (T 2 ) ⇒ T 1 6 = T 2 Proof.

The inversion of the argument is obviously true:

T 1 = T 2 ⇒ pre (T 1 ) = pre (T 2 ) and post (T 1 ) = post (T 2 )

Notes: Traversal Strings and Tree Inequality

If the traversal strings of two trees are equal, the trees can still be different:

T 1 T 2

a

b a

6

= a

b a

pre (T 1 ) = aba pre (T 2 ) = aba

Lower Bound

Theorem (Lower Bound)

If the trees are at tree edit distance k, then the string edit distance between their preorder and postorder traversals is at most k.

Proof.

Tree operations map to string operations (illustration on next slide):

Insertion (ins (v, p, k , m)): Let t 1 . . . t f be the subtrees rooted in the children of p. Then the preorder traversal of the subtree rooted in p is

p pre (t 1 ) . . . pre (t k − 1 )pre (t k ) . . . pre (t k +m − 1 )pre (t k +m ) . . . pre (t f ).

Inserting v moves the subtrees k to m:

ppre (t 1 ) . . . pre (t k − 1 ) v pre (t k ) . . . pre (t k +m − 1 )pre (t k +m ) . . . pre (t f ).

The string distance is 1. Analog rationale for postorder.

Deletion: Inverse of insertion.

Rename: With node rename a single string character is renamed.

Illustration for the Lower Bound Proof (Preorder)

p pre (t 1 ). . . pre (t k − 1 ) pre (t k ). . . pre (t k +m − 1 )

pre (t k +m ). . . pre (t f )

p pre (t 1 ). . . pre (t k − 1 ) v pre (t k ). . . pre (t k +m − 1 )

pre (t k +m ). . . pre (t f )

Lower Bound

From the lower bound theorem it follows that

max(δ s (pre (T 1 ), pre (T 2 )), δ s (post (T 1 ), post(T 2 ))) ≤ δ t (T 1 , T 2 ) where δ s and δ t are the string and the tree edit distance, respectively.

The string edit distance can be computed faster:

string edit distance runtime: O (n 2 ) tree edit distance runtime: O (n 3 )

Similarity join: match all trees with δ t (T 1 , T 2 ) ≤ τ

Given two sets of trees, S ₁ and S ₂ , and a distance threshold τ , let

δ _t (T _i , T _j ) be a function that assesses the edit distance between two trees

T _i ∈ S ₁ and T _j ∈ S ₂ . The similarity join operation between two sets of

trees reports in the output all pairs of trees (T _i , T _j ) ∈ S ₁ × S ₂ such that

δ _t (T _i , T _j ) ≤ τ .

simJoin(S ₁ , S ₂ )

for each T _i ∈ S ₁ do

for each T _j ∈ S ₂ do

if upperBound(T _i , T _j ) ≤ τ then output(T _i , T _j )

else if lowerBound(T _i , T _j ) > τ then /* do nothing */

else if δ _t (T _i , T _j ) ≤ τ then

output(T _i , T _j )

Let pre (T ₁ ) and pre (T ₂ ) be the preorder strings, and post (T ₁ ) and post (T ₂ ) be the postorder strings of two trees T ₁ and T ₂ , respectively.

pre (T ₁ ) 6 = pre (T ₂ ) or post (T ₁ ) 6 = post (T ₂ ) ⇒ T ₁ 6 = T ₂ Proof.

T ₁ = T ₂ ⇒ pre (T ₁ ) = pre (T ₂ ) and post (T ₁ ) = post (T ₂ )

T ₁ T ₂

pre (T ₁ ) = aba pre (T ₂ ) = aba

Insertion (ins (v, p, k , m)): Let t ₁ . . . t _f be the subtrees rooted in the children of p. Then the preorder traversal of the subtree rooted in p is

p pre (t ₁ ) . . . pre (t _k ₋ ₁ )pre (t _k ) . . . pre (t _k _+m ₋ ₁ )pre (t _k _+m ) . . . pre (t _f ).

ppre (t ₁ ) . . . pre (t _k ₋ ₁ ) v pre (t _k ) . . . pre (t _k _+m ₋ ₁ )pre (t _k _+m ) . . . pre (t _f ).

p pre (t ₁ ). . . pre (t _k ₋ ₁ ) pre (t _k ). . . pre (t _k _+m ₋ ₁ )

pre (t _k _+m ). . . pre (t _f )

p pre (t ₁ ). . . pre (t _k ₋ ₁ ) v pre (t _k ). . . pre (t _k _+m ₋ ₁ )

pre (t _k _+m ). . . pre (t _f )

max(δ _s (pre (T ₁ ), pre (T ₂ )), δ _s (post (T ₁ ), post(T ₂ ))) ≤ δ _t (T ₁ , T ₂ ) where δ _s and δ _t are the string and the tree edit distance, respectively.

string edit distance runtime: O (n ² ) tree edit distance runtime: O (n ³ )

Similarity join: match all trees with δ _t (T ₁ , T ₂ ) ≤ τ

if max(δ _s (pre (T ₁ ), pre (T ₂ )), δ _s (post (T ₁ ), post (T ₂ ))) > τ then δ _t (T ₁ , T ₂ ) > τ

pre (T ₁ ) = fdacbe post (T ₁ ) = abcdef

pre (T ₂ ) = fcdabe post (T ₂ ) = abdcef δ _s (pre (T ₁ ), pre (T ₂ )) = 2

δ _s (post(T ₁ ), post (T ₂ )) = 2

c T ₁

pre (T ₁ ) = abac post (T ₁ ) = bcaa

T ₂

pre (T ₂ ) = abac post(T ₂ ) = acba δ _s (pre (T ₁ ), pre (T ₂ )) = 0

δ _s (post (T ₁ ), post (T ₂ )) = 2

δ _t (T ₁ , T ₂ ) = 3

An edit mapping M between T ₁ and T ₂ is a set of node pairs that satisfy the following conditions:

(1) (a, b) ∈ M ⇒ a ∈ N (T ₁ ), b ∈ N (T ₂ ) (2) for any two pairs (a, b) and (x, y) of M :

(ii) a is to the left of x ¹ ⇔ b is to the left of y (order condition)

(4) for any pairs (a ₁ , b ₁ ), (a ₂ , b ₂ ), (x, y) of M :

lca(a ₁ , a ₂ ) is a proper ancestor of x

lca(b ₁ , b ₂ ) is a proper ancestor of y.

Intuition: Distinct subtrees of T ₁ are mapped to distinct subtrees of

T ₂ .

T ₁

T ₂

Constrained edit distance (dashed lines): δ _c (T ₁ , T ₂ ) = 5