The HHK Algorithm - Maximal Dual Simulations for Sparql

4.4 Maximal Dual Simulations for Sparql

5.1.2 The HHK Algorithm

The name of the algorithm, HHK, stems from the initials of the authors' last names, proposed as EcientSimilarity by Henzinger, Henzinger, and Kopke [69]. The original algorithm contained a bug, pointed out and xed by Ranzato [117]. Here, we refer to the xed version of HHK.

As the name EcientSimilarity suggests, HHK computes similarity classes. We adapt their algorithm to solve the simulation problem between labeled graphs Q and G with simulation candidate S0. Any binary relation, R ⊆A×B, has a characteristic function χ_R : A → 2^B with χ_R(a) := {b ∈ B | (a, b) ∈ R}. For (dual) simulations S ⊆ V_Q×V between graph pattern Q and data graph G, χ_S associates with each nodev ∈V_Q a set of (dual) simulating nodes χS(v)⊆V. UpdatingχS(v)(v∈VQ) means updatingS, e. g., χS(v)\ {w}translates to S\ {(v, w)}.

One problem of Algorithm 1, tackled by HHK, is that it always iterates over all edges ofQ, no matter whether it is necessary or not.

Example 5.2 Consider the extension of Example 5.1 by a single b-labeled edge, as de-picted in Figure 5.2. Q is the pattern depicted in Figure 5.2(a) and G the graph in Figure 5.2(b). Let us further assumeS0 ={(w,0),(v,1),(v,2), . . . ,(v, k)}, already an op-timization uponV_Q×V. As before, the rst iteration removes(v, k), the second(v, k−1), and so forth. After k−1 iterations, S contains (w,0) and (v,1). We already know that (v,1) is removed, but only then also (w,0) can be removed from S, leaving us with the empty simulation after at mostk+ 1iterations.

To complete a single iteration, Algorithm 1 checks for edges (v, a, v) ∈ EQ as well as (w, b, v)∈E_Q. The former edge leads to the necessary removals. However, consideration of the latter edge does not inuence the computation at all, unless(v,1)is removed. Hence, in up tokiterations, edge(w, b, v)∈EQ is traversed unnecessarily.

To overcome such unnecessary traversals, Henzinger et al. introduce so-called remove sets.

In our setting, there is a remove set for every labela ∈ Σ. Remove_a(v) stores all nodes u⁰ ∈ V that cannot reach any simulating node of v, i. e., ∀u⁰ ∈ Removea(v), there is no v⁰ ∈χ_S(v) withu⁰ E^av⁰. Thus, an edgew E_Q^a vis only considered ifRemove_a(v)6=∅, i. e., if there is a potential to update the simulation candidates ofw.

Example 5.3 ReconsiderQ,G, andS0 as in Example 5.2. The remove sets are initialized to Removea(v) ={0, k},Remove_b(v) ={1,2, . . . , k}, and Removea(w) =Remove_b(w) =∅. The setRemove_a(v)collects all the nodes ofGthat cannot reachχ_S₀(v). In this particular case, it is the set of all nodes that do not have an outgoinga-labeled edge.

In the rst iteration of HHK,(w, b, v)∈E_Q is considered becauseRemove_b(v)6=∅. By processing(w, b, v), we reduceχS(w)by all the nodes inRemoveb(v). In this case,χS(w)is not updated, but after processingRemove_b(v), there is no need for remembering the nodes we just removed. Thus, we set Remove_b(v) to ∅. Furthermore, we process (v, a, v) ∈E_Q which removes the pair(v, k) fromS. HHK now recomputes Removea(v) to {k−1}. We explain the details of this update below.

It is important to notice that in the next iteration, Removea(v) = {k − 1} and Remove_b(v) =∅. Thus, edge (w, b, v)∈E_Q is not considered and will not be reconsidered

until(v,1)has been removed from S.

Correctly maintaining the remove sets is the key for improved combined complexity of HHK, depicted in Algorithm 2. The integration of remove sets reduces combined com-plexity of Algorithm 1 to O(|V_Q| · |V|³) [69]. Additionally to initializing the working relationS in Line 1, the remove sets are initialized in Lines 2 to 4 as explained in Exam-ple 5.3. The algorithm proceeds by picking nodesv∈V_Qwith non-empty remove sets, i. e., Remove_a(v)6=∅ (a∈Σ), and considers every edgeu E_Q^a v for an update of χ_S(u). Recall that Removea(v) contains all nodes from the database that cannot reach a node by an a-labeled edge, simulating v. These nodes must be removed from χ_S(u) (Line 10). Upon removal of a nodew∈Remove_a(v) fromχ_S(u),wis not a candidate for simulatingu any-more. This information must be propagated to all predecessors ofu. Therefore, for every nodew⁰ that reaches w by some edge but no other node in χ_S(u), the respective remove set of u must include w⁰ (cf. Line 13) as it represents a potential to update predecessor nodes ofu.

Example 5.4 (Example 5.3 continued) Recall that, initially, Removea(v) = {0, k}

and Remove_b(v) = {1,2, . . . , k}. Let us rst process Remove_b(v). As there is only a single edge to consider, every element of Removeb(v) is tested whether it is an element of χS(w) in Line 9. The tests will all be negative, Remove_b(v) = ∅ afterwards, and we

Algorithm 2: The HHK Algorithm

Input :Q= (VQ,Σ, EQ),G= (V,Σ, E), and S0 ⊆VQ×V. Output: Greatest simulationS ⊆S0 between Qand G.

1 S←S₀;

2 forallv∈VQ, a∈Σdo

3 Removea(v)←V \S

w∈χ_S(v)E^aw;

4 end

5 while there are v∈VQ anda∈Σwith Removea(v)6=∅ do

6 Remove←Remove_a(v);

7 Remove_a(v)← ∅;

8 forallw∈Removeand u∈E_Q^av do

9 if w∈χ_S(u) then

10 χS(u)←χS(u)\ {w};

11 forallb∈Σ and w⁰∈E^bw do

12 if w⁰E^b∩χS(u) =∅then

13 Remove_b(u)←Remove_b(u)∪ {w⁰};

14 end

15 end

16 end

17 end

18 end

proceed with Removea(v) and edge (v, a, v) ∈ EQ. Since 0 ∈/ χS(v) but k ∈ χS(v), k is removed from χ_S(v) (in Line 10). Now every predecessor of k in G, here only k−1, is checked whether there is some a-labeled edge from k−1 to some node in χ_S(v) (which has just been updated). If not, k−1 is added toRemovea(v). Hence, after this iteration, Remove_a(v) ={k−1} and all other remove sets are empty. Next, k−1 will be removed from χ_S(v) andRemove_a(v)is updated to{k−2}.

This procedure repeats until node1is removed fromχS(v), i. e., eventuallyχS(v) =∅. Node 1 has a b-predecessor, namely node 0, and it cannot reach any node inχ_S(v) after 1 has been removed. Thus, Remove_b(v) must be updated to {0}. Next, this remove set together with the edge (w, b, v)∈EQ is considered and χS(w) is updated to∅.

The resulting relation isS =∅, verifying thatG does not simulateQ.

Note that Lines 6 and 7 pick a remove set and mark it as processed. The reason for storing the current remove set in a local variable is that in a self-loop, u=v and the nodes that were just removed are not to be mixed up with the nodes that must be removed after the iteration. Example 5.4 exemplies this algorithmic behavior for node vand its remove set Remove_a(v).

The rst key to the complexity of HHK is the observation, that if we pickRemovea(v)in iterationi, then in any later iteration,Removea(v)is disjoint from its version ini[69]. Thus, for every node v∈V_Q we have to consider at most |V| many disjoint sets ofRemove_a(v), that are |Σ(Q)| · |V| many in total². Furthermore, each combination w ∈ Remove and u E_Q^a v occurs at most once and the test w ∈ χS(u) evaluates to true at most once because w is removed from χ_S(u) (Line 10) after Line 9 has been passed. There are at most |V| many a-predecessors of w (Line 11). Thus, the inner for-loop amounts to at most |Σ(Q)| · |V|iterations. In each of these iterations, the test in Line 12 is performed in O(|V|), once again assuming thatw⁰E^aand χ_S(u) are stored in sorted order.

2Σ(Q) ={a|(v, a, w)∈EQ}

Thus, the overall combined complexity of HHK (Algorithm 2) isO(|V_Q| · |Σ(Q)|²· |V|³). Compared to the result of Henzinger et al. [69], we get the factor of|Σ(Q)|² as our setting incorporates edge labels. Translated back to their setting, i. e., no edge labels andQ=G, we get a nal complexity ofO(|V|⁴) or O(|E| · |V|²). Henzinger et al., however, conclude their algorithm to be inO(|E| · |V|)because they make use of an additional data structure to perform Line 12 inO(1).

Consider a single labela∈Σ. Thencounta:V ×VQ →N is a two-dimensional array of positive integers maintained to preserve (5.1).

count_a(w⁰, u) =|w⁰E^a∩χ_S(u)| (5.1) Upon removing w from χ_S(u), counta(w⁰, u) is decremented by 1 for every a-predecessor w⁰ of w. The test in Line 12 is then reduced to checking whether count_a(w⁰, u) = 0.

On one hand,countais benecial as it makes the factor of Line 12 constant. However, employing these matrices is quite memory-consuming as their entries are positive integer values. For every node v ∈ V_Q and every label a ∈Σ(Q), count_a(v) stores O(|V|) many integer values to enable for the constant-time check in Line 12. Observing the current trend of the node set sizes in knowledge graphs, like Wikidata [87], lets thecountstructures appear infeasible. Furthermore, creating this auxiliary structure is quite time-consuming.

They regain their benets if query and data graph have similar extents, e. g., as for the original setting of similarity checking: comparing two (enormous) state spaces with each other. Hence, we conclude that HHK delivers a solution to the simulation problem in O(|V|²) withcount-arrays andO(|V|³) withoutcount data complexity.

Im Dokument Non-Standard Semantics for Graph Query Languages (Seite 114-117)