• Keine Ergebnisse gefunden

4.4 Maximal Dual Simulations for Sparql

5.1.2 The HHK Algorithm

The name of the algorithm, HHK, stems from the initials of the authors' last names, proposed as EcientSimilarity by Henzinger, Henzinger, and Kopke [69]. The original algorithm contained a bug, pointed out and xed by Ranzato [117]. Here, we refer to the xed version of HHK.

As the name EcientSimilarity suggests, HHK computes similarity classes. We adapt their algorithm to solve the simulation problem between labeled graphs Q and G with simulation candidate S0. Any binary relation, R ⊆A×B, has a characteristic function χR : A → 2B with χR(a) := {b ∈ B | (a, b) ∈ R}. For (dual) simulations S ⊆ VQ×V between graph pattern Q and data graph G, χS associates with each nodev ∈VQ a set of (dual) simulating nodes χS(v)⊆V. UpdatingχS(v)(v∈VQ) means updatingS, e. g., χS(v)\ {w}translates to S\ {(v, w)}.

One problem of Algorithm 1, tackled by HHK, is that it always iterates over all edges ofQ, no matter whether it is necessary or not.

Example 5.2 Consider the extension of Example 5.1 by a single b-labeled edge, as de-picted in Figure 5.2. Q is the pattern depicted in Figure 5.2(a) and G the graph in Figure 5.2(b). Let us further assumeS0 ={(w,0),(v,1),(v,2), . . . ,(v, k)}, already an op-timization uponVQ×V. As before, the rst iteration removes(v, k), the second(v, k−1), and so forth. After k−1 iterations, S contains (w,0) and (v,1). We already know that (v,1) is removed, but only then also (w,0) can be removed from S, leaving us with the empty simulation after at mostk+ 1iterations.

To complete a single iteration, Algorithm 1 checks for edges (v, a, v) ∈ EQ as well as (w, b, v)∈EQ. The former edge leads to the necessary removals. However, consideration of the latter edge does not inuence the computation at all, unless(v,1)is removed. Hence, in up tokiterations, edge(w, b, v)∈EQ is traversed unnecessarily.

To overcome such unnecessary traversals, Henzinger et al. introduce so-called remove sets.

In our setting, there is a remove set for every labela ∈ Σ. Removea(v) stores all nodes u0 ∈ V that cannot reach any simulating node of v, i. e., ∀u0 ∈ Removea(v), there is no v0 ∈χS(v) withu0 Eav0. Thus, an edgew EQa vis only considered ifRemovea(v)6=∅, i. e., if there is a potential to update the simulation candidates ofw.

Example 5.3 ReconsiderQ,G, andS0 as in Example 5.2. The remove sets are initialized to Removea(v) ={0, k},Removeb(v) ={1,2, . . . , k}, and Removea(w) =Removeb(w) =∅. The setRemovea(v)collects all the nodes ofGthat cannot reachχS0(v). In this particular case, it is the set of all nodes that do not have an outgoinga-labeled edge.

In the rst iteration of HHK,(w, b, v)∈EQ is considered becauseRemoveb(v)6=∅. By processing(w, b, v), we reduceχS(w)by all the nodes inRemoveb(v). In this case,χS(w)is not updated, but after processingRemoveb(v), there is no need for remembering the nodes we just removed. Thus, we set Removeb(v) to ∅. Furthermore, we process (v, a, v) ∈EQ which removes the pair(v, k) fromS. HHK now recomputes Removea(v) to {k−1}. We explain the details of this update below.

It is important to notice that in the next iteration, Removea(v) = {k − 1} and Removeb(v) =∅. Thus, edge (w, b, v)∈EQ is not considered and will not be reconsidered

until(v,1)has been removed from S.

Correctly maintaining the remove sets is the key for improved combined complexity of HHK, depicted in Algorithm 2. The integration of remove sets reduces combined com-plexity of Algorithm 1 to O(|VQ| · |V|3) [69]. Additionally to initializing the working relationS in Line 1, the remove sets are initialized in Lines 2 to 4 as explained in Exam-ple 5.3. The algorithm proceeds by picking nodesv∈VQwith non-empty remove sets, i. e., Removea(v)6=∅ (a∈Σ), and considers every edgeu EQa v for an update of χS(u). Recall that Removea(v) contains all nodes from the database that cannot reach a node by an a-labeled edge, simulating v. These nodes must be removed from χS(u) (Line 10). Upon removal of a nodew∈Removea(v) fromχS(u),wis not a candidate for simulatingu any-more. This information must be propagated to all predecessors ofu. Therefore, for every nodew0 that reaches w by some edge but no other node in χS(u), the respective remove set of u must include w0 (cf. Line 13) as it represents a potential to update predecessor nodes ofu.

Example 5.4 (Example 5.3 continued) Recall that, initially, Removea(v) = {0, k}

and Removeb(v) = {1,2, . . . , k}. Let us rst process Removeb(v). As there is only a single edge to consider, every element of Removeb(v) is tested whether it is an element of χS(w) in Line 9. The tests will all be negative, Removeb(v) = ∅ afterwards, and we

Algorithm 2: The HHK Algorithm

Input :Q= (VQ,Σ, EQ),G= (V,Σ, E), and S0 ⊆VQ×V. Output: Greatest simulationS ⊆S0 between Qand G.

1 S←S0;

2 forallv∈VQ, a∈Σdo

3 Removea(v)←V \S

w∈χS(v)Eaw;

4 end

5 while there are v∈VQ anda∈Σwith Removea(v)6=∅ do

6 Remove←Removea(v);

7 Removea(v)← ∅;

8 forallw∈Removeand u∈EQav do

9 if w∈χS(u) then

10 χS(u)←χS(u)\ {w};

11 forallb∈Σ and w0∈Ebw do

12 if w0Eb∩χS(u) =∅then

13 Removeb(u)←Removeb(u)∪ {w0};

14 end

15 end

16 end

17 end

18 end

proceed with Removea(v) and edge (v, a, v) ∈ EQ. Since 0 ∈/ χS(v) but k ∈ χS(v), k is removed from χS(v) (in Line 10). Now every predecessor of k in G, here only k−1, is checked whether there is some a-labeled edge from k−1 to some node in χS(v) (which has just been updated). If not, k−1 is added toRemovea(v). Hence, after this iteration, Removea(v) ={k−1} and all other remove sets are empty. Next, k−1 will be removed from χS(v) andRemovea(v)is updated to{k−2}.

This procedure repeats until node1is removed fromχS(v), i. e., eventuallyχS(v) =∅. Node 1 has a b-predecessor, namely node 0, and it cannot reach any node inχS(v) after 1 has been removed. Thus, Removeb(v) must be updated to {0}. Next, this remove set together with the edge (w, b, v)∈EQ is considered and χS(w) is updated to∅.

The resulting relation isS =∅, verifying thatG does not simulateQ.

Note that Lines 6 and 7 pick a remove set and mark it as processed. The reason for storing the current remove set in a local variable is that in a self-loop, u=v and the nodes that were just removed are not to be mixed up with the nodes that must be removed after the iteration. Example 5.4 exemplies this algorithmic behavior for node vand its remove set Removea(v).

The rst key to the complexity of HHK is the observation, that if we pickRemovea(v)in iterationi, then in any later iteration,Removea(v)is disjoint from its version ini[69]. Thus, for every node v∈VQ we have to consider at most |V| many disjoint sets ofRemovea(v), that are |Σ(Q)| · |V| many in total2. Furthermore, each combination w ∈ Remove and u EQa v occurs at most once and the test w ∈ χS(u) evaluates to true at most once because w is removed from χS(u) (Line 10) after Line 9 has been passed. There are at most |V| many a-predecessors of w (Line 11). Thus, the inner for-loop amounts to at most |Σ(Q)| · |V|iterations. In each of these iterations, the test in Line 12 is performed in O(|V|), once again assuming thatw0Eaand χS(u) are stored in sorted order.

2Σ(Q) ={a|(v, a, w)EQ}

Thus, the overall combined complexity of HHK (Algorithm 2) isO(|VQ| · |Σ(Q)|2· |V|3). Compared to the result of Henzinger et al. [69], we get the factor of|Σ(Q)|2 as our setting incorporates edge labels. Translated back to their setting, i. e., no edge labels andQ=G, we get a nal complexity ofO(|V|4) or O(|E| · |V|2). Henzinger et al., however, conclude their algorithm to be inO(|E| · |V|)because they make use of an additional data structure to perform Line 12 inO(1).

Consider a single labela∈Σ. Thencounta:V ×VQ →N is a two-dimensional array of positive integers maintained to preserve (5.1).

counta(w0, u) =|w0Ea∩χS(u)| (5.1) Upon removing w from χS(u), counta(w0, u) is decremented by 1 for every a-predecessor w0 of w. The test in Line 12 is then reduced to checking whether counta(w0, u) = 0.

On one hand,countais benecial as it makes the factor of Line 12 constant. However, employing these matrices is quite memory-consuming as their entries are positive integer values. For every node v ∈ VQ and every label a ∈Σ(Q), counta(v) stores O(|V|) many integer values to enable for the constant-time check in Line 12. Observing the current trend of the node set sizes in knowledge graphs, like Wikidata [87], lets thecountstructures appear infeasible. Furthermore, creating this auxiliary structure is quite time-consuming.

They regain their benets if query and data graph have similar extents, e. g., as for the original setting of similarity checking: comparing two (enormous) state spaces with each other. Hence, we conclude that HHK delivers a solution to the simulation problem in O(|V|2) withcount-arrays andO(|V|3) withoutcount data complexity.