Eciency - Pruning for Sparql Queries - Non-Standard Semantics for Graph Query Languages

5.3 Pruning for Sparql Queries

5.3.4 Eciency

(5.34) Derived from (5.33), we also handlex6=c by simply assuming the inverse vectorc. Nega-tion ofx=yand general negations of built-in lter conditions would require negated types of inequalities. However, conjunctions of the atomic built-in lter conditions above are manageable. Disjunctions only work as long as inC₁∨C₂∨. . .∨C_k, every of the clauses has the form x = c_i (1 ≤ i ≤ k). As soon as the clauses are mixed, we can derive no restrictions at all in terms of additional inequalities. For instance, if a query is ltered by the conditionx=c∨y=c, we cannot assert additional constraints as in Equation (5.33) asxcould be matched by anything, as long as c matchesy. The same applies to arbitrary mixes of conjunctions and disjunctions.

As a last construct, we mention bound(x) and ¬bound(x). The latter is implemented by an additional constraint, namely

x∈Var:β(x)=x

≤0. (5.35)

The former case requires, once again, negated types of inequalities, which may be imple-mented, but leave the scope of this thesis.

For (5.34) and (5.35), we had to use inequalities with complex expressions on the left-hand side of the inequalities. But as long as summation is used, they are just shortleft-hands.

Let{x₁, x₂, . . . , x_m}={x∈Var|β(x) =x}. Then (5.35) unfolds to x₁ ≤ 0

x2 ≤ 0 ...

xk ≤ 0

(5.36)

5.3.4 Eciency

In this set of experiments, we used our implementation sparqlSim and both database sys-tems, Virtuoso and RDFox (Appendix A.1). To emulate a querying setting incorporating maximal dual simulation pruning for Sparql, we rst compute the pruning (DB_prune) by sparqlSim. Afterwards, we load the small database prune into the database and execute the query as described in Appendix A.1. All experiments were performed on is69.

The results have recently been published in [93, 92]. We did conduct further experi-ments using Wikidata queries. However, the results are no more conclusive than the ones we already obtained on DBpedia and LUBM. The reasons for this potentially stem from the 10% sample of Wikidata and the implied selection procedure of Wikidata queries (cf.

Appendix A.3).

To be self-contained, we include the result tables of our experiments in Table A.8 in Appendix A.4. Column T_sparqlSim reports on the pruning time of our implementation.

ColumnsT(DB0) andT(DB_prune)contains the running times of the database systems on the full dataset (DB₀) and the pruned dataset (DB_pruned). Columns for the combined times of sparqlSim's pruning and the system's querying on the prune is captioned byP. Summarizing, the DBMS Virtuoso [47] that builds upon relational database system techniques to answer Sparql queries is almost always faster than our the maximal dual simulation pruning combined with querying on the pruned database. Only for queryL₂, our pruning approach has a consistent impact on both database systems on LUBM. In20

out of 31 queries on DBpedia, RDFox benets from the reduction by sparqlSim's prun-ing. Although sparqlSim's eectiveness improves upon the baseline only by 5%, RDFox exhibits consistently lower computation times on the pruned database. With a maximal gain of more than 40 seconds and a minimal loss of less than 2 seconds, sparqlSim's method appears as a viable complementation of RDFox.

5.4 Summary

In this chapter, we started with a characterization of several dual simulation-related prob-lems. As it turned out, each problem was reducible to DualSim, the special dual simulation problem, which, given two graphs,QandGand a dual simulation candidateS₀ ⊆V_Q×V, asks for the greatest dual simulation S ⊆S0. Therefore, we concentrated on algorithmic principles and solutions for this particular problem.

To get an overview of existing solutions, we presented a naïve iteration and HHK, the most popular similarity algorithm, as measured by its reception in the literature⁹. Further-more, we sketched more space-ecient solutions that are not required in our setting (small query/large data graph). Based on graph database-specic assumptions, we developed a system of inequalities (SOI) approach that rst transforms the DualSim problem into an instance of an SOI. Finding dual simulations is the same task as nding solutions to the constructed SOI.

In a rst experiment (cf. Section 5.2.6), we were able to show that our algorithm does indeed perform better than the naïve algorithm and HHK. Constructing the count-data structure, that is required for an ecient xpoint retrieval, always exceeded the evaluation without using it. Therefore, we presented results only for the cubic version of HHK (cf.

Section 5.1.2). In terms of a feasibility evaluation, our solution seems to be the best t for tasks related to database querying.

Based on our SOI approach, we implemented and formally justied the maximal dual simulation semantics of Chapter 4 as a pruning semantics for Sparql query processing. Its formalization is tightly coupled with our implementation sparqlSim. We gave sketches of how Sparql's union operator and some built-in lter conditions can be reintegrated into the SOI representation.

Briey, our eciency evaluation shows that a full-edged (commercial) database system like Virtuoso often works faster if no external mechanism interferes. Sometimes Virtuoso needed more time answering the query on the pruned database than on the full one. One reason for this may be that the heuristics used for nding the right query execution plan are out of reach when considering the (degenerated) statistics of the pruned database.

Of course, the structure of our experimental setting presupposes that our tool runs as an integrated component in the database management system. Furthermore, the data structures we used are highly optimized towards the necessary xpoint operations. If we cannot build upon similar indexing techniques, as we used them throughout this chapter, the running times will become much slower. In that case, our techniques are still feasible additions to database systems that already build on bit-matrix representations of the data, e. g., Redisgraph [31].

We observed that the order in which unstable inequalities are processed by Algorithm 3 plays an important rôle in the overall evaluation time. As discussed, we aimed for straight-forward optimizations and ordering heuristics (cf. Section 5.2.5), but more involved in-sights, also based on the mathematical structures, might boost our implementation even further.

9according to the ACM Digital Library,537citations on Oct 21, 2019

CHAPTER 6 Conclusion

The individual chapters already provide summary sections. Therefore, we will only briey reect on the goals we set out in Chapter 1. Furthermore, we provide a perspective on simulations we obtained during the process of this thesis.

We believe the reason why tractable graph pattern matching up to (dual) simulations has not been conducted on extensive datasets because the existing algorithms, most of the works build upon [86], do not scale well with large data graphs. Tractability, i. e., the worst-case complexity is polynomial, only accounts for the worst imaginable case. In our experiments, we could show that the worst-case occurs only rarely in real, sometimes in synthetic data. The implementation of our new solution devised for solving the maximal dual simulation problem outperforms competitors by several orders of magnitude, although they sometimes exhibit a better worst-case complexity. Thereby, we could also show that graph pattern matching up to dual simulations handles the usual amount of data in large-scale real-world knowledge graphs. Although our software prototype cannot cope with industrial standard relational database technology in runtime, it provides enhancements for other software prototypes. Our tool can be used as a pruning mechanism for Sparql using conjunctions, disjunctions, unions, and some built-in lter conditions. Thereby, we sketched the potential of dual simulation pattern matching.

Our implementation was grounded on a sound basis of formal results, justifying that our ultimate maximal dual simulation semantics is a correct approximation of the Sparql semantics. At rst, it seemed as if dual simulations must not be combined by Sparql operators since, otherwise, tractability or correctness get lost. Each of the correct semantics we developed contributed an idea that could later be used to tackle the union-closed Sparql semantics of maximal dual simulations.

6.1 Perspective

We devised several semantic notions based on simulations. Right from the beginning, we recaptured graph schemas and their semantics, as given by the modeled database instances. As early as in this step, the mathematical objects of simulations showed an enormous fragility towards change. While the graph schema model had a sound basis on classical simulations, the lack of root nodes in modern graph data immediately turned a well-known preorder into a reexive relation with hardly any meaning. We had to adapt two assumptions in exchange for root nodes:

1. Connectedness of graph schemas, which is a property guaranteed by root nodes (cf.

Section 2.1.1) and

135

2. incorporation of backward steps as simulation steps, culminating in dual simulations.

The second assumption, i. e., employment of dual simulations, also guarantees a property, formerly held by root nodes, namely reachability of all other nodes. Once we rectied our simulation preorder, and thereby, justied the semantics of graph schemas, we tried to add a small feature, namely mandatory edges, which immediately destroyed any use-ful properties for deriving a sensible semantics. It was only the fallback to deterministic graph schemas that helped here. According to the literature, the implications and the con-crete limits of modality and alternating simulation (i. e., modal renement) are open [79].

Going onward, the same issues as for simulations occurred with the notions of similarity and bisimilarity in Chapter 3. Also, essentially the same xes applied. On our last quest in Chapter 4, we only added two operators and directly turned the resulting query se-mantics incomparable, where it was still comparable in case of basic graph patterns (cf.

Theorem 3.24).

Based on this brief experience report, we are curious about the following:

1. Are there other mathematically well-founded notions that are fragile in the sense described above?

2. Are there other robust notions? We suspect graph isomorphisms to be a robust matching notion.

3. If Sparql and general relational algebra is not the right query language for simula-tions, which one is?

Im Dokument Non-Standard Semantics for Graph Query Languages (Seite 141-144)