• Keine Ergebnisse gefunden

The ultimate objective of the RDF triple store is to run users’ queries. The metric to measure the system performance is related to the query response time, or to queries-set total response time. In this context, we refer to the workload as a queries-set of queries at a certain time period within the system’s life span. If the workload’s time is in the past, the query set is considered a history workload. Otherwise, the future workload is referred to as set of queries that the system is expected to receive at some points of

time in the future. The expectation is recognized by a numerical probability value.

The system’s adaption with the workload is the steps that the system takes in order to make use of its history workload aiming to increase the performance of future workload. In order to give a precise definition of the workload, we state first the definitions of the basic concepts that are used within this thesis. We have defined the RDF graph in Chapter 2, Definition 2.1.

The basic element of a SPARQL query is the triple pattern which we define in the following:

Definition 3.1 (Triple Pattern) A triple patterntis defined as triplet= (ˆs,p,ˆ ˆo);

each element is either a constant or variable. The triple pattern answer is defined as ta ={dt∈D|match(t, dt) = 1}; and

The SPARQL query and its answer can now be precisely defined:

Definition 3.2 (Query, Query Answer) We refer to a query q as a set of triple patterns{t1, t2, ..., tn}. This set composes a query graph qG ={qV, qE}; qV is a set of graph vertices given by qV = {v | ∃t∈ q :t = (v,p,ˆ ˆo)∨t = (ˆs,p, v)}ˆ The query answer qa is the set of all the sub-graphs in an RDF graph G that are matching the query graph qG and substituting the corresponding variables. A query graph qG

matchesG1 ={V1, E1} that is a connected sub-graph ofG if |qE|=|E1|,|qV|=|V1| and the following condition holds:

∀e1 ∈E1,∃!e2 ∈qE :match(t, d) = 1, t=mapT oT riple(e1), d=mapT oT riple(e2), where mapT oT riple(e) is the function that maps a data graph edge e to the corre-sponding triple as given by Definition 2.2.

The query length is an important measurement of a query and given by the following definition:

Definition 3.3 (Query length) Given queryq and its query graphqG ={qV, qE};

and let qˆG be the undirected version of gG. The distance between any two vertices v1, v2 ∈ qV which we denote as: d(v1, v2) is the count of vertices in the shortest

path from v1 to v2 in qˆG. The length ofq is the maximum distance between any two vertices in its graph which is given by:

ql= max

{∀v1,v2∈qV,d(v1,v2)6=∞}d(v1, v2) (3.2) ql= max

∀vi,vj∈vdist(vi, vj)

We can now state a clear definition of the workload:

Definition 3.4 (Queries Workload) A collected workload up to time t is defined as a set: Qt={(q1, f1),(q2, f2), ...,(qm, fm)}, where qi is a SPARQL query, and fi is the frequency of its appearance in the workload. The workload answer Qta is the set of the query answers of Qt.

Q(t1, t2) refers to the workload collected in the time period [t1, t2).

3.3.1 Real-world Workload Analysis

There are many live services that provide SPARQL endpoints allowing users to run their own queries on RDF data-sets. However, the users’ queries logs are not available publicly, but there are multiple research works [12, 30, 68, 61] that deeply analyzed the real queries logs and produced their properties such that testing queries that are simulating the real log, can be relatively easy generated.

• Frequent patterns often exist with different levels of distribution and impact.

Some of these patterns are frequent in a very limited time period [12]. These limited periods are justified by users tuning their queries until getting satisfying results.

• There is a detectable correlation between the used data sets and the complexity distribution among the issued queries. See Figure 3.4.

• A correlation between the queries’ shapes and both of the evaluation time and the result size.

From the above points, a workload aware system in which a workload is used as a measurement subject to adapt the storage structure, should not assume fixed trends. Instead, the system should adapt also with the workload itself. This implies evaluating the workload properties dynamically at run time and measure their ef-fectiveness. These measurements are further used to increase the impact of highly

Figure 3.4: Percentages of queries exhibiting a different number of triples (in colors) for each dataset for Valid (left hand side of each bar) and Unique queries (right-hand side of each bar) as appearing in [12]

.

effective properties and obliterate the impact of those with low effectiveness. Hav-ing this functionality allows the system to apply the adaption in different levels of workload quality.

3.3.2 Evaluation Locality

In this subsection, we consider how workload queries are projected and interacted with an RDF graph. From Section 2.6 and Definition 3.2, we have seen that the query execution is the process of finding all the subgraphs in the main RDF graph which match the given query’s graph. The way of how those subgraphs are widespread in terms of locality has a big effect on the execution complexity and performance. From an analytical point of view, we classify this locality interaction between the query and the data-set into the following two aspects:

• The data-set aspect.

• The query-graph aspect.

Locality with respect to the data-set

The RDF data sets that form a big graph have usually non-equal access rate. This also means that some parts of the graph show heavy access, while most parts of the

graph are been accessed with a much lower access rate [66]. In a set of real-world queries that were targeting DBpedia, more than 90% of the queries target only 163 frequent sub-graphs [60]. Detecting these hot parts of the RDF graph is one of the fundamental methods to make the systems more adaptable to the workload.

However, there are different methods and approaches to detect those frequent parts, and there are different strategies to employ them for the sake of system benefits.

Some systems use the detected frequent patterns to recognize which parts of the data-graph are more relevant to replication [37, 31], others consider those parts when doing the data partitioning [26, 60], and others use them to detect the most important pieces of data for caching in main memory [59]. Those methods with their challenges were reviewed in Chapter 2. One of the most challenging factors that affect the outcomes of these approaches, is their flexibility to cope with different workload heterogeneity levels, and their ability to tune themselves dynamically when the workload changes over time.

Locality with respect to the Query-graphs

The shape of the query itself and its layout has a very important effect on the amplitude of its spread within the RDF graph. In most cases, this can be estimated merely by observing the query graph, and the distribution of variables and constants within its structure. That was already introduced earlier in Section 2.6.1, in which a query can be classified into bounded and unbounded and each type has its own localization impact:

• Bounded queries: the query graph has at least one constant vertex within its structure. The query execution is going to stick within a limited locality in the RDF graph, that is the location of the constant vertex and its neighbours, given that the system has the appropriate index to locate that vertex within its indexing structure, as was explained in Section 2.4.

• Unbounded queries: if the query graph contains no constants in any of its vertices, but has constants within its predicates, the number of the sub-graphs that may match the query graph is unlimited. The only possible way to define an answer for such query is to follow how the constant-edges are connected within the query-graph and find matching sub-graphs in the RDF graph. Thus, the execution in this case is widespread in terms of the data locality.