Problem and Motivations - Universal Workload-based Graph Partitioning and Storage Adaption for

The Resource Description Framework (RDF) [34] has been widely used to model the data on the web. Despite its simple triple-based structure, RDF showed a high abil-ity to model the complex relationships between the web entities and preserve their semantic. It provided the scalability that allowed the RDF data to grow big from the range of billions [19] to the range of trillions of triples [62]. The naming rules of Tim Berners-Lee [11] defined the methodology to provide a unique URI-based name to eachthing modeled by an RDF data-set. This allowed data from different sources to be linked into one big cloud of linked RDF data [79] and enabled querying this cloud. Accompanied with Web Ontology Language (OWL), the RDF graph repre-sents a big knowledge graph [8]. That enables the web to build an “understanding”

of human knowledge, and evolve its applications. The medical and health semantic knowledge graphs are important examples in this regard [73, 1, 20]. As a result, RDF data experienced a rapid increase both in the size and complexity of the em-bedded relationships [22]. To keep up with that increase, specialized and dedicated systems have appeared to store the RDF triples and provide the service of querying them. However, these systems had to deal with many challenges regarding the man-agement of such big data, and efficiently process their queries. This manman-agement operation requires many data structures including multiple data-wide indexes, repli-cations, dictionary, statistics, and materialized queries results. In the context of the huge RDF data size, these structures put the RDF system in extreme storage space requirements which become even more challenging in a main memory environment.

RDF Indexing

One of the most important challenges is how the data should be indexed to provide the required efficiency of query answering, while at the same time, trying to avoid the high storage overhead coming from data redundancy in indexes. The indexing in RDF triple stores emerges as a hot research topic as the queries evaluation was feasible only with the existence of the required indexes. However, the objective of indexing is always to decrease the query execution time, and the constraint is the extra storage space. The system’s index needs are tightly related to the workload trends, and the storage constraint is related to the ratio of space availability to data-size, besides the space needs of other data structures in the system. Unfortunately, all the known triple stores made a fixed design choice regarding the objective and constraint of the indexes and thus used a fixed design scheme which the system had to live with. Some of them were very space conservative like Stratustore [75] who

used only one index, and others were very federated like RDF-3X [56] and Hexastore [81] that used more than six indexes, while others preferred to stay average and use three indexes like Rya [64], MAPSIN [69], and AMADA [15]. However, it is only suitable to evaluate the performance of these systems under fixed circumstances of workload and space. For example, the single index of Stratustore could show enough performance if the workload is merely single-type such that it only requires the SPO index. On the other hand, a diverse workload might need the comprehensives indexes of RDF-3X to provide the expected performance, but only if the system can assign the required space. Space availability is highly related to the data size and the system’s space requirements which are far from being fixed parameters.

Data Replication

As the size of RDF sources is rapidly increasing, the resources of the centralized systems have been facing difficulties in maintaining such big data and efficiently querying them. This highly motivated the move toward a distributed RDF triple store where several working nodes are cooperating in storing and querying the global RDF data set. However, this move has marked more challenges. The RDF data set which is also modeled as a graph needs to be partitioned such that each working node receives at least one partition. In this case, a query that needs data from more than one working node needs to pay the communication cost, which is the network cost required to move data across the physical network. This data can be relatively big and may overwhelm most of the total query execution time. There exist two main directions to overcome the cost of these intermediate results:

1. Performing better partitioning to decrease the size of queries’ cross-nodes in-termediate results.

2. Supporting the initial partitioning by replications.

Recalling the complexity and linkage of the RDF data-sets, performing an optimal partitioning as mentioned in the first point is a difficult task. For this reason, Partout [26] proposed to adapt the partitioning with workload using initial workload-sample at system startup. Unfortunately, the performance badly degrades when the work-load does not keep the same trend as the used sample. Thus, the attention shifted towards supporting the initial partitioning with replication. Instead of moving the queries’ intermediate results across the network, a working node may find the needed data in replications and avoid the expensive communication cost. However, repli-cation consumes more storage space, and there should be a wise decision about the

triples to be replicated in order to increase the ratio of replication utilization. Since the replication is performed to support the partitioning, the utilization of the repli-cation depends on the strategy of the used partitioning (e.g graph partitioning or hash-based partitioning). Moreover, the replication is highly related to the work-load, because the shape, length, locality, and arrival rate of the queries determine which triples are highly needed for replication. Considerable work has been done to utilize the replication (works are reviewed in Chapter 2), where part of the works considered using a workload history to identify the more important data for replica-tion and aim to save storage space. However, all of the related works either assume the existence of some initial workload, or fixed parameters and thresholds which are not clearly connected or calculated from the workload. In spite of that the storage space is already identified as the replication constraint, non of the related work has implemented the adaption as a function such that it is dynamically delimited by the space. In this context, if the data size happens to be small or big compared to the available storage size, the given systems have no ability to replicate more or fewer data accordingly.

Universal Adaption

Assume that a working node has a given limited amount of unused space, should the node employ it in building more replications or use it to support its local indexes?

The objective function of the replication is to decrease the queries execution time by avoiding the communication cost of the intermediate results as well as balancing the load between the working nodes. The constraint is again the storage space. Recalling what we have introduced earlier about the indexes objective and constraint, we can identify that the indexes and the replication share the same objective and constraint.

As a matter of fact, building more indexes can be seen as replicating data locally for faster processing, while replication is replicating remote data for faster access.

This makes a clear baseline for a single optimization operation that considers both of them in the same domain. Moreover, replications and indexes are not the only storage consumers in an RDF triple store. Materializing queries results or caching some of the join operations provides considerable benefit under the usual workload environment. These cached results share also the same objective and constraint of the indexes and replications, and thus they can flexibly fit in the universal adaption model.

Workload Analysis

As was earlier pointed out by [26, 37, 60, 31], the historical workload can be an effec-tive tuning subject for the system resources to have a more efficient future workload.

However, there should be no fix assumptions about the RDF workload [12] as the workload properties change values with many practical factors like the data-sets, the applications and the temporal factors. To deal with this dynamic workload status, the system should adapt its analysis to the workload and measure its effectiveness in order to increase the impact of the effective parameters and obliterate the impact of those with low effectiveness.

Im Dokument Universal Workload-based Graph Partitioning and Storage Adaption for Distributed RDF Stores (Seite 17-20)