Summary - Universal Workload-based Graph Partitioning and Storage Adaption for Distributed RDF

We presented in this chapter, the principles of Resource Description Framework (RDF) as a complete and standalone data model that has its specification, vocab-ulary, and serialization format. The RDF model is directly mapped into a graph, and thus can be queried in a graph-based querying language like SPARQL. The performance of query processing is highly related to how well the triples-data are structured and indexed. In this context, some systems proposed to use an extensive number of indexes to reach enough flexibility in query evaluation aiming to enhance the performance. However, this requires a lot of storage space.

The RDF data model is used for web-scale data, which forces the storing system to be designed on big-data principles. The move towards a distributed RDF storage system is seen as vital in this regard. However, this imposes new challenges to keep

Partitioning M.Memory Adaption Advantages Disadvantages

Huang METIS No No Lower

communi-cation cost

Table 2.3: The most related systems which employ workload adaption

the system scalable and optimized in terms of the storage space, which should in this case, not only host many indexes but also replications from the distributed par-titioned data, keeping in mind that these replications on their turn need to be well indexed.

The methods to manage the RDF graph partitioning were considered by many re-search works. However, two important directions are graph partitioning and hash partitioning. While both have their points of strength and weakness, some works

suggested to look for the workload and have workload-aware partitioning. Since this can create non-stable behavior in case the workload is not good enough, other works suggested having workload-aware replication to support the static partitioning (graph or hashed). We identified the storage space as the common constraints that should tune the workload adaption for replication and indexes. In this thesis, we will propose and describe in the upcoming chapters a universal adaption concept where the indexes, replication, and join cache are competing for the limited storage space aiming to increase the system performance.

Workload Analysis

In this chapter, we formulate the adaption system and its related cost model. Ini-tially, we define our cost model that is used to enable the system’s universal adaption.

We detail the role and effect of the workload and how we structure and analyse a collection of workload queries in order to estimate the resources access rates which represents the moving heart of the cost model triangle. This cost model is used to adapt the storage space in terms of the: indexes, replication, and join cache. We provide for each option the necessary formulas of cost, benefit, and access rate. For a summary and description of all the mathematical symbols used in this chapter please refer to Appendix B.2.

Contents

3.1 Why Adaption? . . . . 46 3.2 Universal Adaption . . . . 47 3.3 The Role of the Workload . . . . 51 3.4 Workload Rules . . . . 56 3.5 Heat Queries . . . . 59 3.6 Heat Query Specific Rule . . . . 65 3.7 Summary . . . . 66

Replications engine

Database tuning is already a well-known concept in database management systems;

it refers to the directive change in system parameters towards better optimization of available system resources. The effect of applying such tuning has a deeper impact on RDF management systems, where the indexes are very vital, and the storage space is very precious. Consider for a case example, the SPARQL online service of dbpedia.org¹. The service is hosted on two CentOS 6 virtual machines. Each one has 8 Intel Xeon 2.30 GHz cores, SSD of 200 GB, and 64 GB main memory [78].

Assuming we run two federated RDF stores on each working node, and each node receives its partition of the RDF data set using a graph partitioning tool like METIS [45]. The system needs further to decide at each node about the replication’s size and type, the number and type of indexes, the main memory and/or hard disk allocation, and the number of threads to be used for each query, given that each node has 8 hardware threads. According to an analytic study of a big of real users’ queries log [12], about 80% of the queries that targeted a certain data set which is DBpedia were short queries. Moreover, the access rates of the given online service were suddenly doubled within a period of three months. Given only these two measurements, the

1Available at: http://dbpedia.org/sparql/

system may perform important adaption steps to substantially enhance its total execution time considering the following points:

• The ratio of 80% short queries means that the system does not need to have replication at the border of the partitions (needed to support long queries as shown in Section 5.1), unless it is able to recognize the size and locality of the 20% percent longer queries.

• The high number of arriving queries could mean that there are waiting queries in the queue most of the time. This means that the system should focus on having better throughput (i.e. executing more queries) rather than having high parallelization rate per query. By this, the system avoids the threads synchronization cost²which typically increases with the number of used threads per query in single working node. Moreover, the distributed nodes should try to locally execute each query and avoid any network communication cost (See details in Section 8.2).

• Given that the system does not need border replication for most queries, the storage space can still be employed by replications to enhance the load balance between working nodes on the cluster scale, and reflect this on the number of waiting queries within the queue of each working node.

From the above example, we see the system’s ability to perform adaption steps to increase the performance, by the knowledge of only two or three measurements which are relatively easy to measure and maintain. However, we can also notice an overlap between the storage optimization decisions with respect to the replications, indexes, and the queries’ arrival rate. This is clearly motivating the needs of universal adaption decisions that considers the overlapping measures.

Im Dokument Universal Workload-based Graph Partitioning and Storage Adaption for Distributed RDF Stores (Seite 56-62)