RDF Indexing - Universal Workload-based Graph Partitioning and Storage Adaption for Distributed

As we have earlier mentioned, the initial systems that dealt with the RDF mod-eled data used the classical relational DBMS. However, it was soon found that the

efficiency of such systems would be feasible only with the support of well designed indexes. The later native RDF stores were basically classified by their indexes struc-ture, and to some extent, it is not possible to differentiate between the indexes, the tables, or any other data container in the system, as the whole RDF data are ac-tually stored in indexes. The RDF indexing approaches can be classified here into:

key-value indexes and Graph-based indexes.

2.4.1 Key-value indexes

The main objective of the index design is to speed up the query processing by de-creasing the cost of joins and providing fast triple data retrieval. The SPARQL query in section 2.2 is defined as a set of triple patterns. The system should be able to provide the answer to any triple pattern using its indexes. Each triple pattern is a set of exactly three elements, where each element can be either a constant or a variable, given that we have at least one constant and at least one variable (exclud-ing rare cases where a triple pattern may have zero variables). An optimal index would receive a triple pattern and return all the triples in the data set that matches it. For this purpose, the constant, or a combination of two constants in the triple pattern are used as the index’s key, and the index should deliver the triples that match as output. Consider for instance the following triple pattern: t₁ =(:newton,

?y, ?x). This pattern has one constant that is in the location of the subject, and two variables in the locations of the predicate and object. In order to evaluate t₁, we need an index that has the subject as key (it is usually called S index). We per-form a lookup on that index using the key “:newton” expecting the index to return a list of all triples that have “:newton” as subject. However, if the t₁ =(:newton, :was_born, ?x), then we need an index that has both the subject and predicate as key (usually called SP index). In an extreme case, the triple patten may have three constants as t1 =(:newton, :was_born, :england). In this case, the required index should use subject, predicate and object as key (usually called SPO). Depending on the implementation, the SPO index may answer all the three triple patterns.

The implementing of the index preformed using two main approaches: sorted list and hash table. The sorted index is a list that contains all the triples such that they are sorted in the order of the key. The SPO index is then sorted on the subject, then on the predicate and finally on the object. In this manner, the SPO index may answer all the three triple patterns about Newton. The behaviour of the lookup operation in such index is logarithmic on the data size.

The hash index is performed by using a hash table that contains all the triples hashed on the key. In such a case, the SP index uses a key that is a combination of the subject and the p, and both of them must be given to perform a lookup operation.

Thus, we need three hashed indexes to answer the given three triple patterns about Newton. However, the hashed index is faster and in average has a constant time-behaviour with respect to the data size. To have the benefits of the fast data access and low storage space, a hybrid indexes are used. This index is hashed on the first element of the key and sorted on the second and the third. An SPO index of such type can answer the Newton’s three triple patterns. The index is hashed on the subject and thus may lookup any subject in constant time, and any of its predicates in logarithmic time since it is sorted on the predicates.

Depending on which of the triple’s elements are the key, we may have 6 indexes types (Assuming hash-sort index): SPO, SOP, OPS, OSP, POS, and PSO. RDF-3X [56] builds all the given six types of indexes, allowing high index efficiency and flexibility in answering any triple pattern. However, due to the high storage overheads of having full-set indexes, some systems preferred to only build the most referenced indexes, identified as SPO, POS, and OSP which are maintained by typical key-value stores in separate containers. As a contrast to the hashed-based indexes, fully sort-based indexes can be used to enable the process of range queries more efficiently. As an example, the sort-based SPO index is built by sorting the triples on S then on P and O.

2.4.2 Graph-based indexes

Another method of RDF storage is by holistically storing an RDF data set as a graph.

Each unique subject or object in the data set is a vertex that is associated with one adjacency list for the outgoing edges, and another list for the incoming edges. Each property edge in this regard is listed in the outgoing edges of its subject, and in the incoming edge of its object. This allows the literal graph processing of queries as is shown in Section 2.6.

However, on the practical aspect, the system still needs a general hash index that looks up the vertex of any subject and object in the data set. Thus, we can still consider the two lists besides the general index equivalent to the SPO and OPS indexes which were explained in the previous section. Such a storage approach is followed by Trinity.RDF [85] which is built on Trinity [72], a key-value store that serves as a distributed graph processing system.

Im Dokument Universal Workload-based Graph Partitioning and Storage Adaption for Distributed RDF Stores (Seite 32-35)