Replication Aggregated Rules - Universal Workload-based Graph Partitioning and Storage Adaption

The rules of border replication and load-balancing replication are both targeting vertices that are located in the neighbour working nodes. Since that their vertices intersections are not an empty set, we need to aggregate both of them. However, the aggregation is performed on the operational levels and not on the access level.

This is because the benefit function of the border replications differs from the benefit function of the load-balancing rule.

Definition 5.7 (Replication Aggregated Rules) The operational rules of both border replication and load-balancing replication are aggregated as the following:

R_op^rep={$_op|$_op=aggregate_op($_op1(χ), $_op2(χ)),∀χ∈X, $_op1(χ)∈R^bo_op,

$_op2(χ)∈R^ba_op}

TheR^rep_op represents a set of operational rules the represent the replications or the vertices in the remote nodes from the perspective of a certain working node. That set is comparable with the set of index operational rules and join-cache operational rules that were given in the previous chapter.

5.8 Summary

We summarize the chapter in the following points:

• In a system of distributed working nodes, the RDF graph is partitioned and assigned to the nodes.

• The border regions create a performance problem because the queries in that region could require synchronization across the working nodes.

• That border region problem is overcome with border replication.

• Putting the border replication in the cost model answers the questions: what data to replicate? and how much?

• The border replication has a general access function related to the vertices’

distance from the border and defines a general access rule. On the other hand, its specific access function is derived from the heat queries by projection and defines the specific access rule.

• Both of the rules are aggregated into one border access rule. A benefit function is attached to that rule to create the set of border-replication operational rules.

• Another purpose for replication is to perform load balancing aiming to increase the system throughput.

• The access rate to these replications is related to the nodes’ load-imbalance factor, and to the access of the data in their source nodes. We derive that access to define an access rule for each index in the system. The operational rule is defined by adding the benefit function which is related to the system throughout.

• The operational rules of the border replication and load-balancing replication are aggregated into one set of operational rules. Those rules are comparable with indexes and join cache operational rules.

Universal Adaption

The previous two chapters presented separated approaches for adapting the indexes, join cache, and replications with the workload and storage space. The adaption pro-cess for each of the three concluded by a set of operational rules that are comparable with each other. This chapter deals with performing the universal adaption of the indexes, join cache, and replications using their derived operational rules.

Contents

6.1 System Architecture . . . 108 6.2 Storage Space Optimizer . . . 109 6.3 Creating The Proposed and Assigned Rules . . . 116 6.4 Summary . . . 117

107

Replications engine Indexes engine

Join-cache engine Workload

Analysis Module

Operational Rules

Access Rules

Operational Rules

Uni.

Adapt Chapter 6

Figure 6.1: Chapter’s scope

6.1 System Architecture

We have presented in chapter 2 the design options of a distributed triple store while reviewing the related works of RDF triple stores. We considered the federated shared-nothing nodes, where each node hosts an adopted version of a central triple store.

This approach is also followed by [37, 26, 38, 83, 31]. This provides the system with enough flexibility to adapt the indexes and replication layers with the adaption parameters that were previously explained in Chapter 3. The main components and architecture of our adaptable RDF triple store (UniAdapt) is shown in Figure 6.2.

Our distributed systemHis a set ofnhosts. A hosthi can directly send any message to any other hosth_j using the underlying network. We refer to the delay accounted by sending a message in the network as:

delay=size(msg)∗Z∗ζ (6.1)

whereZ is the network transfer rate, and ζ is a random function representing the size of the current traffic in the network at the time of message sending. Since we assume that the network is dedicated to the purpose of connecting the working hosts, the traffic that is being transferred in the network is essentially the traffic coming from system messages and the data synchronization between the hosts, i.e

moving the intermediate results of running queries across hosts. Each node h works independently from other nodes, such that it receives its own share of the RDF graph.

The node has its storage optimizer that builds the node’s replication layer by looking into the neighbours. The node runs queries on the available local data (main share plus replications), and returns the result to a selected node that assembles the final query result. Each node also makes its own decision about the type and quantity of replications which are to be built from neighbours’ main shares and it has its own optimizer for this purpose. The initial partitioning is made by a single node (node 1 in Figure 6.2), and the results are distributed to other nodes.

Within each working node, there is a main memory query engine, which is sup-ported by a hard-disk query processing engine based on RDF-3X [56]. The storage layer is composed of a dictionary and indexes. The dictionary is a dual hash tables structure that maps each string in the raw data-set to a compressed integer code, and performs a reverse mapping of any integer code back to its original textual rep-resentation. By using this dictionary, each textual triple in the data set is converted to an integer triple and stored in the appropriate indexes. The dictionary concept has been used by many works, but [17] was the first to apply it to RDF systems.

The system has a collection of indexes where the RDF data actually reside as was ex-plained earlier in Section 2.4. Each index may contain local or replicated data. The system keeps a record that helps to distinguish the local data from the replication, where this record costs a single bit per triple.

The storage optimizer is responsible of managing the storage layer by collecting and analysing the workload in order to make decisions about what type of data to assign to each index including the join cache. A decision also is made for the size and type of the replications, allowing UniAdapt to show universal adaption behaviour.

The storage optimizer and the universal adaption is explained in more detail in the next section.

Im Dokument Universal Workload-based Graph Partitioning and Storage Adaption for Distributed RDF Stores (Seite 119-124)