Indexes in The Cost Model - Universal Workload-based Graph Partitioning and Storage Adaption fo

The dynamic index approach allows an index to grow or shrink in size based on the storage availability, workload as well as the integration with other storage consumers.

That requires fitting the dynamic indexes into the cost model of Section 3.2.1. For that, we need the benefit of putting a data element in a certain index, its cost besides its access rate.

4.6.1 Index Cost

Any triple can be indexed in any index; however, to assure index consistency, as-signing triples to indexes are performed on the level of the vertices. For instance, if a certain vertex v in the data graph is to be indexed in SPo index, then all of the triples that have v as subject must be indexed. Thus, the space cost of assigning v to SPo is the number of those indexed triples. This value can be easily measured by the system at the time of the indexing. We refer here for that measurement with the following function:

storageCost(v, χ)

The functionindexLookup(χ, key)given in Definition 4.1 requires the key to be a triple pattern consistent with the index type χ. For this reason, we need to transfer a vertexv to the triple pattern that can be used as key for a given index type. For that purpose we use the function calledkey(v, χ). For example ifχis SPo, the vertex the following triple pattern is returned: (v,?x,?y), where?x and?yare variables. If χ is OPs, the returned triple pattern hasv in the location of the object.

4.6.2 Index Benefit

The effective benefit according to the general cost model in Formula 3.1 is the result of multiplying the performance difference η by the probability of access ρ for the resource that is a single indexed item in this case.

The absolute benefitη of having a vertexv ∈V in index type χ can be recognized by the performance difference that the system gains when it makes the employment.

From the perspective of the query execution, we look at the level of a single triple execution, and on the level of the whole query execution. We recall from Section 2.6.3 that to evaluate a triple pattern using the available indexes, there are three possibilities:

1. Use the optimal index if available, and directly provide the answer.

2. Use the sub-optimal index if available, plus an extra filter operation.

3. Non of the above is available, the case that requires a full data scan.

In the case of the second point, the benefit of having the optimal index available for vertex v instead of the sub-optimal is the removal of the extra filter cost (see Section 2.6.3 for details). In the case of the third point, the benefit will be avoiding a full data scan. We can formulate the triple lookup time tripleLookupT ime(v, χ) for a vertexv in an index typeχ_l₂, given that v is currently indexed by indexχ_l₁ :

full-data scan time, if getOptimalIndex(key(v)) =∅

f ilterT ime(χ^l₁¹, key(v)) + ∆, if getOptimalIndex(key(v)) =χ^l₁¹

0, otherwise ,

(4.2) where f ilterT ime(χ^l₁¹, key(v)) is the time required to filter out the extra triples resulting from not using the optimal index², ∆ is the difference in access time of v in χ^l₁¹ and χ^l₂², due to that l₁ and l₂ may refer to different physical media: ∆ = α(χ^l₁¹)−α(χ^l₂²),∆ = 0forl1 =l2, andα(χ)is the index access-time function that is given in Definition 4.1.

The above absolute benefit function is applicable in case of the query having a single triple pattern and thus requires only one data access path operation. However,

2See Section 2.6.3 for more details about the optimal index

in case of more than one triple pattern, the execution engine needs to perform a further join evaluation phase, and that results in multiple join-trees that differ with each other in the order of the join, the required indexes, and the performance (see Section 2.6.4 for details). The query optimizer works similar to RDF-3X [56] and selects the tree that is expected to show the best performance. However, in case of some indexes are not being available, the optimizer would choose the best performing tree which the system owns all of its required indexes. The performance difference between the best tree and the chosen tree is considered as points of benefits to the absent indexes for the given triple patterns. We label this tree performance difference as treeT ime(v, χ). However, it is only feasible to be calculated on the level of the triple patterns and not on the level of single vertices. That is because the query trees where the performance difference is being found are composed of triple patterns. The calculated value can be generalized to all the vertices considered in the join operation.

However, using the operation rules (Section 4.7) will remove the necessity of going down to the level of vertices, and stay in the level of patterns instead.

The index benefit function on the level of vertex can be then given by the following formula:

ηidx(v, χ) =tripleLookupT ime(v, χ) +treeT ime(v, χ) (4.3)

4.6.3 Index Access Rate

The last parameter to find in our cost model regarding the indexes is the access rate. According to the cost model 3.2.1, the benefit of assigning a vertex to be indexed in a certain index should be factorized by the access rate of that assignment.

That access rate is base on the workload. The workload analysis methods were explained in Chapter 3. The methods resulted in two types of access rates: specific access rates given by the heat query and general access rates based on the average measurements. Moreover, the specific access rates are further generalized by the anonymization process.

The specific access rate of the heat query is expressed by an access rule (Section 3.6).

We use that rule to derive the index specific rule using the projection property in the next section.

Im Dokument Universal Workload-based Graph Partitioning and Storage Adaption for Distributed RDF Stores (Seite 90-93)