Heat Query Specific Rule - Universal Workload-based Graph Partitioning and Storage Adaption for

The main purpose of the heat query structure is to estimate the access rates of the resources based on the workload. However, those resources have also other access rates which are calculated based on the average behavior of the workload. Thus, the access rates by the heat queries are considered specific rules, and the average access rates are considered general rules. These rules can be then aggregated for each resource using the rules’ properties explained earlier in Section 3.4. For this reason, we transfer the heat queries into specific access rule in the following definition:

Definition 3.8 (Heat Query Specific Rule) The heat queries set (Section 3.5) defines a single rule for each index χ∈X as:

$he(χ) = (she,Vˆhe, ahe)

where:

s_he = {s|s = q_G∀(q_G, F, X) ∈ H} the set queries graphs of all heat queries in the system as by Section 3.5.

Vˆ_he is the answer of H, a_he = {(v, a)|∀v ∈ Vˆ_he,∀χ ∈ X, a = access(v, χ)}, and access(v, χ) is given by Formula 3.7.

We define further R^he as a set of all heat query specific rules:

R^he={$_he|∀χ∈X, $_he=$_he(χ)}

The heat query represents an access rule for each index in the system. The source of this rule is the heat query itself, which has a set of connected patterns and their

Workload Arriving queries

General statistics

Heat queries

Base of general

rules

To derive other access rules by projection

Heat queries access

rules

Heat join-map access rule

Figure 3.7: Workload rules’ maps

frequency of access. The access formula assigns access value to each vertex in Vˆ_he. However, the formula is a function of the heat query, and thus, the access function is a function of the source, which means that we can find the access value of each pattern in the heat query. This enables Property 4 (Source Ordering) of the rules property (Section 3.4), which allows ordering a rule by its source’s elements without having to maintain its exact set of the vertices. This property enables important optimization in the universal space adaptation algorithm, as we explain later in Section 6.2.2.

3.7 Summary

This chapter presented the methods that we use to store and analyse the workload for the purpose of deriving the data-vertices access rates.

• The analysis of Real-world queries shows that the RDF workload often contains detectable trends and frequent patterns.

• We divide the storage space into resources unit. Each unit can be utilized by one consumer which is an equal size piece of data. The consumption can be done on one out of multiple indexes’options. A triangle of resource, consumer, options forms an assignment.

• For each assignment, we calculate the performance benefit and the access rate.

The product of those two values gives the effective assignment’s benefit.

• We use the workload to detect the access rates of vertices and indexes.

• The workload analysis is performed using access rules. Each rule is designed to look for certain trend in the workload. A rule can be projected on specific region of the data. Two rules targeting shared region of the data can be aggregated.

• The access rules that look for trends targeting specific vertices in the data graph are called specific rules. The general rules looks for average trends.

• A heat query is composed by combining multiple queries. It records the count of the frequent queries’ items, the stats of the used indexes to evaluate the included queries.

• We use the heat queries set to create one specific access rule that assigns an access value to any vertex in the data set. That rule can be projected on any defined region in the data set.

• An access rule is transferred into an operational rule by providing a relative benefit function.

• The methodology of the rules allows the flexible plugging of futures rules.

Local Storage

In typical key-value RDF stores, the triple data are stored into indexes. To achieve the required query execution performance, a triple store requires multiple data-wide indexes. Due to the high space impact, the stores choose to have only some of the indexes. The decision of choosing specific indexes is based on observation of the workload, and the store’s storage-saving strategy. Instead of the fixed indexes strategy, we let the indexes dynamically adapts to the status of the workload and the storage space. Moreover, we integrate the indexes into the cost model and defines two access rules which are integrated into operational rules. Those rules are comparable to the join cache and replications allowing a universal storage adaption.

Contents

4.1 Storage Scarceness . . . . 70 4.2 System Storage Hierarchy . . . . 72 4.3 Indexes . . . . 72 4.4 Problem of fixed Indexes . . . . 73 4.5 Dynamic Indexes . . . . 74 4.6 Indexes in The Cost Model . . . . 75 4.7 Index Rules . . . . 78 4.8 Index Rules Aggregation . . . . 79 4.9 Cache Index . . . . 80 4.10 Dynamic Indexes Evaluation . . . . 82 4.11 Summary . . . . 88

Replications engine

In 1965 Moore has formulated a law that was then called Moore’s law [54]. He pre-dicted that the number of transistors per area unit will be doubled approximately every 2 years. This rate sets accurate trends for speed, size, and price in the digital world. The hard disk contains electromagnetic components, besides the normal elec-tronic circuits. The average number of Gigabytes per price unit since 1982 followed similar exponential trends by being approximately doubled every 4 years. This be-haviour is shown in Figure 4.2 which plots the average price of hard disks since 1982.

The price data was collected by [52] from different sources. However, as it is also clear from the plot, the exponential trend has changed towards a linear trend since 2015. This is clearer in Figure 4.3 that has a linear y-axis.

On the other hand, the size of the data in the digital world doubles every two years, and is expected to be doubled every one year in the next decade. The ratio of data size growth to the disk size growth was, unfortunately, greater than one, and is get-ting much bigger.

The main memory or RAM is still small in size compared to the big-data scale. Al-though it showed better engagement to Moore’s law, it has also deviated and failed to catch up starting from 2015. These given trends about the hard disk, main

mem-Figure 4.2: Average hard disk price through the years 1980-2019 as appeared in [52]

Figure 4.3: Average hard disk price through the years 2015-2019 as appeared in [52]

ory, as well as the trend of data size increase, highly motivate all the works that aim to the wise and optimized usage of the storage space. This should not however be interpreted as marking the works that always try to save storage space as winners.

This is due to the following reasons:

• The ratio of RDF data-set to the available storage space could be small.

• The RDF system does not use the storage space only to store the raw data, but there are multiple storage employment levels for the sake of query performance.

Increasing the space within RDF system does not only mean more space for new data, but it means better query execution time.

Im Dokument Universal Workload-based Graph Partitioning and Storage Adaption for Distributed RDF Stores (Seite 80-87)