• Keine Ergebnisse gefunden

In the previous section, we have shown the importance of optimizing the system’s resources which can be classified into space resources and processing resources. Our adaptable system aims to make its resources adaptable with its knowledge of the environmental parameters that affect the performance. These environment parame-ters include workload, queries arriving trend, and data-sets types. We denote these parameters as adaption subjects. The storage space is divided into a set of space resources, such that each resource is the smallest unit of storage that the system

2Details on the threading cost are given in Chapter8

considers for optimization. Each resource can be filled with a unit of data equal to its size which we refer to as the consumer. The storage resource may maintain the unit of data in one of multiple indexes where each index represents a potential option.

Each resource can be assigned to one consumer and one option which forms the adaption triangle shown in Figure 3.2. As different triangles can contain the same resource and may deliver different benefits to the system, the optimizer may select the most optimal triangle for each resource. In this context, we can now set our optimization problemas the following:

Given a set of resources, a set of consumers, a set of employment options, and a current knowledge of adaption subjects, what is the best assignments of consumers to resources and in which options such that we gain the best query execution time?

In order for the system to answer the above optimization problem, it needs to have a cost model that can anticipate the benefits of the potential triangles associated with each resource, and this is what the adaption subjects are used for. The system builds its knowledge about these subjects and uses it to anticipate the benefit of employing the system’s resources with different choices. For space resources, these choices are:

the unit of data and the chosen type of index. The selected unit of data should exist in another source index prior to the optimization process, and it must be different in place(e.g. memory, hard disk, or remote) or type from the destination index.

This difference generates a performance benefit to the system which the optimizer considers when deciding which unit of data to assign for each available storage space resource.

From the above context, we can describe the space-adaptable system as the system that is able of making optimized decisions about how to employ each of its storage units in order to achieve the best query execution performance within the current knowledge of the workload. At the same time, these accumulative decisions make the whole layer of indexes dynamically changeable in size, and the different local indexes are competing for space aiming to maintain more data. Moreover, looking from a higher level, we see the distributed system able to dynamically set its replications and local indexes, and the triples are moving around the nodes in the direction of achieving lower queries execution times.

In order for the system to have the required level of universal adaption, it needs to have a unified cost model considering all of the data sources and storage options.

The cost model should set both benefit and cost values to the proposed decision options, while keeping in mind that these cost and benefit values need to be relative

Figure 3.2: Components of the adaption model

and on single measurement scale, in order to make them comparable on a single selection line.

The space-adaptable method can be generalized to the adaption of the processing resources in the system keeping in mind that it is related to the queries arriving trends more than the workload knowledge.

3.2.1 The Cost Model

In order to achieve the universal adaption objective that has been described in the previous section, we aim to have a generalized cost model, in which we keep its input on the level of a single resource unitr. Please refer to Appendix B.2 for clarification of any used mathematical symbol.

We assume that each resource unit can be consumed by any ofcconsumer units.

As it was mentioned earlier, there are multiple optionsop(r, c)to utilize resource unit r with the selected consumer c. We refer to these options here as a functionop(r, c) which returns for each resource and consumer a set of available unique options’ iden-tifiers. Each option opi ∈ op(r, c) is going to deliver a benefit to the performance that we denoteη(opi), and this is more precisely described as the ratio of the maxi-mum execution time that could be saved when the system uses the optionopi, to the execution time under the case of option opi is not available to the system. However, the effective system benefit of having option opi is related to the system’s needs to

use the resource r, which we denote as the resource’s access, and that is in turn mapped to the probability of this accessρ. The general benefit formula of exploiting a certain resource unit with a certain option can be defined in the following:

benef it(opi) =η(opi)·ρ(opi) (3.1) opi∈op(r, c)

Besides a unified and generalized benefit formula, the system needs to find a generalized cost of each option within each resource unit. A suitable and measurable value is the ratio of resource consumption needed to employ this option.

In order to apply this model to the storage space, we need to have the following three points:

• define the storage employment options per single storage unit;

• derive a method of evaluating the benefit of each option, given that the cost of each option is easily measured by the space size needed for each option with respect to the total available size; and

• derive a method to evaluate the probability of accessing of each option depend-ing on the analysis of adaption subjects, as we explain in the next subsection.

3.2.2 The Resources’ Access Rate

According to the cost model of the previous section, the benefit of utilizing some resourcer with a certain consumercis not used in the optimization process until it is related to the extent of its future usage. We derive this future usage which is stated in Formula 3.1 from the system previous knowledge of the performance parameters which we have previously calledadaption subjects. Those subjects are related to the workload and the queries arriving trends. The system collects the history of those subjects, analyses them towards deriving a resource’s access value. That access value represents the system’s usage to that resource when employed by certain data that is structured in certain option.

The workload analysis process aims to derive resources’ access value. It can be classified into three categories:

• General analysis, by gathering general statistical metrics measured over the entire collected history without specifying certain data parts. This includes for example the average query’s length, the average query’s shape, and the average query arriving rate per time unit. That kind of analysis simulates the

Figure 3.3: Average number of hits per day versus the DBpedia version, as appeared on [78]

hard observations used to make the fixed assumptions about the workload by non-adaptable systems.

• Specific analysis, which targets specific data parts that play the role of a con-sumer c in Formula 3.1. An example is the count of (frequency of) specific data vertex in the workload.

• Generalized analysis, which is originally specific analysis that have been gen-eralized to give expectations about other parts of the data or other consumers.

An example is when generalizing a specific vertex’s frequency to other vertices that have an edge with the same predicate. The motivation behind having this generalization approach is to make the system able to work with a smaller amount of collected adaption subjects. However, we treat such generalized analysis with more caution by measuring their effectiveness to decrease the impact of bad generalization.