Relation ABCD
7.1. Related Work on Workload Decomposition &
Representation
Query optimization as well as physical design has to decompose workloads for optimal query-plan respectively physical-design estimation. Therefore, researchers propose several approaches with respect to different optimization criteria that we briefly introduce and compare with our approach in the following (cf. Table 7.1).
An approach for optimal disk performance on mixed workloads proposed by Turby-fill [Tur88] characterizes (disk) access pattern for OLTP, batches, and ad-hoc queries2 based on heuristics (i.e., experiments). The access patterns are used to aggregate similar workload parts and to decide how these patterns are executed – (a) in parallel with possible drawbacks to OLTP or (b) in different time windows.
In contrast to the heuristic-based approach by Turbyfill, Raatikainen [Raa93] dis-cusses general hardness of automatic workload classification and of cluster analysis, which arise from classification. Raatikainen argues that (statistical) workload classi-fication is not necessarily feasible for query-optimization purposes. On the one hand, classification is highly dependent on similarity function for clusters. On the other hand, stability and validity of clusters are identified as an issue. Finally, Raatikainen shows instability of response times for clustering algorithms, thus, such an approach is not suitable for our purpose – at least in terms of query processing due to time constraints and their volatility.
1In Chapter 2, we give a coarse overview for corresponding approaches in Row Stores.
2Nowadays, batches and ad-hoc queries summarized as OLAP.
Nevertheless, Chaudhuri et al. use a clustering approach which targets on superior support of self-managing DBMSs by reduction of size and diversity of workload in-formation [CGN02]. The used distance function is focused on index selection in this approach (i.e., to reduce complexity for the index selection problem). However, work-loads have to be known in advance. Whenever a workload is not known in advance, Chaudhuri et al. use sampling approaches additionally (i.e., additional complexity for quality results). Holze et al. also target on the reduction of size and diversity of workload information. Therefore, they use a clustering approach based on the gener-alized Minkowski metric to evaluate the similarity [HGR09]. In more detail, Holze et al. generalize the approach of Chaudhuri et al. [CGN02] for ongoing workloads (i.e., without sampling). Both approaches concern the problem of cluster stability stated by Raatikainen. That is, the minimal number of clusterkis determined by a quality loss measure for new clusters that in total must be below threshold δ. Chaudhuri et al. solve the validity problem by the assumption that workloads are known in ad-vance respectively sampling; whereas Holze et al. solve this problem with a learning phase. That is, cluster adaptation freezes after learning phase that ends whenever the maximum distance of new elements to existing clusters fall short of thresholdδ.
Hence, periodical execution of cluster analysis is limited to an applicable amount by user interaction at expense of cluster quality.
Another point of view for workload decomposition is the temporal aspect. The temporal aspect is suitable whenever different workload types occur periodically (e.g., OLAP and ETL, daily operations and reorganizational tasks). That is, decomposed workload parts represent similar workload tasks for a period of time, thus, optimiza-tion (e.g., physical design) is focused on workload parts for a corresponding time period instead of optimization for (24/7) overall workload. Agrawal et al. name such temporary workload parts with the term sequence (i.e., a set of SQL state-ments). The following two points are unique characteristics for this approach: (a) it does not distinguish between different physical design approaches (e.g., indexes, materialized views) and (b) it preserves query respectively (SQL)-statement order.
In contrast to the previously introduced approaches, the following two approaches focus on OLAP respectively DWH workloads. That is, the other approaches are not limited to or are at least not especially designed a single for workload domain.
Ghosh et al. present a clustering-based decomposition in which template-query plans represent workload clusters [GCCK07]. A template-query plan is a query plan in which all operations remain but identifying values (e.g., table name) are removed (like a query with bind variables). A template-query plan represents all queries within a certain distance (i.e., a cluster); whereby a hierarchical similarity approach – called SIMCHECK – is used to reduce complexity. Query plans are reused in this approach whenever a similar cluster exists (i.e., a template-query plan). As a result, Ghosh et al. reduce query-optimization cost for complex queries greatly.
Nevertheless, template-query plans are sub-optimal for similar queries but according to the authors fairly close to optimal. An approach by Favre et al. [FBB07] is designed for workload evolution (i.e., the queries to be analyzed) in DWHs. That is, Favre et al. do not focus on workload analysis for physical design, but they focus on query
sets, which represent workloads after evolution. Therefore, they analyze updates on data and schema to update the workload (i.e., add, delete, or adapt queries).
One may be surprised at our set of related approaches. To the best of our knowl-edge, besides our approach exists no approach that supports workload decomposition and representation for physical-design estimation and query optimization. Therefore, we present related decomposition approaches for query processing (e.g., by Ghosh et al. [GPSH02]) as well as for physical design (e.g., by Chaudhuri et al. [CGN02]).
We consider the access-pattern approach for mixed workloads by Turbyfill [Tur88]
due to the fact that our approach inherently considers optimal access (patterns) by storage-architecture selection respectively query dispatching to different stor-age architectures. In more detail, we also consider disk-access patterns for mixed OLTP/OLAP workloads because our approach is not limited to main-memory DBS.
The heuristic-based approach for workload patterns is related to our heuristic frame-work for physical design and query processing. In contrast to our approach, this approach is limited to a certain DBMS (type). We additionally provide iterative improvement for our framework.
Furthermore, we discuss the considerations of Raatikainen [Raa93] to show hard-ness of workload clustering (cf. Lange et al. [LRBB04]; e.g., cluster may be found where no (natural) cluster exists). In line with Raatikainen, we argue that clustering is not suitable for workload representation in general. On the one hand, we state that clustering algorithms are too complex for a multi-purpose approach. That is, identifi-cation of optimal number of cluster is hard, even if workload is not known in advance (cf. Chaudhuri et al. [CGN02]), but cluster stability and validity are more difficult for ongoing workloads (cf., by Holze et al. [HGR09]) that implies costly cluster re-calculation (e.g., by user interaction). Due to the extension of ideas by Chaudhuri et al. [CGN02], the approach by Holze et al. [HGR09] is feasible for OLTP and OLAP in a broad outline. However, we argue that workload clustering is not cost-efficient for query optimization, but we do not prevent workload clustering with our approach.
Clustering on top of our workload-representation approach is still applicable even though our superjacent approaches may be adapted. On the other hand, clustering hides the workload sequence, which is crucial for design tasks and query processing in mixed OLTP/OLAP in our opinion.
In line with Agrawal et al. [ACN06], we argue workload sequences are an impor-tant aspect for quality of physical design due to the fact that requirements and/or drawbacks for special tasks (e.g., ETL) are hidden in overall workloads. That is, tuning aspects are not considered that can have crucial impact on performance of a workload sequence (e.g., indexes not dropped before ETL/reorganization). The ap-proach is designed for, but not limited to, DWH environments – we argue, workload sequence with variable requirements can arise anywhere (not for DWHs only). In comparison, our approach is not limited to certain domain and not focused on access structures, but preserves order of SQL statements as well (i.e., the sequence).
Another approach also targets on DWHs especially on complexity reduction of workload representation for query processing is proposed by Ghosh et al. [GPSH02].
The approach reduces query-evaluation time (i.e., for query plan creation) for
sim-OLTP/ Arbitrary Preserve Workload Architecture
Approach OLAP Cost Model Query Order Aggregation Independent
Turbyfill [Tur88] • ◦ ◦ • ◦
Raatikainen [Raa93] − ◦ ◦ • −
Evrendilek and Dogac [ED95] − ◦ − ◦ •
Chaudhuri et al. [CGN02] − ◦ ◦ • −
Agrawal et al. [ACN06]3 − ◦ • • •
Ghosh et al. [GCCK07] ◦ − ◦ • ◦
Favre et al. [FBB07] ◦ ◦ ◦ ◦ •
Holze et al. [HGR09] • ◦ ◦ • •
AQUA2 • • • • •
Table 7.1.:Comparison of key aspects for workload decomposition and representation.
Legend: • fulfilled, ◦ not fulfilled,− no information available.
ilar queries. They focus on complex OLAP queries where query-evaluation can be notable. That is, they reduce query-evaluation time by orders of magnitude, but evaluation time is in the range of milliseconds and below even without optimization.
We assume, benefits for query evaluation are less significant for less complex queries.
We argue, query execution on the most suitable architecture with optimal query plan is more crucial than savings in query-evaluation time. However, we do not directly support reuse of query-plans. As mentioned before for other clustering approaches, clustering for complexity reduction still is applicable on top our representation.
The approach of Favre et al. [FBB07] is located in the DWH domain as well.
Their approach solves issues for the correctness of workload information induced via DDL and DML in ETL. In other words, the approach prevents usage of old invalid workload information for workload analysis. In our domain, schema modifications in a bulk are not common, thus, a complex analysis for schema evaluation is not necessary. However, the basic idea is of interest for automatic redesign in hybrid systems. Nevertheless, evolution of workload information (e.g., column renaming) can be easily automated via DDL monitoring or recalculation of DBMS statistics via DML monitoring (i.e., aging) which is only necessary for reusing query-plans.
We summarize key aspects for the above considerations in Table 7.1. We highlight that in contrast to our approach, neither approach supports arbitrary cost models due to specialization of approaches for a certain domain in one architecture even though some approaches are not architecture-dependent directly (e.g., [ACN06, HGR09]).
Finally, we consider an approach, which is marginally related only, by Evrendilek and Dogac [ED95] for two reasons. First, they propose a decomposition approach for distributed processing even though the approach is focused on queries. Second, the approach follows similar ideas for (sub-) query distribution including availability of cost estimates at global level. Both aspects are related to our hybrid-system approach. Evrendilek and Dogac propose a cost model for query distribution in distributed DBMSs with respect to data fragmentation and highlight the idea of
3Designed for DWHs, but sequence of different workload types arise elsewhere too.
load balancing for replicated data; whereby due to NP-completeness of the (query) assignment problem a greedy approach is proposed. However, this approach is not transparent for cost models and does not support any workload administration (e.g., aggregation) in contrast to our approach. Moreover, cost-based query optimization is only applicable for joins according to the authors.