• Keine Ergebnisse gefunden

Scalable Data Analytics in the Cloud

3.2 XDB Overview

XDB is a parallel database built using a middleware approach that leverages an existing single-node database for query processing (i.e., MySQL in the current prototype). In the remainder, we first show the architecture ofXDBcluster. Afterward, we then present how complex analytical queries can be formulated usingXDBand how these queries are compiled, optimized, and executed.

3.2.1 System Architecture

Figure 3.1 shows the system architecture of XDB.XDB cluster consists of one Master Node, which accepts analytical SQL queries or complex functions from clients, severalQuery Coordinators, which are responsible for coordinating and monitoring the query execution and finally a huge number ofCompute

!"#$%&'(($)*+,-($&

&

&

&

!"#$%&'(($)*+,-($&

&

&

&

&

&

&

&

&

&

'(./"-#&0()#&

&

&

&

&

&

&

'(./"-#&0()#&

1,-,2,3#&

4#$5#$&

67#8"-($& 67#8"-($&

9#-,),-,&

1,-,2,3#&

4#$5#$&

':*#+-& ':*#+-& ;& ':*#+-&

!"#$%&

48<#)":#$& !"#$%&

=$,8>#$&

;&

':*#+-& ':*#+-&

'(./*:#$&?&

@/A.*B#$&

9,3-#$&0()#& 9,3-#$&=$,8>#$& C#3("$8#3&

'(./*:#&D:,+& '(./*:#&D:,+&

!"#$%&

48<#)":#$& !"#$%&

=$,8>#$&

D,$A,:&

67#8"A(+ED:,+3&

;&

67#8"A(+&

D:,+3&

D,$A,:&&

67#8"A(+ED:,+3&

F+-#$.#)*,-#&

C#3":-3&

F+-#$.#)*,-#&

C#3":-3&

67#8"A(+&

D:,+3&

Figure 3.1.:System Architecture of XDB

Nodes, which actually execute sub-queries of the given query in parallel over the partitioned data. In the following, we briefly discuss the tasks of each node type.

Master Node: The Master Node hosts two components, aCompiler-and-Optimizer and a Master Tracker. TheCompiler-and-Optimizercomponent takes a SQL query or a function as input and produces a compile plan using the catalog (which holds metadata and statistics about the tables). The compile plan is then optimized using rule-based and cost-based optimizations for join-reordering and predicate push-down. The second component, theMaster Tracker, has different tasks: First, it monitors the cluster resources (Query Coordinators and Compute Nodes) and restarts them if these nodes fail. Second, the Master Tracker assigns a compile plan to a Query Coordinator, which is responsible for coordinating and monitoring its execution.

Query Coordinator: A Query Coordinator hosts two components: aQuery Scheduler and aQuery Tracker. The Query Schedulertakes a compile plan from the Master Node and splits it into multiple partial execution plans and sends those plans to different Compute Nodes for parallel execution. The result of each partial execution plan is materialized locally by each Compute Node. Materialization is used for different reasons: implementing fault-tolerance, adaptive query parallelization as well as distributed query processing over the individual MySQL database instances. The second component, theQuery Tracker, is responsible for monitoring and redeploying a partial execution plan in case of a Compute Node failure. We separate the Query Coordinator from the Master Node to be able to scale the number of Query Coordinators independently from the load in the system.

Compute Node: As discussed before, a Compute Node receives a partial execution plan (represented as SQL-queries) from the Query Coordinator and theExecutorcomponent and then executes these plans.

It then signals the Query Coordinator once a partial execution plan has finished.

CREATE FUNCTION b a s k e t A n a l y s i s (OUT f r e q u e n t I t e m s TABLE) 1

BEGIN 2

VAR t 1 = 3

SELECT 4

l 1 . l _ p a r t k e y , l 2 . l _ p a r t k e y , 5

p1 . p _ t y p e , p2 . p _ t y p e , c o u n t( * ) a s f r e q u e n c y 6

FROM 7

l i n e i t e m l 1 , l i n e i t e m l 2 , p a r t p1 , p a r t p2 8

WHERE l 1 . l _ o r d e r k e y = l 2 . l _ o r d e r k e y 9

AND l 1 . l _ p a r t k e y = p1 . p _ p a r t k e y 10

AND l 2 . l _ p a r t k e y = p2 . p _ p a r t k e y 11

AND l 1 . l _ p a r t k e y ! = l 2 . l _ p a r t k e y 12

AND l 1 . l _ s h i p d a t e >= d a t e ’ 1998−03−01 ’ 13

AND l 2 . l _ s h i p d a t e >= d a t e ’ 1998−03−01 ’ 14

AND l 1 . l _ s h i p d a t e <= d a t e ’ 1998−03−31 ’ 15

AND l 2 . l _ s h i p d a t e <= d a t e ’ 1998−03−31 ’ 16

GROUP BY 17

l 1 . l _ p a r t k e y , l 2 . l _ p a r t k e y , p1 . p _ t y p e , p2 . p _ t y p e 18

HAVING f r e q u e n c y > = 2 ; 19

20

: f r e q u e n t I t e m s = 21

CALL s i m i l a r i t y ( : t 1 , p1 . p _ t y p e , p2 . p _ t y p e , 0 . 3 ) ; 22

END; 23

Figure 3.2.:Basket Analysis Function (TPC-H Schema)

3.2.2 Query Language and Execution

XDBsupports analytical queries over arbitrary database schemata, which is partitioned over different Compute Nodes using our elastic partitioning scheme. An analytical query in XDB can either be implemented as a pure SQL statement2, or it can be defined using a function in FunSQL [25]. Different from a pure SQL statement, a FunSQL function supports the definition of a data flow graph (such as other languages like Scope [122] or Pig [47] do). A FunSQL function inXDBtypically consists of assignments of SQL statements to variables and calls to other functions.

An example of a FunSQL function is given in Figure 3.2, which implements a basket analysis on top of the TPC-H schema to retrieve frequent product combinations with similar product names placed together in the same order. The FunSQL function uses a SQL statement and a call to another function which is implemented using a user-defined operator (UdOp): While the first SQL statement returns products (parts) bought together in a given time frame, the user-defined code extracts those combinations, which have similar product descriptions (columnp_type) using a string similarity metric.

Analytical SQL queries and FunSQL functions are compiled by the Master Node into data flow graphs (i.e., compile plans), which consist of relational operators as well as user-defined operators

2 Currently,XDBsupports only a subset of SQL’92 butXDBis conceptually not limited to this subset.

(UdOps) representing calls to functions implemented by user-defined code. Rule-based optimizations like selection-pushdown are then applied to the compile plan.

The compile plan is then handed over to one Query Coordinator. The Query Scheduler of the Query Coordinator first parallelizes the given compile plan using the attached information from the catalog about partitioning schemes and the locations of partitions. For parallelization, we re-use the framework presented in [122]. As a next step, the Query Scheduler decides, which intermediate results to materialize for fault-tolerance. Details about the fault-tolerance model implemented inXDBare discussed in Section 3.4.

Finally, before executing a compile plan, the Query Scheduler decides for each partial compile plan on which Compute Node to execute it. In order to execute the partial compile plans, the Query Scheduler generates a partial execution plan for each partial compile plan and sends it to the given location. For code generation, the relational operators in each partial compile plan are turned into query-fragments (SQL queries) over the given partitions. If a query-fragment accesses data from a remote Compute Node, XDBuses federated tables in MySQL to read table data from a different database instance. The complete execution plan thus forms a dependency graph of partial execution plans, which defines the order of their execution.