Optimization of communication costs by semi-join reduction and other

4.3 Optimization by data flow transformation

4.3.5 Optimization of communication costs by semi-join reduction and other

Network I/O is a main cost factor in data flows with UDFs, since executing such flows on parallel data analytics systems often requires shipping massive amounts of data to remote compute nodes, where expensive computations by the UDFs are performed. In fact, Sarma et al. [2013] found that communication and data shipment costs may dom-inate the total execution costs of data flows in many cases. Transferring data over the network requires broadcasting, re-partitioning, and shuffling of the data, which greatly impacts the overall performance of data analytics systems. Thus, a main objective dur-ing data flow optimization is to reduce communication and data shipment costs as much as possible.

In the context of distributed database systems, semi-join reducers were introduced in the context of join optimization with the goal to identify matching records before

4.3 Optimization by data flow transformation an actual join operator is executed. By inserting semi-join reducers into a data flow, an optimizer can prevent shipping records over the network, which are not part of the final join result. Consider Figure 4.10, which shows a distributed setting with three compute nodes, where a UDF▷◁3(A,B,C)for performing a 3-way join on the data setsA,B,Cshall be executed. Ais a mid-sized data set with many attributes stored on node 1, Bis a large data set with few attributes stored on node 2, andC is a small data set with a medium number of attributes for each record stored on node 3. The computation of the 3-way join can be performed on any of these nodes at different costs. A rule of thumb in such scenarios is to ship the smaller data sets to the nodes where the larger data sets reside to reduce communication costs. Since data setB is the largest data set in our example, a cost-based optimizer decides that all computation should be carried out at node 2, whereBis stored andAandCneed to be shipped to node 2. Introducing semi-joins at nodes 1 and 3 now attempts to reduce communication costs by first identifying records that qualify for the join condition. Therefore, the data flow optimizer decides to introduce a projection and unification operation on the join attribute a of data set B at node 2 and send this intermediate data set to nodes 1 and 3. At nodes 1 and 3, semi-joins of the formA⋉BandB⋊C, respectively, are executed to identify records that contribute to the final result set of the 3-way join. These records are now sent to node 2, where the final join result is determined.

Instead of sending entire data sets or join attributes over the network, bit vectors [Chan and Ioannidis, 1998; Valduriez, 1987] or Bloom filters [Bloom, 1970] can be applied to exclude irrelevant records without actually evaluating the join predicate. To accomplish this, a bit vector vof size nis created. Each value of the join attributea of data setB is transformed into a new value in the interval [1, . . . ,n]by using an appropriate hash function and the corresponding bits in a bit vectorvare set accordingly. The bit vector vand the hash function are sent to the remote nodes 1 and 3 to identify potential join partners, which are ultimately sent to node 2 for join processing. The size ofvshipped to nodes 1 and 3 is significantly smaller compared to performing a projection on the join key column on node 2 and sending the entire column to the remote nodes as employed in semi-join reducers. On the other hand, the size of the set of join candidates shipped by bit vector filtering may be significantly larger due to hash key collisions, since the benefits of filtering highly depend on the choice of a proper hash function and the size of the bit vector. Bit vector filtering is not only valuable for 2-way or multi-way joins, but also for UDFs that involve two phases, such as custom intersections, groupings, etc. In the first phase, such operators can populate a bit vector to remote nodes to skip records in the second phase [Miner and Shook, 2012].

Semi join reducers were first introduced by Bernstein et al. [1981] and later improved by Apers et al. [1983] to reduce data shipment costs in distributed database queries.

Although a study showed that this technique was beneficial in distributed database sys-tems in the 80’s only for some types of queries [Bernstein and Goodman, 1981], it has been re-discovered in the 1990’s and 2000’s to optimize different types of queries, for example, star joins or top-n queries [Stocker et al., 2001]. In parallel data analytics sys-tems, where huge data sets are shipped between nodes, this technique is also promising to reduce costs of data transfer, particularly, when n-way joins are involved. To the best of our knowledge, semi-join reducers are currently not contained in any data analytics system’s optimizer, but can be added manually by the developer. There are plans to

��

Figure 4.10: Semi-join reduction of user-defined 3-way join operator to decrease net-work traffic.

integrate semi-join reducers into the Calcite optimizer for Hive with promising initial results, yet, integration has not been finished as of September 2016¹⁷.

Apart from introducing semi-join reducers, the data shipment strategy itself is an important part for adjustment during data flow optimization. Intuitively, when two or more data sets or partitions shall be combined in a distributed setting, two data ship-ment strategies are deemed beneficial. These strategies are (1)ship as a whole, which ships entire data sets to remote nodes, and (2)fetch as needed, where compute nodes fetch records as needed. Clearly, both approaches have severe disadvantages, i.e., the first results in high volumes of shipped data and the second yields a vast amount of mes-sages that are exchanged between the compute nodes. Different strategies have been proposed to overcome this problem, namely the introduction of on the fly indexing, data compression, and partition pulling. Another option for optimizing network transfer and data shuffling costs is to pull and replicate entire partitions on certain compute nodes if they are sufficiently small [Graefe, 2009]. In the same way, Alexandrov et al. [2015]

propose to re-use existing partitions by first computing sets of interesting partition-ings for a given data flow and enforcing such partitioning in early stages of the data

17http://issues.apache.org/jira/browse/CALCITE-468, last accessed: 2016-10-31.

4.4 Data flow languages and optimization in Map/Reduce-style systems flow execution. In addition, data analytics systems, such as Flink, Spark, or Cloudera, commonly apply techniques for data compression before shipment across the network.

Im Dokument Scalable and Declarative Information Extraction in a Parallel Data Analytics System (Seite 80-83)