Experimental Results - Adaption of the n-way Dissemination Algorithm 61

4. Adaption of the n-way Dissemination Algorithm 61

4.3. Experimental Results

For first results regarding the usability of the adapted n-way dissemination algorithm for an allreduce, then-way dissemination algorithm was implemented as a GPI2 routine within version 1.0.1. The runtime of the algorithm was measured by one measurement right before calling the allreduce and one measurement right after the return of the routine.

The experiments were conducted on different machines. The first and smallest test system was the Aenigma, equipped with dual-socket 6-core Intel Westmere X5670 @2.93 GHz nodes which are connected through an IB Quad Data Rate (QDR) interconnect in a fat tree topology.

The second system the algorithm was tested on was CASE, a system with dual-socket 12-core Ivy Bridge E5-2695 v2 @2.4 GHz nodes, connected through an IB Fourteen Data Rate (FDR) network, also with a fat tree topology.

Different aspects have to be considered when using the n-way dissemination algorithm. For example, the choice of n not only depends on P, but also on the underlying network. As described in the previous section, the runtime of the algorithm depends on the message transfer times, which again are influenced through bandwidth and implicit parallelism of the network.

Thus, the choice of n was investigated in preliminary tests on Aenigma. For this test, the n-way dissemination algorithm was implemented with the sum as the reduction operation with one integer as initial data of each node. For each number of ranks, the algorithm was run 10⁶ times for each possible n. Figure 4.4shows the importance of a good choice ofn, i.e., the one

that will result in the fastest allreduce algorithm.

0 5 10 15 20

0 5 10 15 20 25 30 35

time in µs

number of nodes fastest n

smallest n

Figure 4.4.: Comparison ofn-way dissemination algorithm average runtimes with smallest possiblen and fastestnon Aenigma.

The average runtimes of the n-way dissemination algorithm with the smallest possible n for the adaption are compared to those of the fastest possible n on Aenigma. In the worst cases, i.e., for 8 and 10 nodes, the runtime increases by more than 57%, showing the importance of a well chosen n. Thus, the next step to investigate is the question of how to choose n. In a first approach the choice of n was implemented within the allreduce. In most applications, an allreduce will be used several times during the runtime for the same group. It is hence possible to test the allreduce with different possible n, time these runs and take then with the lowest average runtime for the following allreduces. In the first run of an allreduce, all possiblenwere run 10 times and the one with the lowest average runtime was then chosen for the following allreduces.

This introduces a high overhead for the first usage of the allreduce, but as seen in Fig. 4.5for Aenigma, the average runtime of the allreduce after 10⁵ tries is still much better than those of the native GPI2-1.0.1 allreduce and even the allreduces of the two available MPI implemen-tations MVAPICH 2.2.0 and OpenMPI 1.6.5. The native GPI2 allreduce is implemented as a binomial spanning tree. The figure shows the results for an allreduce with 255 doubles as the workload. This is the maximum defined by the GASPI specification and thus used as the maximum message size for experiments with collective operations.

Figure B.1a in App. Bshow the results for the same test with only one integer as workload.

Also there, the n-way dissemination algorithm shows faster runtimes than the native GPI2 algorithm, but the MPI implementations are approximately on the same level as the n-way allreduce. This comparison between the different message sizes emphasizes the relevance of the results with the n-way dissemination algorithm, because congestion of the network is to be expected for algorithms with butterfly-like communication schemes and large messages.

0 10 20 30 40 50

0 5 10 15 20 25 30

time in µs

number of nodes n-way

GPI2-1.0.1 MVAPICH 2.2.0 OpenMPI 1.6.5

Figure 4.5.: Comparison of different allreduces on Aenigma with 255 doubles and sum as reduce operation.

Figure B.1b in App. B additionally shows that the usage of an adapted n-way dissemination algorithm also makes sense for barriers, because then-way dissemination barrier is faster than the other three barriers (GPI2-1.0.1, MVAPICH 2.2.0 and OpenMPI 1.6.5).

The same tests were conducted on CASE, where a faster interconnect and newer processors are installed. Instead of the QDR interconnect connecting the compute nodes of Aenigma, CASE is equipped with a FDR interconnect. This has a higher bandwidth and a lower latency (see Tab. 2.1 on p. 16), which is expected to be seen in the results. Even though all algorithms benefit from the newer interconnect, the n-way dissemination algorithm should profit more from the higher bandwidth, because it sends more messages per communication round through the network than, e.g., the native binomial spanning tree or dissemination algorithm of GPI2.

On this cluster, the choice of nwas done in the same way as described above, i.e., in the first executed allreduce. Different from the tests on Aenigma, the allreduces were not run10⁵ times but only 10³ times, because of time constraints. Also different from the tests on Aenigma is the MPI implementation available. On this system, IntelMPI 4.1.3 was the fastest available MPI implementation and thus the n-way dissemination algorithm was compared against it.

Figure 4.6 shows the results of the allreduce tests with one integer as workload and a sum-mation as the reduction operation. The MPI_Allreduce shows similar runtime results as the native GPI2 implementation. The implementation of gaspi_allreduce with the n-way dis-semination algorithm on the other hand is significantly faster. For 8 nodes and more, then-way dissemination algorithm is continuously faster, reaching a peak at an speedup of over 46 % for 48 nodes. In most other cases, the n-way dissemination algorithm is between 25 % and 37 % faster than the MPI implementation.

The difference between the MPI barrier and ann-way barrier is even more significant, reaching over 58 % faster runtimes for 64 nodes, as seen in Fig. 4.7. Overall, the n-way dissemination

0 5 10 15 20 25

2 4 8 12 16 24 32 48 64 80 96

time in µs

number of nodes n-way

GPI2-1.0.1 IntelMPI 4.1.3

Figure 4.6.: Comparison of average runtimes of the allreduce operation with 1 integer and sum on CASE.

algorithm is faster by continuously more than 42 % when using 8 nodes or more. Different from the allreduce, also the GPI2 barrier is much faster than the MPI barrier with 11 % to over 41 % of performance increase.

0 5 10 15 20 25

2 4 8 12 16 24 32 48 64 72 96

time in µs

number of nodes n-way

GPI2-1.0.1 IntelMPI 4.1.3

Figure 4.7.: Comparison of average runtimes of the barrier operation on CASE.

The results confirmed the expectation, that a network with a higher bandwidth will have an immense impact on the runtimes of the n-way dissemination algorithm in comparison to the binomial tree algorithm or the original dissemination algorithm, which are used to implement the GPI2 allreduce, respectively the barrier.

Im Dokument On Collective Communication and Notified Read in the Global Address Space Programming Interface (GASPI) (Seite 81-85)