• Keine Ergebnisse gefunden

4. Adaption of the n-way Dissemination Algorithm 61

4.4. Discussion

This chapter has introduced an adaption to then-way dissemination algorithm, which enables the use of this algorithm for the allreduce operation. The algorithm is well suited for a split-phase allreduce, as defined in the GASPI specification, due to its low number of communication rounds, especially in comparison to tree-based algorithms. In addition to that, it involves all ranks in the computation of partial results in each communication round. This reduces the imbalance introduced in tree-based algorithms, where all ranks enter the routine but most ranks will not have any computation or communication to be done and thus idle.

In Sec. 4.3, the experimental results show that the performance of the algorithm is highly dependent on the underlying network and the general system configuration. On Aenigma, the system with a lower bandwidth, higher latency interconnect, the benefits from transferring multiple messages per communication round are not reflected in the runtime results. On CASE on the other hand, these effects are very well visible. Considering the development of networks and the further increase of available bandwidth in near future, the transferal of multiple or larger messages per communication round will be playing a more important role. Algorithms with symmetric communication schemes, like the butterfly algorithm, that have congested the network with fairly small messages in the past, will have to be reconsidered for future implementations of collective communication routines.

Another interesting observation to make is the change of differences in runtimes from the barrier to the allreduce operation. Again these observations are only considering runtimes for more than eight nodes, because there is no meaningful observation to be made for smaller numbers of nodes. While then-way dissemination algorithm is at least 42 % faster than the MPI barrier on CASE, the difference between the MPI allreduce and then-way allreduce is only at 25 % even though the algorithm has not changed. This change might be due to a change in algorithms in the IntelMPI implementation, i.e., a different algorithm might be implemented in the barrier than in the allreduce operation. It is a usual practice to use several algorithms, depending on message and group sizes [98], so it is permissible to assume the same for a highly optimized MPI implementation as IntelMPI 4.1.3. Another possibility for the reduction of runtime differences might be a different approach to computing the partial results. Both possibilities should be taken into consideration for future research on this topic.

In the experiments in this chapter, the choice ofnwas made by running the allreduce multiple times with all possible n and then choosing the one with the lowest runtime. While this procedure works well for a limited number of n to test and applications with intensive use of the allreduce operation, it will not be applicable on systems with very high bandwidth, where potentially ten or more different ncan be tested. The first allreduce will have a considerably higher runtime than the following allreduces, which cannot be balanced in applications, where only few allreduces are needed. In addition, this procedure is prone to errors through jitter in the network or single higher runtimes due to congestion in the network or contention on

the resources. Instead, other options have to be investigated for the choice of n. This could either be some runtime checks during the installation and configuration of GASPI or some environment variables given by the user at the time of installation. In dependence of variables like topology, interconnect type, bandwidth or latency, static but network-specific lookup tables could be generated for the choice of n.

Another point for future research might be the restriction imposed by all butterfly-like algo-rithms when used for an allreduce operation: they may only be used for associative and com-mutative reduction operations, if the result needs to be identical on all participating nodes.

This limits the usage of algorithms like the adapted n-way algorithm or Bruck’s algorithm to maximum or minimum operations as well as sums or products of integers. This issue can be resolved by using multiple underlying algorithms for an allreduce, that are chosen based on not only message size and group size, but also based on the datatype and operation. Another option would be to further adapt these algorithms to internally order the computation of the partial results. This adaption would also imply increasing message sizes in the communication rounds and thus a further parameter to consider when choosing the number of messages to transfer per round.

Further on, different network topologies and interconnects can have a significant impact on the runtime of algorithms used for collective communication. Thus, future work should deal with investigating this impact within GASPI applications. To do so, further implementations of GASPI will have to be implemented and made accessible on different systems.

The following chapter will include the results presented in this chapter in the implementation of an allreduce routine for a GASPI library for collective communication routines. In addition to the comparisons made in this chapter, the next chapter will compare the n-way dissemina-tion algorithm to other allreduce implementadissemina-tions. Apart from the allreduce, also reduce and broadcast routines have been implemented for the collective library and will hence be presented in the following chapter.

Communication Routines for GASPI

As described in Sec. 2.3.2, many different collective communication operations exist. While many of these are defined and implemented in the MPI standard, GASPI only defines the barrier and the allreduce operation. As other collective routines are also demanded and in use by application programmers, a collective communication library on top of GASPI has been designed: GASPI_COLL. This library extends the GASPI standard with several additional collective communication routines but also offers an allreduce with potentially different algo-rithms than used in the implementation of the specification. As this library is designed on top of GASPI, it is portable to every machine or system with a GASPI installation.

All algorithms used have been introduced in Sec.2.3.3. Here, only additional adaptions to the existing algorithms and the reasons for the choice of the given algorithms will be explained. The semantic of GASPI_COLL closely follows that of GASPI, e.g., a call togaspi_coll_allreduce will take the same arguments as a call to gaspi_allreduce. In addition the limits imposed by the GASPI specification are adopted, i.e., the maximum number of elements an array in an allreduce may have and the maximum number of members in a group. Special emphasis needs to be put on the restriction imposed by the specification that no two collective routines of the same type may run at the same time. This transfers much management overhead from the library to the application programmer.

The following sections will describe the group management routines in Sec.5.1and the memory management in Sec. 5.2. The implemented collective routines are described in Sec. 5.3. For each collective routine, runtime comparisons with given MPI implementations have been made.

Because the GASPI implementation GPI2 does not include all collective routines implemented in GASPI_COLL, runtime comparisons were only made where applicable. The results are presented in Sec. 5.4, before Sec. 5.5discusses the results and summarizes this chapter.

5.1. Group Management

Because GASPI_COLL is implemented on top of GASPI, only some parts of the provided group structure can be re-used while other parts need to be newly implemented. Thus each group used in GASPI will also have to be created in GASPI_COLL if any of the routines provided in GASPI_COLL are to be used in that group. This will be done throughgaspi_coll_group_create.

1 t y p e d e f s t r u c t{

g a s p i _ n u m b e r _ t g r o u p S i z e ; g a s p i _ r a n k _ t * g r o u p M e m b e r s ;

5 g a s p i _ r a n k _ t m y I D ;

g a s p i _ p o i n t e r _ t s e g m e n t P t r ; g a s p i _ s e g m e n t _ i d _ t s e g m e n t I D ;

10 int a l l r e d u c e _ c t r ; int a l l r e d u c e _ s t a t u s ;

int a l l r e d u c e _ s t a r t _ d a t a _ o f f s e t [ 2 ];

n w a y _ s t r u c t n w a y _ a l l r e d u c e ; b r u c k s _ s t r u c t b r u c k s _ a l l r e d u c e ;

15 p a i r w i s e _ s t r u c t p w _ a l l r e d u c e ; t r e e _ s t r u c t b i n o m i a l _ a l l r e d u c e ;

int r e d u c e _ c t r ; int r e d u c e _ s t a t u s ;

20 int r e d u c e _ s t a r t _ d a t a _ o f f s e t [ 2 ] ; t r e e _ s t r u c t b i n o m i a l _ r e d u c e ;

int b r o a d c a s t _ c t r ; int b r o a d c a s t _ s t a t u s ;

25 int b r o a d c a s t _ s t a r t _ d a t a _ o f f s e t [ 2 ];

t r e e _ s t r u c t b i n o m i a l _ b r o a d c a s t ;

int a l l t o a l l _ c t r ; int a l l t o a l l _ s t a t u s ;

30

}g a s p i _ c o l l _ g r o u p;

Listing 5.1: Definition of group structure inGaspi_Coll.

When the group is not needed anymore,gaspi_coll_group_deletewill free all previously allo-cated resources. This means that the application programmer is in full control of the overhead introduced by GASPI_COLL.

The group structure of GASPI_COLL is defined as shown in List. 5.1. The internal usage of such a newly defined structure mainly serves the purpose of not repeatedly calling the same GASPI routines from within the library. In addition, application programmers will most likely repeatedly use collective routines in the same group, thus, information necessary for the implementation of the different algorithms is stored in further structs, listed in App. C.

This circumvents the re-calculation of, e.g., communication peers in each call of the collective routine. Each rank of the group has its own copy of the group struct and the included algorithm structs.

Thegroup_size,*group_membersandmyIDare mainly needed for the calculation of the com-munication peers in the initialization phase of the collective routines and for the comcom-munication within the routines. Each group will create own segments, to and from which the communica-tion within the routine will be done through the GASPI communicacommunica-tion routines. To access the

segment and to hand the correct segment ID to the GASPI communication routines,segmentID and *segmentPtrare needed.

The different counters for the collective routines are especially needed for determining the correct communication buffers. Even though the library implementation sticks to the restriction that two collective routines of the same type may not be used concurrently, two succeeding collective routines may very well have some overlap in communication, as Fig. 5.1shows.

0 1 2 3

t

Allreduce 1 Allreduce 2

Allreduce 1 Allreduce 2

Allreduce 1 Allreduce 2

Allreduce 1 Allreduce 2

Figure 5.1.: Two succeeding allreduces, where the second allreduce on rank 3 overlaps with the first allreduce on ranks 0-2.

To ensure that two succeeding collective routines do not overwrite each others communicated data, collective routines with an odd counter write into a different buffer than those with an even counter. If the library did not take care of this, the user would be forced to use a barrier between every two uses of the same collective routine, which would not conform with the goals of GASPI and a performant, scalable application. Accordingly the values in

*_start_data_offset give the offsets, where the odd and even data buffers start.A more detailed description of the memory partitioning is found in Sec.5.2.

5.2. Memory Management

The implementation of a library for collective communication routines as GASPI applications introduces memory specific management overhead. The used memory resources will be shared with the actual application, thus it is also necessary to keep the memory requirements of a collective communication library as low as possible while still employing the GASPI-defined semantic. A GASPI-based communication library will use the communication routines defined by the GASPI specification to ensure compatibility with all GASPI implementations, especially one-sided communication routines with weak synchronization primitives. Using one-sided com-munication routines within the GASPI_COLL routines, makes it necessary to manage not only memory accesses but also the notification buffers.

The semantic of collective routines in GASPI can only be deduced from the definition of the allreduce operation. Different from other communication routines, where an offset on an already

registered memory segment is handed to the routine, the allreduce takes a pointer to the actual data to be reduced as an input argument, as well as a pointer to the location where the result should be stored. This directly implies a copy of the data to an internal memory segment which is registered for one-sided communication. In a library, which will use GASPI routines, this means a GASPI memory segment of sufficient size needs to be allocated and registered internally. This segment is also necessary for the weak synchronization of the communication routines which also involves the internal handling of message notifications.

Another important property of collective communication routines imposed by the GASPI spec-ification is the limitation that no two collective routines of the same type may be run concur-rently within the same group. As described through Fig. 5.1, this does not guarantee, that write accesses of succeeding collective routines do not overlap. The allocated memory segment for collective communication routines is thus structured such that for all routines, two buffers are available: one for the routines started with an odd counter and one for the routines started with an even counter, as depicted in Fig. 5.2. This prevents the overwriting of data from two succeeding collective operations.

even allreduce counter odd allreduce counter even reduce counter odd reduce counter

partial results to send received partial results

allreduce start data offset[0]

allreduce start data offset[1]

* struct.start new data offset[1]

Figure 5.2.: Partitioning of the collective segment for a group in GASPI_COLL.

Each collective routine in GASPI_COLL will have its own memory partition and according offsets, which need to be used in the one-sided communication operations. The offsets are always calculated in dependence of the maximum possible buffer size used in collective communication:

255 doubles. This limit has been adopted from the GASPI specification of the allreduce and will also hold for the reduce and the broadcast operation. Through this, a fixed limit to the maximum size of an internal memory segment per group is given. In addition to the message size, the number of communication rounds has to be considered for the allocation of a sufficiently large internal segment.

Exemplarily, Fig. 5.3 shows the partitioning of the allreduce segment for the n-way dissemi-nation algorithm (Fig. 5.3a) and for the BST (Fig. 5.3b). Let n be the number of messages sent per round and the number of rounds is k = dlogn+1Pe. The portion of the segment dedicated to allreduces is then used by the n-way dissemination algorithm as follows: The initial data of the rank is copied to the very beginning of the segment (green). In every round,

the received partial results are written to the end of the segment, starting at offset (k+ 1)· ELEMENT_OFFSET. The total memory requirement for Bruck’s algorithm with n messages per round is thus(dlogn+1Pe ·(n+ 2) + 2)· ELEMENT_OFFSET.

. . . . . . . . . . . .

initial data

partial results to be sent received partial results nway allreduce.start new data offset[i]

allreduce start data offset[i]

(k+ 1)·ELEMENT OFFSET n·ELEMENT OFFSET (a)n-way dissemination algorithm

. . . . . . . . . . . .

initial data

partial results to be sent received partial results allreduce start data offset[i]

binomial allreduce.start new data offset[i]

2·ELEMENT OFFSET (binomial allreduce.max num children+1)·ELEMENT OFFSET (b) BST

Figure 5.3.: Partitioning of the allreduce segment for different algorithms.

The binomial spanning tree has much smaller memory requirements than the n-way dissem-ination algorithm or Bruck’s algorithm. The array needs to hold the initial data, the partial results received by the children, the newly calculated partial result and the final result received by the parent. The classical binary tree has at most 2 children and a binomial spanning tree at most dlog2(P)e children. The total memory requirement is(dlog2(P)e+ 3)· ELEMENT_OFFSET bytes.

The notification buffer of every GASPI segment is limited by the implementation, in case of the GPI2 implementation, this limit is set to 65535 notifications. This number is high enough for all implemented collective communication routines at the moment. To be used with other GASPI implementations in the future, a query within the group creation will have to be done to check the number of available notifications per segment. In dependence of the retrieved number, different steps will have to be taken, including the exclusion of certain algorithms that need too many notifications, or the allocation of multiple internal segments instead of only one per group.

5.3. Collective Routines

This section will shortly describe the routines implemented in GASPI_COLL and the under-lying algorithms, before the next section will show experimental results with these algorithms.

The broadcast routine also sticks to the semantic and limitations given by the GASPI spec-ification, as far as applicable. The initial data of the participating ranks are handed to the routine via pointers and are internally copied for further use within the collective routine. The same is true for the result buffer: the address is handed to the routine and the final result will be written into this location by the reduce routine. Another limit imposed by the GASPI specification is the size of the message buffers.

Allreduce

The allreduce has been implemented with different underlying communication algorithms, all of which have been described in Sec. 2.3.3 and Chap. 4. Depending on message sizes, data types, reduction operation and group size, the algorithm to be used can be chosen. The BST algorithm (p. 25) can be used for any kind of reduction operation and data type. The experiments in the next section will show, for which message sizes and group sizes the algorithm is most performant. The PE algorithm (p. 27), the adapted n-way dissemination algorithm (Chap.4) and Bruck’s algorithm (p. 29) can not be used for non-associative routines, but may show better results when used, e.g., for maximum or minimum operations or the barrier.

As already mentioned above, the GASPI_COLL allreduce will take the same arguments as the original GASPI allreduce but will have the GASPI_COLL prefix gaspi_coll_:

1 g a s p i _ c o l l _ a l l r e d u c e(g a s p i _ p o i n t e r _ t b u f f e r _ s e n d , g a s p i _ p o i n t e r _ t b u f f e r _ r e c e i v e , g a s p i _ n u m b e r _ t num ,

g a s p i _ o p e r a t i o n _ t o p e r a t i o n ,

5 g a s p i _ d a t a t y p e _ t d a t a t y p e ,

g a s p i _ g r o u p _ t groupID , g a s p i _ t i m e o u t _ t t i m e o u t ) ;

Listing 5.2:GASPI_COLL allreduce routine.

This leads to an easy adaption of an application code and less hassle for the programmer when testing the library allreduce instead of the native GASPI allreduce.

Reduce

Unlike the allreduce, where all participating ranks have the final result upon successful return of the operation, this only holds true for the root rank in the reduce operation. All ranks contribute their own data to be reduced into a final result, which the root rank will then have. Because the root rank also contributes data, a source buffer and a destination buffer are necessary for the reduce routine, both of which do not have to lie within a registered segment.

1 g a s p i _ c o l l _ r e d u c e(g a s p i _ r a n k _ t root ,

g a s p i _ p o i n t e r _ t b u f f e r _ s e n d , g a s p i _ p o i n t e r _ t b u f f e r _ r e c e i v e , g a s p i _ n u m b e r _ t num ,

5 g a s p i _ o p e r a t i o n _ t o p e r a t i o n , g a s p i _ d a t a t y p e _ t d a t a t y p e , g a s p i _ g r o u p _ t groupID , g a s p i _ t i m e o u t _ t t i m e o u t ) ;

Listing 5.3: GASPI_COLL reduce routine.

To complete the reduction, the user needs to specify which reduction operation is to be used.

The predefined reduction operations are the same that are specified in the GASPI specification:

sum, minimum and maximum. Additionally, the user needs to specify the datatype and number of elements to be reduced. From these two arguments the message size will be internally calculated for the transferal of data. If the number of elements to be reduced is larger than 1, the reduction operation will be applied element-wise.

A return of the operation with GASPI_SUCCESS on any non-root rank implies that the work to be done by this rank has been completed and the local buffers may be reused. If the rank is a leaf node in the underlying binomial spanning tree, the work consists of posting a write request to the internal queue. If the the rank is an inner node, the work consists of waiting on the data to be received from the child nodes, computing a partial result and transferring this partial result to the parent node. For the root rank, the successful return implies not only, that

A return of the operation with GASPI_SUCCESS on any non-root rank implies that the work to be done by this rank has been completed and the local buffers may be reused. If the rank is a leaf node in the underlying binomial spanning tree, the work consists of posting a write request to the internal queue. If the the rank is an inner node, the work consists of waiting on the data to be received from the child nodes, computing a partial result and transferring this partial result to the parent node. For the root rank, the successful return implies not only, that