• Keine Ergebnisse gefunden

3. HPC Communication Libraries and Languages 33

3.3. MPI - Message Passing Interface

3.3.1. Basic Concepts

The general concepts described in this section are necessary for all programs using MPI com-munication routines. Before being able to use any MPI comcom-munication routines, MPI has to be initialized through theMPI_Initroutine. Only after this call returns successfully, other library routines may be called. At the end of an MPI program, the routine MPI_Finalize has to be called, to clean up and free allocated resources. After this routine has returned successfully, no further call to MPI library routines may be made.

Groups and Communicators

MPI communication relies on the concept of communicators, which are responsible for the ac-tual communication, the distinction between different kinds of messages and the discrimination of communication universes. In order to do so, the communicator provides the scope for com-munication routines. This scope consists of contexts, groups, virtual topologies and attribute caching. The contexts partition the communication space, such that collective communication does not interfere with point-to-point communication and that different communicators do not interfere. For a detailed description of the elements of communicators, please refer to the MPI standard [75], Chap. 6, but to emphasize the difference between groups in MPI and groups in GASPI, MPI groups are described here.

A group defines the ranks of a set of processes, thus two groups consisting of the same set of processes but ranked in a different order are considered as two different groups. There are several different routines, making a group’s ranks and properties accessible and groups themselves creatable and comparable. Based on these groups and ranks, the communicator enables communication between the processes in the group. There is an important distinction to be made between two different types of existing communicators. On one hand there are intracommunicators, enabling the communication within a single group of processes. Then, there are also intercommunicators which enable communication between two non-overlapping groups of processes.

While communicators are necessary for all communication routines, the completion calls de-scribed in the following subsection are only needed for non-blocking communication calls, i.e., calls that will return immediately after they have initiated the communication, without waiting for the communication to actually finish.

Completion Calls

Asynchronous or non-blocking MPI communication routines need completion calls to be able to complete the communication. In blocking communication routines, all local buffers can

be reused after the successful return of the communication routine. This is delayed to the completion calls in non-blocking communication, i.e., the buffers can only be reused safely after the successful return of the according completion call. Which communication call is to be completed through a completion call is defined through request handles, which are an argument of non-blocking communication routines and completion routines. The different completion calls can roughly be split into waiting routines and testing routines.

The waiting routines block until the desired non-blocking communications, identified through request handles, are complete. No matter how the waiting function returns (successful or not), the routine will update the status (an argument of the routines) of the communication. One can either wait for one certain communication, for any one out of a given array of handles, for some communications associated with a given array of handles or for all communications, whose handles are in the request handle array to be completed.

For the testing routines, the situation is quite similar. They are available in the same versions, but they return immediately with one or more flags stating whether the communication iden-tified through the request handle(s) is complete or not. If it is (or they are) complete, the testing routines will act as if they were waiting routines, namely stating that local buffers may be used again.

Even though many different completion calls are defined through the MPI standard, only theMPI_Wait(req, status)andMPI_Test(request, flag, status)routines are supported.

Each of these will check on exactly one collective communication routine.

Having described groups and communicators, which are necessary for all communication rou-tines, and completion calls, which are necessary for all non-blocking rourou-tines, the following concepts of windows, epochs and synchronization calls are distinct to one-sided communication routines.

Windows

Windows are used to make a processor’s memory region visible and accessible to other processes participating in one-sided communication with this process, a concept similar to the memory regions in IB Verbs. The creation of a window is possible in several different ways, but all need a communicator. First, a window may be created via MPI_Win_create, a routine through which already allocated memory is exposed for RMA. Then one can allocate new memory for the created window by using either the call MPI_Win_allocate, directly exposing it to RMA, or MPI_Win_allocate_shared, allocating the memory as shared memory. This will allow the remote processes to directly store and load data into or from the window. A last possibility is to create a window to which memory will be dynamically attached later in the program. This is especially useful if it is not clear from the beginning on, how much RMA exposed memory is needed on a given process.

All processes keep a public and a private copy of their window such that these have to be kept

synchronized. This is done in different ways depending on the chosen memory model, target communication and others. One of the most important tools for window synchronization is MPI_Win_sync which enables the application to synchronize at any necessary point. Further synchronization calls will be described in the following subsection together with epochs, another necessary concept of one-sided communication in MPI.

Epochs and Synchronization Calls

A further structure needed by all one-sided communication calls are the epochs, because RMA routines may only be called within these. Epochs are delimited through different synchro-nization calls and the user must distinguish between active target communication and passive target communication. In active target communication, the target process is engaged in the synchronization, while in the passive target communication, the target process has absolutely nothing to do with the data transfer. For all RMA communication, an access epoch has to be created on the origin process. In addition to that, an exposure epoch has to be induced on the target process in active target communication.

The origin process can only start passive target communication within a pair of locking calls.

The passive target epoch is started on one process, where a lock type describes whether the calling process will have exclusive access to the window, if the access is shared, or if the epoch shall be started on all processes of the window group. The epoch is then ended by unlocking the window. After the return of an unlocking call, the communication is complete on both sides. Since the target process is not involved in synchronization and one might need some of the transferred data before the end of the epoch, it is possible to flush a window. Calling one of the flush functions lets the calling process wait for one or all RMA operations on a given window.

For active target communication there are several different possibilities of delimiting the access epoch and the exposure epoch. The most general approach is using MPI_Win_fence, which is a collective synchronization call and starts and ends access epochs as well as exposure epochs in all processes in the group of the window. A more resource saving possibility is to pair only those processes, that need to communicate. This is done on the origin process through MPI_Win_startandMPI_Win_completeand on the target process by callingMPI_Win_postand MPI_Win_wait. Alternatively one may also call MPI_Win_test to check if all communication in this epoch has completed. If so it will behave as if a call to MPI_Win_wait had been made.

The ending of an epoch always implies completion of the communication on the origin process as well as on the target process.

Within these epochs, the one-sided communication calls described in the next section can be executed.