MPI Features Overview - Resource-Elasticity Support for Distributed Memory HPC Applications

In this section, the API specified by the MPI standard in its current 3.1 version is briefly described. MPI offers API sets for: point-to-point communication, collective communica-tion, one-sided communicacommunica-tion, parallel IO, derived data types, virtual topologies, group and communicator management, and more.

5 The Message Passing Interface (MPI)

Process 1 Data Process 0 Data

Process 3 Data MPI Communication

Buffers

Process 0 Process 1

Process 3 Process 2

MPI_Send

MPI_Recv

MPI_Send

MPI_Recv

MPI_Send

MPI_Recv MPI_Recv

MPI_Send

Process 2 Data

Figure 5.1: Simplified overview of MPI communication and buffering for small and medium buffers (typically smaller than a megabyte) on a four process appli-cation with a counterclockwise ring communiappli-cation pattern.

5.1.1 Data Types

Instead of using types from the C or Fortran programming languages, MPI defines its own data types. This allows MPI libraries to adapt and align memory properly when transfer-ring messages across machines or software stacks with incompatible type representations.

MPI provides a set of basic types, such asMPI INTandMPI DOUBLE. In addition to this, users can create their own derived data types, for example, to represent vectors or struc-tures. Data types need to be specified in all MPI operations that perform message transfers or perform distributed arithmetic on buffers.

5.1.2 Groups and Communicators

Groups are ordered sets of processes in MPI. Communicators enable communication among the processes of a group. After initialization, MPI libraries provide a communicator that includes all the processes that were started as part of an application: theMPI COMM WORLD communicator. This communicator is usually sufficient when managing the communica-tion of a few processes. However, when applicacommunica-tions reach a certain level of complexity, it

5.1 MPI Features Overview

helps to create groups and communicators to modularize the code of MPI applications.

A typical application will first duplicate a communicator, then split it or divide in ways that will benefit the clarity of the algorithms in the distributed application. Communi-cators can be manipulated directly, or alternatively, groups can be created first and then communicators created from them. The latter approach requires more steps but has flex-ibility advantages. For example, there are union and intersection operations that can be applied to groups but not to communicators.

5.1.3 Point-to-Point Communication

The point-to-point set of operations in the MPI standard allows for the transmission of bytes from one specific process to another. There are several variants available with dif-ferent send modes: default, ready, synchronous and buffered. The status of the receiver end is undefined in the default variant. Ready sends complete when the matching receive has been posted. Synchronous sends only complete when the matching receive has also completed. Finally, buffered sends rely on user provided buffers for their operation.

In addition to send modes, these operations also have blocking and non-blocking ver-sions. Blocking versions do not return until the operations have been completed, while non-blocking versions will return immediately. When using non-blocking versions, the application needs to check the status of the operations with the wait or test operations.

The non-blocking operations allow for the overlap of communication and computation.

Figure 5.1 provides a simplified visual overview of the interactions of application and MPI communication buffers during point-to-point communication. In the figure, four pro-cesses are depicted performing a send and a receive each in a counterclockwise ring pat-tern. As can be seen, several copies are performed from application buffers to MPI commu-nication buffers and the other way around. This is common in most implementations with internal eager or rendezvous communication protocols at the byte transfer layer; these protocols are used when transferring small to medium buffers (before buffer sizes where DMA transfers become more efficient). Typical MPI implementations, including MPICH (discussed in Sec. 5.3), decompose collective operations into multiple point-to-point mes-sages; therefore, this figure also applies to most types of communication that do not rely on network hardware acceleration (such as RDMA or hardware collective operations).

5.1.4 One-Sided Communication

Network hardware with Remote Direct Memory Access (RDMA) features can improve communication performance by reducing latencies and overheads related to buffering and synchronization. In addition, RDMA allows for better overlap of communication and com-putation. MPI added its one-sided communication API to allow implementations to effi-ciently support RDMA hardware. This API was introduced in version 2.0 of the standard, and was updated to match more recent RDMA capable network hardware in version 3.0.

With these operations, MPI implementations can reduce the amount of memory needed for buffering and the number of memory copies performed on typical communication pro-tocols (refer to Fig. 5.1). In cases where hardware RDMA is not available, most MPI li-braries fall back to an internal point-to-point based implementation; this way, applications that use one-sided communication remain portable.

In this mode of communication, MPI processes create memory windows that can be ac-cessed by remote processes. They can then read and write data to their own and remote

5 The Message Passing Interface (MPI)

Process 1 Data Process 0 Data

Process 0 Process 1

MPI_Put

MPI_Get

Window 1 Window 0

Figure 5.2: Put and get operations initiated both by process 0 using MPI one-sided communication.

buffers. Synchronization operations are also provided, to prevent race conditions. Fig-ure 5.2 provides an illustration of a possible interaction between two processes. In this case, the process with MPI rank 0 transfers some data from its own address space (blue) towards the memory window of the remote process with rank 1. Rank 0 also transfers data from the remote window of rank 1 (yellow) into its own address space. The same operations can be performed by the process with rank 1 on the window created by rank 0.

5.1.5 Collective Communication

Figure 5.3: Sequence diagram of a naive all-reduce operation implementation.

MPI also provides operations that work on groups of processes. These can be synchro-nization operations such as a barrier, data transfer operations such as broadcasts, and collective operations on data such as reduc-tions. With MPI 3.0, non-blocking versions of these operations were introduced. Neigh-borhood collectives were also introduced with MPI 3.0; these can be more efficient on certain communication patterns, such as those generated by stencil based distributed solvers.

The use of collectives is highly recom-mended to MPI application developers. The internal collective algorithms implemented within MPI libraries perform well, based on long term research related to their effi-ciency; in addition to this, MPI implemen-tations can take advantage of hardware net-work collectives when available. It is very unlikely that users will achieve better

per-formance than the well tuned internal implementations provided by Open MPI or MPICH.

Take for example the sequence diagram of a naive algorithm presented in Fig. 5.3: an

appli-5.1 MPI Features Overview

cation developer may be tempted to perform this sequence of operations with MPI point-to-point operations (a gather, followed by a reduction and finally a broadcast), instead of relying on well researched and optimized internal implementations abstracted by the MPI ALLREDUCEoperation.

5.1.6 Parallel IO

MPI offers an abstraction to parallel file systems through its MPI-IO API introduced in version 2.0 of the standard. The Input-Output (IO) API makes applications portable across multiple distributed file system implementations. In contrast to POSIX or proprietary par-allel IO APIs, MPI IO benefits from its integration with MPI. For example, its operations can work transparently with MPI data types, including complex derived data types cre-ated by MPI application developers.

Similar to other parts of the MPI standard, the IO API is designed to allow for efficient implementations. There are several optimizations possible. For example, since the net-works of HPC systems typically have lower latencies and higher bandwidth than their file systems, implementations can rely more on the network and minimize the amount of accesses to the distributed parallel file system. Implementations can streamline the order and reduce the number of IO operations.

5.1.7 Virtual Topologies

The development of distributed memory applica-tions pose additional challenges to computer sci-entists. Any type of abstraction that can simplify the description of distributed algorithms is a wel-come addition to any programming model. MPI virtual topologies is one such feature: developers can define them to simplify the implementation of distributed algorithms, while at the same time ex-posing communication patterns to the MPI imple-mentation. These patterns can be used by MPI li-braries to improve the order of ranks based on the proximity of actual processes in the real network topology of a supercomputer.

With virtual topologies, applications may define a Cartesian grid or a graph, where the nodes are the processes and the edges indicate which pro-cesses communicate with each other. The lack of an edge does not impede communication between processes, so arbitrary point-to-point

communica-tion is still possible. For example, a Cartesian grid topology may be defined to simplify the communication of an algorithm that operates on a checkered board type of data dis-tribution. Figure 5.4 depicts the organization of 9 MPI processes in a 3 by 3 Cartesian grid virtual topology with wraparound. The top number indicates their ordered ranks in theMPI COMM WORLDcommunicator, while the bottom pair indicates their location in the Cartesian grid communicator. The more general case is the graph topology, where any arbitrary relationship can be described.

5 The Message Passing Interface (MPI)

Im Dokument Resource-Elasticity Support for Distributed Memory HPC Applications (Seite 49-54)