Testing and Measurement Binaries - Resource-Elasticity Support for Distributed Memory HPC Appli

Due to the complexity of the system, several special modes of operation were added to the resource manager daemons and the MPI library. These allow for quick isolated and precise testing of the different pattern detection, reduction, and scheduling algorithms. Testing these separate aspects of the infrastructure would have been much more time consuming and less precise during regular application runs.

12 Elastic MPI Performance

In this chapter, the performance and scaling of the new MPI operations is evaluated. For the measurements presented here, a simple test application that performs redundant MPI communication was used. The application runs indefinitely and adapts to new resources based on a precomputed schedule. Performance data was collected every time the test application adapted. Sweeps from 16 to 512 or 1024 processes were performed. Measure-ments were accumulated from 10 separate runs (with different allocations) for each type of SuperMUC node.

Figure 12.1:MPI INIT ADAPTlatency.

Figure 12.1 presents the mean time and stan-dard deviation for the MPI INIT ADAPT op-eration. The times observed are indistin-guishable from those with standard MPICH and SLURM with the provided PMI2 imple-mentation. Poor scaling with increased num-bers of processes is observed; this may be-come a target for optimization in the future.

These times are observed from both the orig-inal processes in an application launch and those created by the resource manager on an expansion. The latency of this operation is hidden from preexisting processes thanks to the design of theMPI COMM ADAPT BEGIN operation, as described in Sec. 6.2.3.

12.2 MPI PROBE ADAPT

Figure 12.2:MPI PROBE ADAPTlatency.

As mentioned in Sec. 6.2.2, this operation has been designed such that the general case is very fast. As can be seen in Fig. 12.2, it is the fastest operation in the MPI extension.

When the adaptation flag is false, the la-tency of this operation is about 1 millisec-ond at 512 processes. The performance is much slower when the adaptation flag is set to true. As explained before, the expecta-tion is that resource adaptaexpecta-tions will be in-frequent; therefore, the low latency of this operation when no adaptations need to take

12 Elastic MPI Performance

place is more important. The latency on the true case is dominated by the TBON protocol between the SRUN program and the daemons.

12.3 MPI COMM ADAPT BEGIN

16|32 32|64 64|128 128|256 256|512 512|1024

Time (seconds)

Number of MPI processes (from|to) Sandy Bridge

Haswell

Figure 12.3:MPI COMM ADAPT BEGIN la-tency from a number of staying processes to a new total.

A full sweep of all possible combinations of preexisting processes and expansion pro-cesses is not presented for this operation, since its latency is dominated by the size of the biggest process group. Because of this, balanced cases are presented, where preex-isting process groups are of the same size as expansion process groups, for resulting pro-cess groups of double the size of the preex-isting ones. It is also worth mentioning that a reduction of resources does not impact the performance of this operation, since preex-isting leaving processes participate in the same way as preexisting staying processes during adaptation windows, due to their required participation during data reparti-tions.

As can be seen in Fig. 12.3, the implementation is successful in hiding the latencies re-lated to the creation of new processes on new resources from preexisting processes. The measured times are significantly lower than the initialization times required by the chil-dren processes. Unfortunately, linear scaling has been observed due to the inherited im-plementation of the accept and connect routines from MPICH. These operations are reused in the current implementation and could be targets for optimization in the future.

12.4 MPI COMM ADAPT COMMIT

Figure 12.4:MPI COMM ADAPT COMMIT latency.

The last operation to be evaluated is MPI COMM ADAPT COMMIT. This operation is evaluated on the total number of pro-cesses, since it operates on the consolidated process group after an adaptation. This op-eration is in general very fast and has good scalability properties. It has not been a tar-get for optimization. The reason for this is that all required synchronization takes place in theMPI COMM ADAPT BEGIN oper-ation and is stored in the MPI library. When this operation is called, the process group and communicator metadata is updated lo-cally in the memory of the process.

13 Elastic Resource Manager Performance

A selection of resource manager operations is evaluated in this chapter. This selection contains all operations that impact the performance of MPI operations during normal computations. The operations that were not included are very numerous, but are either performed locally by one of the resource manager components, or do not impact the per-formance of preexisting MPI processes thanks to the latency hiding features described in previous chapters.

13.1 Tree Based Overlay Network (TBON) Latency

The communication betweenSRUNand theSLURMDdaemons that manage the execution of an MPI application is important for theMPI PROBE ADAPToperation when the adap-tation flag is set to true. The algorithm for probing has two sides: the side at each MPI process and the side at eachSLURMDdaemon. When the adaptation flag is set to true, mul-tiple synchronization operations between theSRUNprogram and each daemon take place.

These synchronization operations are performed over the Tree Based Overlay Network that connectsSRUNto eachSLURMDdaemon. Because of this, the latency of messages over the TBON can impact the overhead of MPI processes when they are required to adapt.

0.01 0.1

Im Dokument Resource-Elasticity Support for Distributed Memory HPC Applications (Seite 114-117)