Elastic Runtime Scheduler (ERS) - Elastic Schedulers

10.3 Elastic Schedulers

10.3.1 Elastic Runtime Scheduler (ERS)

As mentioned before, the Elastic Runtime Scheduler (ERS) fulfills the extra requirements of resource-elasticity (Sec. 10.1.4). These extra requirements are a consequence of the added flexibility of malleable jobs. The design decisions made when developing this scheduler were motivated by the following observations:

1. The scalability properties of distributed applications based on its allocated resources is input dependent.

2. Empirical and historical based methods for performance prediction require perfor-mance measurements at multiple resource allocations sizes.

3. Machine learning research that is applicable to job scheduling is still in its early stages; additionally, these techniques require additional storage and databases. These requirements would add premature complications to the current prototype.

4. It is desirable to be able to optimize applications that are running for the first time, as well as applications with new input sets.

5. In addition to backfilling, expansions of running jobs can be performed to minimize the count of idle nodes.

6. Both system-wide and individual application performance metrics must be opti-mized.

10 Monitoring and Scheduling Infrastructure

The first observation comes from the experience of running simulation codes that per-form differently with different sized inputs, or even inputs of similar size but with different geometries. In addition to affecting their overall performance, the input size also deter-mines the available parallelism of the application. Its available parallelism deterdeter-mines the amount of resources it can use efficiently.

The second observation is that history based performance predictors require large col-lections of data. In the case of distributed systems, resource adaptations can involve the movement of large amounts of data and require expensive repartitioning sequences.

This makes the collection of empirical data a lot more expensive computationally than on shared memory systems, where a reconfiguration does not involve the movement of memory over a network. A history of empirical data can be created as applications are executed. Given enough time a predictor may collect adequate amounts of data.

The third observation is that machine learning research, as it relates to scheduling on distributed memory systems, is at the moment limited. These techniques have great po-tential and may be applied in this domain in the near future. The storage requirement is related to the second observation; once it is in place and enough samples are collected, this methods will become feasible. Data collection may need to be repeated on new system installations, depending on the method.

The next observation is that the system should be able to handle new elastic jobs that have no history with acceptable efficiency. It should also be possible to efficiently execute jobs that execute only once. Applications are often run multiple times but each time with different input sets.

Another observation is that there is potential to further improve the minimization of idle nodes in distributed systems due to the added flexibility of elastic execution. With the addition of support for malleable jobs, it is easier to perform the backfilling operation by filling up idle nodes with resource expansions. Additionally, jobs can be started at different node counts. These are advantages over systems with support for only rigid jobs.

The final observation is that the overall efficiency of an HPC system depends on the efficiency of the individual applications running, and not only idle nodes. System-wide and application efficiency metrics can be further improved with resource-elastic execution models.

The ERS performs scheduling decisions at a configurable rate. On each evaluation, the elastic scheduling algorithm follows the following steps:

1. Iterate the list of jobs and make a list of selected running jobs that are elastic and have performance data available.

2. Process the performance data to generate the performance model of the selected jobs.

3. Compute a range of optional and mandatory resource adaptations on the set of jobs.

4. Provide a resource offer based on the ranges to the Elastic Batch Scheduler (EBS).

5. Perform Elastic Backfilling (described later in this section).

6. Use thesrun realloc messageto apply individual resource adaptations.

7. Start any new jobs based on the batch definitions received from the EBS.

8. Wait until the system reaches a steady state before applying further resource trans-formations.

10.3 Elastic Schedulers

Efficiency (elements per second per process)

Number of MPI processes (Sandy Bridge) 4096x4096

Efficiency (elements per second per process)

Number of MPI processes (Haswell)

Number of MPI processes (Sandy Bridge) 4096x4096

Figure 10.6: Efficiency (top) and MPI time to compute time ratio (bottom) of a Cannon’s matrix-matrix multiply kernel. Results for SuperMUC Phase 1 (Sandy Bridge, left) and Phase 2 (Haswell, right) presented. A line is added for the constant 0.1 boundary of the ratio.

The remainder of this section will be dedicated to the description of steps 2, 3, and 5. Step 1 is trivial, since the scheduler simply iterates the list of running jobs and makes an addi-tional list of those that are marked as elastic and have their performance data populated.

Steps 4 and 7 are not yet implemented, since they depend on the availability of the EBS. As mentioned before, the EBS has not been implemented as part of this work, and is instead postponed as future work. Step 6 has been described extensively already in Chap. 9, where the design of the elastic resource manager and its interaction with the elastic MPI library have been documented. The final step has also been described in that chapter. When a transformation is triggered on a job via thesrun realloc messagecommand, its status changes fromJOB RUNNING toJOB ADAPTING. With theMPI COMM ADAPT COMMIT op-eration, each application notifies the resource manager when its adaptation is completed.

Its job record is then updated from the statusJOB ADAPTINGtoJOB RUNNING; this state change marks the application as eligible for adaptations again and its released resources available for other jobs. At that moment, the resource manager updates the credentials for SRUNbased on its new allocation.

10 Monitoring and Scheduling Infrastructure

Im Dokument Resource-Elasticity Support for Distributed Memory HPC Applications (Seite 105-108)