Elastic Backfilling - Elastic Schedulers - Resource-Elasticity Support for Distributed Memory H

10.3 Elastic Schedulers

10.3.3 Elastic Backfilling

The elastic backfilling algorithm is designed to reduce the number of idle nodes by tak-ing advantage of the extra flexibility provided by resource-elasticity. Elastic backfilltak-ing is performed in step 5 of the Elastic Runtime Scheduler (ERS) algorithm.

The algorithm takes the following parameters of each job description as input:

1. Current node count.

2. Estimated time of completion.

3. Its resource range (from its entry in the RRV).

4. Average adaptation time.

The estimated time of completion is estimated linearly based on the provided comple-tion time by the user and the starting node count. This value is currently set with the-t or--timeoption of theSRUNlauncher. The minimum number of nodes is set through an extension that reads an environment variable. Both these values will be specified in addi-tional options in batch descriptions once the Elastic Batch Scheduler (EBS) is developed.

The third input is produced by the heuristic explained previously. It can have a reduction of nodes or an optional expansion. The average adaptation time is computed from pre-vious measured adaptations. It is not available if the application has not adapted before.

This time is measured from the moment thesrun realloc messageis sent to when the status flag of the application is changed back toJOB RUNNING.

The algorithm performs two basic operations: time balancing and resource filling. These operations are applied to sets of jobs and are described in the remainder of this section.

Time Balancing

Time balancing is an operation that transforms the number of nodes of each job in a job set such that their completion times become as close as possible. This can lower the makespan of the schedule and reduce wait times. It is a transformation that can be described with linear algebra. In this subsection, the transformation will be described for cases with two, three and four applications. Extrapolating from these, the technique can be understood for an arbitrary number of applications. This operation is only applied if applications are expected to retain their efficiency levels with the new resource allocations.

Consider the two jobs presented in Fig. 10.7. Ift0andt1are the estimated time comple-tions of jobs 0 and 1, and their node counts aren0andn1respectively, then the following linear system can be solved to find a vector ofx0andx1, such that the expected comple-tions times of both jobs match. This transformation assumes linear scaling; this can be assumed to be approximately true within efficient node ranges, if the performance model determines these accurately. Definingx0⁰ = 1/x0and x1⁰ = 1/x1, the following linear system can be solved to get the scaling factors of the time balancing operation:

Ç1/t0 −1/t1

n0 n1

å Çx0⁰ x1⁰

= Ç 0

n0 +n1 å

(10.4) After that we can directly multiplyx0⁰∗n0andx1⁰∗n1to get our scaled node counts and apply the time balancing operation. In the same matter, time balancing can be applied to a set of three applications.

10 Monitoring and Scheduling Infrastructure

Figure 10.7: Time balancing applied to two jobs.

Figure 10.8 presents a similar scenario as before, but with one extra application. The following linear system can be solved to obtain the three scaling factorsx0⁰,x1⁰andx2⁰:

Ö1/t0 −1/t1 Similarly, the following linear system can be solved to obtain the four scaling factorsx0⁰, x1⁰,x2⁰andx3⁰ for a set of 4 applications:

As can be deduced, these linear systems follow a simple pattern while increasing the number of jobs to be transformed. Its current implementation creates matrices in aug-mented form based on the selected jobs and solves the linear system with Gaussian elim-ination. These are solved quickly even for transformations with thousands of jobs. The current number of idle nodes (Nidle) can be added to the total of nodes to the last equation.

For example, the last equation in the 4 job example becomes:

x0⁰∗n0 +x1⁰∗n1 +x2⁰∗n2 +x3⁰∗n3 =n0 +n1 +n2 +n3 +Nidle

Solving the modified linear system makes the time balancing operation produce a re-source scaling vector that can fill these idle nodes as well.

Since the scaling factors are real numbers, while node counts are natural numbers, a floor operation is applied to each of the results: x0⁰∗n0,x1⁰∗n1,x2⁰∗n2, etc. This means that the operation may indeed produce a surplus of nodes, instead of zero, in some cases.

Any remaining nodes can be filled with the resource filling operation described next.

Resource Filling

This operation takes the number of nodes that are idle and expands applications accord-ing to their estimated efficient maximum number of nodes determined by the performance model and heuristic. The resource filling operation is much simpler than time balancing.

The algorithm starts by creating a list of jobs based on their remaining runtime, from high-est to lowhigh-est. This reordering helps lower the makespan of the schedule. It then expands

10.3 Elastic Schedulers

Time

Node Count ₀

Time

Node Count

Before After

t0 x0*t0

t1 x1*t1

n0/x0

n1/x1

2 t2

n2 2

x2*t2

n2/x2

Figure 10.8: Time balancing applied to three jobs.

Time

Node Count ⁰

Time

Node Count

Before After

Figure 10.9: Resource filling applied to two jobs.

the jobs one by one until all idle nodes are filled or all candidate jobs reach their maximum node count.

Figure 10.9 provides an illustration of this operation being applied to two jobs. In this case the completion times do not need to match. This operation is performed so that the idle node count is minimized, and generally produces a reduction on the available number of nodes for potential application starts. Job priorities are not taken into consideration.

Resource Scaling Vector (RSV)

As mentioned earlier, currently the Elastic Batch Scheduler (EBS) is not implemented. The Elastic Runtime Scheduler (ERS) produces schedules for jobs that are started with theSRUN command. Multiple applications can be launched simultaneously and theSRUNcommand blocks until available resources are available given the number of nodes required by the application. The priority of the jobs is assigned based on their arrival time, where earlier jobs have higher priority following a First-Come First-Serve (FCFS) policy.

The ERS needs to produce a resource scaling vector (RSV) to minimize the makespan of all the jobs, running or blocking with a reservation. The RSV specifies new node counts for a subset of the running jobs. This vector is applied to the running system through an srun realloc messageper application. If the node count is smaller than the current

10 Monitoring and Scheduling Infrastructure

Job Queue

0 1 2 3 4 5 6

2 4 6

Node 0 Node 1 Node 2 Node 3

Schedule

Time

Figure 10.10: Possible schedule of a set of elastic jobs ordered by priority in the queue.

value, then an expansion is started. In contrast, if the node count is greater a reduction is started. If the new value is the same, then the message is not generated. This can occur due to rounding during any of the transformations, although the individual application was a candidate for adaptations.

The elastic backfilling algorithm applies traditional backfilling together with the previ-ously described time balancing and resource filling techniques. The following rules sum-marize the heuristic used to apply these three techniques.

1. If one or more higher priority jobs were delayed due to lack of resources:

a) Select a set of running jobs with resources that add up to the number of re-quired resources to start the high priority jobs. Include any idle nodes based on availability.

b) If a set is found, apply time balancing to it.

c) Create reservations on the time balanced nodes for the high priority jobs.

2. If after starting new jobs and applying time balancing there are still idle nodes:

a) Select jobs that fit in the gaps and apply traditional backfilling.

b) Apply resource filling to fill any remaining jobs.

The first rule is an attempt to minimize the wait times of high priority jobs. The second rule shares the same goal as traditional backfilling techniques: to try to minimize the num-ber of idle nodes. It is better than backfilling only when resource-elastic jobs are available to further fill the gaps. As mentioned before, because the EBS is not available, its shim al-ways returns that the queue is empty. This means that currently new jobs are never started with backfilling.

Figure 10.10 illustrates an alternative schedule produced thanks to the support of resource-elasticity in the presented prototype. The job queue matches that of Fig. 10.1. As can be observed, in this case job 2 has been started at an initial smaller node count and later ex-panded. This allows the start of the jobs based on their priorities, therefore ensuring fair-ness. Additionally, the makespan is shorter. These are clear improvements to fairness and performance when compared to the previous static schedule. Ensuring per application efficient operation, with the techniques described in this chapter, is an additional benefit.

11 Evaluation Setup

The evaluation has been performed in the SuperMUC [13] petascale system. This super-computer is managed by the Leibniz Supercomputing Center (LRZ) and is located in Garching, Germany. The resources of this HPC system are managed by an IBM Load Leveler resource manager.

There were some challenges encountered when testing the custom resource manager and communication library. As may be expected, it is not possible to replace the resource manager from the HPC system in production. Additionally, the new resource manager is composed of multiple daemons in a distributed memory setup. This system is shared among many users; this means that jobs need to wait for undetermined amounts of time in a queue. To overcome these challenges a set of scripts and custom binaries were developed.

11.1 Elastic Resource Manager Nesting in SuperMUC

A set of scripts have been written to allow the presented resource manager to be boot-strapped inside a job allocation. The scripts parse the set of host names that were allocated with each Load Leveler job that is submitted for testing the infrastructure. Aslurm.conf file is generated dynamically for each job. With the configuration in place, the set of dae-mons are started at each host of the allocation. A few seconds are allowed for the resource manager to bootstrap itself and become capable of supporting new application starts. An HPC cluster of the size of the Load Leveler resource allocation is emulated this way in SuperMUC. In summary, the set of hosts allocated for a Load Leveler job becomes a test parallel system. Applications built with the custom MPI implementation can be launched inside Load Leveler jobs once the custom resource manager has finished bootstrapping itself inside an allocation.

This setup has the disadvantage that different performance is observed with each differ-ent job allocation, due to the differdiffer-ent subset of nodes allocated by the Load Leveler in each run. There is not much control over the selection of the nodes. The job description may ask for the nodes to be allocated inside of a single island (set of racks with lower latency across nodes), and not much else. In general, a different set of nodes is expected with each new test run. This requires that multiple tests be performed to smooth out the variability of measurements due to different node sets.

Another disadvantage is that, as a normal job submission to the Load Leveler, the test jobs need to wait in the job queue. Jobs will have variable wait times depending on the size of the queue and the number of resources. It is common to observe wait times of multiple days for test jobs with more than a hundred node allocations. Because of this, only modest node counts (64 maximum) were requested with test jobs.

11 Evaluation Setup

Im Dokument Resource-Elasticity Support for Distributed Memory HPC Applications (Seite 109-114)