Multi-tasking - —Draft— Feb 2000

Since multi-tasking is a straightforward extension of uni-tasking, familarity with uni-tasking as described in Chapter 11 is assumed.

12.1 Scalability

When multi-tasking, multiple processors are deployed against a job to reduce the job’s overall wall clock time. The cpu time (summed over all processors) is not reduced¹ compared to uni-tasking. The degree to which multiple processors are able to reduce a job’s wall clock time is a measure of the job’s parallel efficiency or scalability. For instance, if a job, which takes 10 hours of wall clock time on one processors, can be divided into pieces so that it will take one hour using 10 processors, then the job is 100% parallel. The job is then said to scale well which means it scales linearly with the number of processors. It should be kept in mind that a prefectly linear scaling with the number of processors assumes a dedicated environment. In a non-dedicated environment, other executables will steal processors and the job’s captured processors must wait for stolen processors to be re-claimed before continuing. The wait time degrades scaling.

How parallel must a job be? Almost linear scaling may be good enough for a small number of processors, but it is not for large numbers of processors. For instance, if a job is 99.5%

parallel, it will saturate about 14.8 out of 16 processors. That means the speedup from 16 processors is about 15 times as fast as the uni-tasked job. However, the same job will only give a speedup of 49 on 64 processors and on 1024 processors, the speedup is only 168! This is a consequence of Amdahl’s law.

12.2 When to multi-task

If the number of processors exceed the number of jobs in the system, some processors stand idle and overall system efficiency degrades unless jobs are multi-tasked. In an operational environment, if a forecast is required every 4 hours but takes 8 hours on a single processor, then multi-tasking will be useful. Other examples include long running experiments that take too long to complete or when it’s impossible to exhaust a monthly computer time allocation using one processor. In environments where multiple processors are dedicated to a single job, it

1In fact, it may increase. However, the focus is changed from cpu time to wall clock time. Wall clock time is the time it takes to get results back.

149

is essential to attain high parallel efficiency otherwise system efficiency and overall throughput suffers. In the past, multi-tasking has not been used in any significant way at GFDL. The reason is that there have always been enough jobs in the system to keep all processors busy. Typically, in the middle of 1997, there were about 30 batch jobs in GFDL’s Cray T90 at all times and the system efficiency averaged over 24 hours was about 93%. At the point where a system becomes under saturated, overall system efficiency will drop significantly unless multi-tasking is used.

12.3 Approaches to multi-tasking

Two approaches to multi-tasking MOM have been experimented with at GFDL. They were:

• The fine grained approach to multi-tasking where parallelism is applied at the level of eachnested do loop. This is also known as “autotasking”. All processors work simultane-ously on eachnested do loopand there are many parallel regions.

• Thecoarse grainedapproach to multi-tasking where parallelism is applied at the level of latitude rows. For example, all work associated with solving equations for one latitude row is assigned to one processor. All work associated with solving the equations for another latitude row is assigned to a second processor . . . and so forth. Then, all proces-sors work simultaneously and independently. This implies that there is only one parallel region.

In a nutshell, the fine grained approach to multi-tasking gives about 85% parallelism. That translates into a speedup of a little less than 4 on 8 processors. Why is the parallel efficiency so low? The reason is basically due to not enough work within the many parallel regions and the fact that the range on the parallel loop “j” index is not the same for all loops. Increasing the model resolution can somewhat address the first problem but not the second. The result is that processors stand idle and the parallel performance degrades. This is clearly not a long term solution. The coarse grained approach is the best and yields a near linear speedup with the number of processors.

12.4 The distributed memory paradigm

A distributed memory paradigm is one where each processor has its own chunk of memory but the memory is not shared among processors. This means that an array on one processor cannot be dimensioned larger than the processor’s local memory and the processor cannot easily access arrays dimensioned in other processors memory. Accessing arrays in other processor’s memory is possible by making “communication” calls to transfer data between processors.

MOM uses a distributed memory paradigm. The method builds on the coarse grained approach and assumes that both baroclinic and barotropic parts are divided among processors with distributed memories and therefore “communication” calls must be added to exchange boundary cells between processors at critical points within the code. The “communication”

calls are made via a message passing module which supports SHMEM as well as MPI protocols.

For details, refer to http:/www.gfdl.gov/vb. The advantages of the distributed paradigm are:

1. Higher parallel efficiency is attainable.

12.5. DOMAIN DECOMPOSITION 151 2. No complicated “ifdef” structure is needed to partition code differently for uni-tasking

or multi-tasking on shared or distributed memory platforms.

3. Only two time levels are required for 3-D prognostic data on disk (or ramdrive) as opposed to three time levels with the coarse grained shared memory method.

Most options have been tested in parallel. It bears stating that when an option is par-allelized, it means that answers are the same to the last bit of machine precision regardless of the number of processors used². Some options will probably never be parallelized. One example is the stream function method. This method requires land mass perimeter integrals which cut across processors in compilicated ways and this is a recipe for poor scaling. The implicit free surface method is better since it does not require perimeter integrals. However, global sums are still required which do degrade scaling but not as much as the island integrals.

Improvement can be made to the existing global sum reductions because they are only a crude first attempt. However, even if global sum reductions were no problem, the accuracy of the method depends on the number of iterations which is tied to the grid size³. The best scaling is achieved by the explicit free surface option which does not require any global sum reductions and the number of iterations (sub-cycles) depends on the ratio of internal to external gravity wave speed independent of the number of grid cells.

12.5 Domain Decomposition

If all arrays were globally dimensioned⁴ on all processors, available memory could easily be exceeded with even modest sized problems. Each processor should only dimension arrays large enough to cover the portion of the domain being worked on by the specific processor. Refer to Fig 12.1a which is an example of a domain withi=1,imtlongitides and jrow=1,jmtlatitudes divided among 9 processors arranged in a 2d (two dimensional) domain decomposition. Fig 12.1b gives the arrangement for a 1d (one dimensional) domain decomposition in latitude using the same 9 processors. In both cases, two dimensional arrays need only be dimensioned large enough to cover the area worked on by each processor. As discussed below, this area will actually be one or two cells larger at the processor boundaries to include boundary cells required by the numerics. Boundary cells on each processor must be updated with predicted values from the domain of the adjoining processor. The arrows indicate places where communication takes place across boundaries between processors.

Figure 12.1c indicates three cases: one for 9 processors, 100 processors, and 900 processors.

Within each case, three domain shapes are considered: imt = jmt, imt = 2∗ jmt, andimt = jmt/2. The tables indicate how many communication calls are required. Since both processors which share a boundary require data from the other processor, two communication calls are required at each boundary and the same amount of words are transferred across the boundary for each processor. Two things are worth noting. First, as the number of processors increases, the number of words transferred in the 2d domain decomposition is much less than in the 1d domain decomposition. Second, for large numbers of processors, 1d domain decomposition requires half as many communication calls as 2d domain decomposition. An additional problem for 2d domain decomposition which is not indicated in Figure 12.1c is that the equations are not symmetric with respect to latitude and longitude. Specifically, a

2This is necessary otherwise science may become processor dependent.

3As the number of grid cells increase, the number of iterations must increase to keep the same level of accuracy.

4Dimensioned by the full number of grid cells in latitude, longitude, and depth.

wrap around or cyclic condition is placed on longitudes for global simulations. No such condition is needed along latitudes. The implication is that many additional communication calls will be needed to impose a cyclic condition on intermediate computations for 2d domain decomposition. These communication calls are not needed in a 1d decomposition in latitude because each latitude strip including the cyclic boundary is on processor. To lessen the need for extra communication calls in 2d decomposition, additional points can be added near the cyclic boundary thereby extending the domain slightly.

Whether 1d or 2d decomposition is better depends on the the latency involved in issuing a communication call versus the time taken to transfer the data. Another factor is whether polar filtering is used because it would also require extra communications in a 2d decomposition.

At this point, the 1d domain decomposition of Fig 12.1a is what is done in MOM.

Executing the 3 degree global test case #0 for MOM on multiple processors of a T3E using 1-D decomposition in latitude indicates that scaling falls off quickly when there are fewer than 8 latitude rows per processors. This result has dire implications for climate models. For instance, a 2 degree global ocean model will have about 90 latitiude rows. Efficient scaling implies only 11 processors can be used. On the GFDL T3E, it takes about 10 processors to equal the speed of one T90 processor. If it is assumed that processors on the next system are 2.5 times as fast as the T3E processors and the ratio of network speed to processor speed stays the same, a two degree climate model will only gain a factor of about 3 in speed over a T90 processor.

If the processor speed increases faster than the network speed, then the situation gets worse.

The implication is that 2-D domain decomposition will be needed for climate modeling if an order of magitude speed up is to be attained..

On the other hand, a 1/5 degree global ocean model with 50 levels has 900 latitude rows and requires 2 gigawords of storage with a fully opened memory window. Using 9 rows per processor implies that 100 processors can be used with 1-D decomposition. Each processor must have at least 20 megawords. Realistically, by the next procurement, each processor will have at least 32 megawords of memory. If the GFDL T3E had enough processors, then a speedup of 25 times that of one T90 processor would be realized. It seems that 2-D decomposition is more important for coarse resolution models.

Since the majority of simulations with MOM have been carried out on vector machines at GFDL, the idea has been to compute values over land rather than to incur the cost of starting extra vectors to skip around land cells. In an earlier implementation, extra logic was in place to skip computations on land cells. It turned out that the speed increase on vector machines was not worth the extra coding complexity so the logic was removed. On scalar machines, it may be beneficial to skip computation over land areas. A 2d domain decomposition would in principle more easily allow computation over land to be skipped by not assigning processors to areas containing all land cells. In a 1d decomposition, extra logic would need to be added to eliminate computation over land cells.

12.5.1 Calculating row boundaries on processors

The number of processors is read in through namelist as variablenum processors. The model domain is divided intonum processorspieces with each piece containing the same number⁵of latitude rows. Each processor is assigned the task of working on its own piece of the domain starting with latitude index jrow= jstask and ending with latitude index jrow= jetask. Each

5The important point is to have the same amount of work on each processor. If more work is required on some rows than others, then the number of rows on each processsor should be different. It is assumed here that the same amount of work is on every row and that each processor should get the same number of rows.

12.5. DOMAIN DECOMPOSITION 153 processor has its own memory window to process only those latitude rows within its own task.

The starting and ending rows of each processor’s task are given by

jstask = nint((pn−1)∗calculated rows)+(2−jbu f) (12.1) jetask = nint(pn∗calculated rows)+(1+ jbu f) (12.2) wherepnis the processor number (pn=1· · · num processors), the number of buffer rows “jbuf”

is explained in Section 11.3.1, and

calculated rows= f loat(jmt−2)/num processors. (12.3) The latitude rows for which the tracer and baroclinic equations are solved within the processor’s task are controlled by the starting and stopping rows

jscomp = jstask+jbu f (12.4)

jecomp = jetask−jbu f (12.5)

12.5.2 Communications

Figure 12.2a gives an example of multi-tasking withnum processors=3 processors andjmt=14 latitude rows for a second order memory window. Note that latitude rows on disk (or ramdrive) for each processor in Figure 12.2b look like a miniture version of Figure 11.4 where jstask =1 and jetask = jmt. For example, on the disk (or ramdrive) of processor #1, the global latitude index runs from jrow = 1 to jrow = 6. On this processor, the task limits are jstask = 1 and jetask = 6 and the limits for integrating prognostic equations are from jscomp = 2 through jecomp= 5. On processor #2, the task limits are jstask = 5 and jetask = 10 and the limits for integrating prognostic equations are from jscomp=6 through jecomp=9.

Apart from dividing up the domain into pieces, the new aspect in Figure 12.2 is commu-nication between processors indicated by short arrows pointing to the boundary rows. In the second order memory window, there is one boundary row at the borders of each task. Look at latitude row jrow = 6 on the updated disk of processor #1. This row cannot be updated toτ+1 by processor #1’s MW because data from jrow = 7 is needed. Instead, data at τ+1 from jrow =6 on processor #2 is copied into the jrow =6 slot of processor #1 by a call to the communication routine after all processors have finished working on their tasks. Similarly, jrow=5 on processor #2 is updated withτ+1data fromjrow=5 on processor #1 and so forth.

The situation with fourth order numerics is similar except that more communication is required at the end of the timestep. This is illustrated in Figure 12.3a. Note that sincejbu f =2 for forth order windows, the task for processor #1 ranges from jstask = 1 to jetask = 7. For processor #2, the task ranges fromjstask=4 tojetask=11, and so forth. However, the rows on which prognostic equations are solved within each task are the same as in the case with a 2nd order memory window. Additional communication is indicated by the extra arrows which are needed because there are now two buffer rows on the borders of each task (i.e. jbu f =2).

For all processor numbers frompn=2,num processorsthe following prescribes the required communication for second order numerics:

• copy all data from latitude row “jstask+1” on processor “pn” to latitude row “jetask” on processor “pn-1”.

• copy all data from latitude row “jetask-1” on processor “pn-1” to latitude row “jstask”

on processor “pn”.

When a fourth order memory window is involved, the following communication is required:

• copy all data from latitude row “jstask+3” on processor “pn” to latitude row “jetask” on processor “pn-1”.

• copy all data from latitude row “jstask+2” on processor “pn” to latitude row “jetask-1”

on processor “pn-1”.

• copy all data from latitude row “jetask-3” on processor “pn-1” to latitude row “jstask”

on processor “pn”.

• copy all data from latitude row “jetask-2” on processor “pn-1” to latitude row “jstask+1”

on processor “pn”.

12.5.3 The barotropic solution

The 2-D barotropic equation is divided into tasks in the same way as was done for the prognostic equation. Since the processor boundaries are the same, communication involves the same rows. Each processor dimensions arrays for only that part of the domain being worked on by the specific processor and the actual memory requirement is small. Therefore, within each processor’s task no memory window is needed. All indexing into 2-D arrays is in terms of the absolute global index “jrow”.

12.5. DOMAIN DECOMPOSITION 155

Figure 12.1: a) A 2d domain decomposition using 9 processors. b) Rearranging the 9 processors for a 1d domain decomposition in latitude. c) Comparison of 1d and 2d domain decomposition giving number of communication calls and words transferred for 9, 100, and 900 processors.

D i s k o r R a m d r i v e

Im Dokument —Draft— Feb 2000 — (Seite 175-182)