Parallel eciency

Refer to Figure 3.11 which schematically shows the dierence between code structures. Let the coarse grained approach be viewed as the MOM 1 code structure (although MOM 2 can simulate this) and the ne grained approach be viewed as the MOM 2 code structure.

In the coarse grained approach (MOM 1 coding), the parallel region shown in red is the latitude loop which wraps around all of the dynamics and physics. This work can be reduced to a succession of doubly nested do loops as indicated where the scalar region is marked in blue and the vector region is marked in green on the a PVP (Parallel Vector Processor) like the CRAY YMP. Each processor gets all the work for one latitude row and all processors work simultaneously on separate latitudes. Since the parallel loop contains the totality of doubly nested do loops, there is no shortage of work for any processor. To insure a static load balance,

jmt

^;2 should be an integral multiple of the number of processors otherwise some will stand

18This is done to conserve memory. However, it is an interesting counter intuitive point that sometimes it is better to do redundant calculations rather than save intermediate results in temporary work arrays. Depending on the speed of the load and store operations compared with the multiply and add operations, if redundant computation contains little work it may be faster to compute redundantly. In practice, this depends on the computer and compiler optimizations. This can be veried by executing script run timer which exercises the timing utilities in module timer.Fby solving a tracer equation in various ways. It is one measure of a mature compiler when the speed dierences implied by solving the equations in various ways is relatively small. On the CRAY YMP the dierences are about 10% whereas on the SGI workstation, it can reach 100%

`amlaw' on any CRAY system.

In the ne grained approach (MOM 2 coding), a group of latitudes is solved within an expanded memory window. An outer loop controls the number of these latitude groups¹⁹ and wraps around all of the dynamics and physics. Again, this work can be viewed as a succession of triply nested do loops. However, each triplet is now itself a parallel region with the scalar and vector region on the inside. In contrast to coarse grained parallelism (MOM 1 code), all processors work in parallel on each triplet. When one triplet is nished, all processors synchronize and move on to the next triplet and so forth until all loops within the group of latitudes are nished. Then the next group of latitudes is solved in a similar manner until all rows are solved.

Using ATExpert on GFDL's CRAY C90 with fortran 77, MOM 2 with 1 horizontal resolu-tion and 15 vertical levels (TEST CASE 0) was estimated to have 80% parallel eciency. This means that on an 8 processor system, only about 3 processors will be saturated. Equivalently, about 3 such jobs would be required to saturate the system. On a 32 processor system, about 4.4 processors would be occupied which implies about 8 such jobs are needed to saturate the system. This is clearly not desirable in a dedicated environment or one with hundreds of pro-cessors. However, it may be tolerable in a multiprogramming environment where the number of jobs signicantly exceeds the number of processors. The important consideration is that the job mix should not dry up to the point of leaving fewer jobs than processors.

In general, the ne grained approach has lower parallel eciency than the coarse grained approach. The reason is basically due to not enough work within the many parallel regions and the fact that the range on the parallel loop index is not the same for all loops. Increasing the model resolution can somewhat address the rst problem but not the second. The result is that processors occasionally stand idle. As compilers get smarter, and are better able to

\fuse" parallel regions, the eciency of the ne grained approach can be expected to increase.

However, it is dicult to imagine that eciencies will approach the coarse grained parallelism approach.

To illustrate the problem of too little work, consider a memory window of size

jmw

=5.

Calculations would normally be thought of as involving a latitude loop index ranging from

j

jsmwjemw

(where jsmw=2 and jemw=4) and so the appropriate number of processors would be m=3. However, the latitude loop index for calculating meridional uxes would range from

j

jsmw

^;1

jemw

because meridional uxes are needed on the northern and southern faces of all cells on latitude rows 2 through 4. This means that all processors will rst busy themselves by calculating meridional uxes for the northern faces of cells on rows 1 through 3.

When done, one processor calculates the ux for the northern face of cells on row 4 and the other two processors stand idle because they do not advance to the next triplet until all are synchronized at the bottom of the loop. Since all parallel loops do not have the same range of indices, there must necessarily be idle processors regardless of the amount of work inside the loops.

19Or equivalently the number of times the memory window is moved northward to solve for all latitudes.

R.C.P.

Figure 3.2: a) Loading the memory from disk. b) updating central row and writing results back to disk.

R.C.P.

Figure 3.3: a)A slice through the volume of three dimensional prognostic variables on disk. Note that only two tracers (

nt

=2) are assumed. For

nt >

2, a new box is added for each tracer.

b) Schematic of a memory window of size

jmw

=6 holding the slice. c) Two dimensional schematic of the memory window d) One dimensional schematic of the memory window.

τ ^-1

Figure 3.4: Schematic of dataow between disk and memory for one timestep with a memory window size of

jmw

=3 in MOM 2

τ ^-1

Figure 3.5: Example of dataow between disk and memory for one timestep with a memory window size of

jmw

=4 and the biharmonic option in MOM 2

R.C.P.

Figure 3.6: Example of dataow between disk and memory for one timestep with a memory window size of

jmw

=5 and the biharmonic option in MOM 2

R.C.P.

Figure 3.7: Schematic of dataow between disk and memory for one timestep with a memory window size of

jmw

=5 in MOM 2

R.C.P.

Read/writes between disk and MW

Calculated Row j in the MW Boundary Row j in the MW

j j

Disk

Memory

(jmw = jmt)

6 1 2 3 4 5 8 7

jmw

9 1 Stationary MW

Reads

Im Dokument Reference Manual and User’s Guide Documentation Version 2.0 ( 09, ) (Seite 48-56)

jmt

jmw

j

jsmwjemw

j

jsmw

jemw

nt

nt >

jmw

τ -1

jmw

τ -1

jmw

jmw

jmw

j j

Disk

Memory

(jmw = jmt)

6

1 2 3 4 5 8 7

9

1 Stationary MW

Reads

τ ^-1

τ ^-1