a. Hybrid parallelism - SCMF implementation SOMA [34]

3. Simulation methods and implementation 49

3.2. SCMF implementation SOMA [34]

3.2.3 a. Hybrid parallelism

For efficient utilization of modern high-performance computers,parallelism is important.

We employ a hybrid strategy: (i) To connect multiple shared-memory systems, an MPI-based parallelization is used. (ii) On shared memory systems, like multi-core CPUs or an accelerator, the#pragma-based approaches of OpenMP or OpenACC are used to implementtheparallel code.

MPI-based parallelization and load-balancing The SCMF model introduces implicit parallelism because all macromolecules interact only via quasi-instantaneous, fluctuating interaction fields. By definition of a macromolecule, distinct macromolecules are not connected by bonds. At initialization, entire macromolecules are efficiently distributed among the MPI ranks via parallel IO.

5The C99 standard is used because of fixed width data types.

In the first stage of the SCMF simulation cycle, each MPI rank,α, counts the number of particles,n^α_i(c), of the different segment types,i, that stem from its “own” molecules, and an MPI-all-reduce summation combines these grid-based partial occupation numbers of the MPI ranks to the density fields,ni(c) =^∑αn^α_i(c) =ρ0∆L³ρˆi(c), after each MC step. The resulting density fields on the entire grid are available on each MPI rank, and each MPI rank locally computes the interaction fields, ωi(c), according to Eq. (2.52).

We observed that the computation of the interaction fields, ωi(c), from the occupation numbers,n_i(c), can be computed faster on each MPI rank than a distributed computation and subsequent communication. The calculation is a simple iteration over grid cells with cache/coalescence optimized accesses to the memory.

For typical parameters, the number,ni(c), of particles of typeiin a grid cell is less or on the order of ρ₀∆L³ ≈ ^√N¯/N(∆L/b)³ ∼ 10². Therefore, we safely encode this data as unsigned integers of 16-bits width. Thus, the occupation data of a system with linear dimension L require a total memory of 2nt(L/∆L)³ ≈2ntN^3/2(L/Re0)³ bytes.

Compared to the configuration data – i.e., the spatial positions, {r}, of all particles require about 24N

√N¯(L/R_e0)³ bytes – the occupation data, n_i(c), is smaller by a factor^√N¯/N(12/nt)∼10². The MPI ranks exchange only the occupation data via the MPI-all-reduce collective sum operation, limiting the required throughput. Depending on the accelerator of the ranks and the MPI implementation, this operation might not require an intermediate copy of the occupation data to CPU memory. A CUDA-aware MPI implementation in combination with Nvidia GPUs as accelerators enables MPI operations directly on the GPU memory. This approach scales the application over multiple compute nodes using only a single MPI-all-reduce operation.

In the second stage of the SCMF simulation cycle, each MPI rank propagates only its

“own” molecules via a short MC simulation. This part is most compute-intense but does not require MPI communication.

Our molecule-based parallelization strategy without spatial domain decomposition enables an easy load balancing between MPI ranks: The slowest rank sends a number of its molecules to the fastest rank until the execution time per MC step is balanced across all MPI ranks. This might be necessary because the MPI ranks might be accelerated by heterogeneous accelerators, or the sizes of molecules differ among ranks.

Spatial domain decomposition The advantage of the previously described MPI parallel scheme is its simplicity. Only a single MPI API call is required to synchronize all ranks during the main simulation routines. However, there are situations when this simple scheme must fail. Because the previous scheme does not distribute the molecules based on their spatial position, the occupation number grid n_i(c) and the interaction fields ωi(c) must be completely held on every rank. Thus, the memory resources required for these fields cannot be reduced by increasing the number of MPI ranksi.e. compute nodes. This lack of scaling results in exhaustion of resources for very large simulation sizesnN ∈ O(10⁹).

The solution is a spatial domain decomposition. It is a well-known concept applied in many popular MD codes [116, 138, 139]. Each domain is assigned a region in space and it is guaranteed that only molecules with a center of mass inside this region can reside on a rank of the domain. With this restriction, each rank has to hold only the

3.2. SCMF implementation SOMA [34]

sections of the aforementioned fields inside its domain region. Thus, if the system size is doubled and the number of domains is doubled, the memory requirement per rank does not change. Inside each domain, multiple MPI ranks are still possible and their parallelization scheme works as described in the previous section. The domain decomposition adds another layer of parallelism to the simulation. In addition, the number of ranks participating in the MPI-all-reduce operation is limited to the MPI ranks inside a domain, decreasing the all-to-all communication.

However, the domain decomposition adds two complications to the communication scheme. (i) Molecules may diffuse outside of their original domain and have to be transferred to another domain during a simulation and (ii) spatially adjacent domains interact over their boundary. The communication of this interaction has to be organized.

Note that the interaction between particles is only indirect via the density fields within the SCMF algorithm. Thus, a domain decomposition does not require duplication of particles to reside in multiple domains – so-called ghost particles.

Transferring a chain from one domain to another implies that the molecules are popped from one rank and subsequently transferred to the other rank where it is pushed to the configuration. Although this is nontrivial because data has to be pushed and popped from the device memory, the basic concept is shared with the load-balancing scheme described earlier. Consequently, both features share a common molecule transfer implementation.

The communication between adjacent domains is an entirely new aspect. Even if a molecule has its center of mass inside one domain, some of its particles can overlap into a neighbor domain. For this reason, each domain has so-called boundary layers of the fields extending the fields to its spatial neighbors. For the MC moves of particles, there is no difference if a simulation is performed with or without a domain decomposition.

Particles can freely enter the boundary layers of their domain. If a molecule diffuses from one domain to the next, it is transferred as soon as the center of mass moves across the domain border. Only the calculation and communication of the occupation numbersni differ. In the first step, all ranks in a domain use an all-reduce operation to add up the occupation numbers of all the molecules they own. Afterward, one rank of each domain communicates with one rank of the adjacent domains. Both add up the numbers of the others’ boundary layers to their local fields. In the last step, the neighbor-communicating rank broadcasts the final occupation numbers to all ranks in its domain. Each rank can subsequently prepare the interaction fields ωi, including the inside of the boundary layer, for the MC move. The result of a simulation is the same for a run with and without a domain decomposition if they use the same seed and input. The integrated test suite of SOMA tests this identity. The physical results do not depend on the parallel execution scheme.

The size of the boundary layer depends on the maximum deviation of a segment position from the center of mass of its molecule. The bond connectivity of a molecule dictates this maximum deviation, i.e.a linear chain behaves differently than a star, a dendrimer, or a network. For a linear homopolymer, it is reasonable to assume that the molecular configuration is close to equilibrium. Thus, the boundary layer scales with the end-to-end distance ^√⟨R²_e⟩ ∝ √

N. In the case of diblock copolymers the extension of the molecules increases as a function of the incompatibilityχN. Describing the maximum extension of the fluctuating chain is in any case nontrivial, even though

the extreme value statistics discusses this kind of situations [125]. However, it is safe to assume half of the average contour length L_c/2 =b₀(N −1)/2, corresponding to a fully extended molecule without stretching its backbone, as an upper bound for the size of the boundary layer. As it may be safe to use a smaller boundary layer, the implementation expects the user to choose a safe size. If a particle leaves the boundary of its domain, the code detects the error and aborts. The size of the boundary condition also limits the number of possible domains. The boundary layer of one domain cannot exceed half the size of its neighbor domain because the boundary layer shall not overlap with the boundary layer of its second-next neighbor. For simplicity, the current version of SOMA supports a spatial domain decomposition only in a single direction, i.e.parallel to the X-axis.

Single-rank parallelism The second level of our hybrid strategy concerns the parallelism in each MPI-rank on a shared-memory system or the use of an accelerator. In the first stage of the SCMF simulation cycle, the occupation numbers,n_i(c), are gathered and the interaction fields, ω_i(c), are computed, whereas in the second stage the particle configuration is propagated by a short MC simulation.

The computational effort of both stages scales with the number of particles, however, profiling showed that on all tested computer architectures the first stage only takes a fraction the computation time compared to the second one. One reason is that the MC moves require multiple expensive generations of pseudo-random numbers. As a consequence, we focus in the following only on the MC moves.

In the course of an MC move, we choose a sequence of particles independent of the instantaneous configuration,{r}, try to displace each particle using local, random displacement or SMC proposals, and accept each attempt according to the Metropolis acceptance rate. In this second stage of the SCMF cycle, particles interact via the bonded interactions and with the temporarily constant, interaction fields,ωi(c). Thus, particles that are not directly connected by bonds to each other can be moved independently and simultaneously because the outcome of the MC lottery of one particle does not depend on or influence the outcome of the MC lottery of the other particle. In order to exploit this intrinsic parallelism, we use a hierarchical scheme:

• 1. Parallelization over distinct molecules: By definition, particles that belong to distinct molecules are not connected by bonds and therefore can be moved independently and simultaneously. Thus, we assign each molecule to a parallel thread, i.e., each thread processes an entire molecule by sequentially choosing a particle of this molecule at random for an MC move. On average, each particle of the molecules is attempted to be moved once during an MC step. This simple parallelization over distinct molecules is efficient, provided that each MPI rank owns sufficiently more molecules than the maximal number of parallel threads.

Conventional multi-core systems fulfill this requirement, but highly parallel accel-erators may be undersaturated with systems of moderate sizes. Snippet 3.1 shows the implementation of this scheme using OpenMP and OpenACC.

3.2. SCMF implementation SOMA [34]

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Fig. 3.6. Example of a linear polymer molecule withN = 32 monomers. Since particles are only bonded to their two nearest neighbors, every second monomer is independent. The coloring marks the two independent sets.

1 // p a r a l l e l i t e r a t i o n of all m o l e c u l e s on r a n k

# p r a g m a acc p a r a l l e l l o o p v e c t o r _ l e n g t h ( t u n i n g _ p a r a m e t e r )

# p r a g m a omp p a r a l l e l for

4 for( u i n t 6 4 _ t mol =0; mol < N _ l o c a l _ m o l ; mol ++) { c o n s t u n s i g n e d int N = g e t _ N ( mol ) ;

// s e q u e n t i a l i t e r a t i o n of p a r t i c l e s

7 # p r a g m a acc l o o p seq

for(int m o n o =0; m o n o < N ; m o n o ++ ) { // s e l e c t r a n d o m p a r t i c l e to m o v e

10 c o n s t u n s i g n e d int i = rng ( N ) ; // Monte - C a r l o s c h e m e for p a r t i c l e i }

13 }

Code 3.1 Parallel molecule iteration scheme: implemented with OpenMP for shared memory systems and OpenACC for accelerators.

• 2. Iteration over independent particles within a molecule / “SET” algorithm:

The second approach exploits a finer level of parallelism by identifying sets of independent particles within a single molecule. All particles within one set belong to the same molecule but are not directly bonded with one another and therefore can be moved independently and simultaneously. An exampleofsuch a partitioning for a linear molecule is depicted inFigure 3.6. All even particles belong to set 0, whereas all odd particles belong to set 1. Our implementation of set generation uses a simpleheuristicfor the initialization of the sets of independent particles, aiming for the minimal number of sets with a large and approximately equal number of independent particles in each set. Since the architecture of a molecule type is fixed in the course of the simulation, sets of independent particles for each type of molecules only need to be computed at initialization and remain constant afterward.

A full MC move sweeps in parallel over all distinct molecules like in previous scheme. For each molecule, the algorithm sequentially works on every set in a pre-defined random order. The independent particles that belong to the same set are assigned to parallel threads that attempt the MC move. Thus,the total number of parallel threads is the product of the number of polymers and the (minimal) number of independent particles in a set. Snippet 3.2 demonstrates the

implementation of this second layer of parallelism in OpenACC.

In this scheme, the attempt probability for each particle is exactly 1 as opposed to the former scheme, yielding a slightly faster dynamics. We also note that the sequential iteration over sets breaks detailed balance but global balance is still obeyed because the sequence of particle moves is independent of the configuration,{r}.

# p r a g m a acc p a r a l l e l l o o p v e c t o r _ l e n g t h ( t u n i n g _ p a r a m e t e r )

2 # p r a g m a omp p a r a l l e l for

for ( u i n t 6 4 _ t mol = 0; mol < N _ l o c a l _ m o l ; mol ++) { // G e n e r a t e r a n d o m p e r m u t a t i o n of the s e t s

5 // and s t o r e r e s u l t in s e t _ p e r m u t a t i o n []

// F l a t 2 d array , s o r t i n g e l e m e n t s in i n d e p e n d e n t s e t s .

8 // C o m p u t e d at i n i t i a l i z a t i o n and o n l y a c c e s s e d now . c o n s t u n s i g n e d int* s e t s = g e t _ s e t s ( mol ) ;

11 // I t e r a t e s e t s in r a n d o m o r d e r .

# p r a g m a acc l o o p seq

for(u n s i g n e d int i _ s e t =0; i _ s e t < n _ s e t s ; i _ s e t ++) {

14 c o n s t u n s i g n e d int s e t _ i d = s e t _ p e r m u t a t i o n [ i _ s e t ];

// S e c o n d p a r a l l e l l a y e r i t e r a t i o n of set e l e m e n t s

# p r a g m a acc l o o p v e c t o r

17 for(u n s i g n e d int i_p =0; i_p < s e t _ l e n g t h [ s e t _ i d ]; i_p ++) { u n s i g n e d int ip = s e t s [ s e t _ i d * m a x _ m e m b e r + i_p ];

// MC s c h e m e for p a r t i c l e ip

20 }

} }

Code 3.2MC scheme iteration with two levels of parallelization utilizing independent sets.

Im Dokument Rheology and Structure Formation in Complex Polymer Melts (Seite 74-79)