Band Parallelization - Real-Space Finite-Difference

Density functional calculations in the framework of a wave function based method require the evaluation of (at least) the number of occupied KS-states. When the number of atoms in a system grows, both, its volume and its number of (valence) electrons grow linearly. To be able to tackle the more than quadratically grow-ing workload we apply the real-space parallelization to the simulation volume as pointed out in Section 5.3. We still have to face the challenge of a linearly growing

5.4. Band Parallelization 99 Matrix SizeN BLACS grid Time[sec] System

3500 16² 10.1 GST555

3500 64² 11.0 GST555

14336 16² 141.0 GST₈₈₈

14336 32² 94.1 GST888

14336 64² 85.8 GST888

Table 5.4.: Performance data of the ScaLAPACK routine PDSYGVX on JUGENE for various node numbers. A block size of 64 has been used in all runs. Using all available 16³MPIprocesses of the domain decom-position for solving the generalized eigenvalue problem of dimension 14336 can be as fast as 85.8 sec where a workstation would need more than ten hours.

number of KS-states. Differentσk-points are parallelized on the outermost level as pointed out in Section 5.2. However, large systems often treat only theΓ-point such that the maximum is 2-fold parallelization of spins if applicable. Therefore, we focus in this section onto the number of bandsN_bands. We employ a simple dis-tribution of the total number of bands ontop_bndparallel sets asN_bands=p_bnd×n_bnd. Each set storesn_bnd bands. We use the term setshere, since each set is internally parallelized in a domain decomposition such that we cannot speak of a single MPI-process. n_bnddiffers by 1 band between the sets if the total number of bands is not an integer multiple ofp_bnd.

The DIIS band update method, as introduced in Section 4.4.2, acts onto each band separately. Therefore, each band set perform the DIIS update onto the locally stored bands. Workload and memory are distributed simultaneously. However, band parallelization is more complex for the case of reorthogonalization by means of the subspace rotation, as pointed out in the following Section.

5.4.1. Subspace matrix setup

The parallelization of bands introduces substantial communication into the matrix setup step. Consider the subspace rotation algorithm as described in Section 4.4.3 (serial) and Section 5.3.8 (parallel).

The total number of bands is parallelized by p_bnd > 1 band sets, i.e.N_bands = p_bnd ×n_bnd. The distribution subdivides the matrices Hnm and Snm into blocks ofn_bnd ×n_bnd elements. The diagonal blocks consist of scalar products of|Hˆ˜Ψ˜mi (|Oˆ˜Ψ˜mi) and bands|Ψ˜ni, both in local storage as in Section 5.3.8. In addition, we need to evaluatep_bnd(p_bnd−1)off-diagonal matrix blocks. These consist of scalar products between the locally stored|Hˆ˜Ψ˜_miand bands|Ψ˜_nistored in the other sets.

Figure 5.17.:Setup of matrix elements in band parallelization: In this ex-ample the inner real-space parallelization level contains 6 processes in a 3×2×1 domain decomposition. A 2×2 subset of the 6 grid processes contributes to the BLACS grid since square grids achieve the highest per-formance on square matrix operations in ScaLAPACK. White-colored processes idle during the ScaLAPACK call. Bands are distributed over 5 parallel sets. Each band set evaluates the columns of the matrixHnm

(and equivalentlySnm) that match the locally stored bands (indicated in red). Those matrix elements that result from scalar products of two locally stored bands are marked in a darker tone and can be computed without exchanging wave functions along the band communicator. For the other matrix element, volume intensive communication between the band sets in necessary.

Figure 5.17 shows the setup procedure schematically. For these scalar products, the wave functions need to be exchanged in explicit send-receive operations via the band communicator band_comm. A straightforward implementation without the exploitation of hermiticity would requirep_bnd-1 communication cycles. Then, each band set achieves access to theN_bands×n_bndmatrix elementsHnm(Snm) that are evaluated involving the locally stored |Hˆ˜Ψ˜mi (|Oˆ˜Ψ˜mi). Since the generalized eigenvalue problem is solved in each band set separately access to the complete matrices is needed. Therefore, a reduction operation along the band communicator is performed as final step of the matrix setup procedure. The matrix elements are stored on the distributed BLACS grid which has a similar process distribution in each bands set. Therefore, the reduction operation only works on two M × M matrices whereMis defined in the ScaLAPACK context as

& _N_bands

N_p

·BS (5.29)

with the BLACS grid ofN_p×N_pprocesses. These two communication operations are introduced additionally in band parallelization of which the exchange of the wave functions takes considerably more time than the reduction operation.

Exploiting that the Hamiltonian and the overlap operator are hermitian (sym-metric for real numbers) quantities, i.e.Hnm = H^∗_mn andSnm = S^∗_mn. This leads to a simple setup ofN_bands(N_bands+1)/2 elements of the lower triangular matrix when all bands are stored locally. Savings of 44 % compared to the full setup of N²_bands elements can be achieved for N_bands=3500 as shown in Table 5.5. This is a

5.4. Band Parallelization 101 good result considering that roughly 12 % of of the time for the full matrix setup is spent in the application of the operators.

T_full[sec] T_sym [sec] Speedup no band parallelization 250.88 141.40 1.774 16×band parallelization 23.86 13.39 1.782 Table 5.5.:Times for the subspace matrix setup exploiting the symmetry of symmetric/hermitian operators (sym) compared to the setup of all matrix elements (full). The system was Ge125Sb250Te500on a 16³domain decomposition, with and without band parallelization. In both cases roughly 44 % of the matrix setup time are saved.

In band parallelizationN_bands = n_bnd ×p_bnd, we also have to evaluate matrix blocks of locally stored bands with remotely stored bands. The symmetry allows to compute all matrix blocks in_p_bnd

cycles instead ofp_bnd-1 cycles. Figure 5.18 shows this scenario for p_bnd=3 (odd) and p_bnd=4 (even). The odd case is advan-tageous because the number of off-diagonal matrix blocks that need to be evalu-ated in each process is ^p^bnd^(p₂^bnd⁻¹⁾ which is integer. We thus only need to perform _p_bnd

= ^p^bnd₂⁻¹ cycles where each has process evaluates ^p^bnd₂⁻¹ off-diagonal blocks.

In the case of evenp_bnd, ^p^bnd₂⁻¹is not integer such that only half of the processes need to perform a block evaluation in the last cycle. Nevertheless, also for a 16-fold band parallelization we observe 44 % savings in the matrix setup times, compare Table 5.5.

Figure 5.18.:Setup of subspace matrix elements of hermitian operators.

Only the lower triangular matrix is evaluated (red). In band paralleliza-tion (3-fold on the left, 4-fold on the right) matrix elements between locally stored bands and remotely stored bands can be evaluated in

⌊p_bnd/2⌋ cycles. In the second cycle (green) of the 4-fold band paral-lelization only half of the processes need to evaluate their matrix blocks.

A workflow description of the ScaLAPACK interface can be found in the Ap-pendix C.1.9.

5.4.2. Performance

1 2 4 8 16 32 64

Number of band sets

128

256

512

SCF-Iteration time [sec] 4096 8192 16384 32768 65536 131072 262144

Number of processes

ideal SR DIIS+SR

Ge125Sb250Te500

Band parallelization

Figure 5.19.:Strong scaling of the SCF-iteration time in band paralleliza-tion. Increasing the number of processes in the band parallelization, we find that the execution time decreases for up to a 32-fold parallelization.

The underlying real-space grid parallelization is a 16×16×16 domain decomposition, i.e. 4096JUGENEcores. The last data point (64) has been running on 2¹⁸cores, 88.9% of all available nodes.

As outlined in Section 5.1.2 intensive all-to-all communication causes a commu-nication time which grows proportionally to the number of commucommu-nication part-ners, i.e. the number of parallel band sets in this case. Figure 5.19 shows the strong scaling behavior of JUGENEin band parallelization. We can easily see a maximum of the speedup at a band parallelization with 32 parallel sets, each with an internal domain decomposition on 16³MPIprocesses.

Im Dokument Real-Space Finite-Difference (Seite 106-110)