• Keine Ergebnisse gefunden

Band Parallelization

Im Dokument Real-Space Finite-Difference (Seite 106-110)

Density functional calculations in the framework of a wave function based method require the evaluation of (at least) the number of occupied KS-states. When the number of atoms in a system grows, both, its volume and its number of (valence) electrons grow linearly. To be able to tackle the more than quadratically grow-ing workload we apply the real-space parallelization to the simulation volume as pointed out in Section 5.3. We still have to face the challenge of a linearly growing

5.4. Band Parallelization 99 Matrix SizeN BLACS grid Time[sec] System

3500 162 10.1 GST555

3500 642 11.0 GST555

14336 162 141.0 GST888

14336 322 94.1 GST888

14336 642 85.8 GST888

Table 5.4.: Performance data of the ScaLAPACK routine PDSYGVX on JUGENE for various node numbers. A block size of 64 has been used in all runs. Using all available 163MPIprocesses of the domain decom-position for solving the generalized eigenvalue problem of dimension 14336 can be as fast as 85.8 sec where a workstation would need more than ten hours.

number of KS-states. Differentσk-points are parallelized on the outermost level as pointed out in Section 5.2. However, large systems often treat only theΓ-point such that the maximum is 2-fold parallelization of spins if applicable. Therefore, we focus in this section onto the number of bandsNbands. We employ a simple dis-tribution of the total number of bands ontopbndparallel sets asNbands=pbnd×nbnd. Each set storesnbnd bands. We use the term setshere, since each set is internally parallelized in a domain decomposition such that we cannot speak of a single MPI-process. nbnddiffers by 1 band between the sets if the total number of bands is not an integer multiple ofpbnd.

The DIIS band update method, as introduced in Section 4.4.2, acts onto each band separately. Therefore, each band set perform the DIIS update onto the locally stored bands. Workload and memory are distributed simultaneously. However, band parallelization is more complex for the case of reorthogonalization by means of the subspace rotation, as pointed out in the following Section.

5.4.1. Subspace matrix setup

The parallelization of bands introduces substantial communication into the matrix setup step. Consider the subspace rotation algorithm as described in Section 4.4.3 (serial) and Section 5.3.8 (parallel).

The total number of bands is parallelized by pbnd > 1 band sets, i.e.Nbands = pbnd ×nbnd. The distribution subdivides the matrices Hnm and Snm into blocks ofnbnd ×nbnd elements. The diagonal blocks consist of scalar products of|Hˆ˜Ψ˜mi (|Oˆ˜Ψ˜mi) and bands|Ψ˜ni, both in local storage as in Section 5.3.8. In addition, we need to evaluatepbnd(pbnd−1)off-diagonal matrix blocks. These consist of scalar products between the locally stored|Hˆ˜Ψ˜miand bands|Ψ˜nistored in the other sets.

Figure 5.17.:Setup of matrix elements in band parallelization: In this ex-ample the inner real-space parallelization level contains 6 processes in a 3×2×1 domain decomposition. A 2×2 subset of the 6 grid processes contributes to the BLACS grid since square grids achieve the highest per-formance on square matrix operations in ScaLAPACK. White-colored processes idle during the ScaLAPACK call. Bands are distributed over 5 parallel sets. Each band set evaluates the columns of the matrixHnm

(and equivalentlySnm) that match the locally stored bands (indicated in red). Those matrix elements that result from scalar products of two locally stored bands are marked in a darker tone and can be computed without exchanging wave functions along the band communicator. For the other matrix element, volume intensive communication between the band sets in necessary.

Figure 5.17 shows the setup procedure schematically. For these scalar products, the wave functions need to be exchanged in explicit send-receive operations via the band communicator band_comm. A straightforward implementation without the exploitation of hermiticity would requirepbnd-1 communication cycles. Then, each band set achieves access to theNbands×nbndmatrix elementsHnm(Snm) that are evaluated involving the locally stored |Hˆ˜Ψ˜mi (|Oˆ˜Ψ˜mi). Since the generalized eigenvalue problem is solved in each band set separately access to the complete matrices is needed. Therefore, a reduction operation along the band communicator is performed as final step of the matrix setup procedure. The matrix elements are stored on the distributed BLACS grid which has a similar process distribution in each bands set. Therefore, the reduction operation only works on two M × M matrices whereMis defined in the ScaLAPACK context as

M=

& Nbands

BS

Np

'

·BS (5.29)

with the BLACS grid ofNp×Npprocesses. These two communication operations are introduced additionally in band parallelization of which the exchange of the wave functions takes considerably more time than the reduction operation.

Exploiting that the Hamiltonian and the overlap operator are hermitian (sym-metric for real numbers) quantities, i.e.Hnm = Hmn andSnm = Smn. This leads to a simple setup ofNbands(Nbands+1)/2 elements of the lower triangular matrix when all bands are stored locally. Savings of 44 % compared to the full setup of N2bands elements can be achieved for Nbands=3500 as shown in Table 5.5. This is a

5.4. Band Parallelization 101 good result considering that roughly 12 % of of the time for the full matrix setup is spent in the application of the operators.

Tfull[sec] Tsym [sec] Speedup no band parallelization 250.88 141.40 1.774 16×band parallelization 23.86 13.39 1.782 Table 5.5.:Times for the subspace matrix setup exploiting the symmetry of symmetric/hermitian operators (sym) compared to the setup of all matrix elements (full). The system was Ge125Sb250Te500on a 163domain decomposition, with and without band parallelization. In both cases roughly 44 % of the matrix setup time are saved.

In band parallelizationNbands = nbnd ×pbnd, we also have to evaluate matrix blocks of locally stored bands with remotely stored bands. The symmetry allows to compute all matrix blocks inpbnd

2

cycles instead ofpbnd-1 cycles. Figure 5.18 shows this scenario for pbnd=3 (odd) and pbnd=4 (even). The odd case is advan-tageous because the number of off-diagonal matrix blocks that need to be evalu-ated in each process is pbnd(p2bnd1) which is integer. We thus only need to perform pbnd

2

= pbnd21 cycles where each has process evaluates pbnd21 off-diagonal blocks.

In the case of evenpbnd, pbnd21is not integer such that only half of the processes need to perform a block evaluation in the last cycle. Nevertheless, also for a 16-fold band parallelization we observe 44 % savings in the matrix setup times, compare Table 5.5.

Figure 5.18.:Setup of subspace matrix elements of hermitian operators.

Only the lower triangular matrix is evaluated (red). In band paralleliza-tion (3-fold on the left, 4-fold on the right) matrix elements between locally stored bands and remotely stored bands can be evaluated in

⌊pbnd/2⌋ cycles. In the second cycle (green) of the 4-fold band paral-lelization only half of the processes need to evaluate their matrix blocks.

A workflow description of the ScaLAPACK interface can be found in the Ap-pendix C.1.9.

5.4.2. Performance

1 2 4 8 16 32 64

Number of band sets

64

128

256

512

SCF-Iteration time [sec] 4096 8192 16384 32768 65536 131072 262144

Number of processes

ideal SR DIIS+SR

Ge125Sb250Te500

Band parallelization

Figure 5.19.:Strong scaling of the SCF-iteration time in band paralleliza-tion. Increasing the number of processes in the band parallelization, we find that the execution time decreases for up to a 32-fold parallelization.

The underlying real-space grid parallelization is a 16×16×16 domain decomposition, i.e. 4096JUGENEcores. The last data point (64) has been running on 218cores, 88.9% of all available nodes.

As outlined in Section 5.1.2 intensive all-to-all communication causes a commu-nication time which grows proportionally to the number of commucommu-nication part-ners, i.e. the number of parallel band sets in this case. Figure 5.19 shows the strong scaling behavior of JUGENEin band parallelization. We can easily see a maximum of the speedup at a band parallelization with 32 parallel sets, each with an internal domain decomposition on 163MPIprocesses.

Im Dokument Real-Space Finite-Difference (Seite 106-110)