• Keine Ergebnisse gefunden

7.2 Parallelization over the Modes

7.2.4 SEIK

The analysis algorithm of the SEIK filter is very similar to that of the SEEK filter. Hence, also the parallelization is almost identical in both cases. Discussing the parallelization of SEIK, we focus on the unique parts of it. The parallel SEIK analysis algorithm is shown as algorithm 7.4 while the serial analysis has been shown as algorithm 3.4.

An additional operation in the analysis algorithm of SEIK compared with SEEK is the matrix-matrix product in line 7. Here the ensemble matrix projected onto the observation space (T1 in the pseudo code) is multiplied with matrix T defined by equation (2.62). As has been discussed in section 3.3, this operation is most efficiently implemented taking into account the particular choice of T. Accordingly, this multi-plication involves the subtraction of the global ensemble mean ofT1from each column of this matrix. This mean is computed as the means in the EnKF, i.e. by calculat-ing local means followed by an allreduce summation. The computed ensemble mean is subtracted from each of the local ensemble states. In line 21, the product T t5 is computed. Following the discussion in section 3.3, the mean value of the elements of t5 is computed and subsequently subtracted from each column. The final column is initialized by the negative of the mean value. The product T t5 does not require communication, since t5 is allocated on each process. Other additional operations in the analysis phase of SEIK are the computation of the ensemble mean in line 13, which is computed as in the EnKF, and the initialization of matrix Gin line 10. This operation is parallelized by initializing only rp local columns. These are required for the subsequent computation of Uinv which is a matrix-matrix product of type 1 fol-lowed by an allgather operation. Since the solver step in line 21 is not parallelized and several global communication operations are performed, we cannot expect that the mode-parallel SEIK analysis algorithm scales perfectly.

A particular parallelization issue of the SEIK filter is that matrix T2 consists of only r columns, while T1 contains N =r+ 1 columns. Hence, for the load-balancing of the analysis algorithm the application ofT is problematic. Since the forecast phase usually requires the most computation time, we chose a configuration in which each process holds the same number Np = k of ensemble states (I.e. the same number of columns in the local matrices Xp and T1p). Computing the product T1p T reduces the number of overall columns by one. Accordingly, one of the processes (usually that one with the highest rank) holds only k 1 local columns of T1p T, while all other processes hold k local columns. Due to this, one of the processes executes less operations than the other processes and will complete work earlier. However, this is inevitable if the ensemble has to be distributed evenly in order to obtain the best speed up in the forecast phase. For the parallel algorithm, this has no special implications, as long as the number of columns in matrix T2p is not reduced to zero on one of the processes.

In the resampling algorithm of SEIK, a new ensemble of states is computed on the basis of the forecasted state ensemble X. The parallel algorithm is shown as algorithm 7.5. It can be compared with the serial algorithm 3.8. The Cholesky decom-position in line 2 is performed equally by all processes. The solver step for the local

Subroutine SEIK Analysis Mode(step,n,N,x,Uinv,Xp) int step {time step counter,input}

int n {state dimension, input} int N {ensemble size, input}

real x(n) {local state estimate, output}

real Uinv(r, r) {inverse eigenvalue matrix, output} real Xp(n, Np) {local ensemble matrix, input/output} real T2,t4,t5,t6,y,d,∆x, {fields to be allocated}

real T1p,T2p,T3p,t4p,Gp,Uinvp,xp,∆xp {fields to be allocated} int r {rank of covariance matrix, r=N 1}

int rp {number of local columns of covariance matrix} int Np {local ensemble size}

int m {dimension of observation vector} int i {ensemble loop counter}

1: call Get Dim Obs(step, m) {by each process}

2: Allocate fields: T2(m, r),t4(r),t5(r),t6(N),y(m),d(m),∆x(n),T1p(m, Np),

3: T2p(m, rp),T3p(m, rp),t4p(rp),Gp(r, rp),Uinvp(r, rp),xp(n),∆xp(n)

4: for i=1,Np do

5: call Measurement Operator(step, n, m,Xp(:, i),T1p(:, i)) {user supplied}

6: end for

7: T2p T1p T {implemented with T as operator}

8: allgather T2 fromT2p {global MPI operation}

9: call RinvA(step, m, r,T2p,T3p) {operate only on local columns}

10: Gp (N−1(TT T)−1)p {implemented as direct initialization}

11: Uinvp Gp+T2TT3p {matrix-matrix product of type 1}

12: allgather Uinv from Uinvp {global MPI operation}

13: xp ←N−1PNp

i=1Xp(:, i) {get local ensemble mean state}

14: allreduce summation ofx fromxp {global MPI operation}

15: call Measurement Operator(step, n, m,x,d) {user supplied}

16: call Measurement(step, m,y) {user supplied}

17: dyd

18: t4p T3pTd {matrix-matrix product of type 2}

19: allgather t4from t4p {global MPI operation}

20: solve Uinv t5=t4for t5 {by each process}

21: t6T t5 {implemented with T as operator}

22: ∆xp Xp t6(jp :jp+Np1) {local increment, mat.-vec. product of type 3}

23: allreduce summation of∆x from∆xp {global MPI operation}

24: xx+∆x {by each process}

25: De-allocate local analysis fields

Algorithm 7.4: Structure of the parallel filter analysis routine for the SEIK algorithm.

The arraysT2pandt5are introduced for clarity. Their contents are stored respectively in T1p and t4. The index jp denotes the index of the first column of Xp inX.

Subroutine SEIK Resample Mode(n,N,x,Uinv,Xp) int n {state dimension, input}

int N {ensemble size, input}

real x(n) {state analysis vector, input}

real Uinv(r, r) {inverse eigenvalue matrix, input} real Xp(n, Np) {ensemble matrix, input/output} real T1,T2p,C,ΩpT,X {fields to be allocated} int r {rank of covariance matrix, r =N 1} int Np {local ensemble size}

1: Allocate fields: T1(r, N),T2p(N, Np),C(r, r),ΩpT(r, Np),X(n, N)

2: Cholesky decomposition: Uinv =C CT {by each process}

3: initialize pT {local columns}

4: solve CTT1p =pT for T1p {local columns}

5: T2pT T1p {implemented with T as operator}

6: allgather X fromXp {global MPI operation}

7: for i=1,Np do

8: Xp(:, i)x

9: end for

10: Xp Xp+N1/2 X T2p {matrix-matrix product of type 1 with DGEMM}

11: De-allocate local analysis fields

Algorithm 7.5: Structure of the parallel resampling routine for the SEIK algorithm.

The matrix T1p is not allocated in the program. Its contents are stored in T. To avoid the allocation of X, lines 6 to 10 can be implemented in block formulation.

columns of T1 in line 4 and the product T T1p (line 5) are parallelized. The latter operation is implemented as in the analysis algorithm. The initialization of the new ensemble matrix in line 10 is executed in parallel, too. Since this operation requires the information on all ensemble members inXRn×N, this matrix is initialized by all processes by an allgather operation (line 6). This operation will be very costly due to the large dimension ofX. To avoid the requirement to store the full matrix X, we use a block formulation for the resampling. Therefore a loop is built around lines 5 to 10.

In each cycle of this loop, only a couple of rows of the global matrix X are allocated and gathered at a time. In line 10 only the corresponding rows of Xp are updated.

7.2.5 Comparison of Communication and Memory Require-ments

For comparison of the communication requirements of the three filter algorithms, ta-ble 7.2 summarizes the sizes of the arrays involved in MPI operations.

The amount of communicated data in the mode-parallel analysis algorithm of SEIK is larger than for SEEK. This is caused by the product T1p T in line 7 of algorithm 7.4 and the computation of the ensemble mean in line 14. In the resampling algorithm

Table 7.2: Sizes of arrays involved in global MPI operations in the analysis and re-sampling phases of the SEEK and SEEK algorithms and in the analysis phase of the EnKF algorithm. Next to the matrix size, the name of the matrix is given as well as the information whether the MPI operation is an allgather (g) or allreduce (r) operation.

EnKF SEEK SEIK

analysis mN(T1, g) mr (T1, g) mr (T2, g) nN (T5, g) r2 (Uinv, g) r2 (Uinv, g) m (t2, r) r (t3, g) r (t4, g) n (x, r) n (∆x, r) n (x, r)

n (∆x, r) m (in T1pT, r)

re- r2 (U, g) nN(X, g)

sampling nr (V, g)

r2 (T2, g)

of SEEK, the global mode matrix V Rn×r has to be initialized by all processes using an allgather operation. Analogously the ensemble matrix X Rn×N has to be initialized in resampling algorithm of SEIK. In the resampling algorithm of SEEK, also the much smaller matrices Uand T2 are gathered.

The communication requirements of the EnKF algorithm are similar to those of the SEEK and SEIK algorithms. In the EnKF, the ensemble update is computed within the analysis, while SEEK and SEIK have additional resampling routines. Due to this, the EnKF includes the allgather operation on the matrix T5 Rn×N which is the analogue to the allgather operations of V or X performed respectively in the resampling phases of SEEK and SEIK.

Concerning memory requirements, the mode-decomposition only permits to dis-tribute some fields which hold ensemble quantities. Other arrays, which hold ensembles of observation-related vectors like T1in SEEK and EnKF, are not decomposed. Thus, the scalability of the memory requirements is limited. Next to these non-distributed arrays, additional private arrays have to be allocated. Some of these, likeT2p Rm×rp in algorithm 7.1, involve the observation dimension. These arrays increase the overall memory requirements. Other arrays which involve the state dimensionn, are less prob-lematic. Using block formulations, it is not necessary to allocate these arrays in their full size. A particular memory issue is the allocation of the full mode matrixV Rn×r in the resampling algorithm of SEEK. As has been discussed in section 7.2.2, the allo-cation of this very large array can only be avoided by a block formulation. This will, however, require to gather the full information on V twice. In the case of the EnKF algorithm, the allocation of the matrix T3Rm×m is required. If very large data sets have to be assimilated, this memory requirement can be problematic. In this case, the sequential assimilation of independent observation batches with smaller dimension m will reduce the memory requirements.