SEIK - Filtering with Domain Decomposition

7.3 Filtering with Domain Decomposition

7.3.4 SEIK

As in the case of mode-decomposed ensemble and mode matrices, the analysis algo-rithm of the SEIK filter for domain-decomposition is very similar to that of the SEEK filter. The parallel SEIK analysis algorithm for domain-decomposition is shown as algorithm 7.9. Again we discuss the differences to the SEEK algorithm.

For domain-decomposition, a process knows the full state ensemble for its local do-main. Thus, the computation of ensemble means does not require any MPI operations.

Accordingly, the product of matrixT1_p with matrixT in line 7 involves no communi-cations of data. The same is true for the computation of the ensemble mean in line 13 and the application of T to t5 in line 20. Due to this, the amount of communicated data is equal for the analysis algorithms of SEEK and SEIK in the case of domain-decomposition. The algorithm contains several operations which are executed without parallelization. These are the initializations of G and Uinv, the solver step for t5, and the computation of t6. Most costly will be the solver step for t5in line 19, since it involves the inversion ofUinv ∈R^r×r. These operations, together with the required communication operations, will limit the parallel efficiency of the domain-decomposed analysis. The parallel efficiency will be, however, better than in the case of mode-decomposition, since there the amount of communicated data is much higher than for domain-decomposition.

For domain-decomposed states, the resampling algorithm of SEIK, shown as algo-rithm 7.10, has the benefit that no communication operations are required at all. The operations on the small r×r and r ×(r+ 1) matrices are performed equally by all processes. They can be expected to require negligible time compared with the com-putation of the new ensemble states. The operations on the ensemble matrix are fully parallelized. Hence, the domain-decomposed resampling algorithm of SEIK can be ex-pected to show a nearly ideal speedup. To reduce the required memory, we implement the ensemble transformation in line 11 using a block formulation. It is analogous to the block structure described for the SEEK resampling algorithm.

7.3.5 Comparison of Communication and Memory Require-ments

Table 7.3 summarizes the size of the communicated arrays in the domain-decomposed filter algorithms. The numbers assume that no communication is performed in the implementation of the measurement operator and in the subroutine RinvA.

Since we have usually n À m > N, r for realistic large scale models, it is obvi-ous from table 7.3, that with domain decomposition significantly less data has to be communicated between processes. The smallest amount is in the SEIK algorithm. Its analysis algorithm communicates only two arrays of sizesr×r and r. The resampling algorithm of SEIK is even executed without any communication of data. The largest amount will be in the EnKF algorithm, since here arrays involving the dimension m are communicated.

Comparing the mode-decomposed algorithms (7.1 to 7.5) with the algorithms using domain decomposition (7.6 to 7.10), the smaller memory requirements of the domain-decomposed filter algorithms become visible. Using domain-decomposition, all arrays

Subroutine SEIK Analysis Domain(step,n_p,N,x_p,Uinv,X_p) int step {time step counter,input}

int n_p {state dimension on local domain, input} int N {ensemble size, input}

real x_p(n_p) {local state estimate, output}

real Uinv(r, r) {inverse eigenvalue matrix, output} real X_p(n_p, N) {local ensemble matrix, input/output} real t4,t5,t6,G,∆Uinv,y_p,d_p, {fields to be allocated} real T1_p,T2_p,T3_p,t4_p,∆Uinv_p {fields to be allocated} int m_p {dimension of local observation vector}

int i {ensemble loop counter}

int r {rank of covariance matrix, r=N −1}

1: call Get Dim Obs(step, m_p) {get observation dimension, user supplied}

2: Allocate fields: t4(r),t5(r),t6(N),G(r, r),∆Uinv(r, r),y_p(m_p),d_p(m_p),

3: T1_p(m_p, N),T2_p(m_p, r),T3_p(m_p, r),t4_p(r),∆Uinv_p(r, r)

4: for i=1,N do

5: call Measurement Operator(step, n_p, m_p,X_p(:, i),T1_p(:, i)) {local domain}

6: end for

7: T2_p ←T1_p T {implemented with T as operator}

8: call RinvA(step, m, r,T2_p,T3_p) {operate only on local domain}

9: G←(N⁻¹(T^T T)⁻¹) {by each process; implemented as direct initialization}

10: ∆Uinv_p ←T2_p^TT3_p {matrix-matrix product of type 3}

11: allreduce summation of∆Uinv from∆Uinv_p {global MPI operation}

12: Uinv←G+∆Uinv {by each process}

13: x_p ←N⁻¹P_N

i=1X_p(:, i) {get ensemble mean state for local domain}

14: call Measurement Operator(step, n_p, m_p,x_p,d_p) {user supplied}

15: call Measurement(step, m_p,y_p) {user supplied}

16: d_p ←y_p−d_p

17: t4_p ←T3_p^Td_p {matrix-matrix product of type 3}

18: allreduce summation oft4 from t4_p {global MPI operation}

19: solve Uinv t5=t4for t5 {by each process}

20: t6←T t5 {implemented with T as operator}

21: x_p ←x_p+X_p t6 {matrix-vector product of type 2}

22: De-allocate local analysis fields

Algorithm 7.9: Structure of the parallel filter analysis routine for the SEIK algorithm for domain decomposed states. The arrays T2_p and G are not allocated but stored respectively in T1_p and Uinv. Analogously, the contents of t5 are stored in t4.

Subroutine SEIK Resample Domain(n_p,N,x_p,Uinv,X_p) int n_p {state dimension on local domain, input} int N {ensemble size, input}

real x_p(n_p) {state analysis vector, input}

real Uinv(r, r) {inverse eigenvalue matrix, input} real X_p(n_p, N) {ensemble matrix, input/output} real T1,T2,Ω^T,C,T3_p {fields to be allocated} int r {rank of covariance matrix, r =N −1}

1: Allocate fields: T1(r, N),T2(N, N),Ω^T(r, N),C(r, r),T3_p(n_p, N)

2: Cholesky decomposition: Uinv =C C^T {by each process}

3: initialize Ω^T {by each process}

4: solve C^TT1=Ω^T for T1 {by each process}

5: T2←T T1 {implemented withT as operator}

6: for i=1,N do

7: T3_p(:, i)←X_p(:, i)

8: X_p(:, i)←x_p

9: end for

10: X_p ←X_p+N^1/2 T3_p T2 {matrix-matrix product of type 2}

11: De-allocate local analysis fields

Algorithm 7.10: Structure of the parallel resampling routine for the SEIK algorithm for domain decomposed states. The matrix T1 is never allocated in the program. Its contents are stored inΩ^T. Lines 6 to 10 can be implemented with a block formulation.

Then only some rows of T3_p are allocated.

Table 7.3: Sizes of arrays involved in global MPI operations in the analysis and re-sampling phases of the SEEK and SEEK algorithms and in the analysis phase of the EnKF algorithm for domain-decomposed states. Next to the matrix size, the name of the matrix is given as well as the information whether the MPI operation is an allgather (g) or allreduce (r) operation.

EnKF SEEK SEIK

analysis mN (T1, g) r² (∆Uinv, r) r² (∆Uinv, r) mN (D, g) r (t3, r) r (t4, r) 1 (m, r)

resampling r² (T1, r)

Im Dokument Parallel Filter Algorithms for Data Assimilation in Oceanography (Seite 108-111)