Speedup of the Filter Part for Mode-decomposition

9.5 Parallel Efficiency of Filter Algorithms

9.5.2 Speedup of the Filter Part for Mode-decomposition

2 4 6 8 10 12 0

10 20 30 40 50 60

Execution Time, N=60

processors

time [sec]

EnKF SEEK SEIK

2 4 6 8 10 12

1 1.5 2 2.5 3 3.5 4 4.5 5

Speedup, N=60

processors

speedup

2 4 6 8 10 12

0 20 40 60 80 100 120

Execution Time, N=240

processors

time [sec]

2 4 6 8 10 12

1 1.5 2 2.5 3 3.5 4 4.5 5

Speedup, N=240

processors

speedup

Figure 9.10: Execution time and speedup for the filter update phases in dependence on the number of processes. In the experiments, the mode-decomposed filter was applied. Displayed are mean values and standard deviations over ten experiments for each combination of filter algorithm and number of processes. The left hand side shows results for N = 60, the right hand side forN = 240.

The relative differences in the execution times are smaller for N = 240 than for N = 60. Using the larger ensemble size, the SEIK filter remains the fasted al-gorithm while the EnKF alal-gorithm is still the slowest filter, even if the generation of the observation ensemble is neglected. The execution time for the EnKF triples while that for SEEK and SEIK increases tenfold. The small increase in the execution time

for the EnKF is due to the fact that the time for the initialization of the observation ensemble only approximately doubles since here several operations do not dependent on the size of N. The time for the remaining part of the EnKF quadruples. The increase in the execution time of SEIK is dominated by the computation of the new ensemble matrix in line 10 of the resampling algorithm 7.5. For SEEK, the increase in time is also dominated by the resampling phase. Here most of the time is spent in the compu-tation ofT1_pin line 8 of algorithm 7.2 and the computation of the new modes in line 15.

The speedup of the mode-parallel filter algorithms is rather disappointing. This becomes apparent from the bottom row of figure 9.10 which shows the speedup for the experiments with N = 60 and N = 240. The fluctuations in the speedup are mainly due to cache-effects of the computer used for the experiments. Therefore, the numerical efficiency of matrix-operations like matrix-matrix products depends on the dimensions of the involved matrices. For N = 60, the best speedup is obtained with the SEEK filter. Using 12 processes, a speedup of about 3.2 is obtained which corresponds to a parallel efficiency of 27%. The worst speedup is exhibited by the EnKF algorithm. It stagnates at a value of about 1.2 when 12 processes are used. This corresponds to a parallel efficiency of 10%. The speedup is slightly better for the large ensemble size of N = 240. Here the speedup for SEEK and SEIK reaches respectively 4.4 and 4.7.

Thus an efficiency between 37% and 39% is obtained with 12 processes. The speedup of EnKF is twice as large as for N = 60 stagnating at a value of about 2.4 with 12 processes.

The low parallel efficiency of SEEK and SEIK is mainly due to the extended com-munication which is needed in the algorithms. For increasing ensemble size, the time for computations increases relative to the time for communications. Thus the parallel efficiency increases for larger ensembles. The distinct efficiency of SEEK and SEIK for N = 60 is due to the different number of operations performed in their resampling phases. The amount of communication in the resampling phases of both algorithms is practically equal for N = 60. Since SEIK performs less operations, the allgather oper-ation forX in line 6 of algorithm 7.5 is more dominant for the execution time than the allgather operation performed forV in SEEK. Since the time to perform the allgather operation increases with an increasing number of processes, the efficiency decreases for a larger number of processes. Using more than 6 processes, the allgather operation in SEIK lasts even longer than the computation of the new ensemble states. Therefore, the execution time of SEIK increases if the number of processes exceeds a value of 8.

Hence, the speedup of SEIK decreases for the experiments using more than 8 processes.

For models with larger state dimensionn, the speedup of the SEEK and SEIK filters will also be limited by the required initialization of the full ensemble or mode matrix by allgather operations. Also the differences between SEEK and SEIK will remain for increasing n, since the amount of communication and the complexity of the most expensive floating point operations in the resampling algorithm scale both withO(n).

The minor speedup of the EnKF filter is due to several factors. To examine the reasons in detail, the execution time and the speedup of different groups of operations are displayed in figure 9.11 for the EnKF with N = 240. In the serial experiment,

2 4 6 8 10 12 0

20 40 60 80 100 120

Execution Time

processors

time [sec]

total lines 4−14 lines 15−19 line 20 lines 21−28

2 4 6 8 10 12

0.5 1 1.5 2 2.5 3 3.5 4 4.5

Speedup

processors

speedup

Figure 9.11: Execution times and speedup for the groups of operations in the EnKF analysis algorithm for N = 240. Shown are means and standard deviations analogous to figure 9.10. The line numbers given in the legend of the diagrams refer to those in algorithm 7.3.

the generation of the observation ensemble and the initialization of the residual matrix (lines 15 to 19 in algorithm 7.3) take together about the same time as the ensemble update with its preparations (lines 21 to 28). The the ensemble update shows a better speedup than the initialization of the residuals. The speedup for the ensemble up-date does, however, stagnates at a value of about 3.5. This is due to the allgather operation performed to initialize the matrix T5 ∈ R^n×N. The generation of the ob-servation ensemble does also show a limited speedup since this operation requires the eigenvalue decomposition of the observation error covariance matrix R ∈R^m×m. The decomposition is independent of the local ensemble size and is not parallelized. The speedup of the other parts of the EnKF algorithm is worse than the ensemble update and the initialization of the residual matrix. The computation of matrix T3 ∈R^m×m in line 13 takes about 97% of the execution time of the operations in lines 4 to 14.

Since this operation is not parallelized, the speedup for this part of the algorithm will be approximately constant with a value of one. The complexity of the solver step for the representer amplitudes in line 20 is O(m³ +m²N). It is dominated by the LU-decomposition of the matrix T3which is performed by the LAPACK routine DGESV.

Thus, the achievable speedup of the solver step is very small.

Overall, this discussion showed that the small speedup for the EnKF is caused by a combination of high amounts of communication and operations which are performed serially or do not have a good scalability in terms of performance. The speedup of the ensemble update could be major if the communication was faster relative to the computations. The solver step in line 20 and the computation of T3 in line 13 will, however, remain a limiting factor for the parallel efficiency of the EnKF algorithm. The

speedup will be major if the dimension of the observation vector relative to the state dimension is smaller. This can be achieved by using a EnKF analysis algorithm which sequentially assimilates batches of observations as has been discussed in section 3.4.

In addition, a better speedup can be expected for larger models if the amount of observational data remains constant.

Im Dokument Parallel Filter Algorithms for Data Assimilation in Oceanography (Seite 162-166)