Efficiency of the Framework - Parallel Efficiency of Filter Algorithms

9.5 Parallel Efficiency of Filter Algorithms

9.5.1 Efficiency of the Framework

To study the parallel efficiency and the speedup of the data assimilation framework, data assimilation experiments are performed with the three ESKF algorithms using different numbers of parallel model tasks. Since FEOM does not apply domain-decomposition, a configuration with mode-decomposed filters is applied. To reduce the computation time of the experiments in comparison to those in the preceding sec-tion, the data assimilation experiments are performed over a time period of 10 days.

The interval between subsequent analyses is set to 12 hours. To compute the speedup, the state ensemble has to be divided evenly over the available model tasks. For this reason, an ensemble size of N = 36 (r = 35) is chosen. This ensemble size has the following properties:

• The ensemble is sufficiently large to provide a realistic data assimilation exper-iment. On the other hand, the ensemble is small enough to perform a large number of experiments.

• To assess the speedup, a large variety of different numbers of model tasks is required. To ensure that each model task evolves the same numbers of ensemble states, the chosen numbers of model tasks need to be divisors of the ensemble size. In addition, the number of possible parallel model tasks is limited due to a limited number of processors in the computer system used for the experiments.

UsingN = 36, the experiments can be executed with 1, 2, 3, 4, 6, 9, 12, 18, and 36 parallel model tasks. This enables efficient use of the available 24 processors of the Sun Fire 6800.

Using the configuration described above, the execution time for a single-processor, i.e. serial, experiment is about 9 hours on the Sun Fire 6800. The execution time decreased to about 35 minutes when 18 parallel model tasks are used. Using a single processor, the execution time for the EnKF algorithm was about 18 seconds. The analysis and the resampling phases of SEEK lasted respectively about 0.2 and 2.2 sec-onds. The analysis phase of SEIK took 0.4 seconds while the resampling phase lasted about 1 second. Thus, the analysis phase of SEIK is slower than that of SEEK, but the resampling phase is faster. This is consistent with the computational complexity of the algorithms which was discussed in section 3.4.

Figure 9.8 shows speedup and parallel efficiency for filtering experiments using the configuration of the framework where the filter is executed by one process of each model task. The speedup is computed from the total execution time of one series of experiments. Thus, the time for the initialization of the model and the filter are included as well as the time for the user analysis routines. The user analysis routines compute the filter-estimated variances and write the estimated state to a disk file. Each model task is executed by a single process. Hence, the total number of processes for an experiment equals the number of model tasks and the number of filter processes. This configuration has been chosen to allow for a maximal number of parallel model tasks.

This choice does not limit the significance of the results when the speedup in relation

0 2 4 6 8 10 12 14 16 18 0

2 4 6 8 10 12 14 16 18

Speedup

parallel model tasks

speedup

ideal EnKF SEEK SEIK

0 2 4 6 8 10 12 14 16 18

0.8 0.85 0.9 0.95 1 1.05

Parallel Efficiency

parallel model tasks

parallel efficiency

Figure 9.8: Speedup (left hand side) and parallel efficiency (right hand side) in depen-dence on the number of parallel model tasks for the framework with a filter process on each model task.

0 2 4 6 8 10 12 14 16 18

Speedup

parallel model tasks

speedup

ideal EnKF SEEK SEIK

0 2 4 6 8 10 12 14 16 18

0.8 0.85 0.9 0.95 1 1.05

Parallel Efficiency

parallel model tasks

parallel efficiency

Figure 9.9: Speedup (left hand side) and parallel efficiency (right hand side) in depen-dence on the number of parallel model tasks for the framework with disjoint process sets for filter and model. The filter part is computed by a single process.

to the used number of model tasks is considered. Since here the number of processes in a model task does not change, the computation time for the forecast of a single state is independent of the number of parallel model tasks. Using a filter process on each model task minimizes the amount of communication between model and filter (see section 8.3.1). In fact, since each model task is executed by a single process, no communication between model and filter is conducted. Thus, the parallel efficiency of the program is limited only by the serial parts of the model and the filter algorithms, by the communication performed within the filters, and by possible different times to compute the forecast of different model states.

The speedup in figure 9.8 is excellent for all three filter algorithms. The small differences between the filters are not statistically significant. The sensitivity of the results was examined using 10-fold experiments with the same number of model tasks.

Due to variations in the total execution time of the experiments, a standard deviation of about 3% results for the speedup. Thus, the filter framework yields equal values of the speedup for the three ESKF algorithms. The parallel efficiency of the data assim-ilation system decreases slightly when the number of parallel model tasks is increased.

With 18 model tasks an efficiency of about 85% is obtained.

For comparison, figure 9.9 shows speedup and parallel efficiency for experiments using disjoint process sets for the model and filter parts of the program. In these ex-periments the filter is executed on a single process only. Thus, the parallel efficiency is limited by the serial operations of the filter, serial parts of the model, and by the communication required to exchange the state vectors between filter and model. Fur-ther, different computation times for the forecasts can limit the efficiency when other processes have to wait for one of the model tasks to complete its work.

Using disjoint process sets, the speedup is very similar to the speedup obtained by the configuration with a filter process on each model task. The small differences are again not statistically significant. The standard deviation of the speedup amounts again to about 3%. Due to these uncertainties no more detailed results can be drawn from the values of the speedup. In particular, it is not possible to determine which of the two process configurations, filter processes joint with the model processes or disjoint process sets for model and filter, is more efficient.

The deviation from an optimal parallel efficiency of the data assimilation system is caused by varying execution times of the state evolutions on different model tasks.

Since the processes are synchronized at the end of a forecast phase, this desynchro-nization reduces the speedup of the forecast phase. The influence of the analysis and resampling phases are negligible. For the EnKF, which is the most costly of the three filter algorithms, the execution time for the analysis and resampling phases amounts to less than 0.1% of the total execution time for the serial experiment. In addition, the influence of the serial model initialization and the execution of the user analysis routine are negligible. These phases last respectively about 6 and 10 seconds in the serial experiment.

Im Dokument Parallel Filter Algorithms for Data Assimilation in Oceanography (Seite 159-162)