• Keine Ergebnisse gefunden

4.3 Numerical Tests

4.3.5 Strong Scaling of the Implicit Solver

In the following, we want to investigate the parallel efficiency of the im-plicit solver with and without preconditioning and compare it against the performance of the explicit solver. Both time integration schemes are im-plemented within the DGSEM framework FLEXI, whereas FLEXI in combi-nation with the explicit time marching scheme has demonstrated its ability to scale perfectly on several 10,000 cores in [6, 63]. As preconditioners, we employ the LU and the ILU(0) NoFillIn decompositions.

Computations of large scale problems with a solver of high memory capac-ity requires the usage of high performance computing (HPC) systems. The computational work is distributed over several processors in order to reduce the wall-clock time. Holding the total number of DOFs constant, a perfect strong scalable solver halves the wall-clock time when doubling the num-ber of processors. For the FLEXI code, the grid elements are distributed to different processors such that the upper limit of processor numbers is given by the amount of elements when each processor owns one element.

The lower limit is related to the memory size provided by each processor, since for small numbers of processors the number of elements increases and consequently also the storage per processor. Hardware configurations of the used supercomputer and the environments settings are detailed in Appendix D.

Since implicit schemes depend on the employed test case due to the differ-ent requiremdiffer-ents for the non-linear and linear solvers, we show the parallel efficiency investigations exemplary for the three-dimensional traveling vor-tex described in Appendix C for the polynomial degree ofN = 4and the temporal ESDIRK4-6 method. In order to examine the scaling of the im-plicit solver we consider 10 cases with different number of elements. All mesh configurations are based on the domain Ω = [0,2]3. The coarsest mesh consists of 6 elements in each direction and is then refined by dou-bling the number of elements in only one direction. Table 4.2 illustrates the different problem sizes and also the maximum number of nodes (each node consists of 24 processors), on which the elements are distributed so that each core owns only 9 elements.

The baseline configuration describes the setup against which the parallel ef-ficiency is measured. In Table 4.3 the required nodes for the baseline com-putation for each case are listed. For the explicit time marching scheme the baseline for each case consists of one node (24 cores). But for the

4.3 Numerical Tests

case 1 2 3 4 5 6 7 8 9 10

#total elements:63· 20 21 22 23 24 25 26 27 28 29

#elements in x 6 12 12 12 24 24 24 48 48 48

#elements in y 6 6 12 12 12 24 24 24 48 48

#elements in z 6 6 6 12 12 12 24 24 24 48

max #nodes 1 2 4 8 16 32 64 128 256 512

Table 4.2: Mesh configurations for the investigation of parallel efficiency.

The maximum number of nodes is related to 9 elements on each core.

implicit scheme the baseline in case 10 is computed on two nodes using no preconditioning since it does not fit on one node. And especially the memory-consuming LU preconditioner restricts the baseline on two nodes in case 9 and four nodes in case 10. In contrast, the ILU(0) NoFillIn precon-ditioning has no issues in any case due to the low storage format and the low number of GMRES iterations requiring small Krylov subspace dimen-sions. The number of nodes of each baseline is doubled until the maximum number is reached.

case 1 2 3 4 5 6 7 8 9 10

Explicit 1 1 1 1 1 1 1 1 1 1

Implicit noPrecond 1 1 1 1 1 1 1 1 1 2

Implicit LU 1 1 1 1 1 1 1 1 2 4

Implicit ILU(0) 1 1 1 1 1 1 1 1 1 1

Table 4.3: Amounts of nodes for the baseline computations for the investi-gation of parallel efficiency.

The explicit scheme is run for 100 time steps and the implicit scheme for one time step with CFL = 103 for which the performance index (PID) is

measured neglecting the time for reading or writing files and initialization.

The performance index is defined by the time needed for the update of one DOF on one core

PID= wall-clock time·#cores

#DOFs·#time steps·#RK-stages.

We repeat the calculations for five times and calculate the median PID of the performance indices in order to reduce statistical effects. The parallel efficiency is then computed by

PIDB

PIDk

·100%,

where PIDB corresponds to the baseline median performance index and PIDk to the respective onk nodes. The baseline median PID is computed on a single node expect for some cases as illustrated in Table 4.3.

In Figure 4.23, the parallel efficiency for the explicit and the implicit scheme with either no preconditioner or the LU/ILU(0) NoFillIn preconditioner is illustrated over the number of processors for all presented cases in Table 4.2. For each problem size the efficiency starts at100%with the baseline run on the lowest amount of processors. For the explicit scheme we can see in the top left plot in Figure 4.23 superlinear scaling for almost all cases.

This scaling higher than100%can be a reason of caching effects of the pro-cessors or badly run baseline simulations. The last symbols show very high error bars since the number of DOFs per core is decreased until a minimum of 9 elements on each processor. In this case the internal work compared to the communication between the cores is quite small. Hence, for the highest number of cores in each case different jobs running on the supercomputer at the same time impact significantly the parallel performance and induce the high error bars. The parallel efficiency for the implicit schemes show similar behaviors, where no preconditioning shows the highest super scal-ing followed by the ILU(0) NoFillIn preconditioner but with a great loss in performance at the last two symbols (nine/eighteen elements per core) for each problem size. The LU preconditioner describes small fluctuations around the100%. The parallel efficiency represents only relative results with respect to the baseline performance. If we want to compare between the different solvers, we need to look at the total PID, which is plotted in Figure 4.24.

4.3 Numerical Tests

explicit

102 103 104

50 100 150

#cores

Parallelefficiency

#elements 63·21 63·22 63·23 63·24 63·25 63·26 63·27 63·28 63·29 implicit-noPrecond

102 103 104

50 100 150

#cores implicit-LU

102 103 104

50 100 150

#cores

Parallelefficiency

implicit-ILU0 noFillIn

102 103 104

50 100 150

#cores

Figure 4.23: Parallel efficiency of a strong scaling of the explicit and im-plicit solver for different mesh sizes corresponding to each color. The baseline simulation is generally computed on a single node (24 cores) except for noPrecond and LU due to memory restrictions. The last symbol of each line is related to the case when each processor has only 9 elements. Every sim-ulation is recomputed five times in order to obtain a statistical median. The deviations are represented by error bars.

explicit

103 104 105 106

0.8 1 1.2 1.4 1.6

#DOFs/core

PID(µs/DOF)

63·20 63·21 63·22 63·23 63·24 63·25 63·26 63·27 63·28 63·29

implicit-noPrecond

103 104 105 106

100 200 300 400 500 600 700 800 900 1,000 1,100 1,200 1,300 1,400 1,500

#DOFs/core

implicit-LU

103 104 105 106

200 400 600 800 1,000

#DOFs/core

PID(µs/DOF)

implicit-ILU0 noFillIn

103 104 105 106

200 400 600 800 1,000

#DOFs/core

Figure 4.24: Performance index of a strong scaling of the explicit and im-plicit solver for different preconditioners and different mesh sizes corresponding to each color. Here, the first symbol of each line is related to the case when each processor has only 9 elements. Every simulation is recomputed five times in order to obtain a statistical median. The deviations are represented by error bars.

4.3 Numerical Tests

Here, the x-axis shows the number of DOFs per core, so that the highest number of cores is illustrated by the first symbols and the baseline point corresponds to the very right symbol in each case. Hence, the highest vari-ability of the PID is visible for the smallest DOFs per core. In the explicit case, we can extract from the top left plot of Figure 4.24 that the PID is inde-pendent of the problem size. While each case possess a different time step size due to the CFL condition, the computational effort per DOF remains the same. For small DOFs per core, the PID can decrease which is attributed to caching effects where most of the data fits into the cache of the proces-sor. But, due to jobs running on the supercomputer at the same time the variability of the results increases. This dependency on the shared network resources is not desirable such that 10,000 DOFs per core describes a use-ful guiding value achieving good performance. In contrast, the PID of the implicit scheme in Figure 4.24, which use 1,000 times greater time steps as the explicit scheme, varies for each problem size. This is due to the physical dependency of the number of GMRES iterations, whereas the PID of explicit schemes is related to the call of one DG operator which has the same cost independent of the time step size. Hence, it is not possible to compare the PID for different resolutions for each preconditioner, but the PID of each resolution case for the different implicit solvers. We choose a constant CFL number for all problem sizes in order to compare the PID among the dif-ferent preconditioners. However, in each case we can see that the coarsest mesh obviously corresponds to the highest PID due to the highest time step.

Since the PID is referred to the update for one time stage and the explicit PID stays around one, the implicit scheme is faster if the PID is approx-imately lower than 1,000. Thus, we can highlight that implicit schemes are more efficient for very fine meshes. The differences of the PID values arising for the illustrated preconditioners are not due to parallel communi-cations since preconditioning is employed element-local, but rather due to the variations of GMRES iterations. GMRES operates on global vectors con-sidering all grid elements, so that several communications are necessary per GMRES iteration. Remark, that the number of GMRES iterations remains constant for different numbers of cores again explained by the element-local property. It is remarkable that for every preconditioner and spatial resolution the PID is minimized around the guideline of 10,000 DOFs per core. The worst preconditioner with the highest values is given by the LU decomposition for every problem size. The ILU(0) NoFillIn preconditioner describes the most efficient acceleration in any case. Using no

precondi-tioner results in a very high PID for very low numbers of DOFs per core, since the work on one processor is smaller than using a preconditioner and hence the communication predominates the computational effort. For the guideline of 10,000 DOFs per core no preconditioning shows for case 8, 9 and 10 a smaller PID as the LU preconditioning. But still, the implicit solver including ILU(0) NoFillIn represents the fastest scheme. Comparing the ex-plicit and the imex-plicit time stepping methods, the exex-plicit scheme needs for the given setting 1,000 time steps until the implicit method makes only one. Thus, for 10,000 DOFs per core implicit time integration with ILU(0) NoFillIn outperforms the explicit scheme for any spatial resolution case and by a factor of more than 6 considering the finest grid.