• Keine Ergebnisse gefunden

AthenaMP combinations

8.3 Measurements

8.3.1 AthenaMP combinations

With AthenaMP, OC is easily done on an individual VM. It was already found, that running only one AthenaMP instance in OC mode does not yield any benefits in the case of local data. This is because it does not influence the overall profile, see Subsec-tion 8.2.1. Running multiple instances in parallel, for example, two as in Subsection 8.2.2, gives the expected improvement in CPU efficiency. How many instances and what combination gives the best result, remains to be answered. Possible scenarios are plen-tiful, e.g. running as many instances as possible as parallel processes. On a 4-core VM, adding one additional process could look like: 4+1; 3+2; 3+1+1; 2+2+1; 2+1+1+1 and 1+1+1+1+1 The numbers indicate how many parallel processes are run in one AthenaMP instance. Different instances are separated by a plus sign.

In addition to that, there can be more than five parallel processes on a 4-core VM.

8.3 Measurements Some scenarios can already be excluded beforehand. If there are too many parallel processes, the RAM requirement becomes too high, whereas the CPU efficiency does not benefit.

For the many AthenaMP instance scenario, the memory footprint is larger than for the scenario with fewer instances. This is due to the fact that the processes share part of their memory to reduce redundancies. AthenaMP increases the sharing, by having the processes share additional data, such as the detector geometry. This additional AthenMP sharing is only taking place for processes within the same instance.

After excluding some scenarios, what remains is to measure and compare. The results of the initial measurements are shown in Figure 8.6.

Figure 8.6: Runtimes of different AthenaMP OC configurations. The standard eight processes on the 8-core configuration is marked in blue. Note: for better readability, the wall time does not start at zero.

At first glance, this looks surprising as all possible scenarios seem to take less time than the standard configuration. Even the ones that are not overcommitted, such as 7+1, for example, took less time. The explanation can be found in the way the jobs are executed and in the resulting files. First of all, a fair comparison has to be ensured, the interpretation of the results is done afterwards. Evoking AthenaMP once with eight parallel processes, delivers the results in a single merged file. In the case of the 7+1 invocations, the results will be split among two files, one con-taining 78 of the results, the other 18. These two files would have to be merged in addition.

If further merging steps are considered for the multiple AthenaMP instance scenarios, the durations change accordingly. This means that the wall time of, for example, the

7+1 scenario has to be increased by the time it takes to merge the two resulting output files. This additional merging time, that has to be added to all scenarios with multiple output files, is extrapolated from the standard scenario1. In Figure 8.7, the wall time corrections of the additional merging have been applied.

Figure 8.7: Runtimes of different AthenaMP OC configurations after applying merg-ing corrections. The standard eight processes on the 8-core configuration is marked in blue. Note: for better readability, the wall time does not start at zero.

What can be seen, is that the order of the combinations change, as some would have to undergo more merging than others. More importantly, the difference between the standard eight process scenario and the others shrinks significantly. Some scenarios are even taking longer than the standard one.

In order to avoid differences with the merging, a workaround has been found. Instead of comparing single jobs with each other, a set of jobs was compared. This was done in the following way: in the usual scenario, four jobs run successively and produce four output files. Therefore, the output will be the same for the 2+2+2+2 scenario that is processing four times the workload of a usual job. In both cases four output files, containing roughly 3000 physics events each, are produced. These are not merged further. Figure8.8compares the runtime, divided by four, of these scenarios.

Without the need to apply merging corrections, the difference in the runtime is

sig-1Initially it was attempted to compute and measure the additional merging steps manually. Due to configuration errors, they could however not be executed correctly and the results were inconsistent with the original 8 core AthenMP job. After consulting experts, it was decided that extrapolating would be the better alternative.

8.3 Measurements

Figure 8.8: Runtimes of different AthenaMP configurations after applying the workaround. Not all scenarios are overcommitted. Note: for better read-ability, the wall time does not start at zero.

nificant. This goes even for cases, in which technically no OC is performed (2+2+2+2 and 4+4). The speed-up in the non-OC cases can be explained by the additional flexi-bility and by the fact that multiple cores are active during the merging. This becomes apparent when looking at the profile of the 2+2+2+2 scenario, shown in Figure 8.9.

Figure 8.9: Profile for the 2+2+2+2 scenario. Note: no logarithmic scale is used for the network and disk activity.

Especially towards the end, after 18000 s, and in between the RAWtoESD and ES-DtoAOD steps, between 16000 s and 17000 s, the CPU usage stays high. In the usual scenario, the CPU usage would drop down to 12.5% during these serial merging steps.

This can be seen, for example, after 9000 s in Figure 8.4.

What has to be highlighted at this point, is that the single core merging step does not reach the disk read/write limit2. This is why multiple CPU cores can execute different single core merging processes at the same time without being stuck in I/O wait.3 Finally, this explains the ordering of the processes in Figure 8.9. The higher the number of AthenaMP instances, the better the CPU utilisation during the merging period, and the shorter the overall wall time. This is limited by the memory footprint that increases with each additional instance.

In conclusion, what has been shown in this section is that there are many possibilities to overcommit. Due to the particulars of the merging, the comparison between all sce-narios is not trivial. The combinations that provide the highest event throughput, are the ones that exploit the serialisation part the most. Instead of testing all possible com-binations, a better way to use the CPU cycles fully during the serialisation is described in Subsection8.3.3.