P ERFORMANCE O PTIMIZATION R ESULTS

The ATLAS reconstruction software speed was improved by a factor of more than three for Run 2 events, compared to the software at the end of Run 1. Table 11 shows the impact of the different changes on runtime. These measurements represent the state of

the software at different times during the development. It shows how each set of changes impacts reconstruction on a modern environment. The increase in runtime from release 17.7.0 to 18.9.50 for the new data set shows that improvements for one dataset can lead to worse performance in another dataset. Where possible to attribute to one or a small set of changes, this is measured and the changes are shown in Table 12.

5.7.1 How the Results were Measured

The differences between the Run 1 events and the Run 2 events as defined in Subsec-tion 4.1.1 demonstrate how much the data has to be taken into consideraSubsec-tion when meas-uring the impact of a change. Some changes have decreased the speed for Run 1 events while they have a very positive effect when processing Run 2 events. I could not conduct measurements independently of one another, as some optimizations only work with a release that already includes other optimizations, which may affect the outcome. Switch-ing from buildSwitch-ing with 32bit to 64bit register support increases reconstruction software memory usage by 20% while runtime was decreased by the same amount. Compilation in an SLC 6 environment using GCC4.6 leads to a slight increase in runtime with Run 1 events and a slight decrease for Run 2 events compared to compilation under SLC 5 and GCC4.3.

For these two tests should be stressed that the test machine was in both cases SLC 6, so also SLC 5 compiled releases were run with SLC 6. The same is true for the tests with 32bit and 64bit compiled releases, for the tests shown in Table 12 SLC 6 was used in both cases.

The speedup from running on SLC 5 to SLC 6 is around 10% while the speed reduction of the SLC5 compiled binaries to SLC6 compiled binaries was only 4%, so in total reconstruc-tion was faster also for Run 1 events after the change. The IMF preload showed only 2-3%

speedup in respect to using GNU libm in later tests using SLC 6, showing many perfor-mance inefficiencies were fixed in libm since SLC 5 when the original tests had been performed. The new magnetic field was tested using a wrapper that introduced even more unit conversions instead of with the newly implemented interface, because it allowed testing on the same release without requiring code changes wherever it is accessed.

Therefore, not the full impact of the change was visible, just the impact of the cache and the slightly improved access to the memory internally, leading to worse performance when running with old data. Unit tests had shown better field performance, which also becomes visible when running with Run 2 events. The impact is even higher in newer releases where the magnetic field is not called through a wrapper but directly, avoiding a deep call chain and several unit conversions. Changes in how and how often the field is called make a comparison through releases difficult though. The Eigen migration and the Event Data Model update took 11 months in total. The result is that many other improve-ments also went into the release at the same time. It is therefore not possible to measure these results in a single number, which is why they are not mentioned in this table. All seeding optimizations combined have had a large impact, reducing runtime by more than one third for Run 2 events. The largest single improvement stems from the backtracking update, significantly reducing the number of times the combinatorial track filter needs to run. It serves as example of the effects an algorithmic change can have and as motivation for the new tracking methods presented in Chapter 6. The backtracking update is particu-larly effective for expected future events, almost doubling reconstruction speed. Multiply-ing all measured speedups leads to a theoretical speedup factor of 1.66 for Run 1 events and 4.4 for Run 2 events, compared to the measured actual speedup between first and last release of 2.25 respectively 4.51. This is due to interactions between the different optimi-zations. The predicted improvement and the actual improvement differ by 26% on Run 1 events and by only 2% for Run 2 events, which is not much considering the large number of projects and the huge gains achieved. The software is in much better shape now consid-ering the cleanup projects e.g. in the magnetic field, the EDM and with the Eigen migration that are only partially reflected by computational performance, but which have a huge impact on maintainability.

Figure 55: Core cycles and instructions retired for releases 17 (upper) and 19 (lower). The number of instructions retired per core cycle is worse for the newer release than for the older one. Nevertheless, the newer release processes more events in the same time. The number of events was chosen to achieve similar runtimes.

5.7.2 Interpretation of the Results

The results highlight the importance to test the changes with the expected data during production. It also shows a discrepancy of the effect of a change when first tested and in the last tested environment, which includes all improvements and may have changed in other unpredicted ways. An improvement can become unnecessary or even potentially harmful, as both trigonometric functions of libm and the system allocator of SLC6 deliver comparable results as the libraries they have been replaced with. The difference is that the system libraries may be updated during normal system maintenance or during a switch to a newer OS. These libraries may perform better, which would go unnoticed if new tests are not performed.

The performance of the software measured in instructions per cycle changed for the worse, as a comparison between release 17 and release 19 shows in Figure 55. This means the low-level hardware utilization has not improved. At the same time, the set goals for speed improvement were met. The improvements on the ATLAS code are algorithmic rather than improving the exploitation of hardware features, but some prepare for a better hardware utilization in the future such as the new EDM. It also indicates that the used libraries also hardly improve on the hardware utilization but rather on the algorithms used.

The xAOD format marks an important step from object-based design towards Struc-ture of Arrays style storage of data. The chosen access patterns through a wrapper allow the use of the data in a similar way as before. The elimination of an intermediate conver-sion step for analysis speeds up the internal workflow and opens the door to vectorization by improving data locality. Vectorization was not achieved using vectorizing libraries due to the matrix dimensions typically used. The introduction of the Eigen library, while improving speed, did not lead to the desired vectorization, which could have served as example of outsourcing its complexity. While efforts to include vectorization in computa-tionally expensive code regions persist, a higher priority for many groups is the prepara-tion of the ATLAS code for parallelizaprepara-tion through multithreading.

The optimizations with the highest impact during LS1 were algorithmic optimizations in relatively small parts of the code. While the affected algorithms consumed a lot of CPU time, changing them in some cases also affected the amount of CPU time spent in subse-quent algorithms. Many of the changes affected the final result, which always requires the approval of groups concerned with the efficiency but also can only be considered by a physicist able to distinguish between important and less important results. A workflow to streamline the collaboration between physicists and computer scientists could therefore improve code quality and speed. Some deficiencies may have been spotted sooner if a clear code review scheme was in place, which would not leave single persons responsible for portions of the code. Another problem that could be addressed by a well-defined

workflow are unclear responsibilities for certain code parts, some of which were written by people who already left ATLAS. Despite the improvements to the ATLAS code during LS1, large parts of the code could profit from a thorough analysis and optimization, which is why I expect the largest potential with the smallest effort here. Parallelization through multithreading is currently being implemented. The importance of multithreading de-pends largely on the development of the CPU market, which is foreseen to increase the number of cores faster than memory to a point ATLAS is no longer able to utilize them just by multi-processing.

6 A NALYSIS AND

I MPLEMENTATION OF

T ^RACKING I MPROVEMENTS

Tracking is, as the analysis in Chapter 4 shows, one of the most expensive steps in the current implementation of the ATLAS reconstruction with a runtime increasing polynomi-ally with number of space points. Through in-depth optimizations, performance goals could be met, but the complexity remains a problem for future medium-term require-ments and will become unmanageable in the long term. Tests show that a 10fold increase in pileup leads to a 150fold increase in runtime, which is the workload expected for the HL-LHC, see [60]. One of the technical solutions to cope with the increasing computing problem is to exploit idle parallel resources. In Section 4.7 I have shown that the tracking can be efficiently parallelized. In this chapter I will explore the competitiveness of GPUs and CPUs for tracking. Alternatively, I present a low accuracy and low complexity tracking approach which has the potential to reduce the tracking time to an insignificant factor.

This solution is suitable for both the expectable shift in computational resources towards highly parallel hardware and also the increasing complexity of the data. This chapter first presents the low complexity tracking approach and then the comparative study between GPU and CPU parallel tracking.

Im Dokument Analysis and Optimization of the Offline Software of the ATLAS Experiment at CERN (Seite 89-93)

5.7.1 How the Results were Measured

5.7.2 Interpretation of the Results

6 A NALYSIS AND

I MPLEMENTATION OF

T RACKING I MPROVEMENTS

T ^RACKING I MPROVEMENTS