S OFTWARE DEVELOPMENT IN ATLAS - Analysis and Optimization of the Offline Software of the ATLA

The reconstruction software has grown over decades of development, starting even before ATLAS was founded, as before its approval simulations had to be conducted for the ATLAS design studies. Pre-Athena code was written in Fortran but Fortran was abandoned with the introduction of Athena in favour for C++ in 2000. Much of the code was converted to C++ over several years while some modules were only converted recently.

47.8%

22.1%

20.7%

9.4%

Computing Resource Consumption of all Jobs in 2012

Simulation Reconstruction Analysis Other

Figure 23: Developers with at least one code submission per quarter. The colors refer to the different development domains. Since the end of Run 1 in 2012 the number slowly recovered, but is still far from what it was in 2009 [4].

Thousands of modules have been written since then by thousands of developers, most of whom are not part of ATLAS anymore. Many of the developers were PhD students who do not stay with ATLAS after finishing their studies. This leads to a quick developer turnover and abandoned code. The current ATLAS svn repository contains 455 Athena algorithms in 2293 packages and 7 million lines of code [4]. An analysis of the SVN logs shows that the number of developers committing code at least once within 3 months has seen a steady decline since the start of Run 1 (see Figure 23), reaching below 50% of the previous 750 developers per quarter at the beginning of LS 1. Most developers have a physics background with little focus on computer science, although the small number of people who are responsible for the majority of svn activities (see Figure 24) have experi-ence in software development and follow the ATLAS coding rules. Both figures illustrate that ATLAS is depending on a small core developer team. To reduce this dependency ATLAS needs to increase the number of skilled and dedicated developers.

The heterogeneity of developers means that many parts of the software are developed by people without formal education in computing, a problem common in scientific compu-ting [50], [51]. This is in part due to the fact that the institutes that are part of ATLAS have an obligation to participate in activities required for operation. Estimates are that around 600 people are required for the operation of ATLAS, which is 1/6^th of the entire workforce of the 3600 members of ATLAS [52]. These activities include monitoring of the detector activities, hardware and software calibration and software development. Almost all ATLAS institutes are from the field of physics and software development is not a fundamental part of most physics study programs. This conflict of interest leaves results short of what could be achieved with a comparable workforce of dedicated developers. For this reason, the software has inefficiencies, contains untested and unreviewed code, memory leaks and unsafe pointer handling and is not well documented. This situation is allowed by the

Figure 24: Load distribution of software development in ATLAS measured by num-ber of svn commits per person. In 2014, 42 developers were contributing 50% of the changes. Data extracted from svn.

Figure 25: Number of new software package versions committed to ATLAS SVN each month. Data extracted from svn.

insufficient code quality assurance mechanisms in ATLAS, which are unsuitable for a project with thousands of code changes every month (see Figure 25). The introduction of Jira [53] as issue tracking and management system lead to clearer responsibilities, such that many of the problems found with the Coverity static code checker [54] in place were tracked and fixed. While Coverity led to the discovery and elimination of many problems, it does not replace tests, which can assure that the behaviour is not only not undefined, but is also the desired behaviour. Currently, the only automated tests are integrated into the ATLAS nightly build system.

0 10000 20000 30000 40000 50000 60000 70000

Cumulative Commits

Number of Developers

SVN Commits per Developer

Commits per User 2014 Commits per User 2013 Commits per User 2012 50% commits

0 1000 2000 3000 4000 5000 6000 7000 8000

2nd Half

2011 1st Half

2012 2nd Half

2012 1st Half

2013 2nd Half

2013 1st Half

2014 2nd Half 2014

Number of SVN Commits

Commits per Month

3.4.1 Building and Testing in ATLAS

This build system knows final releases, development (dev) nightlies, development-validation (devval) nightlies and migration releases. Final releases represent a fully consistent state of the software that should be bug free. Development nightlies should also be free of major bugs, but represent ongoing development. Within devval nightlies devel-opers can test if their changes work well with a full build before they are integrated into dev. Migration releases are used to implement changes that will break the release for a prolonged period, such that all changes can be integrated into a dev or devval release once the migration is complete. Release numbers follow the scheme W.X.Y.Z. Final releases are defined by numbers W for major changes and X for minor changes, while Y and Z are used for bug fixes and performance improvements. The dev and devval nightlies have a one-week cycle, being rebuilt every night overwriting the release that is one one-week old. A change is always first included in the devval release, from where it can go to dev if the developer deems it stable. These multiple levels of fully built releases prevent non-working code from going into production but require a lot of resources, as a full build on a dedicated high-end machine takes about 9 hours. This also means, an error will make the whole release unusable by the whole developer base until the error is removed and rebuilt, which is at the earliest the following day.

A release may be validated with respect to its physics results. Such a release can be used in production, and is barred from changes that affect an algorithm’s result except for critical bugfixes (“frozen”) to maintain reproducibility. A release is usually only validated for either simulation or reconstruction but not for both, because time schedules for either purpose are different. For this reason, separate versions are validated separately and development also continues separately due to the policy of frozen releases.

For performance monitoring, a Runtime Tester (RTT) system has been set up. This system runs a series of predefined jobs after a new nightly has been built, scans the log for crashes, warnings and errors and allows inspecting the logs and results. This is a minimal test if a whole domain runs without crashing, but checking the results for sanity remains manual and is done at the developers’ discretion.

Performance is monitored in a similar way, comparing performance between two re-leases or nightlies. Before allowing a release into production, a physics validation is performed, which is the process of manually verifying the sanity of results of the whole software chain. In modules responsible for one or more percent of the used CPU time, inefficiencies are spotted faster because usually only experienced developers are assigned to work on such code. The vast number of modules requiring only a few milliseconds per event remains a problem, as they are too small to attract attention and remain unchanged even though they might have become completely unnecessary. A study has shown hun-dreds of tables in a job’s output that are created but never filled. Another problem is that the reconstruction software is divided in different domains, reflecting the organizational structure of ATLAS. In some cases, responsibilities are not clear where areas of multiple groups are touched, slowing down development.

3.4.2 EDM Design Considerations

Considering the development of the C++ language in the past years, many well-meant decisions affecting large parts of the software should be revised. Generally, a strictly object oriented design was enforced, leading to very deep inheritance chains. Members of EDM classes were created lazily, although they are always accessed. Lazy initialization means allocation and initialization takes place at time of access instead of at time of object crea-tion. This can save computing time and memory in case the members are not always accessed. Dynamic memory allocation is expensive, such that time is wasted if dynamic memory is allocated for multiple small objects rather than for all of them at once. Other parts of the EDM were identified through dynamic casts, as the associated cost was

Figure 26: Seven performance dimensions as taken from [55]. The dimensions are orthogonal to one another with the exception of symmetric multithreading.

expected to diminish to the cost of an integer comparison in future C++ versions, which never happened. Some design choices such as the originally strictly object oriented struc-ture intended to allow easy maintenance were not taken for high performance. This illustrates that the software was not built for high performance but to perform the needed work. The consequences of these decisions continue to affect the performance to this day.

At the time, staying within the permissible resource limits was not problematic as compu-ting resources were sufficient for the problem size. This changes drastically for Run 2 because of the greater complexity of events and their larger numbers on the one hand, but also with the change in hardware development explained in the next section.

Im Dokument Analysis and Optimization of the Offline Software of the ATLAS Experiment at CERN (Seite 42-46)