Monte Carlo simulation - Data intensive ATLAS workﬂows in the Cloud

depending on the jobtype. Therefore, in order to compute all events delivered by the ATLAS detector, there have to be many jobs doing the same processing in parallel - to finish within a reasonable time. Jobs doing the same processing on different input data are called “similar jobs”. This is not to be confused with executing a job twice with the same input data, which would be called the “same job”. Jobs that do not consist of the same transformations/substeps are “different jobs” or of a “different jobtype”. This is the case even if the input dataset is the same.

On the bottom-most layer are the requests, see Figure5.3. Requests consist of many different tasks, that do different types of processing. They are typically logically linked together, meaning that the outputs of jobs from one task are inputs of jobs of another task of the same request. Thereby requests can represent a whole processing chain, excluding analysis.

At certain points in time, data reprocessing campaigns, see Subsection 5.3.6, are started. A campaign includes all jobs, tasks or requests that were executed on the data that had to be reprocessed.

Conditions data

During data taking it is important to know the exact conditions, the state of the whole experiment, in order to achieve a precise simulation and reconstruction. This conditions data set consists of hundreds of parameters, including, for example, alignment, beam position, magnetic field map, cabling, calibration, corrections, detector status, noise, pulse shapes, timing and dead channels [109]. Each subsystem provides these parameters, which are stored in the conditions database. They are updated on the fly, some much more frequently than others. An example would be the cabling, that stays the same much longer than the alignment parameters.

5.2 Monte Carlo simulation

Monte Carlo simulation generally consists of three subsequent computations, event gen-eration, simulation and digitisation. Each part is described in the following subsections.

5.2.1 Event generation

Mostly external, meaning non-ATLAS specific, tools are used to generate events. In the event generation, collisions (events) and immediate decay products are generated. This means everything that would happen in an actual collision up to the point when the

3Limitations are e.g. the maximum lifetime of a Grid job. Other considerations why the wall time should not be too long are errors. If a long job fails towards the end, much computing time is wasted.

particles hit the detector. Depending on the lifetime of the resulting particles (c∗τ <

10 mm), they are already decayed by the event generator and only the resulting particles are considered to interact with the detector. Particles with a higher lifetime are then handled later on, by the simulation. Before the events are passed to the simulation, the specific beam conditions are applied.

The resulting data can be uniquely identified via the software version and inputs, e.g.

job parameters, random seed [110]. For reproducibility, the processor chip architecture has to be taken into account, since it influences the pseudo-random numbers which may be generated differently, especially between different manufacturers. The information of the parents of unstable particles is preserved.

In Figure5.4 the profile of a single core event generation job, that was run on a four core VM, can be seen. Since the job does not process any data and the output file is very small, there is very little disk activity. The job is entirely CPU bound (25%

CPU usage corresponds to one CPU core), using very little memory and almost no network bandwidth. There is almost no variation of the resource usage during the whole processing. The generator that was used was Pythia v8.8186.

Figure 5.4: Profile of a singlecore event generation job run on the VM at CERN, pro-cessing 1000 events, using Athena version 19.2.3.6.

Event generation fluctuation

The fluctuations of the wall time of the different workflows will become extremely im-portant for the model later on in Chapter 6. Initially, the same event generation job, with the same random seed input, was run 50 times on two different machines. The specifications can be found in the appendix, for the VM at CERN see SubsectionA.7.2, for the VM at G¨ottingen see SubsectionA.7.1.

In Table 5.1 it can be seen that the job duration does not fluctuate much. Though,

5.2 Monte Carlo simulation Average wall time [s] Standard deviation %

CERN VM 4488 1.59

G¨ottingen VM 5475 0.52

Table 5.1: Summary of executing thesameevent generation job 50 times on two different VMs, located at CERN and G¨ottingen.

the fluctuation is slightly higher on the CERN VM than on the one in G¨ottingen. The standard deviation of the sample (s) is obtained by:

wherenis the number of measurements,x_i the observed values and ¯xis the mean value of the measurements.

The standard error of the mean (σ_x_¯) is obtained by:

σ_x_¯= s

√n (5.2)

wheres is taken from Equation5.1and nis the number of measurements.

A plot depicting the wall time average over 1, 2 ... 50 jobs with the corresponding estimation of the standard error of the mean has been created, see Figure5.5. It can be described as a sliding window analysis, where the step-size is one and the window-size increases by one with each step.

In order to exclude effects that appear due to the ordering of the jobs, the same plot has been created multiple times. The only difference is that the ordering of the input data points has been rearranged pseudo-randomly by hand, see Figures A1,A2 and A3 in the Appendix.

The fluctuations are very small, considering that the y-axis does not start at zero.

Approaching a high number of jobs (n >25) the wall time average converges within a reasonably small standard error of the mean.

The same investigation has been performed for similar jobs⁴. Since similar jobs are different from each other, it is expected that the variation in wall time will become larger than before, which is confirmed by the numbers in Table 5.2.

Figure5.6 shows the wall time average for an increasing number of jobs.

The plot shows the behaviour for the jobs at G¨ottingen. The same plot showing the results for the CERN VM can be found in the Appendix, see FigureA4. The fluctuations of the wall time average increased together with the standard deviation. Overall these

4Similar jobs have different random seeds as inputs for each job

Figure 5.5: The wall time average over an incrementing number of the same event gen-eration jobs, run at G¨ottingen, starting at one. The black error bars show the standard error of the mean (see Equation5.2). Note: in order to improve readability, the y-axis does not start at zero.

Average wall time [s] Standard deviation %

CERN VM 4510 3.97

G¨ottingen VM 5368 3.34

Table 5.2: Summary of executing 41similar event generation jobs.

fluctuations are still very small and they also converge - albeit after a larger amount of jobs are included.

Changing the workflow to an older version, which was used in 2015, does not show different results in terms of fluctuations.

5.2.2 Simulation

The simulation workflow simulates the detector and physics interactions of the particles with the detector. This is done by theGeant4 [111] toolkit which models the physics and particle transportation through the detector. It takes the resulting data from the event generation as input.

The truth information from the event generation is kept and particles from the simu-lation step are added.

In Figure 5.7the profile of a MC simulation is shown. The whole processing is CPU bound with very little disk and network activity. In this case the job ran on four cores in parallel. There is little variation of the resource usage during the processing, from about

5.2 Monte Carlo simulation

Figure 5.6: The wall time average over an incrementing number ofsimilar event gener-ation jobs, starting at one, at the VM at G¨ottingen. The black error bars show the standard error of the mean (see Equation5.2). Note: in order to improve readability, the y-axis does not start at zero.

200s to about 5600s. Towards the end of the job, after 5500s, it can be observed how the individual processes finish the simulation of the last events at different times and the CPU usage decreases in steps (100% to 75% to 50% to 25%) to zero. The memory profile of the job is rather flat and a low RAM requirement can be seen.

Monte-Carlo simulation fluctuation

The fluctuations of the wall time of the different workflows will become extremely im-portant for the model later on in Chapter6. The variation amongst repeating thesame simulation job were much lower (0.78% for CERN and 0.28% for G¨ottingen) than for similar simulation jobs. This is why only the results for similar jobs are shown, see Table5.3, which in any case includes the same job fluctuations.

Figure 5.8shows the wall time average over an increasing number of similar Monte-Carlo simulation jobs. The outputs of the similar event generation jobs in Subsection 5.2.1were taken as inputs for these simulation jobs.

The plot depicts the results for the VM in G¨ottingen. The same plot, showing the results of the VM at CERN can be found in the Appendix, see Figure A5. Overall the fluctuations are rather small, in the order of a few percent, compared to the wall time.

Towards a higher number of jobs (n >30) a convergence can be seen.

Figure 5.7: Profile of a multicore simulation job run at a 4-core VM at CERN, processing 100 events, using Athena version 19.2.3.3.

Average wall time [s] Standard deviation %

CERN VM 5010 4.37

G¨ottingen VM 3097 4.42

Table 5.3: Summary of executing 26similar Monte-Carlo simulation jobs at CERN and 38similar Monte-Carlo simulation jobs at G¨ottingen.

Im Dokument Data intensive ATLAS workﬂows in the Cloud (Seite 63-68)