HNSciCloud: large scale - Cloud measurement

7.2 Cloud measurement

7.2.1 HNSciCloud: large scale

The single-VM measurements were already presented along with the Model validation.

In the later stages of the prototyping phase, additional resources became available. A total of ten VMs per provider, with eight CPU cores per VM, were provisioned. This increased the scale of the tests significantly to 240 CPU cores.

There was no information available on which VMs were collocated on the same underlying hardware. Therefore resource contention among these ten VMs was possible, as all tests were run in parallel on all VMs.

The same workflows have been run on all providers, see Tables 7.6, 7.8and 7.7. The in-depth details of the underlying workflows, with which the tests could be repeated, can be found in the Appendix in Section A.6. On a coarse level, the workflows that have been run correspond to the different ATLAS job categories, namely: event genera-tion, simulagenera-tion, reconstruction and digitisation. A more fine grained division has been undertaken for the reconstruction workflows. This was done because reconstruction is data intensive and therefore more complex, so additional parameters could be varied.

Each job category was repeated at least ten times per VM. This means, that for each job category around 300 measurements were done.

Due to various reasons, individual jobs failed. It was attempted to re-run each failed job. Time constraints limited the amount of repetitions of failed jobs. After this be-came clear, more than ten jobs per category and per VM were run. The total amount of measurements per provider and job category therefore stayed above 100, even with job failures. The exception is Reco 2, with slightly fewer jobs. Due to the failures, the averages that are presented below, can consist of a varying number of underlying measurements.

Exoscale

In Table 7.6, all Exoscale results are summarised, together with their standard devia-tions. The wall time is split into its components.

From the previous measurements it is already understood, that the different job cat-egories represent a diverse mixture of jobs. This is reflected in the measurements.

A difference between the measurements over ten VMs on a Cloud provider compared to single VMs in a controlled environment, see Sections 5.2 and 5.3, can be observed.

The standard deviation is increased for the larger setup. This is due to the fact that on top of the fluctuations that were already observed, there are infrastructure fluctuations.

These result from differences among the VMs and the more volatile environment.

7.2 Cloud measurement

Exoscale Wall CPU Idle I/O Wait

Time [s] Time [s] Time [s] Time [s]

Table 7.6: In this table all test results of the jobs run on the Exoscale infrastructure are shown. Each line constitutes similar jobs. Jobs that took less than 500 seconds CPU time are excluded, as these failed.

One example would be the different performance of the VMs as illustrated in Figure 7.3. What can be seen is that VMs 5 and 7 take much longer to complete the same workload than the other VMs. Furthermore, the other eight faster VMs differ slightly in their performance as well. These kinds of differences in the performance of VMs can have different root causes, which are not necessarily stable in time. In addition, different workflows may be more or less sensitive to these differences, meaning the impact varies.

Single jobs also appear within VMs that deviate largely from the others. The three cases can be found in VM 1 job number 10, in VM 5 job number 3 and in VM 8 job number 3. These outliers represent an additional type of short termed fluctuation.

In Figure7.4the same VMs as in Figure7.3 are used.

Ignoring the fact, that a different workflow is run, what can be seen immediately is that now, only VM 5 is taking longer to finish the workload. VM 7 that was slower than VM 5 in Figure 7.3, is now performing at the level of the other 8 faster VMs.

This drastic change in performance hints at a change to VM 7 that lay outside of the influence or knowledge of the Cloud procurer. Luckily for the procurer, the two figures are in chronological order, so this change represents an increase in overall computing power.

This shows, that including volatile hardware can have a negative impact on the stabil-ity of the wall time, making predictions more difficult and more error-prone. Assuming that the provider offers better hardware as a bonus, the model can however predict a lower bound for the event throughput.

IBM

The overview over the IBM results can be found in Table 7.7.

Figure 7.3: Wall times [s] of all Reco 6 jobs executed on Exoscale. The jobs are organised according to the VMs they ran on and in the same order.

Figure 7.4: Wall times [s] of all Digi Reco 2 jobs executed on Exoscale. The jobs are organised according to the VMs they ran on and in the same order.

7.2 Cloud measurement

IBM Wall CPU Idle I/O Wait

Time [s] Time [s] Time [s] Time [s]

Table 7.7: In this table all test results of the jobs run on the IBM infrastructure are shown. Jobs that took less than 500 seconds CPU time are excluded, as these failed.

The VM performance is lower than in Exoscale. What can be seen is that different job types are impacted more or less by the differences between the two infrastructures.

Some workflows are significantly slower, such as Digi Reco 2, whereas others are not slowed down as much, such as Reco 7.

The differences in fluctuations are shown and described in detail later.

The IBM infrastructure was found to be heterogeneous as was seen before for Exoscale in Figure 7.3.

T-Systems

One difference between the T-Systems VMs and the VMs from the other providers is that eight of the T-Systems VMs were located on the same underlying host. This was achieved by choosing a “dedicated host” as the underlying infrastructure for these VMs in the T-Systems interface.

The T-Systems results are summarised in Table7.8.

In comparison to what has been seen for Exoscale in Table7.6and for IBM in Table 7.7, the wall times are increased. An interesting fact is that different job categories are affected differently by the difference in the infrastructure. The slow-down factor between the wall times from Exoscale and T-Systems, differs between the job categories. There it ranges from around 1.4 for event generation to around 3.75 for simulation.

The fluctuations are also higher, as is shown and discussed in detail later in Table 7.9.

T-Systems Wall CPU Idle I/O Wait Reco 7 15348± 2886 61478 ±11789 27058±5703 17861± 3272 Digi Reco 1 2789± 524 10497 ±1546 7588±1200 208± 53 Digi Reco 2 25821± 6151 168476 ±16656 23207±5077 1441± 573 Table 7.8: In this table all test results of the jobs run on the T-Systems infrastructure

are shown. Jobs that took less than 500 seconds CPU time are excluded, as these failed.

Figure7.5highlights the heterogeneity of the infrastructure that was also observed for Exoscale.

One issue that can be spotted when looking at Figure 7.5, is that there appears to be a pattern. This wall time pattern, can be seen throughout VMs 2-8 and VM 10. The series of jobs was started at the same time and the VMs in question were located on the same host.

The pattern manifests itself especially in the second job, which took longer than the neighbouring ones. In addition, from the third job onward, the wall time seems to be rising steadily. Due to the almost flat wall time distribution in VMs 1 and 9, the possibility that the pattern originates from the differences between the jobs can be excluded. This has also been double-checked by repeating the same jobs on different VMs. No pattern of this kind has been found in the other VM’s wall time distribution.

Due to the role as customer, the possibilities to investigate the origin of this pattern were very limited. One possible factor that played a role in creating the pattern, is that the VMs may have contended for a hardware resource. This could have impacted the different workflows differently. Another explanation would be an outside interference, such as a neighbouring VMs. Finally, it could also have been a combination of the two, or something entirely different.

Comparison

First of all, for all three providers the wall times fluctuated more using multiple VMs compared with single VMs. The standard deviations of the wall time within the three providers are compared in Table7.9.

The different workflows themselves seem to behave more or less stable in their duration.

7.2 Cloud measurement

Figure 7.5: Wall times [s] of all Reco 5 jobs executed on T-Systems. The jobs are organ-ised according to the VMs they ran on and in the same order.

For example, EvGen has a lower standard deviation than Reco 6 throughout all three providers. It has to be kept in mind however, that the tests took place at different periods in time. A change in outside influences, such as the VM speed-up that was demonstrated in Figures 7.3 and 7.4, could have taken place between and during the tests.

EvGen seems to be impacted the least by the VM heterogeneity. This was not inves-tigated in detail, but the EvGen workflow differs the most from the other workflows. It can be seen from the fact that it is the only workflow that does not have the built-in functionality of running on multiple CPU cores. Modern CPUs are very complex, making it probable that the different workflows do not touch on the same CPU aspects and optimisations. Especially for code that is less modern and less optimised for specific CPU functionalities, the differences in a heterogeneous mixture of CPUs can appear smaller.

The numbers in Table 7.9 also indicate differences between the providers. Exoscale and IBM appear to fluctuate less than T-Systems. An important thing to keep in mind is that a high standard deviation is not necessarily a reflection of an unstable infrastructure.

This can be made clear from Figures7.3,7.4and 7.5. There appears to be a mixture of more and less powerful VMs, which result in a large standard deviation.

Due to the independence of the workflows running on the different VMs, a large variation in performance is not negative. What counts in the end is to have a high

Exoscale IBM T-Systems Wall Time StdDev [%] StdDev [%] StdDev [%]

EvGen 4.70 3.02 3.07

Digi Reco 1 10.21 8.52 18.77

Digi Reco 2 8.12 15.85 23.82

Table 7.9: Comparing the Clouds. Displayed are the standard deviations of the wall time. A comparison between the wall times can be found in the Appendix in TableA2.

event throughput. Therefore a diverse and difficult to predict infrastructure with high throughput is preferable over a homogeneous infrastructure with a low throughput.

A high variability can however lead to a low performance, if the provider did not account for it correctly. This could be the case if the provider guarantees a computing power that is equal to the average of the infrastructure. Depending on how the VMs are provisioned, the customer could get ‘unlucky’ and therefore get less than the agreed upon performance.

Different hardware types or generations within the infrastructure make an accurate prediction of the overall performance more difficult. This can be seen in Subsection 7.3.3, where the model results are shown.

Taking into account that there are two different generations of hardware reduces the standard deviation in the wall time, as can be seen in Table 7.10.

A comparison between these results and the ones in Table 7.9 show that the large spread that is reflected in the standard deviation of T-Systems decreases. The IBM results are impacted less, when splitting them into two categories. The biggest impact can be found in Digi Reco 1 and 2 (IBM and T-Systems) and in MC Sim (T-Systems).

The reason why the T-Systems standard deviation seems to benefit more from this split, is because the difference between the faster and regular machines is bigger than for IBM.

Im Dokument Data intensive ATLAS workﬂows in the Cloud (Seite 110-116)