Implementation Details - Lattice QCD based on OpenCL: CL 2 QCD

3.5 Lattice QCD based on OpenCL: CL 2 QCD

3.5.1 Implementation Details

The host programme is written in C++11, allowing for independent work on different sections of the code. In addition, the object orientation of C++ is quite well suited for these pur-poses as there are many interdependencies between programme parts. Furthermore, future developments like additional fermion discretisations can be incorporated in a natural way.

Several levels of output have been implemented which can be controlled by compile- and runtime options (via command line). Thus, debugging information can be taken completely out of the code on the compiler level in case performance is crucial, i.e. production runs. Simulation parameter can be passed to the application either by input files or command line arguments.

Details on the compilation procedure can be found in Appendix A.6.1. The aforementioned ILDG compatible I/O has been implemented utilising the Lattice QCD Interchange Message Encapsulation²²(LIME) scheme for C.

At several points, the implemented algorithms rely on random numbers. Numerically, these are generated using Pseudo Random Number Generators (PRNGs), of which RANLUX [Lüscher, 1994] is the standard choice, as it is designed to give reliable random numbers in

20 See http://cssm.sasr.edu.au/ildg/ .

21 See https://github.com/etmc/tmLQCD for the latest code version.

22 See http://usqcd.jlab.org/usqcd-docs/c-lime/ .

44 numerical aspects of lattice qcd and the cl²qcd software

parallel applications, too. The original implementation by Lüscher is used in the host code whereas an open-source OpenCL implementation of RANLUX, RANLUXCL²³, is used in the OpenCL parts. The initialisation follows the usual RANLUX rules on the host and the de-vice. Each OpenCL thread runs on its own RANLUX PRNG state. In addition, the simpler NR3 [Press et al., 2007] generator is implemented, which was used for testing but not for simulations.

All mathematical functionality required for the algorithms has been implemented as OpenCL kernels. Kernel compilation is carried out as indicated in Figure 8. To simplify debugging, each kernel is built as stand-alone programme. To reduce the compilation time, the binary files created by the OpenCL compiler are saved for later re-use. Of course, this functionality ensures that current hardware and code version are not changed with respect to the former compilation. Already mentioned in Figure 7 was the possibility to pass compile time arguments in aC-like manner. This allows to save many kernel arguments (e.g. NT,NS, . . . ), which are

“hard coded” into the kernel itself. To simplify kernel calling, the corresponding actions (see Figure 8) have been encapsulated in wrapper functions. All data types used in the kernels are implemented as structures [Bach et al., 2013a]. To access neighbouring sites, threads need to know the corresponding index based on their index (i.e. their global id). As especially with EO preconditioning this relation is non-trivial, traditional applications provide the indices by arrays. As on the GPU there is the limiting factor memory access, a functionality to calculate these indices directly was developed. This is explained in Appendix A.6.3.

CL²QCD went through a couple of major changes during the development process. At the beginning of the development, the target machine, the LOEWE-CSC, represented a quite heterogeneous system. Accordingly, one aim was to investigate possible hybrid applications of the given architecture, i.e. the usage of CPUs and GPUs together. As the most recent system used, SANAM, has a clear focus on GPU performance, the code was heavily restructured, setting aside hybrid usage for the moment. This was reasonable as the four AMD FirePro S10000 provided by each SANAM node are an ideal test ground for a Multi-GPU application.

The two different structures will be presented in the following.

In the first code structure, the class gaugefield is the main object, holding the buffer of the gauge field and the OpenCL environment is initialised by it. The idea was that the physical gauge field is managed by one single object which in turn controls several devices in a possibly hybrid way. The devices are represented by opencl_module-objects that hold device related parts such as kernels and buffers. Problem-related functionality and algorithm logic is implemented by child classes of gaugefield andopencl_module, cf. Figure 9. For hybrid applications, “tasks” can be defined for specific problems, of which each may then contain various of device objects. Possible hybrid applications were reported in [Philipsen et al., 2011]. For example a splitting of fermionic observables like (2.79) into CPU (contraction) and GPU part (inversion) might be beneficial. However, this structural approach has some drawbacks: The fusion of many components (gauge field buffer, OpenCL environment. . . ) into the gaugefield and opencl_module objects made the application inflexible, especially for Multi-GPU usage, where the gauge field has to be distributed among multiple devices. Also,

23 See https://bitbucket.org/ivarun/ranluxcl .

3.5 lattice qcd based on opencl: cl²qcd 45

Basic

Transportcoefficients Heatbath HMC Inverter

(a)gaugefield

Basic

Kappa Random

Heatbath Spinor LA

Correlators

Fermions HMC

(b) opencl_module

Figure 9: Structure of gaugefieldandopencl_moduleclass in the first version of CL²QCD.

the entanglement of theopencl_moduleobjects, which came as a natural process during the development, is not a good way of coding as it misses a clean separation of different levels of the application. Furthermore, this hampered the programming process, as code changes were always affecting many parts of the programme immediately.

Accordingly, the code saw a major refactoring, which separated OpenCL initialisation, buffer management, andopencl_modulesclearly into different objects. In addition, it introduced a clear distinction between high- and low-level functionality by introducing more abstraction, for example making the algorithm logic independent of the actual buffer type used and thus more versatile. This was achieved by creating three distinct packages: Physics, providing objects to represent gauge- or spinor fields (Lattices) and methods to work on (Fermionmatrix) or work with them (Algorithm). TheHardware package maps the objects used in thePhysics package onto the actual hardware using theBuffer classes. TheSystemobject takes over the OpenCL initialisation, in particular it initiates as many Device instances as demanded. Similar to the first version, the kernels are combined problem-related inopencl_modules, however, stripped of the buffer and algorithmic content. These are collected in theCode namespace. Figure 10 shows the structure of the two packages. A third package,meta, holds objects that represent the parameters and provides a variety of helper functions.

In Section 3.3 it was argued that LQCD is always limited by memory bandwidth. Accordingly, for optimal performance on a GPU the memory characteristics must be somehow represented.

By default, a memory object is stored as an array of structures (AoS), however, on a GPU a so-calledstructure of arrays (SoA) approach to data storage is favourable. The difference is depicted in Figure 11 using the example of a spinor object. This is a structure of twelve complex numbers. In the SoA pattern the elements of the structures lie consecutive in the memory, opposed to AoS, which holds one structure after the other. As GPUs execute multiple concurrent threads, these will simultaneously request the same element of a structure. However, the memory controller always schedules read access of a certain length. In the worst case, an

46 numerical aspects of lattice qcd and the cl²qcd software

Physics Lattices Prng

Fermionmatrix Algorithm

(a)Physics package

Hardware System Device Buffers Code

(b)Hardware package

Figure 10: Simplified structure of the two main packages in the second version of CL²QCD. Note that not all objects are classes and possible subcontent is not shown.

AoS delivers only one structure element in such a read, needlessly throwing away the also-read data. This is improved in a SoA, where (given a suited alignment of the elements of the structure) all data per read is used. It turns out that this is crucial for high performance. On the CPU, in turn, the AoS is in general favourable against SoA. Both strategies are used in CL²QCD, depending on the used hardware.

AoS ψ(0).e0 ψ(0).e1 ψ(0).e2 ... ψ(0).e11 ψ(0).e12 ψ(1).e0 ψ(2).e0 ...

SoA ψ(0).e0 ψ(1).e0 ψ(2).e0 ... ψ(N).e0 ψ(0).e1 ψ(1).e1 ψ(2).e1 ...

Figure 11: Sketch of AoS against SoA pattern, showing a twelve component spinor object. Vertical lines indicate consecutive memory blocks. Ndenotes the number of elements of the vector of spinors.

On the GPU, the OpenCL compiler is part of the driver, and it is also responsible for optimisations of the kernel code. Thus, the compilation process is an important part of GPU usage. Since the beginning of the development, multiple miscompilations on various AMD driver generations have been observed. These would manifest themselves in compilation failure or incorrect code output, despite the fact that the correctness of the kernel code could be verified in other environments, e.g. on the CPU. The likelihood of such behaviour is enhanced with increasing code complexity. Furthermore, the compiler heavily influences the register usage of the compiled kernel code, which can often be much worse than what could be expected by kernel requirements and lead to usage of scratch registers. If this is the case, it can be extracted from the binary files produced by the compiler as these hold information about the compiled code, e.g. register usage statistics. As this influences performance massively, rewriting the kernel code is sometimes useful and may help to avoid suchregister spilling.

To ensure code correctness, we therefore incorporated regression tests for all kernels. These check kernel output against reference values for various sets of input parameters and compile options. In particular, these tests verify that execution on CPUs and GPUs yields the same results. In addition, similar tests for executables were added.

Apart from that, in particular the HMC and inverter executables were tested against the reference, tmlqcd. During code development, this was done by modifying the tmlqcd source

3.5 lattice qcd based on opencl: cl²qcd 47

0.6240 0.6245 0.6250 0.6255 0.6260

1k 2k 5k 10k 25k 40k 40k 25k 10k 5k 2k 1k

Plaquette

Trajectories

CL²QCD reference value tmlqcd

Figure 12: Plaquette as function of number of HMC trajectories, measured using the HMC implemen-tations of tmlqcd and CL²QCD, respectively. Parameters were used according to the sample input file sample-hmc0.input provided in the tmlqcd package, which constitutes a test setup with quoted reference value for the plaquette.

code itself to give required interim results. The full HMC of CL²QCD was then tested against the corresponding one from tmlqcd using reference input provided by the latter. Figure 12 shows this exemplarily for one test case. One sees that both programmes give comparable results and converge to the reference value quoted for this setup in the tmlqcd package. The differences are due to the different random numbers used. The HMC without fermions can additionally be used to verify correctness of the heatbath implementation. Moreover, the inverter was tested, for example by comparing the chiral condensate calculated on a specific configuration. Agreeing results were obtained with reference values provided by my colleague Florian Burger.

For details on the installation process of CL²QCD, see Appendix A.6.1.

Im Dokument Lattice QCD at finite temperature with Wilson fermions (Seite 57-61)