Member of the Helmholtz Association

Volltext

(1)INTRODUCTION TO GPU PROGRAMMING OF THREADS AND KERNELS 7 February 2019. Member of the Helmholtz Association. Andreas Herten. ESM User Forum, Forschungszentrum Jülich.

(2) Outline GPUs at JSC. Programming GPUs Libraries Directives Languages Abstraction Libraries/DSL Tools Advanced Topics Using GPUs on JURECA & JUWELS Compiling Resource Allocation. JUWELS JURECA JURON GPU Architecture Empirical Motivation Comparisons 3 Core Features Memory Asynchronicity SIMT. High Throughput Summary. Member of the Helmholtz Association. 7 February 2019. Slide 1 49.

(3) JUWELS – Jülich’s New Scalable System 2500 nodes with Intel Xeon CPUs (2 × 24 cores) 48 nodes with 4 NVIDIA Tesla V100 cards 10.4 (CPU) + 1.6 (GPU) PFLOP/s peak performance (Top500: #26) Member of the Helmholtz Association. 7 February 2019. Slide 2 49.

(4) Next: Booster!. JUWELS – Jülich’s New Scalable System 2500 nodes with Intel Xeon CPUs (2 × 24 cores) 48 nodes with 4 NVIDIA Tesla V100 cards 10.4 (CPU) + 1.6 (GPU) PFLOP/s peak performance (Top500: #26) Member of the Helmholtz Association. 7 February 2019. Slide 2 49.

(5) JURECA – Jülich’s Multi-Purpose Supercomputer 1872 nodes with Intel Xeon E5 CPUs (2 × 12 cores) 75 nodes with 2 NVIDIA Tesla K80 cards (look like 4 GPUs) JURECA Booster: 1640 nodes with Intel Xeon Phi Knights Landing 1.8 (CPU) + 0.44 (GPU) + 5 (KNL) PFLOP/s peak performance (Top500: #44) Mellanox EDR InfiniBand Member of the Helmholtz Association. 7 February 2019. Slide 3 49.

(6) JULIA. JURON. JURON – A Human Brain Project Prototype 18 nodes with IBM POWER8NVL CPUs (2 × 10 cores) Per Node: 4 NVIDIA Tesla P100 cards (16 GB HBM2 memory), connected via NVLink GPU: 0.38 PFLOP/s peak performance Member of the Helmholtz Association. 7 February 2019. Slide 4 49.

(7) JULIA. JURON. JURON – A Human Brain Project Prototype 18 nodes with IBM POWER8NVL CPUs (2 × 10 cores) Per Node: 4 NVIDIA Tesla P100 cards (16 GB HBM2 memory), connected via NVLink GPU: 0.38 PFLOP/s peak performance Member of the Helmholtz Association. 7 February 2019. Slide 4 49.

(8) GPU Architecture. Member of the Helmholtz Association. 7 February 2019. Slide 5 49.

(9) Why?.

(10) Status Quo Across Architectures Performance Theoretical Peak Performance, Double Precision 4. 10. 00. 50. 91. o. sla. GFLOP/sec. 103 HD. 58. 70. 70. HD. 70. 69. HD. 69. Hz. 0 09. 2. HD. 48. 70. 0. 70. HD. 0. 38. 102. 0. 05. 06. sla. C1. Te. la es. C1. 06. sla. 2 49. X5. 2008. Member of the Helmholtz Association. 99. 6 -2. E5. Te. 7. 69. v2. E5. ) 90 72 Ph i. K4. v3. 99. E5. 6 -2. 9. v3. 69. v4. -2. E5. NVIDIA Tesla GPUs AMD Radeon GPUs. 90. 6. X5. 0. la. s Te. INTEL Xeon CPUs. -2. 0. 9 26. 6 X5. K. Xeon Phi 7120 (KNC). 79. E5. 0 59. 70. 89. HD. Te. T. W. HD. . Ed. G. sla. Te. C2. 80. 2 48 X5. M. sla. 70. K. Te. on. 0 K2. Xe. la. s Te. Pr. re. Fi 40. INTEL Xeon Phis. 5. 2010. 7 February 2019. 2012 End of Year Slide 7 49. 2014. 100. la V. Tes. 2016. Graphic: Rupp [2]. o. Pr. re Fi X 20. 100. la P. (K NL. W. Tes. 1 S9.

(11) Status Quo Across Architectures Memory Bandwidth Theoretical Peak Memory Bandwidth Comparison 3. 10. HD. 102. 48. 70. HD. 69. HD. 38. 0. 0 06. 06. C1. sla Te. la es. C1. sl Te. 2 aC. sla Te. 0 M2. HD. 0 ro. eP Fir. S9. sla Te. Xeon Phi 7120 (KNC) Tesla K20X. T. 90 -26. 2. 49. X5. 0 59 W5. 0 68 X5. 0. 15. Ph i. W9. Tesla K40 0 K2. -2. E5. E5. 2 48 X5. ro. eP. Fir. on. HD. 70. 89. Xe. 70 69 90. 0 05. 70. HD. HD. 70. 70 58. 10. d.. GH. 7 69. v2. 6 5-2. 99. E. v3 E. 6 5-2. 99. v3. v4 99 -26 E5 INTEL Xeon CPUs. NVIDIA Tesla GPUs. 0 69 X5. AMD Radeon GPUs INTEL Xeon Phis. 101. Member of the Helmholtz Association. 2008. 2010. 7 February 2019. 2012 End of Year Slide 7 49. 2014. 2016. Graphic: Rupp [2]. GB/sec. zE. 70 79. 100 la V. Tes. 72 90. (K N. L). Tesla P100.

(12) CPU vs. GPU. Graphics: Lee [3] and Shearings Holidays [4]. A matter of specialties. Member of the Helmholtz Association. 7 February 2019. Slide 8 49.

(13) CPU vs. GPU. Graphics: Lee [3] and Shearings Holidays [4]. A matter of specialties. Transporting many. Transporting one. Member of the Helmholtz Association. 7 February 2019. Slide 8 49.

(14) CPU vs. GPU Chip. ALU. ALU. ALU. ALU. Control. Cache. DRAM. DRAM. Member of the Helmholtz Association. 7 February 2019. Slide 9 49.

(15) GPU Architecture Overview. Aim: Hide Latency Everything else follows. Member of the Helmholtz Association. 7 February 2019. Slide 10 49.

(16) GPU Architecture Overview. Aim: Hide Latency Everything else follows. SIMT. Asynchronicity Memory. Member of the Helmholtz Association. 7 February 2019. Slide 10 49.


(18) Memory Host. GPU memory ain’t no CPU memory. ALU. ALU. ALU. ALU. Control. GPU: accelerator / extension card → Separate device from CPU. Cache. DRAM. DRAM. Device. Member of the Helmholtz Association. 7 February 2019. Slide 11 49.

(19) Memory Host. GPU memory ain’t no CPU memory Unified Virtual Addressing. ALU. ALU. ALU. ALU. Control. GPU: accelerator / extension card → Separate device from CPU Separate memory, but UVA. Cache. DRAM. DRAM. Device. Member of the Helmholtz Association. 7 February 2019. Slide 11 49.

(20) Memory Host. GPU memory ain’t no CPU memory. ALU. ALU. ALU. ALU. Control. GPU: accelerator / extension card → Separate device from CPU Separate memory, but UVA. Cache. DRAM. PCIe <16 GB/s. HBM2 <900 GB/s DRAM. Device. Member of the Helmholtz Association. 7 February 2019. Slide 11 49.

(21) Memory Host. GPU memory ain’t no CPU memory. ALU. ALU. ALU. ALU. Control. GPU: accelerator / extension card → Separate device from CPU Separate memory, but UVA Memory transfers need special consideration! Do as little as possible!. Cache. DRAM. PCIe <16 GB/s. HBM2 <900 GB/s DRAM. Device. Member of the Helmholtz Association. 7 February 2019. Slide 11 49.

(22) Memory Host. GPU memory ain’t no CPU memory Unified Memory. ALU. ALU. ALU. ALU. Control. GPU: accelerator / extension card → Separate device from CPU Separate memory, but UVA and UM Memory transfers need special consideration! Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automatically (performance…?). Cache. DRAM. PCIe <16 GB/s. HBM2 <900 GB/s DRAM. Device. Member of the Helmholtz Association. 7 February 2019. Slide 11 49.

(23) Memory Host. GPU memory ain’t no CPU memory. ALU. ALU. ALU. ALU. Control. GPU: accelerator / extension card → Separate device from CPU Separate memory, but UVA and UM Memory transfers need special consideration! Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automatically (performance…?). Cache. DRAM. NVLink ≈80 GB/s. HBM2 <900 GB/s DRAM. Device. Member of the Helmholtz Association. 7 February 2019. Slide 11 49.

(24) Memory Host. GPU memory ain’t no CPU memory. ALU. ALU. ALU. ALU. Control. GPU: accelerator / extension card → Separate device from CPU Separate memory, but UVA and UM Memory transfers need special consideration! Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automatically (performance…?) P100: 16 GB RAM, 720 GB/s; V100: 16 (32) GB RAM, 900 GB/s. Cache. DRAM. NVLink ≈80 GB/s. HBM2 <900 GB/s DRAM. Device. Member of the Helmholtz Association. 7 February 2019. Slide 11 49.

(25) Processing Flow. Scheduler. CPU → GPU → CPU. .... CPU CPU Memory. Interconnect L2. DRAM Member of the Helmholtz Association. 7 February 2019. Slide 12 49.

(26) Processing Flow. Scheduler. CPU → GPU → CPU. .... CPU CPU Memory. Interconnect 1 Transfer data from CPU memory to GPU memory. L2. DRAM Member of the Helmholtz Association. 7 February 2019. Slide 12 49.

(27) Processing Flow. Scheduler. CPU → GPU → CPU. .... CPU CPU Memory. Interconnect 1 Transfer data from CPU memory to GPU memory, transfer. L2. program. DRAM Member of the Helmholtz Association. 7 February 2019. Slide 12 49.

(28) Processing Flow. Scheduler. CPU → GPU → CPU. .... CPU CPU Memory. Interconnect 1 Transfer data from CPU memory to GPU memory, transfer. L2. program 2 Load GPU program, execute on SMs, get (cached) data from. memory; write back. Member of the Helmholtz Association. DRAM 7 February 2019. Slide 12 49.

(29) Processing Flow. Scheduler. CPU → GPU → CPU. .... CPU CPU Memory. Interconnect 1 Transfer data from CPU memory to GPU memory, transfer. L2. program 2 Load GPU program, execute on SMs, get (cached) data from. memory; write back. DRAM. 3 Transfer results back to host memory Member of the Helmholtz Association. 7 February 2019. Slide 12 49.



(32) Async Following different streams. Problem: Memory transfer is comparably slow Solution: Do something else in meantime (computation)! → Overlap tasks Copy and compute engines run separately (streams) Copy. Compute Copy. Copy Compute. Compute Copy. Compute. GPU needs to be fed: Schedule many computations CPU can do other work while GPU computes; synchronization. Member of the Helmholtz Association. 7 February 2019. Slide 14 49.



(35) SIMT. Scalar. Of threads and warps. CPU:. Single Instruction, Multiple Data (SIMD). Member of the Helmholtz Association. 7 February 2019. Slide 16 49. A0. +. B0. =. C0. A1. +. B1. =. C1. A2. +. B2. =. C2. A3. +. B3. =. C3.

(36) SIMT. Vector. Of threads and warps. CPU:. Single Instruction, Multiple Data (SIMD). Member of the Helmholtz Association. 7 February 2019. Slide 16 49. A0. B0. A1. B1. +. C0 =. C1. A2. B2. C2. A3. B3. C3.

(37) SIMT. Vector. Of threads and warps. CPU:. A0. B0. A1. B1. +. C0 =. B2. C2. A3. B3. C3. Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT) Core Core Core Core. Member of the Helmholtz Association. 7 February 2019. Slide 16 49. C1. A2.

(38) SIMT. Vector. Of threads and warps. CPU:. Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT). A0. B0. A1. B1. +. C0 =. B2. C2. A3. B3. C3. SMT Thread. Core Core Thread. Core Core. Member of the Helmholtz Association. 7 February 2019. Slide 16 49. C1. A2.

(39) SIMT. Vector. Of threads and warps. CPU:. Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT). A0. B0. A1. B1. +. C0 =. B2. C2. A3. B3. C3. SMT. GPU: Single Instruction, Multiple Threads (SIMT). Thread. Core Core Thread. Core Core. Member of the Helmholtz Association. 7 February 2019. Slide 16 49. C1. A2.

(40) SIMT. Vector. Of threads and warps. CPU:. Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT). A0. B0. A1. B1. +. C0 =. B2. C2. A3. B3. C3. SMT. GPU: Single Instruction, Multiple Threads (SIMT). Thread. Core Core Thread. Core Core. SIMT. Member of the Helmholtz Association. 7 February 2019. Slide 16 49. C1. A2.

(41) SIMT. Vector. Of threads and warps. CPU:. Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT). 7 February 2019. B0. A1. B1. +. C0 =. Slide 16 49. C1. A2. B2. C2. A3. B3. C3. SMT. GPU: Single Instruction, Multiple Threads (SIMT) CPU core ≊ GPU multiprocessor (SM) Working unit: set of threads (32, a warp) Fast switching of threads (large register file) Branching if. Member of the Helmholtz Association. A0. Thread. Core Core Thread. Core Core. SIMT.

(42) SIMT. Vector. CPU:. Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT). Tesla V100. 7 February 2019. B0. A1. B1. +. C0 =. Slide 16 49. C1. A2. B2. C2. A3. B3. C3. SMT. GPU: Single Instruction, Multiple Threads (SIMT) CPU core ≊ GPU multiprocessor (SM) Working unit: set of threads (32, a warp) Fast switching of threads (large register file) Branching if. Member of the Helmholtz Association. A0. Thread. Core Core Thread. Core Core. SIMT. Graphics: volta-pictures. Of threads and warps.

(43) SIMT. Vector. CPU:. Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT). Tesla V100. 7 February 2019. B0. A1. B1. +. C0 =. Slide 16 49. C1. A2. B2. C2. A3. B3. C3. SMT. GPU: Single Instruction, Multiple Threads (SIMT) CPU core ≊ GPU multiprocessor (SM) Working unit: set of threads (32, a warp) Fast switching of threads (large register file) Branching if. Member of the Helmholtz Association. A0. Thread. Core Core Thread. Core Core. SIMT. Graphics: volta-pictures. Of threads and warps.

(44) Multiprocessor. SIMT. Vector. CPU:. Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT). Tesla V100. 7 February 2019. B0. A1. B1. +. C0 =. Slide 16 49. C1. A2. B2. C2. A3. B3. C3. SMT. GPU: Single Instruction, Multiple Threads (SIMT) CPU core ≊ GPU multiprocessor (SM) Working unit: set of threads (32, a warp) Fast switching of threads (large register file) Branching if. Member of the Helmholtz Association. A0. Thread. Core Core Thread. Core Core. SIMT. Graphics: volta-pictures. Of threads and warps.

(45) Multiprocessor. SIMT. Vector. Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT). Tesla V100. 7 February 2019. B0. A1. B1. +. C0 =. Slide 16 49. C1. A2. B2. C2. A3. B3. C3. SMT. GPU: Single Instruction, Multiple Threads (SIMT) CPU core ≊ GPU multiprocessor (SM) Working unit: set of threads (32, a warp) Fast switching of threads (large register file) Branching if. Member of the Helmholtz Association. A0. Thread. Core Core Thread. Core Core. SIMT. Graphics: volta-pictures. CPU:. TensorCore. Of threads and warps.

(46) New: Tensor Cores New in Volta. 8 Tensor Cores per Streaming Multiprocessor (SM) (640 total for V100) Performance: 125 TFLOP/s (half precision) Calculate A × B + C = D (4 × 4 matrices; A, B: half precision) → 64 floating-point FMA operations per clock (mixed precision). × FP16. Member of the Helmholtz Association. FP32. =. + FP16. 7 February 2019. FP32. FP16 FP32 Slide 17 49. FP32. FP16 FP32.

(47) Low Latency vs. High Throughput Maybe GPU’s ultimate feature. CPU Minimizes latency within each thread GPU Hides latency with computations from other thread warps. Member of the Helmholtz Association. 7 February 2019. Slide 18 49.

(48) Low Latency vs. High Throughput Maybe GPU’s ultimate feature. CPU Minimizes latency within each thread GPU Hides latency with computations from other thread warps CPU Core: Low Latency T1. T2. T3. T4. Thread/Warp Processing Context Switch Ready Waiting. Member of the Helmholtz Association. 7 February 2019. Slide 18 49.

(49) Low Latency vs. High Throughput Maybe GPU’s ultimate feature. CPU Minimizes latency within each thread GPU Hides latency with computations from other thread warps CPU Core: Low Latency T1. T2. T3. T4. GPU Streaming Multiprocessor: High Throughput W1. Thread/Warp Processing Context Switch Ready Waiting. W2 W3 W4 Member of the Helmholtz Association. 7 February 2019. Slide 18 49.

(50) CPU vs. GPU Let’s summarize this!. Optimized for low latency + + + + + − − −. Optimized for high throughput. Large main memory Fast clock rate Large caches Branch prediction Powerful ALU Relatively low memory bandwidth Cache misses costly Low performance per watt. Member of the Helmholtz Association. 7 February 2019. + + + + − − −. High bandwidth main memory Latency tolerant (parallelism) More compute resources High performance per watt Limited memory capacity Low per-thread performance Extension card. Slide 19 49.

(51) Programming GPUs. Member of the Helmholtz Association. 7 February 2019. Slide 20 49.

(52) Preface: CPU A simple CPU program!. SAXPY: ⃗y = a⃗x + ⃗y, with single precision Part of LAPACK BLAS Level 1 void saxpy(int n, float a, float * x, float * y) { for (int i = 0; i < n; i++) y[i] = a * x[i] + y[i]; } float a = 42; int n = 10; float x[n], y[n]; // fill x, y. saxpy(n, a, x, y);. Member of the Helmholtz Association. 7 February 2019. Slide 21 49.

(53) Libraries. Member of the Helmholtz Association. Programming GPUs is easy: Just don’t!. 7 February 2019. Slide 22 49.

(54) Libraries. Programming GPUs is easy: Just don’t!. Use applications & libraries. Member of the Helmholtz Association. 7 February 2019. Slide 22 49.

(55) Libraries. Programming GPUs is easy: Just don’t!. Wizard: Breazell [6]. Use applications & libraries. Member of the Helmholtz Association. 7 February 2019. Slide 22 49.

(56) Libraries. Programming GPUs is easy: Just don’t!. Wizard: Breazell [6]. Use applications & libraries. cuSPARSE. cuBLAS. th ano. cuFFT cuRAND. Member of the Helmholtz Association. 7 February 2019. CUDA Math. Slide 22 49.


(58) cuBLAS Parallel algebra. GPU-parallel BLAS (all 152 routines). Single, double, complex data types Constant competition with Intel’s MKL Multi-GPU support → https://developer.nvidia.com/cublas http://docs.nvidia.com/cuda/cublas. Member of the Helmholtz Association. 7 February 2019. Slide 23 49.

(59) cuBLAS Code example int a = 42; int n = 10; float x[n], y[n]; // fill x, y. cublasHandle_t handle; cublasCreate(&handle); float * d_x, * d_y; cudaMallocManaged(&d_x, n * sizeof(x[0]); cudaMallocManaged(&d_y, n * sizeof(y[0]); cublasSetVector(n, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(n, sizeof(y[0]), y, 1, d_y, 1);. cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasGetVector(n, sizeof(y[0]), d_y, 1, y, 1); cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); Member of the Helmholtz Association. 7 February 2019. Slide 24 49.

(60) cuBLAS Code example int a = 42; int n = 10; float x[n], y[n]; // fill x, y. cublasHandle_t handle; cublasCreate(&handle);. Initialize. float * d_x, * d_y; cudaMallocManaged(&d_x, n * sizeof(x[0]); cudaMallocManaged(&d_y, n * sizeof(y[0]); cublasSetVector(n, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(n, sizeof(y[0]), y, 1, d_y, 1);. cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasGetVector(n, sizeof(y[0]), d_y, 1, y, 1); cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); Member of the Helmholtz Association. 7 February 2019. Slide 24 49.

(61) cuBLAS Code example int a = 42; int n = 10; float x[n], y[n]; // fill x, y. cublasHandle_t handle; cublasCreate(&handle);. Initialize. float * d_x, * d_y; cudaMallocManaged(&d_x, n * sizeof(x[0]); cudaMallocManaged(&d_y, n * sizeof(y[0]); cublasSetVector(n, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(n, sizeof(y[0]), y, 1, d_y, 1);. Allocate GPU memory. cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasGetVector(n, sizeof(y[0]), d_y, 1, y, 1); cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); Member of the Helmholtz Association. 7 February 2019. Slide 24 49.

(62) cuBLAS Code example int a = 42; int n = 10; float x[n], y[n]; // fill x, y. cublasHandle_t handle; cublasCreate(&handle);. Initialize. float * d_x, * d_y; cudaMallocManaged(&d_x, n * sizeof(x[0]); cudaMallocManaged(&d_y, n * sizeof(y[0]); cublasSetVector(n, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(n, sizeof(y[0]), y, 1, d_y, 1);. Allocate GPU memory Copy data to GPU. cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasGetVector(n, sizeof(y[0]), d_y, 1, y, 1); cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); Member of the Helmholtz Association. 7 February 2019. Slide 24 49.

(63) cuBLAS Code example int a = 42; int n = 10; float x[n], y[n]; // fill x, y. cublasHandle_t handle; cublasCreate(&handle);. Initialize. float * d_x, * d_y; cudaMallocManaged(&d_x, n * sizeof(x[0]); cudaMallocManaged(&d_y, n * sizeof(y[0]); cublasSetVector(n, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(n, sizeof(y[0]), y, 1, d_y, 1);. Allocate GPU memory Copy data to GPU Call BLAS routine. cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasGetVector(n, sizeof(y[0]), d_y, 1, y, 1); cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); Member of the Helmholtz Association. 7 February 2019. Slide 24 49.

(64) cuBLAS Code example int a = 42; int n = 10; float x[n], y[n]; // fill x, y. cublasHandle_t handle; cublasCreate(&handle);. Initialize. float * d_x, * d_y; cudaMallocManaged(&d_x, n * sizeof(x[0]); cudaMallocManaged(&d_y, n * sizeof(y[0]); cublasSetVector(n, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(n, sizeof(y[0]), y, 1, d_y, 1);. Allocate GPU memory Copy data to GPU Call BLAS routine. cublasSaxpy(n, a, d_x, 1, d_y, 1);. Copy result to host. cublasGetVector(n, sizeof(y[0]), d_y, 1, y, 1); cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); Member of the Helmholtz Association. 7 February 2019. Slide 24 49.

(65) cuBLAS Code example int a = 42; int n = 10; float x[n], y[n]; // fill x, y. cublasHandle_t handle; cublasCreate(&handle);. Initialize. float * d_x, * d_y; cudaMallocManaged(&d_x, n * sizeof(x[0]); cudaMallocManaged(&d_y, n * sizeof(y[0]); cublasSetVector(n, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(n, sizeof(y[0]), y, 1, d_y, 1);. Allocate GPU memory Copy data to GPU Call BLAS routine. cublasSaxpy(n, a, d_x, 1, d_y, 1);. Copy result to host. cublasGetVector(n, sizeof(y[0]), d_y, 1, y, 1);. Finalize. cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); Member of the Helmholtz Association. 7 February 2019. Slide 24 49.



(68) Thrust Iterators! Iterators everywhere!. Thrust CUDA. =. STL C++. Template library Based on iterators Data-parallel primitives (scan(), sort(), reduce(), … ) Fully compatible with plain CUDA C (comes with CUDA Toolkit) Great with [](){} lambdas! → http://thrust.github.io/ http://docs.nvidia.com/cuda/thrust/. Member of the Helmholtz Association. 7 February 2019. Slide 26 49.

(69) Thrust Code example. int a = 42; int n = 10; thrust::host_vector<float> x(n), y(n); // fill x, y. thrust::device_vector d_x = x, d_y = y; using namespace thrust::placeholders; thrust::transform(d_x.begin(), d_x.end(), d_y.begin(), d_y.begin(), a * _1 + _2);. x = d_x;. Member of the Helmholtz Association. 7 February 2019. Slide 27 49.

(70) Thrust Code example with lambdas. auto lambda = [=] __host__ __device__ (int i) { y[i] = a * x[i] + y[i];}; if(N > gGpuThreshold) thrust::for_each(thrust::device, r, r+N, lambda); else thrust::for_each(thrust::host, r, r+N, lambda);}. Member of the Helmholtz Association. 7 February 2019. Slide 27 49. Source. #include <thrust/for_each.h> #include <thrust/execution_policy.h> constexpr int gGpuThreshold = 10000; void saxpy(float *x, float *y, float a, int N) { auto r = thrust::counting_iterator<int>(0);.

(71) Programming GPUs Directives. Member of the Helmholtz Association. 7 February 2019. Slide 28 49.

(72) GPU Programming with Directives Keepin’ you portable. Annotate usual source code by directives #pragma acc loop for (int i = 0; i < 1; i+*) {};. Member of the Helmholtz Association. 7 February 2019. Slide 29 49.

(73) GPU Programming with Directives Keepin’ you portable. Annotate usual source code by directives #pragma acc loop for (int i = 0; i < 1; i+*) {};. Also: Generalized API functions acc_copy();. Compiler interprets directives, creates according instructions. Member of the Helmholtz Association. 7 February 2019. Slide 29 49.

(74) GPU Programming with Directives Keepin’ you portable. Annotate usual source code by directives #pragma acc loop for (int i = 0; i < 1; i+*) {};. Also: Generalized API functions acc_copy();. Compiler interprets directives, creates according instructions Portability. Pro. Other compiler? No problem! To it, it’s a serial program Different target architectures from same code. Easy to program Member of the Helmholtz Association. 7 February 2019. Slide 29 49. Con Compilers support limited Raw power hidden Somewhat harder to debug.

(75) GPU Programming with Directives The power of… two.. OpenMP Standard for multithread programming on CPU, GPU since 4.0, better since 4.5 #pragma omp target map(tofrom:y), map(to:x) #pragma omp teams num_teams(10) num_threads(10) #pragma omp distribute for ( ) { #pragma omp parallel for for ( ) { // … } }. OpenACC Similar to OpenMP, but more specifically for GPUs Less prescriptive, more descriptive. Member of the Helmholtz Association. 7 February 2019. Slide 30 49.

(76) OpenACC Code example. void saxpy_acc(int n, float a, float * x, float * y) { #pragma acc kernels for (int i = 0; i < n; i++) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n]; // fill x, y. saxpy_acc(n, a, x, y);. Member of the Helmholtz Association. 7 February 2019. Slide 31 49.

(77) OpenACC Code example. void saxpy_acc(int n, float a, float * x, float * y) { #pragma acc parallel loop copy(y) copyin(x) for (int i = 0; i < n; i++) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n]; // fill x, y. saxpy_acc(n, a, x, y);. Member of the Helmholtz Association. 7 February 2019. Slide 31 49.

(78) Programming GPUs Languages. Member of the Helmholtz Association. 7 February 2019. Slide 32 49.

(79) Programming GPU Directly Finally…. Two solutions:. Member of the Helmholtz Association. 7 February 2019. Slide 33 49.

(80) Programming GPU Directly Finally…. Two solutions: OpenCL Open Computing Language by Khronos Group (Apple, IBM, NVIDIA, …) 2009 Platform: Programming language (OpenCL C/C++), API, and compiler Targets CPUs, GPUs, FPGAs, and other many-core machines Fully open source Different compilers available. Member of the Helmholtz Association. 7 February 2019. Slide 33 49.

(81) Programming GPU Directly Finally…. Two solutions: OpenCL Open Computing Language by Khronos Group (Apple, IBM, NVIDIA, …) 2009 Platform: Programming language (OpenCL C/C++), API, and compiler Targets CPUs, GPUs, FPGAs, and other many-core machines Fully open source Different compilers available CUDA NVIDIA’s GPU platform 2007 Platform: Drivers, programming language (CUDA C/C++), API, compiler, debuggers, profilers, … Only NVIDIA GPUs Compilation with nvcc (free, but not open) clang has CUDA support, but CUDA needed for last step Also: CUDA Fortran. Member of the Helmholtz Association. 7 February 2019. Slide 33 49.

(82) Programming GPU Directly Finally…. Two solutions: OpenCL Open Computing Language by Khronos Group (Apple, IBM, NVIDIA, …) 2009 Platform: Programming language (OpenCL C/C++), API, and compiler Targets CPUs, GPUs, FPGAs, and other many-core machines Fully open source Different compilers available CUDA NVIDIA’s GPU platform 2007 Platform: Drivers, programming language (CUDA C/C++), API, compiler, debuggers, profilers, … Only NVIDIA GPUs Compilation with nvcc (free, but not open) clang has CUDA support, but CUDA needed for last step Also: CUDA Fortran. Choose what flavor you like, what colleagues/collaboration is using Hardest: Come up with parallelized algorithm Member of the Helmholtz Association. 7 February 2019. Slide 33 49.

(83) Programming GPU Directly Finally…. Two solutions: OpenCL Open Computing Language by Khronos Group (Apple, IBM, NVIDIA, …) 2009 Platform: Programming language (OpenCL C/C++), API, and compiler Targets CPUs, GPUs, FPGAs, and other many-core machines Fully open source Different compilers available CUDA NVIDIA’s GPU platform 2007 Platform: Drivers, programming language (CUDA C/C++), API, compiler, debuggers, profilers, … Only NVIDIA GPUs Compilation with nvcc (free, but not open) clang has CUDA support, but CUDA needed for last step Also: CUDA Fortran. Choose what flavor you like, what colleagues/collaboration is using Hardest: Come up with parallelized algorithm Member of the Helmholtz Association. 7 February 2019. Slide 33 49.

(84) CUDA’s Parallel Model In software: Threads, Blocks. Methods to exploit parallelism:. Member of the Helmholtz Association. 7 February 2019. Slide 34 49.

(85) CUDA’s Parallel Model In software: Threads, Blocks. Methods to exploit parallelism: Thread. Member of the Helmholtz Association. 7 February 2019. Slide 34 49.

(86) CUDA’s Parallel Model In software: Threads, Blocks. Methods to exploit parallelism: Threads. Member of the Helmholtz Association. 0 1 2 3 4 5. 7 February 2019. Slide 34 49.

(87) CUDA’s Parallel Model In software: Threads, Blocks. Methods to exploit parallelism: Threads → Block. Member of the Helmholtz Association. 0 1 2 3 4 5. 7 February 2019. Slide 34 49.

(88) CUDA’s Parallel Model In software: Threads, Blocks. Methods to exploit parallelism: Threads → Block. 0 1 2 3 4 5. Block 0. Member of the Helmholtz Association. 7 February 2019. Slide 34 49.

(89) CUDA’s Parallel Model In software: Threads, Blocks. Methods to exploit parallelism: Threads → Block. 0 1 2 3 4 5. 0 1 2 3 4 5. 0 1 2 3 4 5. 0. 1. 2. Blocks. Member of the Helmholtz Association. 7 February 2019. Slide 34 49.

(90) CUDA’s Parallel Model In software: Threads, Blocks. Methods to exploit parallelism: Threads → Block. 0 1 2 3 4 5. 0 1 2 3 4 5. 0 1 2 3 4 5. 0. 1. 2. Blocks → Grid. Member of the Helmholtz Association. 7 February 2019. Slide 34 49.

(91) CUDA’s Parallel Model In software: Threads, Blocks. Methods to exploit parallelism: Threads → Block. 0 1 2 3 4 5. 0 1 2 3 4 5. 0 1 2 3 4 5. 0. 1. 2. Blocks → Grid 3D 3D Threads & blocks in 3D. Member of the Helmholtz Association. 7 February 2019. Slide 34 49.

(92) CUDA’s Parallel Model In software: Threads, Blocks. Methods to exploit parallelism: Threads → Block. 0 1 2 3 4 5. 0 1 2 3 4 5. 0 1 2 3 4 5. 0. 1. 2. Blocks → Grid 3D 3D Threads & blocks in 3D. Parallel function: kernel __global__ kernel(int a, float * b) { }. Access own ID by global variables threadIdx.x, blockIdx.y, …. Execution entity: threads. Lightweight → fast switchting! 1000s threads execute simultaneously → order non-deterministic!. Member of the Helmholtz Association. 7 February 2019. Slide 34 49.

(93) CUDA SAXPY With runtime-managed data transfers __global__ void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n]; // fill x, y cudaMallocManaged(&x, n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float));. saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaDeviceSynchronize(); Member of the Helmholtz Association. 7 February 2019. Slide 35 49.

(94) CUDA SAXPY With runtime-managed data transfers __global__ void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n]; // fill x, y cudaMallocManaged(&x, n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float));. saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaDeviceSynchronize(); Member of the Helmholtz Association. 7 February 2019. Slide 35 49. Specify kernel.

(95) CUDA SAXPY With runtime-managed data transfers __global__ void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n]; // fill x, y cudaMallocManaged(&x, n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float));. saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaDeviceSynchronize(); Member of the Helmholtz Association. 7 February 2019. Slide 35 49. Specify kernel ID variables.

(96) CUDA SAXPY With runtime-managed data transfers __global__ void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n]; // fill x, y cudaMallocManaged(&x, n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float));. saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaDeviceSynchronize(); Member of the Helmholtz Association. 7 February 2019. Slide 35 49. Specify kernel ID variables Guard against too many threads.

(97) CUDA SAXPY With runtime-managed data transfers __global__ void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n]; // fill x, y cudaMallocManaged(&x, n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float));. cudaDeviceSynchronize(); 7 February 2019. ID variables Guard against too many threads. Allocate GPU-capable memory. saxpy_cuda<<<2, 5>>>(n, a, x, y);. Member of the Helmholtz Association. Specify kernel. Slide 35 49.

(98) CUDA SAXPY With runtime-managed data transfers __global__ void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n]; // fill x, y cudaMallocManaged(&x, n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float));. Guard against too many threads. Call kernel 2 blocks, each 5 threads. cudaDeviceSynchronize(); 7 February 2019. ID variables. Allocate GPU-capable memory. saxpy_cuda<<<2, 5>>>(n, a, x, y);. Member of the Helmholtz Association. Specify kernel. Slide 35 49.

(99) CUDA SAXPY With runtime-managed data transfers __global__ void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n]; // fill x, y cudaMallocManaged(&x, n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float));. Guard against too many threads. Call kernel 2 blocks, each 5 threads Wait for kernel to finish. cudaDeviceSynchronize(); 7 February 2019. ID variables. Allocate GPU-capable memory. saxpy_cuda<<<2, 5>>>(n, a, x, y);. Member of the Helmholtz Association. Specify kernel. Slide 35 49.

(100) Programming GPUs Abstraction Libraries/DSL. Member of the Helmholtz Association. 7 February 2019. Slide 36 49.

(101) Abstraction Libraries & DSLs. Libraries with ready-programmed abstractions; partly compiler/transpiler necessary Have different backends to choose from for targeted accelerator Between Thrust, OpenACC, and CUDA Examples: Kokkos, Alpaka, Futhark, HIP, C++AMP, …. Member of the Helmholtz Association. 7 February 2019. Slide 37 49.

(102) Abstraction Libraries & DSLs. Libraries with ready-programmed abstractions; partly compiler/transpiler necessary Have different backends to choose from for targeted accelerator Between Thrust, OpenACC, and CUDA Examples: Kokkos, Alpaka, Futhark, HIP, C++AMP, …. Member of the Helmholtz Association. 7 February 2019. Slide 37 49.

(103) An Alternative: Kokkos From Sandia National Laboratories. C++ library for performance portability Data-parallel patterns, architecture-aware memory layouts, … Kokkos::View<double*> x("X", length); Kokkos::View<double*> y("Y", length); double a = 2.0;. // Fill x, y Kokkos::parallel_for(length, KOKKOS_LAMBDA (const int& i) { x(i) = a*x(i) + y(i); });. → https://github.com/kokkos/kokkos/. Member of the Helmholtz Association. 7 February 2019. Slide 38 49.

(104) Programming GPUs Tools. Member of the Helmholtz Association. 7 February 2019. Slide 39 49.

(105) GPU Tools The helpful helpers helping helpless (and others). NVIDIA cuda-gdb GDB-like command line utility for debugging cuda-memcheck Like Valgrind’s memcheck, for checking errors in memory accesses Nsight IDE for GPU developing, based on Eclipse (Linux, OS X) or Visual Studio. (Windows) nvprof Command line profiler, including detailed performance counters. Visual Profiler Timeline profiling and annotated performance experiments OpenCL: CodeXL (Open Source, GPUOpen/AMD) – debugging, profiling.. Member of the Helmholtz Association. 7 February 2019. Slide 40 49.

(106) GPU Tools The helpful helpers helping helpless (and others). NVIDIA cuda-gdb GDB-like command line utility for debugging cuda-memcheck Like Valgrind’s memcheck, for checking errors in memory accesses Nsight IDE for GPU developing, based on Eclipse (Linux, OS X) or Visual Studio. (Windows) nvprof Command line profiler, including detailed performance counters. Visual Profiler Timeline profiling and annotated performance experiments OpenCL: CodeXL (Open Source, GPUOpen/AMD) – debugging, profiling.. Member of the Helmholtz Association. 7 February 2019. Slide 40 49.

(107) nvprof Command that line. Usage: nvprof ./app $ nvprof ./matrixMul -wA=1024 -hA=1024 -wB=1024 -hB=1024 ==37064== Profiling application: ./matrixMul -wA=1024 -hA=1024 -wB=1024 -hB=1024 ==37064== Profiling result: Time(%) Time Calls Avg Min Max Name 99.19% 262.43ms 301 871.86us 863.88us 882.44us void matrixMulCUDA<int=32>(float*, float*, float*, int, int) 0.58% 1.5428ms 2 771.39us 764.65us 778.12us [CUDA memcpy HtoD] 0.23% 599.40us 1 599.40us 599.40us 599.40us [CUDA memcpy DtoH] ==37064== API calls: Time(%) Time 61.26% 258.38ms 35.68% 150.49ms 0.73% 3.0774ms 0.62% 2.6287ms 0.56% 2.3408ms 0.48% 2.0111ms 0.21% 872.52us. Member of the Helmholtz Association. Calls 1 3 3 4 301 364 1. Avg 258.38ms 50.164ms 1.0258ms 657.17us 7.7760us 5.5250us 872.52us. Min 258.38ms 914.97us 1.0097ms 655.12us 7.3810us 235ns 872.52us. Max 258.38ms 148.65ms 1.0565ms 660.56us 53.103us 201.63us 872.52us. 7 February 2019. Name cudaEventSynchronize cudaMalloc cudaMemcpy cuDeviceTotalMem cudaLaunch cuDeviceGetAttribute cudaDeviceSynchronize. Slide 41 49.

(108) nvprof Command that line. With metrics: nvprof --metrics flop_sp_efficiency ./app $ nvprof --metrics flop_sp_efficiency ./matrixMul -wA=1024 -hA=1024 -wB=1024 -hB=1024 [Matrix Multiply Using CUDA] - Starting... ==37122== NVPROF is profiling process 37122, command: ./matrixMul -wA=1024 -hA=1024 -wB=1024 -hB=1024 GPU Device 0: "Tesla P100-SXM2-16GB" with compute capability 6.0 MatrixA(1024,1024), MatrixB(1024,1024) Computing result using CUDA Kernel... ==37122== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics. done122== Replaying kernel "void matrixMulCUDA<int=32>(float*, float*, float*, int, int)" (0 of 2)... Performance= 26.61 GFlop/s, Time= 80.697 msec, Size= 2147483648 Ops, WorkgroupSize= 1024 threads/block Checking computed result for correctness: Result = PASS ==37122== Profiling application: ./matrixMul -wA=1024 -hA=1024 -wB=1024 -hB=1024 ==37122== Profiling result: ==37122== Metric result: Invocations Metric Name Metric Description Min Device "Tesla P100-SXM2-16GB (0)" Kernel: void matrixMulCUDA<int=32>(float*, float*, float*, int, int) 301 flop_sp_efficiency FLOP Efficiency(Peak Single) 22.96%. Member of the Helmholtz Association. 7 February 2019. Slide 41 49. Max. Avg. 23.40%. 23.15%.

(109) Visual Profiler Your new favorite tool. Member of the Helmholtz Association. 7 February 2019. Slide 42 49.

(110) Advanced Topics So much more interesting things to show! Optimize memory transfers to reduce overhead Optimize applications for GPU architecture Drop-in BLAS acceleration with NVBLAS ($LD_PRELOAD) Tensor Cores for Deep Learning Libraries, Abstractions: Kokkos, Alpaka, Futhark, HIP, C++AMP, … Use multiple GPUs On one node Across many nodes → MPI. … Some of that: Addressed at dedicated training courses. Member of the Helmholtz Association. 7 February 2019. Slide 43 49.

(111) Summary of Acceleration Possibilities. Application. Libraries. Directives. Programming Languages. Drop-in Acceleration. Easy Acceleration. Flexible Acceleration. Member of the Helmholtz Association. 7 February 2019. Slide 44 49.

(112) Using GPUs on JURECA & JUWELS. Member of the Helmholtz Association. 7 February 2019. Slide 45 49.

(113) Compiling CUDA. Module: module load CUDA/9.2.88 Compile: nvcc file.cu Default host compiler: g++; use nvcc_pgc++ for PGI compiler cuBLAS: g++ file.cpp -I$CUDA_HOME/include -L$CUDA_HOME/lib64 -lcublas -lcudart. OpenACC. MPI. Member of the Helmholtz Association. Module: module Compile: pgc++. load PGI/18.7-GCC-7.3.0 -acc -ta=tesla file.cpp. Module: module load MVAPICH2/2.3-GDR (also needed: GCC/7.3.0) Enabled for CUDA (CUDA-aware); no need to copy data to host before transfer. 7 February 2019. Slide 46 49.

(114) Running Dedicated GPU partitions JUWELS --partition=gpus. 48 nodes (Job limits: <1 d). --partition=gpus. 70 nodes (Job limits: <1 d, ≤ 32 nodes) 4 nodes (Job limits: <2 h, ≤ 2 nodes). JURECA --partition=develgpus. Needed: Resource configuration with --gres --gres=gpu:4 --gres=mem1024,gpu:2 --partition=vis. only JURECA. → See online documentation. Member of the Helmholtz Association. 7 February 2019. Slide 47 49.

(115) Example 96 tasks in total, running on 4 nodes Per node: 4 GPUs #!/bin/bash -x #SBATCH --nodes=4 #SBATCH --ntasks=96 #SBATCH --ntasks-per-node=24 #SBATCH --output=gpu-out.%j #SBATCH --error=gpu-err.%j #SBATCH --time=00:15:00 #SBATCH --partition=gpus #SBATCH --gres=gpu:4 srun ./gpu-prog. Member of the Helmholtz Association. 7 February 2019. Slide 48 49.

(116) Conclusion, Resources GPUs provide highly-parallel computing power. We have many devices installed at JSC, ready to be used!. Member of the Helmholtz Association. 7 February 2019. Slide 49 49.

(117) Conclusion, Resources GPUs provide highly-parallel computing power. We have many devices installed at JSC, ready to be used! Training courses by JSC CUDA Course 1 - 3 April 2019 OpenACC Course 28 - 29 October 2019 Generally: see online documentation and sc@fz-juelich.de. Member of the Helmholtz Association. 7 February 2019. Slide 49 49.

(118) Conclusion, Resources GPUs provide highly-parallel computing power. We have many devices installed at JSC, ready to be used! Training courses by JSC CUDA Course 1 - 3 April 2019 OpenACC Course 28 - 29 October 2019 Generally: see online documentation and sc@fz-juelich.de Further consultation via our lab: NVIDIA Application Lab in Jülich; contact me!. Member of the Helmholtz Association. 7 February 2019. Slide 49 49.

(119) Conclusion, Resources GPUs provide highly-parallel computing power. We have many devices installed at JSC, ready to be used! Training courses by JSC CUDA Course 1 - 3 April 2019 OpenACC Course 28 - 29 October 2019 Generally: see online documentation and sc@fz-juelich.de Further consultation via our lab: NVIDIA Application Lab in Jülich; contact me! Interested in JURON? Get access!. Member of the Helmholtz Association. 7 February 2019. Slide 49 49.

(120) Conclusion, Resources GPUs provide highly-parallel computing power. We have many devices installed at JSC, ready to be used! Training courses by JSC CUDA Course 1 - 3 April 2019 OpenACC Course 28 - 29 October 2019 Generally: see online documentation and sc@fz-juelich.de Further consultation via our lab: NVIDIA Application Lab in Jülich; contact nk youme!. Tha ttention! for your a -juelich.de. Interested in JURON? Get access!. fz. a.herten@. Member of the Helmholtz Association. 7 February 2019. Slide 49 49.

(121) APPENDIX. Member of the Helmholtz Association. 7 February 2019. Slide 1 9.

(122) Appendix Glossary References. Member of the Helmholtz Association. 7 February 2019. Slide 2 9.

(123) Glossary I API A programmatic interface to software by well-defined functions. Short for. application programming interface. 72, 73, 74, 79, 80, 81, 82, 83 CUDA Computing platform for GPUs from NVIDIA. Provides, among others, CUDA C/C++. 68, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98,. 99, 101, 102, 113, 116, 117, 118, 119, 120, 125 DSL A Domain-Specific Language is a specialization of a more general language to a. specific domain. 2, 100, 101, 102 JSC Jülich Supercomputing Centre, the supercomputing institute of. Forschungszentrum Jülich, Germany. 2, 116, 117, 118, 119, 120, 123 JURECA A multi-purpose supercomputer with 1800 nodes at JSC. 2, 5, 112, 114. Member of the Helmholtz Association. 7 February 2019. Slide 3 9.

(124) Glossary II JURON One of the two HBP pilot system in Jülich; name derived from Juelich and. Neuron. 6, 7 JUWELS Jülich’s new supercomputer, the successor of JUQUEEN. 2, 3, 4, 112, 114 MPI The Message Passing Interface, a API definition for multi-node computing. 110,. 113 NVIDIA US technology company creating GPUs. 3, 4, 5, 6, 7, 79, 80, 81, 82, 83, 105, 106,. 116, 117, 118, 119, 120, 123, 124, 125, 126 NVLink NVIDIA’s communication protocol connecting CPU ↔ GPU and GPU ↔ GPU with. high bandwidth. 6, 7, 125 OpenACC Directive-based programming, primarily for many-core machines. 75, 76, 77,. 101, 102, 113, 116, 117, 118, 119, 120 Member of the Helmholtz Association. 7 February 2019. Slide 4 9.

(125) Glossary III OpenCL The Open Computing Language. Framework for writing code for heterogeneous architectures (CPU, GPU, DSP, FPGA). The alternative to CUDA. 79, 80, 81, 82, 83,. 105, 106 OpenMP Directive-based programming, primarily for multi-threaded machines. 75 P100 A large GPU with the Pascal architecture from NVIDIA. It employs NVLink as its. interconnect and has fast HBM2 memory. 6, 7 Pascal GPU architecture from NVIDIA (announced 2016). 125 POWER CPU architecture from IBM, earlier: PowerPC. See also POWER8. 125 POWER8 Version 8 of IBM’s POWERprocessor, available also under the OpenPOWER. Foundation. 6, 7, 125. Member of the Helmholtz Association. 7 February 2019. Slide 5 9.

(126) Glossary IV SAXPY Single-precision A × X + Y. A simple code example of scaling a vector and adding. an offset. 52, 93, 94, 95, 96, 97, 98, 99 Tesla The GPU product line for general purpose computing computing of NVIDIA. 3, 4,. 5, 6, 7 Thrust A parallel algorithms library for (among others) GPUs. See https://thrust.github.io/. 68 Volta GPU architecture from NVIDIA (announced 2017). 46 CPU Central Processing Unit. 3, 4, 5, 6, 7, 12, 13, 14, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27,. 28, 29, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 52, 75, 79, 80, 81, 82, 83, 124, 125. Member of the Helmholtz Association. 7 February 2019. Slide 6 9.

(127) Glossary V GPU Graphics Processing Unit. 2, 3, 4, 5, 6, 7, 8, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,. 23, 24, 25, 26, 27, 28, 29, 30, 31, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 47, 48, 49, 51, 53, 54, 55, 56, 57, 58, 66, 67, 71, 72, 73, 74, 75, 78, 79, 80, 81, 82, 83, 97, 98, 99, 100, 104, 105, 106, 110, 112, 114, 115, 116, 117, 118, 119, 120, 123, 124, 125, 126 HBP Human Brain Project. 124 SIMD Single Instruction, Multiple Data. 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45 SIMT Single Instruction, Multiple Threads. 15, 16, 17, 30, 31, 33, 34, 35, 36, 37, 38, 39,. 40, 41, 42, 43, 44, 45 SM Streaming Multiprocessor. 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46 SMT Simultaneous Multithreading. 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45. Member of the Helmholtz Association. 7 February 2019. Slide 7 9.

(128) References I. [2]. Karl Rupp. Pictures: CPU/GPU Performance Comparison. URL: https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardwarecharacteristics-over-time/ (pages 10, 11).. [6]. Wes Breazell. Picture: Wizard. URL: https://thenounproject.com/wes13/collection/its-a-wizards-world/. (pages 53–57, 66, 67).. Member of the Helmholtz Association. 7 February 2019. Slide 8 9.

(129) References: Images, Graphics I [1]. Alexandre Debiève. Bowels of computer. Freely available at Unsplash. URL: https://unsplash.com/photos/FO7JIlwjOtU.. [3]. Mark Lee. Picture: kawasaki ninja. URL: https://www.flickr.com/photos/pochacco20/39030210/ (pages 12, 13).. [4]. Shearings Holidays. Picture: Shearings coach 636. URL: https://www.flickr.com/photos/shearings/13583388025/ (pages 12, 13).. [5]. Nvidia Corporation. Pictures: Volta GPU. Volta Architecture Whitepaper. URL: https://images.nvidia.com/content/volta-architecture/pdf/VoltaArchitecture-Whitepaper-v1.0.pdf.. Member of the Helmholtz Association. 7 February 2019. Slide 9 9.

(130)