Member of the Helmholtz Association

Volltext

(1)GPU ACCELERATORS AT JSC SUPERCOMPUTING INTRODUCTION COURSE 29 November 2019. Member of the Helmholtz Association. Andreas Herten. Forschungszentrum Jülich.

(2) Outline GPUs at JSC. Programming GPUs Libraries Directives CUDA C/C++ Performance Analysis Advanced Topics Using GPUs on JURECA & JUWELS Compiling Resource Allocation. JUWELS JURECA JURON GPU Architecture Empirical Motivation Comparisons 3 Core Features Memory Asynchronicity SIMT. High Throughput Summary. Member of the Helmholtz Association. 29 November 2019. Slide 1 40.

(3) JUWELS – Jülich’s New Large System 2500 nodes with Intel Xeon CPUs (2 × 24 cores) 46 + 10 nodes with 4 NVIDIA Tesla V100 cards 10.4 (CPU) + 1.6 (GPU) PFLOP/s peak performance (Top500: #26) Member of the Helmholtz Association. 29 November 2019. Slide 2 40.

(4) 2020: Booster!. JUWELS – Jülich’s New Large System 2500 nodes with Intel Xeon CPUs (2 × 24 cores) 46 + 10 nodes with 4 NVIDIA Tesla V100 cards 10.4 (CPU) + 1.6 (GPU) PFLOP/s peak performance (Top500: #26) Member of the Helmholtz Association. 29 November 2019. Slide 2 40.

(5) JURECA – Jülich’s Multi-Purpose Supercomputer 1872 nodes with Intel Xeon E5 CPUs (2 × 12 cores) 75 nodes with 2 NVIDIA Tesla K80 cards (look like 4 GPUs) JURECA Booster: 1640 nodes with Intel Xeon Phi Knights Landing 1.8 (CPU) + 0.44 (GPU) + 5 (KNL) PFLOP/s peak performance (Top500: #44) Mellanox EDR InfiniBand Member of the Helmholtz Association. 29 November 2019. Slide 3 40.

(6) JULIA. JURON. JURON – A Human Brain Project Pilot System 18 nodes with IBM POWER8NVL CPUs (2 × 10 cores) Per Node: 4 NVIDIA Tesla P100 cards (16 GB HBM2 memory), connected via NVLink GPU: 0.38 PFLOP/s peak performance Member of the Helmholtz Association. 29 November 2019. Slide 4 40.

(7) JULIA. JURON. JURON – A Human Brain Project Pilot System 18 nodes with IBM POWER8NVL CPUs (2 × 10 cores) Per Node: 4 NVIDIA Tesla P100 cards (16 GB HBM2 memory), connected via NVLink GPU: 0.38 PFLOP/s peak performance Member of the Helmholtz Association. 29 November 2019. Slide 4 40.

(8) GPU Architecture.

(9) Why?.

(10) Status Quo Across Architectures Performance Theoretical Peak Performance, Double Precision 4. 10. MI60. 00. 00. 0. 15. o. Pr. Te. 103 GFLOP/sec. HD. 0. 0. 70. 58. HD. 7 69. HD. 70. HD. 48. 0. 70. HD. 38. 102. 0. 05. 06. 0. T. la es. 06 C1. sla. C1. sla Te. X5. X5. 2008 Member of the Helmholtz Association. Hz. G. 0. HD. 7 79. HD. K4. ) 72 Ph i. sla. n. K4. Te. 92. um. 81. 80. a Pl. in at. Pl 70. 89. Xeon Phi 7120 (KNC). 9. 69 -2 E5. la. s Te. 97. Te. 0 69. -2. E5. X5 2 49. 2 M. 0 09. . Ed. 0. 0. la. s Te. m. u tin. v3. 9. E5. -. 9 26. v3. 9 69 -2 E5 I25 M. v4. C2. 0 68. 2 48. 7 69. K2. sla Te. 82. Te. Xe o. 0X. 0. K2. sla. 0 V1. 90. re Fi. F. sla. S9. NL. W. 0. (K. 91. o Pr. ire. P1. Graphic: Rupp [2]. sla. Te. v2. 6 -2. INTEL Xeon CPUs. E5. NVIDIA Tesla GPUs. 0 69. AMD Radeon GPUs. X5. 90 55. INTEL Xeon Phis. W. 2010 29 November 2019. 2012 End of Year Slide 7 40. 2014. 2016. 2018.

(11) Status Quo Across Architectures Memory Bandwidth Theoretical Peak Memory Bandwidth Comparison Tesla V100. 3. 10. Tesla P100. GB/sec. 102 HD. 70 38. HD. HD. 48. 0 05. 0. 0 06. 06. C1 sla. Te. s Te. 1 aC. sla Te. C2. sl Te. 70 69. 90 20 aM. HD. 2. 49. X5. 50. i7 29. 00. S ro. Ph. 91. eP. Fir. 25. MI. sla Te. 0 K2. Xeon Phi 7120 (KNC) Tesla K20X Pla. W5. 0 68 X5. u tin. m. 92. 80. um. 81. tin. Pla. l. 0 59. 82. 1 W9. Tesla K40. 90 -26 E5. 4 X5. HD. ro. eP. Fir. Xe. 70 69. 70 89. 97 -26. v2. E5. E. 6 5-2. 99. v3. 99. E. 6 5-2. v3. -26. 99. v4. E5. INTEL Xeon CPUs NVIDIA Tesla GPUs. 0 69 X5. AMD Radeon GPUs INTEL Xeon Phis. 101. Member of the Helmholtz Association. 2008. 2010 29 November 2019. 2012 End of Year Slide 7 40. 2014. 2016. 2018. 82. Graphic: Rupp [2]. HD. 70. HD. 70 58. Ed. on. .. z GH 70 79. 0( KN. L). 60 MI.

(12) CPU vs. GPU. Graphics: Lee [3] and Bob Adams [4]. A matter of specialties. Member of the Helmholtz Association. 29 November 2019. Slide 8 40.

(13) CPU vs. GPU. Graphics: Lee [3] and Bob Adams [4]. A matter of specialties. Transporting many. Transporting one. Member of the Helmholtz Association. 29 November 2019. Slide 8 40.

(14) CPU vs. GPU Chip. ALU. ALU. ALU. ALU. Control. Cache. DRAM. DRAM. Member of the Helmholtz Association. 29 November 2019. Slide 9 40.

(15) GPU Architecture Overview. Aim: Hide Latency Everything else follows. Member of the Helmholtz Association. 29 November 2019. Slide 10 40.

(16) GPU Architecture Overview. Aim: Hide Latency Everything else follows. SIMT. Asynchronicity Memory. Member of the Helmholtz Association. 29 November 2019. Slide 10 40.


(18) Memory Host. GPU memory ain’t no CPU memory. ALU. ALU. ALU. ALU. Control. GPU: accelerator / extension card → Separate device from CPU. Cache. DRAM. DRAM. Device. Member of the Helmholtz Association. 29 November 2019. Slide 11 40.

(19) Memory Host. GPU memory ain’t no CPU memory Unified Virtual Addressing. ALU. ALU. ALU. ALU. Control. GPU: accelerator / extension card → Separate device from CPU Separate memory, but UVA. Cache. DRAM. DRAM. Device. Member of the Helmholtz Association. 29 November 2019. Slide 11 40.

(20) Memory Host. GPU memory ain’t no CPU memory. ALU. ALU. ALU. ALU. Control. GPU: accelerator / extension card → Separate device from CPU Separate memory, but UVA. Cache. DRAM. PCIe 3 <16 GB/s. HBM2 <900 GB/s DRAM. Device. Member of the Helmholtz Association. 29 November 2019. Slide 11 40.

(21) Memory Host. GPU memory ain’t no CPU memory. ALU. ALU. ALU. ALU. Control. GPU: accelerator / extension card → Separate device from CPU Separate memory, but UVA Memory transfers need special consideration! Do as little as possible!. Cache. DRAM. PCIe 3 <16 GB/s. HBM2 <900 GB/s DRAM. Device. Member of the Helmholtz Association. 29 November 2019. Slide 11 40.

(22) Memory Host. GPU memory ain’t no CPU memory Unified Memory. ALU. ALU. ALU. ALU. Control. GPU: accelerator / extension card → Separate device from CPU Separate memory, but UVA and UM Memory transfers need special consideration! Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automatically (performance…?). Cache. DRAM. PCIe 3 <16 GB/s. HBM2 <900 GB/s DRAM. Device. Member of the Helmholtz Association. 29 November 2019. Slide 11 40.

(23) Memory Host. GPU memory ain’t no CPU memory. ALU. ALU. ALU. ALU. Control. GPU: accelerator / extension card → Separate device from CPU Separate memory, but UVA and UM Memory transfers need special consideration! Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automatically (performance…?). Cache. DRAM. NVLink ≈80 GB/s. HBM2 <900 GB/s DRAM. Device. Member of the Helmholtz Association. 29 November 2019. Slide 11 40.

(24) Memory Host. GPU memory ain’t no CPU memory. ALU. ALU. ALU. ALU. Control. GPU: accelerator / extension card → Separate device from CPU Separate memory, but UVA and UM Memory transfers need special consideration! Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automatically (performance…?) P100. DRAM. NVLink ≈80 GB/s. HBM2 <900 GB/s DRAM. 16 GB RAM, 720 GB/s. Member of the Helmholtz Association. Cache. 29 November 2019. Device. Slide 11 40.

(25) Memory Host. GPU memory ain’t no CPU memory. ALU. ALU. ALU. ALU. Control. GPU: accelerator / extension card → Separate device from CPU Separate memory, but UVA and UM Memory transfers need special consideration! Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automatically (performance…?). Cache. DRAM. NVLink ≈80 GB/s. P100. V100. 16 GB RAM, 720 GB/s. 32 GB RAM, 900 GB/s. Member of the Helmholtz Association. 29 November 2019. Slide 11 40. HBM2 <900 GB/s DRAM. Device.

(26) Processing Flow. Scheduler. CPU → GPU → CPU. .... CPU CPU Memory. Interconnect L2. DRAM Member of the Helmholtz Association. 29 November 2019. Slide 12 40.

(27) Processing Flow. Scheduler. CPU → GPU → CPU. .... CPU CPU Memory. Interconnect L2. 1 Transfer data from CPU memory to GPU memory. DRAM Member of the Helmholtz Association. 29 November 2019. Slide 12 40.

(28) Processing Flow. Scheduler. CPU → GPU → CPU. .... CPU CPU Memory. Interconnect 1 Transfer data from CPU memory to GPU memory, transfer. L2. program. DRAM Member of the Helmholtz Association. 29 November 2019. Slide 12 40.

(29) Processing Flow. Scheduler. CPU → GPU → CPU. .... CPU CPU Memory. Interconnect 1 Transfer data from CPU memory to GPU memory, transfer. L2. program 2 Load GPU program, execute on SMs, get (cached) data from. memory; write back. Member of the Helmholtz Association. DRAM 29 November 2019. Slide 12 40.

(30) Processing Flow. Scheduler. CPU → GPU → CPU. .... CPU CPU Memory. Interconnect 1 Transfer data from CPU memory to GPU memory, transfer. L2. program 2 Load GPU program, execute on SMs, get (cached) data from. memory; write back. DRAM. 3 Transfer results back to host memory Member of the Helmholtz Association. 29 November 2019. Slide 12 40.



(33) Async Following different streams. Problem: Memory transfer is comparably slow Solution: Do something else in meantime (computation)! → Overlap tasks Copy and compute engines run separately (streams) Copy. Compute Copy. Copy Compute. Compute Copy. Compute. GPU needs to be fed: Schedule many computations CPU can do other work while GPU computes; synchronization. Member of the Helmholtz Association. 29 November 2019. Slide 14 40.



(36) SIMT. Scalar. SIMT = SIMD ⊕ SMT. CPU:. Single Instruction, Multiple Data (SIMD). Member of the Helmholtz Association. 29 November 2019. Slide 16 40. A0. +. B0. =. C0. A1. +. B1. =. C1. A2. +. B2. =. C2. A3. +. B3. =. C3.

(37) SIMT. Vector. SIMT = SIMD ⊕ SMT. CPU:. Single Instruction, Multiple Data (SIMD). Member of the Helmholtz Association. 29 November 2019. Slide 16 40. A0. B0. A1. B1. +. C0 =. C1. A2. B2. C2. A3. B3. C3.

(38) SIMT. Vector. SIMT = SIMD ⊕ SMT. CPU:. A0. B0. A1. B1. +. C0 =. B2. C2. A3. B3. C3. Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT) Core Core Core Core. Member of the Helmholtz Association. 29 November 2019. Slide 16 40. C1. A2.

(39) SIMT. Vector. SIMT = SIMD ⊕ SMT. CPU:. Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT). A0. B0. A1. B1. +. C0 =. B2. C2. A3. B3. C3. SMT Thread. Core Core Thread. Core Core. Member of the Helmholtz Association. 29 November 2019. Slide 16 40. C1. A2.

(40) SIMT. Vector. SIMT = SIMD ⊕ SMT. CPU:. Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT). A0. B0. A1. B1. +. C0 =. B2. C2. A3. B3. C3. SMT. GPU: Single Instruction, Multiple Threads (SIMT). Thread. Core Core Thread. Core Core. Member of the Helmholtz Association. 29 November 2019. Slide 16 40. C1. A2.

(41) SIMT. Vector. SIMT = SIMD ⊕ SMT. CPU:. Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT). A0. B0. A1. B1. +. C0 =. B2. C2. A3. B3. C3. SMT. GPU: Single Instruction, Multiple Threads (SIMT). Thread. Core Core Thread. Core Core. SIMT. Member of the Helmholtz Association. 29 November 2019. Slide 16 40. C1. A2.

(42) SIMT. Vector. SIMT = SIMD ⊕ SMT. CPU:. Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT). 29 November 2019. B0. A1. B1. +. C0 =. Slide 16 40. C1. A2. B2. C2. A3. B3. C3. SMT. GPU: Single Instruction, Multiple Threads (SIMT) CPU core ≊ GPU multiprocessor (SM) Working unit: set of threads (32, a warp) Fast switching of threads (large register file) Branching if. Member of the Helmholtz Association. A0. Thread. Core Core Thread. Core Core. SIMT.

(43) SIMT. Vector. CPU:. Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT). Tesla V100. 29 November 2019. B0. A1. B1. +. C0 =. Slide 16 40. C1. A2. B2. C2. A3. B3. C3. SMT. GPU: Single Instruction, Multiple Threads (SIMT) CPU core ≊ GPU multiprocessor (SM) Working unit: set of threads (32, a warp) Fast switching of threads (large register file) Branching if. Member of the Helmholtz Association. A0. Thread. Core Core Thread. Core Core. SIMT. Graphics: Nvidia Corporation [5]. SIMT = SIMD ⊕ SMT.

(44) SIMT. Vector. CPU:. Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT). Tesla V100. 29 November 2019. B0. A1. B1. +. C0 =. Slide 16 40. C1. A2. B2. C2. A3. B3. C3. SMT. GPU: Single Instruction, Multiple Threads (SIMT) CPU core ≊ GPU multiprocessor (SM) Working unit: set of threads (32, a warp) Fast switching of threads (large register file) Branching if. Member of the Helmholtz Association. A0. Thread. Core Core Thread. Core Core. SIMT. Graphics: Nvidia Corporation [5]. SIMT = SIMD ⊕ SMT.

(45) Multiprocessor Vector. SIMT = SIMD ⊕ SMT. CPU:. Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT). Tesla V100. 29 November 2019. B0. A1. B1. +. C0 =. Slide 16 40. C1. A2. B2. C2. A3. B3. C3. SMT. GPU: Single Instruction, Multiple Threads (SIMT) CPU core ≊ GPU multiprocessor (SM) Working unit: set of threads (32, a warp) Fast switching of threads (large register file) Branching if. Member of the Helmholtz Association. A0. Thread. Core Core Thread. Core Core. SIMT. Graphics: Nvidia Corporation [5]. SIMT.

(46) Low Latency vs. High Throughput Maybe GPU’s ultimate feature. CPU Minimizes latency within each thread GPU Hides latency with computations from other thread warps. Member of the Helmholtz Association. 29 November 2019. Slide 17 40.

(47) Low Latency vs. High Throughput Maybe GPU’s ultimate feature. CPU Minimizes latency within each thread GPU Hides latency with computations from other thread warps CPU Core: Low Latency T1. T2. T3. T4. Thread/Warp Processing Context Switch Ready Waiting. Member of the Helmholtz Association. 29 November 2019. Slide 17 40.

(48) Low Latency vs. High Throughput Maybe GPU’s ultimate feature. CPU Minimizes latency within each thread GPU Hides latency with computations from other thread warps CPU Core: Low Latency T1. T2. T3. T4. GPU Streaming Multiprocessor: High Throughput W1. Thread/Warp Processing Context Switch Ready Waiting. W2 W3 W4. Member of the Helmholtz Association. 29 November 2019. Slide 17 40.

(49) CPU vs. GPU Let’s summarize this!. Optimized for low latency + + + + + − − −. Optimized for high throughput. Large main memory Fast clock rate Large caches Branch prediction Powerful ALU Relatively low memory bandwidth Cache misses costly Low performance per watt. Member of the Helmholtz Association. 29 November 2019. + + + + − − −. High bandwidth main memory Latency tolerant (parallelism) More compute resources High performance per watt Limited memory capacity Low per-thread performance Extension card. Slide 18 40.

(50) Programming GPUs.

(51) Preface: CPU A simple CPU program!. SAXPY: ⃗y = a⃗x + ⃗y, with single precision Part of LAPACK BLAS Level 1 void saxpy(int n, float a, float * x, float * y) { for (int i = 0; i < n; i++) y[i] = a * x[i] + y[i]; } float a = 42; int n = 10; float x[n], y[n]; // fill x, y. saxpy(n, a, x, y);. Member of the Helmholtz Association. 29 November 2019. Slide 20 40.

(52) Libraries. Member of the Helmholtz Association. Programming GPUs is easy: Just don’t!. 29 November 2019. Slide 21 40.

(53) Libraries. Programming GPUs is easy: Just don’t!. Use applications & libraries. Member of the Helmholtz Association. 29 November 2019. Slide 21 40.

(54) Libraries. Programming GPUs is easy: Just don’t!. Wizard: Breazell [6]. Use applications & libraries. Member of the Helmholtz Association. 29 November 2019. Slide 21 40.

(55) Libraries. Programming GPUs is easy: Just don’t!. Use applications & libraries. Wizard: Breazell [6]. cuSPARSE. cuBLAS. th ano. cuFFT cuRAND. Member of the Helmholtz Association. 29 November 2019. CUDA Math. Slide 21 40.

(56) Libraries. Programming GPUs is easy: Just don’t!. Use applications & libraries. Wizard: Breazell [6]. cuSPARSE. cuBLAS. th ano. cuFFT cuRAND. Member of the Helmholtz Association. 29 November 2019. CUDA Math. Slide 21 40.

(57) cuBLAS Parallel algebra. GPU-parallel BLAS (all 152 routines). Single, double, complex data types Constant competition with Intel’s MKL Multi-GPU support → https://developer.nvidia.com/cublas http://docs.nvidia.com/cuda/cublas. Member of the Helmholtz Association. 29 November 2019. Slide 22 40.

(58) cuBLAS Code example int a = 42; int n = 10; float x[n], y[n]; // fill x, y. cublasHandle_t handle; cublasCreate(&handle); float * d_x, * d_y; cudaMallocManaged(&d_x, n * sizeof(x[0]); cudaMallocManaged(&d_y, n * sizeof(y[0]); cublasSetVector(n, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(n, sizeof(y[0]), y, 1, d_y, 1);. cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasGetVector(n, sizeof(y[0]), d_y, 1, y, 1); cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); Member of the Helmholtz Association. 29 November 2019. Slide 23 40.

(59) cuBLAS Code example int a = 42; int n = 10; float x[n], y[n]; // fill x, y. cublasHandle_t handle; cublasCreate(&handle);. Initialize. float * d_x, * d_y; cudaMallocManaged(&d_x, n * sizeof(x[0]); cudaMallocManaged(&d_y, n * sizeof(y[0]); cublasSetVector(n, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(n, sizeof(y[0]), y, 1, d_y, 1);. cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasGetVector(n, sizeof(y[0]), d_y, 1, y, 1); cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); Member of the Helmholtz Association. 29 November 2019. Slide 23 40.

(60) cuBLAS Code example int a = 42; int n = 10; float x[n], y[n]; // fill x, y. cublasHandle_t handle; cublasCreate(&handle);. Initialize. float * d_x, * d_y; cudaMallocManaged(&d_x, n * sizeof(x[0]); cudaMallocManaged(&d_y, n * sizeof(y[0]); cublasSetVector(n, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(n, sizeof(y[0]), y, 1, d_y, 1);. Allocate GPU memory. cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasGetVector(n, sizeof(y[0]), d_y, 1, y, 1); cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); Member of the Helmholtz Association. 29 November 2019. Slide 23 40.

(61) cuBLAS Code example int a = 42; int n = 10; float x[n], y[n]; // fill x, y. cublasHandle_t handle; cublasCreate(&handle);. Initialize. float * d_x, * d_y; cudaMallocManaged(&d_x, n * sizeof(x[0]); cudaMallocManaged(&d_y, n * sizeof(y[0]); cublasSetVector(n, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(n, sizeof(y[0]), y, 1, d_y, 1);. Allocate GPU memory Copy data to GPU. cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasGetVector(n, sizeof(y[0]), d_y, 1, y, 1); cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); Member of the Helmholtz Association. 29 November 2019. Slide 23 40.

(62) cuBLAS Code example int a = 42; int n = 10; float x[n], y[n]; // fill x, y. cublasHandle_t handle; cublasCreate(&handle);. Initialize. float * d_x, * d_y; cudaMallocManaged(&d_x, n * sizeof(x[0]); cudaMallocManaged(&d_y, n * sizeof(y[0]); cublasSetVector(n, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(n, sizeof(y[0]), y, 1, d_y, 1);. Allocate GPU memory Copy data to GPU Call BLAS routine. cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasGetVector(n, sizeof(y[0]), d_y, 1, y, 1); cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); Member of the Helmholtz Association. 29 November 2019. Slide 23 40.

(63) cuBLAS Code example int a = 42; int n = 10; float x[n], y[n]; // fill x, y. cublasHandle_t handle; cublasCreate(&handle);. Initialize. float * d_x, * d_y; cudaMallocManaged(&d_x, n * sizeof(x[0]); cudaMallocManaged(&d_y, n * sizeof(y[0]); cublasSetVector(n, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(n, sizeof(y[0]), y, 1, d_y, 1);. Allocate GPU memory Copy data to GPU Call BLAS routine. cublasSaxpy(n, a, d_x, 1, d_y, 1);. Copy result to host. cublasGetVector(n, sizeof(y[0]), d_y, 1, y, 1); cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); Member of the Helmholtz Association. 29 November 2019. Slide 23 40.

(64) cuBLAS Code example int a = 42; int n = 10; float x[n], y[n]; // fill x, y. cublasHandle_t handle; cublasCreate(&handle);. Initialize. float * d_x, * d_y; cudaMallocManaged(&d_x, n * sizeof(x[0]); cudaMallocManaged(&d_y, n * sizeof(y[0]); cublasSetVector(n, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(n, sizeof(y[0]), y, 1, d_y, 1);. Allocate GPU memory Copy data to GPU Call BLAS routine. cublasSaxpy(n, a, d_x, 1, d_y, 1);. Copy result to host. cublasGetVector(n, sizeof(y[0]), d_y, 1, y, 1);. Finalize. cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); Member of the Helmholtz Association. 29 November 2019. Slide 23 40.

(65) Programming GPUs Directives.

(66) GPU Programming with Directives Keepin’ you portable. Annotate serial source code by directives #pragma acc loop for (int i = 0; i < 1; i++) {};. Member of the Helmholtz Association. 29 November 2019. Slide 25 40.

(67) GPU Programming with Directives Keepin’ you portable. Annotate serial source code by directives #pragma acc loop for (int i = 0; i < 1; i++) {};. OpenACC: Especially for GPUs; OpenMP: Has GPU support Compiler interprets directives, creates according instructions. Member of the Helmholtz Association. 29 November 2019. Slide 25 40.

(68) GPU Programming with Directives Keepin’ you portable. Annotate serial source code by directives #pragma acc loop for (int i = 0; i < 1; i++) {};. OpenACC: Especially for GPUs; OpenMP: Has GPU support Compiler interprets directives, creates according instructions Pro. Con. Portability Other compiler? No problem! To it, it’s a serial program Different target architectures from same code. Easy to program Member of the Helmholtz Association. 29 November 2019. Slide 25 40. Compiler support only raising Not all the raw power available Harder to debug Easy to program wrong.

(69) OpenACC Code example. void saxpy_acc(int n, float a, float * x, float * y) { #pragma acc kernels for (int i = 0; i < n; i++) y[i] = a * x[i] + y[i]; } float a = 42; int n = 10; float x[n], y[n]; // fill x, y. saxpy_acc(n, a, x, y);. Member of the Helmholtz Association. 29 November 2019. Slide 26 40.

(70) OpenACC Code example. void saxpy_acc(int n, float a, float * x, float * y) { #pragma acc parallel loop copy(y) copyin(x) for (int i = 0; i < n; i++) y[i] = a * x[i] + y[i]; } float a = 42; int n = 10; float x[n], y[n]; // fill x, y. saxpy_acc(n, a, x, y);. Member of the Helmholtz Association. 29 November 2019. Slide 26 40.

(71) Programming GPUs CUDA C/C++.

(72) Programming GPU Directly Finally…. Member of the Helmholtz Association. 29 November 2019. Slide 28 40.

(73) Programming GPU Directly Finally…. OpenCL Open Computing Language by Khronos Group (Apple, IBM, NVIDIA, …) 2009 Platform: Programming language (OpenCL C/C++), API, and compiler Targets CPUs, GPUs, FPGAs, and other many-core machines Fully open source. Member of the Helmholtz Association. 29 November 2019. Slide 28 40.

(74) Programming GPU Directly Finally…. OpenCL Open Computing Language by Khronos Group (Apple, IBM, NVIDIA, …) 2009 Platform: Programming language (OpenCL C/C++), API, and compiler Targets CPUs, GPUs, FPGAs, and other many-core machines Fully open source CUDA NVIDIA’s GPU platform 2007 Platform: Drivers, programming language (CUDA C/C++), API, compiler, tools, … Only NVIDIA GPUs Compilation with nvcc (free, but not open) clang has CUDA support, but CUDA needed for last step Also: CUDA Fortran. Member of the Helmholtz Association. 29 November 2019. Slide 28 40.

(75) Programming GPU Directly Finally…. OpenCL Open Computing Language by Khronos Group (Apple, IBM, NVIDIA, …) 2009 Platform: Programming language (OpenCL C/C++), API, and compiler Targets CPUs, GPUs, FPGAs, and other many-core machines Fully open source CUDA NVIDIA’s GPU platform 2007 Platform: Drivers, programming language (CUDA C/C++), API, compiler, tools, … Only NVIDIA GPUs Compilation with nvcc (free, but not open) clang has CUDA support, but CUDA needed for last step Also: CUDA Fortran Choose what flavor you like, what colleagues/collaboration is using Hardest: Come up with parallelized algorithm Member of the Helmholtz Association. 29 November 2019. Slide 28 40.

(76) Programming GPU Directly Finally…. OpenCL Open Computing Language by Khronos Group (Apple, IBM, NVIDIA, …) 2009 Platform: Programming language (OpenCL C/C++), API, and compiler Targets CPUs, GPUs, FPGAs, and other many-core machines Fully open source CUDA NVIDIA’s GPU platform 2007 Platform: Drivers, programming language (CUDA C/C++), API, compiler, tools, … Only NVIDIA GPUs Compilation with nvcc (free, but not open) clang has CUDA support, but CUDA needed for last step Also: CUDA Fortran HIP AMD’s new unified programming model for AMD (via ROCm) and NVIDIA GPUs 2016+ Choose what flavor you like, what colleagues/collaboration is using Hardest: Come up with parallelized algorithm Member of the Helmholtz Association. 29 November 2019. Slide 28 40.

(77) Programming GPU Directly Finally…. OpenCL Open Computing Language by Khronos Group (Apple, IBM, NVIDIA, …) 2009 Platform: Programming language (OpenCL C/C++), API, and compiler Targets CPUs, GPUs, FPGAs, and other many-core machines Fully open source CUDA NVIDIA’s GPU platform 2007 Platform: Drivers, programming language (CUDA C/C++), API, compiler, tools, … Only NVIDIA GPUs Compilation with nvcc (free, but not open) clang has CUDA support, but CUDA needed for last step Also: CUDA Fortran HIP AMD’s new unified programming model for AMD (via ROCm) and NVIDIA GPUs 2016+ Choose what flavor you like, what colleagues/collaboration is using Hardest: Come up with parallelized algorithm Member of the Helmholtz Association. 29 November 2019. Slide 28 40.

(78) CUDA’s Parallel Model In software: Threads, Blocks. Methods to exploit parallelism:. Member of the Helmholtz Association. 29 November 2019. Slide 29 40.

(79) CUDA’s Parallel Model In software: Threads, Blocks. Methods to exploit parallelism: Thread. Member of the Helmholtz Association. 29 November 2019. Slide 29 40.

(80) CUDA’s Parallel Model In software: Threads, Blocks. Methods to exploit parallelism: Threads. Member of the Helmholtz Association. 0 1 2 3 4 5. 29 November 2019. Slide 29 40.

(81) CUDA’s Parallel Model In software: Threads, Blocks. Methods to exploit parallelism: Threads → Block. Member of the Helmholtz Association. 0 1 2 3 4 5. 29 November 2019. Slide 29 40.

(82) CUDA’s Parallel Model In software: Threads, Blocks. Methods to exploit parallelism: Threads → Block. 0 1 2 3 4 5. Block 0. Member of the Helmholtz Association. 29 November 2019. Slide 29 40.

(83) CUDA’s Parallel Model In software: Threads, Blocks. Methods to exploit parallelism: Threads → Block. 0 1 2 3 4 5. 0 1 2 3 4 5. 0 1 2 3 4 5. 0. 1. 2. Blocks. Member of the Helmholtz Association. 29 November 2019. Slide 29 40.

(84) CUDA’s Parallel Model In software: Threads, Blocks. Methods to exploit parallelism: Threads → Block. 0 1 2 3 4 5. 0 1 2 3 4 5. 0 1 2 3 4 5. 0. 1. 2. Blocks → Grid. Member of the Helmholtz Association. 29 November 2019. Slide 29 40.

(85) CUDA’s Parallel Model In software: Threads, Blocks. Methods to exploit parallelism: Threads → Block. 0 1 2 3 4 5. 0 1 2 3 4 5. 0 1 2 3 4 5. 0. 1. 2. Blocks → Grid 3D 3D Threads & blocks in 3D. Member of the Helmholtz Association. 29 November 2019. Slide 29 40.

(86) CUDA’s Parallel Model In software: Threads, Blocks. Methods to exploit parallelism: Threads → Block. 0 1 2 3 4 5. 0 1 2 3 4 5. 0 1 2 3 4 5. 0. 1. 2. Blocks → Grid 3D 3D Threads & blocks in 3D. Parallel function: kernel __global__ kernel(int a, float * b) { } Access own ID by global variables threadIdx.x, blockIdx.y, …. Execution entity: threads Lightweight → fast switchting! 1000s threads execute simultaneously → order non-deterministic! Member of the Helmholtz Association. 29 November 2019. Slide 29 40.

(87) CUDA SAXPY With runtime-managed data transfers __global__ void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n]; // fill x, y cudaMallocManaged(&x, n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float));. saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaDeviceSynchronize(); Member of the Helmholtz Association. 29 November 2019. Slide 30 40.

(88) CUDA SAXPY With runtime-managed data transfers __global__ void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n]; // fill x, y cudaMallocManaged(&x, n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float));. saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaDeviceSynchronize(); Member of the Helmholtz Association. 29 November 2019. Slide 30 40. Specify kernel.

(89) CUDA SAXPY With runtime-managed data transfers __global__ void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n]; // fill x, y cudaMallocManaged(&x, n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float));. saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaDeviceSynchronize(); Member of the Helmholtz Association. 29 November 2019. Slide 30 40. Specify kernel ID variables.

(90) CUDA SAXPY With runtime-managed data transfers __global__ void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n]; // fill x, y cudaMallocManaged(&x, n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float));. saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaDeviceSynchronize(); Member of the Helmholtz Association. 29 November 2019. Slide 30 40. Specify kernel ID variables Guard against too many threads.

(91) CUDA SAXPY With runtime-managed data transfers __global__ void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n]; // fill x, y cudaMallocManaged(&x, n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float));. cudaDeviceSynchronize(); 29 November 2019. ID variables Guard against too many threads. Allocate GPU-capable memory. saxpy_cuda<<<2, 5>>>(n, a, x, y);. Member of the Helmholtz Association. Specify kernel. Slide 30 40.

(92) CUDA SAXPY With runtime-managed data transfers __global__ void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n]; // fill x, y cudaMallocManaged(&x, n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float));. Guard against too many threads. Call kernel 2 blocks, each 5 threads. cudaDeviceSynchronize(); 29 November 2019. ID variables. Allocate GPU-capable memory. saxpy_cuda<<<2, 5>>>(n, a, x, y);. Member of the Helmholtz Association. Specify kernel. Slide 30 40.

(93) CUDA SAXPY With runtime-managed data transfers __global__ void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n]; // fill x, y cudaMallocManaged(&x, n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float));. Guard against too many threads. Call kernel 2 blocks, each 5 threads Wait for kernel to finish. cudaDeviceSynchronize(); 29 November 2019. ID variables. Allocate GPU-capable memory. saxpy_cuda<<<2, 5>>>(n, a, x, y);. Member of the Helmholtz Association. Specify kernel. Slide 30 40.

(94) Programming GPUs Performance Analysis.

(95) GPU Tools The helpful helpers helping helpless (and others). NVIDIA cuda-gdb GDB-like command line utility for debugging cuda-memcheck Like Valgrind’s memcheck, for checking errors in memory accesses Nsight IDE for GPU developing, based on Eclipse (Linux, OS X) or Visual Studio. (Windows) nvprof Command line profiler, including detailed performance counters Visual Profiler Timeline profiling and annotated performance experiments OpenCL: CodeXL (Open Source, GPUOpen/AMD) – debugging, profiling.. Member of the Helmholtz Association. 29 November 2019. Slide 32 40.

(96) GPU Tools The helpful helpers helping helpless (and others). NVIDIA cuda-gdb GDB-like command line utility for debugging cuda-memcheck Like Valgrind’s memcheck, for checking errors in memory accesses Nsight IDE for GPU developing, based on Eclipse (Linux, OS X) or Visual Studio. (Windows) nvprof Command line profiler, including detailed performance counters Visual Profiler Timeline profiling and annotated performance experiments New Nsight Systems and Nsight Compute; successors of Visual Profiler OpenCL: CodeXL (Open Source, GPUOpen/AMD) – debugging, profiling.. Member of the Helmholtz Association. 29 November 2019. Slide 32 40.

(97) GPU Tools The helpful helpers helping helpless (and others). NVIDIA cuda-gdb GDB-like command line utility for debugging cuda-memcheck Like Valgrind’s memcheck, for checking errors in memory accesses Nsight IDE for GPU developing, based on Eclipse (Linux, OS X) or Visual Studio. (Windows) nvprof Command line profiler, including detailed performance counters Visual Profiler Timeline profiling and annotated performance experiments New Nsight Systems and Nsight Compute; successors of Visual Profiler OpenCL: CodeXL (Open Source, GPUOpen/AMD) – debugging, profiling.. Member of the Helmholtz Association. 29 November 2019. Slide 32 40.

(98) nvprof Command that line. $ nvprof ./matrixMul -wA=1024 -hA=1024 -wB=1024 -hB=1024 ==37064== Profiling application: ./matrixMul -wA=1024 -hA=1024 -wB=1024 -hB=1024 ==37064== Profiling result: Time(%) Time Calls Avg Min Max Name 99.19% 262.43ms 301 871.86us 863.88us 882.44us void matrixMulCUDA<int=32>(float*, float*, float*, int, int) 0.58% 1.5428ms 2 771.39us 764.65us 778.12us [CUDA memcpy HtoD] 0.23% 599.40us 1 599.40us 599.40us 599.40us [CUDA memcpy DtoH] ==37064== API calls: Time(%) Time 61.26% 258.38ms 35.68% 150.49ms 0.73% 3.0774ms 0.62% 2.6287ms 0.56% 2.3408ms 0.48% 2.0111ms 0.21% 872.52us 0.15% 612.20us 0.12% 499.01us. Member of the Helmholtz Association. Calls 1 3 3 4 301 364 1 1505 3. Avg 258.38ms 50.164ms 1.0258ms 657.17us 7.7760us 5.5250us 872.52us 406ns 166.34us. Min 258.38ms 914.97us 1.0097ms 655.12us 7.3810us 235ns 872.52us 361ns 140.45us. Max 258.38ms 148.65ms 1.0565ms 660.56us 53.103us 201.63us 872.52us 1.1970us 216.16us. 29 November 2019. Name cudaEventSynchronize cudaMalloc cudaMemcpy cuDeviceTotalMem cudaLaunch cuDeviceGetAttribute cudaDeviceSynchronize cudaSetupArgument cudaFree. Slide 33 40.

(99) Visual Profiler. Member of the Helmholtz Association. 29 November 2019. Slide 34 40.

(100) Advanced Topics So much more interesting things to show! Optimize memory transfers to reduce overhead Optimize applications for GPU architecture Drop-in BLAS acceleration with NVBLAS ($LD_PRELOAD) Tensor Cores for Deep Learning Libraries, Abstractions: Kokkos, Alpaka, Futhark, HIP, C++AMP, … Use multiple GPUs On one node Across many nodes → MPI. … Some of that: Addressed at dedicated training courses. Member of the Helmholtz Association. 29 November 2019. Slide 35 40.

(101) Using GPUs on JURECA & JUWELS.

(102) Compiling CUDA. Module: module load CUDA/10.1.105 Compile: nvcc file.cu Default host compiler: g++; use nvcc_pgc++ for PGI compiler cuBLAS: g++ file.cpp -I$CUDA_HOME/include -L$CUDA_HOME/lib64 -lcublas -lcudart. OpenACC. MPI. Member of the Helmholtz Association. Module: module Compile: pgc++. load PGI/19.3-GCC-8.3.0 -acc -ta=tesla file.cpp. Module: module load MVAPICH2/2.3.2-GDR (also needed: GCC/8.3.0) Enabled for CUDA (CUDA-aware); no need to copy data to host before transfer. 29 November 2019. Slide 37 40.

(103) Running Dedicated GPU partitions JUWELS --partition=gpus --partition=develgpus. 46 nodes (Job limits: <1 d) 10 nodes (Job limits: <2 h, ≤ 2 nodes). JURECA --partition=gpus --partition=develgpus. 70 nodes (Job limits: <1 d, ≤ 32 nodes) 4 nodes (Job limits: <2 h, ≤ 2 nodes). Needed: Resource configuration with --gres --gres=gpu:4 --gres=mem1024,gpu:2 --partition=vis. only JURECA. → See online documentation. Member of the Helmholtz Association. 29 November 2019. Slide 38 40.

(104) Example 96 tasks in total, running on 4 nodes Per node: 4 GPUs #!/bin/bash -x #SBATCH --nodes=4 #SBATCH --ntasks=96 #SBATCH --ntasks-per-node=24 #SBATCH --output=gpu-out.%j #SBATCH --error=gpu-err.%j #SBATCH --time=00:15:00 #SBATCH --partition=gpus #SBATCH --gres=gpu:4 srun ./gpu-prog. Member of the Helmholtz Association. 29 November 2019. Slide 39 40.

(105) Conclusion, Resources GPUs provide highly-parallel computing power. We have many devices installed at JSC, ready to be used!. Member of the Helmholtz Association. 29 November 2019. Slide 40 40.

(106) Conclusion, Resources GPUs provide highly-parallel computing power. We have many devices installed at JSC, ready to be used! Training courses by JSC CUDA Course 4 - 6 May 2020 OpenACC Course 26 - 27 October 2019 Generally: see online documentation and sc@fz-juelich.de. Member of the Helmholtz Association. 29 November 2019. Slide 40 40.

(107) Conclusion, Resources GPUs provide highly-parallel computing power. We have many devices installed at JSC, ready to be used! Training courses by JSC CUDA Course 4 - 6 May 2020 OpenACC Course 26 - 27 October 2019 Generally: see online documentation and sc@fz-juelich.de Further consultation via our lab: NVIDIA Application Lab in Jülich; contact me!. Member of the Helmholtz Association. 29 November 2019. Slide 40 40.

(108) Conclusion, Resources GPUs provide highly-parallel computing power. We have many devices installed at JSC, ready to be used! Training courses by JSC CUDA Course 4 - 6 May 2020 OpenACC Course 26 - 27 October 2019 Generally: see online documentation and sc@fz-juelich.de Further consultation via our lab: NVIDIA Application Lab in Jülich; contact me! Interested in JURON? Get access!. Member of the Helmholtz Association. 29 November 2019. Slide 40 40.

(109) Conclusion, Resources GPUs provide highly-parallel computing power. We have many devices installed at JSC, ready to be used! Training courses by JSC CUDA Course 4 - 6 May 2020 OpenACC Course 26 - 27 October 2019 Generally: see online documentation and sc@fz-juelich.de Further consultation via our lab: NVIDIA Application Lab in Jülich; contact k youme!. Than ttention! for your a -juelich.de. Interested in JURON? Get access!. fz. a.herten@. Member of the Helmholtz Association. 29 November 2019. Slide 40 40.

(110) Appendix.

(111) Appendix Glossary References. Member of the Helmholtz Association. 29 November 2019. Slide 2 10.

(112) Glossary I AMD Manufacturer of CPUs and GPUs. 72, 73, 74, 75, 76, 77, 112, 115 API A programmatic interface to software by well-defined functions. Short for. application programming interface. 72, 73, 74, 75, 76, 77 CUDA Computing platform for GPUs from NVIDIA. Provides, among others, CUDA C/C++. 2, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,. 91, 92, 93, 102, 105, 106, 107, 108, 109, 114 HIP GPU programming model by AMD to target their own and NVIDIA GPUs with one. combined language. Short for Heterogeneous-compute Interface for Portability. 72, 73, 74, 75, 76, 77. Member of the Helmholtz Association. 29 November 2019. Slide 3 10.

(113) Glossary II JSC Jülich Supercomputing Centre, the supercomputing institute of. Forschungszentrum Jülich, Germany. 2, 105, 106, 107, 108, 109, 113 JURECA A multi-purpose supercomputer with 1800 nodes at JSC. 2, 5, 101, 103 JURON One of the two HBP pilot system in Jülich; name derived from Juelich and. Neuron. 6, 7 JUWELS Jülich’s new supercomputer, the successor of JUQUEEN. 2, 3, 4, 101, 103 MPI The Message Passing Interface, a API definition for multi-node computing. 100,. 102 NVIDIA US technology company creating GPUs. 3, 4, 5, 6, 7, 72, 73, 74, 75, 76, 77, 95, 96,. 97, 105, 106, 107, 108, 109, 112, 114, 115. Member of the Helmholtz Association. 29 November 2019. Slide 4 10.

(114) Glossary III NVLink NVIDIA’s communication protocol connecting CPU ↔ GPU and GPU ↔ GPU with. high bandwidth. 6, 7, 114 OpenACC Directive-based programming, primarily for many-core machines. 66, 67, 68, 69,. 70, 102, 105, 106, 107, 108, 109 OpenCL The Open Computing Language. Framework for writing code for heterogeneous architectures (CPU, GPU, DSP, FPGA). The alternative to CUDA. 72, 73, 74, 75, 76,. 77, 95, 96, 97 OpenMP Directive-based programming, primarily for multi-threaded machines. 66, 67, 68 P100 A large GPU with the Pascal architecture from NVIDIA. It employs NVLink as its. interconnect and has fast HBM2 memory. 6, 7 Pascal GPU architecture from NVIDIA (announced 2016). 114 Member of the Helmholtz Association. 29 November 2019. Slide 5 10.

(115) Glossary IV POWER CPU architecture from IBM, earlier: PowerPC. See also POWER8. 115 POWER8 Version 8 of IBM’s POWER processor, available also within the OpenPOWER. Foundation. 6, 7, 115 ROCm AMD software stack and platform to program AMD GPUs. Short for Radeon Open Compute (Radeon is the GPU product line of AMD). 72, 73, 74, 75, 76, 77 SAXPY Single-precision A × X + Y. A simple code example of scaling a vector and adding. an offset. 51, 87, 88, 89, 90, 91, 92, 93 Tesla The GPU product line for general purpose computing computing of NVIDIA. 3, 4,. 5, 6, 7. Member of the Helmholtz Association. 29 November 2019. Slide 6 10.

(116) Glossary V CPU Central Processing Unit. 3, 4, 5, 6, 7, 12, 13, 14, 18, 19, 20, 21, 22, 23, 24, 25, 26,. 27, 28, 29, 30, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 51, 72, 73, 74, 75, 76, 77, 112, 114, 115 GPU Graphics Processing Unit. 2, 3, 4, 5, 6, 7, 8, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,. 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 50, 52, 53, 54, 55, 56, 57, 65, 66, 67, 68, 71, 72, 73, 74, 75, 76, 77, 91, 92, 93, 94, 95, 96, 97, 100, 101, 103, 104, 105, 106, 107, 108, 109, 112, 113, 114, 115 HBP Human Brain Project. 113 SIMD Single Instruction, Multiple Data. 36, 37, 38, 39, 40, 41, 42, 43, 44, 45 SIMT Single Instruction, Multiple Threads. 15, 16, 17, 31, 32, 34, 35, 36, 37, 38, 39, 40,. 41, 42, 43, 44, 45 Member of the Helmholtz Association. 29 November 2019. Slide 7 10.

(117) Glossary VI. SM Streaming Multiprocessor. 36, 37, 38, 39, 40, 41, 42, 43, 44, 45 SMT Simultaneous Multithreading. 36, 37, 38, 39, 40, 41, 42, 43, 44, 45. Member of the Helmholtz Association. 29 November 2019. Slide 8 10.

(118) References I. [2]. Karl Rupp. Pictures: CPU/GPU Performance Comparison. URL: https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardwarecharacteristics-over-time/ (pages 10, 11).. [6]. Wes Breazell. Picture: Wizard. URL: https://thenounproject.com/wes13/collection/its-a-wizards-world/. (pages 52–56).. Member of the Helmholtz Association. 29 November 2019. Slide 9 10.

(119) References: Images, Graphics I [1]. Alexandre Debiève. Bowels of computer. Freely available at Unsplash. URL: https://unsplash.com/photos/FO7JIlwjOtU.. [3]. Mark Lee. Picture: kawasaki ninja. URL: https://www.flickr.com/photos/pochacco20/39030210/. License: Creative. Commons BY-ND 2.0 (pages 12, 13). [4]. Bob Adams. Picture: Hylton Ross Mercedes Benz Irizar coach. URL: https://www.flickr.com/photos/satransport/13197324714/. License:. Creative Commons BY-SA 2.0 (pages 12, 13). [5]. Nvidia Corporation. Pictures: Volta GPU. Volta Architecture Whitepaper. URL: https://images.nvidia.com/content/volta-architecture/pdf/VoltaArchitecture-Whitepaper-v1.0.pdf (pages 43–45).. Member of the Helmholtz Association. 29 November 2019. Slide 10 10.

(120)