• Keine Ergebnisse gefunden

GPU ACCELERATORS AT JSC OF THREADS AND KERNELS

N/A
N/A
Protected

Academic year: 2022

Aktie "GPU ACCELERATORS AT JSC OF THREADS AND KERNELS"

Copied!
116
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)GPU ACCELERATORS AT JSC OF THREADS AND KERNELS 21 May 2019. Andreas Herten. Member of the Helmholtz Association. Forschungszentrum Jülich.

(2) Outline GPUs at JSC. Programming GPUs Libraries OpenACC/OpenMP CUDA C/C++ Performance Analysis Advanced Topics Using GPUs on JURECA & JUWELS Compiling Resource Allocation. JUWELS JURECA JURON GPU Architecture Empirical Motivation Comparisons 3 Core Features Memory Asynchronicity SIMT. High Throughput Summary. Member of the Helmholtz Association. 21 May 2019. Slide 1 41.

(3) JUWELS – Jülich’s New Large System 2500 nodes with Intel Xeon CPUs (2 × 24 cores) 46 + 10 nodes with 4 NVIDIA Tesla V100 cards 10.4 (CPU) + 1.6 (GPU) PFLOP/s peak performance (Top500: #26) Member of the Helmholtz Association. 21 May 2019. Slide 2 41.

(4) 2020: Booster!. JUWELS – Jülich’s New Large System 2500 nodes with Intel Xeon CPUs (2 × 24 cores) 46 + 10 nodes with 4 NVIDIA Tesla V100 cards 10.4 (CPU) + 1.6 (GPU) PFLOP/s peak performance (Top500: #26) Member of the Helmholtz Association. 21 May 2019. Slide 2 41.

(5) JURECA – Jülich’s Multi-Purpose Supercomputer 1872 nodes with Intel Xeon E5 CPUs (2 × 12 cores) 75 nodes with 2 NVIDIA Tesla K80 cards (look like 4 GPUs) JURECA Booster: 1640 nodes with Intel Xeon Phi Knights Landing 1.8 (CPU) + 0.44 (GPU) + 5 (KNL) PFLOP/s peak performance (Top500: #44) Mellanox EDR InfiniBand Member of the Helmholtz Association. 21 May 2019. Slide 3 41.

(6) JULIA. JURON. JURON – A Human Brain Project Pilot System 18 nodes with IBM POWER8NVL CPUs (2 × 10 cores) Per Node: 4 NVIDIA Tesla P100 cards (16 GB HBM2 memory), connected via NVLink GPU: 0.38 PFLOP/s peak performance Member of the Helmholtz Association. 21 May 2019. Slide 4 41.

(7) JULIA. JURON. JURON – A Human Brain Project Pilot System 18 nodes with IBM POWER8NVL CPUs (2 × 10 cores) Per Node: 4 NVIDIA Tesla P100 cards (16 GB HBM2 memory), connected via NVLink GPU: 0.38 PFLOP/s peak performance Member of the Helmholtz Association. 21 May 2019. Slide 4 41.

(8) GPU Architecture. Member of the Helmholtz Association. 21 May 2019. Slide 5 41.

(9) Why?.

(10) Status Quo Across Architectures Performance Theoretical Peak Performance, Double Precision 4. 100. la P. 00. 50. 91. o. sla. 3. 70. 70. GFLOP/sec. 10. HD. 58. HD. 70. 69. HD. Hz. 0 09. 2. 70. HD. 48. 0. 70. HD. 10. 06. 0. 38. 06. C1. sla. 2. 0. 05. Te. la es. C1. X5. 2 49. X5. 2008. M. sla. 70. HD. 99. 6 -2. E5. Te. 7. 0. 9 26. E5 80. X5. 69. v2. ) 29 i7 Ph. K4. v3. 99. E5. 6 -2. 9. v3. 69. v4. -2. E5. NVIDIA Tesla GPUs AMD Radeon GPUs. 90. 6. X5. sla. Te. INTEL Xeon CPUs. -2. E5. 0. K. Xeon Phi 7120 (KNC). 79. INTEL Xeon Phis. 5. W. 2010. 2012. 2014. End of Year Member of the Helmholtz Association. 89. HD. Te. T. 0 59. 70. Ed. G. sla. Te. C2. sla. 6. 2 48. .. 69. K. Te. on. K2. 0. re. Fi 40. Xe. 0. la. s Te. Pr. NL. o. Pr. re. Fi X 20. 21 May 2019. Slide 7 41. 100. la V. Tes. (K. W. Tes. 1 S9. 2016. Graphic: Rupp [2]. 10.

(11) Status Quo Across Architectures Memory Bandwidth Theoretical Peak Memory Bandwidth Comparison 3. HD. GB/sec. 70. HD. 10. 2. 69. HD. 0. 05. 38. 0. 0. 06. 06. C1. la es. la es. C1. 2 aC. M2. sla. 0. sla. 0. 59. 2. 49. W5. 29. Te. sl. Te. 90. X5. i7. Xeon Phi 7120 (KNC) Tesla K20X. Te. -26. 2. Ph. Tesla K40 0 K2. E5. 48. 0. 15. S9. ro. eP. Fir. Xe. HD. 7. W9. ro. eP. Fir. T. T. X5. 70. 89. on. GH. 69. 09. 48. 70. HD. HD. HD. 70. 70. 0. 7 58. 0. 10. d.. zE. 0 97. 0. 68. X5. E5. -2. 7 69. 99. v2. 6 5-2. E. v3. 99. E. 6 5-2. v3. v4 99 -26 E5 INTEL Xeon CPUs. NVIDIA Tesla GPUs. 0. 69. X5. AMD Radeon GPUs INTEL Xeon Phis. 101. 2008. 2010. 2012. 2014. End of Year Member of the Helmholtz Association. 21 May 2019. 100. la V. Tes. 0(. KN L). Tesla P100. Slide 7 41. 2016. Graphic: Rupp [2]. 10.

(12) CPU vs. GPU. Graphics: Lee [3] and Bob Adams [4]. A matter of specialties. Member of the Helmholtz Association. 21 May 2019. Slide 8 41.

(13) CPU vs. GPU. Graphics: Lee [3] and Bob Adams [4]. A matter of specialties. Transporting many. Transporting one. Member of the Helmholtz Association. 21 May 2019. Slide 8 41.

(14) CPU vs. GPU Chip. ALU. ALU. ALU. ALU. Control. Cache. DRAM. DRAM. Member of the Helmholtz Association. 21 May 2019. Slide 9 41.

(15) GPU Architecture Overview. Aim: Hide Latency Everything else follows. Member of the Helmholtz Association. 21 May 2019. Slide 10 41.

(16) GPU Architecture Overview. Aim: Hide Latency Everything else follows. SIMT. Asynchronicity Memory. Member of the Helmholtz Association. 21 May 2019. Slide 10 41.

(17) GPU Architecture Overview. Aim: Hide Latency Everything else follows. SIMT. Asynchronicity Memory. Member of the Helmholtz Association. 21 May 2019. Slide 10 41.

(18) Memory. Host. GPU memory ain’t no CPU memory. ALU. ALU. ALU. ALU. Control. GPU: accelerator / extension card → Separate device from CPU. Cache. DRAM. DRAM. Device. Member of the Helmholtz Association. 21 May 2019. Slide 11 41.

(19) Memory. Host. GPU memory ain’t no CPU memory Unified Virtual Addressing. ALU. ALU. ALU. ALU. Control. GPU: accelerator / extension card → Separate device from CPU Separate memory, but UVA. Cache. DRAM. DRAM. Device. Member of the Helmholtz Association. 21 May 2019. Slide 11 41.

(20) Memory. Host. GPU memory ain’t no CPU memory. ALU. ALU. ALU. ALU. Control. GPU: accelerator / extension card → Separate device from CPU Separate memory, but UVA. Cache. DRAM. PCIe <16 GB/s. HBM2 <900 GB/s DRAM. Device. Member of the Helmholtz Association. 21 May 2019. Slide 11 41.

(21) Memory. Host. GPU memory ain’t no CPU memory. ALU. ALU. ALU. ALU. Control. GPU: accelerator / extension card → Separate device from CPU Separate memory, but UVA Memory transfers need special consideration! Do as little as possible!. Cache. DRAM. PCIe <16 GB/s. HBM2 <900 GB/s DRAM. Device. Member of the Helmholtz Association. 21 May 2019. Slide 11 41.

(22) Memory. Host. GPU memory ain’t no CPU memory Unified Memory. ALU. ALU. ALU. ALU. Control. GPU: accelerator / extension card → Separate device from CPU Separate memory, but UVA and UM Memory transfers need special consideration! Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automatically (performance…?). Cache. DRAM. PCIe <16 GB/s. HBM2 <900 GB/s DRAM. Device. Member of the Helmholtz Association. 21 May 2019. Slide 11 41.

(23) Memory. Host. GPU memory ain’t no CPU memory. ALU. ALU. ALU. ALU. Control. GPU: accelerator / extension card → Separate device from CPU Separate memory, but UVA and UM Memory transfers need special consideration! Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automatically (performance…?). Cache. DRAM. NVLink ≈80 GB/s. HBM2 <900 GB/s DRAM. Device. Member of the Helmholtz Association. 21 May 2019. Slide 11 41.

(24) Memory. Host. GPU memory ain’t no CPU memory. ALU. ALU. ALU. ALU. Control. GPU: accelerator / extension card → Separate device from CPU Separate memory, but UVA and UM Memory transfers need special consideration! Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automatically (performance…?) P100: 16 GB RAM, 720 GB/s; V100: 16 (32) GB RAM, 900 GB/s. Cache. DRAM. NVLink ≈80 GB/s. HBM2 <900 GB/s DRAM. Device. Member of the Helmholtz Association. 21 May 2019. Slide 11 41.

(25) Processing Flow. Scheduler. CPU → GPU → CPU. .... CPU CPU Memory. Interconnect L2. DRAM Member of the Helmholtz Association. 21 May 2019. Slide 12 41.

(26) Processing Flow. Scheduler. CPU → GPU → CPU. .... CPU CPU Memory. Interconnect. 1 Transfer data from CPU memory to GPU memory. L2. DRAM Member of the Helmholtz Association. 21 May 2019. Slide 12 41.

(27) Processing Flow. Scheduler. CPU → GPU → CPU. .... CPU CPU Memory. Interconnect. 1 Transfer data from CPU memory to GPU memory, transfer. L2. program. DRAM Member of the Helmholtz Association. 21 May 2019. Slide 12 41.

(28) Processing Flow. Scheduler. CPU → GPU → CPU. .... CPU CPU Memory. Interconnect. 1 Transfer data from CPU memory to GPU memory, transfer. L2. program. 2 Load GPU program, execute on SMs, get (cached) data from. memory; write back. Member of the Helmholtz Association. DRAM 21 May 2019. Slide 12 41.

(29) Processing Flow. Scheduler. CPU → GPU → CPU. .... CPU CPU Memory. Interconnect. 1 Transfer data from CPU memory to GPU memory, transfer. L2. program. 2 Load GPU program, execute on SMs, get (cached) data from. memory; write back. DRAM. 3 Transfer results back to host memory Member of the Helmholtz Association. 21 May 2019. Slide 12 41.

(30) GPU Architecture Overview. Aim: Hide Latency Everything else follows. SIMT. Asynchronicity Memory. Member of the Helmholtz Association. 21 May 2019. Slide 13 41.

(31) GPU Architecture Overview. Aim: Hide Latency Everything else follows. SIMT. Asynchronicity Memory. Member of the Helmholtz Association. 21 May 2019. Slide 13 41.

(32) Async Following different streams. Problem: Memory transfer is comparably slow Solution: Do something else in meantime (computation)! → Overlap tasks Copy and compute engines run separately (streams) Copy. Compute Copy. Copy Compute. Compute Copy. Compute. GPU needs to be fed: Schedule many computations CPU can do other work while GPU computes; synchronization. Member of the Helmholtz Association. 21 May 2019. Slide 14 41.

(33) GPU Architecture Overview. Aim: Hide Latency Everything else follows. SIMT. Asynchronicity Memory. Member of the Helmholtz Association. 21 May 2019. Slide 15 41.

(34) GPU Architecture Overview. Aim: Hide Latency Everything else follows. SIMT. Asynchronicity Memory. Member of the Helmholtz Association. 21 May 2019. Slide 15 41.

(35) SIMT. Scalar. SIMT = SIMD ⊕ SMT. CPU:. Single Instruction, Multiple Data (SIMD). Member of the Helmholtz Association. 21 May 2019. Slide 16 41. A0. +. B0. =. C0. A1. +. B1. =. C1. A2. +. B2. =. C2. A3. +. B3. =. C3.

(36) SIMT. Vector. SIMT = SIMD ⊕ SMT. A0. B0. A1. B1. A2. CPU:. A3. Single Instruction, Multiple Data (SIMD). Member of the Helmholtz Association. 21 May 2019. Slide 16 41. +. C0 =. C1. B2. C2. B3. C3.

(37) SIMT. Vector. SIMT = SIMD ⊕ SMT. A0. B0. A1. B1. A2. CPU:. A3. +. C0 =. C2. B3. C3. Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT) Core Core Core Core. Member of the Helmholtz Association. 21 May 2019. Slide 16 41. C1. B2.

(38) SIMT. Vector. SIMT = SIMD ⊕ SMT. A0. B0. A1. B1. A2. CPU:. +. A3. Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT). C0 =. C2. B3. C3. SMT Thread. Core Core Thread. Core Core. Member of the Helmholtz Association. 21 May 2019. Slide 16 41. C1. B2.

(39) SIMT. Vector. SIMT = SIMD ⊕ SMT. A0. B0. A1. B1. A2. CPU:. +. A3. Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT). C0 =. C2. B3. C3. SMT. GPU: Single Instruction, Multiple Threads (SIMT). Thread. Core Core Thread. Core Core. Member of the Helmholtz Association. 21 May 2019. Slide 16 41. C1. B2.

(40) SIMT. Vector. SIMT = SIMD ⊕ SMT. A0. B0. A1. B1. A2. CPU:. +. A3. Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT). C0 =. C2. B3. C3. SMT. GPU: Single Instruction, Multiple Threads (SIMT). Thread. Core Core Thread. Core Core. SIMT. Member of the Helmholtz Association. 21 May 2019. Slide 16 41. C1. B2.

(41) SIMT. Vector. SIMT = SIMD ⊕ SMT. A0. B0. A1. B1. A2. CPU:. +. A3. Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT). 21 May 2019. =. Slide 16 41. C1. B2. C2. B3. C3. SMT. GPU: Single Instruction, Multiple Threads (SIMT) CPU core ≊ GPU multiprocessor (SM) Working unit: set of threads (32, a warp) Fast switching of threads (large register file) Branching if. Member of the Helmholtz Association. C0. Thread. Core Core Thread. Core Core. SIMT.

(42) SIMT. Vector A0. B0. A1. B1. A2. CPU:. +. A3. Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT). Tesla V100. 21 May 2019. =. Slide 16 41. C1. B2. C2. B3. C3. SMT. GPU: Single Instruction, Multiple Threads (SIMT) CPU core ≊ GPU multiprocessor (SM) Working unit: set of threads (32, a warp) Fast switching of threads (large register file) Branching if. Member of the Helmholtz Association. C0. Thread. Core Core Thread. Core Core. SIMT. Graphics: Nvidia Corporation [5]. SIMT = SIMD ⊕ SMT.

(43) SIMT. Vector A0. B0. A1. B1. A2. CPU:. +. A3. Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT). Tesla V100. 21 May 2019. =. Slide 16 41. C1. B2. C2. B3. C3. SMT. GPU: Single Instruction, Multiple Threads (SIMT) CPU core ≊ GPU multiprocessor (SM) Working unit: set of threads (32, a warp) Fast switching of threads (large register file) Branching if. Member of the Helmholtz Association. C0. Thread. Core Core Thread. Core Core. SIMT. Graphics: Nvidia Corporation [5]. SIMT = SIMD ⊕ SMT.

(44) Multiprocessor Vector. SIMT = SIMD ⊕ SMT. A0. B0. A1. B1. A2. CPU:. +. A3. Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT). Tesla V100. 21 May 2019. =. Slide 16 41. C1. B2. C2. B3. C3. SMT. GPU: Single Instruction, Multiple Threads (SIMT) CPU core ≊ GPU multiprocessor (SM) Working unit: set of threads (32, a warp) Fast switching of threads (large register file) Branching if. Member of the Helmholtz Association. C0. Thread. Core Core Thread. Core Core. SIMT. Graphics: Nvidia Corporation [5]. SIMT.

(45) Low Latency vs. High Throughput Maybe GPU’s ultimate feature. CPU Minimizes latency within each thread GPU Hides latency with computations from other thread warps. Member of the Helmholtz Association. 21 May 2019. Slide 17 41.

(46) Low Latency vs. High Throughput Maybe GPU’s ultimate feature. CPU Minimizes latency within each thread GPU Hides latency with computations from other thread warps CPU Core: Low Latency T1. T2. T3. T4. Thread/Warp Processing Context Switch Ready Waiting. Member of the Helmholtz Association. 21 May 2019. Slide 17 41.

(47) Low Latency vs. High Throughput Maybe GPU’s ultimate feature. CPU Minimizes latency within each thread GPU Hides latency with computations from other thread warps CPU Core: Low Latency T1. T2. T3. T4. GPU Streaming Multiprocessor: High Throughput W1. Thread/Warp Processing Context Switch Ready Waiting. W2 W3 W4. Member of the Helmholtz Association. 21 May 2019. Slide 17 41.

(48) CPU vs. GPU Let’s summarize this!. Optimized for low latency + + + + + − − −. Optimized for high throughput. Large main memory Fast clock rate Large caches Branch prediction Powerful ALU Relatively low memory bandwidth Cache misses costly Low performance per watt. Member of the Helmholtz Association. 21 May 2019. + + + + − − −. Slide 18 41. High bandwidth main memory Latency tolerant (parallelism) More compute resources High performance per watt Limited memory capacity Low per-thread performance Extension card.

(49) Programming GPUs. Member of the Helmholtz Association. 21 May 2019. Slide 19 41.

(50) Preface: CPU A simple CPU program!. SAXPY: ⃗y = a⃗x + ⃗y, with single precision Part of LAPACK BLAS Level 1 void saxpy(int n, float a, float * x, float * y) { for (int i = 0; i < n; i++) y[i] = a * x[i] + y[i]; } float a = 42; int n = 10; float x[n], y[n]; // fill x, y. saxpy(n, a, x, y);. Member of the Helmholtz Association. 21 May 2019. Slide 20 41.

(51) Programming GPUs Libraries. Member of the Helmholtz Association. 21 May 2019. Slide 21 41.

(52) Libraries. Member of the Helmholtz Association. Programming GPUs is easy: Just don’t!. 21 May 2019. Slide 22 41.

(53) Libraries. Programming GPUs is easy: Just don’t!. Use applications & libraries. Member of the Helmholtz Association. 21 May 2019. Slide 22 41.

(54) Libraries. Programming GPUs is easy: Just don’t!. Wizard: Breazell [6]. Use applications & libraries. Member of the Helmholtz Association. 21 May 2019. Slide 22 41.

(55) Libraries. Programming GPUs is easy: Just don’t!. Use applications & libraries. Wizard: Breazell [6]. cuSPARSE cuBLAS. th ano. cuFFT cuRAND. Member of the Helmholtz Association. 21 May 2019. CUDA Math. Slide 22 41.

(56) Libraries. Programming GPUs is easy: Just don’t!. Use applications & libraries. Wizard: Breazell [6]. cuSPARSE cuBLAS. th ano. cuFFT cuRAND. Member of the Helmholtz Association. 21 May 2019. CUDA Math. Slide 22 41.

(57) cuBLAS Parallel algebra. GPU-parallel BLAS (all 152 routines). Single, double, complex data types Constant competition with Intel’s MKL Multi-GPU support → https://developer.nvidia.com/cublas http://docs.nvidia.com/cuda/cublas. Member of the Helmholtz Association. 21 May 2019. Slide 23 41.

(58) cuBLAS Code example int a = 42; int n = 10; float x[n], y[n]; // fill x, y. cublasHandle_t handle; cublasCreate(&handle); float * d_x, * d_y; cudaMallocManaged(&d_x, n * sizeof(x[0]); cudaMallocManaged(&d_y, n * sizeof(y[0]); cublasSetVector(n, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(n, sizeof(y[0]), y, 1, d_y, 1);. cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasGetVector(n, sizeof(y[0]), d_y, 1, y, 1); cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); Member of the Helmholtz Association. 21 May 2019. Slide 24 41.

(59) cuBLAS Code example int a = 42; int n = 10; float x[n], y[n]; // fill x, y. cublasHandle_t handle; cublasCreate(&handle);. Initialize. float * d_x, * d_y; cudaMallocManaged(&d_x, n * sizeof(x[0]); cudaMallocManaged(&d_y, n * sizeof(y[0]); cublasSetVector(n, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(n, sizeof(y[0]), y, 1, d_y, 1);. cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasGetVector(n, sizeof(y[0]), d_y, 1, y, 1); cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); Member of the Helmholtz Association. 21 May 2019. Slide 24 41.

(60) cuBLAS Code example int a = 42; int n = 10; float x[n], y[n]; // fill x, y. cublasHandle_t handle; cublasCreate(&handle);. Initialize. float * d_x, * d_y; cudaMallocManaged(&d_x, n * sizeof(x[0]); cudaMallocManaged(&d_y, n * sizeof(y[0]); cublasSetVector(n, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(n, sizeof(y[0]), y, 1, d_y, 1);. Allocate GPU memory. cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasGetVector(n, sizeof(y[0]), d_y, 1, y, 1); cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); Member of the Helmholtz Association. 21 May 2019. Slide 24 41.

(61) cuBLAS Code example int a = 42; int n = 10; float x[n], y[n]; // fill x, y. cublasHandle_t handle; cublasCreate(&handle);. Initialize. float * d_x, * d_y; cudaMallocManaged(&d_x, n * sizeof(x[0]); cudaMallocManaged(&d_y, n * sizeof(y[0]); cublasSetVector(n, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(n, sizeof(y[0]), y, 1, d_y, 1);. Allocate GPU memory Copy data to GPU. cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasGetVector(n, sizeof(y[0]), d_y, 1, y, 1); cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); Member of the Helmholtz Association. 21 May 2019. Slide 24 41.

(62) cuBLAS Code example int a = 42; int n = 10; float x[n], y[n]; // fill x, y. cublasHandle_t handle; cublasCreate(&handle);. Initialize. float * d_x, * d_y; cudaMallocManaged(&d_x, n * sizeof(x[0]); cudaMallocManaged(&d_y, n * sizeof(y[0]); cublasSetVector(n, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(n, sizeof(y[0]), y, 1, d_y, 1);. Allocate GPU memory Copy data to GPU Call BLAS routine. cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasGetVector(n, sizeof(y[0]), d_y, 1, y, 1); cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); Member of the Helmholtz Association. 21 May 2019. Slide 24 41.

(63) cuBLAS Code example int a = 42; int n = 10; float x[n], y[n]; // fill x, y. cublasHandle_t handle; cublasCreate(&handle);. Initialize. float * d_x, * d_y; cudaMallocManaged(&d_x, n * sizeof(x[0]); cudaMallocManaged(&d_y, n * sizeof(y[0]); cublasSetVector(n, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(n, sizeof(y[0]), y, 1, d_y, 1);. Allocate GPU memory Copy data to GPU Call BLAS routine. cublasSaxpy(n, a, d_x, 1, d_y, 1);. Copy result to host. cublasGetVector(n, sizeof(y[0]), d_y, 1, y, 1); cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); Member of the Helmholtz Association. 21 May 2019. Slide 24 41.

(64) cuBLAS Code example int a = 42; int n = 10; float x[n], y[n]; // fill x, y. cublasHandle_t handle; cublasCreate(&handle);. Initialize. float * d_x, * d_y; cudaMallocManaged(&d_x, n * sizeof(x[0]); cudaMallocManaged(&d_y, n * sizeof(y[0]); cublasSetVector(n, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(n, sizeof(y[0]), y, 1, d_y, 1);. Allocate GPU memory Copy data to GPU Call BLAS routine. cublasSaxpy(n, a, d_x, 1, d_y, 1);. Copy result to host. cublasGetVector(n, sizeof(y[0]), d_y, 1, y, 1);. Finalize. cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); Member of the Helmholtz Association. 21 May 2019. Slide 24 41.

(65) Programming GPUs OpenACC/OpenMP. Member of the Helmholtz Association. 21 May 2019. Slide 25 41.

(66) GPU Programming with Directives Keepin’ you portable. Annotate serial source code by directives #pragma acc loop for (int i = 0; i < 1; i++) {};. Member of the Helmholtz Association. 21 May 2019. Slide 26 41.

(67) GPU Programming with Directives Keepin’ you portable. Annotate serial source code by directives #pragma acc loop for (int i = 0; i < 1; i++) {};. OpenACC: Especially for GPUs; OpenMP: Has GPU support Compiler interprets directives, creates according instructions. Member of the Helmholtz Association. 21 May 2019. Slide 26 41.

(68) GPU Programming with Directives Keepin’ you portable. Annotate serial source code by directives #pragma acc loop for (int i = 0; i < 1; i++) {};. OpenACC: Especially for GPUs; OpenMP: Has GPU support Compiler interprets directives, creates according instructions Pro. Con. Portability Other compiler? No problem! To it, it’s a serial program Different target architectures from same code. Easy to program Member of the Helmholtz Association. 21 May 2019. Slide 26 41. Compiler support only raising Not all the raw power available Harder to debug Easy to program wrong.

(69) OpenACC Code example. void saxpy_acc(int n, float a, float * x, float * y) { #pragma acc kernels for (int i = 0; i < n; i++) y[i] = a * x[i] + y[i]; } float a = 42; int n = 10; float x[n], y[n]; // fill x, y. saxpy_acc(n, a, x, y);. Member of the Helmholtz Association. 21 May 2019. Slide 27 41.

(70) OpenACC Code example. void saxpy_acc(int n, float a, float * x, float * y) { #pragma acc parallel loop copy(y) copyin(x) for (int i = 0; i < n; i++) y[i] = a * x[i] + y[i]; } float a = 42; int n = 10; float x[n], y[n]; // fill x, y. saxpy_acc(n, a, x, y);. Member of the Helmholtz Association. 21 May 2019. Slide 27 41.

(71) Programming GPUs CUDA C/C++. Member of the Helmholtz Association. 21 May 2019. Slide 28 41.

(72) Programming GPU Directly Finally…. Two solutions:. Member of the Helmholtz Association. 21 May 2019. Slide 29 41.

(73) Programming GPU Directly Finally…. Two solutions: OpenCL Open Computing Language by Khronos Group (Apple, IBM, NVIDIA, …) 2009 Platform: Programming language (OpenCL C/C++), API, and compiler Targets CPUs, GPUs, FPGAs, and other many-core machines Fully open source Different compilers available. Member of the Helmholtz Association. 21 May 2019. Slide 29 41.

(74) Programming GPU Directly Finally…. Two solutions: OpenCL Open Computing Language by Khronos Group (Apple, IBM, NVIDIA, …) 2009 Platform: Programming language (OpenCL C/C++), API, and compiler Targets CPUs, GPUs, FPGAs, and other many-core machines Fully open source Different compilers available CUDA NVIDIA’s GPU platform 2007 Platform: Drivers, programming language (CUDA C/C++), API, compiler, debuggers, profilers, … Only NVIDIA GPUs Compilation with nvcc (free, but not open) clang has CUDA support, but CUDA needed for last step Also: CUDA Fortran. Member of the Helmholtz Association. 21 May 2019. Slide 29 41.

(75) Programming GPU Directly Finally…. Two solutions: OpenCL Open Computing Language by Khronos Group (Apple, IBM, NVIDIA, …) 2009 Platform: Programming language (OpenCL C/C++), API, and compiler Targets CPUs, GPUs, FPGAs, and other many-core machines Fully open source Different compilers available CUDA NVIDIA’s GPU platform 2007 Platform: Drivers, programming language (CUDA C/C++), API, compiler, debuggers, profilers, … Only NVIDIA GPUs Compilation with nvcc (free, but not open) clang has CUDA support, but CUDA needed for last step Also: CUDA Fortran. Choose what flavor you like, what colleagues/collaboration is using Hardest: Come up with parallelized algorithm Member of the Helmholtz Association. 21 May 2019. Slide 29 41.

(76) Programming GPU Directly Finally…. Two solutions: OpenCL Open Computing Language by Khronos Group (Apple, IBM, NVIDIA, …) 2009 Platform: Programming language (OpenCL C/C++), API, and compiler Targets CPUs, GPUs, FPGAs, and other many-core machines Fully open source Different compilers available CUDA NVIDIA’s GPU platform 2007 Platform: Drivers, programming language (CUDA C/C++), API, compiler, debuggers, profilers, … Only NVIDIA GPUs Compilation with nvcc (free, but not open) clang has CUDA support, but CUDA needed for last step Also: CUDA Fortran. Choose what flavor you like, what colleagues/collaboration is using Hardest: Come up with parallelized algorithm Member of the Helmholtz Association. 21 May 2019. Slide 29 41.

(77) CUDA’s Parallel Model In software: Threads, Blocks. Methods to exploit parallelism:. Member of the Helmholtz Association. 21 May 2019. Slide 30 41.

(78) CUDA’s Parallel Model In software: Threads, Blocks. Methods to exploit parallelism: Thread. Member of the Helmholtz Association. 21 May 2019. Slide 30 41.

(79) CUDA’s Parallel Model In software: Threads, Blocks. Methods to exploit parallelism: Threads. Member of the Helmholtz Association. 0 1 2 3 4 5. 21 May 2019. Slide 30 41.

(80) CUDA’s Parallel Model In software: Threads, Blocks. Methods to exploit parallelism: Threads → Block. Member of the Helmholtz Association. 0 1 2 3 4 5. 21 May 2019. Slide 30 41.

(81) CUDA’s Parallel Model In software: Threads, Blocks. Methods to exploit parallelism: Threads → Block. 0 1 2 3 4 5. Block 0. Member of the Helmholtz Association. 21 May 2019. Slide 30 41.

(82) CUDA’s Parallel Model In software: Threads, Blocks. Methods to exploit parallelism: Threads → Block. 0 1 2 3 4 5. 0 1 2 3 4 5. 0 1 2 3 4 5. 0. 1. 2. Blocks. Member of the Helmholtz Association. 21 May 2019. Slide 30 41.

(83) CUDA’s Parallel Model In software: Threads, Blocks. Methods to exploit parallelism: Threads → Block. 0 1 2 3 4 5. 0 1 2 3 4 5. 0 1 2 3 4 5. 0. 1. 2. Blocks → Grid. Member of the Helmholtz Association. 21 May 2019. Slide 30 41.

(84) CUDA’s Parallel Model In software: Threads, Blocks. Methods to exploit parallelism: Threads → Block. 0 1 2 3 4 5. 0 1 2 3 4 5. 0 1 2 3 4 5. 0. 1. 2. Blocks → Grid 3D 3D Threads & blocks in 3D. Member of the Helmholtz Association. 21 May 2019. Slide 30 41.

(85) CUDA’s Parallel Model In software: Threads, Blocks. Methods to exploit parallelism: Threads → Block. 0 1 2 3 4 5. 0 1 2 3 4 5. 0 1 2 3 4 5. 0. 1. 2. Blocks → Grid 3D 3D Threads & blocks in 3D. Parallel function: kernel. __global__ kernel(int a, float * b) { } Access own ID by global variables threadIdx.x, blockIdx.y, …. Execution entity: threads. Lightweight → fast switchting! 1000s threads execute simultaneously → order non-deterministic!. Member of the Helmholtz Association. 21 May 2019. Slide 30 41.

(86) CUDA SAXPY With runtime-managed data transfers __global__ void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n]; // fill x, y cudaMallocManaged(&x, n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float));. saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaDeviceSynchronize(); Member of the Helmholtz Association. 21 May 2019. Slide 31 41.

(87) CUDA SAXPY With runtime-managed data transfers __global__ void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n]; // fill x, y cudaMallocManaged(&x, n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float));. saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaDeviceSynchronize(); Member of the Helmholtz Association. 21 May 2019. Slide 31 41. Specify kernel.

(88) CUDA SAXPY With runtime-managed data transfers __global__ void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n]; // fill x, y cudaMallocManaged(&x, n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float));. saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaDeviceSynchronize(); Member of the Helmholtz Association. 21 May 2019. Slide 31 41. Specify kernel ID variables.

(89) CUDA SAXPY With runtime-managed data transfers __global__ void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n]; // fill x, y cudaMallocManaged(&x, n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float));. saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaDeviceSynchronize(); Member of the Helmholtz Association. 21 May 2019. Slide 31 41. Specify kernel ID variables Guard against too many threads.

(90) CUDA SAXPY With runtime-managed data transfers __global__ void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n]; // fill x, y cudaMallocManaged(&x, n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float));. cudaDeviceSynchronize(); 21 May 2019. ID variables Guard against too many threads. Allocate GPU-capable memory. saxpy_cuda<<<2, 5>>>(n, a, x, y);. Member of the Helmholtz Association. Specify kernel. Slide 31 41.

(91) CUDA SAXPY With runtime-managed data transfers __global__ void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n]; // fill x, y cudaMallocManaged(&x, n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float));. Guard against too many threads. Call kernel 2 blocks, each 5 threads. cudaDeviceSynchronize(); 21 May 2019. ID variables. Allocate GPU-capable memory. saxpy_cuda<<<2, 5>>>(n, a, x, y);. Member of the Helmholtz Association. Specify kernel. Slide 31 41.

(92) CUDA SAXPY With runtime-managed data transfers __global__ void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n]; // fill x, y cudaMallocManaged(&x, n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float));. Guard against too many threads. Call kernel 2 blocks, each 5 threads Wait for kernel to finish. cudaDeviceSynchronize(); 21 May 2019. ID variables. Allocate GPU-capable memory. saxpy_cuda<<<2, 5>>>(n, a, x, y);. Member of the Helmholtz Association. Specify kernel. Slide 31 41.

(93) Programming GPUs Performance Analysis. Member of the Helmholtz Association. 21 May 2019. Slide 32 41.

(94) GPU Tools The helpful helpers helping helpless (and others). NVIDIA cuda-gdb GDB-like command line utility for debugging cuda-memcheck Like Valgrind’s memcheck, for checking errors in memory accesses Nsight IDE for GPU developing, based on Eclipse (Linux, OS X) or Visual Studio. (Windows). nvprof Command line profiler, including detailed performance counters. Visual Profiler Timeline profiling and annotated performance experiments OpenCL: CodeXL (Open Source, GPUOpen/AMD) – debugging, profiling.. Member of the Helmholtz Association. 21 May 2019. Slide 33 41.

(95) GPU Tools The helpful helpers helping helpless (and others). NVIDIA cuda-gdb GDB-like command line utility for debugging cuda-memcheck Like Valgrind’s memcheck, for checking errors in memory accesses Nsight IDE for GPU developing, based on Eclipse (Linux, OS X) or Visual Studio. (Windows). nvprof Command line profiler, including detailed performance counters. Visual Profiler Timeline profiling and annotated performance experiments OpenCL: CodeXL (Open Source, GPUOpen/AMD) – debugging, profiling.. Member of the Helmholtz Association. 21 May 2019. Slide 33 41.

(96) nvprof. Command that line. $ nvprof ./matrixMul -wA=1024 -hA=1024 -wB=1024 -hB=1024 ==37064== Profiling application: ./matrixMul -wA=1024 -hA=1024 -wB=1024 -hB=1024 ==37064== Profiling result: Time(%) Time Calls Avg Min Max Name 99.19% 262.43ms 301 871.86us 863.88us 882.44us void matrixMulCUDA<int=32>(float*, float*, float*, int, int) 0.58% 1.5428ms 2 771.39us 764.65us 778.12us [CUDA memcpy HtoD] 0.23% 599.40us 1 599.40us 599.40us 599.40us [CUDA memcpy DtoH] ==37064== API calls: Time(%) Time 61.26% 258.38ms 35.68% 150.49ms 0.73% 3.0774ms 0.62% 2.6287ms 0.56% 2.3408ms 0.48% 2.0111ms 0.21% 872.52us 0.15% 612.20us 0.12% 499.01us. Member of the Helmholtz Association. Calls 1 3 3 4 301 364 1 1505 3. Avg 258.38ms 50.164ms 1.0258ms 657.17us 7.7760us 5.5250us 872.52us 406ns 166.34us. Min 258.38ms 914.97us 1.0097ms 655.12us 7.3810us 235ns 872.52us 361ns 140.45us. 21 May 2019. Max 258.38ms 148.65ms 1.0565ms 660.56us 53.103us 201.63us 872.52us 1.1970us 216.16us. Name cudaEventSynchronize cudaMalloc cudaMemcpy cuDeviceTotalMem cudaLaunch cuDeviceGetAttribute cudaDeviceSynchronize cudaSetupArgument cudaFree. Slide 34 41.

(97) Visual Profiler. Member of the Helmholtz Association. 21 May 2019. Slide 35 41.

(98) Advanced Topics So much more interesting things to show! Optimize memory transfers to reduce overhead Optimize applications for GPU architecture Drop-in BLAS acceleration with NVBLAS ($LD_PRELOAD) Tensor Cores for Deep Learning Libraries, Abstractions: Kokkos, Alpaka, Futhark, HIP, C++AMP, … Use multiple GPUs On one node Across many nodes → MPI. … Some of that: Addressed at dedicated training courses. Member of the Helmholtz Association. 21 May 2019. Slide 36 41.

(99) Using GPUs on JURECA & JUWELS. Member of the Helmholtz Association. 21 May 2019. Slide 37 41.

(100) Compiling CUDA. Module: module load CUDA/10.1.105 Compile: nvcc file.cu Default host compiler: g++; use nvcc_pgc++ for PGI compiler cuBLAS: g++ file.cpp -I$CUDA_HOME/include -L$CUDA_HOME/lib64 -lcublas -lcudart. OpenACC MPI. Member of the Helmholtz Association. Module: module Compile: pgc++. load PGI/19.3-GCC-8.3.0 -acc -ta=tesla file.cpp. Module: module load MVAPICH2/2.3.1-GDR (also needed: GCC/8.3.0) Enabled for CUDA (CUDA-aware); no need to copy data to host before transfer. 21 May 2019. Slide 38 41.

(101) Running Dedicated GPU partitions JUWELS --partition=gpus --partition=develgpus. 46 nodes (Job limits: <1 d) 10 nodes (Job limits: <2 h, ≤ 2 nodes). JURECA --partition=gpus --partition=develgpus. 70 nodes (Job limits: <1 d, ≤ 32 nodes) 4 nodes (Job limits: <2 h, ≤ 2 nodes). Needed: Resource configuration with --gres --gres=gpu:4 --gres=mem1024,gpu:2 --partition=vis. only JURECA. → See online documentation. Member of the Helmholtz Association. 21 May 2019. Slide 39 41.

(102) Example 96 tasks in total, running on 4 nodes Per node: 4 GPUs #!/bin/bash -x #SBATCH --nodes=4 #SBATCH --ntasks=96 #SBATCH --ntasks-per-node=24 #SBATCH --output=gpu-out.%j #SBATCH --error=gpu-err.%j #SBATCH --time=00:15:00 #SBATCH --partition=gpus #SBATCH --gres=gpu:4 srun ./gpu-prog. Member of the Helmholtz Association. 21 May 2019. Slide 40 41.

(103) Conclusion, Resources GPUs provide highly-parallel computing power. We have many devices installed at JSC, ready to be used!. Member of the Helmholtz Association. 21 May 2019. Slide 41 41.

(104) Conclusion, Resources GPUs provide highly-parallel computing power. We have many devices installed at JSC, ready to be used! Training courses by JSC CUDA Course April 2020 OpenACC Course 28 - 29 October 2019 Generally: see online documentation and sc@fz-juelich.de. Member of the Helmholtz Association. 21 May 2019. Slide 41 41.

(105) Conclusion, Resources GPUs provide highly-parallel computing power. We have many devices installed at JSC, ready to be used! Training courses by JSC CUDA Course April 2020 OpenACC Course 28 - 29 October 2019 Generally: see online documentation and sc@fz-juelich.de Further consultation via our lab: NVIDIA Application Lab in Jülich; contact me!. Member of the Helmholtz Association. 21 May 2019. Slide 41 41.

(106) Conclusion, Resources GPUs provide highly-parallel computing power. We have many devices installed at JSC, ready to be used! Training courses by JSC CUDA Course April 2020 OpenACC Course 28 - 29 October 2019 Generally: see online documentation and sc@fz-juelich.de Further consultation via our lab: NVIDIA Application Lab in Jülich; contact me! Interested in JURON? Get access!. Member of the Helmholtz Association. 21 May 2019. Slide 41 41.

(107) Conclusion, Resources GPUs provide highly-parallel computing power. We have many devices installed at JSC, ready to be used! Training courses by JSC CUDA Course April 2020 OpenACC Course 28 - 29 October 2019 Generally: see online documentation and sc@fz-juelich.de Further consultation via our lab: NVIDIA Application Lab in Jülich; contact k youme!. Than ttention! for your a -juelich.de. Interested in JURON? Get access!. fz. a.herten@. Member of the Helmholtz Association. 21 May 2019. Slide 41 41.

(108) APPENDIX. Member of the Helmholtz Association. 21 May 2019. Slide 1 9.

(109) Appendix Glossary References. Member of the Helmholtz Association. 21 May 2019. Slide 2 9.

(110) Glossary I API A programmatic interface to software by well-defined functions. Short for. application programming interface. 72, 73, 74, 75, 76. CUDA Computing platform for GPUs from NVIDIA. Provides, among others, CUDA C/C++. 2, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,. 91, 92, 100, 103, 104, 105, 106, 107, 111. JSC Jülich Supercomputing Centre, the supercomputing institute of. Forschungszentrum Jülich, Germany. 2, 103, 104, 105, 106, 107, 110. JURECA A multi-purpose supercomputer with 1800 nodes at JSC. 2, 5, 99, 101 JURON One of the two HBP pilot system in Jülich; name derived from Juelich and. Neuron. 6, 7. JUWELS Jülich’s new supercomputer, the successor of JUQUEEN. 2, 3, 4, 99, 101 Member of the Helmholtz Association. 21 May 2019. Slide 3 9.

(111) Glossary II MPI The Message Passing Interface, a API definition for multi-node computing. 98,. 100. NVIDIA US technology company creating GPUs. 3, 4, 5, 6, 7, 72, 73, 74, 75, 76, 94, 95, 103,. 104, 105, 106, 107, 110, 111, 112, 113. NVLink NVIDIA’s communication protocol connecting CPU ↔ GPU and GPU ↔ GPU with. high bandwidth. 6, 7, 112. OpenACC Directive-based programming, primarily for many-core machines. 2, 65, 66, 67,. 68, 69, 70, 100, 103, 104, 105, 106, 107. OpenCL The Open Computing Language. Framework for writing code for heterogeneous architectures (CPU, GPU, DSP, FPGA). The alternative to CUDA. 72, 73, 74, 75, 76,. 94, 95. Member of the Helmholtz Association. 21 May 2019. Slide 4 9.

(112) Glossary III OpenMP Directive-based programming, primarily for multi-threaded machines. 2, 65, 66,. 67, 68. P100 A large GPU with the Pascal architecture from NVIDIA. It employs NVLink as its. interconnect and has fast HBM2 memory. 6, 7. Pascal GPU architecture from NVIDIA (announced 2016). 112 POWER CPU architecture from IBM, earlier: PowerPC. See also POWER8. 112 POWER8 Version 8 of IBM’s POWERprocessor, available also under the OpenPOWER. Foundation. 6, 7, 112. SAXPY Single-precision A × X + Y. A simple code example of scaling a vector and adding. an offset. 50, 86, 87, 88, 89, 90, 91, 92. Member of the Helmholtz Association. 21 May 2019. Slide 5 9.

(113) Glossary IV Tesla The GPU product line for general purpose computing computing of NVIDIA. 3, 4,. 5, 6, 7. CPU Central Processing Unit. 3, 4, 5, 6, 7, 12, 13, 14, 18, 19, 20, 21, 22, 23, 24, 25, 26,. 27, 28, 29, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 50, 72, 73, 74, 75, 76, 111, 112. GPU Graphics Processing Unit. 2, 3, 4, 5, 6, 7, 8, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,. 23, 24, 25, 26, 27, 28, 29, 30, 31, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 49, 51, 52, 53, 54, 55, 56, 57, 65, 66, 67, 68, 71, 72, 73, 74, 75, 76, 90, 91, 92, 93, 94, 95, 98, 99, 101, 102, 103, 104, 105, 106, 107, 110, 111, 112, 113. HBP Human Brain Project. 110 SIMD Single Instruction, Multiple Data. 35, 36, 37, 38, 39, 40, 41, 42, 43, 44. Member of the Helmholtz Association. 21 May 2019. Slide 6 9.

(114) Glossary V. SIMT Single Instruction, Multiple Threads. 15, 16, 17, 30, 31, 33, 34, 35, 36, 37, 38, 39,. 40, 41, 42, 43, 44. SM Streaming Multiprocessor. 35, 36, 37, 38, 39, 40, 41, 42, 43, 44 SMT Simultaneous Multithreading. 35, 36, 37, 38, 39, 40, 41, 42, 43, 44. Member of the Helmholtz Association. 21 May 2019. Slide 7 9.

(115) References I. [2] Karl Rupp. Pictures: CPU/GPU Performance Comparison. URL:. https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardwarecharacteristics-over-time/ (pages 10, 11).. [6] Wes Breazell. Picture: Wizard. URL:. https://thenounproject.com/wes13/collection/its-a-wizards-world/. (pages 52–56).. Member of the Helmholtz Association. 21 May 2019. Slide 8 9.

(116) References: Images, Graphics I [1] Alexandre Debiève. Bowels of computer. Freely available at Unsplash. URL: https://unsplash.com/photos/FO7JIlwjOtU. [3] Mark Lee. Picture: kawasaki ninja. URL:. https://www.flickr.com/photos/pochacco20/39030210/. License: Creative. Commons BY-ND 2.0 (pages 12, 13).. [4] Bob Adams. Picture: Hylton Ross Mercedes Benz Irizar coach. URL:. https://www.flickr.com/photos/satransport/13197324714/. License:. Creative Commons BY-SA 2.0 (pages 12, 13).. [5] Nvidia Corporation. Pictures: Volta GPU. Volta Architecture Whitepaper. URL:. https://images.nvidia.com/content/volta-architecture/pdf/VoltaArchitecture-Whitepaper-v1.0.pdf (pages 42–44).. Member of the Helmholtz Association. 21 May 2019. Slide 9 9.

(117)

Referenzen

ÄHNLICHE DOKUMENTE

Displaykabel/Basisgerät 2912210050 Display-Layout spezifizieren (siehe hierzu Kapitel „Display-Layouts“) Für jedes GPU-3 sind zwei zusätzliche Displays möglich

In the following the fast computation of the convolution is achieved by using fast Fourier transforms while exploiting the properties of the demagnetization tensor eld N, the

JURON – A Human Brain Project Prototype 18 nodes with IBM POWER8NVL CPUs 2 × 10 cores Per Node: 4 NVIDIA Tesla P100 cards 16 GB HBM2 memory, connected via NVLink GPU: 0.38 PFLOP/s

JUWELS – Jülich’s New Large System currently under construction 2500 nodes with Intel Xeon CPUs 2 × 24 cores 48 nodes with 4 NVIDIA Tesla V100 cards 16 GB HBM2 memory 10.4 CPU + 1.6

Der Weg über die serverseitige Visualisierung bringt den Vorteil mit sich, dass komplette Datensätze für den Visualisierungsprozess nicht zwangs- läufig übertragen werden

[16] Hardware costs versus MEM benchmark performance (circles) for three classes of node types: CPU-only nodes (orange), nodes with professional Tesla GPUs (purple), and nodes

Performance as a function of the GPU application clock rate on a node with 23E5–2680v2 processors and K20X (dark blue, dark green) or K40 (light blue, light green) GPUs. Gray

• A modeling approach based on static code features that predicts core and memory frequency configurations, which are Pareto-optimal with respect to energy and performance..