Member of the Helmholtz Association

Volltext

(1)GPU Introduction. JSC OpenACC Course 2019 28 October 2019. Member of the Helmholtz Association. Andreas Herten. Forschungszentrum Jülich.

(2) Outline Programming GPUs Libraries GPU programming models CUDA. Introduction GPU History Architecture Comparison Jülich Systems App Showcase Platform 3 Core Features Memory Asynchronicity SIMT. High Throughput Summary. Member of the Helmholtz Association. 28 October 2019. Slide 1 30.

(3) History of GPUs A short but parallel story. 1999 Graphics computation pipeline implemented in dedicated graphics hardware Computations using OpenGL graphics library [2] »GPU« coined by NVIDIA [3] 2001 NVIDIA GeForce 3 with programmable shaders (instead of fixed pipeline) and floating-point support; 2003: DirectX 9 at ATI 2007 CUDA 2009 OpenCL 2019 Top 500: 25 % with NVIDIA GPUs (#1, #2) [4], Green 500: 8 of top 10 with GPUs [5] 2021 Aurora: First (?) US exascale supercomputer based on Intel GPUs Frontier: First (?) US more-than-exascale supercomputer based on AMD GPUs. Member of the Helmholtz Association. 28 October 2019. Slide 2 30.

(4) Status Quo Across Architectures Memory Bandwidth Theoretical Peak Performance, Double Precision 4. 10. MI60. 00. 00. 0. 15. o. Pr. Te. 103 GFLOP/sec. HD. 0. 0. 70. 58. HD. 7 69. HD. 70. HD. 48. 0. 70. HD. 38. 102. 0. 05. 06. 0. T. la es. 06 C1. sla. C1. X5. X5. 2008 Member of the Helmholtz Association. Hz. G. 0. HD. 7 79. HD. K4. ) 72 Ph i. sla. n. K4. Te. 92. um. 81. 80. a Pl. in at. Pl 70. 89. Xeon Phi 7120 (KNC). 9. 69 -2 E5. la. s Te. 97. Te. 0 69. -2. E5. X5 2 49. sla Te. 2 M. 0 09. . Ed. 0. 0. la. s Te. m. u tin. v3. 9. E5. -. 9 26. v3. 9 69 -2 E5 I25 M. v4. C2. 0 68. 2 48. 7 69. K2. sla Te. 82. Te. Xe o. 0X. 0. K2. sla. 0 V1. 90. re Fi. F. sla. S9. NL. W. 0. (K. 91. o Pr. ire. P1. Graphic: Rupp [6]. sla. Te. v2. 6 -2. INTEL Xeon CPUs. E5. NVIDIA Tesla GPUs. 0 69. AMD Radeon GPUs. X5. 90 55. INTEL Xeon Phis. W. 2010 28 October 2019. 2012 End of Year Slide 3 30. 2014. 2016. 2018.

(5) Status Quo Across Architectures Memory Bandwidth Theoretical Peak Memory Bandwidth Comparison Tesla V100. 3. 10. Tesla P100. GB/sec. 102 HD. 70 38. HD. HD. 48. 0 05. 0. 0 06. 06. C1 sla. Te. s Te. 1 aC. sla Te. C2. sl Te. 70 69. 90 20 aM. HD. 2. 49. X5. 50. i7 29. 00. S ro. Ph. 91. eP. Fir. 25. MI. sla Te. 0 K2. Xeon Phi 7120 (KNC) Tesla K20X Pla. W5. 0 68 X5. u tin. m. 92. 80. um. 81. tin. Pla. l. 0 59. 82. 1 W9. Tesla K40. 90 -26 E5. 4 X5. HD. ro. eP. Fir. Xe. 70 69. 70 89. 97 -26. v2. E5. E. 6 5-2. 99. v3. 99. E. 6 5-2. v3. -26. 99. v4. E5. INTEL Xeon CPUs NVIDIA Tesla GPUs. 0 69 X5. AMD Radeon GPUs INTEL Xeon Phis. 101. Member of the Helmholtz Association. 2008. 2010 28 October 2019. 2012 End of Year Slide 3 30. 2014. 2016. 2018. 82. Graphic: Rupp [6]. HD. 70. HD. 70 58. Ed. on. .. z GH 70 79. 0( KN. L). 60 MI.

(6) JURECA – Jülich’s Multi-Purpose Supercomputer 1872 nodes with Intel Xeon E5 CPUs (2 × 12 cores) 75 nodes with 2 NVIDIA Tesla K80 cards (look like 4 GPUs) JURECA Booster: 1640 nodes with Intel Xeon Phi Knights Landing 1.8 (CPU) + 0.44 (GPU) + 5 (KNL) PFLOP/s peak performance (Top500: #44) Mellanox EDR InfiniBand Member of the Helmholtz Association. 28 October 2019. Slide 4 30.

(7) Next: Booster!. JUWELS – Jülich’s New Scalable System 2500 nodes with Intel Xeon CPUs (2 × 24 cores) 48 nodes with 4 NVIDIA Tesla V100 cards 10.4 (CPU) + 1.6 (GPU) PFLOP/s peak performance (Top500: #26) Member of the Helmholtz Association. 28 October 2019. Slide 5 30.

(8) Getting GPU-Acquainted. TASK. Some Applications. N-Body. GEMM Location of Code:. 1-Introduction-…/Tasks/getting-started/. See Instructions.rst for hints. Mandelbrot. Member of the Helmholtz Association. 28 October 2019. Dot Product. Slide 6 30.

(9) Getting GPU-Acquainted. TASK. Some Applications DGEMM Benchmark. 10000. 1500. GFLOP/s. GFLOP/s. 2000. 1000 500. 1 GPU SP 2 GPUs SP 4 GPUs SP 1 GPU DP 2 GPUs DP 4 GPUs DP. 7500 5000 2500. 0 2000. 4000. 6000. 8000. 0. 10000 12000 14000 16000. 20000. Size of Square Matrix. 40000. Mandelbrot Benchmark. 1500. MPixel/s. N-Body Benchmark. 12500. CPU GPU. 60000. 80000. Number of Particles. 100000. 120000. DDot Benchmark Device. CPU GPU. 103. 1000. 102 500 CPU GPU. 0 5000. Member of the Helmholtz Association. 10000. 15000. 20000. Width of Image. 25000. 28 October 2019. 30000. 101 103. Slide 6 30. 104. 105. 106. Vector Length. 107. 108. 109.

(10) Platform.

(11) CPU vs. GPU. Graphics: Lee [7] and Shearings Holidays [8]. A matter of specialties. Transporting many. Transporting one. Member of the Helmholtz Association. 28 October 2019. Slide 8 30.

(12) CPU vs. GPU Chip. ALU. ALU. ALU. ALU. Control. Cache. DRAM. DRAM. Member of the Helmholtz Association. 28 October 2019. Slide 9 30.

(13) GPU Architecture Overview. Aim: Hide Latency Everything else follows. SIMT. Asynchronicity Memory. Member of the Helmholtz Association. 28 October 2019. Slide 10 30.


(15) Memory Host. GPU memory ain’t no CPU memory Unified Virtual Addressing. GPU: accelerator / extension card → Separate device from CPU Separate memory, but UVA Memory transfers need special consideration! Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automatically (performance…?). ALU. DRAM. PCIe 3 <16 GB/s. V100. 16 GB RAM, 720 GB/s. 32 GB RAM, 900 GB/s. 28 October 2019. ALU. ALU. Cache. P100. Member of the Helmholtz Association. ALU Control. Slide 11 30. DRAM. Device.

(16) Memory Host. GPU memory ain’t no CPU memory Unified Memory. GPU: accelerator / extension card → Separate device from CPU Separate memory, but UVA and UM Memory transfers need special consideration! Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automatically (performance…?). ALU. DRAM. NVLink ≈80 GB/s. V100. 16 GB RAM, 720 GB/s. 32 GB RAM, 900 GB/s. 28 October 2019. ALU. ALU. Cache. P100. Member of the Helmholtz Association. ALU Control. Slide 11 30. HBM2 <900 GB/s DRAM. Device.

(17) Processing Flow. Scheduler. CPU → GPU → CPU. .... CPU CPU Memory. Interconnect 1 Transfer data from CPU memory to GPU memory, transfer. L2. program 2 Load GPU program, execute on SMs, get (cached) data from. memory; write back. Member of the Helmholtz Association. DRAM 28 October 2019. Slide 12 30.

(18) Processing Flow. Scheduler. CPU → GPU → CPU. .... CPU CPU Memory. Interconnect 1 Transfer data from CPU memory to GPU memory, transfer. L2. program 2 Load GPU program, execute on SMs, get (cached) data from. memory; write back. Member of the Helmholtz Association. DRAM 28 October 2019. Slide 12 30.

(19) Processing Flow. Scheduler. CPU → GPU → CPU. .... CPU CPU Memory. Interconnect 1 Transfer data from CPU memory to GPU memory, transfer. L2. program 2 Load GPU program, execute on SMs, get (cached) data from. memory; write back. DRAM. 3 Transfer results back to host memory Member of the Helmholtz Association. 28 October 2019. Slide 12 30.


(21) Async Following different streams. Problem: Memory transfer is comparably slow Solution: Do something else in meantime (computation)! → Overlap tasks Copy and compute engines run separately (streams) Copy. Compute Copy. Copy Compute. Compute Copy. Compute. GPU needs to be fed: Schedule many computations CPU can do other work while GPU computes; synchronization. Member of the Helmholtz Association. 28 October 2019. Slide 14 30.


(23) SIMT. Vector. SIMT = SIMD ⊕ SMT. CPU:. Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT). 28 October 2019. B0. A1. B1. +. C0 =. Slide 16 30. C1. A2. B2. C2. A3. B3. C3. SMT. GPU: Single Instruction, Multiple Threads (SIMT) CPU core ≊ GPU multiprocessor (SM) Working unit: set of threads (32, a warp) Fast switching of threads (large register file) Branching if. Member of the Helmholtz Association. A0. Thread. Core Core Thread. Core Core. SIMT.

(24) SIMT. Vector. CPU:. Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT). Tesla V100. 28 October 2019. B0. A1. B1. +. C0 =. Slide 16 30. C1. A2. B2. C2. A3. B3. C3. SMT. GPU: Single Instruction, Multiple Threads (SIMT) CPU core ≊ GPU multiprocessor (SM) Working unit: set of threads (32, a warp) Fast switching of threads (large register file) Branching if. Member of the Helmholtz Association. A0. Thread. Core Core Thread. Core Core. SIMT. Graphics: Nvidia Corporation [9]. SIMT = SIMD ⊕ SMT.

(25) Multiprocessor Vector. SIMT = SIMD ⊕ SMT. CPU:. Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT). Tesla V100. 28 October 2019. B0. A1. B1. +. C0 =. Slide 16 30. C1. A2. B2. C2. A3. B3. C3. SMT. GPU: Single Instruction, Multiple Threads (SIMT) CPU core ≊ GPU multiprocessor (SM) Working unit: set of threads (32, a warp) Fast switching of threads (large register file) Branching if. Member of the Helmholtz Association. A0. Thread. Core Core Thread. Core Core. SIMT. Graphics: Nvidia Corporation [9]. SIMT.

(26) Low Latency vs. High Throughput Maybe GPU’s ultimate feature. CPU Minimizes latency within each thread GPU Hides latency with computations from other thread warps CPU Core: Low Latency T1. T2. T3. T4. GPU Streaming Multiprocessor: High Throughput W1 W2. Thread/Warp Processing Context Switch Ready Waiting. W3 W4. Member of the Helmholtz Association. 28 October 2019. Slide 17 30.

(27) CPU vs. GPU Let’s summarize this!. Optimized for low latency + + + + + − − −. Optimized for high throughput. Large main memory Fast clock rate Large caches Branch prediction Powerful ALU Relatively low memory bandwidth Cache misses costly Low performance per watt. Member of the Helmholtz Association. 28 October 2019. + + + + − − −. High bandwidth main memory Latency tolerant (parallelism) More compute resources High performance per watt Limited memory capacity Low per-thread performance Extension card. Slide 18 30.

(28) Programming GPUs.

(29) Summary of Acceleration Possibilities. Application. Libraries. Directives. Programming Languages. Drop-in Acceleration. Easy Acceleration. Flexible Acceleration. Member of the Helmholtz Association. 28 October 2019. Slide 20 30.

(30) Libraries. Programming GPUs is easy: Just don’t!. Wizard: Breazell [10]. Use applications & libraries. Member of the Helmholtz Association. 28 October 2019. Slide 21 30.

(31) Libraries. Programming GPUs is easy: Just don’t!. Wizard: Breazell [10]. Use applications & libraries. cuSPARSE. cuBLAS. th ano. cuFFT cuRAND. Member of the Helmholtz Association. 28 October 2019. CUDA Math. Slide 21 30.

(32) Summary of Acceleration Possibilities. Application. Libraries. Directives. Programming Languages. Drop-in Acceleration. Easy Acceleration. Flexible Acceleration. Member of the Helmholtz Association. 28 October 2019. Slide 22 30.

(33) !. Parallelism. Libraries are not enough? You need to write your own GPU code?. Member of the Helmholtz Association. 28 October 2019. Slide 23 30.

(34) Primer on Parallel Scaling Amdahl’s Law. 100. Total Time t = tserial + tparallel N Processors t(N) = ts + tp /N Speedup s(N) = t/t(N) =. ts +tp ts +tp /N. 80. Speedup. Possible maximum speedup for N parallel processors. 60 40 20 0. Member of the Helmholtz Association. 28 October 2019. Parallel Portion: 50% Parallel Portion: 75% Parallel Portion: 90% Parallel Portion: 95% Parallel Portion: 99%. 1. 2. 4. Slide 24 30. 8. 16 32 64 128 256 512 1024 2048 4096. Number of Processors.

(35) !. Parallelism. Parallel programming is not easy! Things to consider: Is my application computationally intensive enough? What are the levels of parallelism? How much data needs to be transferred? Is the gain worth the pain?. Member of the Helmholtz Association. 28 October 2019. Slide 25 30.

(36) Possibilities Different levels of closeness to GPU when GPU-programming, which can ease the pain… OpenACC, OpenMP Thrust, Kokkos, SYCL PyCUDA, Cupy, Numba Other alternatives (for completeness) CUDA Fortran HIP. OpenCL. Member of the Helmholtz Association. 28 October 2019. Slide 26 30.

(37) CUDA SAXPY SAXPY:⃗y = a⃗x + ⃗y (single precision) __global__ void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n]; // fill x, y cudaMallocManaged(&x, n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float));. Guard against too many threads. Call kernel 2 blocks, each 5 threads Wait for kernel to finish. cudaDeviceSynchronize(); 28 October 2019. ID variables. Allocate GPU-capable memory. saxpy_cuda<<<2, 5>>>(n, a, x, y);. Member of the Helmholtz Association. Specify kernel. Slide 27 30.

(38) CUDA Threading Model Warp the kernel, it’s a thread!. Methods to exploit parallelism: Thread → Block. 0 1 2 3 4 5. 0 1 2 3 4 5. 0 1 2 3 4 5. 0. 1. 2. Block → Grid 3D 3D Threads & blocks in 3D. Execution entity: threads Lightweight → fast switchting! 1000s threads execute simultaneously → order non-deterministic!. OpenACC takes care of threads and blocks for you! → Block configuration is just an optimization! Member of the Helmholtz Association. 28 October 2019. Slide 28 30.

(39) Summary of Acceleration Possibilities. Application. Libraries. Directives is Course Th. Programming Languages. Drop-in Acceleration. Easy Acceleration. Flexible Acceleration. Member of the Helmholtz Association. 28 October 2019. Slide 29 30.

(40) Conclusions. GPUs achieve performance by specialized hardware → threads Faster time-to-solution Lower energy-to-solution GPU acceleration can be done by different means. Libraries are the easiest, CUDA the fullest OpenACC good compromise. u Thank yo ttention! for your a -juelich.de fz a.herten@. Member of the Helmholtz Association. 28 October 2019. Slide 30 30.

(41) Appendix Glossary References. Member of the Helmholtz Association. 28 October 2019. Slide 1 8.

(42) Glossary I AMD Manufacturer of CPUs and GPUs. 3 API A programmatic interface to software by well-defined functions. Short for. application programming interface. 43 ATI Canada-based GPUs manufacturing company; bought by AMD in 2006. 3 CUDA Computing platform for GPUs from NVIDIA. Provides, among others, CUDA C/C++. 2, 3, 36, 37, 38, 40, 43 JSC Jülich Supercomputing Centre, the supercomputing institute of. Forschungszentrum Jülich, Germany. 42 JURECA A multi-purpose supercomputer with 1800 nodes at JSC. 6 JUWELS Jülich’s new supercomputer, the successor of JUQUEEN. 7. Member of the Helmholtz Association. 28 October 2019. Slide 2 8.

(43) Glossary II NVIDIA US technology company creating GPUs. 3, 6, 7, 42, 43 OpenACC Directive-based programming, primarily for many-core machines. 1, 36, 38 OpenCL The Open Computing Language. Framework for writing code for heterogeneous architectures (CPU, GPU, DSP, FPGA). The alternative to CUDA. 3, 36 OpenGL The Open Graphics Library, an API for rendering graphics across different. hardware architectures. 3 OpenMP Directive-based programming, primarily for multi-threaded machines. 36 SAXPY Single-precision A × X + Y. A simple code example of scaling a vector and adding. an offset. 37 Tesla The GPU product line for general purpose computing computing of NVIDIA. 6, 7. Member of the Helmholtz Association. 28 October 2019. Slide 3 8.

(44) Glossary III. Thrust A parallel algorithms library for (among others) GPUs. See https://thrust.github.io/. 36. Member of the Helmholtz Association. 28 October 2019. Slide 4 8.

(45) References: Images, Graphics I [1]. Igor Ovsyannykov. Yarn. Freely available at Unsplash. URL: https://unsplash.com/photos/hvILKk7SlH4.. [6]. Karl Rupp. Pictures: CPU/GPU Performance Comparison. URL: https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardwarecharacteristics-over-time/ (pages 4, 5).. [7]. Mark Lee. Picture: kawasaki ninja. URL: https://www.flickr.com/photos/pochacco20/39030210/ (page 11).. [8]. Shearings Holidays. Picture: Shearings coach 636. URL: https://www.flickr.com/photos/shearings/13583388025/ (page 11).. Member of the Helmholtz Association. 28 October 2019. Slide 5 8.

(46) References: Images, Graphics II. [9]. Nvidia Corporation. Pictures: Volta GPU. Volta Architecture Whitepaper. URL: https://images.nvidia.com/content/volta-architecture/pdf/VoltaArchitecture-Whitepaper-v1.0.pdf (pages 24, 25).. [10]. Wes Breazell. Picture: Wizard. URL: https://thenounproject.com/wes13/collection/its-a-wizards-world/. (pages 30, 31).. Member of the Helmholtz Association. 28 October 2019. Slide 6 8.

(47) References: Literature I [2]. Kenneth E. Hoff III et al. “Fast Computation of Generalized Voronoi Diagrams Using Graphics Hardware”. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques. SIGGRAPH ’99. New York, NY, USA: ACM Press/Addison-Wesley Publishing Co., 1999, pp. 277–286. ISBN: 0-201-48560-5. DOI: 10.1145/311535.311567. URL: http://dx.doi.org/10.1145/311535.311567 (page 3).. [3]. Chris McClanahan. “History and Evolution of GPU Architecture”. In: A Survey Paper (2010). URL: http://mcclanahoochie.com/blog/wpcontent/uploads/2011/03/gpu-hist-paper.pdf (page 3).. [4]. Jack Dongarra et al. TOP500. Nov. 2016. URL: https://www.top500.org/lists/2016/11/ (page 3).. Member of the Helmholtz Association. 28 October 2019. Slide 7 8.

(48) References: Literature II. [5]. Jack Dongarra et al. Green500. Nov. 2016. URL: https://www.top500.org/green500/lists/2016/11/ (page 3).. Member of the Helmholtz Association. 28 October 2019. Slide 8 8.

(49)