# 12 Accelerators and GPU Programming with CUDA ## Background * Performance scalability is getting more and more limited * Single thread performance increases limited * Multicore scaling limited by memory bandwidth * Alternative: growing number of nodes * Technological limitations * Floorspace * Power, Energy, Heat * Harder to use all transistors * Question: **Can we use transistors in a better way** * Is `general purpose computing` the right way? * Idea: ***extra hardware*** to `accelerate computation` * **`Specialized operations`** * No longer capable of (easily) sustaining general operations * ”Hosted” by a (general purpose) host system * Impact on programmability ### Examples of Accelerators ![](https://i.imgur.com/HClCjZn.png =500x) ## GPU Architecture ### Graphics Processing Units (GPUs) * Originally developed to support fast graphics displays * Triangle drawing, shading, texture mapping, * Paved the way for high-end graphics workstations * Evolution * Originally hardwired pipeline * Increasingly programmable over the years * Led to usage of GPUs for general programming * `Shader programming` * “Re-Use” of languages/libraries like `OpenGL` * First **`fully programmable GPUs`** around 2006 * Aka `General Purpuse GPUs` (**`GPGPUs`**) * NVIDIA introduces **`CUDA`** to program GPUs * Today most GPUs are **`programmable`** * From AMD GPUs to `integrated SoCs` as on the `Raspberry Pi` * Programming varies and is often `low-level` ### Potential of GPUs ![](https://i.imgur.com/656QuuZ.png =500x) ### Concepts of a GPU * Key concept: **vector / SIMD processing** * Massively parallel processing `in a single chip` * Advantages: * `Reduces complexity` of the `processor` * Increases `energy efficiency` * Disadvantages * Reduced flexibility (must vectorize problem) * Additional programming complexity * GPUs are typically **designed hierarchically** * Example in the following based on NVIDIA * Other GPUs (AMD, Intel, ) similar ### Architectural Diagram of NVIDIA’s Ampere u-Arch ![](https://i.imgur.com/9sMvkrT.jpg) * **`Compute hierarchy`** * SM = Streaming Multiprocessor * TPC = Texture Processing Cluster * GPC = Graphics Processing Cluster * Central (Shared) L2 Cache * Connection to host via PCI Express 40 * Connection between GPUs via NVLINK 30 * Multiple memory controllers * HBM 2 = High Bandwidth Memory 2 * Gigathread Engine with MIG Control * Schedule kernels to GPC/TPC/SM * Enables sharing of GPU * Further advanced in successor generations #### Streaming Multiprocessor in Ampere u-Arch ![](https://i.imgur.com/YIgsbfX.png) * 32 FP64 CUDA cores * 64 FP32 CUDA cores * 64 INT32 CUDA cores * 4 3rd Gen Tensor Cores * (FP16/FP32/FP64, INT4/8, BF16/TF32, and binary) * 4x 64 K, 32bit-registers * 192 KB shared memory * (L1 Data Cache) * Thread blocks as basic scheduling unit * Mapped to Warps to execute with * a fixed number of threads with single * Instruction counter #### NVIDIA’s Ampere u-Arch: Tensor Core ![](https://i.imgur.com/vwZZQLG.png) * Programmable matrix-multiply and accumulate (MMA Instructions) * Useful for AI/Deep Learning and HPC application * New in Ampere: Full FP64 (double) support #### NVIDIA’s Ampere u-Arch: Sparsity * Registers * CUDA threads have own registers * L0 cache / Processing Block-private * 16 KB * L1 I-cache * \>48KB Instruction / Constant cache * Unified L1 data cache and Shared memory * 196 KB (up to 164KB Shared Memory) * Small, but very high bandwidth / Low Latency * Global L2 Cache * 4 MB * High Bandwidth Memory (HBM) * 6 HBM2 Stacks Up to 48 GB * 5120 bit, 2x 6 memory controller, 1555GB/s total bandwidth * ![](https://i.imgur.com/VdbAE2V.png) ## Integration ### System Integration ![](https://i.imgur.com/eQBaCkz.png) ### NVIDIA’s DGX-1 Architecture * “Supercomputer in a Box” * ![](https://i.imgur.com/xQSX4sn.png) * Intended for **`Deep Learning`** * 8 P100 GPUs with 16GB DRAM each * Xeon host CPUs * Peak at 170 TFlop/s (at FP16) * ![](https://i.imgur.com/Dhp74Ey.png =400x) ### NVIDIA’s DGX A100 Architecture ![](https://i.imgur.com/QlvyRm4.png) * Similar concept as in the DGX-1, but with 8 A100-SXM4-40G GPUs and NVSwitch * ![](https://i.imgur.com/elqFQrW.png =400x) ### IBM Summit System @ ORNL ![](https://i.imgur.com/dd1MR7Y.png) ### Connecting an Accelerator * Specialized hardware attached to a host * **Most common** (as of now): **`PCIe`** * Add-on cards * Access and data transfer via PCI bus * Advantage: easy to integrate * Disadvantage: `PCI` can be a `bottleneck` * **Alternative**: **`direct processor connection`** * Needs new interfaces * Removes bottlenecks * One step **further**: **`CPU integration`** * Direct access from `within CPU` * Either fully integrated or with packaged dies * External booster concept * ![](https://i.imgur.com/Thqu2JK.png =300x) ### Memory Memory * Generally GPUs have a **`separate memory`** * Requires explicit data transfers, which complicates * Typically expensive through PCI express * Newer GPU generations support a **`unified memory option`** * `Shared address space` between CPU and GPU * Allows `direct access` * Note: not necessarily symmetric performance * Requires **`logical integration`** * `AMD Fusion` approach integrates CPU and GPU in one package * NVIDIA provided software version based on transparent `paging` * Newer NVIDIA systems feature `new HW link`: NVLINK (1 and 2) ## CUDA Programming * Differences to CPU programming * Accelerators have **no OS** * Computation in the form of **kernels** for the target architectures * Requires a **host program** to run on host a system * Tasks * Definition and programming of a kernel * Setting up a data environment * Dispatching a kernel * Synchronization * Collect results ### Offloading of Kernels ![](https://i.imgur.com/ZYvF1HY.png =500x) ### Common Options #### CUDA * Targets NVIDIA GPUs * Language extensions on top of C/C++ * Combined with runtime * AMD version: HIP #### OpenCL for any (supported) device * Targets any kind of device * Language extension on top of C/C++ * Combined with runtime #### OpenMP * Extensions in OpenMP since version 40 * General extension for any accelerator * Accelerators are seens as “devices” and “targets” #### OpenACC * Specifically targets GPUs * Grew out of OpenMP * Focus on `loop parallelism` ### Kernel Programming with NVIDIA‘s CUDA * Base concept: **`CUDA thread`** * Used to define all forms of parallelism in the GPU * **`Thread Blocks`**: group of CUDA threads * Executed by a single Streaming Multiprocessor * As SIMD and multithreading execution mapped to Warps * NVIDIA classifies CUDA as `Single Instruction, Multiple Thread` (`SIMT`) * **`Thread blocks`** are required to `execute independently` and `in order` * Different thread blocks `cannot communicate directly` * Can be `coordinated` using `atomic memory operations` in the GPU memory * CUDA programs (host and kernel) are compiled with a separate compiler * `nvcc` * Includes language additions as well as access to CUDA runtime functions * The ISA (Instruction Set Architecture) for CUDA-capable NVIDIA GPU is `PTX` (`Parallel Thread Execution`) ### Code and Data for the GPU ```c // Functions can run on the host or the device (GPU) /* Prefix for functions */ __global__ // Kernels to be executed on the GPU // Kernels are configurable __device__ // GPU functions that can be called from within kernels __host __ // Functions for the CPU // (also default) // Variables declared in __device__ or __global__ functions are allocated to the GPU memory accessible by all processors in the GPU ``` #### Kernel Invocation ```c // Function call syntax for the GPU name<<<dimGrid, dimBlock>>>( parameter list) // Dim3 dimGrid number of thread blocks (x, y, z) // Dim3 dimBlock number of threads per block (x, y, z) /* Example */ // Invoke DAXPY daxpy(n, 20, x, y) // DAXPY in C void daxpy(int n, double a, double *x, double *y){ for (int i=0; i<n; ++i) y[i]=a*x[i]+y[i]; } // Monolithic Kernel __global__ void daxpy(int n, double a, double *x, double *y){ int i=blockIdxx*blockDimx+threadIdxx; if (i<n) y[i]=a*x[i]+y[i] } //Copy data to device cudaMalloc(&d_x, nbytes); cudaMalloc(&d_y, nbytes); cudaMemcpy(d_x,h_x,nbytes,cudaMemocpyHostToDevice); cudaMemcpy(d_y,h_y,nbytes,cudaMemocpyHostToDevice); // invoke DAXPY with 256 threads per Thread Block int nblocks = (n+255)/256; daxpy<<<nblocks, 256>>>(n,20,d_x,d_y); //Copy data from device cudaMemcpy(h_y,d_y,nbytes,cudaMemcopyDevicetoHost); //Free device memory cudaFree(&d_x); cudaFree(&d_y); ``` ![](https://i.imgur.com/0l68Ejr.png) ### Host Synchronization * All kernel launches are asynchronous * Control returns to CPU immediately * Kernel executes after all previous CUDA calls have completed * **`cudaMemcopy()`** is **synchronous** * Control returns to CPU after copy completes * Copy starts after all previous CUDA calls have completed * **`cudaThreadSynchronize()`** * Blocks until all previous CUDA calls complete * **`cudaDeviceSynchronize()`** * Blocks until all calls on all CUDA devices are completed ### Asynchrony Rules in CUDA * All CUDA kernel invocations are **asynchronous** * Kernel just scheduled on GPU * CPU continues executing * Synchronization through `explicit sync routines` or `synchronizing MemCopy` * ![](https://i.imgur.com/UgEzYOt.png) * However, **multiple kernel invocations serialize** by default * `Each kernel` **waiting** on the completion of the `previous one` * `Synchronizing MemCopy` **waiting** on `all kernels` * ![](https://i.imgur.com/EkMVT82.png) ### CUDA Streams * Streams represents queues for kernel execution * Host CPU code places work into the stream/queue and continues * GPU schedules kernels from queue when resources are free * **Overlapping** * Operations within the `same stream` are `ordered` (`FIFO`) and `cannot overlap` * Operations in `different streams` are `unordered` and `can overlap` * Stream Management ```c cudaStream_t stream; // Declares a stream handle cudaStreamCreate(&stream); // Allocates a stream cudaStreamDestroy(stream); // Deallocates a stream // Synchronizes host until work in stream has completed ``` * Kernel invocation on particular stream ```c name<<blocks, threads, size, stream>>(<params>) ``` #### CUDA Stream Example * Starting `two kernels` ![](https://i.imgur.com/RMM0IjG.png =300x) ```c foo<<blocks,threads>>(); foo<<blocks,threads>>(); ``` * Starting `two kernels` on `two streams` ![](https://i.imgur.com/Hl5HKVF.png =300x) ```c cudaStream_t stream1, stream2; cudaStreamCreate(&stream1); cudaStreamCreate(&stream2); foo<<blocks,threads,0,stream1>>(); foo<<blocks,threads,0,stream2>>(); cudaStreamDestroy(stream1); cudaStreamDestroy(stream2); ``` * Starting `two kernels` with `one extra stream` ![](https://i.imgur.com/mCYSl3w.png =300x) ```c cudaStream_t stream; cudaStreamCreate(&stream); S0 S1 foo<<blocks,threads>>(); foo<<blocks,threads,0,stream>>(); cudaStreamDestroy(stream); ``` * Stream is `synchronizing` to all streams ![](https://i.imgur.com/1GAdR8J.png =300x) ```c for (int i=0; i<ns; i++) { cudaStreamCreate(&s[i]); kernel<<<1,64,0,s[i]>>>(); // Stream 0 is the default stream kernel<<<1, 1>>>(); } ``` #### Per Stream Memory Copies * **Synchronizing** **`MemCopy`** (synchronizes to stream 0) * `cudaMemcpy (void* d, void* s, size_t c, cudaMemcpyKind kind)` * **Async** **`MemCopy`** * `cudaMemcpyAsync (void* d, void* s, size_t c, cudaMemcpyKind kind, cudaStream_t stream = 0 )` * **Async** copy requires **`pinned host memory`** * Memory not pageable, i.e., can be safely transferred at any time * Special allocation/deallocation calls * `cudaMallocHost(...) / cudaHostAlloc(...)` * `cudaFreeHost(...)` * Calls to pin/unpin memory * `cudaHostRegister(...) / cudaHostUnregister(...)` #### Streams Can Also Help Transfer Concurrency * **Synchronous** ```c cudaMemcpy(...); foo<<...>>(); ``` ![](https://i.imgur.com/EorGQEP.png =300x) * **Asynchronous** **`Same Stream`** ```c cudaMemcpyAsync(...,stream1); foo<<...,stream1>>(); ``` ![](https://i.imgur.com/QLU8yhu.png =300x) * **Asynchronous** **`Different Streams`** ```c cudaMemcpyAsync(...,stream1); foo<<...,stream2>>(); ``` ![](https://i.imgur.com/119x4Sq.png =300x) ### Advanced CUDA Synchronization Topics * **Explicit Synchronization** * Synchronize everything: `cudaDeviceSynchronize()` * Synchronize one stream: `cudaStreamSynchronize(stream)` * Synchronize using events -> look it up * **Stream Priorities** * High priority streams will **preempt** lower priority streams * `cudaDeviceGetStreamPriorityRange(&low, &high)` * Lower number is higher priority * Create using special API * `cudaStreamCreateWithPriority(&stream, flags, priority)` * **Stream Callbacks** * `Registration` of host side callbacks with respect to a stream * `Invocation` **when** tasks in the stream are **complete** * **Direct Assembler**: * `PTX Assembler` code can be embedded into CUDA-C code * More information * http://on-demand.gputechconf.com/gtc/2014/presentations/S4158-cuda-streams-best-practices-common-pitfalls.pdf * https://devblogs.nvidia.com/gpu-pro-tip-cuda-7-streams-simplify-concurrency/ ### CUDA Pros & Cons * CUDA enables straightforward programming of GPUs * **Direct mapping** of computation to GPUs * **Exposes parallelism** * **Drawback 1** * Very **`low level`** * Manual mapping and optimization * Be aware of thread mapping and address alignments * **Drawback 2** * **`Bound`** to one vendor (NVIDIA) * But AMD is developing 3rd-party support for their GPUs * **Drawback 3** * Want to use **`host system`** as well, which is also a parallel system * Requires a **`separate programming system`** ## OpenMP Target ### Alternative Example: OpenMP Target and Device Capabilities > Added in OpenMP 4.0, refined and extended since then * **`Host-centric execution model`** * Offloading ***target regions*** to ***target devices*** * ***Target region*** starts execution as a `single thread` on the device * **`Device`** is a generic concept * Can be **`any kind of accelerator`** (different compilers will support different hardware) * Can also be another `CPU` * **`Data environment`** on the device * Directives define the **`data environment`** * `Host` variables called ***original variables*** * `Device` variables called ***corresponding variables*** * Host and device variables may **share storage** * ... define the **`mapping`** * `To`: copy from host to device * `From`: copy from device to host * `FromTo`: both ### Target directive ```c // Creates a device data environment // Executes the target region (structured-block) on the specificed device #pragma omp target device(0) Structured-block // Synchronous // The encountering threads waits until the device finished // Device clause // Defines the target device ``` ### Target directive with **`Data Enviroment`** ```c // Variables referenced in target region // Either are declared in the region // Or have to be explicitly added to the data environment #pragma omp target device(0),map(to:x,y) Structured-block // map clause // to, from, fromTo // to // value copied to device at the beginning of the block // from // value copied from device at the end of the block ``` ### Explicit Data Enviroment ```c #pragma omp target data device(0) map(to:x,y) Structured-block // Defines the data environment on a device // Exists while the structure block is executed on the host // If (scalar-expression) clause // false: The device is the host. #pragma omp target update device(0) map(to:x,y) Structured-block // Update corresponding or original variable // from: GPU->CPU // to: CPU->GPU // A list item can only appear in a to or from clause, but not both. ``` ### Declare Variables and Functions for Target ```c // Specifies (static) variables and functions to be mapped to a device #pragma omp declare target block of definitions #pragma omp end declare target // Function: // A device specific version is created by the compiler // // Variable: // Corresponding variable created in the initial device data environment // - For all devices // - Initialized variable: Value copied to the device. // // Additional clauses available to map variables to specific devices ``` ```c /* Example: daxpy */ #pragma omp declare target void daxpy(int n, double a, double *x, double *y){ #pragma omp parallel for for (int i=0; i<n; ++i) y[i]=a*x[i]+y[i]; } #pragma omp end declare target ... #pragma omp target map(to:x,y),map(from:y) daxpy(n,2.0,x,y); ``` ### Programming Challenges * OpenMP ***unifies*** **`host`** and **`accelerator`** ***programming*** in one model * Still have to manage `target device` and `host code` separately * Not as `fine-grained control` as with CUDA * CUDA and OpenMP require ***careful memory management*** * **`Data transfers`** are **`expensive`** * Keep data on GPU as long as possible * Across kernel invocations * Requires **`careful`** **`scheduling of kernels`** and **`data transfers`** * Higher level programming ***abstractions*** exist * Can hide use of accelerators * Often combined with automatic code generation * ***Performance or generality is often problematic*** * CuPy: NumPy-compatible matrix library with CUDA acceleration ### Beyond the Single GPU or Accelerator * Goal: exploit the entire node * Possibly ***multiple GPUs*** * **Host** itself **is** typically a **large NUMA system** * Schedule work across all HW threads on CPU and GPU * ***Neither*** `CUDA` ***nor*** `OpenMP` provide explicit support for this * ***Manual*** chunking and load distribution * Load balancing to compensate for heterogeneous performance * **GPUs** (Accelerators) in a **larger HPC system** * Must cover **multiple nodes** by combining with `MPI` * MPI processes are CUDA or OpenMP with target/device codes * Additional complication for load balance * Additional complication: ***#GPUs not equal #cores*** * Need to decide with MPI process handles a GPU * **Recommendation** is often: **one MPI process for each GPU**