owned this note
owned this note
Published
Linked with GitHub
# 12 Accelerators and GPU Programming with CUDA
## Background
* Performance scalability is getting more and more limited
* Single thread performance increases limited
* Multicore scaling limited by memory bandwidth
* Alternative: growing number of nodes
* Technological limitations
* Floorspace
* Power, Energy, Heat
* Harder to use all transistors
* Question: **Can we use transistors in a better way**
* Is `general purpose computing` the right way?
* Idea: ***extra hardware*** to `accelerate computation`
* **`Specialized operations`**
* No longer capable of (easily) sustaining general operations
* ”Hosted” by a (general purpose) host system
* Impact on programmability
### Examples of Accelerators

## GPU Architecture
### Graphics Processing Units (GPUs)
* Originally developed to support fast graphics displays
* Triangle drawing, shading, texture mapping,
* Paved the way for high-end graphics workstations
* Evolution
* Originally hardwired pipeline
* Increasingly programmable over the years
* Led to usage of GPUs for general programming
* `Shader programming`
* “Re-Use” of languages/libraries like `OpenGL`
* First **`fully programmable GPUs`** around 2006
* Aka `General Purpuse GPUs` (**`GPGPUs`**)
* NVIDIA introduces **`CUDA`** to program GPUs
* Today most GPUs are **`programmable`**
* From AMD GPUs to `integrated SoCs` as on the `Raspberry Pi`
* Programming varies and is often `low-level`
### Potential of GPUs

### Concepts of a GPU
* Key concept: **vector / SIMD processing**
* Massively parallel processing `in a single chip`
* Advantages:
* `Reduces complexity` of the `processor`
* Increases `energy efficiency`
* Disadvantages
* Reduced flexibility (must vectorize problem)
* Additional programming complexity
* GPUs are typically **designed hierarchically**
* Example in the following based on NVIDIA
* Other GPUs (AMD, Intel, ) similar
### Architectural Diagram of NVIDIA’s Ampere u-Arch

* **`Compute hierarchy`**
* SM = Streaming Multiprocessor
* TPC = Texture Processing Cluster
* GPC = Graphics Processing Cluster
* Central (Shared) L2 Cache
* Connection to host via PCI Express 40
* Connection between GPUs via NVLINK 30
* Multiple memory controllers
* HBM 2 = High Bandwidth Memory 2
* Gigathread Engine with MIG Control
* Schedule kernels to GPC/TPC/SM
* Enables sharing of GPU
* Further advanced in successor generations
#### Streaming Multiprocessor in Ampere u-Arch

* 32 FP64 CUDA cores
* 64 FP32 CUDA cores
* 64 INT32 CUDA cores
* 4 3rd Gen Tensor Cores
* (FP16/FP32/FP64, INT4/8, BF16/TF32, and binary)
* 4x 64 K, 32bit-registers
* 192 KB shared memory
* (L1 Data Cache)
* Thread blocks as basic scheduling unit
* Mapped to Warps to execute with
* a fixed number of threads with single
* Instruction counter
#### NVIDIA’s Ampere u-Arch: Tensor Core

* Programmable matrix-multiply and accumulate (MMA Instructions)
* Useful for AI/Deep Learning and HPC application
* New in Ampere: Full FP64 (double) support
#### NVIDIA’s Ampere u-Arch: Sparsity
* Registers
* CUDA threads have own registers
* L0 cache / Processing Block-private
* 16 KB
* L1 I-cache
* \>48KB Instruction / Constant cache
* Unified L1 data cache and Shared memory
* 196 KB (up to 164KB Shared Memory)
* Small, but very high bandwidth / Low Latency
* Global L2 Cache
* 4 MB
* High Bandwidth Memory (HBM)
* 6 HBM2 Stacks Up to 48 GB
* 5120 bit, 2x 6 memory controller, 1555GB/s total bandwidth
* 
## Integration
### System Integration

### NVIDIA’s DGX-1 Architecture
* “Supercomputer in a Box”
* 
* Intended for **`Deep Learning`**
* 8 P100 GPUs with 16GB DRAM each
* Xeon host CPUs
* Peak at 170 TFlop/s (at FP16)
* 
### NVIDIA’s DGX A100 Architecture

* Similar concept as in the DGX-1, but with 8 A100-SXM4-40G GPUs and NVSwitch
* 
### IBM Summit System @ ORNL

### Connecting an Accelerator
* Specialized hardware attached to a host
* **Most common** (as of now): **`PCIe`**
* Add-on cards
* Access and data transfer via PCI bus
* Advantage: easy to integrate
* Disadvantage: `PCI` can be a `bottleneck`
* **Alternative**: **`direct processor connection`**
* Needs new interfaces
* Removes bottlenecks
* One step **further**: **`CPU integration`**
* Direct access from `within CPU`
* Either fully integrated or with packaged dies
* External booster concept
* 
### Memory Memory
* Generally GPUs have a **`separate memory`**
* Requires explicit data transfers, which complicates
* Typically expensive through PCI express
* Newer GPU generations support a **`unified memory option`**
* `Shared address space` between CPU and GPU
* Allows `direct access`
* Note: not necessarily symmetric performance
* Requires **`logical integration`**
* `AMD Fusion` approach integrates CPU and GPU in one package
* NVIDIA provided software version based on transparent `paging`
* Newer NVIDIA systems feature `new HW link`: NVLINK (1 and 2)
## CUDA Programming
* Differences to CPU programming
* Accelerators have **no OS**
* Computation in the form of **kernels** for the target architectures
* Requires a **host program** to run on host a system
* Tasks
* Definition and programming of a kernel
* Setting up a data environment
* Dispatching a kernel
* Synchronization
* Collect results
### Offloading of Kernels

### Common Options
#### CUDA
* Targets NVIDIA GPUs
* Language extensions on top of C/C++
* Combined with runtime
* AMD version: HIP
#### OpenCL for any (supported) device
* Targets any kind of device
* Language extension on top of C/C++
* Combined with runtime
#### OpenMP
* Extensions in OpenMP since version 40
* General extension for any accelerator
* Accelerators are seens as “devices” and “targets”
#### OpenACC
* Specifically targets GPUs
* Grew out of OpenMP
* Focus on `loop parallelism`
### Kernel Programming with NVIDIA‘s CUDA
* Base concept: **`CUDA thread`**
* Used to define all forms of parallelism in the GPU
* **`Thread Blocks`**: group of CUDA threads
* Executed by a single Streaming Multiprocessor
* As SIMD and multithreading execution mapped to Warps
* NVIDIA classifies CUDA as `Single Instruction, Multiple Thread` (`SIMT`)
* **`Thread blocks`** are required to `execute independently` and `in order`
* Different thread blocks `cannot communicate directly`
* Can be `coordinated` using `atomic memory operations` in the GPU memory
* CUDA programs (host and kernel) are compiled with a separate compiler
* `nvcc`
* Includes language additions as well as access to CUDA runtime functions
* The ISA (Instruction Set Architecture) for CUDA-capable NVIDIA GPU is `PTX` (`Parallel Thread Execution`)
### Code and Data for the GPU
```c
// Functions can run on the host or the device (GPU)
/* Prefix for functions */
__global__
// Kernels to be executed on the GPU
// Kernels are configurable
__device__
// GPU functions that can be called from within kernels
__host __
// Functions for the CPU
// (also default)
// Variables declared in __device__ or __global__ functions are allocated to the GPU memory accessible by all processors in the GPU
```
#### Kernel Invocation
```c
// Function call syntax for the GPU
name<<<dimGrid, dimBlock>>>( parameter list)
// Dim3 dimGrid number of thread blocks (x, y, z)
// Dim3 dimBlock number of threads per block (x, y, z)
/* Example */
// Invoke DAXPY
daxpy(n, 20, x, y)
// DAXPY in C
void daxpy(int n, double a, double *x, double *y){
for (int i=0; i<n; ++i)
y[i]=a*x[i]+y[i];
}
// Monolithic Kernel
__global__
void daxpy(int n, double a, double *x, double *y){
int i=blockIdxx*blockDimx+threadIdxx;
if (i<n) y[i]=a*x[i]+y[i]
}
//Copy data to device
cudaMalloc(&d_x, nbytes); cudaMalloc(&d_y, nbytes);
cudaMemcpy(d_x,h_x,nbytes,cudaMemocpyHostToDevice);
cudaMemcpy(d_y,h_y,nbytes,cudaMemocpyHostToDevice);
// invoke DAXPY with 256 threads per Thread Block
int nblocks = (n+255)/256;
daxpy<<<nblocks, 256>>>(n,20,d_x,d_y);
//Copy data from device
cudaMemcpy(h_y,d_y,nbytes,cudaMemcopyDevicetoHost);
//Free device memory
cudaFree(&d_x); cudaFree(&d_y);
```

### Host Synchronization
* All kernel launches are asynchronous
* Control returns to CPU immediately
* Kernel executes after all previous CUDA calls have completed
* **`cudaMemcopy()`** is **synchronous**
* Control returns to CPU after copy completes
* Copy starts after all previous CUDA calls have completed
* **`cudaThreadSynchronize()`**
* Blocks until all previous CUDA calls complete
* **`cudaDeviceSynchronize()`**
* Blocks until all calls on all CUDA devices are completed
### Asynchrony Rules in CUDA
* All CUDA kernel invocations are **asynchronous**
* Kernel just scheduled on GPU
* CPU continues executing
* Synchronization through `explicit sync routines` or `synchronizing MemCopy`
* 
* However, **multiple kernel invocations serialize** by default
* `Each kernel` **waiting** on the completion of the `previous one`
* `Synchronizing MemCopy` **waiting** on `all kernels`
* 
### CUDA Streams
* Streams represents queues for kernel execution
* Host CPU code places work into the stream/queue and continues
* GPU schedules kernels from queue when resources are free
* **Overlapping**
* Operations within the `same stream` are `ordered` (`FIFO`) and `cannot overlap`
* Operations in `different streams` are `unordered` and `can overlap`
* Stream Management
```c
cudaStream_t stream;
// Declares a stream handle
cudaStreamCreate(&stream);
// Allocates a stream
cudaStreamDestroy(stream);
// Deallocates a stream
// Synchronizes host until work in stream has completed
```
* Kernel invocation on particular stream
```c
name<<blocks, threads, size, stream>>(<params>)
```
#### CUDA Stream Example
* Starting `two kernels`

```c
foo<<blocks,threads>>();
foo<<blocks,threads>>();
```
* Starting `two kernels` on `two streams`

```c
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
foo<<blocks,threads,0,stream1>>();
foo<<blocks,threads,0,stream2>>();
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);
```
* Starting `two kernels` with `one extra stream`

```c
cudaStream_t stream;
cudaStreamCreate(&stream);
S0
S1
foo<<blocks,threads>>();
foo<<blocks,threads,0,stream>>();
cudaStreamDestroy(stream);
```
* Stream is `synchronizing` to all streams

```c
for (int i=0; i<ns; i++) {
cudaStreamCreate(&s[i]);
kernel<<<1,64,0,s[i]>>>();
// Stream 0 is the default stream
kernel<<<1, 1>>>();
}
```
#### Per Stream Memory Copies
* **Synchronizing** **`MemCopy`** (synchronizes to stream 0)
* `cudaMemcpy (void* d, void* s, size_t c, cudaMemcpyKind kind)`
* **Async** **`MemCopy`**
* `cudaMemcpyAsync (void* d, void* s, size_t c, cudaMemcpyKind kind, cudaStream_t stream = 0 )`
* **Async** copy requires **`pinned host memory`**
* Memory not pageable, i.e., can be safely transferred at any time
* Special allocation/deallocation calls
* `cudaMallocHost(...) / cudaHostAlloc(...)`
* `cudaFreeHost(...)`
* Calls to pin/unpin memory
* `cudaHostRegister(...) / cudaHostUnregister(...)`
#### Streams Can Also Help Transfer Concurrency
* **Synchronous**
```c
cudaMemcpy(...);
foo<<...>>();
```

* **Asynchronous** **`Same Stream`**
```c
cudaMemcpyAsync(...,stream1);
foo<<...,stream1>>();
```

* **Asynchronous** **`Different Streams`**
```c
cudaMemcpyAsync(...,stream1);
foo<<...,stream2>>();
```

### Advanced CUDA Synchronization Topics
* **Explicit Synchronization**
* Synchronize everything: `cudaDeviceSynchronize()`
* Synchronize one stream: `cudaStreamSynchronize(stream)`
* Synchronize using events -> look it up
* **Stream Priorities**
* High priority streams will **preempt** lower priority streams
* `cudaDeviceGetStreamPriorityRange(&low, &high)`
* Lower number is higher priority
* Create using special API
* `cudaStreamCreateWithPriority(&stream, flags, priority)`
* **Stream Callbacks**
* `Registration` of host side callbacks with respect to a stream
* `Invocation` **when** tasks in the stream are **complete**
* **Direct Assembler**:
* `PTX Assembler` code can be embedded into CUDA-C code
* More information
* http://on-demand.gputechconf.com/gtc/2014/presentations/S4158-cuda-streams-best-practices-common-pitfalls.pdf
* https://devblogs.nvidia.com/gpu-pro-tip-cuda-7-streams-simplify-concurrency/
### CUDA Pros & Cons
* CUDA enables straightforward programming of GPUs
* **Direct mapping** of computation to GPUs
* **Exposes parallelism**
* **Drawback 1**
* Very **`low level`**
* Manual mapping and optimization
* Be aware of thread mapping and address alignments
* **Drawback 2**
* **`Bound`** to one vendor (NVIDIA)
* But AMD is developing 3rd-party support for their GPUs
* **Drawback 3**
* Want to use **`host system`** as well, which is also a parallel system
* Requires a **`separate programming system`**
## OpenMP Target
### Alternative Example: OpenMP Target and Device Capabilities
> Added in OpenMP 4.0, refined and extended since then
* **`Host-centric execution model`**
* Offloading ***target regions*** to ***target devices***
* ***Target region*** starts execution as a `single thread` on the device
* **`Device`** is a generic concept
* Can be **`any kind of accelerator`** (different compilers will support different hardware)
* Can also be another `CPU`
* **`Data environment`** on the device
* Directives define the **`data environment`**
* `Host` variables called ***original variables***
* `Device` variables called ***corresponding variables***
* Host and device variables may **share storage**
* ... define the **`mapping`**
* `To`: copy from host to device
* `From`: copy from device to host
* `FromTo`: both
### Target directive
```c
// Creates a device data environment
// Executes the target region (structured-block) on the specificed device
#pragma omp target device(0)
Structured-block
// Synchronous
// The encountering threads waits until the device finished
// Device clause
// Defines the target device
```
### Target directive with **`Data Enviroment`**
```c
// Variables referenced in target region
// Either are declared in the region
// Or have to be explicitly added to the data environment
#pragma omp target device(0),map(to:x,y)
Structured-block
// map clause
// to, from, fromTo
// to
// value copied to device at the beginning of the block
// from
// value copied from device at the end of the block
```
### Explicit Data Enviroment
```c
#pragma omp target data device(0) map(to:x,y)
Structured-block
// Defines the data environment on a device
// Exists while the structure block is executed on the host
// If (scalar-expression) clause
// false: The device is the host.
#pragma omp target update device(0) map(to:x,y)
Structured-block
// Update corresponding or original variable
// from: GPU->CPU
// to: CPU->GPU
// A list item can only appear in a to or from clause, but not both.
```
### Declare Variables and Functions for Target
```c
// Specifies (static) variables and functions to be mapped to a device
#pragma omp declare target
block of definitions
#pragma omp end declare target
// Function:
// A device specific version is created by the compiler
//
// Variable:
// Corresponding variable created in the initial device data environment
// - For all devices
// - Initialized variable: Value copied to the device.
//
// Additional clauses available to map variables to specific devices
```
```c
/* Example: daxpy */
#pragma omp declare target
void daxpy(int n, double a, double *x, double *y){
#pragma omp parallel for
for (int i=0; i<n; ++i)
y[i]=a*x[i]+y[i];
}
#pragma omp end declare target
...
#pragma omp target map(to:x,y),map(from:y)
daxpy(n,2.0,x,y);
```
### Programming Challenges
* OpenMP ***unifies*** **`host`** and **`accelerator`** ***programming*** in one model
* Still have to manage `target device` and `host code` separately
* Not as `fine-grained control` as with CUDA
* CUDA and OpenMP require ***careful memory management***
* **`Data transfers`** are **`expensive`**
* Keep data on GPU as long as possible
* Across kernel invocations
* Requires **`careful`** **`scheduling of kernels`** and **`data transfers`**
* Higher level programming ***abstractions*** exist
* Can hide use of accelerators
* Often combined with automatic code generation
* ***Performance or generality is often problematic***
* CuPy: NumPy-compatible matrix library with CUDA acceleration
### Beyond the Single GPU or Accelerator
* Goal: exploit the entire node
* Possibly ***multiple GPUs***
* **Host** itself **is** typically a **large NUMA system**
* Schedule work across all HW threads on CPU and GPU
* ***Neither*** `CUDA` ***nor*** `OpenMP` provide explicit support for this
* ***Manual*** chunking and load distribution
* Load balancing to compensate for heterogeneous performance
* **GPUs** (Accelerators) in a **larger HPC system**
* Must cover **multiple nodes** by combining with `MPI`
* MPI processes are CUDA or OpenMP with target/device codes
* Additional complication for load balance
* Additional complication: ***#GPUs not equal #cores***
* Need to decide with MPI process handles a GPU
* **Recommendation** is often: **one MPI process for each GPU**