12 Accelerators and GPU Programming with CUDA

Background

Performance scalability is getting more and more limited
- Single thread performance increases limited
- Multicore scaling limited by memory bandwidth
- Alternative: growing number of nodes
Technological limitations
- Floorspace
- Power, Energy, Heat
- Harder to use all transistors
Question: Can we use transistors in a better way
- Is general purpose computing the right way?
Idea: extra hardware to accelerate computation
- Specialized operations
- No longer capable of (easily) sustaining general operations
- ”Hosted” by a (general purpose) host system
- Impact on programmability

Examples of Accelerators

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

GPU Architecture

Graphics Processing Units (GPUs)

Originally developed to support fast graphics displays
- Triangle drawing, shading, texture mapping,
- Paved the way for high-end graphics workstations
Evolution
- Originally hardwired pipeline
- Increasingly programmable over the years
- Led to usage of GPUs for general programming
  - Shader programming
  - “Re-Use” of languages/libraries like OpenGL
First fully programmable GPUs around 2006
- Aka General Purpuse GPUs (GPGPUs)
- NVIDIA introduces CUDA to program GPUs
Today most GPUs are programmable
- From AMD GPUs to integrated SoCs as on the Raspberry Pi
- Programming varies and is often low-level

Potential of GPUs

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Concepts of a GPU

Key concept: vector / SIMD processing
- Massively parallel processing in a single chip
Advantages:
- Reduces complexity of the processor
- Increases energy efficiency
Disadvantages
- Reduced flexibility (must vectorize problem)
- Additional programming complexity
GPUs are typically designed hierarchically
- Example in the following based on NVIDIA
- Other GPUs (AMD, Intel, ) similar

Architectural Diagram of NVIDIA’s Ampere u-Arch

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Compute hierarchy
- SM = Streaming Multiprocessor
- TPC = Texture Processing Cluster
- GPC = Graphics Processing Cluster
Central (Shared) L2 Cache
Connection to host via PCI Express 40
Connection between GPUs via NVLINK 30
Multiple memory controllers
- HBM 2 = High Bandwidth Memory 2
Gigathread Engine with MIG Control
- Schedule kernels to GPC/TPC/SM
- Enables sharing of GPU
- Further advanced in successor generations

Streaming Multiprocessor in Ampere u-Arch

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

32 FP64 CUDA cores
64 FP32 CUDA cores
64 INT32 CUDA cores
4 3rd Gen Tensor Cores
(FP16/FP32/FP64, INT4/8, BF16/TF32, and binary)
4x 64 K, 32bit-registers
192 KB shared memory
(L1 Data Cache)
Thread blocks as basic scheduling unit
Mapped to Warps to execute with
a fixed number of threads with single
Instruction counter

NVIDIA’s Ampere u-Arch: Tensor Core

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Programmable matrix-multiply and accumulate (MMA Instructions)
Useful for AI/Deep Learning and HPC application
New in Ampere: Full FP64 (double) support

NVIDIA’s Ampere u-Arch: Sparsity

Registers
- CUDA threads have own registers
L0 cache / Processing Block-private
- 16 KB
L1 I-cache
- >48KB Instruction / Constant cache
Unified L1 data cache and Shared memory
- 196 KB (up to 164KB Shared Memory)
- Small, but very high bandwidth / Low Latency
Global L2 Cache
- 4 MB
High Bandwidth Memory (HBM)
- 6 HBM2 Stacks Up to 48 GB
- 5120 bit, 2x 6 memory controller, 1555GB/s total bandwidth
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Integration

System Integration

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

NVIDIA’s DGX-1 Architecture

“Supercomputer in a Box”
- Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
Intended for Deep Learning
- 8 P100 GPUs with 16GB DRAM each
- Xeon host CPUs
- Peak at 170 TFlop/s (at FP16)
- Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

NVIDIA’s DGX A100 Architecture

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Similar concept as in the DGX-1, but with 8 A100-SXM4-40G GPUs and NVSwitch
- Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

IBM Summit System @ ORNL

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Connecting an Accelerator

Specialized hardware attached to a host
Most common (as of now): PCIe
- Add-on cards
- Access and data transfer via PCI bus
- Advantage: easy to integrate
- Disadvantage: PCI can be a bottleneck
Alternative: direct processor connection
- Needs new interfaces
- Removes bottlenecks
One step further: CPU integration
- Direct access from within CPU
- Either fully integrated or with packaged dies
External booster concept
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Memory Memory

Generally GPUs have a separate memory
- Requires explicit data transfers, which complicates
- Typically expensive through PCI express
Newer GPU generations support a unified memory option
- Shared address space between CPU and GPU
- Allows direct access
- Note: not necessarily symmetric performance
Requires logical integration
- AMD Fusion approach integrates CPU and GPU in one package
- NVIDIA provided software version based on transparent paging
- Newer NVIDIA systems feature new HW link: NVLINK (1 and 2)

CUDA Programming

Differences to CPU programming
- Accelerators have no OS
- Computation in the form of kernels for the target architectures
- Requires a host program to run on host a system
Tasks
- Definition and programming of a kernel
- Setting up a data environment
- Dispatching a kernel
- Synchronization
- Collect results

Offloading of Kernels

Common Options

CUDA

Targets NVIDIA GPUs
Language extensions on top of C/C++
Combined with runtime
AMD version: HIP

OpenCL for any (supported) device

Targets any kind of device
Language extension on top of C/C++
Combined with runtime

OpenMP

Extensions in OpenMP since version 40
General extension for any accelerator
Accelerators are seens as “devices” and “targets”

OpenACC

Specifically targets GPUs
Grew out of OpenMP
Focus on loop parallelism

Kernel Programming with NVIDIA‘s CUDA

Base concept: CUDA thread
- Used to define all forms of parallelism in the GPU
Thread Blocks: group of CUDA threads
- Executed by a single Streaming Multiprocessor
- As SIMD and multithreading execution mapped to Warps
- NVIDIA classifies CUDA as Single Instruction, Multiple Thread (SIMT)
Thread blocks are required to execute independently and in order
- Different thread blocks cannot communicate directly
- Can be coordinated using atomic memory operations in the GPU memory
CUDA programs (host and kernel) are compiled with a separate compiler
- nvcc
- Includes language additions as well as access to CUDA runtime functions
- The ISA (Instruction Set Architecture) for CUDA-capable NVIDIA GPU is PTX (Parallel Thread Execution)

Code and Data for the GPU

// Functions can run on the host or the device (GPU)

/* Prefix for functions */

__global__
// Kernels to be executed on the GPU
// Kernels are configurable

__device__
// GPU functions that can be called from within kernels

__host __
// Functions for the CPU 
// (also default)


// Variables declared in __device__ or __global__ functions are allocated to the GPU memory accessible by all processors in the GPU

Kernel Invocation

// Function call syntax for the GPU
name<<<dimGrid, dimBlock>>>( parameter list)

// Dim3 dimGrid  number of thread blocks (x, y, z)
// Dim3 dimBlock number of threads per block (x, y, z)

/* Example */

// Invoke DAXPY
daxpy(n, 20, x, y)

// DAXPY in C
void daxpy(int n, double a, double *x, double *y){
    for (int i=0; i<n; ++i)
        y[i]=a*x[i]+y[i];
}

// Monolithic Kernel
__global__
void daxpy(int n, double a, double *x, double *y){
    int i=blockIdxx*blockDimx+threadIdxx;
    if (i<n) y[i]=a*x[i]+y[i]
}

//Copy data to device
cudaMalloc(&d_x, nbytes); cudaMalloc(&d_y, nbytes);
cudaMemcpy(d_x,h_x,nbytes,cudaMemocpyHostToDevice);
cudaMemcpy(d_y,h_y,nbytes,cudaMemocpyHostToDevice);

// invoke DAXPY with 256 threads per Thread Block
int nblocks = (n+255)/256;
daxpy<<<nblocks, 256>>>(n,20,d_x,d_y);

//Copy data from device
cudaMemcpy(h_y,d_y,nbytes,cudaMemcopyDevicetoHost);

//Free device memory
cudaFree(&d_x); cudaFree(&d_y);

Host Synchronization

All kernel launches are asynchronous
- Control returns to CPU immediately
- Kernel executes after all previous CUDA calls have completed
cudaMemcopy() is synchronous
- Control returns to CPU after copy completes
- Copy starts after all previous CUDA calls have completed
cudaThreadSynchronize()
- Blocks until all previous CUDA calls complete
cudaDeviceSynchronize()
- Blocks until all calls on all CUDA devices are completed

Asynchrony Rules in CUDA

All CUDA kernel invocations are asynchronous
- Kernel just scheduled on GPU
- CPU continues executing
- Synchronization through explicit sync routines or synchronizing MemCopy
However, multiple kernel invocations serialize by default
- Each kernel waiting on the completion of the previous one
- Synchronizing MemCopy waiting on all kernels

CUDA Streams

Streams represents queues for kernel execution
- Host CPU code places work into the stream/queue and continues
- GPU schedules kernels from queue when resources are free
Overlapping
- Operations within the same stream are ordered (FIFO) and cannot overlap
- Operations in different streams are unordered and can overlap

Stream Management

cudaStream_t stream;
// Declares a stream handle

cudaStreamCreate(&stream);
// Allocates a stream

cudaStreamDestroy(stream);
// Deallocates a stream
// Synchronizes host until work in stream has completed

Kernel invocation on particular stream

name<<blocks, threads, size, stream>>(<params>)

CUDA Stream Example

Starting two kernels

foo<<blocks,threads>>();
foo<<blocks,threads>>();

Starting two kernels on two streams

cudaStream_t stream1, stream2;

cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

foo<<blocks,threads,0,stream1>>();
foo<<blocks,threads,0,stream2>>();

cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);

Starting two kernels with one extra stream

cudaStream_t stream;
cudaStreamCreate(&stream);
S0
S1
foo<<blocks,threads>>();
foo<<blocks,threads,0,stream>>();
cudaStreamDestroy(stream);

Stream is synchronizing to all streams

for (int i=0; i<ns; i++) {
    cudaStreamCreate(&s[i]);
    
    kernel<<<1,64,0,s[i]>>>();
    
    // Stream 0 is the default stream
    kernel<<<1, 1>>>();
}

Per Stream Memory Copies

Synchronizing MemCopy (synchronizes to stream 0)
- cudaMemcpy (void* d, void* s, size_t c, cudaMemcpyKind kind)
Async MemCopy
- cudaMemcpyAsync (void* d, void* s, size_t c, cudaMemcpyKind kind, cudaStream_t stream = 0 )
Async copy requires pinned host memory
- Memory not pageable, i.e., can be safely transferred at any time
- Special allocation/deallocation calls
  - cudaMallocHost(...) / cudaHostAlloc(...)
  - cudaFreeHost(...)
- Calls to pin/unpin memory
  - cudaHostRegister(...) / cudaHostUnregister(...)

Streams Can Also Help Transfer Concurrency

Synchronous

cudaMemcpy(...);
foo<<...>>();

Asynchronous Same Stream

cudaMemcpyAsync(...,stream1);
foo<<...,stream1>>();

Asynchronous Different Streams

cudaMemcpyAsync(...,stream1);
foo<<...,stream2>>();

Advanced CUDA Synchronization Topics

Explicit Synchronization
- Synchronize everything: cudaDeviceSynchronize()
- Synchronize one stream: cudaStreamSynchronize(stream)
- Synchronize using events -> look it up
Stream Priorities
- High priority streams will preempt lower priority streams
- cudaDeviceGetStreamPriorityRange(&low, &high)
  - Lower number is higher priority
- Create using special API
  - cudaStreamCreateWithPriority(&stream, flags, priority)
Stream Callbacks
- Registration of host side callbacks with respect to a stream
- Invocation when tasks in the stream are complete
Direct Assembler:
- PTX Assembler code can be embedded into CUDA-C code
More information
- http://on-demand.gputechconf.com/gtc/2014/presentations/S4158-cuda-streams-best-practices-common-pitfalls.pdf
- https://devblogs.nvidia.com/gpu-pro-tip-cuda-7-streams-simplify-concurrency/

CUDA Pros & Cons

CUDA enables straightforward programming of GPUs
- Direct mapping of computation to GPUs
- Exposes parallelism
Drawback 1
- Very low level
- Manual mapping and optimization
- Be aware of thread mapping and address alignments
Drawback 2
- Bound to one vendor (NVIDIA)
- But AMD is developing 3rd-party support for their GPUs
Drawback 3
- Want to use host system as well, which is also a parallel system
- Requires a separate programming system

OpenMP Target

Alternative Example: OpenMP Target and Device Capabilities

Added in OpenMP 4.0, refined and extended since then

Host-centric execution model
- Offloading target regions to target devices
- Target region starts execution as a single thread on the device
Device is a generic concept
- Can be any kind of accelerator (different compilers will support different hardware)
- Can also be another CPU
Data environment on the device
- Directives define the data environment
  - Host variables called original variables
  - Device variables called corresponding variables
  - Host and device variables may share storage
… define the mapping
- To: copy from host to device
- From: copy from device to host
- FromTo: both

Target directive

// Creates a device data environment
// Executes the target region (structured-block) on the specificed device
#pragma omp target device(0)
    Structured-block

// Synchronous
// The encountering threads waits until the device finished
// Device clause
// Defines the target device

Target directive with `Data Enviroment`

// Variables referenced in target region
// Either are declared in the region
// Or have to be explicitly added to the data environment
#pragma omp target device(0),map(to:x,y)
    Structured-block

// map clause
// to, from, fromTo
// to
// value copied to device at the beginning of the block
// from
// value copied from device at the end of the block

Explicit Data Enviroment

#pragma omp target data device(0) map(to:x,y)
    Structured-block
// Defines the data environment on a device
// Exists while the structure block is executed on the host
// If (scalar-expression) clause
// false: The device is the host.

#pragma omp target update device(0) map(to:x,y)
    Structured-block
// Update corresponding or original variable
// from: GPU->CPU
// to: CPU->GPU
// A list item can only appear in a to or from clause, but not both.

Declare Variables and Functions for Target

// Specifies (static) variables and functions to be mapped to a device
#pragma omp declare target
    block of definitions
#pragma omp end declare target

// Function:
// A device specific version is created by the compiler
// 
// Variable:
// Corresponding variable created in the initial device data environment
//     - For all devices
//     - Initialized variable: Value copied to the device.
//     
// Additional clauses available to map variables to specific devices

/* Example: daxpy */

#pragma omp declare target
void daxpy(int n, double a, double *x, double *y){
    #pragma omp parallel for
    for (int i=0; i<n; ++i)
        y[i]=a*x[i]+y[i];
}
#pragma omp end declare target

...

#pragma omp target map(to:x,y),map(from:y)
    daxpy(n,2.0,x,y);

Programming Challenges

OpenMP unifies host and accelerator programming in one model
- Still have to manage target device and host code separately
- Not as fine-grained control as with CUDA
CUDA and OpenMP require careful memory management
- Data transfers are expensive
- Keep data on GPU as long as possible
  - Across kernel invocations
- Requires careful scheduling of kernels and data transfers
Higher level programming abstractions exist
- Can hide use of accelerators
- Often combined with automatic code generation
- Performance or generality is often problematic
- CuPy: NumPy-compatible matrix library with CUDA acceleration

Beyond the Single GPU or Accelerator

Goal: exploit the entire node
- Possibly multiple GPUs
- Host itself is typically a large NUMA system
Schedule work across all HW threads on CPU and GPU
- Neither CUDA nor OpenMP provide explicit support for this
- Manual chunking and load distribution
- Load balancing to compensate for heterogeneous performance
GPUs (Accelerators) in a larger HPC system
- Must cover multiple nodes by combining with MPI
  - MPI processes are CUDA or OpenMP with target/device codes
  - Additional complication for load balance
- Additional complication: #GPUs not equal #cores
  - Need to decide with MPI process handles a GPU
  - Recommendation is often: one MPI process for each GPU

12 Accelerators and GPU Programming with CUDA

Background

Examples of Accelerators

GPU Architecture

Graphics Processing Units (GPUs)

Potential of GPUs

Concepts of a GPU

Architectural Diagram of NVIDIA’s Ampere u-Arch

Streaming Multiprocessor in Ampere u-Arch

NVIDIA’s Ampere u-Arch: Tensor Core

NVIDIA’s Ampere u-Arch: Sparsity

Integration

System Integration

NVIDIA’s DGX-1 Architecture

NVIDIA’s DGX A100 Architecture

IBM Summit System @ ORNL

Connecting an Accelerator

Memory Memory

CUDA Programming

Offloading of Kernels

Common Options

CUDA

OpenCL for any (supported) device

OpenMP

OpenACC

Kernel Programming with NVIDIA‘s CUDA

Code and Data for the GPU

Kernel Invocation

Host Synchronization

Asynchrony Rules in CUDA

CUDA Streams

CUDA Stream Example

Per Stream Memory Copies

Streams Can Also Help Transfer Concurrency

Advanced CUDA Synchronization Topics

CUDA Pros & Cons

OpenMP Target

Alternative Example: OpenMP Target and Device Capabilities

Target directive

Target directive with Data Enviroment

Explicit Data Enviroment

Declare Variables and Functions for Target

Programming Challenges

Beyond the Single GPU or Accelerator

Read more

軍士官退場判賠爭議

Mock Interview Space

Red Hat Certified Specialist in OpenShift 4.2 Administration Exam Prep (ex280)

Working Student Success: Prioritizing Mentorship, Training, and a Supportive Work Environment

Target directive with `Data Enviroment`