Try   HackMD

12 Accelerators and GPU Programming with CUDA

Background

  • Performance scalability is getting more and more limited
    • Single thread performance increases limited
    • Multicore scaling limited by memory bandwidth
    • Alternative: growing number of nodes
  • Technological limitations
    • Floorspace
    • Power, Energy, Heat
    • Harder to use all transistors
  • Question: Can we use transistors in a better way
    • Is general purpose computing the right way?
  • Idea: extra hardware to accelerate computation
    • Specialized operations
    • No longer capable of (easily) sustaining general operations
    • ”Hosted” by a (general purpose) host system
    • Impact on programmability

Examples of Accelerators

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

GPU Architecture

Graphics Processing Units (GPUs)

  • Originally developed to support fast graphics displays
    • Triangle drawing, shading, texture mapping,
    • Paved the way for high-end graphics workstations
  • Evolution
    • Originally hardwired pipeline
    • Increasingly programmable over the years
    • Led to usage of GPUs for general programming
      • Shader programming
      • “Re-Use” of languages/libraries like OpenGL
  • First fully programmable GPUs around 2006
    • Aka General Purpuse GPUs (GPGPUs)
    • NVIDIA introduces CUDA to program GPUs
  • Today most GPUs are programmable
    • From AMD GPUs to integrated SoCs as on the Raspberry Pi
    • Programming varies and is often low-level

Potential of GPUs

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Concepts of a GPU

  • Key concept: vector / SIMD processing
    • Massively parallel processing in a single chip
  • Advantages:
    • Reduces complexity of the processor
    • Increases energy efficiency
  • Disadvantages
    • Reduced flexibility (must vectorize problem)
    • Additional programming complexity
  • GPUs are typically designed hierarchically
    • Example in the following based on NVIDIA
    • Other GPUs (AMD, Intel, ) similar

Architectural Diagram of NVIDIA’s Ampere u-Arch

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Compute hierarchy
    • SM = Streaming Multiprocessor
    • TPC = Texture Processing Cluster
    • GPC = Graphics Processing Cluster
  • Central (Shared) L2 Cache
  • Connection to host via PCI Express 40
  • Connection between GPUs via NVLINK 30
  • Multiple memory controllers
    • HBM 2 = High Bandwidth Memory 2
  • Gigathread Engine with MIG Control
    • Schedule kernels to GPC/TPC/SM
    • Enables sharing of GPU
    • Further advanced in successor generations

Streaming Multiprocessor in Ampere u-Arch

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • 32 FP64 CUDA cores
  • 64 FP32 CUDA cores
  • 64 INT32 CUDA cores
  • 4 3rd Gen Tensor Cores
  • (FP16/FP32/FP64, INT4/8, BF16/TF32, and binary)
  • 4x 64 K, 32bit-registers
  • 192 KB shared memory
  • (L1 Data Cache)
  • Thread blocks as basic scheduling unit
  • Mapped to Warps to execute with
  • a fixed number of threads with single
  • Instruction counter

NVIDIA’s Ampere u-Arch: Tensor Core

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Programmable matrix-multiply and accumulate (MMA Instructions)
  • Useful for AI/Deep Learning and HPC application
  • New in Ampere: Full FP64 (double) support

NVIDIA’s Ampere u-Arch: Sparsity

  • Registers
    • CUDA threads have own registers
  • L0 cache / Processing Block-private
    • 16 KB
  • L1 I-cache
    • >48KB Instruction / Constant cache
  • Unified L1 data cache and Shared memory
    • 196 KB (up to 164KB Shared Memory)
    • Small, but very high bandwidth / Low Latency
  • Global L2 Cache
    • 4 MB
  • High Bandwidth Memory (HBM)
    • 6 HBM2 Stacks Up to 48 GB
    • 5120 bit, 2x 6 memory controller, 1555GB/s total bandwidth
  • Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

Integration

System Integration

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

NVIDIA’s DGX-1 Architecture

  • “Supercomputer in a Box”
    • Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
  • Intended for Deep Learning
    • 8 P100 GPUs with 16GB DRAM each
    • Xeon host CPUs
    • Peak at 170 TFlop/s (at FP16)
    • Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →

NVIDIA’s DGX A100 Architecture

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Similar concept as in the DGX-1, but with 8 A100-SXM4-40G GPUs and NVSwitch
    • Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →

IBM Summit System @ ORNL

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Connecting an Accelerator

  • Specialized hardware attached to a host
  • Most common (as of now): PCIe
    • Add-on cards
    • Access and data transfer via PCI bus
    • Advantage: easy to integrate
    • Disadvantage: PCI can be a bottleneck
  • Alternative: direct processor connection
    • Needs new interfaces
    • Removes bottlenecks
  • One step further: CPU integration
    • Direct access from within CPU
    • Either fully integrated or with packaged dies
  • External booster concept
  • Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

Memory Memory

  • Generally GPUs have a separate memory
    • Requires explicit data transfers, which complicates
    • Typically expensive through PCI express
  • Newer GPU generations support a unified memory option
    • Shared address space between CPU and GPU
    • Allows direct access
    • Note: not necessarily symmetric performance
  • Requires logical integration
    • AMD Fusion approach integrates CPU and GPU in one package
    • NVIDIA provided software version based on transparent paging
    • Newer NVIDIA systems feature new HW link: NVLINK (1 and 2)

CUDA Programming

  • Differences to CPU programming
    • Accelerators have no OS
    • Computation in the form of kernels for the target architectures
    • Requires a host program to run on host a system
  • Tasks
    • Definition and programming of a kernel
    • Setting up a data environment
    • Dispatching a kernel
    • Synchronization
    • Collect results

Offloading of Kernels

Common Options

CUDA

  • Targets NVIDIA GPUs
  • Language extensions on top of C/C++
  • Combined with runtime
  • AMD version: HIP

OpenCL for any (supported) device

  • Targets any kind of device
  • Language extension on top of C/C++
  • Combined with runtime

OpenMP

  • Extensions in OpenMP since version 40
  • General extension for any accelerator
  • Accelerators are seens as “devices” and “targets”

OpenACC

  • Specifically targets GPUs
  • Grew out of OpenMP
  • Focus on loop parallelism

Kernel Programming with NVIDIA‘s CUDA

  • Base concept: CUDA thread
    • Used to define all forms of parallelism in the GPU
  • Thread Blocks: group of CUDA threads
    • Executed by a single Streaming Multiprocessor
    • As SIMD and multithreading execution mapped to Warps
    • NVIDIA classifies CUDA as Single Instruction, Multiple Thread (SIMT)
  • Thread blocks are required to execute independently and in order
    • Different thread blocks cannot communicate directly
    • Can be coordinated using atomic memory operations in the GPU memory
  • CUDA programs (host and kernel) are compiled with a separate compiler
    • nvcc
    • Includes language additions as well as access to CUDA runtime functions
    • The ISA (Instruction Set Architecture) for CUDA-capable NVIDIA GPU is PTX (Parallel Thread Execution)

Code and Data for the GPU

// Functions can run on the host or the device (GPU)

/* Prefix for functions */

__global__
// Kernels to be executed on the GPU
// Kernels are configurable

__device__
// GPU functions that can be called from within kernels

__host __
// Functions for the CPU 
// (also default)


// Variables declared in __device__ or __global__ functions are allocated to the GPU memory accessible by all processors in the GPU

Kernel Invocation

// Function call syntax for the GPU
name<<<dimGrid, dimBlock>>>( parameter list)

// Dim3 dimGrid  number of thread blocks (x, y, z)
// Dim3 dimBlock number of threads per block (x, y, z)

/* Example */

// Invoke DAXPY
daxpy(n, 20, x, y)

// DAXPY in C
void daxpy(int n, double a, double *x, double *y){
    for (int i=0; i<n; ++i)
        y[i]=a*x[i]+y[i];
}

// Monolithic Kernel
__global__
void daxpy(int n, double a, double *x, double *y){
    int i=blockIdxx*blockDimx+threadIdxx;
    if (i<n) y[i]=a*x[i]+y[i]
}

//Copy data to device
cudaMalloc(&d_x, nbytes); cudaMalloc(&d_y, nbytes);
cudaMemcpy(d_x,h_x,nbytes,cudaMemocpyHostToDevice);
cudaMemcpy(d_y,h_y,nbytes,cudaMemocpyHostToDevice);

// invoke DAXPY with 256 threads per Thread Block
int nblocks = (n+255)/256;
daxpy<<<nblocks, 256>>>(n,20,d_x,d_y);

//Copy data from device
cudaMemcpy(h_y,d_y,nbytes,cudaMemcopyDevicetoHost);

//Free device memory
cudaFree(&d_x); cudaFree(&d_y);

Host Synchronization

  • All kernel launches are asynchronous
    • Control returns to CPU immediately
    • Kernel executes after all previous CUDA calls have completed
  • cudaMemcopy() is synchronous
    • Control returns to CPU after copy completes
    • Copy starts after all previous CUDA calls have completed
  • cudaThreadSynchronize()
    • Blocks until all previous CUDA calls complete
  • cudaDeviceSynchronize()
    • Blocks until all calls on all CUDA devices are completed

Asynchrony Rules in CUDA

  • All CUDA kernel invocations are asynchronous
    • Kernel just scheduled on GPU
    • CPU continues executing
    • Synchronization through explicit sync routines or synchronizing MemCopy
  • However, multiple kernel invocations serialize by default
    • Each kernel waiting on the completion of the previous one
    • Synchronizing MemCopy waiting on all kernels

CUDA Streams

  • Streams represents queues for kernel execution
    • Host CPU code places work into the stream/queue and continues
    • GPU schedules kernels from queue when resources are free
  • Overlapping
    • Operations within the same stream are ordered (FIFO) and cannot overlap
    • Operations in different streams are unordered and can overlap
  • Stream Management
    ​​​​cudaStream_t stream;
    ​​​​// Declares a stream handle
    ​​​​
    ​​​​cudaStreamCreate(&stream);
    ​​​​// Allocates a stream
    ​​​​
    ​​​​cudaStreamDestroy(stream);
    ​​​​// Deallocates a stream
    ​​​​// Synchronizes host until work in stream has completed
    
  • Kernel invocation on particular stream
    ​​​​name<<blocks, threads, size, stream>>(<params>)
    

CUDA Stream Example

  • Starting two kernels
    ​​​​foo<<blocks,threads>>();
    ​​​​foo<<blocks,threads>>();
    
  • Starting two kernels on two streams
    ​​​​cudaStream_t stream1, stream2;
    ​​​​
    ​​​​cudaStreamCreate(&stream1);
    ​​​​cudaStreamCreate(&stream2);
    ​​​​
    ​​​​foo<<blocks,threads,0,stream1>>();
    ​​​​foo<<blocks,threads,0,stream2>>();
    ​​​​
    ​​​​cudaStreamDestroy(stream1);
    ​​​​cudaStreamDestroy(stream2);
    
  • Starting two kernels with one extra stream
    ​​​​cudaStream_t stream;
    ​​​​cudaStreamCreate(&stream);
    ​​​​S0
    ​​​​S1
    ​​​​foo<<blocks,threads>>();
    ​​​​foo<<blocks,threads,0,stream>>();
    ​​​​cudaStreamDestroy(stream);
    
  • Stream is synchronizing to all streams
    ​​​​for (int i=0; i<ns; i++) {
    ​​​​    cudaStreamCreate(&s[i]);
    ​​​​    
    ​​​​    kernel<<<1,64,0,s[i]>>>();
    ​​​​    
    ​​​​    // Stream 0 is the default stream
    ​​​​    kernel<<<1, 1>>>();
    ​​​​}
    

Per Stream Memory Copies

  • Synchronizing MemCopy (synchronizes to stream 0)
    • cudaMemcpy (void* d, void* s, size_t c, cudaMemcpyKind kind)
  • Async MemCopy
    • cudaMemcpyAsync (void* d, void* s, size_t c, cudaMemcpyKind kind, cudaStream_t stream = 0 )
  • Async copy requires pinned host memory
    • Memory not pageable, i.e., can be safely transferred at any time
    • Special allocation/deallocation calls
      • cudaMallocHost(...) / cudaHostAlloc(...)
      • cudaFreeHost(...)
    • Calls to pin/unpin memory
      • cudaHostRegister(...) / cudaHostUnregister(...)

Streams Can Also Help Transfer Concurrency

  • Synchronous
    ​​​​cudaMemcpy(...);
    ​​​​foo<<...>>();
    
  • Asynchronous Same Stream
    ​​​​cudaMemcpyAsync(...,stream1);
    ​​​​foo<<...,stream1>>();
    
  • Asynchronous Different Streams
    ​​​​cudaMemcpyAsync(...,stream1);
    ​​​​foo<<...,stream2>>();
    

Advanced CUDA Synchronization Topics

CUDA Pros & Cons

  • CUDA enables straightforward programming of GPUs
    • Direct mapping of computation to GPUs
    • Exposes parallelism
  • Drawback 1
    • Very low level
    • Manual mapping and optimization
    • Be aware of thread mapping and address alignments
  • Drawback 2
    • Bound to one vendor (NVIDIA)
    • But AMD is developing 3rd-party support for their GPUs
  • Drawback 3
    • Want to use host system as well, which is also a parallel system
    • Requires a separate programming system

OpenMP Target

Alternative Example: OpenMP Target and Device Capabilities

Added in OpenMP 4.0, refined and extended since then

  • Host-centric execution model
    • Offloading target regions to target devices
    • Target region starts execution as a single thread on the device
  • Device is a generic concept
    • Can be any kind of accelerator (different compilers will support different hardware)
    • Can also be another CPU
  • Data environment on the device
    • Directives define the data environment
      • Host variables called original variables
      • Device variables called corresponding variables
      • Host and device variables may share storage
  • define the mapping
    • To: copy from host to device
    • From: copy from device to host
    • FromTo: both

Target directive

// Creates a device data environment
// Executes the target region (structured-block) on the specificed device
#pragma omp target device(0)
    Structured-block

// Synchronous
// The encountering threads waits until the device finished
// Device clause
// Defines the target device

Target directive with Data Enviroment

// Variables referenced in target region
// Either are declared in the region
// Or have to be explicitly added to the data environment
#pragma omp target device(0),map(to:x,y)
    Structured-block

// map clause
// to, from, fromTo
// to
// value copied to device at the beginning of the block
// from
// value copied from device at the end of the block

Explicit Data Enviroment

#pragma omp target data device(0) map(to:x,y)
    Structured-block
// Defines the data environment on a device
// Exists while the structure block is executed on the host
// If (scalar-expression) clause
// false: The device is the host.

#pragma omp target update device(0) map(to:x,y)
    Structured-block
// Update corresponding or original variable
// from: GPU->CPU
// to: CPU->GPU
// A list item can only appear in a to or from clause, but not both.

Declare Variables and Functions for Target

// Specifies (static) variables and functions to be mapped to a device
#pragma omp declare target
    block of definitions
#pragma omp end declare target

// Function:
// A device specific version is created by the compiler
// 
// Variable:
// Corresponding variable created in the initial device data environment
//     - For all devices
//     - Initialized variable: Value copied to the device.
//     
// Additional clauses available to map variables to specific devices
/* Example: daxpy */

#pragma omp declare target
void daxpy(int n, double a, double *x, double *y){
    #pragma omp parallel for
    for (int i=0; i<n; ++i)
        y[i]=a*x[i]+y[i];
}
#pragma omp end declare target

...

#pragma omp target map(to:x,y),map(from:y)
    daxpy(n,2.0,x,y);

Programming Challenges

  • OpenMP unifies host and accelerator programming in one model
    • Still have to manage target device and host code separately
    • Not as fine-grained control as with CUDA
  • CUDA and OpenMP require careful memory management
    • Data transfers are expensive
    • Keep data on GPU as long as possible
      • Across kernel invocations
    • Requires careful scheduling of kernels and data transfers
  • Higher level programming abstractions exist
    • Can hide use of accelerators
    • Often combined with automatic code generation
    • Performance or generality is often problematic
    • CuPy: NumPy-compatible matrix library with CUDA acceleration

Beyond the Single GPU or Accelerator

  • Goal: exploit the entire node
    • Possibly multiple GPUs
    • Host itself is typically a large NUMA system
  • Schedule work across all HW threads on CPU and GPU
    • Neither CUDA nor OpenMP provide explicit support for this
    • Manual chunking and load distribution
    • Load balancing to compensate for heterogeneous performance
  • GPUs (Accelerators) in a larger HPC system
    • Must cover multiple nodes by combining with MPI
      • MPI processes are CUDA or OpenMP with target/device codes
      • Additional complication for load balance
    • Additional complication: #GPUs not equal #cores
      • Need to decide with MPI process handles a GPU
      • Recommendation is often: one MPI process for each GPU