12 Accelerators and GPU Programming with CUDA
Background
- Performance scalability is getting more and more limited
- Single thread performance increases limited
- Multicore scaling limited by memory bandwidth
- Alternative: growing number of nodes
- Technological limitations
- Floorspace
- Power, Energy, Heat
- Harder to use all transistors
- Question: Can we use transistors in a better way
- Is
general purpose computing
the right way?
- Idea: extra hardware to
accelerate computation
Specialized operations
- No longer capable of (easily) sustaining general operations
- ”Hosted” by a (general purpose) host system
- Impact on programmability
Examples of Accelerators
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
GPU Architecture
Graphics Processing Units (GPUs)
- Originally developed to support fast graphics displays
- Triangle drawing, shading, texture mapping,
- Paved the way for high-end graphics workstations
- Evolution
- Originally hardwired pipeline
- Increasingly programmable over the years
- Led to usage of GPUs for general programming
Shader programming
- “Re-Use” of languages/libraries like
OpenGL
- First
fully programmable GPUs
around 2006
- Aka
General Purpuse GPUs
(GPGPUs
)
- NVIDIA introduces
CUDA
to program GPUs
- Today most GPUs are
programmable
- From AMD GPUs to
integrated SoCs
as on the Raspberry Pi
- Programming varies and is often
low-level
Potential of GPUs
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Concepts of a GPU
- Key concept: vector / SIMD processing
- Massively parallel processing
in a single chip
- Advantages:
Reduces complexity
of the processor
- Increases
energy efficiency
- Disadvantages
- Reduced flexibility (must vectorize problem)
- Additional programming complexity
- GPUs are typically designed hierarchically
- Example in the following based on NVIDIA
- Other GPUs (AMD, Intel, ) similar
Architectural Diagram of NVIDIA’s Ampere u-Arch
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Compute hierarchy
- SM = Streaming Multiprocessor
- TPC = Texture Processing Cluster
- GPC = Graphics Processing Cluster
- Central (Shared) L2 Cache
- Connection to host via PCI Express 40
- Connection between GPUs via NVLINK 30
- Multiple memory controllers
- HBM 2 = High Bandwidth Memory 2
- Gigathread Engine with MIG Control
- Schedule kernels to GPC/TPC/SM
- Enables sharing of GPU
- Further advanced in successor generations
Streaming Multiprocessor in Ampere u-Arch
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- 32 FP64 CUDA cores
- 64 FP32 CUDA cores
- 64 INT32 CUDA cores
- 4 3rd Gen Tensor Cores
- (FP16/FP32/FP64, INT4/8, BF16/TF32, and binary)
- 4x 64 K, 32bit-registers
- 192 KB shared memory
- (L1 Data Cache)
- Thread blocks as basic scheduling unit
- Mapped to Warps to execute with
- a fixed number of threads with single
- Instruction counter
NVIDIA’s Ampere u-Arch: Tensor Core
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- Programmable matrix-multiply and accumulate (MMA Instructions)
- Useful for AI/Deep Learning and HPC application
- New in Ampere: Full FP64 (double) support
NVIDIA’s Ampere u-Arch: Sparsity
- Registers
- CUDA threads have own registers
- L0 cache / Processing Block-private
- L1 I-cache
- >48KB Instruction / Constant cache
- Unified L1 data cache and Shared memory
- 196 KB (up to 164KB Shared Memory)
- Small, but very high bandwidth / Low Latency
- Global L2 Cache
- High Bandwidth Memory (HBM)
- 6 HBM2 Stacks Up to 48 GB
- 5120 bit, 2x 6 memory controller, 1555GB/s total bandwidth
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Integration
System Integration
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
NVIDIA’s DGX-1 Architecture
- “Supercomputer in a Box”
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- Intended for
Deep Learning
- 8 P100 GPUs with 16GB DRAM each
- Xeon host CPUs
- Peak at 170 TFlop/s (at FP16)
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
NVIDIA’s DGX A100 Architecture
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- Similar concept as in the DGX-1, but with 8 A100-SXM4-40G GPUs and NVSwitch
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
IBM Summit System @ ORNL
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Connecting an Accelerator
- Specialized hardware attached to a host
- Most common (as of now):
PCIe
- Add-on cards
- Access and data transfer via PCI bus
- Advantage: easy to integrate
- Disadvantage:
PCI
can be a bottleneck
- Alternative:
direct processor connection
- Needs new interfaces
- Removes bottlenecks
- One step further:
CPU integration
- Direct access from
within CPU
- Either fully integrated or with packaged dies
- External booster concept
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Memory Memory
- Generally GPUs have a
separate memory
- Requires explicit data transfers, which complicates
- Typically expensive through PCI express
- Newer GPU generations support a
unified memory option
Shared address space
between CPU and GPU
- Allows
direct access
- Note: not necessarily symmetric performance
- Requires
logical integration
AMD Fusion
approach integrates CPU and GPU in one package
- NVIDIA provided software version based on transparent
paging
- Newer NVIDIA systems feature
new HW link
: NVLINK (1 and 2)
CUDA Programming
- Differences to CPU programming
- Accelerators have no OS
- Computation in the form of kernels for the target architectures
- Requires a host program to run on host a system
- Tasks
- Definition and programming of a kernel
- Setting up a data environment
- Dispatching a kernel
- Synchronization
- Collect results
Offloading of Kernels

Common Options
CUDA
- Targets NVIDIA GPUs
- Language extensions on top of C/C++
- Combined with runtime
- AMD version: HIP
OpenCL for any (supported) device
- Targets any kind of device
- Language extension on top of C/C++
- Combined with runtime
OpenMP
- Extensions in OpenMP since version 40
- General extension for any accelerator
- Accelerators are seens as “devices” and “targets”
OpenACC
- Specifically targets GPUs
- Grew out of OpenMP
- Focus on
loop parallelism
Kernel Programming with NVIDIA‘s CUDA
- Base concept:
CUDA thread
- Used to define all forms of parallelism in the GPU
Thread Blocks
: group of CUDA threads
- Executed by a single Streaming Multiprocessor
- As SIMD and multithreading execution mapped to Warps
- NVIDIA classifies CUDA as
Single Instruction, Multiple Thread
(SIMT
)
Thread blocks
are required to execute independently
and in order
- Different thread blocks
cannot communicate directly
- Can be
coordinated
using atomic memory operations
in the GPU memory
- CUDA programs (host and kernel) are compiled with a separate compiler
nvcc
- Includes language additions as well as access to CUDA runtime functions
- The ISA (Instruction Set Architecture) for CUDA-capable NVIDIA GPU is
PTX
(Parallel Thread Execution
)
Code and Data for the GPU
Kernel Invocation

Host Synchronization
- All kernel launches are asynchronous
- Control returns to CPU immediately
- Kernel executes after all previous CUDA calls have completed
cudaMemcopy()
is synchronous
- Control returns to CPU after copy completes
- Copy starts after all previous CUDA calls have completed
cudaThreadSynchronize()
- Blocks until all previous CUDA calls complete
cudaDeviceSynchronize()
- Blocks until all calls on all CUDA devices are completed
Asynchrony Rules in CUDA
- All CUDA kernel invocations are asynchronous
- Kernel just scheduled on GPU
- CPU continues executing
- Synchronization through
explicit sync routines
or synchronizing MemCopy

- However, multiple kernel invocations serialize by default
Each kernel
waiting on the completion of the previous one
Synchronizing MemCopy
waiting on all kernels

CUDA Streams
- Streams represents queues for kernel execution
- Host CPU code places work into the stream/queue and continues
- GPU schedules kernels from queue when resources are free
- Overlapping
- Operations within the
same stream
are ordered
(FIFO
) and cannot overlap
- Operations in
different streams
are unordered
and can overlap
- Stream Management
- Kernel invocation on particular stream
CUDA Stream Example
- Starting
two kernels
- Starting
two kernels
on two streams
- Starting
two kernels
with one extra stream
- Stream is
synchronizing
to all streams
Per Stream Memory Copies
- Synchronizing
MemCopy
(synchronizes to stream 0)
cudaMemcpy (void* d, void* s, size_t c, cudaMemcpyKind kind)
- Async
MemCopy
cudaMemcpyAsync (void* d, void* s, size_t c, cudaMemcpyKind kind, cudaStream_t stream = 0 )
- Async copy requires
pinned host memory
- Memory not pageable, i.e., can be safely transferred at any time
- Special allocation/deallocation calls
cudaMallocHost(...) / cudaHostAlloc(...)
cudaFreeHost(...)
- Calls to pin/unpin memory
cudaHostRegister(...) / cudaHostUnregister(...)
Streams Can Also Help Transfer Concurrency
- Synchronous

- Asynchronous
Same Stream

- Asynchronous
Different Streams

Advanced CUDA Synchronization Topics
- Explicit Synchronization
- Synchronize everything:
cudaDeviceSynchronize()
- Synchronize one stream:
cudaStreamSynchronize(stream)
- Synchronize using events -> look it up
- Stream Priorities
- High priority streams will preempt lower priority streams
cudaDeviceGetStreamPriorityRange(&low, &high)
- Lower number is higher priority
- Create using special API
cudaStreamCreateWithPriority(&stream, flags, priority)
- Stream Callbacks
Registration
of host side callbacks with respect to a stream
Invocation
when tasks in the stream are complete
- Direct Assembler:
PTX Assembler
code can be embedded into CUDA-C code
- More information
CUDA Pros & Cons
- CUDA enables straightforward programming of GPUs
- Direct mapping of computation to GPUs
- Exposes parallelism
- Drawback 1
- Very
low level
- Manual mapping and optimization
- Be aware of thread mapping and address alignments
- Drawback 2
Bound
to one vendor (NVIDIA)
- But AMD is developing 3rd-party support for their GPUs
- Drawback 3
- Want to use
host system
as well, which is also a parallel system
- Requires a
separate programming system
OpenMP Target
Alternative Example: OpenMP Target and Device Capabilities
Added in OpenMP 4.0, refined and extended since then
Host-centric execution model
- Offloading target regions to target devices
- Target region starts execution as a
single thread
on the device
Device
is a generic concept
- Can be
any kind of accelerator
(different compilers will support different hardware)
- Can also be another
CPU
Data environment
on the device
- Directives define the
data environment
Host
variables called original variables
Device
variables called corresponding variables
- Host and device variables may share storage
- … define the
mapping
To
: copy from host to device
From
: copy from device to host
FromTo
: both
Target directive
Target directive with Data Enviroment
Explicit Data Enviroment
Declare Variables and Functions for Target
Programming Challenges
- OpenMP unifies
host
and accelerator
programming in one model
- Still have to manage
target device
and host code
separately
- Not as
fine-grained control
as with CUDA
- CUDA and OpenMP require careful memory management
Data transfers
are expensive
- Keep data on GPU as long as possible
- Across kernel invocations
- Requires
careful
scheduling of kernels
and data transfers
- Higher level programming abstractions exist
- Can hide use of accelerators
- Often combined with automatic code generation
- Performance or generality is often problematic
- CuPy: NumPy-compatible matrix library with CUDA acceleration
Beyond the Single GPU or Accelerator
- Goal: exploit the entire node
- Possibly multiple GPUs
- Host itself is typically a large NUMA system
- Schedule work across all HW threads on CPU and GPU
- Neither
CUDA
nor OpenMP
provide explicit support for this
- Manual chunking and load distribution
- Load balancing to compensate for heterogeneous performance
- GPUs (Accelerators) in a larger HPC system
- Must cover multiple nodes by combining with
MPI
- MPI processes are CUDA or OpenMP with target/device codes
- Additional complication for load balance
- Additional complication: #GPUs not equal #cores
- Need to decide with MPI process handles a GPU
- Recommendation is often: one MPI process for each GPU