changed 5 months ago
Published Linked with GitHub

[ENCCS Webinar]
Practical Introduction to
GPU Programming

\(~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\)

Yonglei Wang (ENCCS / NSC@LiU)


GPU & HPC


What is GPU?

  • GPU is a specialized electronic circuit.
  • Originally developed for computer graphics and image processing.
  • Now evolved into general-proposed accelerators for massive parallel computing.
NVIDIA RTX 4090 & H200

What is GPU?

  • GPU is a specialized electronic circuit.
  • Originally developed for computer graphics and image processing.
  • Now evolved into general-proposed accelerators for massive parallel computing.
AMD MI300 series vs. Intel's ARC series

Top 10 HPCs

Ref: Top 500 list released on Nov. 18, 2024


HPCs in EU

EuroHPC-AI-Factories-ecosystem
quantum

GPU Architecture


CPU vs. GPU

Ref: GPU Programming: When, Why and How?


GPU architecture

simd gpu-simt
  • Computer architecture is characterized by four categories according to Flynn’s taxonomy:
    • Single instruction stream, single data stream (SISD)
    • Single instruction stream, multiple data streams (SIMD)
    • Multiple instruction streams, single data stream (MISD)
    • Multiple instruction streams, multiple data streams (MIMD)
  • GPUs are based on Single Instruction Multiple Threads (SIMT)

Nvidia GPU architecture

nvidia-gpu
  • Nvidia (microarchitectures): Tesla (2006), Fermi (2010), Kepler (2012), Maxwell (2014), Pascal (2016), Volta (2017), Turing (2018), Ampere (2020), Ada Lovelace (2022), Hopper (2022), and Blackwell (2024)
  • GPU > GPC > TPC > SM > CUDA core + Tensor core
  • Each SM has L1 cache, and L2 cache is shared among SMs within TPC

Nvidia GPU architecture

rtx-3090 sm-memory
  • Ampere GPU (RTX 3090) has 7 GPCs, 42 TPCs, and 84 SMs
  • RT (Ray Tracing) cores are dedicated to performing the ray-tracing rendering math computation
  • Each SM has L1 cache (up to 128 KB), and L2 (up to 6144 KB) cache is shared among GPCs

Nvidia GPU architecture

thread-batching hardware-model-1
  • SIMT enables programmers to achieve thread-level parallelism in SMs
  • SM is based on the scalable multi-thread array, usually called CTA (cooperative thread array), which can scale threads at 1D, 2D and 3D level
  • GPU has hierarchical memories
    • registers and local memory at thread level
    • shared memory at block level
    • global memory at grid level
    • constant and texture memories are two specialized memory types

AMD GPU architecture

amd-nodelevel amd-gpu-gcd
  • LUMI, one of the EuroHPC JU machines, features AMD MI250 GPUs.
  • Left: Node-level architecture of a system based on AMD Instinct MI250 accelerator.
    • Each package has two GCDs (GPU compute dies)
    • The MI250 OAMs attach to the host system via PCIe Gen 4 x16 links (yellow lines).
  • Right: Block diagram of an package consisting of two GCDs

Intel GPU architecture

intel-gpuintel-gpu2
  • Intel X\(^e\) HPC Core: 8 matrix math units and 8 vector math units
  • X\(^e\) HPC Cores are stacked up with ray tracing units to X\(^e\) HPC Slice
  • X\(^e\) HPC Core > X\(^e\) HPC Slice > X\(^e\) HPC Stack > X\(^e\) HPC Link
  • X\(^e\) GPU family consists of a series of micro-architectures
    • integrated/low power (Xe-LP)
    • high performance gaming (Xe-HPG)
    • datacenter/high performance (Xe-HP)
    • high performance computing (Xe-HPC)

Comparison of NVIDIA, AMD, and Intel GPUs

compare-gpu
  • All GPU vendors have different strategies to produce GPUs for market needs.
  • The GPUs used for AI research and HPC have varied efficiencies on high throughput for both single and mixed precision, memories, and compute cores for FP64 (used in scientific computing).
  • Some advancements include integrated memory in AMD and Intel GPUs, as well as development of multiple GPU dies on a single GPU by AMD and stackable options in Intel GPUs.

GPU Programming Models


GPU compute APIs

  • Standard C/C++ & Fortran programming
  • Directive-based models
    • OpenACC
    • OpenMP offloading
  • Non-portable kernel-based models
    • CUDA for Nvidia GPU
    • HIP for AMD GPU
    • OneAPI for Intel GPU
  • Portable kernel-based models
    • SYCL, OpenCL, Alpaka, Kokkos, etc.
  • High-level programming languages
    • Python
    • Julia

Standard C/C++ & Fortran programming

cppppp cppppppppp

Directive-based models

The serial code is annotated with directives telling compilers to run specific loops and regions on GPU.

Two representative directives-based programming models are OpenACC and OpenMP offloading.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →
Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Overview of OpenACC

  • OpenACC is not a GPU programming language, it is based on compiler directives.
  • OpenACC allows you to express parallelism in your (C/C++ & Fortran) code.
  • OpenACC can be used with both Nvidia and AMD GPUs.
    • Use a single source code across different GPU vendors without modification.
  • Key directives for OpenACC
    • Compute Constructs:
      • parallel and kernel - used to create a parallel region.
    • Loop Constructs:
      • loop, collapse, gang, worker, vector, etc. - designed to efficiently allocate threads for work-sharing tasks.
    • Data Management Clauses:
      • copy, create, copyin, copyout, delete, and present - for managing data transfer between host and device.
    • Other Constructs:
      • reduction, atomic, cache, etc. - for special operations that prevent slowing down parallel computation.
  • For more information about OpenACC directives, please refer to this link: here.

Key directives for OpenACC

kernel-c kernel-f
  • Compute Constructs: parallel vs. kernels
  • Kernels in C/C++ & Fortran
  • nvc -fast -Minfo=all -acc=gpu -gpu=cc80 Hello_World.c
  • nvfortran -fast -Minfo=all -acc=gpu Hello_World_OpenACC.f90

Key directives for OpenACC

aaaa-vector-add-c aaaa-vector-add-f
  • Loop and Data Clauses
  • copyin(list) - Allocates memory on GPU and copies data from CPU (host) to GPU when entering a region.
  • copyout(list) - Allocates memory on GPU and copies data from GPU to CPU (host) when exiting a region.
  • copy(list) - Allocates memory on GPU, copies data from CPU (host) to GPU when entering a region, and copies data back to CPU (host) when exiting a region.

Non-portable kernel-based models

  • Developers write low-level codes that directly communicates with GPU hardware.
  • Two representative programming models are CUDA and HIP.
cuda-c
  • A qualifier __global__ to define a device kernel
  • Calling device functions in main program:
    • C/C++ example, c_function()
    • CUDA example, cuda_function≪1,1≫()
    • <<< >>>, specify numbers of thread blocks and threads to launch kernels
  • Make sure to synchronize threads
    • syncthreads() - synchronizes all threads within a thread block
    • CudaDeviceSynchronize() - synchronizes a kernel call on host

CUDA example: Vector addition

cuda-vec-add-1 cuda-vec-add-2

Source: Vector_Addition.cu


Performance and profiling

matrix-add

Using CUDA APIs, we can measure the time taken to execute CUDA kernel functions.


Performance and profiling

cuda-profiling
  • Nvidia provides profiling tools to measure traces and events of CUDA application.
    • Nsight Compute: Profile kernel calls; both visual profile-GUI and Command Line Interface can be used to check profiling information of kernel calls.
    • Nsight Graphics: This is quite useful for analyzing profiling results through GUI.
    • Nsight Systems: Provides system-wide profiling
      • i.e., when application involves mixed programming with CPU and GPU (that is, MPI, OpenMP, and CUDA).

Non-portable kernel-based models

Hipification is the process of converting CUDA code to HIP, enabling code to run on both AMD and NVIDIA GPUs.

  • HIP code uses similar keyword as that for CUDA code, making it easy to port simple CUDA kernels to HIP by changing a few library-specific calls.
    • launch Kernel functions
      • cuda_function<<<blocks, threads>>>()
      • hip_function<<<blocks, threads>>>()
    • synchronization
      • cudaDeviceSynchronize()
      • hipDeviceSynchronize()
  • HIP provides tools (hipify-perl or hipify-clang) to convert CUDA syntax to HIP syntax.

Portable kernel-based models

  • Cross-platform portability ecosystems typically provide a higher-level abstraction layer which provide a convenient and portable programming model for GPU programming.
  • For C++, the most notable cross-platform portability ecosystems are Alpaka, Kokkos, OpenCL, and SYCL.
  • Pros and cons of cross-platform portability ecosystems
    • Pros
      • The amount of code duplication is minimized
      • Less knowledge of underlying architecture is needed for initial porting (Kokkos, SYCL)
    • Cons
      • These models are relatively new and not very popular yet
      • Limited learning resources compared to CUDA & HIP
      • Some low-level APIs and separate-source kernel models are less user friendly
  • References and code examples

Comparison of GPU compute APIs

API Portability Ease of Use Performance Primary Vendor Best For
OpenACC Medium (NVIDIA & AMD) High (directive-based) Medium-High NVIDIA, AMD (limited) Scientific computing and HPC with minimal code modifications
OpenMP Offloading High (Cross-vendor) Medium (directive-based) High Intel, AMD, NVIDIA Parallelizing CPU and GPU workloads using OpenMP pragmas for performance portability
CUDA Low (NVIDIA only) Medium Very High NVIDIA High-performance compute on NVIDIA GPUs
HIP Medium (AMD & NVIDIA) Medium High AMD, NVIDIA (via Hipify) Porting CUDA applications to AMD GPUs with minimal code changes
SYCL High (Cross-vendor) Medium High Intel, AMD, NVIDIA Heterogeneous computing with single-source C++
OpenCL High (Cross-vendor) Medium Medium Cross-vendor General-purpose parallel programming across multiple hardware architectures

Python libraries for GPU programming

Library Best For Supports
Numba.cuda General CUDA GPU programming NVIDIA
CuPy NumPy-like GPU acceleration NVIDIA
PyCUDA Low-level CUDA programming NVIDIA
PyOpenCL Cross-vendor GPU programming NVIDIA, AMD, Intel
SYCL (dpctl) Cross-platform parallelism Intel, AMD, NVIDIA
TensorFlow Deep learning NVIDIA, AMD
PyTorch Machine learning NVIDIA, AMD
OpenMP (Numba) CPU & GPU parallelism Intel, AMD, NVIDIA

Python libraries for AI research

API/Framework Primary Use GPU Support Multi-GPU Support Best For
TensorFlow DL, neural networks, custom ML models NVIDIA (CUDA), AMD (ROCm) Yes, using Mirrored Strategy or Distribution Strategy General-purpose ML and DL
PyTorch DL and neural networks NVIDIA (CUDA), AMD (ROCm) Yes, using DataParallel and DistributedDataParallel Research and production DL, flexibility
Hugging Face (Transformers) NLP with transformers NVIDIA, AMD (via PyTorch or TensorFlow) Yes, using PyTorch or TensorFlow for multi-GPU Pre-trained transformer models for NLP tasks
Keras High-level neural networks API NVIDIA (CUDA), AMD (ROCm) Yes, using TensorFlow for multiGPU Simplified DL with TensorFlow backend
JAX High-performance ML and scientific computing NVIDIA (CUDA), AMD (ROCm), Intel (oneAPI) Yes, using XLA for GPU acceleration Numerical computing, DL with high performance
RAPIDS GPU-accelerated data science and ML NVIDIA GPUs Yes, via Dask and cuML for ML tasks Data science with cuML, cuDF, and Dask

Julia libraries for AI research

  • Julia is a high-performance, dynamic programming language designed for numerical scientific computing.
  • It was designed specifically to solve the two-language problem.
    • Interpreted languages like Python and R translate instructions line by line.
    • Compiled languages like C/C++ and Fortran are translated by compiler prior to running code.
    • Julia provides both high performance and ease of use in a single language.
  • Key Features of Julia
    • Using Just-In-Time (JIT) compilation to achieve high performance computing
    • Simple & expressive syntax (like Python)
    • Powerful for numerical & scientific computing with built-in parallel computing support
    • Interoperability with other programming languages
julia-feature
Library Purpose Similar to GPU Support?
Flux.jl Deep learning PyTorch, TensorFlow ✅ Yes (CUDA)
MLJ.jl Machine learning scikit-learn ✅ Yes
Turing.jl Probabilistic modeling, Bayesian learning PyMC3, Stan ✅ Yes
AlphaZero.jl Reinforcement learning DeepMind AlphaZero ✅ Yes
Knet.jl Deep learning, Fast GPU AI research PyTorch ✅ Yes
ReinforcementLearning.jl Reinforcement Learning DeepMind AlphaZero ✅ Yes

ENCCS
Lesson materials &
Training events


Lesson materials & Seasonal training events

lesson-material

Nvidia bootcamps

N-Ways-GPU-2025-04_image-1

Take-home message

  • GPU & HPC
  • GPU architecture
  • GPU programming models
    • GPU compute APIs
    • Standard C/C++ & Fortran programming
    • Directive-based models
      • OpenACC, OpenMP offloading
    • Non-portable kernel-based models
      • CUDA, HIP
    • Portable kernel-based models
      • SYCL, OpenCL, Alpaka, Kokkos, etc.
    • High-level programming languages
      • Python, Julia
  • Lesson materials
  • Training events & Nvidia bootcamps
Select a repo