[ENCCS Webinar]
Practical Introduction to
GPU Programming
Yonglei Wang (ENCCS / NSC@LiU)
GPU & HPC
What is GPU?
![]() ![]() |
---|
NVIDIA RTX 4090 & H200 |
What is GPU?
![]() ![]() |
---|
AMD MI300 series vs. Intel's ARC series |
HPCs in EU
![]() ![]() ![]() |
---|
GPU Architecture
GPU architecture
![]() ![]() |
---|
Nvidia GPU architecture
![]() |
---|
Nvidia GPU architecture
![]() ![]() |
---|
Nvidia GPU architecture
![]() ![]() |
---|
AMD GPU architecture
![]() ![]() |
---|
Intel GPU architecture
![]() ![]() |
---|
Comparison of NVIDIA, AMD, and Intel GPUs
![]() |
---|
GPU Programming Models
GPU compute APIs
Standard C/C++ & Fortran programming
![]() ![]() |
---|
Directive-based models
The serial code is annotated with directives telling compilers to run specific loops and regions on GPU.
Two representative directives-based programming models are OpenACC and OpenMP offloading.
Overview of OpenACC
parallel
and kernel
- used to create a parallel region.loop
, collapse
, gang
, worker
, vector
, etc. - designed to efficiently allocate threads for work-sharing tasks.copy
, create
, copyin
, copyout
, delete
, and present
- for managing data transfer between host and device.reduction
, atomic
, cache
, etc. - for special operations that prevent slowing down parallel computation.Key directives for OpenACC
![]() ![]() |
---|
nvc -fast -Minfo=all -acc=gpu -gpu=cc80 Hello_World.c
nvfortran -fast -Minfo=all -acc=gpu Hello_World_OpenACC.f90
Key directives for OpenACC
![]() ![]() |
---|
Non-portable kernel-based models
![]() |
---|
__global__
to define a device kernelsyncthreads()
- synchronizes all threads within a thread blockCudaDeviceSynchronize()
- synchronizes a kernel call on hostPerformance and profiling
![]() |
---|
Using CUDA APIs, we can measure the time taken to execute CUDA kernel functions.
Performance and profiling
![]() |
---|
Non-portable kernel-based models
Hipification is the process of converting CUDA code to HIP, enabling code to run on both AMD and NVIDIA GPUs.
Portable kernel-based models
Comparison of GPU compute APIs
API | Portability | Ease of Use | Performance | Primary Vendor | Best For |
---|---|---|---|---|---|
OpenACC | Medium (NVIDIA & AMD) | High (directive-based) | Medium-High | NVIDIA, AMD (limited) | Scientific computing and HPC with minimal code modifications |
OpenMP Offloading | High (Cross-vendor) | Medium (directive-based) | High | Intel, AMD, NVIDIA | Parallelizing CPU and GPU workloads using OpenMP pragmas for performance portability |
CUDA | Low (NVIDIA only) | Medium | Very High | NVIDIA | High-performance compute on NVIDIA GPUs |
HIP | Medium (AMD & NVIDIA) | Medium | High | AMD, NVIDIA (via Hipify) | Porting CUDA applications to AMD GPUs with minimal code changes |
SYCL | High (Cross-vendor) | Medium | High | Intel, AMD, NVIDIA | Heterogeneous computing with single-source C++ |
OpenCL | High (Cross-vendor) | Medium | Medium | Cross-vendor | General-purpose parallel programming across multiple hardware architectures |
Python libraries for GPU programming
Library | Best For | Supports |
---|---|---|
Numba.cuda | General CUDA GPU programming | NVIDIA |
CuPy | NumPy-like GPU acceleration | NVIDIA |
PyCUDA | Low-level CUDA programming | NVIDIA |
PyOpenCL | Cross-vendor GPU programming | NVIDIA, AMD, Intel |
SYCL (dpctl) | Cross-platform parallelism | Intel, AMD, NVIDIA |
TensorFlow | Deep learning | NVIDIA, AMD |
PyTorch | Machine learning | NVIDIA, AMD |
OpenMP (Numba) | CPU & GPU parallelism | Intel, AMD, NVIDIA |
Python libraries for AI research
API/Framework | Primary Use | GPU Support | Multi-GPU Support | Best For |
---|---|---|---|---|
TensorFlow | DL, neural networks, custom ML models | NVIDIA (CUDA), AMD (ROCm) | Yes, using Mirrored Strategy or Distribution Strategy | General-purpose ML and DL |
PyTorch | DL and neural networks | NVIDIA (CUDA), AMD (ROCm) | Yes, using DataParallel and DistributedDataParallel | Research and production DL, flexibility |
Hugging Face (Transformers) | NLP with transformers | NVIDIA, AMD (via PyTorch or TensorFlow) | Yes, using PyTorch or TensorFlow for multi-GPU | Pre-trained transformer models for NLP tasks |
Keras | High-level neural networks API | NVIDIA (CUDA), AMD (ROCm) | Yes, using TensorFlow for multiGPU | Simplified DL with TensorFlow backend |
JAX | High-performance ML and scientific computing | NVIDIA (CUDA), AMD (ROCm), Intel (oneAPI) | Yes, using XLA for GPU acceleration | Numerical computing, DL with high performance |
RAPIDS | GPU-accelerated data science and ML | NVIDIA GPUs | Yes, via Dask and cuML for ML tasks | Data science with cuML, cuDF, and Dask |
Julia libraries for AI research
![]() |
---|
Library | Purpose | Similar to | GPU Support? |
---|---|---|---|
Flux.jl | Deep learning | PyTorch, TensorFlow | ✅ Yes (CUDA) |
MLJ.jl | Machine learning | scikit-learn | ✅ Yes |
Turing.jl | Probabilistic modeling, Bayesian learning | PyMC3, Stan | ✅ Yes |
AlphaZero.jl | Reinforcement learning | DeepMind AlphaZero | ✅ Yes |
Knet.jl | Deep learning, Fast GPU AI research | PyTorch | ✅ Yes |
ReinforcementLearning.jl | Reinforcement Learning | DeepMind AlphaZero | ✅ Yes |
ENCCS
Lesson materials &
Training events
Lesson materials & Seasonal training events
![]() |
---|
Take-home message