[Webinar] Practical Introduction to GPU Programming

[ENCCS Webinar]
Practical Introduction to
GPU Programming

\(~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\)

Yonglei Wang (ENCCS / NSC@LiU)

GPU & HPC

What is GPU?

GPU is a specialized electronic circuit.
Originally developed for computer graphics and image processing.
Now evolved into general-proposed accelerators for massive parallel computing.


NVIDIA RTX 4090 & H200

What is GPU?

GPU is a specialized electronic circuit.
Originally developed for computer graphics and image processing.
Now evolved into general-proposed accelerators for massive parallel computing.


AMD MI300 series vs. Intel's ARC series

Top 10 HPCs

Ref: Top 500 list released on Nov. 18, 2024

HPCs in EU

GPU Architecture

CPU vs. GPU

Ref: GPU Programming: When, Why and How?

GPU architecture

Computer architecture is characterized by four categories according to Flynn’s taxonomy:
- Single instruction stream, single data stream (SISD)
- Single instruction stream, multiple data streams (SIMD)
- Multiple instruction streams, single data stream (MISD)
- Multiple instruction streams, multiple data streams (MIMD)
GPUs are based on Single Instruction Multiple Threads (SIMT)

Nvidia GPU architecture

Nvidia (microarchitectures): Tesla (2006), Fermi (2010), Kepler (2012), Maxwell (2014), Pascal (2016), Volta (2017), Turing (2018), Ampere (2020), Ada Lovelace (2022), Hopper (2022), and Blackwell (2024)
GPU –> GPC –> TPC –> SM –> CUDA core + Tensor core
Each SM has L1 cache, and L2 cache is shared among SMs within TPC

Nvidia GPU architecture

Ampere GPU (RTX 3090) has 7 GPCs, 42 TPCs, and 84 SMs
RT (Ray Tracing) cores are dedicated to performing the ray-tracing rendering math computation
Each SM has L1 cache (up to 128 KB), and L2 (up to 6144 KB) cache is shared among GPCs

Nvidia GPU architecture

SIMT enables programmers to achieve thread-level parallelism in SMs
SM is based on the scalable multi-thread array, usually called CTA (cooperative thread array), which can scale threads at 1D, 2D and 3D level
GPU has hierarchical memories
- registers and local memory at thread level
- shared memory at block level
- global memory at grid level
- constant and texture memories are two specialized memory types

AMD GPU architecture

LUMI, one of the EuroHPC JU machines, features AMD MI250 GPUs.
Left: Node-level architecture of a system based on AMD Instinct MI250 accelerator.
- Each package has two GCDs (GPU compute dies)
- The MI250 OAMs attach to the host system via PCIe Gen 4 x16 links (yellow lines).
Right: Block diagram of an package consisting of two GCDs

Intel GPU architecture

Intel X\(^e\) HPC Core: 8 matrix math units and 8 vector math units
X\(^e\) HPC Cores are stacked up with ray tracing units to X\(^e\) HPC Slice
X\(^e\) HPC Core –> X\(^e\) HPC Slice –> X\(^e\) HPC Stack –> X\(^e\) HPC Link
X\(^e\) GPU family consists of a series of micro-architectures
- integrated/low power (Xe-LP)
- high performance gaming (Xe-HPG)
- datacenter/high performance (Xe-HP)
- high performance computing (Xe-HPC)

Comparison of NVIDIA, AMD, and Intel GPUs

All GPU vendors have different strategies to produce GPUs for market needs.
The GPUs used for AI research and HPC have varied efficiencies on high throughput for both single and mixed precision, memories, and compute cores for FP64 (used in scientific computing).
Some advancements include integrated memory in AMD and Intel GPUs, as well as development of multiple GPU dies on a single GPU by AMD and stackable options in Intel GPUs.

GPU Programming Models

GPU compute APIs

Standard C/C++ & Fortran programming
Directive-based models
- OpenACC
- OpenMP offloading
Non-portable kernel-based models
- CUDA for Nvidia GPU
- HIP for AMD GPU
- OneAPI for Intel GPU
Portable kernel-based models
- SYCL, OpenCL, Alpaka, Kokkos, etc.
High-level programming languages
- Python
- Julia

Standard C/C++ & Fortran programming

Directive-based models

The serial code is annotated with directives telling compilers to run specific loops and regions on GPU.

Two representative directives-based programming models are OpenACC and OpenMP offloading.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Overview of OpenACC

OpenACC is not a GPU programming language, it is based on compiler directives.
OpenACC allows you to express parallelism in your (C/C++ & Fortran) code.
OpenACC can be used with both Nvidia and AMD GPUs.
- Use a single source code across different GPU vendors without modification.
Key directives for OpenACC
- Compute Constructs:
  - parallel and kernel - used to create a parallel region.
- Loop Constructs:
  - loop, collapse, gang, worker, vector, etc. - designed to efficiently allocate threads for work-sharing tasks.
- Data Management Clauses:
  - copy, create, copyin, copyout, delete, and present - for managing data transfer between host and device.
- Other Constructs:
  - reduction, atomic, cache, etc. - for special operations that prevent slowing down parallel computation.
For more information about OpenACC directives, please refer to this link: here.

Key directives for OpenACC

Compute Constructs: parallel vs. kernels
Kernels in C/C++ & Fortran
nvc -fast -Minfo=all -acc=gpu -gpu=cc80 Hello_World.c
nvfortran -fast -Minfo=all -acc=gpu Hello_World_OpenACC.f90

Key directives for OpenACC

Loop and Data Clauses
copyin(list) - Allocates memory on GPU and copies data from CPU (host) to GPU when entering a region.
copyout(list) - Allocates memory on GPU and copies data from GPU to CPU (host) when exiting a region.
copy(list) - Allocates memory on GPU, copies data from CPU (host) to GPU when entering a region, and copies data back to CPU (host) when exiting a region.

Non-portable kernel-based models

Developers write low-level codes that directly communicates with GPU hardware.
Two representative programming models are CUDA and HIP.

A qualifier __global__ to define a device kernel
Calling device functions in main program:
- C/C++ example, c_function()
- CUDA example, cuda_function≪1,1≫()
- <<< >>>, specify numbers of thread blocks and threads to launch kernels
Make sure to synchronize threads
- syncthreads() - synchronizes all threads within a thread block
- CudaDeviceSynchronize() - synchronizes a kernel call on host

CUDA example: Vector addition

Source: Vector_Addition.cu

Performance and profiling

Using CUDA APIs, we can measure the time taken to execute CUDA kernel functions.

Performance and profiling

Nvidia provides profiling tools to measure traces and events of CUDA application.
- Nsight Compute: Profile kernel calls; both visual profile-GUI and Command Line Interface can be used to check profiling information of kernel calls.
- Nsight Graphics: This is quite useful for analyzing profiling results through GUI.
- Nsight Systems: Provides system-wide profiling
  - i.e., when application involves mixed programming with CPU and GPU (that is, MPI, OpenMP, and CUDA).

Non-portable kernel-based models

Hipification is the process of converting CUDA code to HIP, enabling code to run on both AMD and NVIDIA GPUs.

HIP code uses similar keyword as that for CUDA code, making it easy to port simple CUDA kernels to HIP by changing a few library-specific calls.
- launch Kernel functions
  - cuda_function<<<blocks, threads>>>()
  - hip_function<<<blocks, threads>>>()
- synchronization
  - cudaDeviceSynchronize()
  - hipDeviceSynchronize()
HIP provides tools (hipify-perl or hipify-clang) to convert CUDA syntax to HIP syntax.

Portable kernel-based models

Cross-platform portability ecosystems typically provide a higher-level abstraction layer which provide a convenient and portable programming model for GPU programming.
For C++, the most notable cross-platform portability ecosystems are Alpaka, Kokkos, OpenCL, and SYCL.
Pros and cons of cross-platform portability ecosystems
- Pros
  - The amount of code duplication is minimized
  - Less knowledge of underlying architecture is needed for initial porting (Kokkos, SYCL)
- Cons
  - These models are relatively new and not very popular yet
  - Limited learning resources compared to CUDA & HIP
  - Some low-level APIs and separate-source kernel models are less user friendly
References and code examples

Comparison of GPU compute APIs

API	Portability	Ease of Use	Performance	Primary Vendor	Best For
OpenACC	Medium (NVIDIA & AMD)	High (directive-based)	Medium-High	NVIDIA, AMD (limited)	Scientific computing and HPC with minimal code modifications
OpenMP Offloading	High (Cross-vendor)	Medium (directive-based)	High	Intel, AMD, NVIDIA	Parallelizing CPU and GPU workloads using OpenMP pragmas for performance portability
CUDA	Low (NVIDIA only)	Medium	Very High	NVIDIA	High-performance compute on NVIDIA GPUs
HIP	Medium (AMD & NVIDIA)	Medium	High	AMD, NVIDIA (via Hipify)	Porting CUDA applications to AMD GPUs with minimal code changes
SYCL	High (Cross-vendor)	Medium	High	Intel, AMD, NVIDIA	Heterogeneous computing with single-source C++
OpenCL	High (Cross-vendor)	Medium	Medium	Cross-vendor	General-purpose parallel programming across multiple hardware architectures

Python libraries for GPU programming

Library	Best For	Supports
Numba.cuda	General CUDA GPU programming	NVIDIA
CuPy	NumPy-like GPU acceleration	NVIDIA
PyCUDA	Low-level CUDA programming	NVIDIA
PyOpenCL	Cross-vendor GPU programming	NVIDIA, AMD, Intel
SYCL (dpctl)	Cross-platform parallelism	Intel, AMD, NVIDIA
TensorFlow	Deep learning	NVIDIA, AMD
PyTorch	Machine learning	NVIDIA, AMD
OpenMP (Numba)	CPU & GPU parallelism	Intel, AMD, NVIDIA

Python libraries for AI research

API/Framework	Primary Use	GPU Support	Multi-GPU Support	Best For
TensorFlow	DL, neural networks, custom ML models	NVIDIA (CUDA), AMD (ROCm)	Yes, using Mirrored Strategy or Distribution Strategy	General-purpose ML and DL
PyTorch	DL and neural networks	NVIDIA (CUDA), AMD (ROCm)	Yes, using DataParallel and DistributedDataParallel	Research and production DL, flexibility
Hugging Face (Transformers)	NLP with transformers	NVIDIA, AMD (via PyTorch or TensorFlow)	Yes, using PyTorch or TensorFlow for multi-GPU	Pre-trained transformer models for NLP tasks
Keras	High-level neural networks API	NVIDIA (CUDA), AMD (ROCm)	Yes, using TensorFlow for multiGPU	Simplified DL with TensorFlow backend
JAX	High-performance ML and scientific computing	NVIDIA (CUDA), AMD (ROCm), Intel (oneAPI)	Yes, using XLA for GPU acceleration	Numerical computing, DL with high performance
RAPIDS	GPU-accelerated data science and ML	NVIDIA GPUs	Yes, via Dask and cuML for ML tasks	Data science with cuML, cuDF, and Dask

Julia libraries for AI research

Julia is a high-performance, dynamic programming language designed for numerical scientific computing.
It was designed specifically to solve the two-language problem.
- Interpreted languages like Python and R translate instructions line by line.
- Compiled languages like C/C++ and Fortran are translated by compiler prior to running code.
- Julia provides both high performance and ease of use in a single language.
Key Features of Julia
- Using Just-In-Time (JIT) compilation to achieve high performance computing
- Simple & expressive syntax (like Python)
- Powerful for numerical & scientific computing with built-in parallel computing support
- Interoperability with other programming languages

Library	Purpose	Similar to	GPU Support?
Flux.jl	Deep learning	PyTorch, TensorFlow	✅ Yes (CUDA)
MLJ.jl	Machine learning	scikit-learn	✅ Yes
Turing.jl	Probabilistic modeling, Bayesian learning	PyMC3, Stan	✅ Yes
AlphaZero.jl	Reinforcement learning	DeepMind AlphaZero	✅ Yes
Knet.jl	Deep learning, Fast GPU AI research	PyTorch	✅ Yes
ReinforcementLearning.jl	Reinforcement Learning	DeepMind AlphaZero	✅ Yes

ENCCS
Lesson materials &
Training events

Lesson materials & Seasonal training events

Link: https://hackmd.io/@yonglei/mermaid-enccs-lesson
Seasonal training events
- GPU Programming / OpenACC-CUDA workshop
- Practical machine/deep Learning
- Python/Julia HPDA/HPC
- Best practice HPC training
- Quantum autumn school

Nvidia bootcamps

Take-home message

GPU & HPC
GPU architecture
GPU programming models
- GPU compute APIs
- Standard C/C++ & Fortran programming
- Directive-based models
  - OpenACC, OpenMP offloading
- Non-portable kernel-based models
  - CUDA, HIP
- Portable kernel-based models
  - SYCL, OpenCL, Alpaka, Kokkos, etc.
- High-level programming languages
  - Python, Julia
Lesson materials
Training events & Nvidia bootcamps

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.