[Webinar] Practical Introduction to GPU Programming

--- title: Intro to GPU Programming tags: presentation margin: 0.1 slideOptions: theme: 'black' margin: 0.08 transition: 'fade' center: true --- <style> .reveal { font-size: 32px; } </style> [ENCCS Webinar] Practical Introduction to GPU Programming $~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$ Yonglei Wang (ENCCS / NSC@LiU) --- GPU & HPC --- ## What is GPU?  - GPU is a specialized electronic circuit. - Originally developed for computer graphics and image processing. - Now evolved into general-proposed accelerators for massive parallel computing. | ![](https://m.media-amazon.com/images/I/51EL-FaK4XL._AC_UF1000,1000_QL80_.jpg =40%x) ![](https://decisive-angel-514477a09e.media.strapiapp.com/h200_391352089d.png =40%x) | |:-:| | **NVIDIA RTX 4090** & **H200** | ---- ## What is GPU?  - GPU is a specialized electronic circuit. - Originally developed for computer graphics and image processing. - Now evolved into general-proposed accelerators for massive parallel computing. | ![](https://www.amd.com/content/dam/amd/en/images/pr-feed/1213366.jpg =40%x) ![](https://i.pcmag.com/imagery/articles/02ON9V98apUZIE57Y9j2BLQ-4..v1662676440.jpg =40%x) | |:--:| | **AMD MI300 series** *vs.* **Intel's ARC series** | --- Top 10 HPCs | ![](https://hackmd.io/_uploads/ryEDEbxakg.jpg =95%x) | | :-: | | |  Ref: [Top 500 list released on Nov. 18, 2024](https://top500.org/lists/top500/2024/11/) ---- HPCs in EU | ![](https://hackmd.io/_uploads/HklE9beT1l.png =45%x) ![EuroHPC-AI-Factories-ecosystem](https://hackmd.io/_uploads/Hk5YcZlTyx.jpg =30%x) ![quantum](https://hackmd.io/_uploads/SyYq5Zlayl.png =45%x)| | :-: | --- GPU Architecture --- ## CPU vs. GPU | ![](https://enccs.github.io/gpu-programming/_images/CPUAndGPU.png =75%x) | | :-: | | |  Ref: [GPU Programming: When, Why and How?](https://enccs.github.io/gpu-programming/) --- ## GPU architecture | ![simd](https://hackmd.io/_uploads/ryvSZGxa1x.png =25%x) ![gpu-simt](https://hackmd.io/_uploads/B1Offzep1x.jpg =60%x) | | :--: |  - Computer architecture is characterized by four categories according to Flynn’s taxonomy: - Single instruction stream, single data stream (SISD) - Single instruction stream, multiple data streams (SIMD) - Multiple instruction streams, single data stream (MISD) - Multiple instruction streams, multiple data streams (MIMD) - GPUs are based on ==Single Instruction Multiple Threads (SIMT)== --- ## Nvidia GPU architecture | ![nvidia-gpu](https://hackmd.io/_uploads/BytNifg61e.jpg =80%x) | | :--: | | |  - Nvidia (microarchitectures): Tesla (2006), Fermi (2010), Kepler (2012), Maxwell (2014), Pascal (2016), Volta (2017), Turing (2018), Ampere (2020), Ada Lovelace (2022), Hopper (2022), and Blackwell (2024) - ==GPU --> GPC --> TPC --> SM --> CUDA core + Tensor core== - Each SM has L1 cache, and L2 cache is shared among SMs within TPC ---- ## Nvidia GPU architecture | ![rtx-3090](https://hackmd.io/_uploads/Sk0w6Mx6Je.jpg =65%x) ![sm-memory](https://hackmd.io/_uploads/S1S_TGlTyg.png =32%x) | | :-: | | |  - ==Ampere GPU== (RTX 3090) has 7 GPCs, 42 TPCs, and 84 SMs - RT (Ray Tracing) cores are dedicated to performing the ray-tracing rendering math computation - Each SM has L1 cache (up to 128 KB), and L2 (up to 6144 KB) cache is shared among GPCs ---- ## Nvidia GPU architecture | ![thread-batching](https://hackmd.io/_uploads/rJTQgQg6Jl.png =26%x) ![hardware-model-1](https://hackmd.io/_uploads/SJmExQeTye.png =29%x) | | :-: | | |  - ==SIMT== enables programmers to achieve thread-level parallelism in SMs - SM is based on the scalable multi-thread array, usually called ==CTA== (cooperative thread array), which can scale threads at 1D, 2D and 3D level - ==GPU has hierarchical memories== - registers and local memory at thread level - shared memory at block level - global memory at grid level - constant and texture memories are two specialized memory types --- ## AMD GPU architecture | ![amd-nodelevel](https://hackmd.io/_uploads/SkrBMQgTJg.png =27%x) ![amd-gpu-gcd](https://hackmd.io/_uploads/rkcrfXg6Je.png =60%x) | | :-: | | |  - LUMI, one of the EuroHPC JU machines, features AMD MI250 GPUs. - Left: Node-level architecture of a system based on AMD Instinct MI250 accelerator. - Each package has two GCDs (GPU compute dies) - The MI250 OAMs attach to the host system via PCIe Gen 4 x16 links (yellow lines). - Right: Block diagram of an package consisting of two GCDs --- ## Intel GPU architecture | ![intel-gpu](https://hackmd.io/_uploads/SyP1KQga1x.png =75%x)![intel-gpu2](https://hackmd.io/_uploads/BJAQFQlpyg.png =60%x) | | :-: | | |  - Intel X$^e$ HPC Core: 8 matrix math units and 8 vector math units - X$^e$ HPC Cores are stacked up with ray tracing units to X$^e$ HPC Slice - ==X$^e$ HPC Core --> X$^e$ HPC Slice --> X$^e$ HPC Stack --> X$^e$ HPC Link== - X$^e$ GPU family consists of a series of micro-architectures - integrated/low power (Xe-LP) - high performance gaming (Xe-HPG) - datacenter/high performance (Xe-HP) - high performance computing (Xe-HPC) --- ## Comparison of NVIDIA, AMD, and Intel GPUs | ![compare-gpu](https://hackmd.io/_uploads/BkCGs7xTJe.png =90%x) | | :-: | | |  - All GPU vendors have different strategies to produce GPUs for market needs. - The GPUs used for AI research and HPC have varied efficiencies on high throughput for both single and mixed precision, memories, and compute cores for FP64 (used in scientific computing). - Some advancements include integrated memory in AMD and Intel GPUs, as well as development of multiple GPU dies on a single GPU by AMD and stackable options in Intel GPUs. --- GPU Programming Models --- GPU compute APIs  - **Standard C/C++ & Fortran programming** - **Directive-based models** - OpenACC - OpenMP offloading - **Non-portable kernel-based models** - CUDA for Nvidia GPU - HIP for AMD GPU - OneAPI for Intel GPU - **Portable kernel-based models** - SYCL, OpenCL, Alpaka, Kokkos, *etc.* - **High-level programming languages** - Python - Julia --- Standard C/C++ & Fortran programming | ![cppppp](https://hackmd.io/_uploads/H1TV6El61e.png =46%x) ![cppppppppp](https://hackmd.io/_uploads/HysB6ElTyg.png =50%x) | | :--: | --- Directive-based models  The serial code is annotated with directives telling compilers to run specific loops and regions on GPU.  Two representative directives-based programming models are [**OpenACC**](https://www.openacc.org/) and [**OpenMP offloading**](https://www.openmp.org/). <div style="clear: both;"> <div style="margin-left 10em; margin-right 10em;"> <img src="https://hackmd.io/_uploads/rymWsewxJe.jpg" alt="" height=460x> <img src="https://hackmd.io/_uploads/S1GCigDe1g.jpg" alt="" height=460x> </div> </div> ---- Overview of OpenACC  - OpenACC is not a GPU programming language, it is based on compiler directives. - OpenACC allows you to express parallelism in your (C/C++ & Fortran) code. - OpenACC can be used with both Nvidia and AMD GPUs. - Use a single source code across different GPU vendors without modification. - Key directives for OpenACC - Compute Constructs: - `parallel` and `kernel` - used to create a parallel region. - Loop Constructs: - `loop`, `collapse`, `gang`, `worker`, `vector`, etc. - designed to efficiently allocate threads for work-sharing tasks. - Data Management Clauses: - `copy`, `create`, `copyin`, `copyout`, `delete`, and `present` - for managing data transfer between host and device. - Other Constructs: - `reduction`, `atomic`, `cache`, *etc.* - for special operations that prevent slowing down parallel computation. - For more information about OpenACC directives, please refer to this link: [here](https://www.openacc.org/sites/default/files/inline-files/OpenACC_2_0_specification.pdf). ---- Key directives for OpenACC | ![kernel-c](https://hackmd.io/_uploads/SyBBErxp1x.png =85%x) ![kernel-f](https://hackmd.io/_uploads/SkoFVBxaJe.png =85%x) | | :--: | | |  - **Compute Constructs**: parallel vs. kernels - ==Kernels in C/C++ & Fortran== - `nvc -fast -Minfo=all -acc=gpu -gpu=cc80 Hello_World.c` - `nvfortran -fast -Minfo=all -acc=gpu Hello_World_OpenACC.f90` ---- Key directives for OpenACC | ![aaaa-vector-add-c](https://hackmd.io/_uploads/Hk2xIHlpJg.png =60%x) ![aaaa-vector-add-f](https://hackmd.io/_uploads/H1S-IHgaye.png =60%x) | | :-: | | |  - **Loop and Data Clauses** - copyin(list) - Allocates memory on GPU and copies data from CPU (host) to GPU when entering a region. - copyout(list) - Allocates memory on GPU and copies data from GPU to CPU (host) when exiting a region. - copy(list) - Allocates memory on GPU, copies data from CPU (host) to GPU when entering a region, and copies data back to CPU (host) when exiting a region. --- Non-portable kernel-based models  - Developers write low-level codes that directly communicates with GPU hardware. - Two representative programming models are [CUDA](https://developer.nvidia.com/cuda-toolkit) and [HIP](https://rocm.docs.amd.com/projects/HIP/en/latest/). | ![cuda-c](https://hackmd.io/_uploads/SyADl8xa1l.png =60%x) | | :-: | | |  - A qualifier `__global__` to define a device kernel - Calling device functions in main program: - C/C++ example, ==c_function()== - CUDA example, ==cuda_function≪1,1≫()== - ==<<< >>>==, specify numbers of thread blocks and threads to launch kernels - Make sure to synchronize threads - `syncthreads()` - synchronizes all threads within a thread block - `CudaDeviceSynchronize()` - synchronizes a kernel call on host ---- CUDA example: Vector addition | ![cuda-vec-add-1](https://hackmd.io/_uploads/HJ3EX8gTyl.png =45%x) ![cuda-vec-add-2](https://hackmd.io/_uploads/ByMSQ8eake.png =51%x) | | :-: | | |  Source: [Vector_Addition.cu](https://enccs.github.io/gpu-programming/7-non-portable-kernel-models/#vector-addition) ---- Performance and profiling | ![matrix-add](https://hackmd.io/_uploads/rkTpCIlayl.png =60%x) | | :-: | | |  Using CUDA APIs, we can measure the time taken to execute CUDA kernel functions. ---- Performance and profiling | ![cuda-profiling](https://hackmd.io/_uploads/BkKWJceakl.png =90%x) | | :-: | | |  - Nvidia provides profiling tools to measure traces and events of CUDA application. - ==Nsight Compute==: Profile kernel calls; both *visual profile-GUI* and *Command Line Interface* can be used to check profiling information of kernel calls. - ==Nsight Graphics==: This is quite useful for analyzing profiling results through GUI. - ==Nsight Systems==: Provides system-wide profiling - *i.e.*, when application involves mixed programming with CPU and GPU (that is, MPI, OpenMP, and CUDA). --- Non-portable kernel-based models ==Hipification== is the process of converting CUDA code to HIP, enabling code to run on both AMD and NVIDIA GPUs. - HIP code uses similar keyword as that for CUDA code, making it easy to port simple CUDA kernels to HIP by changing a few library-specific calls. - launch Kernel functions - cuda_function<<<blocks, threads>>>() - hip_function<<<blocks, threads>>>() - synchronization - cudaDeviceSynchronize() - hipDeviceSynchronize() - HIP provides tools (hipify-perl or hipify-clang) to convert CUDA syntax to HIP syntax. --- Portable kernel-based models  - ==Cross-platform portability ecosystems== typically provide a higher-level abstraction layer which provide a convenient and portable programming model for GPU programming. - For C++, the most notable cross-platform portability ecosystems are [**Alpaka**](https://github.com/alpaka-group/alpaka), [**Kokkos**](https://github.com/kokkos/kokkos), [**OpenCL**](https://www.khronos.org/opencl/), and [**SYCL**](https://www.khronos.org/sycl/). - Pros and cons of cross-platform portability ecosystems - Pros - The amount of code duplication is minimized - Less knowledge of underlying architecture is needed for initial porting (Kokkos, SYCL) - Cons - These models are relatively new and not very popular yet - Limited learning resources compared to CUDA & HIP - Some low-level APIs and separate-source kernel models are less user friendly - References and code examples - [Alpaka and openPMD workshop](https://www.hlrs.de/training/2024/alpaka-openpmd-hack) - [LAMMPS KOKKOS package](https://docs.lammps.org/Speed_kokkos.html) - [GROMACS package](https://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html) - ==[GPU Programming: When, Why and How?](https://enccs.github.io/gpu-programming/)== --- Comparison of GPU compute APIs  | API | Portability | Ease of Use | Performance | Primary Vendor | Best For | | :--: | :--: | :--: | :--: | :--: | :--: | | OpenACC | Medium (NVIDIA & AMD) | High (directive-based) | Medium-High | NVIDIA, AMD (limited) | Scientific computing and HPC with minimal code modifications | | OpenMP Offloading | High (Cross-vendor) | Medium (directive-based) | High | Intel, AMD, NVIDIA | Parallelizing CPU and GPU workloads using OpenMP pragmas for performance portability | | CUDA | Low (NVIDIA only) | Medium | Very High | NVIDIA | High-performance compute on NVIDIA GPUs | | HIP | Medium (AMD & NVIDIA) | Medium | High | AMD, NVIDIA (via Hipify) | Porting CUDA applications to AMD GPUs with minimal code changes | | SYCL | High (Cross-vendor) | Medium | High | Intel, AMD, NVIDIA | Heterogeneous computing with single-source C++ | | OpenCL | High (Cross-vendor) | Medium | Medium | Cross-vendor | General-purpose parallel programming across multiple hardware architectures | --- Python libraries for GPU programming  | Library | Best For | Supports | | :-----: | :------: | :------: | | Numba.cuda | General CUDA GPU programming | NVIDIA | | CuPy | NumPy-like GPU acceleration | NVIDIA | | PyCUDA | Low-level CUDA programming | NVIDIA | | PyOpenCL | Cross-vendor GPU programming | NVIDIA, AMD, Intel | | SYCL (dpctl) | Cross-platform parallelism | Intel, AMD, NVIDIA | | TensorFlow | Deep learning | NVIDIA, AMD | | PyTorch | Machine learning | NVIDIA, AMD | | OpenMP (Numba) | CPU & GPU parallelism | Intel, AMD, NVIDIA | ---- Python libraries for AI research  | API/Framework | Primary Use | GPU Support | Multi-GPU Support | Best For | | :--: | :--: | :--: | :--: | :--: | | TensorFlow | DL, neural networks, custom ML models | NVIDIA (CUDA), AMD (ROCm) | Yes, using Mirrored Strategy or Distribution Strategy | General-purpose ML and DL | | PyTorch | DL and neural networks | NVIDIA (CUDA), AMD (ROCm) | Yes, using DataParallel and DistributedDataParallel | Research and production DL, flexibility | | Hugging Face (Transformers) | NLP with transformers | NVIDIA, AMD (via PyTorch or TensorFlow) | Yes, using PyTorch or TensorFlow for multi-GPU | Pre-trained transformer models for NLP tasks | | Keras | High-level neural networks API | NVIDIA (CUDA), AMD (ROCm) | Yes, using TensorFlow for multiGPU | Simplified DL with TensorFlow backend | | JAX | High-performance ML and scientific computing | NVIDIA (CUDA), AMD (ROCm), Intel (oneAPI) | Yes, using XLA for GPU acceleration | Numerical computing, DL with high performance | | RAPIDS | GPU-accelerated data science and ML | NVIDIA GPUs | Yes, via Dask and cuML for ML tasks | Data science with cuML, cuDF, and Dask | --- Julia libraries for AI research - Julia is a high-performance, dynamic programming language designed for numerical scientific computing. - It was designed specifically to solve the ==two-language problem==. - Interpreted languages like Python and R translate instructions line by line. - Compiled languages like C/C++ and Fortran are translated by compiler prior to running code. - Julia provides both high performance and ease of use in a single language. - Key Features of Julia - Using Just-In-Time (JIT) compilation to achieve high performance computing - Simple & expressive syntax (like Python) - Powerful for numerical & scientific computing with built-in parallel computing support - Interoperability with other programming languages | ![julia-feature](https://hackmd.io/_uploads/SkxNasep1e.png =50%x) | | :-: | | |  | Library | Purpose | Similar to | GPU Support? | | :-----: | :-----: | :--------: | :----------: | | Flux.jl | Deep learning | PyTorch, TensorFlow | ✅ Yes (CUDA) | | MLJ.jl | Machine learning | scikit-learn | ✅ Yes | | Turing.jl | Probabilistic modeling, Bayesian learning | PyMC3, Stan | ✅ Yes | | AlphaZero.jl | Reinforcement learning | DeepMind AlphaZero | ✅ Yes | | Knet.jl | Deep learning, Fast GPU AI research | PyTorch | ✅ Yes | | ReinforcementLearning.jl | Reinforcement Learning | DeepMind AlphaZero | ✅ Yes | --- ENCCS Lesson materials & Training events --- Lesson materials & Seasonal training events | ![lesson-material](https://hackmd.io/_uploads/S1rlH2lTkl.png =55%x) | | :-: | | |  - Link: https://hackmd.io/@yonglei/mermaid-enccs-lesson - Seasonal training events - GPU Programming / OpenACC-CUDA workshop - Practical machine/deep Learning - Python/Julia HPDA/HPC - Best practice HPC training - Quantum autumn school ---- Nvidia bootcamps | ![N-Ways-GPU-2025-04_image-1](https://hackmd.io/_uploads/By4w8ngakx.png =50%x) | | :-: | | |  - [N-Ways to GPU Programming, Apr. 8-9](https://enccs.se/events/bootcamp-n-days-gpu-programming/) - [AI for Science, May 27-28](https://events.vsc.ac.at/event/186/) - [Multi-GPU Programming, Jun. 17–18](https://events.vsc.ac.at/event/187/) - [AI Multi-GPU Multi-Node Profiling, Jul. 9-10](https://events.vsc.ac.at/event/188/) --- Take-home message  - GPU & HPC - GPU architecture - GPU programming models - GPU compute APIs - Standard C/C++ & Fortran programming - Directive-based models - OpenACC, OpenMP offloading - Non-portable kernel-based models - CUDA, HIP - Portable kernel-based models - SYCL, OpenCL, Alpaka, Kokkos, etc. - High-level programming languages - Python, Julia - Lesson materials - Training events & Nvidia bootcamps

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.