owned this note
owned this note
Published
Linked with GitHub
---
title: Intro to GPU Programming
tags: presentation
margin: 0.1
slideOptions:
theme: 'black'
margin: 0.08
transition: 'fade'
center: true
---
<style>
.reveal {
font-size: 32px;
}
</style>
<p style="text-align: center"><b>
<font size=6 color=blueyellow>[ENCCS Webinar]</font><br>
<font size=24 color=blueyellow>Practical Introduction to <br>GPU Programming</font>
</b></p>
$~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$
Yonglei Wang (ENCCS / NSC@LiU)
---
<p style="text-align: center"><b><font size=12 color=gold>GPU & HPC</font></b></p>
---
## <p style="text-align: center"><b><font size=6 color=blueyellow>What is GPU?</font></b></p>
<!-- .slide: style="font-size: 24px;" -->
- GPU is a specialized electronic circuit.
- Originally developed for computer graphics and image processing.
- Now evolved into general-proposed accelerators for massive parallel computing.
|   |
|:-:|
| **NVIDIA RTX 4090** & **H200** |
----
## <p style="text-align: center"><b><font size=6 color=blueyellow>What is GPU?</font></b></p>
<!-- .slide: style="font-size: 24px;" -->
- GPU is a specialized electronic circuit.
- Originally developed for computer graphics and image processing.
- Now evolved into general-proposed accelerators for massive parallel computing.
|   |
|:--:|
| **AMD MI300 series** *vs.* **Intel's ARC series** |
---
<p style="text-align: center"><b><font size=6 color=blueyellow>Top 10 HPCs</font></b></p>
|  |
| :-: |
| |
<!-- .slide: style="font-size: 24px;" -->
Ref: [Top 500 list released on Nov. 18, 2024](https://top500.org/lists/top500/2024/11/)
----
<p style="text-align: center"><b><font size=6 color=blueyellow>HPCs in EU</font></b></p>
|   <br> |
| :-: |
---
<p style="text-align: center"><b><font size=12 color=gold>GPU Architecture</font></b></p>
---
## <p style="text-align: center"><b><font size=6 color=blueyellow>CPU vs. GPU</font></b></p>
|  |
| :-: |
| |
<!-- .slide: style="font-size: 24px;" -->
Ref: [GPU Programming: When, Why and How?](https://enccs.github.io/gpu-programming/)
---
## <p style="text-align: center"><b><font size=6 color=blueyellow>GPU architecture</font></b></p>
|   |
| :--: |
<!-- .slide: style="font-size: 24px;" -->
- Computer architecture is characterized by four categories according to Flynn’s taxonomy:
- Single instruction stream, single data stream (SISD)
- Single instruction stream, multiple data streams (SIMD)
- Multiple instruction streams, single data stream (MISD)
- Multiple instruction streams, multiple data streams (MIMD)
- GPUs are based on ==Single Instruction Multiple Threads (SIMT)==
---
## <p style="text-align: center"><b><font size=6 color=blueyellow>Nvidia GPU architecture</font></b></p>
|  |
| :--: |
| |
<!-- .slide: style="font-size: 24px;" -->
- Nvidia (microarchitectures): Tesla (2006), Fermi (2010), Kepler (2012), Maxwell (2014), Pascal (2016), Volta (2017), Turing (2018), Ampere (2020), Ada Lovelace (2022), Hopper (2022), and Blackwell (2024)
- ==GPU --> GPC --> TPC --> SM --> CUDA core + Tensor core==
- Each SM has L1 cache, and L2 cache is shared among SMs within TPC
----
## <p style="text-align: center"><b><font size=6 color=blueyellow>Nvidia GPU architecture</font></b></p>
|   |
| :-: |
| |
<!-- .slide: style="font-size: 24px;" -->
- ==Ampere GPU== (RTX 3090) has 7 GPCs, 42 TPCs, and 84 SMs
- RT (Ray Tracing) cores are dedicated to performing the ray-tracing rendering math computation
- Each SM has L1 cache (up to 128 KB), and L2 (up to 6144 KB) cache is shared among GPCs
----
## <p style="text-align: center"><b><font size=6 color=blueyellow>Nvidia GPU architecture</font></b></p>
|   |
| :-: |
| |
<!-- .slide: style="font-size: 24px;" -->
- ==SIMT== enables programmers to achieve thread-level parallelism in SMs
- SM is based on the scalable multi-thread array, usually called ==CTA== (cooperative thread array), which can scale threads at 1D, 2D and 3D level
- ==GPU has hierarchical memories==
- registers and local memory at thread level
- shared memory at block level
- global memory at grid level
- constant and texture memories are two specialized memory types
---
## <p style="text-align: center"><b><font size=6 color=blueyellow>AMD GPU architecture</font></b></p>
|   |
| :-: |
| |
<!-- .slide: style="font-size: 24px;" -->
- LUMI, one of the EuroHPC JU machines, features AMD MI250 GPUs.
- Left: Node-level architecture of a system based on AMD Instinct MI250 accelerator.
- Each package has two GCDs (GPU compute dies)
- The MI250 OAMs attach to the host system via PCIe Gen 4 x16 links (yellow lines).
- Right: Block diagram of an package consisting of two GCDs
---
## <p style="text-align: center"><b><font size=6 color=blueyellow>Intel GPU architecture</font></b></p>
|  |
| :-: |
| |
<!-- .slide: style="font-size: 24px;" -->
- Intel X$^e$ HPC Core: 8 matrix math units and 8 vector math units
- X$^e$ HPC Cores are stacked up with ray tracing units to X$^e$ HPC Slice
- ==X$^e$ HPC Core --> X$^e$ HPC Slice --> X$^e$ HPC Stack --> X$^e$ HPC Link==
- X$^e$ GPU family consists of a series of micro-architectures
- integrated/low power (Xe-LP)
- high performance gaming (Xe-HPG)
- datacenter/high performance (Xe-HP)
- high performance computing (Xe-HPC)
---
## <p style="text-align: center"><b><font size=6 color=blueyellow>Comparison of NVIDIA, AMD, and Intel GPUs</font></b></p>
|  |
| :-: |
| |
<!-- .slide: style="font-size: 24px;" -->
- All GPU vendors have different strategies to produce GPUs for market needs.
- The GPUs used for AI research and HPC have varied efficiencies on high throughput for both single and mixed precision, memories, and compute cores for FP64 (used in scientific computing).
- Some advancements include integrated memory in AMD and Intel GPUs, as well as development of multiple GPU dies on a single GPU by AMD and stackable options in Intel GPUs.
---
<p style="text-align: center"><b><font size=12 color=gold>GPU Programming Models</font></b></p>
---
<p style="text-align: center"><b><font size=6 color=blueyellow>GPU compute APIs</font></b></p>
<!-- .slide: style="font-size: 24px;" -->
- **Standard C/C++ & Fortran programming**
- **Directive-based models**
- OpenACC
- OpenMP offloading
- **Non-portable kernel-based models**
- CUDA for Nvidia GPU
- HIP for AMD GPU
- OneAPI for Intel GPU
- **Portable kernel-based models**
- SYCL, OpenCL, Alpaka, Kokkos, *etc.*
- **High-level programming languages**
- Python
- Julia
---
<p style="text-align: center"><b><font size=6 color=blueyellow>Standard C/C++ & Fortran programming</font></b></p>
|   |
| :--: |
---
<p style="text-align: center"><b><font size=6 color=blueyellow>Directive-based models</font></b></p>
<!-- .slide: style="font-size: 18px;" -->
The serial code is annotated with directives telling compilers to run specific loops and regions on GPU.
<!-- .slide: style="font-size: 18px;" -->
Two representative directives-based programming models are [**OpenACC**](https://www.openacc.org/) and [**OpenMP offloading**](https://www.openmp.org/).
<div style="clear: both;">
<div style="margin-left 10em; margin-right 10em;">
<img src="https://hackmd.io/_uploads/rymWsewxJe.jpg" alt="" height=460x> <img src="https://hackmd.io/_uploads/S1GCigDe1g.jpg" alt="" height=460x>
</div>
</div>
----
<p style="text-align: center"><b><font size=6 color=blueyellow>Overview of OpenACC</font></b></p>
<!-- .slide: style="font-size: 24px;" -->
- OpenACC is not a GPU programming language, it is based on compiler directives.
- OpenACC allows you to express parallelism in your (C/C++ & Fortran) code.
- OpenACC can be used with both Nvidia and AMD GPUs.
- Use a single source code across different GPU vendors without modification.
- Key directives for OpenACC
- Compute Constructs:
- `parallel` and `kernel` - used to create a parallel region.
- Loop Constructs:
- `loop`, `collapse`, `gang`, `worker`, `vector`, etc. - designed to efficiently allocate threads for work-sharing tasks.
- Data Management Clauses:
- `copy`, `create`, `copyin`, `copyout`, `delete`, and `present` - for managing data transfer between host and device.
- Other Constructs:
- `reduction`, `atomic`, `cache`, *etc.* - for special operations that prevent slowing down parallel computation.
- For more information about OpenACC directives, please refer to this link: [here](https://www.openacc.org/sites/default/files/inline-files/OpenACC_2_0_specification.pdf).
----
<p style="text-align: center"><b><font size=6 color=blueyellow>Key directives for OpenACC</font></b></p>
|   |
| :--: |
| |
<!-- .slide: style="font-size: 24px;" -->
- **Compute Constructs**: parallel vs. kernels
- ==Kernels in C/C++ & Fortran==
- `nvc -fast -Minfo=all -acc=gpu -gpu=cc80 Hello_World.c`
- `nvfortran -fast -Minfo=all -acc=gpu Hello_World_OpenACC.f90`
----
<p style="text-align: center"><b><font size=6 color=blueyellow>Key directives for OpenACC</font></b></p>
|   |
| :-: |
| |
<!-- .slide: style="font-size: 22px;" -->
- **Loop and Data Clauses**
- copyin(list) - Allocates memory on GPU and copies data from CPU (host) to GPU when entering a region.
- copyout(list) - Allocates memory on GPU and copies data from GPU to CPU (host) when exiting a region.
- copy(list) - Allocates memory on GPU, copies data from CPU (host) to GPU when entering a region, and copies data back to CPU (host) when exiting a region.
---
<p style="text-align: center"><b><font size=6 color=blueyellow>Non-portable kernel-based models</font></b></p>
<!-- .slide: style="font-size: 22px;" -->
- Developers write low-level codes that directly communicates with GPU hardware.
- Two representative programming models are [CUDA](https://developer.nvidia.com/cuda-toolkit) and [HIP](https://rocm.docs.amd.com/projects/HIP/en/latest/).
|  |
| :-: |
| |
<!-- .slide: style="font-size: 18px;" -->
- A qualifier `__global__` to define a device kernel
- Calling device functions in main program:
- C/C++ example, ==c_function()==
- CUDA example, ==cuda_function≪1,1≫()==
- ==<<< >>>==, specify numbers of thread blocks and threads to launch kernels
- Make sure to synchronize threads
- `syncthreads()` - synchronizes all threads within a thread block
- `CudaDeviceSynchronize()` - synchronizes a kernel call on host
----
<p style="text-align: center"><b><font size=6 color=blueyellow>CUDA example: Vector addition</font></b></p>
|   |
| :-: |
| |
<!-- .slide: style="font-size: 24px;" -->
Source: [Vector_Addition.cu](https://enccs.github.io/gpu-programming/7-non-portable-kernel-models/#vector-addition)
----
<p style="text-align: center"><b><font size=6 color=blueyellow>Performance and profiling</font></b></p>
|  |
| :-: |
| |
<!-- .slide: style="font-size: 24px;" -->
Using CUDA APIs, we can measure the time taken to execute CUDA kernel functions.
----
<p style="text-align: center"><b><font size=6 color=blueyellow>Performance and profiling</font></b></p>
|  |
| :-: |
| |
<!-- .slide: style="font-size: 24px;" -->
- Nvidia provides profiling tools to measure traces and events of CUDA application.
- ==Nsight Compute==: Profile kernel calls; both *visual profile-GUI* and *Command Line Interface* can be used to check profiling information of kernel calls.
- ==Nsight Graphics==: This is quite useful for analyzing profiling results through GUI.
- ==Nsight Systems==: Provides system-wide profiling
- *i.e.*, when application involves mixed programming with CPU and GPU (that is, MPI, OpenMP, and CUDA).
---
<p style="text-align: center"><b><font size=6 color=blueyellow>Non-portable kernel-based models</font></b></p>
==Hipification== is the process of converting CUDA code to HIP, enabling code to run on both AMD and NVIDIA GPUs.
- HIP code uses similar keyword as that for CUDA code, making it easy to port simple CUDA kernels to HIP by changing a few library-specific calls.
- launch Kernel functions
- cuda_function<<<blocks, threads>>>()
- hip_function<<<blocks, threads>>>()
- synchronization
- cudaDeviceSynchronize()
- hipDeviceSynchronize()
- HIP provides tools (hipify-perl or hipify-clang) to convert CUDA syntax to HIP syntax.
---
<p style="text-align: center"><b><font size=6 color=blueyellow>Portable kernel-based models</font></b></p>
<!-- .slide: style="font-size: 22px;" -->
- ==Cross-platform portability ecosystems== typically provide a higher-level abstraction layer which provide a convenient and portable programming model for GPU programming.
- For C++, the most notable cross-platform portability ecosystems are [**Alpaka**](https://github.com/alpaka-group/alpaka), [**Kokkos**](https://github.com/kokkos/kokkos), [**OpenCL**](https://www.khronos.org/opencl/), and [**SYCL**](https://www.khronos.org/sycl/).
- Pros and cons of cross-platform portability ecosystems
- Pros
- The amount of code duplication is minimized
- Less knowledge of underlying architecture is needed for initial porting (Kokkos, SYCL)
- Cons
- These models are relatively new and not very popular yet
- Limited learning resources compared to CUDA & HIP
- Some low-level APIs and separate-source kernel models are less user friendly
- References and code examples
- [Alpaka and openPMD workshop](https://www.hlrs.de/training/2024/alpaka-openpmd-hack)
- [LAMMPS KOKKOS package](https://docs.lammps.org/Speed_kokkos.html)
- [GROMACS package](https://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html)
- ==[GPU Programming: When, Why and How?](https://enccs.github.io/gpu-programming/)==
---
<p style="text-align: center"><b><font size=6 color=blueyellow>Comparison of GPU compute APIs</font></b></p>
<!-- .slide: style="font-size: 18px;" -->
| API | Portability | Ease of Use | Performance | Primary Vendor | Best For |
| :--: | :--: | :--: | :--: | :--: | :--: |
| OpenACC | Medium (NVIDIA & AMD) | High (directive-based) | Medium-High | NVIDIA, AMD (limited) | Scientific computing and HPC with minimal code modifications |
| OpenMP Offloading | High (Cross-vendor) | Medium (directive-based) | High | Intel, AMD, NVIDIA | Parallelizing CPU and GPU workloads using OpenMP pragmas for performance portability |
| CUDA | Low (NVIDIA only) | Medium | Very High | NVIDIA | High-performance compute on NVIDIA GPUs |
| HIP | Medium (AMD & NVIDIA) | Medium | High | AMD, NVIDIA (via Hipify) | Porting CUDA applications to AMD GPUs with minimal code changes |
| SYCL | High (Cross-vendor) | Medium | High | Intel, AMD, NVIDIA | Heterogeneous computing with single-source C++ |
| OpenCL | High (Cross-vendor) | Medium | Medium | Cross-vendor | General-purpose parallel programming across multiple hardware architectures |
---
<p style="text-align: center"><b><font size=6 color=blueyellow>Python libraries for GPU programming</font></b></p>
<!-- .slide: style="font-size: 24px;" -->
| Library | Best For | Supports |
| :-----: | :------: | :------: |
| Numba.cuda | General CUDA GPU programming | NVIDIA |
| CuPy | NumPy-like GPU acceleration | NVIDIA |
| PyCUDA | Low-level CUDA programming | NVIDIA |
| PyOpenCL | Cross-vendor GPU programming | NVIDIA, AMD, Intel |
| SYCL (dpctl) | Cross-platform parallelism | Intel, AMD, NVIDIA |
| TensorFlow | Deep learning | NVIDIA, AMD |
| PyTorch | Machine learning | NVIDIA, AMD |
| OpenMP (Numba) | CPU & GPU parallelism | Intel, AMD, NVIDIA |
----
<p style="text-align: center"><b><font size=6 color=blueyellow>Python libraries for AI research</font></b></p>
<!-- .slide: style="font-size: 18px;" -->
| API/Framework | Primary Use | GPU Support | Multi-GPU Support | Best For |
| :--: | :--: | :--: | :--: | :--: |
| TensorFlow | DL, neural networks, custom ML models | NVIDIA (CUDA), AMD (ROCm) | Yes, using Mirrored Strategy or Distribution Strategy | General-purpose ML and DL |
| PyTorch | DL and neural networks | NVIDIA (CUDA), AMD (ROCm) | Yes, using DataParallel and DistributedDataParallel | Research and production DL, flexibility |
| Hugging Face (Transformers) | NLP with transformers | NVIDIA, AMD (via PyTorch or TensorFlow) | Yes, using PyTorch or TensorFlow for multi-GPU | Pre-trained transformer models for NLP tasks |
| Keras | High-level neural networks API | NVIDIA (CUDA), AMD (ROCm) | Yes, using TensorFlow for multiGPU | Simplified DL with TensorFlow backend |
| JAX | High-performance ML and scientific computing | NVIDIA (CUDA), AMD (ROCm), Intel (oneAPI) | Yes, using XLA for GPU acceleration | Numerical computing, DL with high performance |
| RAPIDS | GPU-accelerated data science and ML | NVIDIA GPUs | Yes, via Dask and cuML for ML tasks | Data science with cuML, cuDF, and Dask |
---
<p style="text-align: center"><b><font size=6 color=blueyellow>Julia libraries for AI research</font></b></p>
- Julia is a high-performance, dynamic programming language designed for numerical scientific computing.
- It was designed specifically to solve the ==two-language problem==.
- Interpreted languages like Python and R translate instructions line by line.
- Compiled languages like C/C++ and Fortran are translated by compiler prior to running code.
- Julia provides both high performance and ease of use in a single language.
- Key Features of Julia
- Using Just-In-Time (JIT) compilation to achieve high performance computing
- Simple & expressive syntax (like Python)
- Powerful for numerical & scientific computing with built-in parallel computing support
- Interoperability with other programming languages
|  |
| :-: |
| |
<!-- .slide: style="font-size: 16px;" -->
| Library | Purpose | Similar to | GPU Support? |
| :-----: | :-----: | :--------: | :----------: |
| Flux.jl | Deep learning | PyTorch, TensorFlow | ✅ Yes (CUDA) |
| MLJ.jl | Machine learning | scikit-learn | ✅ Yes |
| Turing.jl | Probabilistic modeling, Bayesian learning | PyMC3, Stan | ✅ Yes |
| AlphaZero.jl | Reinforcement learning | DeepMind AlphaZero | ✅ Yes |
| Knet.jl | Deep learning, Fast GPU AI research | PyTorch | ✅ Yes |
| ReinforcementLearning.jl | Reinforcement Learning | DeepMind AlphaZero | ✅ Yes |
---
<p style="text-align: center"><b><font size=12 color=gold>ENCCS <br>Lesson materials & <br>Training events</font></b></p>
---
<p style="text-align: center"><b><font size=6 color=blueyellow>Lesson materials & Seasonal training events</font></b></p>
|  |
| :-: |
| |
<!-- .slide: style="font-size: 20px;" -->
- Link: https://hackmd.io/@yonglei/mermaid-enccs-lesson
- Seasonal training events
- GPU Programming / OpenACC-CUDA workshop
- Practical machine/deep Learning
- Python/Julia HPDA/HPC
- Best practice HPC training
- Quantum autumn school
----
<p style="text-align: center"><b><font size=6 color=blueyellow>Nvidia bootcamps</font></b></p>
|  |
| :-: |
| |
<!-- .slide: style="font-size: 24px;" -->
- [N-Ways to GPU Programming, Apr. 8-9](https://enccs.se/events/bootcamp-n-days-gpu-programming/)
- [AI for Science, May 27-28](https://events.vsc.ac.at/event/186/)
- [Multi-GPU Programming, Jun. 17–18](https://events.vsc.ac.at/event/187/)
- [AI Multi-GPU Multi-Node Profiling, Jul. 9-10](https://events.vsc.ac.at/event/188/)
---
<p style="text-align: center"><b><font size=6 color=blueyellow>Take-home message</font></b></p>
<!-- .slide: style="font-size: 24px;" -->
- GPU & HPC
- GPU architecture
- GPU programming models
- GPU compute APIs
- Standard C/C++ & Fortran programming
- Directive-based models
- OpenACC, OpenMP offloading
- Non-portable kernel-based models
- CUDA, HIP
- Portable kernel-based models
- SYCL, OpenCL, Alpaka, Kokkos, etc.
- High-level programming languages
- Python, Julia
- Lesson materials
- Training events & Nvidia bootcamps