<style> .reveal { font-size: 32px; } </style> <p style="text-align: center"><b> <font size=6 color=blueyellow>[ENCCS Webinar]</font><br> <font size=24 color=blueyellow>Practical Introduction to <br>GPU Programming</font> </b></p> $~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$ Yonglei Wang (ENCCS / NSC@LiU) --- <p style="text-align: center"><b><font size=12 color=gold>GPU & HPC</font></b></p> --- ## <p style="text-align: center"><b><font size=6 color=blueyellow>What is GPU?</font></b></p> <!-- .slide: style="font-size: 24px;" --> - GPU is a specialized electronic circuit. - Originally developed for computer graphics and image processing. - Now evolved into general-proposed accelerators for massive parallel computing. | ![](https://m.media-amazon.com/images/I/51EL-FaK4XL._AC_UF1000,1000_QL80_.jpg =40%x) ![](https://decisive-angel-514477a09e.media.strapiapp.com/h200_391352089d.png =40%x) | |:-:| | **NVIDIA RTX 4090** & **H200** | ---- ## <p style="text-align: center"><b><font size=6 color=blueyellow>What is GPU?</font></b></p> <!-- .slide: style="font-size: 24px;" --> - GPU is a specialized electronic circuit. - Originally developed for computer graphics and image processing. - Now evolved into general-proposed accelerators for massive parallel computing. | ![](https://www.amd.com/content/dam/amd/en/images/pr-feed/1213366.jpg =40%x) ![](https://i.pcmag.com/imagery/articles/02ON9V98apUZIE57Y9j2BLQ-4..v1662676440.jpg =40%x) | |:--:| | **AMD MI300 series** *vs.* **Intel's ARC series** | --- <p style="text-align: center"><b><font size=6 color=blueyellow>Top 10 HPCs</font></b></p> | ![](https://hackmd.io/_uploads/ryEDEbxakg.jpg =95%x) | | :-: | | | <!-- .slide: style="font-size: 24px;" --> Ref: [Top 500 list released on Nov. 18, 2024](https://top500.org/lists/top500/2024/11/) ---- <p style="text-align: center"><b><font size=6 color=blueyellow>HPCs in EU</font></b></p> | ![](https://hackmd.io/_uploads/HklE9beT1l.png =45%x) ![EuroHPC-AI-Factories-ecosystem](https://hackmd.io/_uploads/Hk5YcZlTyx.jpg =30%x) <br> ![quantum](https://hackmd.io/_uploads/SyYq5Zlayl.png =45%x)| | :-: | --- <p style="text-align: center"><b><font size=12 color=gold>GPU Architecture</font></b></p> --- ## <p style="text-align: center"><b><font size=6 color=blueyellow>CPU vs. GPU</font></b></p> | ![](https://enccs.github.io/gpu-programming/_images/CPUAndGPU.png =75%x) | | :-: | | | <!-- .slide: style="font-size: 24px;" --> Ref: [GPU Programming: When, Why and How?](https://enccs.github.io/gpu-programming/) --- ## <p style="text-align: center"><b><font size=6 color=blueyellow>GPU architecture</font></b></p> | ![simd](https://hackmd.io/_uploads/ryvSZGxa1x.png =25%x) ![gpu-simt](https://hackmd.io/_uploads/B1Offzep1x.jpg =60%x) | | :--: | <!-- .slide: style="font-size: 24px;" --> - Computer architecture is characterized by four categories according to Flynn’s taxonomy: - Single instruction stream, single data stream (SISD) - Single instruction stream, multiple data streams (SIMD) - Multiple instruction streams, single data stream (MISD) - Multiple instruction streams, multiple data streams (MIMD) - GPUs are based on ==Single Instruction Multiple Threads (SIMT)== --- ## <p style="text-align: center"><b><font size=6 color=blueyellow>Nvidia GPU architecture</font></b></p> | ![nvidia-gpu](https://hackmd.io/_uploads/BytNifg61e.jpg =80%x) | | :--: | | | <!-- .slide: style="font-size: 24px;" --> - Nvidia (microarchitectures): Tesla (2006), Fermi (2010), Kepler (2012), Maxwell (2014), Pascal (2016), Volta (2017), Turing (2018), Ampere (2020), Ada Lovelace (2022), Hopper (2022), and Blackwell (2024) - ==GPU --> GPC --> TPC --> SM --> CUDA core + Tensor core== - Each SM has L1 cache, and L2 cache is shared among SMs within TPC ---- ## <p style="text-align: center"><b><font size=6 color=blueyellow>Nvidia GPU architecture</font></b></p> | ![rtx-3090](https://hackmd.io/_uploads/Sk0w6Mx6Je.jpg =65%x) ![sm-memory](https://hackmd.io/_uploads/S1S_TGlTyg.png =32%x) | | :-: | | | <!-- .slide: style="font-size: 24px;" --> - ==Ampere GPU== (RTX 3090) has 7 GPCs, 42 TPCs, and 84 SMs - RT (Ray Tracing) cores are dedicated to performing the ray-tracing rendering math computation - Each SM has L1 cache (up to 128 KB), and L2 (up to 6144 KB) cache is shared among GPCs ---- ## <p style="text-align: center"><b><font size=6 color=blueyellow>Nvidia GPU architecture</font></b></p> | ![thread-batching](https://hackmd.io/_uploads/rJTQgQg6Jl.png =26%x) ![hardware-model-1](https://hackmd.io/_uploads/SJmExQeTye.png =29%x) | | :-: | | | <!-- .slide: style="font-size: 24px;" --> - ==SIMT== enables programmers to achieve thread-level parallelism in SMs - SM is based on the scalable multi-thread array, usually called ==CTA== (cooperative thread array), which can scale threads at 1D, 2D and 3D level - ==GPU has hierarchical memories== - registers and local memory at thread level - shared memory at block level - global memory at grid level - constant and texture memories are two specialized memory types --- ## <p style="text-align: center"><b><font size=6 color=blueyellow>AMD GPU architecture</font></b></p> | ![amd-nodelevel](https://hackmd.io/_uploads/SkrBMQgTJg.png =27%x) ![amd-gpu-gcd](https://hackmd.io/_uploads/rkcrfXg6Je.png =60%x) | | :-: | | | <!-- .slide: style="font-size: 24px;" --> - LUMI, one of the EuroHPC JU machines, features AMD MI250 GPUs. - Left: Node-level architecture of a system based on AMD Instinct MI250 accelerator. - Each package has two GCDs (GPU compute dies) - The MI250 OAMs attach to the host system via PCIe Gen 4 x16 links (yellow lines). - Right: Block diagram of an package consisting of two GCDs --- ## <p style="text-align: center"><b><font size=6 color=blueyellow>Intel GPU architecture</font></b></p> | ![intel-gpu](https://hackmd.io/_uploads/SyP1KQga1x.png =75%x)![intel-gpu2](https://hackmd.io/_uploads/BJAQFQlpyg.png =60%x) | | :-: | | | <!-- .slide: style="font-size: 24px;" --> - Intel X$^e$ HPC Core: 8 matrix math units and 8 vector math units - X$^e$ HPC Cores are stacked up with ray tracing units to X$^e$ HPC Slice - ==X$^e$ HPC Core --> X$^e$ HPC Slice --> X$^e$ HPC Stack --> X$^e$ HPC Link== - X$^e$ GPU family consists of a series of micro-architectures - integrated/low power (Xe-LP) - high performance gaming (Xe-HPG) - datacenter/high performance (Xe-HP) - high performance computing (Xe-HPC) --- ## <p style="text-align: center"><b><font size=6 color=blueyellow>Comparison of NVIDIA, AMD, and Intel GPUs</font></b></p> | ![compare-gpu](https://hackmd.io/_uploads/BkCGs7xTJe.png =90%x) | | :-: | | | <!-- .slide: style="font-size: 24px;" --> - All GPU vendors have different strategies to produce GPUs for market needs. - The GPUs used for AI research and HPC have varied efficiencies on high throughput for both single and mixed precision, memories, and compute cores for FP64 (used in scientific computing). - Some advancements include integrated memory in AMD and Intel GPUs, as well as development of multiple GPU dies on a single GPU by AMD and stackable options in Intel GPUs. --- <p style="text-align: center"><b><font size=12 color=gold>GPU Programming Models</font></b></p> --- <p style="text-align: center"><b><font size=6 color=blueyellow>GPU compute APIs</font></b></p> <!-- .slide: style="font-size: 24px;" --> - **Standard C/C++ & Fortran programming** - **Directive-based models** - OpenACC - OpenMP offloading - **Non-portable kernel-based models** - CUDA for Nvidia GPU - HIP for AMD GPU - OneAPI for Intel GPU - **Portable kernel-based models** - SYCL, OpenCL, Alpaka, Kokkos, *etc.* - **High-level programming languages** - Python - Julia --- <p style="text-align: center"><b><font size=6 color=blueyellow>Standard C/C++ & Fortran programming</font></b></p> | ![cppppp](https://hackmd.io/_uploads/H1TV6El61e.png =46%x) ![cppppppppp](https://hackmd.io/_uploads/HysB6ElTyg.png =50%x) | | :--: | --- <p style="text-align: center"><b><font size=6 color=blueyellow>Directive-based models</font></b></p> <!-- .slide: style="font-size: 18px;" --> The serial code is annotated with directives telling compilers to run specific loops and regions on GPU. <!-- .slide: style="font-size: 18px;" --> Two representative directives-based programming models are [**OpenACC**](https://www.openacc.org/) and [**OpenMP offloading**](https://www.openmp.org/). <div style="clear: both;"> <div style="margin-left 10em; margin-right 10em;"> <img src="https://hackmd.io/_uploads/rymWsewxJe.jpg" alt="" height=460x> <img src="https://hackmd.io/_uploads/S1GCigDe1g.jpg" alt="" height=460x> </div> </div> ---- <p style="text-align: center"><b><font size=6 color=blueyellow>Overview of OpenACC</font></b></p> <!-- .slide: style="font-size: 24px;" --> - OpenACC is not a GPU programming language, it is based on compiler directives. - OpenACC allows you to express parallelism in your (C/C++ & Fortran) code. - OpenACC can be used with both Nvidia and AMD GPUs. - Use a single source code across different GPU vendors without modification. - Key directives for OpenACC - Compute Constructs: - `parallel` and `kernel` - used to create a parallel region. - Loop Constructs: - `loop`, `collapse`, `gang`, `worker`, `vector`, etc. - designed to efficiently allocate threads for work-sharing tasks. - Data Management Clauses: - `copy`, `create`, `copyin`, `copyout`, `delete`, and `present` - for managing data transfer between host and device. - Other Constructs: - `reduction`, `atomic`, `cache`, *etc.* - for special operations that prevent slowing down parallel computation. - For more information about OpenACC directives, please refer to this link: [here](https://www.openacc.org/sites/default/files/inline-files/OpenACC_2_0_specification.pdf). ---- <p style="text-align: center"><b><font size=6 color=blueyellow>Key directives for OpenACC</font></b></p> | ![kernel-c](https://hackmd.io/_uploads/SyBBErxp1x.png =85%x) ![kernel-f](https://hackmd.io/_uploads/SkoFVBxaJe.png =85%x) | | :--: | | | <!-- .slide: style="font-size: 24px;" --> - **Compute Constructs**: parallel vs. kernels - ==Kernels in C/C++ & Fortran== - `nvc -fast -Minfo=all -acc=gpu -gpu=cc80 Hello_World.c` - `nvfortran -fast -Minfo=all -acc=gpu Hello_World_OpenACC.f90` ---- <p style="text-align: center"><b><font size=6 color=blueyellow>Key directives for OpenACC</font></b></p> | ![aaaa-vector-add-c](https://hackmd.io/_uploads/Hk2xIHlpJg.png =60%x) ![aaaa-vector-add-f](https://hackmd.io/_uploads/H1S-IHgaye.png =60%x) | | :-: | | | <!-- .slide: style="font-size: 22px;" --> - **Loop and Data Clauses** - copyin(list) - Allocates memory on GPU and copies data from CPU (host) to GPU when entering a region. - copyout(list) - Allocates memory on GPU and copies data from GPU to CPU (host) when exiting a region. - copy(list) - Allocates memory on GPU, copies data from CPU (host) to GPU when entering a region, and copies data back to CPU (host) when exiting a region. --- <p style="text-align: center"><b><font size=6 color=blueyellow>Non-portable kernel-based models</font></b></p> <!-- .slide: style="font-size: 22px;" --> - Developers write low-level codes that directly communicates with GPU hardware. - Two representative programming models are [CUDA](https://developer.nvidia.com/cuda-toolkit) and [HIP](https://rocm.docs.amd.com/projects/HIP/en/latest/). | ![cuda-c](https://hackmd.io/_uploads/SyADl8xa1l.png =60%x) | | :-: | | | <!-- .slide: style="font-size: 18px;" --> - A qualifier `__global__` to define a device kernel - Calling device functions in main program: - C/C++ example, ==c_function()== - CUDA example, ==cuda_function≪1,1≫()== - ==<<< >>>==, specify numbers of thread blocks and threads to launch kernels - Make sure to synchronize threads - `syncthreads()` - synchronizes all threads within a thread block - `CudaDeviceSynchronize()` - synchronizes a kernel call on host ---- <p style="text-align: center"><b><font size=6 color=blueyellow>CUDA example: Vector addition</font></b></p> | ![cuda-vec-add-1](https://hackmd.io/_uploads/HJ3EX8gTyl.png =45%x) ![cuda-vec-add-2](https://hackmd.io/_uploads/ByMSQ8eake.png =51%x) | | :-: | | | <!-- .slide: style="font-size: 24px;" --> Source: [Vector_Addition.cu](https://enccs.github.io/gpu-programming/7-non-portable-kernel-models/#vector-addition) ---- <p style="text-align: center"><b><font size=6 color=blueyellow>Performance and profiling</font></b></p> | ![matrix-add](https://hackmd.io/_uploads/rkTpCIlayl.png =60%x) | | :-: | | | <!-- .slide: style="font-size: 24px;" --> Using CUDA APIs, we can measure the time taken to execute CUDA kernel functions. ---- <p style="text-align: center"><b><font size=6 color=blueyellow>Performance and profiling</font></b></p> | ![cuda-profiling](https://hackmd.io/_uploads/BkKWJceakl.png =90%x) | | :-: | | | <!-- .slide: style="font-size: 24px;" --> - Nvidia provides profiling tools to measure traces and events of CUDA application. - ==Nsight Compute==: Profile kernel calls; both *visual profile-GUI* and *Command Line Interface* can be used to check profiling information of kernel calls. - ==Nsight Graphics==: This is quite useful for analyzing profiling results through GUI. - ==Nsight Systems==: Provides system-wide profiling - *i.e.*, when application involves mixed programming with CPU and GPU (that is, MPI, OpenMP, and CUDA). --- <p style="text-align: center"><b><font size=6 color=blueyellow>Non-portable kernel-based models</font></b></p> ==Hipification== is the process of converting CUDA code to HIP, enabling code to run on both AMD and NVIDIA GPUs. - HIP code uses similar keyword as that for CUDA code, making it easy to port simple CUDA kernels to HIP by changing a few library-specific calls. - launch Kernel functions - cuda_function<<<blocks, threads>>>() - hip_function<<<blocks, threads>>>() - synchronization - cudaDeviceSynchronize() - hipDeviceSynchronize() - HIP provides tools (hipify-perl or hipify-clang) to convert CUDA syntax to HIP syntax. --- <p style="text-align: center"><b><font size=6 color=blueyellow>Portable kernel-based models</font></b></p> <!-- .slide: style="font-size: 22px;" --> - ==Cross-platform portability ecosystems== typically provide a higher-level abstraction layer which provide a convenient and portable programming model for GPU programming. - For C++, the most notable cross-platform portability ecosystems are [**Alpaka**](https://github.com/alpaka-group/alpaka), [**Kokkos**](https://github.com/kokkos/kokkos), [**OpenCL**](https://www.khronos.org/opencl/), and [**SYCL**](https://www.khronos.org/sycl/). - Pros and cons of cross-platform portability ecosystems - Pros - The amount of code duplication is minimized - Less knowledge of underlying architecture is needed for initial porting (Kokkos, SYCL) - Cons - These models are relatively new and not very popular yet - Limited learning resources compared to CUDA & HIP - Some low-level APIs and separate-source kernel models are less user friendly - References and code examples - [Alpaka and openPMD workshop](https://www.hlrs.de/training/2024/alpaka-openpmd-hack) - [LAMMPS KOKKOS package](https://docs.lammps.org/Speed_kokkos.html) - [GROMACS package](https://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html) - ==[GPU Programming: When, Why and How?](https://enccs.github.io/gpu-programming/)== --- <p style="text-align: center"><b><font size=6 color=blueyellow>Comparison of GPU compute APIs</font></b></p> <!-- .slide: style="font-size: 18px;" --> | API | Portability | Ease of Use | Performance | Primary Vendor | Best For | | :--: | :--: | :--: | :--: | :--: | :--: | | OpenACC | Medium (NVIDIA & AMD) | High (directive-based) | Medium-High | NVIDIA, AMD (limited) | Scientific computing and HPC with minimal code modifications | | OpenMP Offloading | High (Cross-vendor) | Medium (directive-based) | High | Intel, AMD, NVIDIA | Parallelizing CPU and GPU workloads using OpenMP pragmas for performance portability | | CUDA | Low (NVIDIA only) | Medium | Very High | NVIDIA | High-performance compute on NVIDIA GPUs | | HIP | Medium (AMD & NVIDIA) | Medium | High | AMD, NVIDIA (via Hipify) | Porting CUDA applications to AMD GPUs with minimal code changes | | SYCL | High (Cross-vendor) | Medium | High | Intel, AMD, NVIDIA | Heterogeneous computing with single-source C++ | | OpenCL | High (Cross-vendor) | Medium | Medium | Cross-vendor | General-purpose parallel programming across multiple hardware architectures | --- <p style="text-align: center"><b><font size=6 color=blueyellow>Python libraries for GPU programming</font></b></p> <!-- .slide: style="font-size: 24px;" --> | Library | Best For | Supports | | :-----: | :------: | :------: | | Numba.cuda | General CUDA GPU programming | NVIDIA | | CuPy | NumPy-like GPU acceleration | NVIDIA | | PyCUDA | Low-level CUDA programming | NVIDIA | | PyOpenCL | Cross-vendor GPU programming | NVIDIA, AMD, Intel | | SYCL (dpctl) | Cross-platform parallelism | Intel, AMD, NVIDIA | | TensorFlow | Deep learning | NVIDIA, AMD | | PyTorch | Machine learning | NVIDIA, AMD | | OpenMP (Numba) | CPU & GPU parallelism | Intel, AMD, NVIDIA | ---- <p style="text-align: center"><b><font size=6 color=blueyellow>Python libraries for AI research</font></b></p> <!-- .slide: style="font-size: 18px;" --> | API/Framework | Primary Use | GPU Support | Multi-GPU Support | Best For | | :--: | :--: | :--: | :--: | :--: | | TensorFlow | DL, neural networks, custom ML models | NVIDIA (CUDA), AMD (ROCm) | Yes, using Mirrored Strategy or Distribution Strategy | General-purpose ML and DL | | PyTorch | DL and neural networks | NVIDIA (CUDA), AMD (ROCm) | Yes, using DataParallel and DistributedDataParallel | Research and production DL, flexibility | | Hugging Face (Transformers) | NLP with transformers | NVIDIA, AMD (via PyTorch or TensorFlow) | Yes, using PyTorch or TensorFlow for multi-GPU | Pre-trained transformer models for NLP tasks | | Keras | High-level neural networks API | NVIDIA (CUDA), AMD (ROCm) | Yes, using TensorFlow for multiGPU | Simplified DL with TensorFlow backend | | JAX | High-performance ML and scientific computing | NVIDIA (CUDA), AMD (ROCm), Intel (oneAPI) | Yes, using XLA for GPU acceleration | Numerical computing, DL with high performance | | RAPIDS | GPU-accelerated data science and ML | NVIDIA GPUs | Yes, via Dask and cuML for ML tasks | Data science with cuML, cuDF, and Dask | --- <p style="text-align: center"><b><font size=6 color=blueyellow>Julia libraries for AI research</font></b></p> - Julia is a high-performance, dynamic programming language designed for numerical scientific computing. - It was designed specifically to solve the ==two-language problem==. - Interpreted languages like Python and R translate instructions line by line. - Compiled languages like C/C++ and Fortran are translated by compiler prior to running code. - Julia provides both high performance and ease of use in a single language. - Key Features of Julia - Using Just-In-Time (JIT) compilation to achieve high performance computing - Simple & expressive syntax (like Python) - Powerful for numerical & scientific computing with built-in parallel computing support - Interoperability with other programming languages | ![julia-feature](https://hackmd.io/_uploads/SkxNasep1e.png =50%x) | | :-: | | | <!-- .slide: style="font-size: 16px;" --> | Library | Purpose | Similar to | GPU Support? | | :-----: | :-----: | :--------: | :----------: | | Flux.jl | Deep learning | PyTorch, TensorFlow | ✅ Yes (CUDA) | | MLJ.jl | Machine learning | scikit-learn | ✅ Yes | | Turing.jl | Probabilistic modeling, Bayesian learning | PyMC3, Stan | ✅ Yes | | AlphaZero.jl | Reinforcement learning | DeepMind AlphaZero | ✅ Yes | | Knet.jl | Deep learning, Fast GPU AI research | PyTorch | ✅ Yes | | ReinforcementLearning.jl | Reinforcement Learning | DeepMind AlphaZero | ✅ Yes | --- <p style="text-align: center"><b><font size=12 color=gold>ENCCS <br>Lesson materials & <br>Training events</font></b></p> --- <p style="text-align: center"><b><font size=6 color=blueyellow>Lesson materials & Seasonal training events</font></b></p> | ![lesson-material](https://hackmd.io/_uploads/S1rlH2lTkl.png =55%x) | | :-: | | | <!-- .slide: style="font-size: 20px;" --> - Link: https://hackmd.io/@yonglei/mermaid-enccs-lesson - Seasonal training events - GPU Programming / OpenACC-CUDA workshop - Practical machine/deep Learning - Python/Julia HPDA/HPC - Best practice HPC training - Quantum autumn school ---- <p style="text-align: center"><b><font size=6 color=blueyellow>Nvidia bootcamps</font></b></p> | ![N-Ways-GPU-2025-04_image-1](https://hackmd.io/_uploads/By4w8ngakx.png =50%x) | | :-: | | | <!-- .slide: style="font-size: 24px;" --> - [N-Ways to GPU Programming, Apr. 8-9](https://enccs.se/events/bootcamp-n-days-gpu-programming/) - [AI for Science, May 27-28](https://events.vsc.ac.at/event/186/) - [Multi-GPU Programming, Jun. 17–18](https://events.vsc.ac.at/event/187/) - [AI Multi-GPU Multi-Node Profiling, Jul. 9-10](https://events.vsc.ac.at/event/188/) --- <p style="text-align: center"><b><font size=6 color=blueyellow>Take-home message</font></b></p> <!-- .slide: style="font-size: 24px;" --> - GPU & HPC - GPU architecture - GPU programming models - GPU compute APIs - Standard C/C++ & Fortran programming - Directive-based models - OpenACC, OpenMP offloading - Non-portable kernel-based models - CUDA, HIP - Portable kernel-based models - SYCL, OpenCL, Alpaka, Kokkos, etc. - High-level programming languages - Python, Julia - Lesson materials - Training events & Nvidia bootcamps
{"description":"title: ENCCS kickoff August 2024tags: presentationmargin: 0.08#center: falseslideOptions:theme: 'moon'margin: 0.08transition: 'fade'","contributors":"[{\"id\":\"4378d700-74cc-43e9-9c0a-18f572c846ae\",\"add\":163451,\"del\":60825}]","title":"[Webinar] Practical Introduction to GPU Programming"}
    849 views