GPU Programming: Why, When, and How - Day 1

![](https://media.enccs.se/2024/08/GPU-why-when-how-2024.jpg) GPU Programming: Why, When, and How -- Day 1 :::success **GPU Programming: Why, When, and How — Schedule**: https://hackmd.io/@yonglei/gpu-programming-schedule-2024 ::: ## (Nov. 12) Day 1 | Time | Content | Instructor | | :---------: | :-----: | :--------: | | 09:00-09:15 | Welcome | Yonglei/Thor | | 09:15-09:40 | [Why GPUs?](https://enccs.github.io/gpu-programming/1-gpu-history/) | Binod (Jim-Viktor) | | 09:40-10:20 | [GPU hardware and software ecosystem](https://enccs.github.io/gpu-programming/2-gpu-ecosystem/) | Cristian | | 10:20-10:30 | Break | | | 10:30-11:00 | [What problems fit to GPU?](https://enccs.github.io/gpu-programming/3-gpu-problems/) | Pedro | | 11:00-11:30 | [GPU programming concepts](https://enccs.github.io/gpu-programming/4-gpu-concepts/) | Cristian (Stepas) | | 11:30-12:00 | [Introduction to GPU programming models](https://enccs.github.io/gpu-programming/5-intro-to-gpu-prog-models/) | Stepas | | 12:00-13:00 | Lunch break | | | 13:00-14:20 | [Directive-based models (OpenACC, OpenMP)](https://enccs.github.io/gpu-programming/6-directive-based-models/) | Wei (Stepas) | | 14:20-14:30 | Break | | 14:30-16:00 | [Non-portable kernel-based models (CUDA, HIP)](https://enccs.github.io/gpu-programming/7-non-portable-kernel-models/) | Cristian (Jonathan) | --- ## Exercises and Links :::warning - Slides for [Why GPU?](https://github.com/ENCCS/gpu-programming/tree/main/content/slides) ::: --- :::danger You can ask questions about the workshop content at the bottom of this page. We use the Zoom chat only for reporting Zoom problems and such. ::: ## Questions, answers and information - Is this how to ask a question? - Yes, and an answer will appear like so! ### 1. Why GPUs? - Do you guys have something like a tutorial on CUDA management? - we might cover this topic for CUDA-based GPU programming workshop. You can follow our lesson materials to CUDA workshops, like [HERE](https://enccs.github.io/OpenACC-CUDA-beginners/). ### 2. GPU hardware and software ecosystem - webpage lesson material for the "GPU hardware and software ecosystem" are at here: https://enccs.github.io/gpu-programming/2-gpu-ecosystem/# - Why the GPU doesn't have L3 cache? - Some of the reasons: a) it is supposed to already have fast access to its own RAM, b) the control paths on GPUs are simpler compared to general-purpose CPUs. - And GPUs have their own architecture - they have L1 cache shared with all threads within SM (Streaming Multiprocessor) and also L2 cache shared with SM. :+1: :+1: - Can I do GPU programming using the suite CUDA, ROCm, OneAPI using IGPU or dedicated GPUs in consumer grade devices (RTX 4050 etc, AMD 780M, Intel ARC 370 ...)? - Yes - at least CUDA does work/ is accessible to development on consumer-level NVIDIA GPUs. :+1: - Intel works ok too. AMD only officially supports a few select consumer-grade models, but, in practice, things often work on recent integrated and desktop cards. - Note that those GPUs may not support double precision operations (64 bit). :+1: - Most AMD and NVIDIA GPUs support FP64, even if with noticeably lower performance. Consumer Intel GPU does not support FP64 at all, but they have software emulation (`export OverrideDefaultFP64Settings=1 IGC_EnableDPEmulation=1` before running your OpenMP/OpenCL/SYCL program) - Later in the workshop, various programming models will be presented, some of which aim to be portable and convenient, i.e. not depend on the underlying architecture or require intimate knowledge of low-level GPU programming. Those approaches will work on any devices supported by the respective compilers/ toolchains. :+1: - If OpenAPI support CPUs, GPUs and FPGAs, it would have been more logical if it would have been most popular one. But the trend is with CUDA. Any reason for that? Are there any efficiency differences that need to be considered when choosing? Or is it only hardware/compiler choice? - P.S.: There's oneAPI and OpenCL, no "OpenAPI" :) - CUDA has been around for ~10 years and there's lots of code written with it. - Also much more mature / stable, more documentation, more examples.... - People joke about CUDA being hard to install (https://x.com/mrsiipa/status/1855271042484228559); with most other GPU frameworks (incl. oneAPI) it's worse. - We even have a comment in this doc down below, _Getting the GPU kernel module, CUDA driver and application library versions match is a dependency hell of its own._ For oneAPI to target NVIDIA GPUs, you still need to install CUDA, and on top of that install matching versions of oneAPI and oneAPI CUDA plugin. - OneAPI is relatively new and also started by Intel -> it takes time for other vendors to see the benefit in uniform toolset/ API. :+1: - Not sure other vendors see uniformity as a benefit :) - Just because the code is portable, does not mean it is equally efficient. One still needs to optimize code (especially with CPU vs GPU vs FPGA -- usually, you need to use very different algorithms to use hardware efficiently, even if the same code will work somehow on all three). And there is some complexity added; if you only care about NVIDIA GPUs, you might not want to spend any additional effort. - OpenCL has been around for very long, but never gained traction in HPC (it's very popular in embedded world!) due to hardware near-monopoly for many years. :+1: - If I have an ML model working on CUDA, what's the best way to port it to FPGA? - Likely not advised, as it will take effort just to get it running and **a lot** of effort for it to run with comparable efficiency to worked-out CUDA-based solutions. Still, one could start with portable frameworks (SYCL or OpenCL). - How can simultaneous *access* to memory (i.e. reading from the _same_ location many times) cause out of memory problem? - You mean in the exercise? That's probably just ambiguous wording, cores "trying to access memory" -> "cores trying to claim some memory for themselves". ### X. Break until 10:30 ### 3. What problems fit to GPU? - Lesson materials: https://enccs.github.io/gpu-programming/3-gpu-problems/#what-problems-fit-to-gpu - What does it mean writing your own kernels? - This will be covered in later sections: kernels are specific pieces of code fitted to run on a GPU. They can be written in vendor-specific language (CUDA/ HIP), some abstracted way (SYCL) or used implicitly (OpenMP/ OpenACC). :+1: :::success ==**Discussion**== What type of problems have you used GPUs for? - Gromacs simulations of proteins - Convolutional neural network for classifying spectrograms - Solving time-dependet partial differential equations in k-space - Explicit numerical schemes for shallow-water eq's - Solving PDEs using spectral methods How large was the performance boost? - x10 times faster than regular CPU computations - x100 faster (interestingly LUMI works about 10x slower than my local GPU and UCloud - Nvidia) - comment: you probably didn't bind the proper CPUs and GPUs together on LUMI - 50-100x faster. cuFFT worked very good - 10-50x (than CPUs + MPI) - 100x speedup for fft ::: ### 4. GPU programming concepts - Lesson material: https://enccs.github.io/gpu-programming/4-gpu-concepts/ - Is one CUDA thread equivalent to a CPU thread? - What do you mean by "equivalent"? In a sense that it is mapped to a single physical core, yes. In the practical sense, GPU thread is lighter and more restricted. :+1: :+1: - Big difference is that, within HPC, it's usually best to use one CPU core per CPU thread, and let the CPU do out-of-order execution and other magic to hide memory access and instruction latency to fully utilize the hardware. A GPU core is much simpler, and you _must_ oversubscribe it, by launching many GPU threads per GPU core, to avoid hardware stalling waiting for memory. - all threads in the warp should execute the same instruction? can we use `if-else` statement in the code? - you can use the `if-else` statement but this will decrease the efficiency of programming - when some threads execute the `if` stament and the other threads have to wait. - Since NVIDIA Volta introduced "independent thread scheduling", it is a bit more complicated than that, and two threads in warp can execute different instructions. But it's very limited, and "Warp divergence" is still best avoided. - How do you define the size of the group? why is it useful knowing the index of a thread? - In order to map the computation (for example, element of an array which gets computed) to the GPU thread. You can deduce the global index of the GPU thread only if you know the size of the group. - It should be as large as possible, though there is a max value. In practice one would test different sizes and see which one gives the best performance, - For specific hardware and algorithm properties, you can use CUDA Occupancy Calculator (https://docs.nvidia.com/nsight-compute/NsightCompute/index.html#occupancy-calculator, https://xmartlabs.github.io/cuda-calculator/). You specify your GPU and properties of the kernel (registers used, size of shared memory -- this will be discussed later today), can tell you which group size would be optimal for you. - Similar tool for Intel GPUs: https://oneapi-src.github.io/oneAPI-samples/Tools/GPU-Occupancy-Calculator/ - What is the most efficient way (best data access pattern, i.e. in warps the data needs to be close each other) to loop over multidimensional data? (e.g. classical finite difference stencil) - The grid can be 1D, 2D or 3D. You can also map from 1D grid to 3D arrays, but it is convenient to have the same dimensions for the GPU grid as the algorithm. - How do you implement sapxy if matrix size is bigger than available threads, so that we cant allocate one thread per - Divide and conquer! Compute slices or batches of the matrix. - If one wants to fix the number of threads in the grid and there are less threads than the size of the problem one can instruct the kernel to do more than 1 index. A small loop in the kernel will not impact performance. - In practice there is no limit on the number of threads. - Recent CUDA GPUs allow up to 2'147'483'647 groups, each up to 1024 threads :) - You mean it is possible to oversubscribe? - And it is recommended to do so! Because when one warp waits for some data, other warps can do computing. - In the material it says "There should be many more blocks than SMPs present on the device. Also in order to ensure a good occupancy of the CUDA cores there should be more warps active on a given SMP than SIMT units." On a GPU node with 4 GPUs is there a recommendation how many blocks or warps to fit? - In general I would recommend to fix the block size and let the number of block depending on the size of the problem. The MI250x has 110 CU, so if the number of blocks (groups) is 111 one would a big drop in performance, but it dependes on the application ### 5. Introduction to GPU programming models - Lesson material: https://enccs.github.io/gpu-programming/5-intro-to-gpu-prog-models/ :::success ==**Discussion**== Which GPU programming frameworks have you already used previously, if any? - Cuda, Bohrium (https://bohrium.readthedocs.io/), Futhark (https://futhark-lang.org/) - Pytorch (not directly writing code for, but using CUDA) - I am curious how this runs on LUMI? - OpenMP, SYCL, CUDA, HIP - CUDA :+1: - PyCUDA + CUDA C/C++ - learning Julia programming - CUDA, HIP, Jax, Numba, Cupy What did you find most challenging? What was most useful? - Understanding that the GPU is perpendicular to programming the same for CPUs with fx OpenMP. Performance and understanding why your performance increases/decreases. - Index trouble in CUDA C/C++. Low-level code can give very good performance and control. - As a physicist first, the hardest part was justifying the time spent on code to my supervisor :) :+1: :+1: - deciding what to learn, low level language or high level and if there is speed up gained compared to time spent. - I was vindicated by the way, simulations that used to take weeks now take hours - Great! - With CUDA and HIP availability of SDK like cuFFT and hipFFT makes life easier. - Getting the GPU kernel module, CUDA driver and application library versions to match is a dependency hell of its own. ::: ### X. Lunch break until 13:00 --- ### 6. Directive-based models - Lesson material: https://enccs.github.io/gpu-programming/6-directive-based-models/ #### ==Setup & Notes for "Directive-based models (OpenACC, OpenMP)"== ```bash module purge module load LUMI/24.03 module load partition/G module load rocm/6.0.3 module load PrgEnv-cray/8.5.0 ``` **commands for compiling the openacc** - for fortran ```bash! ftn -O2 -hacc -o ex1_acc ex1_acc.f90 ``` - for C/C++ :::danger For Cray Compiler: OpenACC is supported only by the Cray Fortran compiler. The C and C++ compilers have no support for OpenACC. ::: **commands for compiling the openmp offloading** - C/C++, `cc` for c, and `CC` for C++ ```bash cc -O2 -fopenmp -o ex1_omp ex1_omp.c ``` - for fortran ```bash! ftn -O2 -homp -o ex1_omp ex1_omp.f90 ``` **allocate interactive resource** ```bash! srun -N 1 -n 1 -c 8 -t 1:00:00 --gres=gpu:1 -A project_465001310 -p small-g --pty $SHELL ``` or allocate resource and lunch your exe by srun ```bash! salloc -A project_465001310 -N 1 -n 1 -c 8 -t 1:00:00 -p small-g --gpus-per-node=1 ``` more debug information ```bash! export CRAY_ACC_DEBUG=2 ``` --- - Do we need to create our own dir? - The best is to go the scratch `/scratch/project_465001310` folder and make your own folder `mkdir $USER`. Then clone the repository in your own folder. - can we do a bit slower pls? - We will try in a minute :) - is ex1 kernel and ex2 parallel loop? - What we compile and where and what command we use? - you can access to the code examples via `git clone` or you can create new files and then copy/paste the code examples from webpage lesson material to the files - the commands used to compile the source code is available here: https://hackmd.io/wh641ux0Q8ez0iC3yfJAqA?both#Setup-amp-Notes-for-Directive-based-models-OpenACC-OpenMP - `srun ./vec_add_kernels` does not output anything - the code itself has no print, the print coming from `export CRAY_ACC_DEBUG=2` with extra debug information. - Is this warning expected: `This compilation contains OpenACC directives. -h acc is not active so the directives are ignored.``? - This warning says that one has to add `-h acc` to the compiler so that the gpu code is created. Without the extra flag is just a CPU code - After compiling why is he only using srun? - is he in a interactive shell? --Yes, thanks - Is this the command he used? `salloc -A project_465001310 -N 1 -n 1 -c 8 -t 1:00:00 -p small-g --gpus-per-node=1` - For 1 GPU it is ok to use `-c 8`, but for more than 1 GPu per node use `-c 7` - why is that? - https://lumi-supercomputer.github.io/LUMI-training-materials/User-Updates/Update-202308/lumig-lownoise/ - I get the following warning when compiling the OMP version of "Exercise: Adding data mapping clauses" with `cc -O2 -fopenmp -o ex4_omp ex4_omp.c`: `warning: ex4_omp.c:14:5: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]` - Which modules are loaded? Did you load the `partition/G` and `rocm` modules? - This is fine: the loop is not vectorized via SIMD, because it is offloaded to GPU instead. (Why exactly it gives a warning, I'm not sure.) - I see. Thanks :) - Could you repeat why there wasnt much differences between the last two examples? and what was the difference between them? pls - do you mean the example with/without the explicit "memory copy"? - Yes, got the answer :) thank you. - (for Stepas) what is the runtime of openmp example per whole cpu, as the time was sum of threads? naively 20s/64 ~0.33s was faster than gpu examples, probably is not as nice - Indeed, for small problems full parallelization via CPU cores might be faster than GPU-offloading (due to the need to copy data to GPU and back at least once). - Some test results, gathered on small-g allocation (1 GPU node, CPU with only 2 cores/threads): ``` salloc -A project_465001310 -N 1 -n 1 -c 8 -t 1:00:00 -p small-g --gpus-per-node=1 srun --interactive --pty --jobid=<jobid> $SHELL [nid005086]:~/testing> rocm-smi ======================================== ROCm System Management Interface ======================================== ================================================== Concise Info ================================================== Device [Model : Revision] Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU% Name (20 chars) (Edge) (Avg) (Mem, Compute) ================================================================================================================== 0 [0x0b0c : 0x00] 47.0°C 223.0W N/A, N/A 800Mhz 1600Mhz 0% manual 500.0W 0% 0% AMD INSTINCT MI200 ( ================================================================================================================== ============================================== End of ROCm SMI Log =============================================== ``` Runs of `data_sum.c` in two variants of directives, with array size of `100*1024*1024` and changing size of inner computation loops `k`: ``` a) #pragma omp parallel for b) #pragma omp target teams distribute parallel for simd map(to:vecA[0:NX],vecB[0:NX]) map(from:vecC[0:NX]) ``` | k | CPU, s | GPU, s | | :-: | :----: | :----: | | 1 | 0.8058 | 0.6439 | | 16 | 0.8224 | 0.6637 | | 64 | 9.394 | 0.7275 | | 128 | 23.56 | 0.7657 | It is clear how CPU-parallelized computation starts scaling ~linearly with problem size, while almost all GPU time is spent on fixed-cost tasks (that is, one-time memory copying). - Barely related, but I think this plot (matrix multiplication in Julie) nicely illustrates the point: ![Screenshot_20241112_143459](https://hackmd.io/_uploads/ryBitCxMkx.png) - It's very much related, thanks! - how do we allocate the gpus so that they show up in the rocm-smi tool? - Make sure that you are actually logged into the compute node (either doing allocation and log-in via single command, or allocating and then starting interactive terminal on it). Then, rocm-smi should show you the number of assigned GPUs. - Probably missing something essential here? `martinli@uan03:/scratch/project_465001310/martinli/gpu-programming/content/examples/demo> rocm-smi cat: /sys/module/amdgpu/initstate: No such file or directory ERROR:root:Driver not initialized (amdgpu not found in modules)` - You are not logged into the GPU node. The login node does not have GPUs - I did run salloc before rocm-smi... - salloc asks for the node. Srun will then run on the node - Ok, I see. Thanks :) --- ### X. Exercise until 14:30 ### 7. Non-portable kernel-based models - Lesson material: https://enccs.github.io/gpu-programming/7-non-portable-kernel-models/ - Is there a typo on the slide? Should it not be vector_add<<...>>(Ah, Bh...)? Instead of Ad, Bd and so on... Thanks :) - when is the array transferred in the unified memory scheme? - The data is transfered as needed, so will be transfered before the kernel - why we subtract 1 when defining the block dimension? - This is a "round up" for integer arithmetic. N+256-1/256 (okay got it, so that any remainder will be assigned to a full block?) - Yes, you always have to assign full blocks. :+1: - I didn't quite get how the +1 avoids bank conflicts - Bank conflicts are mostly due to stepping through arrays using powers of two, so the same banks are repeatedly hit. So the simplest way is to pad the fastest moving dimension so it is not a power of two. I.e. you have `array[8][9]` instead of `array[8][8]`. - [Here](https://cuda-programming.blogspot.com/2013/02/bank-conflicts-in-shared-memory-in-cuda.html) is a link to an old article about bank conflicts. New architectures might be different (more banks per SMP/CU, or more threads in the warp/wavefront), but the basic idea is the same. - do we have to do the memory optimization mannually or the session was just for practice / are there inbuilt memory optimization tools that are well optimized compared to mannual definition? - Unfortunately yes. We need to do the optimizations manually, but we can use tools which can indicate which parts of the code are inefficient and sometimes they give suggestions. :+1: - OpenMP sometimes detects some patterns or problems and has something already implemented in the compiler. For example the reduction using OpenMP is done by just adding a hint on which variable is to be reduced and the operation. For example `!omp target teams distribute parallel do reduction(+:locdot)` :+1: - what about suite related tools i.e. CudaMAlloc etc? or they do different tasks? - cudaMalloc and hipMalloc do the same thing, but one works with CUDA C and the other one with HIP C. - a possible replacement to the mannual allocation? - `cudaMallocManaged` and `hipMallocManaged` allocte data that can be migrated between CPU and GPU. ### X. Exercise until 15:55 --- ## Reflections and quick feedback :::warning One thing that you liked or found useful for your projects? - good introduction of the GPU basics. HackMD interaction allows to clear any doubts on the go compared to waiting to the end. - I liked HackMD. Will the (final) notes be available to us after the course? In a Github-repo or similar? - these hackmd documents will be available online. you can come to these documents later One thing that was confusing/suboptimal, or something we should do to improve the learning experience? - the practical session was hard to follow (+3) - Could you say the break time lenght in minutes due to timezones? ::: --- :::info *Always ask questions at the very bottom of this document, right **above** this.* ::: ---