GPU Programming: Why, When, and How - Day 2

![](https://media.enccs.se/2024/08/GPU-why-when-how-2024.jpg) GPU Programming: Why, When, and How - Day 2 :::success **GPU Programming: Why, When, and How — Schedule**: https://hackmd.io/@yonglei/gpu-programming-schedule-2024 ::: ## (Nov. 13) Day 2 | Time | Content | Instructor | | :---------: | :-----: | :--------: | | 09:00-10:30 | [Portable kernel-based models (C++ stdpar, OpenCL, SYCL)](https://enccs.github.io/gpu-programming/8-portable-kernel-models/) | Andrey | | 10:30-10:40 | Break | | 10:40-12:00 | Exercises (on various models) | | | 12:00-13:00 | Lunch break | | 13:00-14:15 | [High-level language support](https://enccs.github.io/gpu-programming/9-language-support/) | Yonglei, Hicham | | | 14:15-14:30 | Break | | | 14:30-15:45 | [Multi-GPU programming with MPI](https://enccs.github.io/gpu-programming/10-multiple_gpu/) | Hicham, Andrey | | 15:45-16:00 | Q/A | | --- ## Exercises and Links :::warning - Exercises for [XXX]() ::: --- :::danger You can ask questions about the workshop content at the bottom of this page. We use the Zoom chat only for reporting Zoom problems and such. ::: ## Questions, answers and information - Is this how to ask a question? - Yes, and an answer will appear like so! ### 8. Portable kernel-based models - Lesson material: https://enccs.github.io/gpu-programming/8-portable-kernel-models/ - Can one mix two models, such as StdPar and SYCL, to get more control only when needed? - It is possible two use two (or more) programming models in the same file. Not only StdPar and SYCL, but also OpenMP offloading, native CUDA/HIP, etc (in the latter case, you sacrifice portability). It is harder to share data between them (i.e., ensuring that there are no unnecessary copies of data when StdPar and SYCL kernel are launched on the same device), since there is no unified notion of buffer etc.; but it might be possible if you are willing to limit yourself to certain implementation. - E.g., AdaptiveCpp implements StdPar on top of SYCL, so you can use the same USM pointers in both, and it will work ok; but that is beyond what both standards can guarantee, and might not work with some other compiler. - Similarly, both AdaptiveCpp and Intel oneAPI (not sure about nvc++ and roc-stdpar) have extensions to get more control of StdPar (like specifying the device to use). This can be another approach here. - Is `do concurrent` in Fortran comparable to `std::for_each` StdPar in C++ for GPU offloading? - Similar, yes. Not really a Fortran expert, though (--Andrey) - In SYCL, is it possible to define a kernel as a function, similar to how it is done in CUDA? - Sort-of. It cannot be a *function* (because, when it is called, only index is passed as an argument, so you must have some other ways to pass other arguments to it), but it can be a *functor* (function-like object), either defined explicitly as a class or returned from a helper function. Example of the latter: - ```cpp auto kernel(float* gm_x, const float* gm_v, float step) { return [=](sycl::id<1> id) { gm_x[id] = gm_x[id] + gm_v[id] * step; }; } // `kernel` actually returns a `struct` which has fields containing gm_x, gm_v, and step, as well as has `operator()` defined. // But we don't need to think about all that complexity in most cases, and just use it like this: q.submit([&](sycl::handler& cgh) { cgh.parallel_for(sycl::range<1>(n), kernel(d_x, d_v, step)); }); // ^^^ Might look a bit like dark magic, but not much different from CUDA's <<<...>>> syntax ``` - For acpp how much of difference is between O2 and O3 optimisation. - Depends :) There's also `-ffast-math` flag you can try. But, in my experience, there's typically nothing dramatic (e.g., no 10x speed-up you can expect from going from O0 to O1 in some cases). But for HPC use, you rarely want _less_ optimizations, so I think `-O3` is a good default (there are cases when you might not want some of the things `-O3` does, but then you don't have much choice (other than manually controlling which optimization flags are used: `-O[0-3]` flags are just shorthands to enable a pre-determined sets of optimizations, so you could selectively choose which ones you need)). - If you have completed the SYCL exercise, please put a "O" down here: - O - O --- ### 9. High-level language support - Lesson material: https://enccs.github.io/gpu-programming/9-language-support/ :::success ==**Discussion**== Have you used python for programming/data analysis? - Yes and almost daily use: o o o o o o - yes and random use: o(matplotlib) o o - never use: o o - Others - Mathematica & C++ Have you used Julia for programming/data analysis? - Yes and almost daily use: o o - yes and random use: - never use: o o o o o - Others - Mathematica & C++ ::: #### Exercise until XX:58 and share results ### 9.2 HIP-Python Presentation and hands-on examples can be accessed from: ``` $ git clone https://github.com/HichamAgueny/HIP-Python_examples.git Or from $ cd /project/project_465001310/workshop_software/HIP-Python_examples’ ``` - After sourcing environment hip not found? ``` salloc -A project_465001310 -t 00:30:00 -p dev-g -N 1 --gpus 1 ``` ``` module load LUMI/24.03 partition/G module load cray-python/3.11.7 ``` ``` python -m venv MyVirtEnv_hip_pyt source MyVirtEnv_hip_pyt/bin/activate python -m pip install --upgrade pip python -m pip install -i https://test.pypi.org/simple/ hip-python python -m pip install numpy ``` ``` source MyVirtEnv_hip_pyt/bin/activate ``` #### X. Break until XX:30 :::danger ==Upcoming Julia training events== - https://enccs.se/events/gpu-programming-using-julia-a-practical-intro-webinar/ - https://enccs.se/events/julia-for-hpc-autumn-2024/ ::: ### 10. Multiple GPU programming with MPI - Lesson material: https://enccs.github.io/gpu-programming/10-multiple_gpu/ - presentation and hands-on examples can be accessed from: ``` $ git clone https://github.com/HichamAgueny/multiGPU_MPI_examples.git ``` - or ``` $ cd /project/project_465001310/workshop_software/multiGPU_MPI_examples ``` - Share computational results ``` --Measuring the Bandwidth during Data Transfer: MPI-OpenACC test-- --nbr of MPI processes: 2 --nbr of gpus on each node: 2 --nbr of nodes: 1 --Time (s) 0.00002 Data size (B) 128 Bandwidth (GBps) 0.01127 --Time (s) 0.00002 Data size (B) 256 Bandwidth (GBps) 0.02279 --Time (s) 0.00003 Data size (B) 512 Bandwidth (GBps) 0.03812 --Time (s) 0.00003 Data size (B) 1024 Bandwidth (GBps) 0.07676 --Time (s) 0.00003 Data size (B) 2048 Bandwidth (GBps) 0.16091 --Time (s) 0.00003 Data size (B) 4096 Bandwidth (GBps) 0.26746 --Time (s) 0.00003 Data size (B) 8192 Bandwidth (GBps) 0.55175 --Time (s) 0.00003 Data size (B) 16384 Bandwidth (GBps) 1.03747 --Time (s) 0.00004 Data size (B) 32768 Bandwidth (GBps) 1.58152 --Time (s) 0.00006 Data size (B) 65536 Bandwidth (GBps) 2.04924 --Time (s) 0.00011 Data size (B) 131072 Bandwidth (GBps) 2.42541 --Time (s) 0.00019 Data size (B) 262144 Bandwidth (GBps) 2.80556 --Time (s) 0.00035 Data size (B) 524288 Bandwidth (GBps) 3.01661 --Time (s) 0.00071 Data size (B) 1048576 Bandwidth (GBps) 2.93665 --Time (s) 0.00104 Data size (B) 2097152 Bandwidth (GBps) 4.04334 --Time (s) 0.00220 Data size (B) 4194304 Bandwidth (GBps) 3.80894 --Time (s) 0.00434 Data size (B) 8388608 Bandwidth (GBps) 3.86179 --Time (s) 0.00718 Data size (B) 16777216 Bandwidth (GBps) 4.67406 --Time (s) 0.01317 Data size (B) 33554432 Bandwidth (GBps) 5.09540 --Time (s) 0.02545 Data size (B) 67108864 Bandwidth (GBps) 5.27424 --Time (s) 0.04962 Data size (B) 134217728 Bandwidth (GBps) 5.40975 --Time (s) 0.09821 Data size (B) 268435456 Bandwidth (GBps) 5.46681 --Time (s) 0.19425 Data size (B) 536870912 Bandwidth (GBps) 5.52768 --Time (s) 0.38879 Data size (B) 1073741824 Bandwidth (GBps) 5.52351 ``` - ``` --Measuring the Bandwidth during Data Transfer: MPI-OpenACC test-- --nbr of MPI processes: 2 --nbr of gpus on each node: 2 --nbr of nodes: 1 --Time (s) 0.00001 Data size (B) 128 Bandwidth (GBps) 0.03234 --Time (s) 0.00054 Data size (B) 256 Bandwidth (GBps) 0.00095 --Time (s) 0.00001 Data size (B) 512 Bandwidth (GBps) 0.16500 --Time (s) 0.00003 Data size (B) 1024 Bandwidth (GBps) 0.07522 --Time (s) 0.00003 Data size (B) 2048 Bandwidth (GBps) 0.12547 --Time (s) 0.00003 Data size (B) 4096 Bandwidth (GBps) 0.24359 --Time (s) 0.00063 Data size (B) 8192 Bandwidth (GBps) 0.02599 --Time (s) 0.00005 Data size (B) 16384 Bandwidth (GBps) 0.59684 --Time (s) 0.00005 Data size (B) 32768 Bandwidth (GBps) 1.20071 --Time (s) 0.00005 Data size (B) 65536 Bandwidth (GBps) 2.42263 --Time (s) 0.00006 Data size (B) 131072 Bandwidth (GBps) 4.64262 --Time (s) 0.00006 Data size (B) 262144 Bandwidth (GBps) 8.51413 --Time (s) 0.00007 Data size (B) 524288 Bandwidth (GBps) 14.92709 --Time (s) 0.00009 Data size (B) 1048576 Bandwidth (GBps) 23.43134 --Time (s) 0.00013 Data size (B) 2097152 Bandwidth (GBps) 32.32676 --Time (s) 0.00018 Data size (B) 4194304 Bandwidth (GBps) 45.86786 --Time (s) 0.00030 Data size (B) 8388608 Bandwidth (GBps) 56.48567 --Time (s) 0.00052 Data size (B) 16777216 Bandwidth (GBps) 64.95599 --Time (s) 0.00096 Data size (B) 33554432 Bandwidth (GBps) 70.25147 --Time (s) 0.00197 Data size (B) 67108864 Bandwidth (GBps) 68.24202 --Time (s) 0.00391 Data size (B) 134217728 Bandwidth (GBps) 68.60877 --Time (s) 0.00752 Data size (B) 268435456 Bandwidth (GBps) 71.38320 --Time (s) 0.01448 Data size (B) 536870912 Bandwidth (GBps) 74.14020 --Time (s) 0.02831 Data size (B) 1073741824 Bandwidth (GBps) 75.85600 ``` - exercise compile script(pull 2 different versions of fortran??) and submit(wrong account) files have errors - ==Some C++/OpenMP/MPI examples available in https://enccs.github.io/gpu-programming/10-multiple_gpu/== ## Reflections and quick feedback :::warning One thing that you liked or found useful for your projects? - very good at demo/typealone part for julia and python One thing that was confusing/suboptimal, or something we should do to improve the learning experience? - xx ::: --- :::info *Always ask questions at the very bottom of this document, right **above** this.* ::: ---