![](https://media.enccs.se/2025/10/Frame-7-1536x768.jpg) <p style="text-align: center"><b><font size=5 color=blueyellow>GPU Programming: When, Why, and How - Day 3</font></b></p> :::success **Nov. 25 - 27 , 09:00 - 12:30 (CET), 2025** ::: :::success **GPU Programming: When, Why, and How - Schedule**: https://hackmd.io/@ENCCS-Training/gpu-programming-2025-schedule ::: ## Schedule | Time | Contents | | :---------: | :----------: | | 09:00-09:10 | Welcome and Recap | | 09:10-10:30 | Multi-GPU programming with MPI | | 10:30-10:40 | Break | | 10:40-11:30 | Example problem: stencil computation | | 11:30-11:40 | Break | | 11:40-12:20 | Preparing code for GPU porting <br>Recommendations | | 12:20-12:30 | Q/A & Summary | --- ## Note for [Multiple GPU programming with MPI](https://enccs.github.io/gpu-programming/10-multiple_gpu/) :::warning ### Slides and codes are available in this GitHub repo ``` git clone https://github.com/HichamAgueny/multigpu_mpi_course.git $cd multigpu_mpi_course #Load LUMI software Stack module load LUMI/24.03 partition/G module load cpeCray ``` ### Example 1: ``` #For MPI-OpenACC cd example_1/setDevice_acc #For MPI-OpenMP cd example_1/setDevice_omp Compile: ./compile.sh Submit a job: sbatch script.slurm View the output file: vi setDevice_accxxxxxx.out ``` ### Example 2: ``` #For MPI-OpenACC cd example_2/mpiacc #For MPI-OpenMP cd example_2/mpiomp Compile: ./compile.sh Submit a job: sbatch script.slurm View the output file: vi staging_mpiacc-xxxxxx.out ``` ### Example 3: ``` #For MPI-OpenACC cd example_3/gpuaware_mpiacc #For MPI-OpenMP cd example_3/gpuaware_mpiomp Compile: ./compile.sh Submit a job: sbatch script.slurm View the output file: vi gpuaware_mpiacc-xxxxxx.out ``` ::: --- :::danger You can ask questions about the workshop content at the bottom of this page. We use the Zoom chat only for reporting Zoom problems and such. ::: ## Questions, answers and information ### 5. [Multiple GPU programming with MPI](https://enccs.github.io/gpu-programming/10-multiple_gpu/) - ==Will the slides be available from yesterday's HZDR on alpaka and c++ std porting?== - YW: I will try to ask these slides from instructors and upload them to the github repo. - ==Today's slides are also a good reference as a Learning material!== - :+1: will be updated - Only fortran codes were used, Can also these be ported into C/ C++? - As Hicham explained, since OpenACC is not supported in C++ on LUMI there's only Fortran exercises today. But these examples can easily be ported (for OpenMP). - And you can port openACC C/C++ to OpenMP using `clacc`. see here https://documentation.sigma2.no/code_development/guides/cuda_translating-tools.html#translate-openacc-to-openmp-with-clacc - and also here https://www.exascaleproject.org/highlight/clacc-an-open-source-openacc-compiler-and-source-code-translation-project/ - Both for OpenMP and MPI collectives for reading the GPU topology, Is it possible using the C++? - Yes, see the function calls on the course page https://enccs.github.io/gpu-programming/10-multiple_gpu/ - You just need to change the compiler command and flags of course compared to the course exercises. - What were the observations on achieving the bandwidth rate when using `f90` or `C++` codes? Are they approximately same or `C++` is the best? - I did run some benchmarks with MPI-CUDA C and MPI-OpenACC fortran - the bandwidth was roughly the same. Probaly at large scale (large dataset) coul be some difference. So it is also about compiler. - It seems that the GPU-aware OpenACC example is slower, I get ~0.019 GBps. The OpenMP version goes up to ~80 GBps. - Can you re-run it again ? - I have gotten the same result for 4 different runs and re-compiles of the OpenACC example. - For OpenACC, I also get ~0.019 GBps. - I ran the exercise from the course page repo (not bandwidth report, just time) and the omp is >40% faster. - You are right there was a factor missing (the recent change was not commited). This is now fixed. You should be able to get about 76-80 GBps. - ==All the executions report same numbers across the runs== --- ### 6. [Multi-framework example: stencil computation](https://enccs.github.io/gpu-programming/13-examples/) --- ### 7. [Preparing code for GPU porting](https://enccs.github.io/gpu-programming/11-gpu-porting/) #### Question: - When running the hipify example `/gpu-lecture_day_3/stencil/gpu-programming/content/examples/exercise_hipify/Hipify_clang`, ~~I run ino the following problem~~: ~~`srun: error: Unable to create step for job 15001777: More processors requested than permitted`~~ (I was still in the SLURM job of the previous exercise) - the `./run_hipify-clang` seems to be also called `./run_hipify-clang.sh` - ~~also: `.../Hipify_perl` runs into the same SLURM error~~ - ==The file `compile_acc.sh` has an issue in user rights for execution - this needs to be corrected using `chmod`.== - How long will the project allocation for this course remain open to complete remaining exercises or play with the code? - Thank you very much for an informative and well-run course! :+1: :::info *Always ask questions at the very bottom of this document, right **above** this.* ::: ---