
<p style="text-align: center"><b><font size=5 color=blueyellow>GPU Programming: When, Why, and How - Day 3</font></b></p>
:::success
**Nov. 25 - 27 , 09:00 - 12:30 (CET), 2025**
:::
:::success
**GPU Programming: When, Why, and How - Schedule**: https://hackmd.io/@ENCCS-Training/gpu-programming-2025-schedule
:::
## Schedule
| Time | Contents |
| :---------: | :----------: |
| 09:00-09:10 | Welcome and Recap |
| 09:10-10:30 | Multi-GPU programming with MPI |
| 10:30-10:40 | Break |
| 10:40-11:30 | Example problem: stencil computation |
| 11:30-11:40 | Break |
| 11:40-12:20 | Preparing code for GPU porting <br>Recommendations |
| 12:20-12:30 | Q/A & Summary |
---
## Note for [Multiple GPU programming with MPI](https://enccs.github.io/gpu-programming/10-multiple_gpu/)
:::warning
### Slides and codes are available in this GitHub repo
```
git clone https://github.com/HichamAgueny/multigpu_mpi_course.git
$cd multigpu_mpi_course
#Load LUMI software Stack
module load LUMI/24.03 partition/G
module load cpeCray
```
### Example 1:
```
#For MPI-OpenACC
cd example_1/setDevice_acc
#For MPI-OpenMP
cd example_1/setDevice_omp
Compile: ./compile.sh
Submit a job: sbatch script.slurm
View the output file: vi setDevice_accxxxxxx.out
```
### Example 2:
```
#For MPI-OpenACC
cd example_2/mpiacc
#For MPI-OpenMP
cd example_2/mpiomp
Compile: ./compile.sh
Submit a job: sbatch script.slurm
View the output file: vi staging_mpiacc-xxxxxx.out
```
### Example 3:
```
#For MPI-OpenACC
cd example_3/gpuaware_mpiacc
#For MPI-OpenMP
cd example_3/gpuaware_mpiomp
Compile: ./compile.sh
Submit a job: sbatch script.slurm
View the output file: vi gpuaware_mpiacc-xxxxxx.out
```
:::
---
:::danger
You can ask questions about the workshop content at the bottom of this page. We use the Zoom chat only for reporting Zoom problems and such.
:::
## Questions, answers and information
### 5. [Multiple GPU programming with MPI](https://enccs.github.io/gpu-programming/10-multiple_gpu/)
- ==Will the slides be available from yesterday's HZDR on alpaka and c++ std porting?==
- YW: I will try to ask these slides from instructors and upload them to the github repo.
- ==Today's slides are also a good reference as a Learning material!==
- :+1: will be updated
- Only fortran codes were used, Can also these be ported into C/ C++?
- As Hicham explained, since OpenACC is not supported in C++ on LUMI there's only Fortran exercises today. But these examples can easily be ported (for OpenMP).
- And you can port openACC C/C++ to OpenMP using `clacc`. see here https://documentation.sigma2.no/code_development/guides/cuda_translating-tools.html#translate-openacc-to-openmp-with-clacc
- and also here https://www.exascaleproject.org/highlight/clacc-an-open-source-openacc-compiler-and-source-code-translation-project/
- Both for OpenMP and MPI collectives for reading the GPU topology, Is it possible using the C++?
- Yes, see the function calls on the course page https://enccs.github.io/gpu-programming/10-multiple_gpu/
- You just need to change the compiler command and flags of course compared to the course exercises.
- What were the observations on achieving the bandwidth rate when using `f90` or `C++` codes? Are they approximately same or `C++` is the best?
- I did run some benchmarks with MPI-CUDA C and MPI-OpenACC fortran - the bandwidth was roughly the same. Probaly at large scale (large dataset) coul be some difference. So it is also about compiler.
- It seems that the GPU-aware OpenACC example is slower, I get ~0.019 GBps. The OpenMP version goes up to ~80 GBps.
- Can you re-run it again ?
- I have gotten the same result for 4 different runs and re-compiles of the OpenACC example.
- For OpenACC, I also get ~0.019 GBps.
- I ran the exercise from the course page repo (not bandwidth report, just time) and the omp is >40% faster.
- You are right there was a factor missing (the recent change was not commited). This is now fixed. You should be able to get about 76-80 GBps.
- ==All the executions report same numbers across the runs==
---
### 6. [Multi-framework example: stencil computation](https://enccs.github.io/gpu-programming/13-examples/)
---
### 7. [Preparing code for GPU porting](https://enccs.github.io/gpu-programming/11-gpu-porting/)
#### Question:
- When running the hipify example `/gpu-lecture_day_3/stencil/gpu-programming/content/examples/exercise_hipify/Hipify_clang`, ~~I run ino the following problem~~:
~~`srun: error: Unable to create step for job 15001777: More processors requested than permitted`~~ (I was still in the SLURM job of the previous exercise)
- the `./run_hipify-clang` seems to be also called `./run_hipify-clang.sh`
- ~~also: `.../Hipify_perl` runs into the same SLURM error~~
- ==The file `compile_acc.sh` has an issue in user rights for execution - this needs to be corrected using `chmod`.==
- How long will the project allocation for this course remain open to complete remaining exercises or play with the code?
- Thank you very much for an informative and well-run course! :+1:
:::info
*Always ask questions at the very bottom of this document, right **above** this.*
:::
---