
<p style="text-align: center"><b><font size=5 color=blueyellow>GPU Programming: Why, When, and How -- Day 3</font></b></p>
:::success
**GPU Programming: Why, When, and How — Schedule**: https://hackmd.io/@yonglei/gpu-programming-schedule-2024
:::
## <font color=red>(Nov. 14) Day 3</font>
| Time | Content | Instructor |
| :---------: | :-----: | :--------: |
| 09:00-10:00 | [Preparing code for GPU porting](https://enccs.github.io/gpu-programming/11-gpu-porting/) | Cristian (Fortran to GPU), Andrey |
| 10:00-10:30 | [Recommendations and discussions](https://enccs.github.io/gpu-programming/12-recommendations/) | Yonglei |
| 10:30-10:45 | Break | |
| 10:45-11:50 | [Problem examples](https://enccs.github.io/gpu-programming/13-examples/) | Stepas, Andrey |
| 11:50-12:00 | Wrap-up | Yonglei |
| 12:00-13:00 | Lunch break | |
| 13:00-15:50 | [Bring your code session](https://hackmd.io/@yonglei/gpu-programming-byc-2024) | |
| 15:50-16:00 | Summary of this workshop | |
---
## Exercises and Links
:::warning
- Exercises for [XXX]()
:::
---
:::danger
You can ask questions about the workshop content at the bottom of this page. We use the Zoom chat only for reporting Zoom problems and such.
:::
## Questions, answers and information
- Is this how to ask a question?
- Yes, and an answer will appear like so!
## 11. Preparing code for GPU porting
- Lesson material: https://enccs.github.io/gpu-programming/11-gpu-porting/
**NOTE** If you're particularly interested in what was skipped for now, tune in for the automatic porting tools (2nd part of this episode) later in the session or during the afternoon's "bring-your-own-code" session!
---
## 12. General recommendations
- Episode material: https://enccs.github.io/gpu-programming/12-recommendations/
:::success
==Discussion hints==:
- Has your mental model of how GPUs work and how they are programmed changed during this workshop?
- Do you have a better idea yet about what framework may be right for your code?
- What other questions do you have? Ask us anything!
:::
- Data parallelism was new, and data access optimization. The need to consider GPU architecture.
- The deep consideration of memory layout and the grouping of threads was a new concern.
- Seeing the principle of using GPU programming is interesting even though it was different world for people who usually just do Python or high-level programming for data analysis/ software development.
- Having only worked with low-level GPU stuff before, the high-level options were quite interesting to hear about :+1:
- How mature do you consider ROCm to be, compared to CUDA? Do you find that it takes more time to port a code to run on AMD GPUs than on NVIDIA GPUs?
- Most effort goes in porting code the first time, to _any_ flavor. After that, there are [tools](https://enccs.github.io/gpu-programming/11-gpu-porting/#porting-between-different-gpu-frameworks) today that can translate between variants of the same approach (f.e. directive-to-directive or vendor-to-vendor kernel) with good results.
- How much do you gain from using fine grain control (ROCm/CUDA) as opposed to OpenACC or OpenMP or even better the parallel STL extensions?
- It really depends on the case, but you'll see one example in the next section. Hint: lower-level is still good to know :)
- In OpenMP it can be very difficult to use for example __shared__ memory. So for this kind of problems one would see a big impact. (10 to 100 x)
- If you have already optimized the slowest part of the code 1-2 times and it still is the slowest part, when to move on?
- You might want to check your theoretical algorithm. If your program spends time on what it is _supposed_ to do, then it is already good, and you can spend _your_ time on something else.
- Is there any particular reason to use HIPify over a header that translates "GPU_method" to the corresponding CUDA/HIP function calls like we do in the Dirac code?
- The *hipify* tool might be more at day than the header, but in practice a header only porting might work for many cases. ( https://github.com/cschpc/hop )
- execise points to wrong account/project number(clang), and something with cuda container?
```
FATAL: While checking image: could not open image /scratch/ project_465000485/Container/cuda_11.4.3-devel-ubuntu20.04.sif:
failed to retrieve path for /scratch/project_465000485/Container/cuda_11.4.3-devel-ubuntu20.04.sif:
lstat /pfs/lustrep2/scratch/project_465000485/Container: permission denied
```
- You will need to pull a container ```singularity pull docker://nvcr.io/nvidia/cuda:11.4.3-devel-ubuntu20.04```
- try to use `project_465001310` for the project nubmer
- I think the present exercise has not been updated to this year modules and project
- This is correct (unfortunately).
- Take it as extra exercise :).
- Please use this `module load rocm/6.0.3` instead.
- and ```$ export PATH=/opt/rocm-6.0.3/bin:$PATH```
- do i need to prepare ```Container/cuda_11.4.3-devel-ubuntu20.04.sif```
- You will need to pull the container ```singularity pull docker://nvcr.io/nvidia/cuda:11.4.3-devel-ubuntu20.04```
- looks like it worked, had to change the partition from eap too for slurm, worked.
- Cool :)
## X. Exercise + Break until XX:45
## 13. Programming example(s): stencil computation
- Lesson material: https://enccs.github.io/gpu-programming/13-examples/
:::success
==Poll==
Are you considering to use/interested in any of the following models/frameworks? (Write an 'O' next to your choices.)
- OpenMP offloading (C++) - oo
- SYCL (C++) - ooo
- Python (numba/CUDA) - oooo
- Julia - O
- Others
- maybe want to use openacc
- STL, cuBLAS...
:::
- "However, for larger grid sizes the parallelization becomes inefficient – as the individual chunks of the grid get too large to fit into CPU cache, threads become bound by the speed of RAM reads/writes" Question: how do deal with this issue?
- This is a memory-bound problem, so, directly, you cannot do much about that. Data locality is already not bad. So, pretty much, there are no easy solutions.
- What are the hard solutions then?
- You can try improving cache locality by, e.g., padding memory to have better memory alignment of each row, or controlling how exactly openmp schedules iterations between threads. Various prefetch instructions could be tried, but the access is regular, so unlikely to help much. Things like non-temporal/streaming store might also be relevant (need to see how well the compiler optimized the code already)
- Since the grid is too large, you may consider implementing MPI. And eventually combine OpenMP-MPI.
---
## Reflections and last feedback
:::warning
One thing that you liked or found useful for your projects?
- General overview with strengths and weaknesses of different programming models
- Having mostly word with CUDA before, it was helpful to get an overview of other low-level and high-level options. Also, I had not realized how branching affects threads within a warp, so learning of that was useful.
One thing that was confusing/suboptimal, or something we should do to improve the learning experience?
- xx
:::
:::danger
**NOTE ON THE AFTERNOON SESSION**
The session is open-ended, with no particular schedule. If you have questions/comments/code or ideas to discuss, let us know of your presence by writing in the Zoom chat (and allow a few minutes for someone available to respond).
:::
---
:::info
*Always ask questions at the very bottom of this document, right **above** this.*
:::
---