---
# System prepended metadata

title: AMD hackathon 2026-04-20

---

## When

Tuesday afternoon

## Attendees

Christos, Daniel, Edoardo, Hannes, Ioannis, Philip, Will, (Andreas?)

## Preparation

see https://github.com/C2SM/icon4py/pull/1047

and https://github.com/dganellari/icon4py/blob/amd_profiling/amd_scripts/PROFILING_RESULTS.md

`NCU profile from GH200`: https://polybox.ethz.ch/index.php/s/22F7mtw9wE9KqCk

### Before the hackathon
- [x] Prepare a walkthrough of how the GT4Py program looks likes in icon4py, how does the initial state of the SDFG look like, how the final SDFG looks like and how the generated HIP code looks like
    - [ ] GT4Py, SDFG: **[Philip]**
    - [ ] HIP code: **[Ioannis]**
- [x] Make everyone aware of the stepts to execute a single icon4py stencil and the greenline test and how to profile it in our system

### During the hackathon
- [ ] Ask if they went through the generated code and if there are stuff we can improve there
    - [x] ~~i.e. why is `fuse_tasklet` beneficial? **[Edoardo]**~~ No longer the case on latest gt4py
- [ ] `hipMalloc{Async}` performance **[Edoardo]**
- [ ] HIP kernel launch configuration **[Yakup, Phillip]**
    - [ ] Work group size -> For the moment (256, 1) is the best
    - [ ] Work group assignments to XCDs
    - [ ] Figure out if CPX is better than SPX in AMD system
- [ ] How to set the SGPR/VGPR limits (registers) **[Ioannis]**
- [ ] `gpu_kloop` (each thread goes through multiple vertical levels) **[Phillip, Ioannis]**
- [ ] Fix roofline generation due to large file name **[Ioannis]**
- [x] Investigate loading neighbor tables to shared memory (sizzling, etc) **[Daniel, Ioannis]**
- [ ] Investigate further kernel fusion (launch less kernels)
    - [ ] Map fusion for the moment is restricted by neighbor accesses and data dependencies. That's why atm it's not straightforward to improve it. There are various strategies we can try in the future, like kernel inlining, use of shared memory or use of extra buffers for increasing coalesced memory accesses
    - [ ] Is it possible to reduce kernel launches? **[Daniel, Philip]**
    - [ ] Is it possible to launch kernels in parallel in different streams? Do we need different stream assignment? **[Yakup]**
- [ ] Reevaluate `reframe` tests with any changes **[Christos]**

### Homework

- [ ] Using hip async pool from cupy
- [ ] K-unrolling should be even more beneficial on AMD than NVIDIA: understand why we don't see this
- [ ] Try horizontal loop blocking **[Ioannis]**
