# [ICON4Py] GPU kernel level optimizations (NVIDIA+AMD) - Shaped by: Ioannis - Appetite (FTEs, weeks): 1 cycle - Developers: <!-- Filled in at the betting table unless someone is specifically required here --> ## Problem After the performance improvements in DaCe backends with the addition of various transformations we are at a point where it makes sense to look into more detail into the generated GPU kernels from the SDFG maps. For this cycle we would like to profile and analyze the GPU kernels using low level analysis tools in NVIDIA and AMD GPUs to see if we can further improve the GPU utilization. This project if beneficial for: - Improving GH200 and A100 GPUs performance - Initial thinking for [[DaCe] Rethink What HorizontalMapSplitRange Should Be](https://hackmd.io/8k6PO_eoTgayVlVZDOx2jw?view) - Exploring AMD ecosystem and profilers - Understanding AMD GPU arcthitecture and how to improve performance - Compare MI300A and GH200 GPU performance in more detail ## Appetite Full cycle ## Solution The tasks we would like to tackle are divided in two fronts: - NVIDIA (GH200 and A100 GPUs) - Profile and improve vertical solvers (scans) - Profile and improve kernels that have low memory utilization - Profile kernels that operate on a small domain (halos usually) - AMD (MI300A GPU) - Build `icon4py` on `Beverin` - Compare diffusion and dycore granule performance on MI300A vs GH200 GPU - Find out what profiling tools there are on the AMD ecosystem and how to use them - Make sure that we use the latest ROCm - Using the `vertically_implicit_solver` stencils profile AMD GPU kernels and see how to improve them - Compare standard deviation of AMD kernels to NVIDIA CUDA kernels ## Rabbit holes <!-- Details about the solution worth calling out to avoid problems --> Use of latest ROCm (7.x.x) available tools and avoid fixing any issues related to them. If latest ROCm is not working fall back to latest working ## No-gos Don't spend lots of time fixing AMD software stack but report issues ## Progress <!-- Don't fill during shaping. This area is for collecting TODOs during building. As first task during building add a preliminary list of coarse-grained tasks for the project and refine them with finer-grained items when it makes sense as you work on them. --> - [ ]