# [Greenline] Muphys clean-up, performance improvement, paper
<!-- VEB Optimizer -->
- Shaped by: Will, Hannes
- Appetite (FTEs, weeks): 1/2 cycle
- Developers: <!-- Filled in at the betting table unless someone is specifically required here -->
## Problem
Some initial performance numbers for muphys-gt4py were reported at the ICON all-hands meeting (Oct. 27-29). The comparison with Kokkos performance was not very favorable for GT4Py (passable for DaCe backend, poor for GTFN). The suggestion of Lex (SPCL) and Georgiana (DKRZ) is to have a common publication on using muphys to evaluate different programming paradigms, e.g., Kokkos, SYCL, GT4Py, Fortran with the Fortan2DaCe compiler. In such a publication it would be good if the GT4Py version would shine in terms of readability and performance.
## Appetite
MuPhys-GT4Py has always been a "hobby", however, in the mid-term this graupel scheme might gain acceptance in the NWP-DWD-MCH-EXCLAIM sphere. The chance of having GT4Py represented in a comparative paper offers an opportunity for higher visibility.
## Solution
The Muphys-GT4Py implementation currently in icon4py:main is the most naive possible. The code needs to be cleaned up, which might already lead to better performance. Hannes has made a number of suggestions, including using the new "named collections". But we should also consider possible new GT4Py features which could make the code more elegant.
In order to allow others to make changes to Muphys-GT4Py, it is necessary to **automate the verification** procedure, which uses CDO to compare fields with the Fortran results.
After the cleanup we would need to look at the generated code in order to (potentially) make **improvements in the GTFN and DaCe backends**.
There is also a **bug in the DaCe code generation** which contaminates only the temperature field. Edoardo has been looking into this. Realistically we can only present DaCe performance results if this bug is solved. Additionally, the current muphys branch uses an experimental GT4Py branch which fixes performance of transformations for GTFN. This branch needs to be made availabe in GT4Py/main.
There are also several procedural steps to create a level playing field for a comparison, e.g., a common repository for all schemes considered, which would
ensure reproducibility of results, agreed upon definitions on how to properly time runs, and allowing for alternative data layouts.
## Rabbit holes
<!-- Details about the solution worth calling out to avoid problems -->
Besides locating and correcting the above-mention bug in DaCe code generation, it is probably not wise to invest large amounts of time in backend improvements.
## No-gos
We will not actually contribute to the paper in this cycle, but only try to get a Muphys-GT4Py version which would be "presentable".
## Progress
<!-- Don't fill during shaping. This area is for collecting TODOs during building. As first task during building add a preliminary list of coarse-grained tasks for the project and refine them with finer-grained items when it makes sense as you work on them. -->
- [x] Preparation and infrastructure for muphys comparison
- [x] Agree on focus of paper: are we just users of tools like Kokkos and GT4Py, or do we talk about scientific aspects of the tools?
- [x] Common repository
- [ ] Agreed upon definitions for timing
- [x] Clean-up of muphys code
- [x] Automate the Graupel verification, based on Fortran results
- [x] Simple restructuring for better readability, e.g., single `precip` scan operator rather than four
- [x] Verification test based on run_graupel_only
- [x] Utilize name collections
- [ ] Propose potential GT4Py extensions to improve readability
- [x] Backends
- [x] DaCe: isolate and solve code generation bug
- [x] GTFN: understand the poor performance: gtfn doesn't inline many things, unclear if we want to invest time in improving it
## Performance measurement
### Hannes
graupel_only 100 iterations
dace_gpu
| | icon4py | gt4py | DaCe | R2B6 [s] | R2B7 maxfrac [s] |
| -------------------- | ------- | ------- | ------- | ------------------ | ------------------ |
| before scans inlined | 42e4b86 | 3ef7255 | ab9eaef | 1.4409005641937256 | 7.051928281784058 |
| scans inlined | b959d57 | 3ef7255 | ab9eaef | 1.5913982391357422 | 6.980992555618286 |
| nothing after scan | f02584a | | | 1.5492215156555176 | 6.5803093910217285 |
| + q_inout, mask=True | n/a | | | 1.1194305419921875 | 4.470679759979248 |