Muphys performance

# Muphys performance ## brainstorming - explore single-kernel version - check how f2dace version is structured - explore no `mask`/`is_sig_present` version (as a potential performance improvement which requires domain scientist approval) - correctness: check that we produce the correct flux outputs - jax version: single scan instead of vmap-ed scan - use gt4py metrics for timing - bindings: consider integrating with c/c++ bindings into the muphys_cpp repo - gt4py scan frontend: allow `init` from field to avoid k shift in temperature (might be requirement if we want to make an only-scan gt4py implementation). ## TODOs (GT4Py/DaCe): - [ ] Will (or Giorgiana?): get references that include the fluxes - [ ] (Who?) version with multiple output domains https://github.com/C2SM/icon4py/pull/995 - [ ] correct - [ ] and fast... - [x] Hannes: using inout buffers for Q https://github.com/C2SM/icon4py/pull/999 - [x] explore if we can remove the separate in/out from the program (without hurting performance): - seems like we can: there is small performance variability favoring slightly the version where dace cannot be aware that it's the same buffer, however not significant. I will go for the version where the graupel program only gets a single inout q container. - [ ] Edoardo: Check separatly the changes in https://github.com/C2SM/icon4py/pull/998 - [ ] refactoring of user code - [ ] TaskletFusion - [x] Philip: Figuring out why `TaskletFusion` is so much beneficial in [Edoardo's PR#998](https://github.com/C2SM/icon4py/pull/998) - [x] Deploy it, i.e. make it running. - [x] Check the theory that it is related to [`MoveDataflowIntoIfBody`](https://github.com/GridTools/gt4py/blob/main/src/gt4py/next/program_processors/runners/dace/transformations/move_dataflow_into_if_body.py) the effect could be similar to masking. Does not seem to be the case. - [x] Find location where TF can be integrated - [x] Do more experiments with it - CONCLUSION: Performance depends on the location where TF is performed. I.e. one must run it before `MoveDataflowIntoIfBody` and that in turn must run before `MoveTaskletIntoMap` is run (but the second condition is most likely a compiler artifact). It was decided that the current state is "good enough" and while spending more time would lead to more insights it is not worth given the time constraints. - [ ] :warning: There is a [PR](https://github.com/GridTools/gt4py/pull/2457) that introduces the `TaskletFusion` change in GT4Py proper, but it leads to a degradation in one stencil of the dycore (lowered it twice, so it is probably not indeterministc behaviour). - [ ] Philip: Why have we have two kernels. - [x] :exclamation: There is some strange copy somewhere between the two Maps that prevents even manual fusing -> Is needed because in GT4Py we can not manipulate the indexes we want to access. In FORTRAN they do `kp1 = min(k_max, k + 1)`, we probably should have a concept such as that in GT4Py that allows us to express that case. - [x] I profiled the code using `nsys` and at least for `R02B07` the copy takes $\approx 600\mu\text{s}$, which is less than 1% of the run time. - [ ] Check the larger grids if it is similar there. - [ ] Edoardo: In F2DaCe the scan loop, i.e. the loop that propcesses the vertical levels is outside the kernel, i.e. each level results in a kernel launch. Edoardo is working on a PR. - [ ] Philip: hack f2dace version with swapped loops for comparison - [ ] Philip Check strides in CPU mode, they might be off, if memory is supplied by GT4Py. - [ ] Hannes: allow single precision in the driver + test - [ ] Will (or Giorgiana): reference data for single precision - [x] Hannes: check if we should rewrite https://github.com/C2SM/icon4py/pull/1000/changes#r2716174318 - doesn't seem required, no significant difference - [ ] Ioannis: If-statement grouping in DaCe - [ ] [perf[next-dace]: RemoveAliasingScalars and FuseHorizontalConditionBlocks transformations gt4py#2469](https://github.com/GridTools/gt4py/pull/2469) - [ ] [feat[next-dace]: Enable setting gpu_maxnreg attribute in maps gt4py#2464 ](https://github.com/GridTools/gt4py/pull/2464) - [ ] [Improvements in Graupel code icon4py#1033 ](https://github.com/C2SM/icon4py/pull/1033) - [ ] Ioannis: Performance indeterminism ``` For 100 iterations it took 0.8099613189697266 seconds! For 100 iterations it took 0.7903952598571777 seconds! For 100 iterations it took 0.8035836219787598 seconds! For 100 iterations it took 0.7906265258789062 seconds! For 100 iterations it took 0.7940170764923096 seconds! For 100 iterations it took 0.806774377822876 seconds! For 100 iterations it took 0.8065989017486572 seconds! For 100 iterations it took 0.7770366668701172 seconds! For 100 iterations it took 0.8079164028167725 seconds! For 100 iterations it took 0.79966139793396 seconds! ``` due to lower ILP achieved depending on the order of calculations happening right before the scan in the same CUDA kernel - [x] Ioannis: Scan unrolling - Not helpful because of large number of registers necessary - [ ] Hannes: remove more masks - most likely the is_level_active masks are not completely removed from the code, remove them explicitly - note: numerical differences are bigger ~1e-8, but maybe there is a different problem - [ ] Edoardo(?): Pass to transform partial copies into inout buffers and only update the non-copied area