[DaCe] Optimization IX

# [DaCe] Optimization IX  - Shaped by: Philip - Appetite (FTEs, weeks): - Developers:  ## Problem We essentially continue our work on the DaCe optimizer. The current state is as follows: - [`15_to_28`](https://github.com/C2SM/icon4py/pull/772): Currently cleaned up by Hannes, however, according to Christoph the corrector is kind of done, while the predictor not. Giacomo has started with it, but unclear if he continues. - `29_to_37`: Is currently handled by Christoph no EAT known. - `39_to_60`: Very near its completion, currently handled by Ioannis. The work would be pretty straight forward from what we have done before. However, it was suggested that the stencils are first cleaned up, before new stencils are analyzed. We consider this as a dependency or even a task in the scope of this project, however, we do not handle it. ### Branches Here are the branches to use: - ICON4Py: [`all_opts`](https://github.com/C2SM/icon4py/pull/812) (probably needs an update) + `main`. - GT4Py: `main`. - Benchmark Repo: `all_opts`. #### Left Overs A list of standing issues in DaCe and/or GT4Py can be found [here](https://hackmd.io/tZ3BKzwNTlWwv81fW2H2ww). In previous cycle this was part of the respective shaping document but it was now decided that it should become its own document. Something that has become urgent is the `n_lev` issue, which is caused is related to the promotion of a `Scalar` to a `symbol` on an `InterstateEdge`, however, we are working on. ## Appetite As long as it takes. ## Solution ## No-Go ## Rabbit holes Uncountable. ## Important TODOs - [ ] Failing verification - [ ] `dy_15_to_28_predictor` - Multiple writes (race conditions) in multiple access nodes - [x] `calculate_nabla2_and_smag_coefficients_for_vn` - ~~Started failing after introducing only using the default stream. Not sure why. SDFG looks simple~~ - Problem with https://github.com/GridTools/gt4py/pull/2178 - It seems that the same bug that affects `apply_diffusion_to_theta_and_exner` is the same. - [x] `apply_diffusion_to_theta_and_exner` - Problem with https://github.com/GridTools/gt4py/pull/2178 - Philip tackles this - [ ] DaCe transformations - [ ] Enable VerticalMapFusion only if all the input edges of an AccessNode are generated by maps - See `dy_41_to_60_corrector` SDFG where the same map is split between `1:14` and `14:80` (issue originates from the pattern shown below around `next_w`) - ![image](https://hackmd.io/_uploads/rJWqdh4Pgx.png) - [ ] Fix `next_w` copies in `41_to_60_corrector`/`39_to_60_predictor` - ![image](https://hackmd.io/_uploads/rJcyK24Pex.png) - [ ] Check `30_to_38` to see if there's anything to improve - [x] Corrector/predictor in SL2 look good - [x] See SL1 performance overhead for predictor below - [ ] Other performance aspects - [ ] Compare SDFGs of stencils that have differences between SL1 and SL2 performance ## Progress - [ ] `15_to_28` - [x] https://github.com/C2SM/icon4py/pull/772 - [x] Check if the corrector is really okay. - [ ] Check predictor when ready --> failing - [x] Check the state of the current integration into the benchmark repo (fixed values and flavors). - [x] Update the OpenACC reference time. - [x] Actual work: - [x] Fix `ConstantSubstitution` for `AccessNode`s as well - [x] Check why with the current version is failing - [ ] `30_to_38` - [ ] Ask Christoph when it is done and how its state is. - [ ] See https://github.com/C2SM/icon4py/pull/802 - [x] Find out which flavor there are and which are the correct values. - [x] Integrate into the optimizer. - [x] Determine the OpenACC reference time. - [ ] Actual work: - Check the SDFGs to see if there's anything to improve - [ ] `39_to_60` - [x] Write `41_to_60_corrector` in a single fieldop - [x] Not really possible because `next_w` and `vertical_mass_flux` need to be written for one extra level - [x] Check if they really have to be written for this extra level by running the `timeloop` tests in `icon4py` - [x] Write `39_to_60_predictor` in a single fieldop - [x] Same as above - [x] Symbol substitution fails for `n_lev` if it's set from the `gtx.program` - [x] Works if it's passed as a parameter to the `gtx.program` and then to the `gtx.fieldop` - [x] Need Edoaro's fix for the `symbolic expressions` - [ ] https://github.com/C2SM/icon4py/pull/784 - [x] Check why 41_to_60_corrector verification fails on GPU only with benchmark_2 - [ ] Check why CI needs specific parallelism otherwise CUDA_ILLEGAL_ADDRESS_ERROR - [ ] Check SL1 vs SL2 performance - [ ] In `VerticalMapFusion` check if there are cases where we can fuse maps that are before/after an AccessNode that has overlapping edges - To fuse successfully the maps in this case we would have to duplicate the computations of the overlapping range which might not be beneficial