[DaCe] Optimization VIII

# [DaCe] Optimization VIII  - Shaped by: Philip - Appetite (FTEs, weeks): - Developers:  ## Problem There are different lines of work we will persuade. First, we will still continue measuring the stencils, this time from the dycore. We do not expect that we need any additional transformation, however, we expect to discover bugs and cases that were not yet handled well or not at all. Before we also used specialization level 2, in which we replace almost every size variable, with its value. We did that for several reasons, however, we should also start to go switch down to specialization level 1, in which we only replace values related to the vertical dimension, which is more closer to what we will have in practice. The second kind of task is the one of consolidation. We have created multiple transformations, such as [`SplitEdge`](https://github.com/GridTools/gt4py/pull/1988), [`HorizontalSplitMapRange` and `VerticalSplitMapRange`](https://github.com/GridTools/gt4py/pull/1992) which should be reviewed and be included into GT4Py `main`. The third line of work is to start creating some heuristics to guide optimization. I think that most of our current guidelines: - Removing transient data whenever possible. - Fuse as much as possible. - Avoid loading stuff from memory multiple times (which is the idea behind the transformations above). Are fine, however, there is some potential to improve (beside the things that are listed in [Left Overs](https://hackmd.io/zUfWuOlLQL6ap9EcPLjjRA#Left-Overs)). Most of all the selection when we do `LoopBlocking` and with which parameter and also the thread block size. A first step was already done in [PR#2020](https://github.com/GridTools/gt4py/pull/2020) that limits the block size. As we have seen tremendous performance increase in the diffusion stencil when `LoopBlocking` is used we should make the pipeline more intelligent, especially how `LoopBlocking` is interacting with the tread block size. #### Left Overs During our previous work, not all aspects of the problem were fully solved. Here is a not fully complete list with the most sever/important issues the optimizer currently has. Note that this is not a list of tasks, it is more of a reminder of things that could be done if there is spare time or a list of particular issues that might surface while working with a stencil. They were copied from the [previous shping document](https://hackmd.io/z6UQXnPvR72Kv_WLAIdCZg). ##### DaCe Related Issues Here are all issues that are highly related to DaCe. - The MapFusion PR in DaCe has finally been merged. However, in GT4Py we need also parallel map fusion, there is a [DaCe PR](https://github.com/spcl/dace/pull/1965) open, but it is not reviewed yet. - In certain situations, the `InlineMultistateSDFG` transformation can cause some [troubles](see https://github.com/spcl/dace/issues/1959) because it splits the write to transients, i.e. there are more than one AccessNode that write to a transient. - An error in the code generator, related to `concat_where`, was found. In order for kernels to run concurrently, they have to be submitted to different Cuda streams, where have to be created in the beginning. However, DaCe uses more streams than it creates, which causes segmentation fault. The current solution is to limit it to one stream, but this is not a solution as `concat_where` is a textbook example for concurrent streams. - In some cases `gt_substitute_compiletype_symbols()` does not work (this is a limitation of the implementation). - Not directly affecting this project, but during the upgrade to DaCe main it was realized that `is_accessed_downstream()` has to be changed. Before it was exploring the graph manually, but this is no longer possible because of the new structure of the SDFG this is no longer possible. The current solution is to rely on the `StateReachability` which is called by the function on its own. This is okay in some cases but in most cases it should be run inside a pipeline. This need to be changed, but it also triggers the [`PatternTransformation._pipeline_results` issue](https://github.com/spcl/dace/issues/1911) that also affects the MapFusion PR. - There is an issue regarding the launch of kernels, it mostly happens for fully specialized SDFG, i.e. all sizes, strides are replaced with their value. There is a [fix](https://github.com/spcl/dace/pull/1968) for that but it was not yet accepted. - Cuda provides an API call to copy 2d arrays around, however, there are some limitations in [DaCe](https://github.com/spcl/dace/issues/1953). This must be addressed. ##### Misc - The `SplitAccessNode` transformation uses the producer for splitting, this means that either a fragment for each producer is generated or the node is not split at all. It should be extended, such that splits can be merged, i.e. we have three producers writing `[0:5]`, `[5:10]` and `[10:15]` respectively and two consumer reading `[0:10]` and `[10:15]`. Currently this split is not possible, but it should be possible. - We also allow promotion in horizontal (i.e. adding a dummy dimension in horizontal). Emprical observation have shown that this is in most of the times good. But we should add a check to ensure that the computation is not too heavy, like computing the average wind velocity. - We currently assumed constant folding, for example we assumed that `limited_area` is always `True` and `extra_diffu` is `False`. This is fine for MCH, but in general we should find a solution to that. For doing this we use `substitute_compiletype_symbols()` which does not work correctly, see above. - The NASA guys have fixed `ScalarToSymbol` promotion, It might be time to enable that thing in our simplify pipeline. - The pattern matching is not so fast, we should implement our custom one to speed up at least the fusion stuff, because it has a super regular and simple pattern. - Sometimes the reduction nodes are already expanded when we get them from the lowering. I do not think that this is a big problem, but we have to figuring out why this is the case. - It is very likely that we have dynamic allocations, i.e. calls to `new` which we have to address. - There is also an updated state fusion transformation, which seams better than the current one, which should be integrated. - Double buffering needs a refactoring. - The order in auto optimize might be updated. This is mostly motivated by the introduction of the "If Mover" and the current order might create dependencies also that thing should run after k blocking, in whcih case the stupid "global map restriction" (which is probably useless anyway) should be removed. - Properly handling of Maps in K-Blocking, currently they are always considered dependent on K, which is not correct. ##### Strides - For optimal performance the Map variables must be set in a specific way. Usually we rely on the name of the Map variables, however, in some cases this is not possible (especially if the `CopyToMap` transformation has been run). The right way would be to look at the access pattern and then find the correct iteration order based on the strides of the accessed containers. Note that this means that we know the values of the strides at optimization time. As a call to `CopyToMap` is essentially a sign that some fusion or elimination did not work, we should address that problem. - There is a transformation that correct strides of containers in NestedSDFGs. However, currently views are ignored, there is an experimental [PR](https://github.com/GridTools/gt4py/pull/1784) for this. ##### GPU Transform - There is an issue in the GT GPU transformation. In order to set the GPU block size and the iteration order correctly, we essentially run `CopyToMap` in it (the background is that the code generator runs that to transform some Memlets it can not handle), However, it seems that we do not catch all cases. - There is a bug in DaCe's GPU transformation (see [DaCe Issue#1773](https://github.com/spcl/dace/issues/1773) and [GT4Py PR#1741](https://github.com/GridTools/gt4py/pull/1741/files)). However, it has only low priority as it only affect a unit test that covers an edge case. - I (Philip) remember that trivial tasklets where a big problem in the past, but I have not seen them in a long time and I wonder if they are actually still a problem, because they are handled very poorly (implementation wise). ## Appetite As long as it takes. ## Solution We will focus on some stencils and try to optimize them, which means in this context, identifying the pattern that limits performance. - [`compute_derived_horizontal_winds_and_ke_and_horizontal_advection_of_w_and_contravariant_correction`](https://github.com/C2SM/icon4py/blob/main/model/atmosphere/dycore/src/icon4py/model/atmosphere/dycore/stencils/compute_edge_diagnostics_for_velocity_advection.py), previously known as `fused_velocity_advection_stencil_1_to_7_predictor`. - [`compute_horizontal_advection_of_w`](https://github.com/C2SM/icon4py/blob/main/model/atmosphere/dycore/src/icon4py/model/atmosphere/dycore/stencils/compute_edge_diagnostics_for_velocity_advection.py), previously known as `fused_velocity_advection_stencil_1_to_7_corrector` (although I am not sure, if there is anything left to optimize) - [`compute_advection_in_horizontal_momentum_equation`](https://github.com/C2SM/icon4py/blob/main/model/atmosphere/dycore/src/icon4py/model/atmosphere/dycore/stencils/compute_advection_in_horizontal_momentum_equation.py) previously known as `fused_velocity_advection_stencil_19_to_20`. If there is time, we will also look at (was previously known) `fused_velocity_advection_stencil_8_to_13` and `fused_velocity_advection_stencil_15_to_18`, but there should be some stuff that might change there. ## No-Go ## Rabbit holes Uncountable. ## Progress  ### Cleanups - [ ] Merge all PRs of gt4py to main - [ ] [feat[dace][next]: Updated SingleStateGlobalSelfCopyElimination gt4py#2060](https://github.com/GridTools/gt4py/pull/2060) **[Philip]** - [x] Review (crrently addressing Edoardo's comments) - [x] Verification - [ ] Performance test - [x] [fix[next][dace]: Do not apply map_promoter on symbolic range gt4py#2071 ](https://github.com/GridTools/gt4py/pull/2071) - [x] Review (first round of review done by Philip) - [x] Verification - [x] Performance test - [x] Approved PRs - [x] See what's necessary for https://github.com/GridTools/gt4py/pull/2070 - https://exclaimgroup.slack.com/archives/C08R6LX1R6H/p1749626704734599?thread_ts=1749626236.910799&cid=C08R6LX1R6H - [x] Create icon4py_staging branch in gt4py based on main + concat_where - `icon4py_staging 8a4308bf3 (tag: icon4py_staging_20250613)` - [ ] Merge changes in icon4py - [ ] [1_to_13](https://github.com/C2SM/icon4py/pull/762) - [x] [19_to_20](https://github.com/C2SM/icon4py/pull/735) - [ ] [41_to_60](https://github.com/iomaganaris/icon4py/tree/41_to_60_opt) ### Performance - [ ] Check SL1 - Varying performance compared to SL2 ![image](https://hackmd.io/_uploads/HJ8D24PXee.png) - [ ] Figure out why - [ ] 41_to_60 - [ ] Find out issue with n_lev **[Ioannis]** - [x] Check details in https://exclaimgroup.slack.com/archives/C08R6LX1R6H/p1749650787427709 - [x] Created fix discribed above in https://github.com/GridTools/gt4py/commit/2e2e543ce749009481626950bf67030f4e26db9a (see result below [11.6.2025]) - [ ] Proper fix in [fix[next][dace]: Fix compile time symbol substitution gt4py#2085 ](https://github.com/GridTools/gt4py/pull/2085) - [ ] **[TL;DR] Need to find fix for half levels** - [ ] Idea 1 (higher level): Add `OuterDomain` and `ComputationDomain` to the calls of fieldops. Fieldops are executing only in the `ComputationDomain` but they can read (currently implicitely because of `Koff`s) and write from/to the `OuterDomain` - The fieldops called by a program turn into `concat_where`s - If some part of the `ComputationDomain` is accessed then its edge and any temporary should be removed from the SDFG - [ ] Idea 2 (Workaround): Add `concat_where`s for `0 <= KDim < n_lev(=vertical_end-1)` and/or remove the smaller domain (Klevel 80) fieldops that set things to 0 - [ ] Idea 3 (SDFG level): Find case of `(T) => (G)`. Replace `(T)` with `(G)` and later write to `(G)` the part that was written originally to `(G)` (needs a separate state) - [ ] Find remaining issues with copies/maps - [ ] See https://cscs-lugano.slack.com/archives/C06DWMYLKTJ/p1750329661766379 - [ ] Unless we can find a solution in the frontend we need a transformation that merges AccessNodes with patterns like `(T) => (A)` where `(A)` is `Global AccessNode` and `(T)` is `Transient AccessNode` when there are unecessecary copies between them. This might be the case for `(A) => (T)` as well - [ ] `41_to_60_corrector` - [ ] Optimizations in [41_to_60_opt](https://github.com/C2SM/icon4py/compare/main...41_to_60_opt) - [ ] `vertical_mass_flux_at_cells_on_half_levels` - Written by the program from 2 field ops - `_set_surface_boundary_condtion_for_computation_of_w` - [80] level (set by `domain`) - Just sets this to 0 - `_vertically_implicit_solver_at_predictor_step_before_solving_w` - [0:80) levels (set by `domain`) - We create a temporary from this fieldop that is later copied to the Global AccessNode ([0:80)) - Sets levels 0 and 80 to 0 - Is it possible to get rid of the first fieldop and do everything - [ ] `tridiagonal_alpha_coeff_at_cells_on_half_levels` - Written by the program by 2 fieldops - `_set_surface_boundary_condtion_for_computation_of_w` - [80] level (set by `domain`) - Just sets this to 0 - `_vertically_implicit_solver_at_corrector_step_before_solving_w` - Computed for the domain of the fieldop [0:80) - [x] `next_w` - [x] Fixed by https://github.com/GridTools/gt4py/pull/2060 - Written by the program by 3 fieldops - `_set_surface_boundary_condtion_for_computation_of_w` - [80] level (set by `domain`) - Just sets this to 0 - `_vertically_implicit_solver_at_corrector_step_before_solving_w` - Gets computed by a forward `scan` for [0:80) - Then a backward scan for (80:0] - And level 0 is set to 0 - `_vertically_implicit_solver_at_corrector_step_after_solving_w` - Gets computed (partially) for [0:80) - The situation with `next_w` is even more difficult because there's a temporary generated by `_vertically_implicit_solver_at_corrector_step_before_solving_w` part of which gets copied to another temporary and which gets combined with the computation taking place in `_vertically_implicit_solver_at_corrector_step_after_solving_w` (14 levels). The outcome of this variable is then copied to `next_w` - [ ] 8_to_13 - [ ] Slight regression compared to the old results not yet understood - [ ] Maybe related to strides in field accesses ### Functionality - [ ] Integrate the remaining existing stencils and validate/measure them - [ ] 14_to_28 **[Giacomo]** - [ ] 30_to_38 once ready - [ ] 41_to_60_corrector **[Ioannis]** - [ ] `benchmark_2` verification test fails with `icon4py` `main` (4566d53) and `gt4py` `icon4py_staging` (d34a7f78) for `next_rho`, `next_exner`, `next_theta_v` - Problem probably related to `Found transient 'lambda_13___tmp7' that has multiple overlapping incoming edges. Might indicate an error.` and `UserWarning: Found transient 'lambda_1___tmp1' that has multiple overlapping incoming edges. Might indicate an error.` - [x] `benchmark_2` verification tests pass with `icon4py` `41_to_60_opt` (f56cb92) and `gt4py` `icon4py_staging` (d34a7f78) but we have unecessary copies to/from `next_w`, `vertical_mass_flux_at_cells_on_half_levels` and `tridiagonal_alpha_coeff_at_cells_on_half_levels` - [x] Unnecessary copies of `next_w` fixed by https://github.com/GridTools/gt4py/pull/2060 ### Integration - [ ] Verification of icon4py **[Edoardo]** - [x] 30_to_40 failing (result validation) - fixed in [#2071](https://github.com/GridTools/gt4py/pull/2071) - [ ] 41_to_60 failing (result validation) - [ ] Verification of blueline - Blocked, awaiting icon4py verification and new `icon4py_staging` tag ### PASC - [x] Create nice plot with the latest results tomorrow/Thursday - [x] Figure out 41_to_60 predictor/corrector - [x] Show in nice way the long stencil names - [x] Polish the plot (colors, notes, etc) ### Latest results #### 11.6.2025 ``` Directory: /scratch/mch/ioannmag/cycle29//icon4py_custom Branch: benchmark_dace_base_19_20_1_13_simpl (main + https://github.com/C2SM/icon4py/pull/762) [branch from my fork https://github.com/iomaganaris/icon4py/tree/benchmark_dace_base_19_20_1_13_simpl] Commit ID: 196beb0030784a7e1a23d3a262f2fb0e297c987d Directory: /scratch/mch/ioannmag/cycle29//dace_custom Branch: gt4py-next-integration Commit ID: d779cd1e91e6b519426f463184e3ffd36e7ceaf5 Directory: /scratch/mch/ioannmag/cycle29//gt4py_custom Branch: icon4py_staging Commit ID: e631f00376528f78fcd40d590704f7398630dc08 Directory: /scratch/mch/ioannmag/cycle29//benchmark_2 Branch: master Commit ID: 0595a68c0a29383c29703c83995684925527f3f7 ``` ![image](https://hackmd.io/_uploads/HJSKjEvmeg.png) **Warning** Below results include changes that improve the `41_to_60` not yet reviewed/merged ``` Directory: /scratch/mch/ioannmag/cycle29//icon4py_custom Branch: iomaganaris:41_to_60_1_to_13_opt Commit ID: 51ceceefa0749b54b8ae23d4413a302ccb985a67 Directory: /scratch/mch/ioannmag/cycle29//dace_custom Branch: gt4py-next-integration Commit ID: d779cd1e91e6b519426f463184e3ffd36e7ceaf5 Directory: /scratch/mch/ioannmag/cycle29//gt4py_custom Branch: iomaganaris:icon4py_staging Commit ID: 2e2e543ce749009481626950bf67030f4e26db9a Directory: /scratch/mch/ioannmag/cycle29//benchmark_2 Branch: master Commit ID: 0595a68c0a29383c29703c83995684925527f3f7 ``` ![image](https://hackmd.io/_uploads/H1AwiNwQxg.png) #### 10.6.2025 ``` Directory: /scratch/mch/ioannmag/cycle29//icon4py_custom Branch: benchmark_dace_base_19_20_1_13_simpl_nomain Commit ID: 9b93010016d9374bd1281cb0acd67765f1251eec Directory: /scratch/mch/ioannmag/cycle29//dace_custom Branch: gt4py-next-integration Commit ID: 4b49068dfbd1d839a562c1d4535c7ea340597c07 Directory: /scratch/mch/ioannmag/cycle29//gt4py_custom Branch: icon4py_staging_main_100625_dev Commit ID: 26fee729e690d9d65d193b83783b93df234d90da Directory: /scratch/mch/ioannmag/cycle29//benchmark_2 Branch: master Commit ID: 721b7e65ff3d9209d98e3724aebf4405440aba88 ``` ![image](https://hackmd.io/_uploads/S1rsJp87el.png)