[DaCe] Optimization IV

# [DaCe] Optimization IV  ###### tags: `cycle 25 - 09/24` - Shaped by: Philip - Appetite (FTEs, weeks): - Developers: Philip, Christoph Müller (Support) ## Problem The optimization pipeline is still under development, while several parts could be improved, see [PR#1639](https://github.com/GridTools/gt4py/pull/1639) for a summary, they will not be the main obstacle. The main problem is that we currently do not have representative (as in big) SDFGs. Therefore, it is not clear which transformations are missing and which ones need to be updated/fixed/improved. From basic considerations we think that some of the issues might be related to: - Strides of temporaries (might possible also influence iteration order). - DaCe's `InlineSDFG` transformations has some restrictions, which might or might not be an issue. - The [fusion transformations](https://github.com/spcl/dace/pull/1629) are still not merged yet in DaCe repository (not a real issue yet as they currently live also in GT4Py, but its review will take some time). - When generating reductions DaCe seems to use atomic operations quite aggressively, even if they are not needed. However, essentially we do not know for certain, what components we are exactly missing. Thus, we want to identify and start addressing these points in this project. With the goal to have a performant and verifying combined stencil. ## Appetite While the whole task of getting the optimization pipeline ready for use, will take longer, it is probable that we should have something running, that is reasonable fast by the end of the cycle. #### Availability Philip may only be able to work 66% on this project as he also has to work 33% for LUMI. ## Solution We will proceed in a similar way as we did in [cycle 23](https://hackmd.io/klvzLnzMR6GZBWtRU8HbDg) where we first implemented the optimization pipeline. After some discussion with Christoph Müller, we selected these stencils: - [apply_diffusion_to_vn](https://github.com/C2SM/icon4py/blob/main/model/atmosphere/diffusion/src/icon4py/model/atmosphere/diffusion/stencils/apply_diffusion_to_vn.py) - [apply_diffusion_to_w_and_compute_horizontal_gradients_for_turbulence](https://github.com/C2SM/icon4py/blob/main/model/atmosphere/diffusion/src/icon4py/model/atmosphere/diffusion/stencils/apply_diffusion_to_w_and_compute_horizontal_gradients_for_turbulence.py) The later stencil is interesting because fusing it results in a nested neighbourhood reduction which is known to perform poorly. In a first step we will set up a simple testing pipeline in the spirit of the one used in cycle 23. The main purpose of it is to automate running and timing as much as possible, such that changes to the optimization pipeline can be tested quickly. It is most likely, that we can reuse some of the infrastructure that was created in cycle 23. We will then proceed in an iterative way. In the beginning, we will apply the pipeline that we currently have. From this we will get a broad overview of what is missing in the sense "the transformations were not able to do _X_" or "failed doing _Y_". Here we will look from a purely SDFG level, for this we will need help/feedback from SPCL (we will most likely do it during the weekly meetings or schedule a special appointment). Furthermore, we will also look at the generated code to see what could be improved there, for this we will collaborate with Chrostoph and Ioannis. Then we will address the identified issues and start from the beginning. #### Baseline As a performance baseline we will use runtime data obtained from ICON FORTRAN run with OpenACC, for convenience we use EXCLAIM Version of ICON (BLUE). The experiment, i.e. the grid, we use is not yet determined, however, we will use a global grid that is large enough to saturate the GPU, we will try both A100 and GH200. Christoph will provide us with the specific about that. In the beginning we will use random data as input data, mostly for convenience. The stencils listed above do not have a large data dependency. However, in the later stages and for a really fair comparison we have to use serialized data. To obtain this data, we will use the serialize mode of Liskov (currently it does not work; Christoph is investigating this). #### See Also The optimization part in a [larger context](https://hackmd.io/OSw9YiwcQImPqpU46FyoKw#optimizations). ## Rabbit holes The main point is to identify and solve the main performance blocker, not to solve every little detail we might find. The main targets are the transformations and not the testing pipeline. In certain cases transformations might be inappropriate to solve a problem, because the transformation would be very difficult and tricky to write. However, if a small modification of GT4Py would do the trick, we will solve the problem on that level. Furthermore, if the problem is in the code generator then we will not address it, but report back to SPCL. ## No-gos - Modifying things in the code generator. ## Progress  - [ ] Update PR ([PR#1639](https://github.com/GridTools/gt4py/pull/1636))  [^1]: I had a discussion with Christoph and according to him, random data will lead to a very different runtime behaviour in the (FIND_CORRECT_ONE) stencil as it extremely runtime dependent. However, in the beginning (and verification) it should be fine.