[GT4Py/DaCe] Fieldview SDFG-Transformations II

# [GT4Py/DaCe] Fieldview SDFG-Transformations II  - Shaped by: Philip, Edoardo - Appetite (FTEs, weeks): - Developers:  ## Problem In last [cycle (23)](https://hackmd.io/klvzLnzMR6GZBWtRU8HbDg?view) we started setting up an optimization pipeline, that showed [good initial results](https://docs.google.com/presentation/d/1i1GCPEgcX-nUPk0X-csYewtVpz-PwTWOwzdSjNRskdk/edit#slide=id.g21bee19dfce_1_0) on the nabla4 stencil. However, the corresponding [PR](https://github.com/GridTools/gt4py/pull/1594) is still under review. Furthermore, since there was no frontend yet, the stencil IR had to be written by hand in last cycle, which is a limiting factor[^1]. In cycle 24 some icon4py stencils will be parsed by the GT4Py frontend and automatically translated to SDFG. From our previous experience (optimization pipeline for the JaCe prototype) we know that one important transformation (beside some other small ones, such as a custom redundant array elimination) is missing -- parallel Map fusion. Another even more important aspect that is missing is the possibility to use some criteria that decides if a Map should be fused or not. Currently, if two (serial) maps can be fused, then they will be fused, regardless whether this might be beneficial or not. While implementing the possibility to use/apply such a criteria is not hard, finding the appropriate one is. For finding/coming up with such a criteria we would need a large set of representative stencils. Another aspect that is missing is the optimization inside kernels (this happens if we have serial Maps inside a parallel Map). While fusing global Maps together is beneficial, it is not clear whether it is also true for (sequential) Maps inside kernels. However, for testing this we would need kernels whose content is not limited to Tasklets.  ## Solution From the above discussion, we conclude that the best course of action would be to invest roughly two to three weeks into cleaning up and finalizing a first version of the optimization pipeline. This includes: - Merging the current (working) state of the pipeline into `main`. - Implementing the last transformations that we know are needed: * Parallel Map fusion * A custom version of redundant array removal ([rule 1](https://hackmd.io/klvzLnzMR6GZBWtRU8HbDg?view#Requirements-on-SDFG))  - Resolve some of the most pressing `TODO`s in the code base. * Resolving a bug in the GPU transformation * Improving Map fusion - Ensure that the optimization pipeline produces SDFGs with reasonable performance on the icon4py stencils that can be lowered by the GT4Py backend. For this cycle, a reasonable performance would be comparable with the ITIR-legacy execution. ##### Extension  We could start involving SPCL in the design of the optimization pipeline. We need to share the `gtir` development branch and provide instructions how to generate the SDFG for the icon4py stencils. Their feedback on the design of the decision criteria for application of map fusion would be very useful. From basic considerations about how a GPU works, the following rules should give us a valid first order approximation: - Fuse two (global) Maps if this leads to an increase in the operational intensity, with the goal to decrease memory loads. - Do not fuse if the resulting kernel would require too many register, i.e. maximize the theoretical occupancy. What is a bit of a problem is, that the SDFG nodes are processed in non-deterministic order. We might also have to investigate this aspect and make some changes in DaCe. ## Appetite  Technically this project can take any amount of time, however, as stated above we limit it to 2-3 weeks. ## Rabbit holes  The complexity of some transformations, especially Map fusion, is very large. However, we will limit us to the cases that are expected in GT4Py SDFGs. ## No-gos  ## Progress  - [x] Finishing Initial PR ([PR#1594](https://github.com/GridTools/gt4py/pull/1594)) - [ ] Porting to DaCe ([PR#1629](https://github.com/spcl/dace/pull/1629)) - [x] Passing the tests - [ ] Get approval from Devs  [^1]: Nabla4 was relatively simple, but if we would write more by hand, we have to make sure that the GTIR we use is comparable to what we will also get later, thus it stands to reason that we should also run the GT4Py optimizations. Otherwise we might start to optimize for SDFGs that we will not get in the end (an example is common subexpression elimination or "this part is only used in this branch so move the computation inside it"). However, we could also start without them and then just say that everything that we can not handle easily on the DaCe level is left for GT4Py.