# [GT4Py/DaCe] Fieldview SDFG-Transformations II
<!-- Add the tag for the current cycle number on top -->
- Shaped by: Philip, Edoardo
- Appetite (FTEs, weeks):
- Developers: <!-- Filled in at the betting table unless someone is specifically required here -->
## Problem
In last [cycle (23)](https://hackmd.io/klvzLnzMR6GZBWtRU8HbDg?view) we started setting up an optimization pipeline, that showed [good initial results](https://docs.google.com/presentation/d/1i1GCPEgcX-nUPk0X-csYewtVpz-PwTWOwzdSjNRskdk/edit#slide=id.g21bee19dfce_1_0) on the nabla4 stencil.
However, the corresponding [PR](https://github.com/GridTools/gt4py/pull/1594) is still under review.
Furthermore, since there was no frontend yet, the stencil IR had to be written by hand in last cycle, which is a limiting factor[^1]. In cycle 24 some icon4py stencils will be parsed by the GT4Py frontend and automatically translated to SDFG.
From our previous experience (optimization pipeline for the JaCe prototype) we know that one important transformation (beside some other small ones, such as a custom redundant array elimination) is missing -- parallel Map fusion.
Another even more important aspect that is missing is the possibility to use some criteria that decides if a Map should be fused or not.
Currently, if two (serial) maps can be fused, then they will be fused, regardless whether this might be beneficial or not.
While implementing the possibility to use/apply such a criteria is not hard, finding the appropriate one is.
For finding/coming up with such a criteria we would need a large set of representative stencils.
Another aspect that is missing is the optimization inside kernels (this happens if we have serial Maps inside a parallel Map).
While fusing global Maps together is beneficial, it is not clear whether it is also true for (sequential) Maps inside kernels.
However, for testing this we would need kernels whose content is not limited to Tasklets.
<!--
Essentially, there are three possible ways how to move forward:
1. Trying to further optimize Nabla4 and fully beat the baseline.
2. Repeating the same Nabla4 exercise that we did for Nabla4 to another kernel.
3. Tying up some of the loose ends.
The main issue of the first path, is that it is unclear how beneficial this is to other stencils.
Furthermore, we are unsure if this can be achieved by transformations alone or whether we would need to modify the DaCe code generator.
The main issue with the second path is that we essentially lack a "good" baseline as we had one for Nabla4.
Besides, for this to be beneficial, the new stencil we select would need to have a very different compute pattern.
The main issue with the third path is that we would not find/implement new yet unknown transformation.
However, it would allow us to create a solid foundation on which we could later easily build on.
-->
## Solution
From the above discussion, we conclude that the best course of action would be to invest roughly two to three weeks into cleaning up and finalizing a first version of the optimization pipeline.
This includes:
- Merging the current (working) state of the pipeline into `main`.
- Implementing the last transformations that we know are needed:
* Parallel Map fusion
* A custom version of redundant array removal ([rule 1](https://hackmd.io/klvzLnzMR6GZBWtRU8HbDg?view#Requirements-on-SDFG))
<!--
- Adding the possibility to add a fusion criteria and implement a prototype, see bellow.
As a side note, currently DaCe just processes the nodes (Maps) in a more or less arbitrarily and non deterministic order, we might also consider improving on that in the future, but we will ignore that.
-->
- Resolve some of the most pressing `TODO`s in the code base.
* Resolving a bug in the GPU transformation
* Improving Map fusion
- Ensure that the optimization pipeline produces SDFGs with reasonable performance on the icon4py stencils that can be lowered by the GT4Py backend. For this cycle, a reasonable performance would be comparable with the ITIR-legacy execution.
##### Extension
<!--
We could also try to merge the Map fusion transformations back to DaCe.
However, in that case we would need more time because the current implementation inherently assumes that the SDFG obeys the [outlined structures](https://hackmd.io/klvzLnzMR6GZBWtRU8HbDg?view#Requirements-on-SDFG).
-->
We could start involving SPCL in the design of the optimization pipeline. We need to share the `gtir` development branch and provide instructions how to generate the SDFG for the icon4py stencils.
Their feedback on the design of the decision criteria for application of map fusion would be very useful.
From basic considerations about how a GPU works, the following rules should give us a valid first order approximation:
- Fuse two (global) Maps if this leads to an increase in the operational intensity, with the goal to decrease memory loads.
- Do not fuse if the resulting kernel would require too many register, i.e. maximize the theoretical occupancy.
What is a bit of a problem is, that the SDFG nodes are processed in non-deterministic order.
We might also have to investigate this aspect and make some changes in DaCe.
## Appetite
<!-- Explain how much time we want to spend and how that constrains the solution -->
Technically this project can take any amount of time, however, as stated above we limit it to 2-3 weeks.
## Rabbit holes
<!-- Details about the solution worth calling out to avoid problems -->
The complexity of some transformations, especially Map fusion, is very large.
However, we will limit us to the cases that are expected in GT4Py SDFGs.
## No-gos
<!-- Anything specifically excluded from the concept: functionality or use cases we intentionally aren’t covering to fit the ## appetite or make the problem tractable -->
## Progress
<!-- Don't fill during shaping. This area is for collecting TODOs during building. As first task during building add a preliminary list of coarse-grained tasks for the project and refine them with finer-grained items when it makes sense as you work on them. -->
- [x] Finishing Initial PR ([PR#1594](https://github.com/GridTools/gt4py/pull/1594))
- [ ] Porting to DaCe ([PR#1629](https://github.com/spcl/dace/pull/1629))
- [x] Passing the tests
- [ ] Get approval from Devs
<!--========================================================-->
[^1]: Nabla4 was relatively simple, but if we would write more by hand, we have to make sure that the GTIR we use is comparable to what we will also get later, thus it stands to reason that we should also run the GT4Py optimizations.
Otherwise we might start to optimize for SDFGs that we will not get in the end (an example is common subexpression elimination or "this part is only used in this branch so move the computation inside it").
However, we could also start without them and then just say that everything that we can not handle easily on the DaCe level is left for GT4Py.