[GT4Py] Refactoring of the DaCe Backend

# [GT4Py] Refactoring of the DaCe Backend ###### tags: `cycle 18` - Shaped by: [Philip](mailto:phimuell@cscs.ch) - Appetite (FTEs, ??): - Developers: ## Problem Until very recently I was thinking that ITIR was not suitable to create SDFG and that Field View would be more suitable for the task. While I still think that this is the case, I have revisited my previous pessimistic view about ITIR. I came to the conclusion that, after some probably non trivial transformations, ITIR could become suitable for the job. According to my understanding ITIR, at least for `FieldOperator`s, describes stencil operations _relative_ to the current location of an iterator in a field. This operation is then executed, potentially in parallel, for every location. (This pattern is seen inside the loop in `fendef_embedded()` in file `/src/gt4py/next/iterator/embedded.py`.) This mechanism should in principal be nicely captured by DaCe maps. In its simplest form all operations would just be placed in a single tasklet. While simple, this approach prevent DaCe from optimizing the operations and thus they should be splits kept small such that the dataflow aspect is preserved. However, we should consider such an approach as a prototype. The actual problem however is, that currently the translation generates SDFG in a top-down manner, which leads, in my view, to some non optimal SDFG, that are - hard for DaCe to analyse. - adding unnecessary complexity to the optimization (spending time optimizing stuff away that should not be there in the first place). - might actually prevent some optimization from happening, like memory pattern detection. - since almost all memlets (unnecessarily) transfer the whole array, they have a too large volume. According to Eduardo the reason for this is, that arrays are available everywhere in case they appear inside a capture. As an example let's consider the stencil, which comes from `test_domain()` in `${GT4PY_ROOT}/tests/next_tests/integration_tests/feature_tests/ffront_tests/test_execution.py`. ```math A_{i,j} := A_{i,j} + A_{i,j}. ``` The current implementation, after a round of `auto_optimize()`, produces the following SDFG: ![Optimized_ SDFG of adding two fields.](https://hackmd.io/_uploads/Hk2cFlFM6.png) But using DaCe's Python frontend directly leads to the following (desired) SDFG: ![SDFG of the adding stencil by the DaCe Python frontend.](https://hackmd.io/_uploads/SyJhYxtz6.png) An other stencil (diff stencil) that leads to strange SDFG is ```math dA_{i} := A_{i+1} - A_{i}. ``` ![The optimized SDFG for 'diff-stencil'](https://hackmd.io/_uploads/rJ_aYxFGa.png) As we see DaCe was not able to remove (obvious) indirection from the graph. However, the generated code looks not that bad, but I think this is due to the simplicity of the operation. ## Appetite  ## Solution In a first step the SDFG should be constructed bottom-up. For this every closure should be considered as a single map and then its internal should be constructed. - In a first round we should figuring out which elements are actually accessed. - For each of this accesses a separate memlet is constructed. - Then the operation tree is build. One of the problems is shifting, i.e. indirection. ## Rabbit holes Something that is a problem could be shifting or indirection. While there are more sophisticated solutions a simple one could be to first copying it into an temporary (transient), which could enable k-caching However, it is a potential rabbit hole. ## No-gos  # Comments ## The DaCe API - we should check and use new features from the DaCe API - Statefull (the array is an example) ## ITIR - Verbose - `shift` is an example that leads to very complicated constructs. - Have a representation with less degree of freedoms. - Currently we do it in one go, it is very painful$ - Peter's recommendation: have to introduce another IR on our side and one on the DaCe side (maybe merge the two?). The problem is that the two representations are very different. - The ITIR will be updated/cleaned up "soon"; so the gab should become smaller; therefore not the best time to do it. - It would be good what is valid in ITIR because we should assume that what we get is valid. - There are three different types of shifts (Cartesian, NeighborHoodOffsetProvider, StridedNeighborHoodOffsetProvider) - If we would have a shift operator for each of these it could be simnplified. - Make an IR that is in the middle between the two. - Do not extend IR because then it must not be reworked much -> something like a FieldIR. - We need some kind of long term planing to better decide what we should do. - Either work on the DaCe backend or on the foundation of them, i.e. cleaning up DaCe and gt4py. ## Conclusion - Spending time on the DaCe backend is probably not a good idea. - Philip's concern: what should we improve (there are documents somewhere) without knowing the deeper problem. - Improve/replace iterator IR? - Improve the DaCe user API? - Improve DaCe transforms? - Talk with the DaCe people about our (Eduardo & Peter) experiences with the DaCe API. - Turning off the ITIR optimizations, because they discard some stuff DaCe could use. - Startuing from ITIR might not be the best thing, FieldView might be better for the job; Reducing interference from the ITIR passes.