[DaCe] Overhaul of Redundant Array Removal Transformations

# [DaCe] Overhaul of Redundant Array Removal Transformations  - Shaped by: Philip - Appetite (FTEs, weeks): ?? - Developers:  ## Problem The problems with the redundant array removal transformations that ships with DaCe are wide and vary, but the main important points are: - They are buggy, main problem is that they ignore `other_subset` or it is handled in a very naïve way. - They are not applying in all cases we would need them, even in relatively simple ones. - Sometimes they are just plain wrong, because the involved `subset`s is not handled correctly. - They do not really propagate the new strides into the nested SDFG. - They sometimes create Views[^noteAboutViews]. - They are hard to debug. For these reasons most, but not all, of the heavy lifting is already done by GT4Py specific transformations. Thus, there are only a few places left where the transformations shipped by DaCe are actually needed. As a side remark, also the GT4Py transformations have their fair share of problems although they are a bit less sever. For example they are kind of optimized for the full specialization case. Furthermore, one has to distinguish between removing redundant copies on the top level and redundant copies within Maps. This project only focuses on the removal of top level copies, as the ones inside Maps usually only involves scalars or arrays with a static size of about 6, which, according to the assembly, are handled by the compiler. Thus, handling top level copies is much more important than the ones inside Maps (ignoring the by now famous counter example of `apply_diffusion_in_w_...`; but that is another matter.) However, transformations should still be able to handle both cases, if possible. A contributing factor to the large importance of redundant array transformations was the decision that in order to keep the lowering simple, it should not try to avoid any and all unnecessary (top level) copy. This was motivated because DaCe, as a dataflow optimization framework, should be rather good at removing these excess copies, especially given the fact that we do not have Views or other crazy global updates. This turned out not to be the case. Therefore, this project has two goals: 1. Create a GT4Py replacement of the "built-in" DaCe transformations that we need. If time permits, some of the not overly specified transformation might be merged back to DaCe but this is an optional stretch goal with lowest priority. 2. Refactor the GT4Py transformations. The main reason is that some of them are old and have experienced some mission creep, some even have become obsolete. Furthermore, most of the replacements (first point) should not be handled by creating new transformation, but by extending (in the sense of implementing the `TODO`s) transformations that are already in GT4Py. ## Appetite Due to the nature of the project and the kind of transformations, it is hard to give an estimate. However, it is likely that 6-8 weeks might be enough, but there is no guarantee for it. However, given what our experiences with the stock DaCe transformation is and the importance of redundant array removal, this project is something that we want to have no matter what at least in midterm. ## Solution The aim of this project is to increase the quality of the redundant array removal transformations in GT4Py and to get independent of DaCe in that regard. But first of all one should get familiar with [ADR18](https://github.com/GridTools/gt4py/blob/main/docs/development/ADRs/next/0018-Canonical_SDFG_in_GT4Py_Transformations.md) which governs the structure of SDFGs that we expect/allow/care about. In short we have SDFGs that are kind of SSA style, which means that we write every transient only once (for every transient there is only one `AccessNode` with an incoming degree $\not= 0$) and every location of global data is written only once (which means that there are multiple `AccessNode` with an incoming degree $\not= 0$ for global data, but `G[i, j]` is written only once and never overwritten). > While ADR18 asserts this behaviour for global data, some parts of the toolchain do not fully adhere to it. > Meaning that in the same SDFG certain location of global data is written multiple times. > This _is_ an issue of the GT4Py toolchain, but it is probably simpler to drop this requirement from ADR18. > However, for that one would have to check the entire transformation pipeline. > Thus the _current_ recommendation is to not use this assumption in any work carried out here. ### Replacing Stock DaCe Tranformation The "solution"/approach here is quite simple. First, one disables calling them in [`gt_simplify()`](https://github.com/GridTools/gt4py/blob/ecab48ebfafc11d27c6e4167b86fa18649227aa5/src/gt4py/next/program_processors/runners/dace/transformations/simplify.py#L127) and then starts to look for the problems. There is one issue though, while the stencils in the dycore can quite easily be checked, simply by timing them before and after, for the green line only stencils, such as the interpolation stencils, this is a bit harder. While they are not "directly" performance relevant, as they are called only once, they should be never the less fast. Furthermore, they provide an important source of patterns that the optimizer should be able to handle. However, to be fair until now nobody has really looked at them, except if they fail in CI. What could be done here is a semi automatic inspection. I.e. run the green line before and after disabling the redundant array transformations and then use some scripts that crawls through them and compares them and for example compares the - numbers of states, - numbers of Maps, - mumbers of top level `AccessNode`s and their size. As it was noted above, it is expected in the majority of cases can be handled by simply extending/perfecting the transformations that we already have and that new transformations are probably an exception. ### Refactoring Transformations The second aspect of the project is about the refactoring of transformations. We will now discuss the transformations that are there. Keep in mind that they should also be extended. Note the listing bellow does not have a particular order. ##### `DistributedBufferRelocator` This is one of the few transformations that acts on multiple states. It is also very old dating to back before structured control flow regions where a thing and it does not have fully incorporated that change. This transformation is mostly important for lower specialization levels, i.e. when we do not know which `if` branch will be selected. On the highest specialization level, it is not particular useful. However, nevertheless it is something that should be done correctly. On a high level the transformations takes code that looks as: ```python= if bool_condition: tmp = expensive_foo(...) else: tmp = expensive_bar(...) ... # Other awesome code. global_result = tmp ``` and rewrites it to: ```python= if bool_condition: global_result = expensive_foo(...) else: global_result = expensive_bar(...) ... # Other awesome code. ``` I.e. instead of computing the result inside a transient and then copy the transient in the global, it is more efficient to directly write the result in the global data. Obviously this can only be done if certain constraints are fulfilled, such as (not exhaustive): - Everything that is computed and stored in `tmp` is also copied to `global_result` (there is `MapSplitter` that can be used to adjust the size of the "generating Map", but it has some restrictions, but it could be used as a template to overcome or lessen this restriction). - `tmp` is not read anywhere else (by ADR18 it can not be written to anywhere else). - Between the `if` and the location where `tmp` is copied into the result, `global_result` is not used. The main issue is that currently it generates something more along the lines of: ```python= if bool_condition: tmp = expensive_foo(...) global_result = tmp else: tmp = expensive_bar(...) global_result = tmp ... # Other awesome code. ``` The main reason for that is that it detects the usage of `tmp` in the other branch and thus assumes `tmp` has to be kept alive. This must be improved. As written above, this transformation is not important for high specialization levels, but something that should be done right. As such its priority is medium. ##### `MultiStateGlobalSelfCopyElimination` & `MultiStateGlobalSelfCopyElimination2` As their names is indicating both transformations essentially perform the same operations, but slightly different. They handle the case where global data is copied into a transient in one state and in another state that transient is copied back to the same global data. ADR18 guarantees that these copies are pointwise, i.e. there is no shift performed (shifts are always performed by Maps; See also the `MapToCopy` transformation). The main difference is that `MultiStateGlobalSelfCopyElimination` looks for this case by focusing on the global data and `MultiStateGlobalSelfCopyElimination2` is using transient data to find this case. Note on history. The transformation was originally invented to allow state fusion in certain cases. Then due to a later change to ICON4Py another case emerged, which could not be handled by the original transformations. It was first tried to update it, but this proved difficult and it was thus decided to create `MultiStateGlobalSelfCopyElimination2` as a "temporary" solution. The obvious goal here is unify them into a single transformation. However, it should be scheduled at the end as there are more important things that should be taken care of. As an alternative, one could also aim at improving state fusion[^noteGT4PyStateFusion]. ##### `SingleStateGlobalDirectSelfCopyElimination` This is a very simplistic transformation that copies global data into itself, i.e. like a self assignment. ADR18 guarantees us that the copy is pointwise, thus in the expression `g[a:b] = g[c:d]` we are guaranteed that `a == c` and `b == d` holds. There is actually nothing much to do there. ##### `SingleStateGlobalSelfCopyElimination` This is very similar to the `SingleStateGlobalDirectSelfCopyElimination` and the multi state version, but a bit different and actually more powerful. It essentially matches the pattern `(G) -> (T) -> (G)`, i.e. some data is copied from a global data container into a transient and then back in the same global data. ADR18 guarantees us that there is no shift so in principle the transient and the copies themselves are redundant (which was the first iteration). The issues starts if `T` is written (and read) to by other sources. The transformation then figuring out if the three nodes behave as one node and merges them (second iteration). Or, if this would generate a cycle, tries to figuring out if the can remove the write backs, i.e. some parts of the connections of the write edge from `T` to `G` (third iteration). ##### `DoubleWriteRemover` A relatively simple transformation that targets the following pattern: ```python= for i in range(N): temp[i] = ... A[:, slice1] = temp A[:, slice2] = temp ``` which it rewrites to: ```python for i in range(N): temp_scalar = ... A[i, slice1] = temp_scalar A[i, slice2] = temp_scalar ``` The transformation is actually relatively good. Note this transformation currently also applies inside Maps, which is technically not wrong but also not super [useful](https://github.com/GridTools/gt4py/pull/2445) and it should be changed. ##### `CopyChainRemover` This transformation was originally designed to handle `concat_where()` expressions. Such expressions look similar to the following image. ![CopyChainExample](https://hackmd.io/_uploads/r1AhBssHbe.jpg) The sub expressions are computed and stored into temporaries, `T` and `S`, before they are written to the result on the whole domain, `F`. The transformation now changes this such that `T` and `S` are bypassed and the results are written directly into `F`. Thus the basic assumption of this transformation was, that everything that is computed and by extension written into `S` and `T` is then copied into `F`. A main restriction of this design is that `T` and `S` must be transients. At a later stage the transformation was extended to address a the situation where `T` or `S` were non transient data. While this transformation is generally good, there was some mission creep and the integration of the two modes (which the transformation calls "PULL" and "PUSH" mode) could be a bit cleaner. Thus, this transformation does not have very high priority. ##### `RemovePointwiseViews` This is a relatively simple transformation that removes View, however, currently it only acts inside Maps. Currently it is the only transformation in GT4Py that removes Views and since such a transformation is important, it could serve as a starting point for such a general purpose transformations. ##### `GT4PyMapBufferElimination` This is a very old transformation it matches the pattern `MapExit --> (Transient) --> (GlobalData)`, i.e. a Map that writes into a transient that is then copied into a global data. This pattern is essentially the same as for `CopyChainRemover` and in fact this transformation is much older. Thus this transformation should be removed, all cases where it applied should (and are probably already) handled by `CopyChainRemover`. ##### `RemoveAccessNodeCopies` This is a special purpose transformation that is needed to handle a special case that arises in the vertical solvers. Nothing needs to be done here as solving it in general is hard. For completion, this transformation matches `(G) --> (T1) --> (T2) -- (G)` and eliminates the copying. It is needed because of limitations of the other transformations. ### Integrating Them in `gt_simplify()` Because it is impossible to inject transformations into DaCe's simplify transformation we created our own version of it. Which is essentially a wrapper around it that does: - Performs some pre-processing steps, such as inlining nested SDFGs. - Calls DaCe's simplify function, where all parts are disabled that we do not need. - Performs some post-processing steps. The steps above are repeated until a fix point has been reached. The fix point is detected by comparing the hashes, which has the side effect of clearing some internal chaches and prevents some errors. Currently, most of the redundant array removal transformations we have are run by passing them to `sdfg.apply_transformation_repeated()` which is convenient, but also has the risk of indeterministic behaviour, i.e. the order might influence which is the remaining node. This has become more important, because the names have/will becomes more important to guide other transformations, see [this project](https://hackmd.io/SuiIF_DpQ9G7kzSmCuMAxg). However, (sadly) we do not aim for 100% deterministic (lack of resources) thus we should only invest time if there is reason to believe that it is important. For example, `SingleStateGlobalDirectSelfCopyElimination` is irrelevant in this case. Furthermore, the order in which `DoubleWriteRemover` should be irrelevant. A transformation where order is important is `CopyChainRemover`. As its name implies it was made to remove chains of copies, while from a functional point it is irrelevant where the transformation starts, it is important to ensure that the same node remains at the end. Thus we need to create a special matcher for (at least) this transformation. ## Rabbit holes Integrating the transformation and making them stable might be challenge on its own, due to the wide variety of patterns that we might have. But as outlined above, this is not needed. Furthermore, we do not aim for general purpose transformations, we target the GT4Py case which is simple. Thus, there are no strange dimension reordering operations (except adding or removing trivial ones). ## No-gos Creating `Views` is strictly forbidden. They are not handled properly by [DaCe](https://github.com/spcl/dace/issues/1612) nor by [GT4Py](https://github.com/GridTools/gt4py/pull/2334) and introduce aliasing and other nasty stuff. They are only a convenient tool at the source code level. ## Progress  - [x] Task 1 ([PR#xxxx](https://github.com/GridTools/gt4py/pulls)) - [x] Subtask A - [x] Subtask X - [ ] Task 2 - [x] Subtask H - [ ] Subtask J - [ ] Discovered Task 3 - [ ] Subtask L - [ ] Subtask S - [ ] Task 4  [^noteAboutViews]: While Views are convenient for programmers, they are completely useless and actually a liability at the DaCe level. This is because they destroy the SSA property of the SDFG, since now writing to data container `X` can lead to an update of container `Y`. One always has to do the trace everything and now data can be aliased. [^noteGT4PyStateFusion]: GT4Py has its own state fusion transformation, although it is not a full replacement of the stock DaCe version. Instead it exploits the particular structure of SDFG generated by GT4Py, see [ADR18](https://github.com/GridTools/gt4py/blob/main/docs/development/ADRs/next/0018-Canonical_SDFG_in_GT4Py_Transformations.md) for more. The main reason for its creation was that DaCe lacked some analysis steps.