[DaCe] Miscellaneous Tasks

# [DaCe] Miscellaneous Tasks  - Shaped by: Philip - Appetite (FTEs, weeks): - Developers:  ## Problem The optimization pipeline has accumulated a lot of technical dept that in the last years, some of that has become a real issue and blocker. Form some of them a dedicated project has been shaped, but for some it does not really make sense to do it. This document contains something like a list of things that should/could be done. Another way of seeing it, is a project that aims to shape other projects. > Currently the order of the project does not reflect any particular order. ##### Look At The GPU Transformations The GPU transformations have become outdated. Most of it is only a wrapper around the one that is supplied by DaCe, i.e. `sdfg.apply_gpu_transofmration()`, but the transformation massages the SDFG further. Especially it also runs `TrivialGPUMapElimination`, which is an obscure transformation that handles a particular case where a trivial Map is created because there is an operation performed on a GPU scalar. This should by now be handled [differently](https://github.com/GridTools/gt4py/blob/badbc3d9df5839c46a81caa632c89a24fcb762f3/src/gt4py/next/program_processors/runners/dace/transformations/gpu_utils.py#L86), but it must be checked. The other parts are mostly fine, but probably some small improvements here and there are possible. ##### Iteration and Stride Order Currently the iteration order is set in a very crude way, which assumes that CPU and GPU have different memory allocation schemes. The lowering does not specify any memory order, which will use DaCe's default, which is C (row major) memory order. The assumption is that the global memory, i.e. the arguments, are also allocated in this way and thus no change is needed. In GPU mode the assumption is that the outside memory is allocated in FORTRAN order (column major). Thus the optimizer changes the [iteration order of the Maps](https://github.com/GridTools/gt4py/blob/badbc3d9df5839c46a81caa632c89a24fcb762f3/src/gt4py/next/program_processors/runners/dace/transformations/auto_optimize.py#L774)[^noteIterationOrder] and the [strides](https://github.com/GridTools/gt4py/blob/badbc3d9df5839c46a81caa632c89a24fcb762f3/src/gt4py/next/program_processors/runners/dace/transformations/auto_optimize.py#L789) accordingly. For GPU this currently work (if you are on it, please check it) and Philip darkly remembers that this also once worked for CPU. But it seems that there is no special casing in the GT4Py's memory allocation, so it is probably wrong for CPU. This should be checked. It is important here that this scheme we are currently apply is not good and ideally one would compute the optimal strides and order based on the compute pattern and the strides of the arguments. While desirable it is not really feasible, so we should just check if we do the right thing for CPU. Note as a special thing one should also check the case here we have an input or better a transient, that has three dimensions and also check for that case. ##### Update Stride Propagation GT4Py has utilities to propagate strides into nested SDFGs, these are important when data is replaced, i.e. during redundant array removal and the new data layout needs to be propagated into nested SDFGs. While the current implementation has proved useful, its dimension matching is rather simple. In the mean time [`associate_dimensions()`](https://github.com/GridTools/gt4py/blob/751e64c192afaf53720aafa8fb841cd66291ca9f/src/gt4py/next/program_processors/runners/dace/transformations/utils.py#L653) has been introduced which addresses this problem in a more stable way. Furthermore, it should also propagate the strides to [`View`s](https://github.com/GridTools/gt4py/pull/1784) although we should get rid of them they might be still there. ##### Getting Rid Of `View`s `View`s are a hard to handle and their support in GT4Py and DaCe is not stellar. Furthermore, if they simplify something at one place they make a lot of other things exponentially more complicated. Thus we should get rid of them completely, i.e. never generate them (the lowering generate it at one place) and consider the existence of a view as an error. See also the [[DaCe] Overhaul of Redundant Array Removal Transformations](/dB_3vSsqTYiscnlnUXbc-w) project. ##### Make `MapFusionHorizontal` More Powerful by Fusing According to Size Currently, we can only fuse Maps if they have the same range. While it is a limitation if we have symbolic sizes, it is not necessarily a limitation if the ranges are fully known. If they are numbers then the limitation is that they the ranges have the same size, i.e. `endIdx - startIdx` is the same for both ranges. Consider the following situation: ```python= for i in dace.map[2:12]: z[i] = foo(i, ...) for j in dace.map[1:11]: y[j] = bar(j, ...) # No dependency on `z` ``` In this case we can apply the transformation: ```python= for l in dace.map[0:10]: i = 2 + l z[i] = foo(i, ...) j = 1 + l y[j] = bar(j, ...) ``` :exclamation: In certain cases this could also be done for the vertical case, but it is probably not worth implement logic to figuring out if we have a case we can handle or not. In such cases the ["dataflow inliner"](https://hackmd.io/H2mXKtePQIaa35RHTylQ9w) is most likely a more suitable choice. ##### Better Array Removal Unspecialized Mode See project [[DaCe] Overhaul of Redundant Array Removal Transformations](/dB_3vSsqTYiscnlnUXbc-w) especially [this section](https://hackmd.io/dB_3vSsqTYiscnlnUXbc-w#DistributedBufferRelocator). ##### Move Towards Free Functions Some transformations do useful things, however, because they are implemented as `SingelStateTransformation` they are, in some instances [hard to use elsewhere](https://github.com/GridTools/gt4py/blob/badbc3d9df5839c46a81caa632c89a24fcb762f3/src/gt4py/next/program_processors/runners/dace/transformations/split_access_nodes.py#L304). It would be simpler, if some of the functionality would be moved into free functions, which makes it simpler to reuse them. What is not so clear yet, is what should happen with the `can_be_applied()` functions. Should it also be turned into a free function and if so all of it? The best is probably to split the functionality into "can the transformation be applied on a _structural_ level" and "is it beneficial to apply the transformation". The first check should probably be turned into a free function, while the second kind of checks should remain inside the transformation. ##### Refactor Everything In the course of the last 1.5 year the optimizer has accumulated a lot of technical dept and is in dire need of a big refactoring. For example the name `MapSplitter` might be accurate but we should rename it. ## Appetite At some point in time we will need it. ## Solution  ## Rabbit holes  ## No-gos  ## Progress  - [x] Task 1 ([PR#xxxx](https://github.com/GridTools/gt4py/pulls)) - [x] Subtask A - [x] Subtask X - [ ] Task 2 - [x] Subtask H - [ ] Subtask J - [ ] Discovered Task 3 - [ ] Subtask L - [ ] Subtask S - [ ] Task 4  [^noteIterationOrder]: As a small side remark, there is no clear definition in which order DaCe's code generation generates the loop/kernels for a multi dimensional Map. We currently rely on a particular behaviour of that thing.