[DaCe] Extended Fusion/Inlining Transformation

# [DaCe] Extended Fusion/Inlining Transformation  - Shaped by: Philip - Appetite (FTEs, weeks): - Developers:  ## Problem The canonical situation for Map fusion is: ```python= for i, j in map[0:N, 0:M]: T[i, j, ...] = foo(i, j, ...) for k, l in map[0:N, 0:M]: out[k, l, ...] = bar(k, l, T[k, l, ...], ...) ``` Which can then be fused to: ```python= for i, j in map[0:N, 0:M]: t[...] = foo(i, j, ...) out[k, l, ...] = bar(k, l, t[...], ...) ``` While in most cases `t` is a scalar, it is also possible that it is an array, usually with a small size though. This implies that two Maps can only be fused if the following is true: _one iteration of the first Map must generate all data than an iteration of the second Map needs_. A corollary from this is, that the two Maps need to have the same iteration space. However, in the following example (1D laplap) the Maps can not be fused: ```python= for i in map[1:(N-1)]: lap[i - 1] = A[i - 1] - A[i + 1] for i in map[2:(N-2)]: laplap[i - 2] = lap[i - 1] - lap[i + 1] ``` The first and less severe issue is that the Maps have a different ranges. The second and more important problem is, that the data dependencies can not be meet, since the second loop needs `lap[i - 1]` and `lap[i + 1]` which are generated by two different iterations. However, it is known that in the case of the `laplap` fusing them, i.e. elimination of the transient, would be better. Another pattern, in which this "inlining" seems to be beneficial are some [interpolation function](https://github.com/spcl/icon-dace/blob/734248a8c5d0ef27a35949d546a8fb7394a19192/src/atm_dyn_iconam/mo_solve_nonhydro.f90#L1979). In `icon4py` the corresponding stencil is [_compute_contravariant_correction_of_w](https://github.com/C2SM/icon4py/blob/85bfae43edc28d1fa48406774a8829e01a0689de/model/atmosphere/dycore/src/icon4py/model/atmosphere/dycore/stencils/compute_contravariant_correction_of_w.py#L17). However, the pattern there the pattern is slightly different, but actually more illustrative: ```python= for i, k in map[0:N, 0:K]: w[i, k] = calculate_weight_at(i, k, ...) for i, k in map[0:N, 1:K]: out[i, k] = foo(w[i, k], w[i, k - 1], ...) ``` It turns out that on GPU the following pattern is better: ```python= for i, k in map[0:N, 1:K]: w_m0 = calculate_weight_at(i, k, ...) w_m1 = calculate_weight_at(i, k - 1, ...) out[i, k] = foo(w_m0, w_m1, ...) ``` Thus, we need a transformation that is able to transform the first situation into the second version. ## Appetite I would say it is medium hard. A first version should be simple, to properly extend, such that the cases we need are handled and to integrate it into the pipeline is more difficult. A first working solution (~80% solution) should take about 2 weeks. ## Solution The solution is to create something like an "inlining" or "replicating" Map fusion. The process to do this, is actually relatively simple under some condition, such as that the Map only has a single output, that is used in one other Map. These restrictions can be lifted, but will increase the complexity of the transformation. The approach is simple and on a conceptual level follows what was done in the "interpolation example" above. In the following however, we will look at the LapLap example that is shown as SDFG below. ![1D LapLap](https://hackmd.io/_uploads/rkUxPzmuxx.png) For this the body of the first Map should be seen as a function or in DaCe lingo, as a nested SDFG. Therefore, the first step is to wrap the body of the first Map into a nested SDFG, which is a relative simple task. In the LapLap example the nested SDFG/function only has the input `a`. It is important that in the following we assume that the whole container is passed into the nested SDFG (in `C` speak this means that the base pointer is passed as argument), this means that all accesses are done inside the nested SDFG. However, the nested SDFG also has additional arguments, besides the symbols that the body needs, such as strides and so on, the Map parameters of the first Map have to be included ever time. In the second step we add this nested SDFG to the second Map. However, it is important that this is done twice, once for `t[j-1]` and once for `t[j+1]`. While passing field connectors, in the example `a` and the symbols such as strides to the nested SDFG the Map parameters are a bit more difficult. For this one has to understand _how_ the data is written. From the image we see that the subset that writes to `t` is `i - 1` and if we want to read for example `j + 1` we must solve the equation $\mathtt{i} - 1 = \mathtt{j} + 1$ for $\mathtt{i}$, i.e $\mathtt{i} = \mathtt{j} + 2$. This has to be done for every dimension. In the example above this task is trivial, however, it might be more complicated in some cases, especially if there are `max()` expressions introduced by `concat_where`s, but they should be limited. There are also a few other things that have to be considered. The main important question is when do we call the transformation? It has to happen somewhere during the first phase of the auto optimizer, where the top Maps are fused. However, it can not happen too late, where the Maps have become too big, i.e. they already have multiple outputs, but also not too early, since the Maps are too small then. The most likely solution would be to first run vertical Map fusion with enabled `require_exclusive_intermediates` and `require_only_intermediates` [flags](https://github.com/spcl/dace/pull/2090). ## Rabbit holes The index computation could potentially be quite trouble some. However, we should focus on the GT4Py case, which is most likely simple. Furthermore, the correct integration into the pipeline and the control logic, when to actually apply might not be simple. But this particular aspect should be considered as a secondary task. ## No-gos DaCe has something that is called `OTFMapFusion`, but it is not fully clear if it does that furthermore, it is old and not maintained. We should therefore not spend any time on it. ## Progress  - [x] Task 1 ([PR#xxxx](https://github.com/GridTools/gt4py/pulls)) - [x] Subtask A - [x] Subtask X - [ ] Task 2 - [x] Subtask H - [ ] Subtask J - [ ] Discovered Task 3 - [ ] Subtask L - [ ] Subtask S - [ ] Task 4