# [DaCe] Make the Optimizer _less_ Indeterministic
- Shaped by: Philip
- Appetite (FTEs, weeks): 6-8 Weeks
- Developers: <!-- Filled in at the betting table unless someone is specifically required here -->
## Problem
In previous cycles we have realized that sometimes there is indeterministic behaviour that leads to suboptimal performance.
An early example was `apply_diffusion_to_w_and_(...)` where, depending in which order parallel instructions _inside_ a kernel were generated, there was a 20% performance penalty (one could argue that this is a bug in the compiler).
However, there is another kind of indeterministic behaviour that stems from the fact that the order in which nodes are processed by a transformation matters.
Here we explicitly restrict ourselves to top level nodes, see [here](https://github.com/GridTools/gt4py/blob/16eb0d5bc735b46ff736f70bdfd141f7780d95d4/src/gt4py/next/program_processors/runners/dace/transformations/auto_optimize.py#L401) for the relevant function in the optimizer.
Known instances where this is the case are:
- `compute_theta_rho_face_values_and_pressure_gradient_and_update_vn`
- `compute_advection_in_vertical_momentum_equation`
The source of indeterministic behaviour are wide and vary and are not fully known, but some have been identified:
- The order in which Maps are fused.
We think that the `*Vertical` operations are less affected than the `*Horizontal` ones, as they are much more restricted.
- The order in which `AccessNode`s are split.
Although, this is partially addressed in [GT4Py PR#2383 (not merged yet)](https://github.com/GridTools/gt4py/pull/2383) at least for cases where the splitter was called directly by the optimizer.
- The order in which Maps are split, i.e. `gt_horizontal_map_split_fusion()` and `gt_vertical_map_split_fusion()`, not the badly named `MapSplitter` transformation.
Furthermore, we think that splitting Maps is more prone to indeterministic behaviour than fusing them.
Again it is believed that horizontal splitting is here is more at risk to lead to indeterministic behaviour.
This is also the reason for the behaviour of `compute_theta_rho_(...)`, see 'Case Study'.
- Transients inside a Map.
In one of our experiments we observed huge fluctuations on the number of transients inside a kernel.
In one case we observed 428 nested AccessNodes and in the other case we observed 434 of them.
Note addressed in this project.
- Order in which redundant arrays are removed.
The main issue here is that depending on the order either one or the other name name might survive.
Note addressed in this project.
The main reason why we also want to make regular Map fusion deterministic is, because this can also lead to a performance gain in the optimizer, because the matching, especially for the horizontal case, is very expensive.
In this project we only aim to make the fusing and splitting process more deterministic.
The goal is not to make everything deterministic and to ensure that the right path is taken.
The main goal of this project is, to remove some sources of indeterministic behaviour in the optimization step that operates on the top level.
As a secondary goal we will also be able to perform some refactoring in a critical part of the code.
### Case Study
Let's look at `compute_theta_rho_(...)` as a concrete example.
The main issue here is, that a Map, which computes `next_vn` on a small subdomain, can be split in different ways.


In the fast case, first image, the Map range in K is `0:35`.
Originally the Map operated over the whole domain, but was split into `0:35` and `35:80`.
The second fragment was then fused into another Map and the first fragment remained.
In the second image, we see the slow case, where the Map was split into `0:5` and `5:80`.
The first Map fragment, `0:5`, was then integrated into another Map and the second, larger, fragment remained.
Thus depending on which split is performed first, we get different results.
## Appetite
Fixing everything would take a very long time.
However, in the reduced scope of this project it should take something like 3 weeks.
## Solution
As stated above, the main goal of this project is not to get the good behaviour back, but to get a more deterministic behaviour, i.e. reduce the indeterministic aspects.
We also combine it with some general but related cleanups.
### Preconditions and Assumptions
For this project to work, we have to make some assumptions, that we considering as preconditions on the lowering _has_ to give us.
- :warning: The labels of Maps must be stable, i.e. initially the Map that surrounds an operation must always have the same name.
Only if the GTIR changes it is allowed to change as well.
- :warning: The labels of Maps must be unique within an entire SDFG, this also includes nested SDFGs
We have to require this because the inliner does not rename them.
- :warning: The names of non-GT4Py-Program-argument-datacontainer (i.e. everything that is _not_ supplied from the outside to a GT4Py program) of an SDFG must be stable and have the same name as long as the GTIR it is constructed from does not change.
- :warning: It would be good, also for debugging, if the non-Program-arguments, see above, are unique within the entire SDFG, including nested SDFGs.
However, by a [not yet merged change in `gt_inline_nested_sdfgs()`](https://github.com/GridTools/gt4py/pull/2385) this is no longer a hard requirement, but remains a desirable property.
> Edoardo is pretty sure that they are fulfilled, but he is currently checking it.
### Preliminary Work
There was already some work done in that regard.
While most of it is not done, the remaining work is small.
- Deterministic Map labels in MapFusion ([DaCe PR#2226](https://github.com/spcl/dace/pull/2226)):
This PR ensures that the label of the fused Map is deterministic regardless of the order in which the Maps are fused together.
- Stable inlining of nested SDFGs ([GT4Py PR#2385](https://github.com/GridTools/gt4py/pull/2385)):
This PR perform the inlining in a fixed deterministic order, which is most importantly for the names of transients.
:exclamation: Not merged.
- Stable splitting of `AccessNodes` ([GT4Py PR#2383](https://github.com/GridTools/gt4py/pull/2383)):
This function makes `gt_split_access_nodes()` stable by processing the nodes in alphabetical order.
It might be possible that this will solve the indeterministic behaviour in `compute_advection_in_vertical_momentum_equation`, but this was not checked.
:exclamation: Not merged.
### Stabilize Horizontal Map Processing
Here we are discussing a way how to make horizontal Map fusing [`MapFusionHorizontal`](https://github.com/GridTools/gt4py/blob/16eb0d5bc735b46ff736f70bdfd141f7780d95d4/src/gt4py/next/program_processors/runners/dace/transformations/auto_optimize.py#L524) and [`gt_horizontal_map_split_fusion()`](https://github.com/GridTools/gt4py/blob/16eb0d5bc735b46ff736f70bdfd141f7780d95d4/src/gt4py/next/program_processors/runners/dace/transformations/auto_optimize.py#L592) deterministic.
While they are very different operation they have some parts in common.
Both look at two Maps and compute if they should be split and how.
Thus, the common part is how to feed them Map pairs in a stable way?
To that end, we realize first that we can process each state separately.
Inside a state, we make use of the [assumption from above](https://hackmd.io/SuiIF_DpQ9G7kzSmCuMAxg?both#Preconditions-and-Assumptions) that the labels of Maps are unique.
Thus, we simply collect all Maps in a state and sort them according to their label.
The question is what happens to the list of sorted Maps that have not been processed yet?
There splitting and fusing now diverge.
For fusing it is possible to keep using the list (when the fact that `HorizontalMapFusion` preserves the first Map, is exploited).
This becomes harder for the splitting.
While reusing the list would be good, it is perfectly okay to just drop it after a split and start anew.
The only thing that should be ensured is that when `canHorizontalMerge(map1, map2)` is known to not applicable then `canHorizontalMerge(map2, map1)` should not be checked.
The current matcher does not do that.
### Stabilize Vertical Map Processing
Here we are discussing a way of how to make vertical Map processing, i.e. [`MapFusionVertical`](https://github.com/GridTools/gt4py/blob/16eb0d5bc735b46ff736f70bdfd141f7780d95d4/src/gt4py/next/program_processors/runners/dace/transformations/auto_optimize.py#L469) and [`gt_vertical_map_split_fusion()`](https://github.com/GridTools/gt4py/blob/16eb0d5bc735b46ff736f70bdfd141f7780d95d4/src/gt4py/next/program_processors/runners/dace/transformations/auto_optimize.py#L575), _more_ stable.
Due to the more restricted pattern vertical Map processing should be handled differently than the horizontal processing.
Since the two Maps are always connected by an `AccessNode` we can base our matching algorithm on the `AccessNode`s.
Thus we first scan for `AccessNode`s in a state and then sort them in a deterministic order.
Although not fully correct in all cases `sorted(listOfAllAccessNodesInState, key=lambda ac: (ac.data, constaingState.in_degree(ac), containingState.out_degree(ac)))` should be good enough due to our "near SSA style" SDFGs.
The approach outlined above works, when the `AccessNode` is written by a single Map and read by a single Map, but will not work if multiple Maps (consumers and/or producers) are involved[^mapFusionProducerNote].
To handle them we need a way to sort process the Maps in a deterministic order.
An idea to handle that can be found in the [`concurrent_subgraphs()` section below](https://hackmd.io/SuiIF_DpQ9G7kzSmCuMAxg?both#concurrent_subgraphs-sink-source_nodes-amp-in-out_edges).
> Note: By carefully handling it might be possible to avoid the rescanning of the state in certain cases.
> While this is desirable, it is not a primary goal of this project and a 80% solution is completely fine.
> So do not waste time on it.
### `concurrent_subgraphs()`, `{sink, source}_nodes()` & `{in, out}_edges()`
> Note this is a stretch goal
These are all helper functions within DaCe and are a possible source of indeterministic behaviour, since the order in which they return things is not fully deterministic.
Currently they mainly depend on the order in which they were inserted (NetworkX is not fully clear on that), but which is not worth much if somewhere a `set` is used.
So it might be a good idea to stabilize them, such that they give the same output.
For that we need to order nodes in a general way, which probably does not exist.
A very simple way could be a tuple with the following structure: `(${TYPE_ID}, ${SELF_ID}, input_degree, output_degree)` (maybe not the best choices and maybe a different order).
In the above `${TYPE_ID}` is used to identify the type, it should probably be `type(node).__name__`.
`${SELF_ID}` is then something that identifies a node, examples what could be used are given here:
- `AccessNode` -> `.data`
- `NestedSDFG` -> `.label`
- `Taskelt` -> `.label`
- `Map{Entry, Exit}` -> `.map.label`
- `LibraryNode` -> ?
The best thing would probably be a utility function that orders nodes in a particular way and can be reused.
Probably this function should unconditionally used in `concurrent_subgraphs()` and `{sink, source}_nodes()`, it is difficult to use them in `{in, out}_edges()` for several reasons:
- There can be multiple connections between nodes.
- It might be expensive to call it every time
### Array Removal
Another source of indeterministic behaviour is the order in which arrays are removed.
The main issue here, is that depending on the order they are applied, i.e. which is the last standing `AccessNode`, and this will influence the processing order even if we use the schemes outlined above.
While important we will ignore that aspect for now as there is [another project regarding this](https://hackmd.io/dB_3vSsqTYiscnlnUXbc-w).
## Rabbit holes
Form a tool chain performance point of view, it might be interesting to avoid rescanning as much as possible.
Furthermore, the ideas outlined above will not guarantee a fully deterministic behaviour, after all the project is called "... _less_ indeterministic ...".
Thus they will not work in all cases, however, the GT4Py case is probably so simple, that we should get away with them.
## No-gos
It is important that this project is not about updating the logic that governs fusion/splitting but about getting a _more_ deterministic behaviour overall.
Note that it is unclear how a success can be verified as there is no way how to explicitly trigger indeterministic behaviour.
Furthermore, the update in Map fusing only targets the fusing of top level Maps and the splitting of them, it is not about the fusing of nested Maps (they have different problems).
## Progress
<!-- Don't fill during shaping. This area is for collecting TODOs during building. As first task during building add a preliminary list of coarse-grained tasks for the project and refine them with finer-grained items when it makes sense as you work on them. -->
- [x] Task 1 ([PR#xxxx](https://github.com/GridTools/gt4py/pulls))
- [x] Subtask A
- [x] Subtask X
- [ ] Task 2
- [x] Subtask H
- [ ] Subtask J
- [ ] Discovered Task 3
- [ ] Subtask L
- [ ] Subtask S
- [ ] Task 4
<!------------------------------------->
[^mapFusionProducerNote]:
Note `VerticalMapFusion` is on a fundamental level only able to handle one producer, but the node can have multiple consumer (same or different state).
On the other hand the vertical Map splitter can handle multiple producers, actually that is one of the main application of it.