[DaCe] Optimize the Optimizer

# [DaCe] Optimize the Optimizer  - Shaped by: Philip - Appetite (FTEs, weeks): - Developers:  ## Problem The current DaCe optimizer pipeline of GT4Py is very slow and has become essentially unusable for larger SDFGs. This has to be fixed. ## Solution First the pipeline must be profiled. For this flame graphs should be very good, because they also allow to distinguish between a call to `gt_simplify()` at the very beginning and later on. There are multiple ways to do that: - https://github.com/gaogaotiantian/viztracer - NVTX-Ranges (need manual modifications of the code) Form our experience, we expect two main areas that will show up in the tracing. We will now discuss them in more detail. ### `gt_simplify()` Note we discuss here the entirety of `gt_simplify()` not a specific subpass of the GT4Py simplification pipeline. One of the main reasons for the large runtime is, because it has a monolithic design, something inherited from its DaCe counterpart. The current design of `gt_simplify()` roughly as follows: - Run some GT4Py specific pre-cleaning transformation that we need. - Run vanilla DaCe `simplify`. - Run some GT4Py post transformation The above steps are executed in a loop to ensure that we have reached a fix point state, which is actually a desirable behaviour as later optimization might modify things that have blocked previous transformations from applying. However, every time `gt_simplify()` is called the whole array of transformations is run, regardless if there is any chance if they might run or not. For this `InlineSDFGs` can be used as an example. Running this transformation is crucial for GT4Py, however, once we have run it, running it again has no real use since every nested SDFG that has not been inlined in a previous call, can most likely not be inlined anyway, as it is needed to provide state in a dataflow region, i.e. inside a Map. Thus, it should be excluded from running at later invocations. The same is true for other transformations such as `PruneSymbols` and `PruneConnectors`, which in fact are only run to overcome some issue in the inlining transformation itself. Thus, there is no need to run some transformation only multiple times, while running some multiple times is fine. An example for this is probably the (GT4Py) `StateFusion` transformation, that might fail to fuse states, because it was unable to analyze the dataflow properly. However, while the number of data containers are reduced during the `_gt_auto_process_dataflow_inside_maps()` phase, it might be able to analyze it properly and perform the fusion. This indicates that we should modularize `gt_simplify()` into smaller bits or at least make it simpler to configure for a particular situation. In fact, there is already a simple mechanism that allows to disable certain parts. The `gt_simplify()` provides the `skip` argument, which is a list of transformations that should be skipped, i.e. not run in that particular run of `gt_simplify()`. However, currently only `GT_SIMPLIFY_DEFAULT_SKIP_SET` is provided. Therefore, a simple solution would be to provide more defaults and actively use them inside `gt_auto_optimize()`. Another, possibility would be to do some deeper refactoring, which will take a long time, maybe too long. This is a direction that the DaCe developers want to do and if this route is taken, we should actively involve them. Another issue is, while DaCe `simplify()` and `gt_simplify()` allows to disable certain transformations, it is not possible to inject or replace transformations from the outside. This is one reason for the current design of `gt_simplify()`. Form this discussion we propose the following actions: - Modify DaCe's `SimplifyPass` such that it is possible to modify the list of `Pass`es that are run. This should not be super fancy, maybe just an additional argument, that defaults to the current list is already enough. This will allow us to use a better design in `gt_simplify()`, i.e. we can run them in one pipeline. - Then we should make better use of the `skip` argument that is already provided. For this we have to look at the different phases of `gt_auto_opimizer()` and figuring out which transformations can be run inside them. However, there are some aspects that are important here: - Regardless what is done, there should still be a `gt_simplify()` that runs the entire simplification pipeline, if requested. - It should remain a fixed point transformation or idempotent. ### Rescanning All The Time Another issue might be, that the SDFG is scanned multiple time. One reason is that we do not make use of the pipeline that is provided by DaCe. There are however, some known issues that blocks this; see [here](https://hackmd.io/tZ3BKzwNTlWwv81fW2H2ww), search for `find_successor_state()`, `is_accessed_downstream()` and `_pipeline_results`. ### Vertical/Horizontal Split Map Transformation There are other problems that affects the Map fusion transformation. While Serial/Vertical Map fusion the matching, i.e. search for candidate, is quite simple, it is much more involved when we want to perform Parallel/Horizontal Map fusion. The reason is, that we have to match every Map against every other Map, which is an $\mathcal{O}\left(N^2\right)$ problem. A simple mitigation is to run serial Map fusion first, as this will reduce the number of Maps. This is already done in in the initial Map fusion process, which is probably good enough. Another, issue comes from the Map fusion transformations that split Maps, i.e. [`gt_horizontal_map_fusion()` and `gt_vertical_map_fusion()`](https://github.com/GridTools/gt4py/blob/ac93ffd89defbf25f596e93e585a61485292125b/src/gt4py/next/program_processors/runners/dace/transformations/map_fusion_extended.py). They currently work by performing the split and then run Map fusion on the entire SDFG. Instead, they should restrict the fusion transformation only to the Maps that were involved in the split. Ioannis already did some initial work on that: https://github.com/iomaganaris/gt4py/tree/improve_horizontal_map_fusion ## Appetite > Added after betting table It will take at least one cycle. ## Rabbit holes See [No-gos](https://hackmd.io/B4GBJaEZRfmFhhgIEt_iIQ?both=#No-gos). ## No-gos This is not about fixing [currently existing limitations](https://hackmd.io/tZ3BKzwNTlWwv81fW2H2ww) but to improve the performance of the SDFG transformations itself. ## Progress  - [x] Task 1 ([PR#xxxx](https://github.com/GridTools/gt4py/pulls)) - [x] Subtask A - [x] Subtask X - [ ] Task 2 - [x] Subtask H - [ ] Subtask J - [ ] Discovered Task 3 - [ ] Subtask L - [ ] Subtask S - [ ] Task 4