[DaCe] Library Node for `broadcast + concat_where` Expressions

# [DaCe] Library Node for `broadcast + concat_where` Expressions  - Shaped by: Philip - Appetite (FTEs, weeks): 4-5 weeks - Developers:  ## Problem There is a particular problem when a `concat_where` is used together with a `broadcast` expression. Consider the following code: ```python= var = concat_where( 1 < KDim & 3 < EdgeDim, foo(...), broadcast(0.0, KDim, EdgeDim) ) ``` `broadcast` leads to the materialization of a temporary on the whole domain, which is filled with zeros. Then the section, that is actually needed, in this case `[0, 0:3]`, is copied into `var`, the result. This is very bad because the whole array has to materialize. In the current version this is handled in a very crude way, by the splitting tools. However, this is more a lucky coincident than something carefully engineered, but it is rather fragile and should be handled in a more explicit way. ## Appetite Although this project _looks_ simple the details are hard and might take an some time to complete. It is hard to estimate, but it might take 4 to 5 weeks. ## Solution The solution is to use a specific syntactic construct that the optimizer detects and understands, which in DaCe-Speak means a library node. The first step, i.e. the creation of the library node, has already been started, see [GT4Py PR#2386](https://github.com/GridTools/gt4py/pull/2386), but there is still things left to do. First of all the library node is not finalized, but this is actually a minor detail. As long as an instance of the library node exists it has, for syntactical reasons, write to memory (in the beginning this is always a transient). Then we need at least three transformations. The first transformation "inlines" it, in Python code this means that the following code, this is how the example above would look like without the splitting tools: ```python= for i, j in dace.map[0:N, 0:M]: a[i, j] = value_to_broadcast for i, j in dace.map[hstart:hend, vstart:vend]: out[i, j] = foo(i, j, a[i + 3, j - 2], ...) ``` Is then transformed to: ```python= for i, j in dace.map[hstart:hend, vstart:vend]: out[i, j] = foo(i, j, value_to_broadcast, ...) ``` Please note that in the second loop the ranges and accesses are not point wise, but since we have a library node we can handle it without a problem. The first thing that has to be checked is that `a` can be removed or not, which can be solved by a simple check of its degree and if it is single use data. Another issue is if the intermediate, i.e. `a`, is involved in a neighbourhood access. This aspect is hard to handle, because we need to modify the Tasklet that does the access. Since this case is very unlikely, we will ignore it. The second transformation is concerned with splitting the output of the library node. In code this would transform ```python= a[:, :] = value_to_broadcast b = foo(a[hstart1:hstop1, vstart1:vstop1], ...) c = bar(a[hstart2:hstop2, vstart2:vstop2], ...) ``` to ```python= b = foo(value_to_broadcast, ...) c = bar(value_to_broadcast, ...) ``` It is important that the slices of `a` `foo()` and `bar()` can overlap. This transformation is need to solve the original issue, i.e. that the broadcast will result in a field of the entire domain. Thus it has to be cut down to the range where it is actually used/needed. This is essentially needed to handle all cases were inlining could not be applied directly. These kind of transformations can make use of what already exists, i.e. `SplitAccessNode` and `SplitConsumerMemlet`. The third transformation is the expanding transformation, which is already included in Edoardo's PR. However, this transformation should only be needed in the following cases: - When we write directly to global memory. - If there is a `concat_where` whose (transient) intermediate can not be split, this happens for example if the result is used in a neighbourhood access expression, then we have to write to the transient. ### Integration Into The Optimizer The main question is how to integrate it into the current optimizer. The operation belongs into the first phase, i.e. top level data flow optimization, but the question is where to put it there? It is relatively clear that the whole thing should not run in the last iteration, because we might need to process the Maps were created by the expanding. It is also relatively clear that we should run it after Map fusion has run at least one[^notAboutMapFusion] but probably only have to run only once. However the following things are not so clear and needs further investigation: - Should they run before or after the splitting tools or both? - When should we expand the library nodes that could not be eliminated? ## Rabbit holes  ## No-gos  ## Progress  - [x] Task 1 ([PR#xxxx](https://github.com/GridTools/gt4py/pulls)) - [x] Subtask A - [x] Subtask X - [ ] Task 2 - [x] Subtask H - [ ] Subtask J - [ ] Discovered Task 3 - [ ] Subtask L - [ ] Subtask S - [ ] Task 4  [^notAboutMapFusion]: There is some background knowledge needed. The optimizer has different stages, one of the first is the top level stage, which operates only on nodes on the top level. In that phase several steps are performed repeatedly until a fix point is reached.