[GT4Py/DaCe] Fieldview SDFG-transformations

# [GT4Py/DaCe] Fieldview SDFG-transformations - Shaped by: Philip, Edoardo - Appetite (FTEs, weeks): - Developers:  ## Problem Currently GT4Py is in the transition from ITIR to GTIR. The SDFG produced from GTIR is very inefficient for code generation. First of all, the fieldview representation contains many simple maps, one for each field operation, with temporary fields between them. Additionally, some patterns like neighbors reduction or boundary conditions can be transformed to generated optimized code. In order to use DaCe for program optimization, we need transformations and a driver to decide which transformations to apply. In this cycle we will focus on the transformations, to ensure that we can optimize simple programs, therefore this project has an explorative character. ## Appetite  ## Solution In this project we will proceed in the following order. #### Small Examples In a first step we will work on the examples that are currently used to test the new DaCe backend. These are small example, where the combined IR has been written by hand. Here we test if the transformations that ships with DaCe work for this case. From our experience with the JaCe prototype, we do not expect any larger issue at this stage. #### Nabla4 Nabla4 is a stencil used in ICON's dycore. The main reason why we have selected it is because that in previous cycle a hand written version of this stencil was developed, which can serve as a viable baseline. However, because the frontend is not ready yet to generate viable interface for the backend we will have to write this stencil by hand, but it should not be a big issue. A realistic goal for this cycle is to set up a pipeline and to perform first optimizations of this stencil. An alternative of additional stencil to test would be advection stencil 20 (old name). #### End of the cycle Once the work of lowering to SDFG from the GT4Py frontend is done in https://hackmd.io/OSw9YiwcQImPqpU46FyoKw, apply this pipeline also to the selected combined program [velocity advection stencil 1 to 7](https://github.com/C2SM/icon4py/blob/main/model/atmosphere/dycore/src/icon4py/model/atmosphere/dycore/fused_velocity_advection_stencil_1_to_7.py). ## Rabbit holes Solving the full optimization problem, i.e. which transformations to apply and where. ## No-gos  ## Progress  - [x] Initial Merge ([PR#1594](https://github.com/GridTools/gt4py/pull/1594)) Currently under review. ## Requirements on SDFG > This document has been superseded by ADR0018, which was part of [PR#1594](https://github.com/GridTools/gt4py/pull/1594). We distinguish between intrastate optimization, i.e. the optimization of the data flow within a state and interstate optimization, i.e. the optimization between states, not limited to state fusion, but all transformation with the goal to _promote_ state fusion. For this project, and for the time being, we focus on intrastate optimization and relay on DaCe, especially its simplify pass, to do interstate optimization. The only restriction we impose on global memory is: 1. Global memory is either used as input (only read from) or as output (only written to) but never for both. - [Rational]: Allows to remove double buffering, that DaCe may not remove on its own, since it is unable to verify that it is not needed. - [Note]: GT4Py wants to support this style, however, for compatibility with FORTRAN, there is the following amendment to this rule: "The same global memory is allowed to be used as input and output at the same time, iff the output depends _elementwise_ on the input". This formulation allows to write expressions such as `a += 1`, with only memory for `a`. Phrased more technically using global memory for input and output is allowed if the two computation `tmp = computation(global_memory); global_memory = tmp;` and `global_memory = computation(global_memory);` are equivalent. For the SDFG state machine we assume that: 2. After we start our intrastate optimization pipeline, which focused on intrastate optimizations, the number of state remains fixed. - [Rational]: See rule 8. - [Note]: If we have an iterative process, i.e. run interstate and intrastate optimization repeatedly, then it is possible that the number of states between two intrastate optimization runs changes, as long as a intrastate optimization run is in between. 3. An interstate edge can only access scalars but not arrays (even if they have shape (1,)), i.e. use them in assignments or conditions. - [Rational]: If an array is also used in interstate edges it became very tedious to verify if we could remove an array (see rule 8). - [Note]: Running simplify might actually result in the violation of this rule, but see rule 6 for more information. 4. The state graph does not contain any cycles, i.e. the implementation of a for/while loop using states is not allowed, the new loop construct or serial maps must be used for it. - [Rational]: This is a simplification that makes it much simpler to define what "downstream" really is. - [Note]: Currently the code generator does not support the `LoopRegion` construct and it is transformed to a state machine. But this is currently not an issue as it is not used. The rules we impose on transients are a bit more complicated, however, while sounding restrictive, they are very liberal. It is important that these rules only have to be met after after simplify was called on the SDFG: 5. Downstream of a write access, i.e. in all states that follows the state the access node is located in, there is no other access node that is used to write to that array. - [Rational 1]: This rule together with rule 8 essentially boils down to ensure that the assignment in the SDFG follows SSA style, but also allows for expressions such as: ```python a: Array = None if cond: a = true_branch() else: a = false_branch() ``` (**NOTE:** This could also be done with references, however, they are strongly discouraged.) - [Rational 2]: This still allows reductions with WCR as they write to the same access node and loops, whose body modifies a transient that outlives the loop body, as they use the same access node. (**NOTE:** WCR are not handled by the fusion transformation and block fusion.) 6. Every access node that reads from an array that was not written to in the same state must be a source node. - [Rational]: Together with rule 2, 3, 7 and 8 this simplifies the check if a transient can be safely removed or if it is used somewhere else. As it guarantees that the transients that can not be removed (because they transmit data from one state to the other) remains constant during intrastate optimization of an SDFG and is given by all source transient nodes in all state and the scalars. In case a transient is only used within a state rule 8 guarantees that it has multiple outgoing edges. - [Note]: To prevent some issues caused by the violation of rule 3 by simplify, the set is extended with the transient sink arrays. Because if such nodes are not removed by simplify, they are used somewhere else, either as source node in another state or in an interstate edge. 7. It is _recommended_, that a write access node should only have one incoming edge. - [Rational]: This case is handled poorly by some DaCe transformations, thus we should avoid them as much as possible. 8. No two access nodes in a state can refer to the same array. (Note, an SDFG can still be constructed using different access node for the same underlying data; simplify will combine them.) - [Rational]: Together with rule 5 this guarantees SSA style. For maps we assume the following: 9. The names of map variables (iteration variable) follows: - 9.1: All map variables iterating over the same dimension (disregarding the actual range), have the same deterministic name, that includes `gtx.Dimension.value`. - 9.2: The name of horizontal dimensions (`kind` attribute) always end in `__gtx_horizontal`. - 9.3: The name of non local dimensions never ends in `__gtx_localdim`. - 9.4: The name of local dimension always ends in `__gtx_localdim`. - [Rational 1]: Without this rule it is very hard to tell which map variable does what, this way we can transmit information from GT4Py to DaCe. See also rule 10. - [Rational 2]: Renaming is extremely difficult and error prone in DaCe, so the easiest thing to do is to construct the map variables in such a way that we do not have to rename them. 10. Two map ranges, i.e. the pair map/iteration variable and range, have the same name and cover the same range they can be merged. - [Rational]: Because of rule 9 we will only fuse stuff that actually makes sense to fuse. - [Note]: This rule might be dropped in the future. 11. If the name of the iteration variable differs the maps can never be merged, even if they cover the same range (this is a simplification that will work for the moment, but might be lifted, however, it is hard to get right). - [Rational]: To make rule 10 stronger. - [Note]: This rule might be dropped in the future. ## Ioannis Together with Ioannis I looked at the GPU code (source and assembly) and we came to the following conclusions: - DaCe is faster than the naive version (that is comparable to the GTFN code) because it uses `restrict` which allows the compiler to do more optimizations, especially loading from a non coherent cache. Also the code is probably more organized. - DaCe is slower than the fast GPU version, henceforth kloop, because it does not do k blocking. I.e. there is no inner serial loop.