[DaCe] ITIR vs. Field View II -- Optimizing Jax-SDFG

# [DaCe] ITIR vs. Field View II -- Optimizing Jax-SDFG ###### tags: `cycle 20`  - Shaped by: Philip - Appetite (FTEs, weeks): - Developers: Philip (?Edoardo) ## Problem During our [previous project](https://hackmd.io/@gridtools/BynVNnAu6) we performed a first benchmark, which [showed](https://docs.google.com/presentation/d/1z1_6fy_A3pBJPq0kbUdfHw0kySyJedMHr3eQdJyd_EQ/edit#slide=id.g2b8a4951582_23_6) that ITIR outperforms Jax based SDFGs. After a very fruitful [discussion with Lex and Ben (SPCL)](https://hackmd.io/@gridtools/B1IiqScsT) we came to the conclusion, that while the ITIR based SDFG are already pretty good improving them further will prove very hard. Therefore we decided to freeze the current SDFG implementation and consider it as target/gold standard that we want to reach and beat with the Jax based SDFGs. Thus the goal of this project is to improve the performance of the Jax based SDFG. Current SDFGs generated from Jax looks similar to the image below, which shows the `Calculate_Nabla_4` stencil, which was selected as a guinea pig for the initial phase of the project. ![Jax SDFG of Nabla4](https://hackmd.io/_uploads/Bk7CHBsi6.png) The above SDFG can be roughly divided into tow parts. The lower part, which contains a large Map (`0:30720, 0:65`) and covers all edges and K levels. Then there is the above part, which contains several different Maps. Most of them are one dimensional over the range `0:30720` (essentially they represent computations that are independent of K), however, they are not fused together. We assume that the the existence of the some additional `0:4` (which are generated by advanced indexing, for example `prop[C2E]`) prevent DaCe from fusing them together. ##### ITIR Subproblem It seems that there is still a bug in the ITIR to SDFG translator. The bug manifests itself as an invalid memory access. ## Appetite At least a cycle. ## Solution Since we are still in prototype phase we will might not choose the most general approach, but the fastest and simplest one. Thus if we can make the initial SDFG somehow easier to digest for DaCe by modifying the Jax to SDFG translator we will do that. However, we want also to take this opportunity to get to know the DaCe transformation framework better. What we basically do is reformulating the "Map Space", see picture above, to make it more performant. During the CSCS+SPCL meeting we came up with the following four patterns that we want to try. ##### 1) 2 Kernel Launches 1D + 2D This approach involves making two big Maps that run one after the other. The upper/first Map would cover the range `0:30710`, while the second one would go over `0:30720, 0:65`. In essence we try here to refine the structure we already see in the above picture in its infancy. ###### Approach Looking at the image above we think that reason DaCe does not fuse the Maps properly is because some have this additional `0:4` iteration dimension. This dimensions are generated by advanced indexing. To overcome this issue we could do - Unroll this dimensions. - Transform Maps such as `Map[..., it=0:4]` to `Map[...] { Map[it=0:4, Seq] }`, i.e. turning the additional map into a submap. ##### 2) 2D Kernel Launch This is basically what ITIR does. In essence it means that every Map is modified in such a way that it has the same range, which would allow DaCe to put everything in one single Map. ###### Approach An idea would be to wrap the Map inside another map, i.e `Map[iEdge=0:30720] { WORK(iEdge) } -> Map[iEdges=0:30720, iK=0:65] { WORK(iEdge) }`, The work would not be depend on the parameter of that map space, but it should trick DaCe to fuse it. Let's hope that it is this easy. ##### 3) 1D Kernel Launch with Inner Sequential Loop We would start a GPU kernel that goes over all edges. Inside this kernel we would have a `for`-loop, thus we would go from the Map `Map[0:30720, 0:65; GPU]`, that we have seen in the second approach, to two Maps `Map1[0:30720; GPU] { Map2[0:65; Seq] }`. ###### Approach Achieving this should be simple once we reached the second pattern, because we just have to split the big Map. ##### 4) 2D Kernel Launch with K-Blocking Essentially a combination of the second and third idea with some refinement, which lead to the following Maps `Map1[itE=0:30720, itK=0:65:KBlock; GPU] { Map2[itKInner=itK:(itK+KBlock), Seq] }`. Note that this is already a very deep optimization strategy, thus it has the lowest priority. ###### Approach It will probably come naturally, once we have solved for pattern 3 and 2, since DaCe already has a tiling transformation. ## Rabbit holes Solving the general problem is very hard, so we are focusing on the stencils that we selected in the previous cycle. Writing DaCe transformation is the final goal. But it could be quite hard, thus it might advisable, to do the modifications directly inside the Jax to DaCe translator. Furthermore, we try to schedule a training session with one of the SPCL guys to give us a crash course in transformations (it is not about transformations, it is more about the infrastructure that DaCe has for it, which is quite extensive and complicated). ### DaCe Transformation Best Practices At the DaCe Meeting (2024-02-22) a set of best practice rules were discussed: * Best Practices * Use existing infrastructure (pattern matching, subgraph) * GPU best practices guide * Access Nodes * Don't just remove, have to check (view, descriptor) * Modifying data descriptor/access without checking -> easy failures * Can I remove this? Is this used somewhere else? * Maybe have API for it * Utility passes: State reachability, access sets * `analysis.py` (`StateReachability`, `AccessSets`, `FindAccessNodes`, `AliasesWith`) * Could use pipeline to make use of utility passes * What to avoid * Legacy approaches? * Pattern Matching: Don't modify what you're not matching * What you modify has to be fully part of the matching ## No-gos Writing a DaCe transformation that could be added to the DaCe repo. ## Progress