Cartesian GTC: CUDA optimizations and plain-CUDA GTC backend

# Cartesian GTC: CUDA optimizations and plain-CUDA GTC backend ###### tags: `cycle 1` shaped by: Hannes, Felix ## Goals For best performance, we need to look for optimizations beyond GridTools C++. Therefore, we introduce a new plain CUDA backend in GTC with optimizations. Goals of this project are 1. port known optimizations from GT4Py (continuation of the previous project [GT4Py passes to GTC migration](https://hackmd.io/@gridtools/r1Dyrw6vP)) 2. add a plain-CUDA backend 3. implement more optimizations to achieve decent GTBench performance with the plain-CUDA backend This project is interesting for several reasons: - This project is a natural next step in transitioning from the old GT4Py backends to GTC-based backends. It will help to solidify the IR design and transfer knowledge to more people. - We continue moving more optimizations from the current GT4Py to GTC which is a requirement to deprecate the old backends. By this project all Cartesian GT4Py users will immediately profit (i.e. Vulcan, FVM-LAM). - GTBench was used as an example application in the recent DaCe exploration project. This project can serve as a comparison of DaCe- and Eve-based optimizations and can help identifying which optimizations are best tackled in which framework. - Vulcan showed interest in a plain-CUDA backend and already has a Dawn-inspired implementation based on the old GT4Py infrastructure. ## Appetite full cycle possible developers: - Felix - Hannes? - contributions from Anton (previous backend work for Dawn) - maybe contributions from Eddie ## Solution ### 1. Porting of existing GT4Py optimizations (50%+) #### Remove unnecessary temporaries The naive implementation of the GridTools parallel model introduces many intermediate temporaries. Remove them! Add extensive testing for the critical patterns (e.g. self assignment). #### Merge HorizontalExecutions and VerticalLoops Implement the StageMerger optimizations from GT4Py. Keep the GridTools parallel model in mind (e.g. implement a checker before the OIR → GTCpp lowering). #### Demote temporaries See Eddie's work. Lower dimensions of temporaries (3D → 2D, 3D → scalar). Needs extension of IRs and code generators. ### 2. Plain-CUDA backend (~20%) Implement a SID-based plain-CUDA backend based on the work for Dawn by Anton and possibly extensions from the work of Vulcan. Several variations: - Standard GridTools pattern `warpsize x lines + halo warps` - J-scan + shuffles (see below) By default, we would apply J-scan + shuffles for all pure horizontal stencils and the standard GridTools pattern for all vertical stencils. ### 3. GTBench optimizations (~30%) #### K-caches In a first step, implement k-caching per interval (i.e. the caches can be re-started for each interval, as is done in DaCe). The ring buffer can implemented with a C++ helper struct, however AMD compilers might not like it. Analysis: Detect that the k-cache pattern is met. As an improvement (nice to have), grouping of VerticalLoops could be implemented. Currently each oir.VerticalLoop contains exactly one Interval. To apply k-caching over the full k-axis we would need to build VerticalLoop groups that span the whole axis. #### J-scan + shuffle This strategy is expected to work well for _all_ horizontal stencils, also the ones with costly math function calls (like pow, exp) where the current GT gpu_horizontal backend does not perform well. - Analysis pass to detect that the pattern is applicable to a stencil. - Implement a variation of the Plain-CUDA backend where this pattern is used. #### Compile-time domain sizes (nice to have) If time is left, we can explore compile-time domain sizes as an experiment. - Requires modifications to the frontend to propagate the domain sizes *at compile time* - SID supports compile-time strides. #### GTBench performance comparison Create a small report with performance comparison of GTBench from the different tools and hardware: Tools - GTBench GridTools - GTBench4Py DaCe - GTBench4Py traditional GT4Py - GTBench4Py from this work (maybe in several variations) Hardware: - P100 - A100 - Mi50 (at CSCS) - Mi100 (at Cray)