# Cartesian GTC: CUDA optimizations and plain-CUDA GTC backend
###### tags: `cycle 1`
shaped by: Hannes, Felix
## Goals
For best performance, we need to look for optimizations beyond GridTools C++. Therefore, we introduce a new plain CUDA backend in GTC with optimizations.
Goals of this project are
1. port known optimizations from GT4Py (continuation of the previous project [GT4Py passes to GTC migration](https://hackmd.io/@gridtools/r1Dyrw6vP))
2. add a plain-CUDA backend
3. implement more optimizations to achieve decent GTBench performance with the plain-CUDA backend
This project is interesting for several reasons:
- This project is a natural next step in transitioning from the old GT4Py backends to GTC-based backends. It will help to solidify the IR design and transfer knowledge to more people.
- We continue moving more optimizations from the current GT4Py to GTC which is a requirement to deprecate the old backends. By this project all Cartesian GT4Py users will immediately profit (i.e. Vulcan, FVM-LAM).
- GTBench was used as an example application in the recent DaCe exploration project. This project can serve as a comparison of DaCe- and Eve-based optimizations and can help identifying which optimizations are best tackled in which framework.
- Vulcan showed interest in a plain-CUDA backend and already has a Dawn-inspired implementation based on the old GT4Py infrastructure.
## Appetite
full cycle
possible developers:
- Felix
- Hannes?
- contributions from Anton (previous backend work for Dawn)
- maybe contributions from Eddie
## Solution
### 1. Porting of existing GT4Py optimizations (50%+)
#### Remove unnecessary temporaries
The naive implementation of the GridTools parallel model introduces many intermediate temporaries. Remove them!
Add extensive testing for the critical patterns (e.g. self assignment).
#### Merge HorizontalExecutions and VerticalLoops
Implement the StageMerger optimizations from GT4Py.
Keep the GridTools parallel model in mind (e.g. implement a checker before the OIR → GTCpp lowering).
#### Demote temporaries
See Eddie's work.
Lower dimensions of temporaries (3D → 2D, 3D → scalar). Needs extension of IRs and code generators.
### 2. Plain-CUDA backend (~20%)
Implement a SID-based plain-CUDA backend based on the work for Dawn by Anton and possibly extensions from the work of Vulcan.
Several variations:
- Standard GridTools pattern `warpsize x lines + halo warps`
- J-scan + shuffles (see below)
By default, we would apply J-scan + shuffles for all pure horizontal stencils and the standard GridTools pattern for all vertical stencils.
### 3. GTBench optimizations (~30%)
#### K-caches
In a first step, implement k-caching per interval (i.e. the caches can be re-started for each interval, as is done in DaCe).
The ring buffer can implemented with a C++ helper struct, however AMD compilers might not like it.
Analysis: Detect that the k-cache pattern is met.
As an improvement (nice to have), grouping of VerticalLoops could be implemented. Currently each oir.VerticalLoop contains exactly one Interval. To apply k-caching over the full k-axis we would need to build VerticalLoop groups that span the whole axis.
#### J-scan + shuffle
This strategy is expected to work well for _all_ horizontal stencils, also the ones with costly math function calls (like pow, exp) where the current GT gpu_horizontal backend does not perform well.
- Analysis pass to detect that the pattern is applicable to a stencil.
- Implement a variation of the Plain-CUDA backend where this pattern is used.
#### Compile-time domain sizes (nice to have)
If time is left, we can explore compile-time domain sizes as an experiment.
- Requires modifications to the frontend to propagate the domain sizes *at compile time*
- SID supports compile-time strides.
#### GTBench performance comparison
Create a small report with performance comparison of GTBench from the different tools and hardware:
Tools
- GTBench GridTools
- GTBench4Py DaCe
- GTBench4Py traditional GT4Py
- GTBench4Py from this work (maybe in several variations)
Hardware:
- P100
- A100
- Mi50 (at CSCS)
- Mi100 (at Cray)