# [GT4Py] Research: ICON structured grid strided access standalone benchmark - IV
- Shaped by:
- Appetite (FTEs, weeks):
- Developers: <!-- Filled in at the betting table unless someone is specifically required here -->
## Problem
In [cycle23-24](https://hackmd.io/3XZx9ktnTWG7RgoxSR7o1Q) we explored the performance of the `nabla4` kernel and `mo_intp_rbf_rbf_vec_interpol` kernel for the `structured` and the `torus` grid (without periodic boundaries). For this cycle we would like to understand how the `inlining` of kernels and the overcomputation that it entails influence performance compared to the `separate` kernel execution for more computationally intensive or memory intensive kernels (modified `nabla4`) or inlining of additional interpolation. Furthermore there are some improvements to be checked. In the end of the cycle we should have a nice presentable report with the experiments we did and their outcome until this point.
## Appetite
Full cycle
## Solution
- [x] Try Felix's last changes (https://github.com/GridTools/gridtools/pull/1787)
- [x] Compared to the previous `master` without these improvements
- [x] No big changes in runtime with the latest changes in `master`
- [x] Compared to simple `__ldg`
- [x] Same
- [x] 2 differences compared to the custom CUDA kernel
- [x] Many more integer instructions
- [x] Expected because of shifting
- [x] Shouldn't be a bottleneck since the compute % is still too low
- [x] Many more LDG instructions (+144%)
- [x] +65% store instructions
- [x] Fixed with single neighbor read
- [x] Neighbors are still accessed per K level
- [x] `__restrict__` is missing from `neighbors`?
- [x] See https://github.com/iomaganaris/gridtools/tree/loop-blocking-opt
- [x] In the unrolled version, still some reads in the epilogue loop
- [x] Maybe because of how the loops are created?
- [x] 5% faster than without unrolling and without `__restrict__`s
- [x] Non vertical fields (`primal_normal_vert_v1`, `primal_normal_vert_v2`, `inv_vert_vert_length` and `inv_primal_edge_length`) are read for every K level
- [x] If we only load the neighbors once and the above fields at every k level we get the same number of reads from the CUDA kernel with the GTFN
- [x] CUDA kernel is much faster (27%) because of less loads for the fields that are independent of the vertical level
- [x] If we avoid unrolling and we use a single for loop then `gtfn` is within 6% (slower) compared to CUDA with loading the fields independent of vertical level in every vertical level
- [x] Best version of `nabla4` kernel is with `unroll 1` and `__restrict__`s
- [x] Finalize https://github.com/fthaler/gridtools/pull/1
- [x] V100
- [x] Without const neighbors: `0.0036853439807891845`
- [x] With const neighbors: `0.003628000020980835` ~2%
- [x] With const input fields: `0.0027156479358673097` ~25%
- [x] Make sure that `gpu_naive` `inlined` versions are reasonable
- [x] Generate plots with latest data
- [ ] `mem_async` on `nabla4` for `K levels`
- [x] https://github.com/GridTools/icon_structured_benchmark/tree/cuda_pipeline
- [x] Very small speed up and no difference in registers
- [x] Better to use `LDGSTS.128.BYPASS` instructions whenever possible
- [ ] Then think about how to do that for`inlined`/`inlined_v2v`
- [ ] Try `e2v` and `e2c2v` separately in structured version
- [x] Try `nabla4` with compressed `e2c2v` with `per-vertex` order in `unstructured` [Nvidia V100]
- [x] `per-orientation`
- [x] `gpu_naive`: 0.004318720102310181
- [x] `gpu_kloop`: 0.0030766080617904666
- [x] `per-vertex`
- [x] `gpu_naive`: 0.003964927911758423 ~8% faster
- [x] `gpu_kloop`: 0.002456576108932495 ~20% faster
- [ ] Try on GH200 as well
- [x] Check how `e2ecv` accesses can be improved in inlined version
- [ ] ~~For every edge there is an `e2ecv` entry with 4 fields (one for every vertex)~~
- [ ] ~~Each vertex can potentially have different values depending on the index of the `e2ecv` entry~~
- [ ] ~~If we know that the values of `primal_normal_vert1/2` is the same for every vertex or the entries in `e2ecv` for the same vertex is the same then we can simplify this~~
- [ ] ~~`primal_normal_vert1/2` length would be the same as the number of vertexes then~~
- [ ] ~~If there is only 1 entry for every vertex in the `primal_normal_vert1/2` fields, then `e2ecv` can be simplified similarly to `e2c2v` and we can further collapse the calculations in the inlined kernel~~
- [x] Don't pursue
- [ ] Use vertical fields for all the fields to saturate mem BW
- [ ] `gpu_naive` and `gpu_kloop` versions
- [x] if `gpu_naive` is faster than `gpu_kloop` and `separate` is faster than `inlined` then `inlined_cached` should be faster
- [x] `structured inlined` is only a bit faster than separate. That's why `inlined_cached` is now faster
- [ ] Should implement `cached` for both `unstructured` and `structured`
- [x] For `gpu_kloop` `k irerations` and `thread block` size are restricted to fit in shared mem
- [x] Large enough thread block size and `k` iterations (256 thread block size, 10 `k` iterations)
- [x] Simple inlining of `unstructured` to shared memory doesn't work better
- [ ] We can try saving to GPU main memory
- [ ] Probably not good enough
- [ ] See if adding more artificial fields + ops goes to this direction
- [ ] Add another interpolation `e2v`
- [ ] Target is to prove that best version of `inlined` can be slower than `separate`
- [ ] If we add an extra kernel for inlining then it means that the overcomputation is going to be increased, similarly to the `nabla4_vertical`. Based on how much extra memory loads (uncached) we have then we'll see the inlined version being closer to the separate or slightly slower (with the `nabla4_vertical` case for example)
- [ ] My (Ioannis') guess is that overcomputations of `nabal4` and `interpolate` won't be that expensive (with `v2v` especially) since there are not so many memory loads so inlining will still be faster. Shared memory version will also be faster
- [ ] With `nabla4_vertical` and `interpolate` inlined version will be slower. With cache it will be much faster
- [x] `nabla4_vertical` has much less loads in the overcomputations compared to `nabla4 & interpolate inlined` when inlined with `kernel[e2v]`
- [x] `nabla4 & interpolate inlined` reads per `k level` per element of `sqr_sin2_cos2[e2v]`: 108
- [x] `nabla4_vertical` reads per `k level` per element: 19
- [x] --> reinforced behavior of `nabla4_vertical` (cached will be much faster and `inlined` slower than `separate`)
- [x] --> `nabla4 & interpolate & kernel[e2v]` very big register pressure to keep things in registers to speed up `gpu_kloop` implementation
- [ ] Maybe still useful to get an estimate on how expensive the overcomputation of already inlined kernels is
- [ ] Maybe inlining more than 2 kernels is not a good idea?
- [ ] Should depend on the arithmetic intensity of the kernel?
- [ ] `sqr_sin2_cos2` kernel
- [x] Based on `e2v`
- [x] `e2v2e2c2v`
- [x] 12 vertices, 10 unique, 2 reuses --> many loads and computations
- [x] Compute bound (1 neighbor table, 4 inputs, vertical)
```
out[e, k] sqr_sin2_cos2(in1[v, k], in2[v, k]):
return (sqr(sin(in1[e2v[0], k])^2+cos(in1[e2v[0], k])^2) + sqr(sin(in1[e2v[1], k])^2+cos(in1[e2v[1], k])^2) + sqr(sin(in2[e2v[0], k])^2+cos(in2[e2v[0], k])^2) + sqr(sin(in2[e2v[1], k])^2+cos(in2[e2v[1], k])^2)) / 4 // = 1 always
```
- [ ] Write a presentable report
- [ ] At least at the end of the cycle
- [x] Do K loop in consecutive K levels
- [x] threadblocksize * grisize = 1
- [x] Not pursued since if one thread doesn't process the whole vertical dimension we need sync
- [ ] Partly compile time offsets
- [ ] Quickly check. Good to know
- [x] ~~See how more neighbor accesses influence performance~~
- [ ] ~~Partly visible by the `inlined` and `inlined_v2v` performance~~
- [x] Covered above
## Rabbit holes
<!-- Details about the solution worth calling out to avoid problems -->
## No-gos
<!-- Anything specifically excluded from the concept: functionality or use cases we intentionally aren’t covering to fit the ## appetite or make the problem tractable -->
## Progress
- [ ] Create `unstructured` `cached` version with minimal overcomputation
- [ ] Compare with other `unstructured` versions (should be faster because of less loads)
- [ ] ~~Not trivial~~ Ultra painful
- [ ] How to figure out which edges need to be calculated only once and stored in shared memory?
- [ ] The edges 0, 2, 4 need to be calculated at every step and then pass the values to the neighbor vertices
- [ ] Need to take care of multiple scenarios of blockDim, dimension of the grid, etc. Very difficult to nail down and multiple if-statements --> probably bad performance
- [ ] How to create the indexes to access the shared memory?
- [ ] For each vertex save all edges. When we compute an edge we save it in the corresponding field for 2 vertices
- [x] Probably needs assumptions of structured grid --> `x_dim` at least
- [x] For all the vertices "inside" the thread block calculate west/south edges. For the vertices in the north and east sides calculate all edges.
- [x] Add vertical fields
- [x] Make all fields of `nabla4` vertical
- [x] `structured`
- [x] `unstructured`
- [x] Compare performance for these 2
- [x] Compare `inlined` and `separate` versions and see if `separate` is better than best of `inlined` due to more memory transfers
## Results
- [x] Latest runtimes (`bd2c802`)

- [x] [GTFN backend] Results on GH for torus_128, 80 Klevels
| Commit | Median Runtime | Comment |
| ------- | -------------- | ------------------------------- |
| 3d5f405 | 0.001284031987 | master |
| ffcf790 | 0.001296352029 | master |
| daf2892 | 0.001301056027 | master |
| 12ca60f | 0.001297888041 | master |
| 85a0c72 | 0.001299936056 | Latest master |
| 494f8de | 0.002684895992 | Loop blocking without unrolling |
| ca11767 | 0.001519840002 | Only read neighbor tables twice (???) |
| 805897d | 0.001279 | Base line master |
| 32daaa5 | 0.001096 | k blocking (no unroll) |
| 4908e55 | 0.001052 | latest master (const neighbors) |
| 4908e55 | 0.000832 | latest master w/ cosnt in-fields|

- [x] Latest runtimes ('49945e5')


## TODO after meeting with Hannes and Christoph
- [ ] Add another c2v kernel
- [ ] Try unstructured cached version with Christoph's description
- [ ] Opt mi300a
- [ ] Try cache hints and non_temporal_loads