[GT4Py] Research: ICON structured grid strided access standalone benchmark - IV

# [GT4Py] Research: ICON structured grid strided access standalone benchmark - IV - Shaped by: - Appetite (FTEs, weeks): - Developers:  ## Problem In [cycle23-24](https://hackmd.io/3XZx9ktnTWG7RgoxSR7o1Q) we explored the performance of the `nabla4` kernel and `mo_intp_rbf_rbf_vec_interpol` kernel for the `structured` and the `torus` grid (without periodic boundaries). For this cycle we would like to understand how the `inlining` of kernels and the overcomputation that it entails influence performance compared to the `separate` kernel execution for more computationally intensive or memory intensive kernels (modified `nabla4`) or inlining of additional interpolation. Furthermore there are some improvements to be checked. In the end of the cycle we should have a nice presentable report with the experiments we did and their outcome until this point. ## Appetite Full cycle ## Solution - [x] Try Felix's last changes (https://github.com/GridTools/gridtools/pull/1787) - [x] Compared to the previous `master` without these improvements - [x] No big changes in runtime with the latest changes in `master` - [x] Compared to simple `__ldg` - [x] Same - [x] 2 differences compared to the custom CUDA kernel - [x] Many more integer instructions - [x] Expected because of shifting - [x] Shouldn't be a bottleneck since the compute % is still too low - [x] Many more LDG instructions (+144%) - [x] +65% store instructions - [x] Fixed with single neighbor read - [x] Neighbors are still accessed per K level - [x] `__restrict__` is missing from `neighbors`? - [x] See https://github.com/iomaganaris/gridtools/tree/loop-blocking-opt - [x] In the unrolled version, still some reads in the epilogue loop - [x] Maybe because of how the loops are created? - [x] 5% faster than without unrolling and without `__restrict__`s - [x] Non vertical fields (`primal_normal_vert_v1`, `primal_normal_vert_v2`, `inv_vert_vert_length` and `inv_primal_edge_length`) are read for every K level - [x] If we only load the neighbors once and the above fields at every k level we get the same number of reads from the CUDA kernel with the GTFN - [x] CUDA kernel is much faster (27%) because of less loads for the fields that are independent of the vertical level - [x] If we avoid unrolling and we use a single for loop then `gtfn` is within 6% (slower) compared to CUDA with loading the fields independent of vertical level in every vertical level - [x] Best version of `nabla4` kernel is with `unroll 1` and `__restrict__`s - [x] Finalize https://github.com/fthaler/gridtools/pull/1 - [x] V100 - [x] Without const neighbors: `0.0036853439807891845` - [x] With const neighbors: `0.003628000020980835` ~2% - [x] With const input fields: `0.0027156479358673097` ~25% - [x] Make sure that `gpu_naive` `inlined` versions are reasonable - [x] Generate plots with latest data - [ ] `mem_async` on `nabla4` for `K levels` - [x] https://github.com/GridTools/icon_structured_benchmark/tree/cuda_pipeline - [x] Very small speed up and no difference in registers - [x] Better to use `LDGSTS.128.BYPASS` instructions whenever possible - [ ] Then think about how to do that for`inlined`/`inlined_v2v` - [ ] Try `e2v` and `e2c2v` separately in structured version - [x] Try `nabla4` with compressed `e2c2v` with `per-vertex` order in `unstructured` [Nvidia V100] - [x] `per-orientation` - [x] `gpu_naive`: 0.004318720102310181 - [x] `gpu_kloop`: 0.0030766080617904666 - [x] `per-vertex` - [x] `gpu_naive`: 0.003964927911758423 ~8% faster - [x] `gpu_kloop`: 0.002456576108932495 ~20% faster - [ ] Try on GH200 as well - [x] Check how `e2ecv` accesses can be improved in inlined version - [ ] ~~For every edge there is an `e2ecv` entry with 4 fields (one for every vertex)~~ - [ ] ~~Each vertex can potentially have different values depending on the index of the `e2ecv` entry~~ - [ ] ~~If we know that the values of `primal_normal_vert1/2` is the same for every vertex or the entries in `e2ecv` for the same vertex is the same then we can simplify this~~ - [ ] ~~`primal_normal_vert1/2` length would be the same as the number of vertexes then~~ - [ ] ~~If there is only 1 entry for every vertex in the `primal_normal_vert1/2` fields, then `e2ecv` can be simplified similarly to `e2c2v` and we can further collapse the calculations in the inlined kernel~~ - [x] Don't pursue - [ ] Use vertical fields for all the fields to saturate mem BW - [ ] `gpu_naive` and `gpu_kloop` versions - [x] if `gpu_naive` is faster than `gpu_kloop` and `separate` is faster than `inlined` then `inlined_cached` should be faster - [x] `structured inlined` is only a bit faster than separate. That's why `inlined_cached` is now faster - [ ] Should implement `cached` for both `unstructured` and `structured` - [x] For `gpu_kloop` `k irerations` and `thread block` size are restricted to fit in shared mem - [x] Large enough thread block size and `k` iterations (256 thread block size, 10 `k` iterations) - [x] Simple inlining of `unstructured` to shared memory doesn't work better - [ ] We can try saving to GPU main memory - [ ] Probably not good enough - [ ] See if adding more artificial fields + ops goes to this direction - [ ] Add another interpolation `e2v` - [ ] Target is to prove that best version of `inlined` can be slower than `separate` - [ ] If we add an extra kernel for inlining then it means that the overcomputation is going to be increased, similarly to the `nabla4_vertical`. Based on how much extra memory loads (uncached) we have then we'll see the inlined version being closer to the separate or slightly slower (with the `nabla4_vertical` case for example) - [ ] My (Ioannis') guess is that overcomputations of `nabal4` and `interpolate` won't be that expensive (with `v2v` especially) since there are not so many memory loads so inlining will still be faster. Shared memory version will also be faster - [ ] With `nabla4_vertical` and `interpolate` inlined version will be slower. With cache it will be much faster - [x] `nabla4_vertical` has much less loads in the overcomputations compared to `nabla4 & interpolate inlined` when inlined with `kernel[e2v]` - [x] `nabla4 & interpolate inlined` reads per `k level` per element of `sqr_sin2_cos2[e2v]`: 108 - [x] `nabla4_vertical` reads per `k level` per element: 19 - [x] --> reinforced behavior of `nabla4_vertical` (cached will be much faster and `inlined` slower than `separate`) - [x] --> `nabla4 & interpolate & kernel[e2v]` very big register pressure to keep things in registers to speed up `gpu_kloop` implementation - [ ] Maybe still useful to get an estimate on how expensive the overcomputation of already inlined kernels is - [ ] Maybe inlining more than 2 kernels is not a good idea? - [ ] Should depend on the arithmetic intensity of the kernel? - [ ] `sqr_sin2_cos2` kernel - [x] Based on `e2v` - [x] `e2v2e2c2v` - [x] 12 vertices, 10 unique, 2 reuses --> many loads and computations - [x] Compute bound (1 neighbor table, 4 inputs, vertical) ``` out[e, k] sqr_sin2_cos2(in1[v, k], in2[v, k]): return (sqr(sin(in1[e2v[0], k])^2+cos(in1[e2v[0], k])^2) + sqr(sin(in1[e2v[1], k])^2+cos(in1[e2v[1], k])^2) + sqr(sin(in2[e2v[0], k])^2+cos(in2[e2v[0], k])^2) + sqr(sin(in2[e2v[1], k])^2+cos(in2[e2v[1], k])^2)) / 4 // = 1 always ``` - [ ] Write a presentable report - [ ] At least at the end of the cycle - [x] Do K loop in consecutive K levels - [x] threadblocksize * grisize = 1 - [x] Not pursued since if one thread doesn't process the whole vertical dimension we need sync - [ ] Partly compile time offsets - [ ] Quickly check. Good to know - [x] ~~See how more neighbor accesses influence performance~~ - [ ] ~~Partly visible by the `inlined` and `inlined_v2v` performance~~ - [x] Covered above ## Rabbit holes  ## No-gos  ## Progress - [ ] Create `unstructured` `cached` version with minimal overcomputation - [ ] Compare with other `unstructured` versions (should be faster because of less loads) - [ ] ~~Not trivial~~ Ultra painful - [ ] How to figure out which edges need to be calculated only once and stored in shared memory? - [ ] The edges 0, 2, 4 need to be calculated at every step and then pass the values to the neighbor vertices - [ ] Need to take care of multiple scenarios of blockDim, dimension of the grid, etc. Very difficult to nail down and multiple if-statements --> probably bad performance - [ ] How to create the indexes to access the shared memory? - [ ] For each vertex save all edges. When we compute an edge we save it in the corresponding field for 2 vertices - [x] Probably needs assumptions of structured grid --> `x_dim` at least - [x] For all the vertices "inside" the thread block calculate west/south edges. For the vertices in the north and east sides calculate all edges. - [x] Add vertical fields - [x] Make all fields of `nabla4` vertical - [x] `structured` - [x] `unstructured` - [x] Compare performance for these 2 - [x] Compare `inlined` and `separate` versions and see if `separate` is better than best of `inlined` due to more memory transfers ## Results - [x] Latest runtimes (`bd2c802`) ![runtimes_torus_128_80](https://hackmd.io/_uploads/BJ8uqtXeJx.png) - [x] [GTFN backend] Results on GH for torus_128, 80 Klevels | Commit | Median Runtime | Comment | | ------- | -------------- | ------------------------------- | | 3d5f405 | 0.001284031987 | master | | ffcf790 | 0.001296352029 | master | | daf2892 | 0.001301056027 | master | | 12ca60f | 0.001297888041 | master | | 85a0c72 | 0.001299936056 | Latest master | | 494f8de | 0.002684895992 | Loop blocking without unrolling | | ca11767 | 0.001519840002 | Only read neighbor tables twice (???) | | 805897d | 0.001279 | Base line master | | 32daaa5 | 0.001096 | k blocking (no unroll) | | 4908e55 | 0.001052 | latest master (const neighbors) | | 4908e55 | 0.000832 | latest master w/ cosnt in-fields| ![image](https://hackmd.io/_uploads/SkZtF91M1e.png) - [x] Latest runtimes ('49945e5') ![image](https://hackmd.io/_uploads/HkYdd5JGkx.png) ![image](https://hackmd.io/_uploads/HJxod91zke.png) ## TODO after meeting with Hannes and Christoph - [ ] Add another c2v kernel - [ ] Try unstructured cached version with Christoph's description - [ ] Opt mi300a - [ ] Try cache hints and non_temporal_loads