[GT4Py] Research: ICON structured grid strided access standalone benchmark - III

# [GT4Py] Research: ICON structured grid strided access standalone benchmark - III (2 kernel combinations) - Shaped by: - Appetite (FTEs, weeks): - Developers:  ## Problem In [cycle21-22](https://hackmd.io/3Yr0zW7NSdud64JuPNMe_w) we explored the performance of the `nabla4` kernel and implemented the `mo_intp_rbf_rbf_vec_interpol` kernel for the `structured` and the `torus` grid (without periodic boundaries). For this cycle we would like to explore the performance benefits of the structured implementation in cases where the above kernels are called one after the other. ## Appetite Full cycle ## Solution Based on a conversation with Hannes and Ioannis there are the following cases to investigate: - Unstructured (Based on Christoph exploration 2 kernels faster than 1 inlined --> probably the inlined is faster according to Christoph because of the intermediate step reusage in structured) - 2 kernels with tmp from the 1st one as input of the 2nd one (1st kernel output is the same as the current one) - 1 kernel inlined - 1 kernel with 1 neighbor table with c2(faraway)v - Structured - 2 kernels with tmp from the 1st one as input of the 2nd one (1st kernel output is the same as the current one) - 1 kernel inlined (should be faster) On top of that it might be beneficial to check the `cuda::pipeline` feature. ## Rabbit holes  ## No-gos  * ~~CPU version this cycle~~ Did it to assist with the debugging of the inlined version ## Progress - [x] Run `unstructured_naive` implementation of `nabla4` + `interpolate` kernels on CPU - [x] Design class which instantiates the data for the `nabla4` and `interpolate` kernels - [x] Output of `nabla4` needs to be the input of `interpolation` kernel - [x] Figure out how to handle indexes of the smaller dimensions of the first kernel output 1. Transform the `v2e` indexes from the full region to the region of edges of `z_nabla4` - Can be done in the beginning and doesn't interfere with the kernel computation for the unstructured version. For the structured it needs more computations during the kernel execution - EdgeDim needs to be without the halos of the first kernel (size of `z_nabla2`). `(x_dim', y_dim')` for the second kernel need to be `(x_dim - 2 * halo_nabla, y_dim - 2 * halo_nabla)` 2. ~~Compute `z_nabla4` with size of `EdgeDim`. That way the `v2e` indexes don't need any transformation but the unstructured grid will need the halo information and some extra computations to store the output in the right index~~ - ~~Extra work for handling the existing tests, etc~~ - [x] Run `structured_naive` implementation of `nabla4` + `interpolate` kernels on CPU - [ ] Check timers with nsys and make sure that there is no large overhead in combined kernel - [x] `unstructred` inlined - [x] For each vertex need to find the edges and then for each edge need to run the nabla4 kernel --> Is this the `complex` neighbor? - [x] Is it actually possible to inline `nabla4` with `interpolate` since they don't have the same input size? - [x] How is something like this done in `icon`? - [x] Discussed with Christoph that inlining is done on the neighbor table level like I've already done - [x] Need to figure out what's going on with `e2v` and `e2c2v` indexes for the inlined version - [x] `e2c2v` is filtered with halo = 2 in the separate benchmarks. Then `v2e` is filtered with halo = 1 always and its elements are transformed so that they can refer to the indexes of the output vector of `nabla4` which has size same as the filtered `e2c2v`. In the inlined benchmark `e2c2v` is the same as the original one and `v2e` is filtered with halo = 3 (2+1). This is done so that the indexes of `v2e` can refer to `e2c2v`. All the vectors in the `sanity-checks` have been permuted so that the same data are checked in both the separate and inlined versions. - [x] `structured` inlined - [x] CPU version for validation - [x] Compute the v2e2c2v neighbor per (i, j) vertex on the fly - [x] Naive version - [x] Simple version - [ ] Caching version - [ ] `v2e2c2v` version - [x] Construct the `v2e2c2v` neighbor in ~~python~~ C++(same as `e2c2v[v2e]`) - [x] Take care of duplicate entries and remove them. Make sure that the indices are used in the kernel in the correct order - [x] Size of the `v2e2c2v` neighbors should be 7 (instead of 24) - [ ] Figure out whether the `e2ecv` neighbors are reasonable. 1 entry per vertex means that there is no compression possible - [x] Pass it to the inlined implementation - [ ] Check grid dimensions of inlined structured gpu_naive - [x] Create gpu_kloop implementations of the unstructured kernels - [x] Create gpu_kloop implemenetation for the structured kernel - [ ] Need to try limiting the registers in the inlined gpu_kloop kernels - [x] Structured inlined implementation with z_nabla4_e2 caching to Shared Memory to avoid recomputations - [ ] Pick a kernel and do a sweep with maxnreg from 64 to 128 with step 8 - [ ] Sweep the block size as well - [ ] Check why `inlined` `gpu_kloop` is faster than anything - [ ] Every thread computes ~20 klevels - [ ] Less reads from global memory - [ ] ~10% L1 hits --> Not sure how this justifies the extra performance - [ ] Improve `gpu_kloop` `inlined_cached` - [x] ~~Load to cache only the necessary fields~~ - [x] shouldn't help because in the non `cahced` version we already save all the variables to registers which is faster than L1 (shared) - [ ] Try reducing the grid dimensions and increase the k repetitions to take advantage of the saved values in shared mem - [x] Try `__maxnreg()__` on `unstructured` `inlined` and `unstructured` `inlined` `v2v` - [x] Improved performance with register limit - [x] Improved implementations with minimal loads - [x] `structured_inlined_cached_kloop` branch results (31/7) ![image](https://hackmd.io/_uploads/S10hQ3vK0.png) ![image](https://hackmd.io/_uploads/BJhp7nvtA.png) - [x] Generate plot with current runtimes (`68e91d7`) ![runtimes_torus_128_80](https://hackmd.io/_uploads/SkKhtqu30.png) - [ ] Answer the following questions about the plot - [x] What happens if `unstructured_gpu_kloop_inlined_v2v` k repetitions is 20 - [x] `unstructured_gpu_kloop_{inlined,inlined_v2v}` improved with k-repetitions 80 (maxKdim) - [x] Why is `unstructured_gpu_naive_inlined_v2v` slower than `unstructured_gpu_naive_inlined` - [x] Fixed that by increasing the blockDim.z (K) and the number of registers used by `unstructured_gpu_naive_inlined_v2v` - [x] Check if `unstructured_gpu_naive_inlined_v2v` `gpu_kloop` can also be improved by increasing the number of registers and the blockDim.z size --> `blockDim.z` change doesn't help but increased number of register does - [x] `v2v` `gpu_kloop` version is slower because it uses more registers and it only has very few less reads - [x] Check if `structured_gpu_naive_inlined_cached` can be improved - [ ] See if `v2ecv` can be improved in `unstructured` - [x] `structured_gpu_kloop_inlined_cached` can't be improved because if we increase the k-loops or the k levels in the thread block size then we end up using more shared memory than available - [x] Why is `unstructured` `gpu_kloop` `inlined` 2x slower than `inlined_v2v`? - [x] That's because of the compression of vertices in `v2v`. We only need to load 6 vertices for every `v2e2c2v` instead of `24` in `e2c2v[v2e]` - [x] `inlined` kernel has 72 + 80 (k reps) * 54 `ld.global.f64` loads while `inlined_v2v` has 72 + 80 (k reps) * 20 loads (2.62x less) - [x] Generate plot with current runtimes (`7a25dee`) ![image](https://hackmd.io/_uploads/SJSxW93nA.png) - [x] Generate plot with current runtimes (`c0c46e9`) ![image](https://hackmd.io/_uploads/SkiqdyJ6A.png) - [ ] Next things to try - [ ] Check how `e2ecv` can be improved - [ ] Try Felix's last changes (https://github.com/GridTools/gridtools/pull/1787) - [ ] Compared to the previous `master` without these improvements - [ ] Compared to simple `__ldg` - [ ] Make sure that `gpu_naive` `inlined` versions are reasonable - [ ] `mem_async` on `nabla4` for `K levels` - [ ] Then think about how to do that for`inlined`/`inlined_v2v` - [ ] Do K loop in consecutive K levels - [ ] Partly compile time offsets - [ ] See how more neighbor accesses influence performance - [ ] Partly visible by the `inlined` and `inlined_v2v` performance - [ ] Find representative kernels - [ ] Check how they compare with `nabla4` ## Extras - [x] Try out `sid::as_const` in `gtfn` backend to see if it reduces memory loads from gpu backend in loop blocking - Didn't work - [x] See if there is any update from Felix in `gridtools::storage` that might be applicable (`__ldg__`) - [ ] There is no change there. The `gridtools::storage` could be done if there is any improvement with the manual change in `nabla4` kernel - [ ] Need to check if Felix changes with `ldg` and `k loop` help with reading the neighbor tables only once