[GT4Py] Research: ICON structured grid strided access standalone benchmark - IV

# [GT4Py] Research: ICON structured grid strided access standalone benchmark - V - Shaped by: - Appetite (FTEs, weeks): - Developers:  ## Problem In [cycle25](https://hackmd.io/-XvvJNt2SeCmcIRHhNs43g) among other things we investigated the performance of a fully vertical `nabla4` kernel inlined with the `interpolate` kernel. For this cycle we want to see the results of `inlining` and `caching` using the shared memory with an additional kernel. Ideally this kernel should use `c2v` so that the overcomputation becomes 3 times bigger (`c2v` size is 3). On top of that we would like to see a cached version of the unstructured kernel and investigate some extra optimizations regarding cache hints on the GPU. By the end of the cycle there should be a descriptive and presentable report with all the experiments. ## Appetite Full cycle ## Solution - Try cache hints and non_temporal_loads - Try unstructured cached version with Christoph's description - Add another `c2v` kernel ## Progress - [x] Write report - [x] Figure out the type of the report - [x] Presentation (powerpoint) would be the best probably - [x] Start listing the assumptions made for the grids and the permutations of the unstructured grid - [x] Create list with kernel names - [x] Mention for each one of them the different things tried - [x] Mention details about important kernels/implementations - [x] WIP - [x] Present results in nice way - [x] Write conclusions - [x] Mention why we explored particular kernels - [x] Address Christoph's comments - [x] Make picture regarding edge ordering (per-vertex / per-orientation) - [x] Reason about grid size - [x] Be more specific about k-loop iterations - [x] Mention for gpu_kloop unrolling that going from k, k+1, etc in each thread should be better - [ ] Figure out differences in performance in torus_256 compared to torus_128 - [x] Probably more cache hits in `structured` versions compared to `unstructured` in `256` compared to `128` because of the smaller amount of data (150MB of data read and 50MB L2 cache) - [x] ~~Kernel performance more dependent on cache hits~~ - [x] Need to validate with `ncu` - [x] Same cache hit % in L1 and L2 for `128` and `256` for `nabla4` - [x] Need to check `unstructured_gpu_kloop` - [ ] Try 40 K iterations for unstructured v2v - [x] Rerun LDGSTS on GH200 - [x] Much slower compared to default version for `structured` - [x] Check ncu output of unstructured-per-vertex - [x] Much better cache hits, slower though - [ ] Try with more `K` iterations - [x] Check occupancy for `structured_gpu_naive_inlined_cached` in `GH200` - [x] Try maxnreg 48 with 2 `blockDim.z` - [x] Update `icon-structured` README - [x] Run `nabla4-interpolate` kernel with only reads and writes - [x] Gather `ncu` profiles and share with Lukas ![image](https://hackmd.io/_uploads/HyUatEQ7kl.png) ![image](https://hackmd.io/_uploads/rkL-iE77kx.png) - [x] Tried Luka's suggestion of doing the `__syncthreads()` in the k-loop in the `nabla4_interpolate_structured_inlined_cached` - [x] ~12% slower compared to saving all the k loops in shared memory and do only one sync ([implementation](https://github.com/GridTools/icon_structured_benchmark/blob/a91e0d661cb09d6afca3868147330ec418648566/include/nabla4_interpolate_structured_inlined_cached.hpp#L113)) - [ ] Inline `c2v` stencil - [x] Find ICON stencil with `c2v` neighbor - [x] Christoph will provide - [x] [verts2cells_scalar_lib](https://github.com/C2SM/libmath-interpolation/blob/17ce8f5ef0ceb18561c5397da4d1fdf86280623d/src/mo_lib_interpolation_scalar.F90#L41) - [x] Create `cpu_kfirst` version for `interpolate_c2v` - [x] Create test that compares `numpy` results with `C++` kernel with serialized data for `c2v` - [x] Create `gpu` implementations - [x] `gpu_naive` - [x] `unstructured` - [x] `structured` - [x] Assumes that 2 cells (upward, downward) correspond to `(i, j)` thread - 4 unique vertices instead of 6 - [ ] Try how much does this optimization improve performance compared to assigning only 1 cell to thread `(i, j)`? - [ ] Launch grid with `(x_dim, 2*y_dim)` and get `(i', j')` where `i' = i` and `j' = j / 2`. Color (upward or downward) is `j % 2` - [x] `gpu_kloop` - [x] `unstructured` - [x] `structured` - [x] Figure out `c2v` vector filtering - [x] Done based on `8192` torus but needs more tests - [x] To be consistent use `halo(vertex) = 1` for `verts2cells` - [x] First column is discarded for uniformity even if it could be calculated for `verts2cells` - Shouldn't influence performance - [x] Check if it can be fixed in the future --> done in [92a9f8f83f54647f12c558d09e8272fb3689da04](https://github.com/GridTools/icon_structured_benchmark/commit/92a9f8f83f54647f12c558d09e8272fb3689da04) - [x] `c2v` doesn't have any orientation notion like the edges. The only difference in ordering would be per color (upward or downward cells grouped together). Most cache reuse should be enabled with ordering cells like `UpwardDownwardUpwardDownward...` though so I'll only pursue this option atm - [x] Create `inlined` implementations - [x] `gpu_naive` - [x] `unstructured` - [x] `structured` - [x] `gpu_kloop` - [x] `unstructured` - [x] `structured` - [x] Create `inlined_c2v2e2c2v` implementations - [x] `gpu_kloop` - [x] `unstructured` - [x] `gpu_naive` - [x] `unstructured` - [ ] Is it possible to construct `c2v2e` and `c2v2e2e2c2v` vectors for upward and downward cells the same way to avoid runtime check of `upward/downward` cells in the kernel? - [x] My conclusion atm is **no** - [ ] Part of master thesis? - [x] Create `inlined_cached` implementations - [x] `gpu_kloop` - [x] `structured` - [x] `gpu_naive` - [x] `structured` - [x] Try saving in shared memory only the `nabla4` output and recompute the rest - More k repetitions due to less shared memory being occupied - Same performance as saving both `nabla4` and `interpolate` to shared memory (`only nabla4`: [commit](https://github.com/GridTools/icon_structured_benchmark/commit/e833281d40a37b9d9c5ec2f6780f198abca2fe24), `nabla4 & intepolate`: [commit](https://github.com/GridTools/icon_structured_benchmark/commit/9133df9a7b881980d284322ea5930cf15d382d5e)) - [x] Write sanity checks - [x] Check `ncu` reports and optimize - [x] See if it makes sense to create inlined combinations of `nabla4 & interpolate v2e` and `intepolate c2v` ![runtimes_torus_accel_128_80_roofline](https://hackmd.io/_uploads/rJM0LO9Lyl.png) Latest `main` (`9133df9`) ![image](https://hackmd.io/_uploads/ByruTTMDJx.png) With vectorized writes (`double2`) and `__stcs` in all `strided` implementations ([68a6a98](https://github.com/GridTools/icon_structured_benchmark/commit/68a6a9855bd15b6216ab20a0853b31cbb1ea123a)) ![image](https://hackmd.io/_uploads/Hy_FlMNPJx.png) - [x] `inlined_v2v_e` with `e_v2e2c2v` vector which stores the indices for each edge for the unique vertex read from `v2v` - [x] Slower due to use of stack (necessary for accessing unique elements dynamically at runtime) - [x] After ~30 repetitions of the benchmark there is GPU clock throttling as observed with `nsys` using: `nsys profile --gpu-metrics-device=0 ` - [ ] Try `inlined_cached` with `k_loop` that saves each level on `shared mem` and only `nabla4` cached with `LDGSTS` instructions (pipeline) - [x] ~x2 slower compared to inlined_cached version with nabla4 cached to shared mem - [x] Improved a bit by loading every edge per k iteration with the pipeline instead of having a pipeline per edge orientation - [x] Problem is lower (-25%) achieved occupancy. If the shared memory was larger it would be better - [ ] Ask Lukas if there is something I can improve - [ ] Implement `unstructured` `inlined cached` approach by computing maximum indexes of edges of a certain thread block - [ ] Check if it makes sense? - [ ] Naive approach - [ ] Go through the `v2e` vector and group it per thread block (32 entries) - [ ] For each one of these groups save in another vector the `v2e[0]` of the first entry and the `v2e[5]` entry of the last one - [ ] `per-vertex` the range will be much smaller - [ ] Give this vector as an input to the kernel - [ ] Go through all the inputs with the threads of the thread block and calculate `nabla4` for them - [ ] Save it to the `shared memory` - [ ] Access it by reading `v2e` and subtracting the first index of this thread block - [ ] `v2v` approach - [ ] Same as above but with `v2e2c2v` neighbor - [x] **Probably no benefit in performance based on outcome of `inlined_cached`** - [ ] Try TMA with structured - [ ] Is it worth it? - [ ] Check cache hits in pure inlined compared to cached and k_loop_iter inlined and cached with pipeline - [ ] Inlined and cached should have pretty good L1/L2 cache hits - [ ] Is it possible to align GPU memory using gridtools builder? - [ ] If not is it possible to align somehow the blocks with the 16 byte requirement? - [ ] TMA should only load the elements of u_vert and v_vert for 3 different consecutive batches for each j level of the threadBlock - (i-1, j+1):(i-1+blockDim.x, j+1) - (i , j ):(i +blockDim.x, j ) - (i+1, j-1):(i+1+blockDim.x, j-1) - [ ] Needs 16 byte aligned global memory address - [x] Not true if we load index `17` for example and `sizeof(element)` is `4 bytes` - [ ] Need to load `blockDim.x+1` for `j-1`, `j` and `j+1` together to shared memory - [ ] Probably need to load `blockDim.x+2` and add `halo = 1` to `threadIdx.x` to calculate the addresses in shared memory - [ ] We load another column every time to address 16 byte requirement - [ ] Then compute the necessary indexes in shared memory - [ ] Need to figure out how to align the gridtools arrays in certain offset - [x] Should be possible using `halos` - [x] It's possible for the `cpu`. Need to check for the `gpu` - [x] **Probably no benefit in performance based on outcome of `inlined_cached`** - [x] Figure out which cache hints make sense - [ ] In `inlined` versions makes more sense (?) - [ ] Maybe more prominent differences in `256` (?) - [ ] Execute benchmarks on GH200 - [x] No specific cache hint would make sense because most of the loaded data are going to be reused by the next thread. Maybe it makes sense for the data that are written because are going to be touched only once however the performance didn't change there - [x] Partly compile time offsets - [x] Quickly check. Good to know - [x] Not much of a difference in register usage or runtime in couple of kernels I tried - [x] Jacopo's script for visualizing offset providers: https://github.com/C2SM/icon4py/blob/documentation_dycore02/model/atmosphere/dycore/docs/ext/offset_providers.py - `icon4py` changes - Fix order of southeast edges to be according to the vertices similarly to the other edges (north and east) - Permute `e2c2v` based on orientation or vertices - `icon-structured-benchmark` features - Filter `e2c2v`, `e2ecv`, `v2e` and `c2v` neighbor tables based on `halo`s and ordering `per-vertex` or `per-orientation` for edges. `c2v` only orientation `per-vertex` ## Latest `main` ([5272141](https://github.com/GridTools/icon_structured_benchmark/commit/5272141d9f812adeec2c6dfd94aa729901a8e1dd)) ![runtimes_torus_accel_64_80_roofline](https://hackmd.io/_uploads/Hyxc6g3Pyg.jpg) ![runtimes_torus_accel_128_80_roofline](https://hackmd.io/_uploads/B1kk0l3w1e.jpg) ![runtimes_torus_accel_256_80_roofline](https://hackmd.io/_uploads/r1ZZRxhvke.jpg) ## Conclusions - `gpu_kloop` beneficial overall (**~30%**) - Inlining more than 2 kernels not beneficial due to increase in register usage - In case more than 2 kernels are inlined then using shared memory to cache the intermediate kernels is better - Small differences between `inlined_v2v` and `structured`. `structured` a bit better performance and probably same complexity in application on ICON - Apart from cache hints and compile time offsets probably not much more to check unless somebody gives some feedback ## Future work - Felix's suggestions - `j-scan`: blocking horizontal domain (and use of SFC?) - shuffle instead of shared mem - Lukas' suggestions - Try loading instead of 8 bytes at once 32. Then L1 memory should be skipped and there should be more loads from L2 directly to shared mem - [x] Done in [aligned_accesses branch](https://github.com/GridTools/icon_structured_benchmark/tree/aligned_accesses) - In the pipeline version try calculating on the fly nabla4 (inline with interpolate) to reduce shared mem size and add more computations that happen together with the memory loads - Isn't helpful because of the neighbors that need to be saved in registers for this inlined combination - Try limiting registers - Useful but necessary to do a sweep since the smallest number of registers without spilling is not the best