[GT4Py] Research: ICON structured grid strided access standalone benchmark - II

# [GT4Py] Research: ICON structured grid strided access standalone benchmark - II (torus grid without periodic boundaries)  - Shaped by: - Appetite (FTEs, weeks): - Developers:  ## Problem In [cycle20](https://hackmd.io/WP1oWR6ZQEOsln4B82ZOMQ) we saw that the structured approach for the torus files can provide a significant speed up compared to the unstructured support on CPU. However the fact that the torus grids had periodic boundaries made the calculation of the needed offsets for the `e2c2v` neighbors very compute intensive with lots of `modulo` operations. The goal for this cycle is twofold. First of all, to run the `nabla4` kernel in a structured grid without periodic boundaries to simplify the offset calculation (similar to what would happen in a real ICON scenario). Secondly, to implement a GPU backend to analyze the performance on GPUs. ## Appetite  * Run `nabla4` kernel in a reduced grid without periodic boundaries: 1 week * GPU implementation & optimizations: 2 weeks ## Solution  For the structured version, replace the for-loop of the edges with two for-loops (`i,j`) that go through the vertices which have cartesian coordinates. That way we can skip the vertices and edges that have periodic edges and the offsets to the neighboring vertices will be calculated only by `+1` or `-1` in the `longitudinal` or `latitudinal` axes. ![IMG_2575 Large](https://hackmd.io/_uploads/rJrCdG2JA.jpg) _Drawing that showcases the grid with halos and the domain of edges, vertices and cells for processing certain part of the grid. Used for discussion with Hannes and Ioannis_ For the unstructured we can just remove the unnecessary elements from the `e2c2v` and `e2ecv` vectors. For both cases we should have the same input fields for the other fields as now. (Some extra memory which isn't going to be accessed for both structured and unstructured cases) After having that in place we can start looking at the GPU performance by writing a kernel in CUDA. ## Rabbit holes  ## No-gos  ## Progress  - [x] Implement benchmark without the edges that wrap around the torus (periodic boundaries) and with halo `1` around the computation area - [x] Create validation tests - [x] Benchmark first version in Santis and compare with previous results - [x] Get results of first implementation - [x] Try simplifying the calculation of the `e2c2v` edges - [x] We only need to calculate 6 different neighbors for all `e2c2v` vertices --> not much gain - [x] Try different compilation options - [x] Add `#pragma GCC ivec` everywhere - [x] `--param vect-max-version-for-alias-checks=n` where `n > 10` - [x] Try again `#pragma omp simd` --> not necessary - [x] Try without `-fcost-model=unlimited -fsve-vector-length=128`, etc --> slightly slower - [x] Full set of compilation options ``` -g -mcpu=neoverse-v2 -Ofast -msve-vector-bits=128 -fopt-info-vec-missed -fvect-cost-model=unlimited -fPIC -fopenmp-simd -DNDEBUG --param vect-max-version-for-alias-checks=50 ``` - [x] Clean up the new implementation - [x] Try to simplify `wrapper.cpp` - [x] Simplify `benchmark.hpp` - [x] `wrapper.cpp` is more difficult because of python bindings - [x] Gather results (`890b981`) ![k_sweep_acc_torus_cpu_ifirst](https://hackmd.io/_uploads/HyfCRMreA.png) ![k_sweep_acc_torus_cpu_kfirst](https://hackmd.io/_uploads/SyfRAfHeA.png) ![runtimes_torus_128_65](https://hackmd.io/_uploads/HJhky7BlC.png) ![runtimes_torus_64_65](https://hackmd.io/_uploads/By3J1XSlA.png) ![runtimes_torus_128_65_multiproc](https://hackmd.io/_uploads/Hk9bJmBlC.png) - [x] Conclusions from CPU implementation - [x] Runtimes of the `structured` grid without periodic boundaries are a bit smaller than the one with periodic boundaries due to the smaller domain they calculate - [x] `cpu_ifirst` in single core 2x faster than the structured version with periodic boundaries (partly because of the s) - [x] `cpu_kfirst` has very slight improvement because it only calculates the neighbor offsets once per edge - [x] Running the benchmark once per core of a CPU gets us **60%** better performance for `cpu_ifirst` and **28%** for `cpu_kfirst` --> should be conscious because of the high variability of this experiment - [x] The runtimes reported for the `cpu_ifirst` implementation were higher for the `structured` grid with halos compared to the one without halos which is strange - [x] Start GPU implementation - [x] L2 Cache of H100: 50MB - [x] Check memory transfer to the GPU --> gridtools storage takes care of this - [x] Use target_views of gridtools storage - [x] Figure out build system to build with NVCC --> need [fix in gridtools](https://github.com/GridTools/gridtools/pull/1776) - [x] Launch unstructured in the GPU - [x] Add error checking, etc - [x] Implement structured kernel on GPU - [ ] Check why unstructured and structured implementations have different number of coalesced memory accesses --> same `e2c2v` neighbors calculated in similar order, so not sure why there are more uncoalesced in structured version - [x] Compare changing between ifirst and kfirst loops --> Why is `gpu` `ifirst` from the beginning? --> because of vertical kernels that access neighbors in the vertical direction - [ ] GPU Unstructured implementation optimizations - [ ] Find optimal launch configuration for P100 --> Check with `nvprof` - [x] Persistent L2 cache for the neighbors --> Not beneficial in RTX 3080 because of very small size of persistent L2 mem. Needs to be evaluated in H100 - [x] Figure out how to keep 2 vectors in persistent memory --> Check `cudaLaunchKernelEx` and `cudaLaunchConfig_t`, `numAttrs`, `cudaFuncSetAttribute` --> Not possible without combining the 2 vectors in memory - [ ] Shared memory for the neighbors --> Probably not worth the try because right now all the `k` levels are launched in parallel - [x] Edge order (permute `e2c2v` to have edges grouped based on their orientation) - [ ] GPU Structured implementation optimizations - [ ] Find optimal launch configuration for P100 --> Check with `nvprof` --> `nvprof --metrics all` w/ `__launch_bounds()__` and `-Xptxas=-v` - [x] Find optimal launch configuration for H100 - [x] Calculate the offsets for each thread block and save them in shared memory --> up to 5% faster for `klevels=16-65`, 2x slower for `klevels=1` and 0.5% slower for `klevels=180` - [x] Write the kernel such that we launch 1 thread/edge and check inside the kernel with an `if-else` which neighbor we should use depending on the edge index --> much better performance - [x] General improvements - [ ] Figure out what to do with `e2ecv` - [ ] Should it be just increased linearly to get the best access pattern? This would mean that in normal scenarios the `e2ecv` vector and the ones that it corresponds to need to be properly permutted - [x] Should it be permuted based on the same permutation of `e2c2v`? This means that the memory access will be consecutive only for parts of the grid --> currently selected - [ ] Rerun `cpu_ifirst` implementation on `Santis` after latest fix with `#pragma GCC ivdep` in `unstructured cpu_ifirst` - [x] Enable `gtfn` gpu backend as well - [x] Slower than the pure CUDA implementation - [x] Try different types for `e2c2v` and `e2ecv` - [x] `int` might be faster than `uint_32` in CUDA - [x] `uint32_t` is marginally fastest of all the implementations and uses the least amount of registers in the kernels ``` = k levels: 65 = nabla4_benchmark_unstructured_gpu_gridtools median runtimes (5063409_sizet): 0.012321447 nabla4_benchmark_structured_torus_gpu_gridtools_halo median runtimes (5063409_sizet): 0.011105387 = k levels: 65 = nabla4_benchmark_unstructured_gpu_gridtools median runtimes (5063409_int): 0.01198712 nabla4_benchmark_structured_torus_gpu_gridtools_halo median runtimes (5063409_int): 0.010073352 = k levels: 65 = nabla4_benchmark_unstructured_gpu_gridtools median runtimes (5063409_uint32): 0.010934761 nabla4_benchmark_structured_torus_gpu_gridtools_halo median runtimes (5063409_uint32): 0.010092807 ``` - [x] ~~Replicate cache pressure on the GPU~~ - [ ] ~~Multiple kernels launched on the GPU~~ - [ ] ~~With different processes (probably more overhead)~~ - [ ] ~~With a single process but with different objects - streams (needs a bit of refactoring to launch asynchronously the kernels and have a single `cudaDeviceSynchronize()` in the end)~~ - [ ] ~~Figure out how many times to replicate the kernel~~ - [ ] ~~Should be something similar to a typical ICON execution~~ - [x] Kernels are launched in order in the GPU in ICON-EXCLAIM (`ASYNC(1)`) so there is only 1 of them being executed on the GPU. There is no stress from multiple kernels being executed at the same time - [x] Santis build flags - [x] CPU: `cmake .. -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_BUILD_TYPE=Custom -DCMAKE_CXX_FLAGS="-g -mcpu=neoverse-v2 -Ofast -msve-vector-bits=128 -fopt-info-vec-missed -fvect-cost-model=unlimited -fPIC -fopenmp-simd -DNDEBUG --param vect-max-version-for-alias-checks=50" -DINDEX_TYPE=sizet` - [x] GPU: `cmake .. -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_BUILD_TYPE=Custom -DCMAKE_CXX_FLAGS="-g -mcpu=neoverse-v2 -Ofast -fPIC -DNDEBUG" -DIS_GPU=ON -DCMAKE_CUDA_ARCHITECTURES=90 -DCMAKE_CUDA_COMPILER=$(which nvcc) -DCMAKE_CUDA_FLAGS="-diag-suppress 177 -fPIC --save-temps --verbose --generate-line-info -Xptxas=-v --expt-relaxed-constexpr -DNDEBUG" -DINDEX_TYPE=sizet` - [x] `gtfn` faster? --> No, it was timing the wrong thing - [x] Check why there is an overhead of the C++ CUDA stream timers compared to nsys profile - [x] HostToDevice transfer is the overhead? --> No - [x] Based on the nsys timeline the overhead from the cuda timers to the time reported by nsys is the `cuLaunchKernel` time - [x] Can influence a bit the acceleration reported in small problem sizes - [x] Create plots for CPU - [x] Run times for `128` and `64` with `gtfn`, `cpu_ifirst` and `cpu_kfirst` (**int64**) ![runtimes_torus_128_65](https://hackmd.io/_uploads/HJQ9Iv6Z0.png) ![runtimes_torus_64_65](https://hackmd.io/_uploads/ByI_Lw6WA.png) - [x] Acceleration with `K sweep` for `128` and `64` for the different backends ![k_sweep_acc_torus_cpu_ifirst](https://hackmd.io/_uploads/S1T4ODpWA.png) ![k_sweep_acc_torus_cpu_kfirst](https://hackmd.io/_uploads/BylBdvpbC.png) - [x] If time permits for `512` and `256` as well - [x] Multi process plots for 128 (mention acceleration) --> **int64_t: cpu_ifirst 68%, cpu_kfirst 120%** ![runtimes_torus_128_65_multiproc](https://hackmd.io/_uploads/r1R6IS6-A.png) - [x] Mention `int64_t` is default for `gtfn` for the indexes - [x] `int64` and `sizet` faster than `int` or `uint32` for most cases - [x] Larger acceleration in `int64` and `sizet` than `int` and `uint32` for the `cpu_ifirst` structured version - [x] Compared to full grid - [x] Unstructured 2-10% slower than full grid (???) --> probably related to added `#pragma GCC ivdep` in `cpu_ifirst` unstructured - [x] Structred cpu_ifirst up to 2x times faster, cpu_kfirst <3% faster - [x] Create plots for GPU - [x] For the fastest `INDEX_TYPE` - [x] Run times for `128` and `64` with `gtfn` and `gpu` (**int64**) ![runtimes_torus_64_65](https://hackmd.io/_uploads/B1xHViMCbR.png) (**sizet**) ![runtimes_torus_64_65](https://hackmd.io/_uploads/Sk_NazRZR.png) (**int64**) ![k_sweep_acc_torus_gpu](https://hackmd.io/_uploads/Bk2wjfR-R.png) (**sizet**) ![k_sweep_acc_torus_gpu](https://hackmd.io/_uploads/Sk9l6fR-C.png) - [x] Acceleration with `K sweep` for 128 and 64 for the different backends - [x] Mention `int64_t` is default for `gtfn` for the indexes - [x] Mention how does index types compare - [x] `int` marginally faster than `uint32` (<1%) - [x] `sizet` marginally faster than `int64` (<2.4%) - [x] `sizet` faster than `int` (~8%) - [x] L2 Persistent - [x] If neighbor array is larger than the `persistingL2CacheMaxSize` it doesn't provide better performance - [x] With `int/uint32` as `index_type` there is some better performance because the neighbor tables fit the L2 persisting max size - [x] Needed to combine `e2c2v` and `e2ecv` on the same array - [x] torus: 128, k: 65, index type uint32 (**10-18%** speedup) - [x] With L2 pers `Med (5 runs): 1,105,280 ns (nsys)`, `Med (101 runs): 0.001102784037590027 (timer)` - [x] Without L2 per `Med (5 runs): 1,307,680 ns (nsys)`, `Med (101 runs): 0.001216` - [x] torus: 128, k: 65, index type sizet (**12%** slowdown) - [x] With L2 pers `0.0013276159763336182` - [x] Without L2 pers `0.001179` - [x] torus: 64, k: 65, index type: uint32 (**3%** slowdown - **0.2%** speedup) - [x] With L2 pers `Med (5 runs): 5,023,808 ns (nsys)`, `Med (101 runs): 0.004969600200653076 (timer)` - [x] Without L2 per `Med (5 runs): 5,035,040 ns (nsys)`, `Med (101 runs): 0.004793 (timer)` - [x] Try with `e2c2v` only - [x] torus: 128, k: 65, index type uint32 (**1%** speedup) - [x] With L2 pers `Med (5 runs): 1,284,256 ns` - [x] torus: 64, k: 65, index type: uint32 (**9.3%** slowdown) - [x] With L2 pers `Med (5 runs): 5,503,168 ns` - [x] I expect that with `index type: sizet` there won't be any benefit from the `L2 persitstent` because only a small % of the arrays will fit inside it - [x] Try `inv_vert_vert_length_gt_tv` or `inv_primal_edge_length_gt_tv` - [x] `inv_vert_vert_length_gt_tv` - [x] torus: 128, k: 65, index type uint32 (**0.6%** speedup) - [x] With L2 pers `Med (5 runs): 1,299,392 ns` - [x] torus: 64, k: 65, index type: uint32 (**2%** slowdown) - [x] With L2 pers `Med (5 runs): 5,168,768 ns` - [x] **Conclusion** - [x] Acceleration achievable only in cases that whole vectors fit inside the `L2 persistent memory`. Otherwise small speed up or slowdown - [x] It is possible to set only one block of continuous memory for `L2 persistent memory` so to add `e2c2v` or other neighbors there are some changes needed to the code - [x] If `e2c2v` and `e2ecv` fitted in the `L2 persistent memory`, meaning for up to 902700 edges and index type `uint32`, there is a measurable speed up of up to 18% - [x] Constant memory - [x] 64 KB --> Very small for `e2c2v` or `e2ecv`. Same for anything relate to edge length for large scale --> Need to define what large scale means - [ ] Try structured with i,j for the threads with similar number of threads to the orientations - [ ] Try out cooperative groups and [Distrubuted Shared Memory](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#distributed-shared-memory) to only read once if possible the `e2c2v` to shared memory --> Needs H100 GPU - [ ] Refactor structured implementation to use `cooperative groups` - [x] Try vectorized loads for neighbor tables - [x] For `torus_128` with `k: 65` and index type `int` - [x] 1D vector: `0.0013059999942779542` - [x] 1D vector with `int4`: `0.0012603839635848997` (3% speed up) - [x] 1D vector with `int4` and improved config: `0.0012020959854125977` (8%) --> Faster than the `sizet` version - [x] Try with gridtools storage - [x] Transpose storage to `(4_c, 0)` from `(0, 4_c)` --> Uncoalesced memory accesses for neighbor tables but vectorized loads - [x] For `torus_128` with `k: 65` and index type `int`: `0.004006048202514648` - [x] Added logic for vectorized memory - [x] For `torus_128` with `k: 65` and index type `int`: `0.0029010560512542725` (38% speed up - 2.4x slowdown compared to original version) - [x] Why are the strides set up like this in `gridtools::storage`? - [ ] TODOS - [ ] Avoid use of shared memory and use registers outside the k-loop - [x] Ref `ij` implementation: `torus_128`, `K 65`, `10 repetitions` median: `1,341,488 ns` - [x] `ij` implementation with registers: `torus_128`, `K 65`, `10 repetitions` median: `831,184 ns` - [x] For loop with color/orientation instead of parallelism in grid - [x] `torus_128`, `K 65`, `10 repetitions` median: `917,728 ns` - [x] Change permutation in sparse fields for coalesced memory accesses - [x] Unstructured - [x] Thread cache local permutation - [x] `torus_128`, `K 65`, `10 repetitions` median: `1,102,928 ns` - [x] Thread coalesced memory accesses permutation - [x] `torus_128`, `K 65`, `10 repetitions` median: `1,112,656 ns` - [x] Structured - [x] Thread cache local permutation - [x] `torus_128`, `K 65`, `10 repetitions` median: `898,080 ns` (**<1% slower** but faster in **NCU** ???) - [x] Thread coalesced memory accesses permutation - [x] `torus_128`, `K 65`, `10 repetitions` median: `922,825 ns` (**~3% slower** but faster in **NCU** ???) - [x] K loop inside the kernel - [x] Thread coalesced memory accesses permutation - [x] Unstructured `torus_128`, `K 65`, `5 repetitions` median: `0.0007611839771270752 s` - [x] Structured `torus_128`, `K 65`, `5 repetitions` median: `0.0007422400116920471 s` (**2.5%** speedup) - [x] Unstructured `torus_64`, `K 65`, `5 repetitions` median: `0.0029232320785522463 s` - [x] Structured `torus_64`, `K 65`, `5 repetitions` median: `0.0026801600456237795 s` (**9%** speedup) - [x] Thread cache local permutation - [x] Unstructured `torus_128`, `K 65`, `5 repetitions` median: `0.0009497280120849609 s` - [x] Structured `torus_128`, `K 65`, `5 repetitions` median: `0.000814303994178772 s` (**16%** speedup compared to unstructured - **10%** slowdown compared to coalesced) - [x] After implementing k blocking in unstructured as well and pulling outside the k-loop in registers all the variables that are reused inside the loop ![runtimes_torus_128_65](https://hackmd.io/_uploads/HJwnDYEmA.png) ![runtimes_torus_64_65](https://hackmd.io/_uploads/rJ2oDtNXC.png) ![k_sweep_acc_torus_gpu](https://hackmd.io/_uploads/BJbaPFVQR.png) - [ ] Check performance difference between `gtfn` and CUDA `unstructured` implementation - [x] Uncoalesced memory accesses in output vector --> Fixed (23% speedup) - [x] Uncoalesced memory accesses in neighbors vectors --> Fixed (1% speedup) - [x] Missing correct alignment --> Fixed with GridTools storage - [ ] **2.5x times more memory loads** `neighbor_table_neighbors` called multiple times with same arguments? This function populates arrays from the neighbor tables and reads them multiple times from memory (8 instead of 1), need to make sure that my prints don't mess up and increase the number of reads (same number of sectors read with ncu) - [ ] Code path - [ ] https://github.com/GridTools/gridtools/blob/master/include/gridtools/fn/unstructured.hpp#L107 - [ ] https://github.com/GridTools/gridtools/blob/master/include/gridtools/fn/unstructured.hpp#L85 - [ ] https://github.com/GridTools/gridtools/blob/master/include/gridtools/fn/neighbor_table.hpp#L63 - [ ] https://github.com/GridTools/gridtools/blob/master/include/gridtools/fn/sid_neighbor_table.hpp#L36 --> multiple reads of the same address - [ ] Without `if` in https://github.com/GridTools/gridtools/blob/master/include/gridtools/fn/unstructured.hpp#L87 same number of memory requests as unstructured and 10% speedup compared to the ref `gtfn` (still 44% slower than the cuda unstructured) --> many more integer instructions. Most stalls in `neighbor_table_neighbors` - [ ] Increased ALU utilization (minor) - [ ] Stalls in `shift()` and `neighbor_table_neighbors()` (minor) - [ ] Additional kernel - [ ] Unstructured (Based on Christoph exploration 2 kernels faster than 1 inlined --> probably the inlined is faster according to Christoph because of the intermediate step reusage) - [ ] 2 kernels with tmp from the 1st one as input of the 2nd one (1st kernel output is the same as the current one) - [ ] 1 kernel inlined - [ ] 1 kernel with 1 neighbor table with c2(faraway)v - [ ] Structured - [ ] 2 kernels with tmp from the 1st one as input of the 2nd one (1st kernel output is the same as the current one) - [ ] 1 kernel inlined (should be faster) - [ ] Figure out 2nd kernel - [ ] Discussion with Christoph: ``` What Christoph said is that basically there is no kernel that uses the output of nabla4 (ECV) and applies it to something else (CECV or VECV). Instead of nabla4 there is a very similar kernel (nabla2) which calculates 2 output fields. One of them is calculated in a very similar way to nabla4 (z_nabla2 output field) and the other one is unrelated at the moment (kh_smag_e_vp). Then z_nabla2 field is used by an interpolation kernel (_mo_intp_rbf_rbf_vec_interpol_vertex) that calculates values on vertices based on an edge (VECV). This is the kernel that would make more sense to use for our experiment. It would need a write from scratch of the nabla2 kernel to keep only the necessary parts (which however should be very similar to the nabla4 kernel) plus the implementation of the _mo_intp_rbf_rbf_vec_interpol_vertex kernel and all the combinations with nabla2 and _mo_intp_rbf_rbf_vec_interpol_vertex. An alternative to the above is again implementing nabla2 but keeping the part that calculates kh_smag_e_vp only. Then there is another kernel that does a CECV transformation (_temporary_fields_for_turbulence_diagnostics) that can be used for the exploration. This alternative however looks like it would need more work since basically there are 2 brand new kernels that need to be implemented. For that reason we concluded that it would make more sense to move with the first option since there is also lots of data reuse in the VECV kernel. ``` - [x] `gt4py` implementations - [model/atmosphere/diffusion/src/icon4py/model/atmosphere/diffusion/stencils/calculate_nabla2_and_smag_coefficients_for_vn.py](https://github.com/C2SM/icon4py/blob/main/model/atmosphere/diffusion/src/icon4py/model/atmosphere/diffusion/stencils/calculate_nabla2_and_smag_coefficients_for_vn.py) - [model/common/src/icon4py/model/common/interpolation/stencils/mo_intp_rbf_rbf_vec_interpol_vertex.py](https://github.com/C2SM/icon4py/blob/main/model/common/src/icon4py/model/common/interpolation/stencils/mo_intp_rbf_rbf_vec_interpol_vertex.py) - [ ] `mo_intp_rbf_rbf_vec_interpol_vertex` implementation - [ ] `cpu_ifirst` and `cpu_kfirst` - [ ] Add test with `serialbox` - [ ] `gpu` implementation - [ ] Add sanity check - [ ] `nabla2` implementation - [ ] Tests, etc might take time - [ ] Combine `nabla2` and `interpolation` - [ ] Launch the kernels together in CPU and GPU - [ ] Write combined kernel in CPU and GPU - [ ] Figure out how to calculate the combined neighbor and kernel - [ ] `gtfn` performance - [ ] Created `copy` kernel - [x] `EdgeDim: 902700`, `KDim: 65`, `20 reps` - [x] `gtfn`: `0.00030164799094200135 s` - [x] `cuda`: `0.00034942400455474854 s` (15% slower) - [ ] Created `copy_neighbor` kernel - [ ] `gridtools storage` implementations is slower --> Many more stalls and very little eligible threads - [ ] `gt4py` storage has different strides than `gridtools`? - [ ] Results - [ ] Edges: 902700, K levels: 65, Repetitions: 30, int32 for both index_type and vectors - [ ] `gtfn` with `m_index` check: `0.000391551986336708` - [ ] `gtfn` without `m_index` check: `0.00038412800431251526` - [ ] `gt::storage` with 128B alignment: `0.0003876159936189651` - [ ] `gt::storage` with 256B alignment: `0.0003779360055923462` - [ ] Need to check above differences - [x] Check changes in the different repositories to see if something is changed and `gt::storage` is now faster than `gtfn` --> Reproduced the same results with initializing one object for all the runs - [x] Check GPU clocks to be stable --> needs elevated permissions and probably unnecessary since kernels run close to each other - [x] Check if there is any difference with the above options in `ncu` --> No - [x] Check whether anything changes if I run the kernel with the different data at every repetition --> See below - [x] Figure out why there are such low `eligible warps` in `gt::storage` kernel --> Number of instructions in `gt::storage/cuda` version is ~46% less than the `gtfn` version. The `Warp Cycles Per Executed Instruction [cycle]` for `gt::storage/cuda` is ~48% less. This means that the total number of `Warp cycles` for the two versions is almost the same which is what we're seeing in the runtimes as well. It's just that there is latency hiding in the `gtfn` version when loading the data from memory due to the extra `int` operations executed - [x] Why `gtfn` generates always `ld.global.s32` for neighbors? --> https://github.com/GridTools/gridtools/commit/b40e0fc3fe4725223cf36842340b1f2555f584f1 (potentially better fix exists) - [ ] ~~Find proper fix~~ - [x] Why `gt::storage` generates `ld.global.u32` with `int`? --> Looks like the same for `gtfn` as well. Even if there is `ldu32/64` then the output register is handled as an integer - [ ] Check why the strides are different between `gt4py.storage` and `gt::storage` and whether this plays any role in the `eligible warps`/`Long Scoreboard Stalls` - [ ] Try above with more repetitions - [x] Initialize one object per run or one object for all the runs - [x] Better one object for all the runs ``` with int64 both and initialization of object per run (30 repetitions) copy_neighbor_benchmark_gtfn_gpu median runtime: 0.00037510401010513306 copy_neighbor_benchmark_gpu median runtime: 0.00041022399067878725 with int64 both and one object for all runs for both (30 repetitions) copy_neighbor_benchmark_gtfn_gpu median runtime: 0.0003737919926643372 copy_neighbor_benchmark_gpu median runtime: 0.00037880000472068786 with int64 both, one object for all runs for both and 10 dry runs for the cuda version (30 repetitions) copy_neighbor_benchmark_gtfn_gpu median runtime: 0.00037259200215339664 copy_neighbor_benchmark_gpu median runtime: 0.00037831999361515045 ``` - [x] torus 128, K 65, int64 - [x] without `m_index` check - `nabla4_benchmark_unstructured_gtfn_gpu median runtime: 0.0011159999966621399` - `nabla4_benchmark_unstructured_gpu_gridtools median runtime: 0.000985535979270935` - [x] with `m_index` check (double `ld.global.u64` compared to other implementations and without `m_index` check) - `nabla4_benchmark_unstructured_gtfn_gpu median runtime: 0.0018831200003623962` - `nabla4_benchmark_unstructured_gpu_gridtools median runtime: 0.000996288001537323` - [x] Above behavior with double the number of loads for the neighbors not observable with other simple kernels --> When there was a different field accessed with the same neighbor it generated double the number of loads - [x] `uint32` structured faster with `sizet` grid parameters - [x] Compare performance between `uint32` and `int32` for structured - [x] Same with same grid parameters - [x] Almost the same generated code as well - [x] For `gtfn` `int` and `uint32` same performance - [x] Double loads for neighbors fix - [x] torus 128, K 65, `int32`, 10 repetitions - [x] Ref with extra loads: `nabla4_benchmark_unstructured_gtfn_gpu median runtime`: `0.0014720000028610228` - [x] `const variable` fix: `0.001332592010498047` - [x] `~10%` improvement - [x] https://github.com/GridTools/gridtools/commit/c7e7bb7c107424178fd09e95dc3326ff6bd29146 - [x] Still ~40% slower than unstructured CUDA version --> Many more 64 bit calculations (~35% more compute throughput, ~32% increase in duration according to NCU) - [x] Lots of warp stalls in https://github.com/iomaganaris/gridtools/blob/fix/m_index_type_unstructured/include/gridtools/fn/unstructured.hpp#L88 and https://github.com/iomaganaris/gridtools/blob/fix/m_index_type_unstructured/include/gridtools/sid/concept.hpp#L567 - [x] torus 128, K 65, `int32`, 10 dry runs, 20 repetitions - [x] Structured implementation - [x] Naive implementation: `0.0009215999841690063` - [x] K-blocking: `0.0006446560025215149` - [x] `~40% improvement` - [x] Structured implementation - [x] K-blocking: `0.0006077599823474884` - [x] `~6%` improvement - [x] Latest results (`daca087`) ![k_sweep_acc_torus_gpu](https://hackmd.io/_uploads/SJVoKPQNA.png) ![runtimes_torus_128_65](https://hackmd.io/_uploads/rkTiFD740.png) ![runtimes_torus_64_65](https://hackmd.io/_uploads/r1ZhYv7NR.png) - [ ] Why `K 65` has lower acceleration? - [x] `gtfn` optimizations (torus 128, K 65, `int32`, 10 dry runs, 20 repetitions, median runtime) - [x] `gridtools::sid::rename_numbered_dimensions<generated::Vertex_t, generated::K_t>(gridtools::sid::shift_sid_origin(gridtools::nanobind::as_sid(u_vert.first), u_vert.second))`: `0.0013217120170593262` (`43%` slower than handwritten CUDA) [Without if: `0.0013119199872016908` (`42%` slower than handwritten CUDA)] - [x] `gridtools::sid::rename_numbered_dimensions<generated::Vertex_t, generated::K_t>(gridtools::sid::shift_sid_origin(gridtools::nanobind::as_sid(u_vert.first, gridtools::nanobind::stride_spec<1, nanobind::any>{}), u_vert.second))`: `0.001023967981338501` (`29%` improvement, `11%` slower than handwritten CUDA) [Without if: `0.0010360640287399292` (`26%` improvement, `12%` slower than handwritten CUDA)] - [x] `gridtools::sid::rename_numbered_dimensions<generated::Vertex_t, generated::K_t>((gridtools::nanobind::as_sid(u_vert.first, gridtools::nanobind::stride_spec<1, nanobind::any>{})))`: [Without if: `0.0010396479964256285` (same as above)] - [x] `gridtools::sid::rename_numbered_dimensions<generated::Vertex_t, generated::K_t>(gridtools::sid::shift_sid_origin(gridtools::nanobind::as_sid(u_vert.first, gridtools::nanobind::stride_spec<1, <actual_stride>{}), u_vert.second))`: `0.0009746719896793366` (`35%` improvement, `5%` slower than handwritten CUDA) [Without if: `0.0009715040028095245` (`6.6%` improvement, `5%` slower than handwritten CUDA)] - [ ] `e2c2v load instructions gtfn` ``` ld.param.u32 %r1, [_ZN9gridtools2fn7backend9gpu_impl_6kernel...__param_1]; ld.param.u64 %rd13, [_ZN9gridtools2fn7backend9gpu_impl_6kernel...__param_3]; ld.param.u64 %rd14, [_ZN9gridtools2fn7backend9gpu_impl_6kernel...__param_3+8]; cvta.to.global.u64 %rd17, %rd13; # e2c2v mov.u32 %r5, %ctaid.x; shl.b32 %r6, %r5, 5; mov.u32 %r7, %tid.x; add.s32 %r8, %r6, %r7; # edge_index add.s32 %r16, %r1, %r8; # r16 = edge_index + r1 cvt.s64.s32 %rd30, %r16;# rd30 = edge_index + r1 add.s64 %rd40, %rd14, %rd30; add.s64 %rd41, %rd40, %rd14; # rd41 = rd40 + rd14 = rd14 + rd30 + rd14 = 2 * rd14 + edge_index + r1 shl.b64 %rd42, %rd41, 2; # rd42 = 4 * (r41) = 4 * (2 * rd14 + edge_index + r1) add.s64 %rd43, %rd17, %rd42; # rd43 = rd17 + rd42 = rd17 + r42 = rd17 + 4 * (2 * rd14 + edge_index + r1) = e2c2v + 8 * rd14 + 4 * (edge_index + r1) ld.global.u32 %r18, [%rd43]; ``` - [ ] `e2c2v load instructions handwritten CUDA` ``` ld.param.u64 %rd1, [_Z7run_gpuiiN9gridtools7storage9gpu_impl_..._param_2]; mov.u32 %r36, %ntid.x; mov.u32 %r37, %ctaid.x; mov.u32 %r38, %tid.x; mad.lo.s32 %r39, %r37, %r36, %r38; # edge_index cvta.to.global.u64 %rd11, %rd1; # e2c2v mul.wide.u32 %rd21, %r39, 4; # edge_index * 4 add.s64 %rd22, %rd11, %rd21; # e2c2v + edge_index * 4 ld.global.u32 %r48, [%rd22]; # *(e2c2v + edge_index * 4) ``` - [ ] `z_nabla2_e load instructions gtfn` ``` ld.param.u64 %rd6, [_ZN9gridtools2fn7backend9gpu_impl_6kernel...__param_1+48]; ld.param.u64 %rd12, [_ZN9gridtools2fn7backend9gpu_impl_6kernel...__param_2+32]; ld.param.u32 %r1, [_ZN9gridtools2fn7backend9gpu_impl_6kernel...__param_1]; mov.u32 %r5, %ctaid.x; shl.b32 %r6, %r5, 5; mov.u32 %r7, %tid.x; add.s32 %r8, %r6, %r7; # edge_index mov.u32 %r9, %ctaid.y; shl.b32 %r10, %r9, 3; mov.u32 %r11, %tid.y; add.s32 %r2, %r10, %r11; # k cvta.to.global.u64 %rd21, %rd6; add.s32 %r16, %r1, %r8; # r16 = edge_index + r1 cvt.s64.s32 %rd22, %r8; cvt.s64.s32 %rd23, %r2; mul.lo.s64 %rd28, %rd12, %rd23; # rd28 = rd12 * rd23 = rd12 * k neg.s32 %r17, %r16; r17 = -r16 = -(edge_index + r1) cvt.s64.s32 %rd29, %r17; # rd29 = -(edge_index + r1) cvt.s64.s32 %rd30, %r16;# rd30 = edge_index + r1 add.s64 %rd31, %rd30, %rd22; # rd31 = rd30 + rd22 = edge_index + r1 + edge_index add.s64 %rd32, %rd31, %rd29; # rd32 = rd31 + rd29 = edge_index + r1 + edge_index - (edge_index + r1) = edge_index add.s64 %rd33, %rd32, %rd28; # rd33 = rd32 + rd28 = edge_index + rd12 * k shl.b64 %rd34, %rd33, 3; # rd34 = 8 * rd33 = 8 * (edge_index + rd12 * k) add.s64 %rd35, %rd21, %rd34; # rd35 = rd21 + rd34 = rd6 + rd34 = rd6 + 8 * (edge_index + rd12 * k) ld.global.f64 %fd1, [%rd35]; # z_nabla2_e[] ``` - [ ] `z_nabla2_e load instructions handwritten cuda` ``` ld.param.u32 %r16, [_Z7run_gpuiiN9gridtools7storage9gpu_impl..._param_8+16]; ld.param.u64 %rd7, [_Z7run_gpuiiN9gridtools7storage9gpu_impl_..._param_8]; mov.u32 %r36, %ntid.x; mov.u32 %r37, %ctaid.x; mov.u32 %r38, %tid.x; mad.lo.s32 %r39, %r37, %r36, %r38; # edge_index mov.u32 %r40, %ntid.y; mov.u32 %r41, %ctaid.y; mov.u32 %r42, %tid.y; mad.lo.s32 %r1, %r41, %r40, %r42; # k cvta.to.global.u64 %rd19, %rd7; mad.lo.s32 %r71, %r16, %r1, %r39; # rd71 = r16 * k + edge_index mul.wide.u32 %rd64, %r71, 8; add.s64 %rd65, %rd19, %rd64; # rd65 = rd19 + rd64 = rd7 + 8 * rd71 = rd7 + 8 * (r16 * k + edge_index) ld.global.f64 %fd25, [%rd65]; ``` - [ ] `gtfn` execution - My theory is that the rest of the slowdown (~10%) compared to the handwritten CUDA version is because of the shifting of the pointers of the fields - In the gpu implementation we do first of all a [sid::multi_shift(ptr, strides, thread_idx);](https://github.com/GridTools/gridtools/blob/master/include/gridtools/fn/backend/gpu.hpp#L102) which shifts all the `ptr`s with an offset of the global thread index - Then we call [make_iterator()](https://github.com/GridTools/gridtools/blob/master/include/gridtools/fn/unstructured.hpp#L125) though which returns the input field pointers to their original starting address - Then the kernel is launched and the input fields that are accessed inside the kernel are shifted again according to either their index in [deref()](https://github.com/GridTools/gridtools/blob/master/include/gridtools/fn/unstructured.hpp#L78) or based on their neighbor table using the horizontal shift [deref(shift())](https://github.com/GridTools/gridtools/blob/master/include/gridtools/fn/unstructured.hpp#L85) - One idea for improvement is to avoid calling `multi_shift` in the `Ins` fields to shift their pointers. Only their `m_index` needs to be set up. Then the pointer shift is done using `deref` or the necessary neighbor `shift` ## Comparisons between (opt) `gtfn`, `cuda_kblock` and `cuda_naive` - Commit: `3ad077a` - Optimizations of `gtfn`: `gridtools` fix in `horizontal_shift`, `stride 1` in `sid` in python bindings - Index type: `int32` ![runtimes_torus_128_65](https://hackmd.io/_uploads/H1x4tLJB0.png) ![runtimes_torus_64_65](https://hackmd.io/_uploads/rkUEYIkHR.png) ![k_sweep_acc_torus_gpu](https://hackmd.io/_uploads/rk7QYI1BC.png) ![k_sweep_acc_torus_gpu_naive](https://hackmd.io/_uploads/B1mV5IyrR.png) - [x] Validate interpolation kernel for `unstructured` - [x] Create unit tests based on python implementation and gtfn/gt4py data - [x] Write GPU version - [x] Permute `v2e` based on new edges permutation - [x] Find `e2e` permutation and apply it to `v2e` or calculate `v2e` from the vertex indexes --> Needed for comparing structured and unstructured versions - [x] Need to think about how the 2 separate kernels are going to be connected together and which `v2e` will be used since it's needed to have `halo = 1` in the 2nd computation --> ask how are halos handled in that case - [x] Write structured implementation - [ ] ~~Write nabla2 implementation~~ - [ ] ~~Unstructured~~ - [ ] ~~Structured~~ - [ ] Try out x_dim and y_dim as constexpr since they are going to be JITted - [ ] Same for number of edges, etc - [ ] Based on discussion with Christoph - [ ] `nvprof --metrics inst_integer,inst_fp_32,inst_fp_64,ipc,achieved_occupancy,dram_read_transactions,dram_write_transactions,dram_read_throughput,dram_write_throughput,stall_sync` useful metrics for analysis (used in Open Earth Compiler paper) - [ ] `k-blocking` should be done so that one thread computes consecutive `k` levels to reuse memory accesses in cases where we have the following access patterns: ``` smth = f(k) - f(k-1) smth= = f(k+1) - f(k) ``` - [x] Try `__builtin_assume_aligned` --> no difference on Daint (CUDA 11.8) and ault (CUDA 12.2) - [x] Tried it by adding to the `unstructured naive` GPU kernel the following: ``` (__builtin_assume_aligned(e2c2v_gt_tv.data(), 128)); (__builtin_assume_aligned(e2ecv_gt_tv.data(), 128)); (__builtin_assume_aligned(u_vert_gt_tv.data(), 128)); (__builtin_assume_aligned(v_vert_gt_tv.data(), 128)); (__builtin_assume_aligned(primal_normal_vert_v1_gt_tv.data(), 128)); (__builtin_assume_aligned(primal_normal_vert_v2_gt_tv.data(), 128)); (__builtin_assume_aligned(z_nabla2_e_gt_tv.data(), 128)); (__builtin_assume_aligned(inv_vert_vert_length_gt_tv.data(), 128)); (__builtin_assume_aligned(inv_primal_edge_length_gt_tv.data(), 128)); (__builtin_assume_aligned(z_nabla4_e2_wp_gt_tv.data(), 128)); ``` - [x] Add at compile time ability to `__builtin_assume(can_deref(it));` in GTFN - [x] Taken care by Hannes in https://github.com/GridTools/gridtools/pull/1785 - [x] `nabla2` kernel implementation is exactly the same as `nabla4` --> No reason for extra implementation - [x] `gtfn` improvements with [loop-blocking](https://github.com/GridTools/gridtools/pull/1787) and [GT_ASSUME fix](https://github.com/GridTools/gridtools/pull/1789) ``` nabla4_benchmark_unstructured_gtfn_gpu (k loop: 10) median runtime: 0.000939983993768692 (10% faster over naive, 45% slower compared to CUDA version) nabla4_benchmark_unstructured_gpu_gridtools median runtime: 0.0006452800035476685 nabla4_benchmark_structured_torus_gpu_gridtools_halo median runtime: 0.0006144319772720337 nabla4_benchmark_unstructured_gtfn_gpu_naive median runtime: 0.0010604159832000733 nabla4_benchmark_unstructured_gpu_gridtools_naive median runtime: 0.001010591983795166 nabla4_benchmark_structured_torus_gpu_gridtools_halo_naive median runtime: 0.0009190239906311036 ``` - [x] `gpu_naive/kloop` version of `interpolate` kernel didn't generate the correct PTX with `p_u_out() += ...`. That way for every addition the compiler read and stored `p_u_out`. An improvement to this came by calculating the sum in a register and then storing it to `p_u_out` #### Results `gridtools`: `27ec7f4` `icon_structured_benchmark`: `695c93a` ##### nabla4 ![k_sweep_acc_torus_gpu_naive](https://hackmd.io/_uploads/Ski5T2GvA.png) ![k_sweep_acc_torus_gpu](https://hackmd.io/_uploads/SJo5a3GvC.png) ![runtimes_torus_64_80](https://hackmd.io/_uploads/BkocT2MwA.png) ![runtimes_torus_128_80](https://hackmd.io/_uploads/BJicT3MwC.png) ##### MoIntpRbfRbfVecInterpol `icon_structured_benchmark`: `3aa7b37` ![runtimes_torus_128_80](https://hackmd.io/_uploads/HJdxbom_0.png) ![runtimes_torus_64_80](https://hackmd.io/_uploads/rkFeZsXOR.png) ![k_sweep_acc_torus_gpu_naive](https://hackmd.io/_uploads/SkuxWs7dC.png) ![k_sweep_acc_torus_gpu_kloop](https://hackmd.io/_uploads/BJuxbsQ_A.png)