[GT4Py] Research: ICON structured grid strided access standalone benchmark

# [GT4Py] Research: ICON structured grid strided access standalone benchmark  - Shaped by: - Appetite (FTEs, weeks): - Developers:  ## Problem ICON currently is using indirect accesses to access the neighbors of cells/vertices/edges. This is a problem because it adds memory accesses and prevents optimizations in the memory access pattern and from the compiler. ICON though is partly structured. For the structured parts of ICON we can find a way to move from indirect accesses to accesses based on offsets/strides. This way can enable memory access patterns optimizations that could lead to better performance across ICON. ![image](https://hackmd.io/_uploads/B1HiLLG36.png) The parts of ICON grid that are unstructured and still need indirect accesses via LUT are around the triangles connected in the 5 angled vertices (inside the pentagons). To evaluate this optimization it would be useful to create a standalone benchmark (probably in C++) that applies a small kernel to a torus fully structured grid using strided accesses and compare it with the same grid using indirect accesses ## Appetite This is more of a research project so spending all the cycle might be necessary to create a small standalone C++ benchmark of the indirect access and offset patterns and look into the performance benefits ## Solution The idea for this cycle would be to create the standalone benchmark to evaluate the performance of a structured grid with strided and indirect accesses using a simple kernel. To do this the following steps can be followed: - Read [André Rösti bachelor thesis](https://andreroesti.com/attachments/bachelor-thesis.pdf) to get some background from previous work (see: https://hackmd.io/7ytBqWOnSg6r0ykzNz-4Cw) - Take netcfd files from Fortran script (torus grid) or take the torus netcfd files generated by `David Strassmann` (fully structured torus grid) - Read them with GridManager that populates the fields needed for the Icon grid execution (see: [icon4py GridManager](https://github.com/C2SM/icon4py/blob/main/model/common/src/icon4py/model/common/grid/grid_manager.py)) - Create small C++ application that instantiates the same data structures as the ICON grid from the `icon4py GridManager` - Implement in C++ using this grid data structure the [nabla4 kernel](https://github.com/C2SM/icon4py/blob/a32640c8bacc65bbafe41ec9a68b772c7a3c1853/model/atmosphere/diffusion/src/icon4py/model/atmosphere/diffusion/stencils/calculate_nabla4.py#L23) which is simple enough and makes use of only E2C2V neighbors - Switch to a structured strided grid and implement this kernel based on that without indirect accesses - Benchmark the two different versions for various grid sizes - Check for any further improvements in the data access patterns and ids ## Rabbit holes  ## No-gos  ## Progress   - [x] Read André Rösti's bachelor thesis - [ ] Reproduce any interesting results on Santis (Optional to validate performance with the new benchmarks) - Takeaways from bachelor thesis: - The thesis focuses in 3 kernels: Laplace-of-Laplace, Horizontal Diffusion and Fastwaves (characteristics in Table 6.1) - There are different implementation variants based on 2 topics - Storage strategies (Chasing/Non-chasing, Compressed/Uncompressed) - Grid access strategies (naive, idxvar, shared, z-loop, z-loop sliced) - Overhead of indirect accesses are more prominent in kernels that are not computationally heavy - For smaller domain sizes the overhead of indirect accesses might be smaller because of the LUT residing in cache (this also depends on the implementation of the access strategy) (section ) - Depending on the access strategy there are different overheads as the Z domain increases. Z is special because it has regular structure. We can think of Z as pillar where all the elements in the pillar need the same offsets from the LUT for their X and Y neighbors. For traversing the Z pillar there are stable offsets (`domain_size_x*domain_size_y`) - Grid and block size play a role in the performance of each access strategy - z-curve (in contrary of the row-major) organization of the regular grid resembles more random irregular grids - Non-chasing storage adds more memory requirements and invalidates caches faster - Compression helps with large domains and not very compute intensive kernels, especially with non-chasing storage since the neighbors' storage stays in cache for more - Some times more complex strategies (compression, warp broadcasting, etc) might not be worth because of their added overheads - For a (large) 512x512x64 grid the slowdowns for the different kernels compared to the structure grid are: - laplap: 45% (row-maj), 50% (z-curve) - hordif: 25% (row-maj), 30% (z-curve) - fastwave: 4.4% (row-maj), 5% (z-curve) - [x] Check what's needed for running GridManager an getting the ICON/Torus Grid - [x] Read GridManager code - [x] Useful [code](https://github.com/C2SM/icon4py/blob/main/model/common/src/icon4py/model/common/grid/grid_manager.py#L82) for figuring out the sizes for the connectivity matrices in the ICON grid - [x] Create simple test script - [x] Read and try to understand the ICON grid data structure vectors - [x] Try to draw some stuff on paper - [ ] Design C++ data structures similar to the IconGrid mapping - [ ] Alignment :thinking_face: - [x] Figure out how to move the data from Python to C++ - [x] Export C++ function to Python might be easier - [x] Pass only `E2C2V` matrix and grid dims needed for `nabla4` (rest of other `nabla4` inputs can be random) - [x] Done with `nanobind` - [x] Get some torus grids from Leonidas or David Strassman - [x] Contact Magdalena for that - [x] Get read permissions for the files - [x] Get GridManager running with these input (start with ~~Leonidas' script~~ Strassman's torus files) - [x] Questions about torus files: - [x] Why are there `-1`s? - [x] The elements in `E2C2V` should be up to VertexDim-1? - [x] `_construct_diamond_edges` doesn't work - [x] Seems like the input used for the below logs was problematic (`torus_100000_100000_100000.nc` very small grid file) ``` Dimension(value='E2C2V', kind=<DimensionKind.LOCAL: 'local'>): array([[ 1, 3, -1, -1], [ 1, 2, 3, 4], [ 3, 2, 4, 4], [ 2, 4, 3, 3], [ 2, 3, 4, 1], [ 4, 3, 1, 1], [ 3, 1, 4, 4], [ 3, 4, 1, 2], [ 1, 4, 2, 2], [ 4, 2, 1, 1], [ 4, 1, 2, 3], [ 2, 1, -1, -1]]) ``` but ``` CellsDim: 8 VertexDim: 4 EdgeDim: 12 KDim: 65 E2C2VDim: 4 ``` - [x] Inspect data structures - [x] How to generate data needed for `nabla4` kernel? - [x] For benchmarking generation of random input data (need to define a meaningful range) - [x] For verification purposes (good to have sooner than later) need some data generated/validated through `icon4py` - [x] Is there some way to run `nabla4` only with `icon4py` and one of the existing grids? - [x] Would be good to run more than one test or have an automatic way to generate the inputs/outputs for the tests --> check https://github.com/iomaganaris/icon4py/tree/validation/nabla4_serialbox - [x] Need to define specific optimization flags for the outputs (probably debug build) --> For the naive kernel there wasn't any difference with debug/Ofast builds - [x] Binary comparison might be too much (especially for GPU) - probably better to do comparison with `allclose` and meaningful `atol`/`rtol` --> `rtol` needed to be 1e-4 instead of the default 1e-5 for the tests to pass - [x] Get necessary input/output data from https://github.com/C2SM/icon4py/blob/59e757ff32bb9f2e2e64e3811121b97a8a28122b/model/atmosphere/diffusion/tests/diffusion_stencil_tests/test_calculate_nabla4.py#L72 - [x] Check https://github.com/GridTools/gridtools_verification --> `serialbox` C++ API - [x] Created [repository](https://github.com/iomaganaris/icon-structured) with initial tests - [ ] Visualization - [ ] Try [icon-vis](https://github.com/C2SM/icon-vis?tab=readme-ov-file#plotting-gribnetcdf-icon-data) and [iconarray](https://github.com/C2SM/iconarray) for visualizing data - [ ] Try `ncview` - [ ] https://hackmd.io/0x-YtL00Qq6G97C1oVaWkw#Code-sample-for-loading-ICON-NetCDF-files-with-UXarray --> Could only visualize the edges but still I think something is wrong. There are too few edges shown (`torus_100000_100000_32768.nc`): ![Screenshot 2024-03-06 at 15.35.20](https://hackmd.io/_uploads/SylU1W8pp.png) - [ ] Implement unstructured version of the kernel on the above benchmarking framework - [x] Custom `ifirst` or `kfirst` implementation? - [ ] GridTools C++ implementation? - [ ] Test different CPU implementations - [x] Find a way to run the benchmarks until 95% CI is within 5% of median (or a max number of runs) --> Not necessary since benchmarks with 101 repetitions follow above condition - [ ] Implement structured grid - [x] [IIP(IdeaInProgress)] organize cells based on horizontal/longitudinal coordinates. Starting from the top of the torus and moving down horizontally level by level. i.e. first level has 6 cells looking north, second level has the same number of cells looking south, etc (need to validate whether that's the case in all torus files - would be good to be able to easily visualize them) - [x] Figure out a way to arrange cells/edges/vertices in memory based on the `SimpleGrid` for the `e2c2v` connectivity - [x] 3 edges per vertex, east, northeast and north - [x] Figure out what to do with `e2ecv` (probably I just need to sort the vectors it addresses based on the `e2c2v` indexes) - [x] Check whether the same logic applies to the torus files from David --> No, torus grid from David have different organization - [ ] Check compilation flag and vectorization in the current version (https://godbolt.org/z/nGfh13h6P) - [x] GCC Clariden: `-Ofast -march=skylake -fopenmp -fopt-info-vec-missed -mtune=skylake -fvect-cost-model=unlimited -mavx2 -ffast-math -ftree-vectorize -fopenmp -fPIC` (faster than Intel :thinking_face: on Clariden) - [x] GCC Santis: `-g -mcpu=neoverse-v2 -Ofast -msve-vector-bits=128 -fopt-info-vec-missed -fvect-cost-model=unlimited -fPIC -fopenmp-simd` - [x] Intel: `-g -O3 -qopt-report=3 -qopenmp -march=skylake -mtune=skylake -std=c++20 -fPIC` - [x] To get vectorized code try to split the inner for look to 3 different ones depending on the orientation of the edge. Also the edges need to be sorted based on their orientation to have consecutive memory accesses for each loop - [x] On Santis probably typecasting doesn't allow vectorization: ``` /bret/scratch/cscs/ioannmag/icon-structured/include/nabla4_structured.hpp:195:85: missed: couldn't vectorize loop /bret/scratch/cscs/ioannmag/icon-structured/include/nabla4_structured.hpp:166:31: missed: not vectorized: relevant stmt not supported: _329 = (double) _328; ``` - [ ] Run torus grids by David (need additional work because they have different order of edges) - [x] Apply permutation to have certain order of edges (north, east, southeast together) - [x] Figure out where this permutation needs to be applied. Only `e2v` or `c2v` and `e2c` as well? --> Not right with just changing `e2v`. Others need to be permuted as well - [x] Based on this organization create new `nabla4_benchmark` classes that implement `get_e2c2v_vertices_north_edge`, `get_e2c2v_vertices_east_edge` and `get_e2c2v_vertices_southeast_edge` - [x] Create validation tests (pytest) based on the above - [x] Create sanity checks that compare the output for the unstructured and the structured implementation for any torus grid - [x] Run benchmarks for different sizes / torus files - [x] Check organization of edges. Probably sorting the `e2c2v` based on the orientation of the edges (`|` first, then `-` and then `\` edges) can help with vectorization and accessing consecutive memory --> 7-15% improvement --> That's wasn't true. It's better to organize edges such that `e2c2v` have similar vertex ids together. By organizing them based on the first index we get at least 3 indexes which are the same between `e2c2v[0]`, `e2c2v[1]` and `e2c2v[2]`. i.e. `E2C2V[0]: [0 6 23 1]`, `E2C2V[1]: [0 1 6 25]` and `E2C2V[2]: [0 25 1 24]` - [x] Remove casts to try whether it helps with vectorization in GH200 --> did help with vectorization of the calculations after the `e2c2v` calculation - [x] Optimize `cpu_kfirst` reads/calculations - [ ] Try to vectorize `e2c2v` calculations on GH200 - [ ] Benchmark - [x] Run with/without modulo --> same perf - [x] Run with/without edge permutation --> with sorting vertex indices based on the first index we get better cache locality and performance - [x] Try running the same application in multiple processes in a single CPU to do more cache invalidation, similar to a normal run where multiple processes use the same L3 cache - [ ] Check https://github.com/dawn-ico/grid-experiments/tree/main --> Will be useful probably because in the end we need to define an organization of the order of the edges, cells, vertices so that we can get the offsets of the different neighbors in a deterministic way - [ ] Check the existing ICON experiment visualization and organization - [ ] See if it's helpful to edit so that we can use the same ideas/script for the torus files - [ ] Plots - [x] Plot a sweep in k levels for the same (small-large) torus - [x] I expect that with multiple repetitions and k levels the acceleration will be more evident because the `E2C2V` vector will be moved out of the cache --> Partly true. In `ifirst` where we load the `e2c2v` multiple times and both the `e2c2v` but also the other. vectors get transferred out of the cache due to their large size we see this behavior. Acceleration increases as the k dimension or the size of the torus increases. For small matrix size and `ifirst` where everything fits more or less the cache we get worse performance (Based on Santis) ![k_sweep_acc_torus_cpu_ifirst](https://hackmd.io/_uploads/ryfl8ixyA.png) - [x] Daint showcases similar behavior. Just for `k = 1` the performance is much better for structured because of better vectorization it manages to do. The computation cost for the indexes is much less than reading the indexes from memory but still more than reading it from the cache ![k_sweep_acc_torus_cpu_kfirst](https://hackmd.io/_uploads/HJM1UieyA.png) - [x] For smaller number of edges the structured grid runtime gets closer to the unstructured because the computational intensity shifts more to the actual kernel computation instead of the calculation of the offsets - [x] `cpu_ifirst` showcases better acceleration for structured grid because for every `k` we need to calculate or read the `e2c2v` from memory again. As the size of `k` increases or the edges size increase, the memory footprint of the data structures increases as well. This means that the `e2c2v` will be needed to be loaded from the main memory each time (for every `k`) because it will probably be evicted from cache. The cost to calculate the offsets is smaller than the cost of reading the neighbors from the ram but larger than reading them from the cache - [ ] Would be good to validate this with some profiling - [x] `cpu_kfirst` for larger arrays doesn't showcase that good performance because for each `klevel` we read the horizontal data that already fit in the cache. This means that `e2c2v` would probably not be evicted from the cache and thus cost less than calculating the offsets. This is supported by the fact that for `k = 1` and `3663792 edges` the performance for the structured grid is better than the unstructured while for `k > 1` the performance is a bit worse - [ ] Would be good to validate this with some profiling - [x] Plot runtimes with only 1 process/cpu and 1 cpu running the application ![runtimes_torus_128_65](https://hackmd.io/_uploads/HyvjrieJA.png) ![runtimes_torus_64_65](https://hackmd.io/_uploads/SkTjSjeJR.png) - [x] Plot runtimes with same application running in every core of the CPU which further stresses the cache (similar to a real life scenario) --> large variations ![runtimes_torus_128_65_multiproc](https://hackmd.io/_uploads/Hk4qSixJA.png) - [x] Plot number of edges in X axis instead of Torus size - [x] Based on cartesian coordinates it's also useful to look at [IcoChainSize](https://github.com/iomaganaris/icon4py/blob/597fc914c0c7842fe1b6d7dde2351e7659220c73/tools/src/icon4pytools/icon4pygen/icochainsize.py#L195) - [ ] Run the `torus_64` for `k = 2` and compare cache misses with `k = 1` on Santis --> difficult to make sure that the measurements are valid on GH200, needs more work - [x] Check results from Piz Daint - [x] Same behavior as Santis - [ ] Add `gtfn` baseline - [x] To check with Hannes and implement gtfn backend in C++ by hand - [x] `gtfn` backend ~2x slower than the naive `cpu_ifirst` implementation - [x] Check overhead by `stencil_executor` construction --> none - [x] By using `std::vector<std::array<size_t, 4>>` for the `e2c2v` and `e2ecv` vectors I got almost x2 speed up for the `unstructured` kernels - [x] **In the end, for the multiprocess case we get for `cpu_ifirst` 70% speed up and for the `cpu_kfirst` 40% speed up** - [ ] Use `gridtools` storage - [x] Refactor a bit the code to be easier to implement this - [ ] Check whether it's possible to use `gt4py` to pass the data around in Python --> [gt4py cartesian](https://github.com/GridTools/gt4py/blob/591ea4e180b3bc0f6dc9c34a0da0d579ceceb183/src/gt4py/storage/cartesian/interface.py#L270) - [x] Make sure that there are no asserts when accessing the data - [x] Move the target views outside the loop - [x] Much less NEON SIMD instructions for the `gridtools storage` version --> Problem seems to be that `inner_kernel_ifirst` is not inlined ``` (venv_py311_icon4py) [santis][ioannmag@nid005608 icon-structured-2]$ PYTHONPATH=$(pwd)/build_9c229bd:$PYTHONPATH perf stat -a -e SVE_INST_SPEC,ASE_INST_SPEC,INST_SPEC -- python3.11 run_torus_grid_perf.py ../torus_100000_100000_8192.nc --klevels 65 --repetitions 1 --backend cpu_ifirst CellsDim: 160 VertexDim: 80 EdgeDim: 240 KDim: 65 E2C2VDim: 4 Longitude dimension: 8 Latitude dimension: 10 nabla4_benchmark_structured_torus_cpu_ifirst median runtime: nan Performance counter stats for 'system wide': 1,487,621 SVE_INST_SPEC 75,200,198 ASE_INST_SPEC 15,241,763,996 INST_SPEC 1.884444157 seconds time elapsed (venv_py311_icon4py) [santis][ioannmag@nid005608 icon-structured-2]$ PYTHONPATH=$(pwd)/build_2285e51:$PYTHONPATH perf stat -a -e SVE_INST_SPEC,ASE_INST_SPEC,INST_SPEC -- python3.11 run_torus_grid_perf.py ../torus_100000_100000_8192.nc --klevels 65 --repetitions 1 --backend cpu_ifirst CellsDim: 160 VertexDim: 80 EdgeDim: 240 KDim: 65 E2C2VDim: 4 Longitude dimension: 8 Latitude dimension: 10 nabla4_benchmark_structured_torus_cpu_ifirst median runtime: nan Performance counter stats for 'system wide': 1,490,708 SVE_INST_SPEC 33,734,088 ASE_INST_SPEC 14,980,523,614 INST_SPEC 1.843682832 seconds time elapsed ``` - [x] Run backends without `#pragma omp simd` and compare performance --> `cpu_ifirst` implementations for `unstructured` are faster without pragmas on Santis as well - [x] Run gridtools storage implementations on `Santis` - [x] Baseline of torus implementation with periodic boundaries (commit `9ced41e`) ![k_sweep_acc_torus_cpu_ifirst](https://hackmd.io/_uploads/HJtJMTmlR.png) ![k_sweep_acc_torus_cpu_kfirst](https://hackmd.io/_uploads/H1CyGpQx0.png) ![runtimes_torus_128_65_compressed](https://hackmd.io/_uploads/Hk5NzTXxC.png) ![runtimes_torus_64_65](https://hackmd.io/_uploads/S1ZZMp7gR.png) ![runtimes_torus_128_65_multiproc_compressed](https://hackmd.io/_uploads/HyDcfamlR.png) ### Nice to have - [ ] Reproduction of André Rösti's results on H100 - [ ] 3D visualization of ICON grid from `netCDF4` files ### Ideas for next cycle - Figure out whether trying a structured cartesian grid without periodic boundary conditions (based on discussion with Christoph) - Simple parallelogram with cartesian dimensions that would be the division of the icon triangles - Figure out what to do with halo exchanges - Will simplify the offset calculations and provide better vectorization on GH200 - Try implementing the kernels on GPU - Try with larger kernels with more neighbor accesses, more compute intensive or memory bound to see the behavior