# [Benchmarking] gtbench (Clariden - Santis)
## :question: Purpose
Compare performance of gtbench which underneath uses gridtools from A100 to H100 and multinode between Clariden (AMD-A100) and Santis (GH200-H100)
## :bookmark_tabs: GPU specs
[A100](https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-nvidia-us-2188504-web.pdf): DP 9.7 TFLOPS - SP 19.5 TFLOPS / Mem BW 2 TB/s
[H100](https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet): DP 34 TFLOPS - SP 67 TFLOPS / Mem BW 3.35 TB/s
[Grace Hopper tuning guide](https://docs.nvidia.com/grace-performance-tuning-guide.pdf)
[Hopper tuning guide (H100)](https://docs.nvidia.com/cuda/pdf/Hopper_Tuning_Guide.pdf)
## :computer: A100 [Clariden]
```
ssh -A clariden.cscs.ch
cd /iopsstor/scratch/cscs/ioannmag/gtbench
uenv start /iopsstor/scratch/cscs/bcumming/prgenv-gnu-a100.squashfs
uenv view default
sbatch A100_gtbench.sh # edit domain size
sbatch A100_gtbench_ncu.sh # run benchmark with NSight Compute
sbatch A100_gtbench_nsys.sh # run benchmark with NSight Systems
```
Note: `libcudart.so` is missing from uenv and needs to be added to `LD_LIBRARY_PATH`
### cpu_ifirst single_node
```
cmake .. -DGTBENCH_BACKEND=cpu_ifirst -DGTBENCH_RUNTIME=single_node -DCMAKE_INSTALL_PREFIX=./install -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_COMPILER="" -DCMAKE_BUILD_TYPE=Custom -DCMAKE_CXX_FLAGS="-g -march=native -Ofast -fvect-cost-model=unlimited"
```
### cpu_kfirst single_node
```
cmake .. -DGTBENCH_BACKEND=cpu_kfirst -DGTBENCH_RUNTIME=single_node -DCMAKE_INSTALL_PREFIX=./install -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_COMPILER="" -DCMAKE_BUILD_TYPE=Custom -DCMAKE_CXX_FLAGS="-g -march=native -Ofast -fvect-cost-model=unlimited"
```
## :computer: H100 [Santis]
Setup environment for all builds
```
cd $SCRATCH
uenv start /bret/scratch/cscs/bcumming/images/prgenv-gnu-24.2.squashfs
uenv view default
source $SCRATCH/spack/share/spack/setup-env.sh
spack load boost # or boost+thread
```
### cpu_ifirst single_node
`Optimized`
```
cmake .. -DGTBENCH_BACKEND=cpu_ifirst -DGTBENCH_RUNTIME=single_node -DCMAKE_INSTALL_PREFIX=./install -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_COMPILER="" -DCMAKE_BUILD_TYPE=Custom -DCMAKE_CXX_FLAGS="-g -mcpu=neoverse-v2 -Ofast"
```
`Optimized (sve vec length)`
```
cmake .. -DGTBENCH_BACKEND=cpu_ifirst -DGTBENCH_RUNTIME=single_node -DCMAKE_INSTALL_PREFIX=./install -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_COMPILER="" -DCMAKE_BUILD_TYPE=Custom -DCMAKE_CXX_FLAGS="-g -mcpu=neoverse-v2 -Ofast -msve-vector-bits=128 -fopt-info-vec-missed -fvect-cost-model=unlimited"
```
### cpu_kfirst single_node
`Optimized`
```
cmake .. -DGTBENCH_BACKEND=cpu_kfirst -DGTBENCH_RUNTIME=single_node -DCMAKE_INSTALL_PREFIX=./install -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_COMPILER="" -DCMAKE_BUILD_TYPE=Custom -DCMAKE_CXX_FLAGS="-g -mcpu=neoverse-v2 -Ofast"
```
### cpu_ifirst simple_mpi
```
# need to provide empty string to CMAKE_CUDA_COMPILER otherwise nvcc was used for compilation and there were a lot of compilation issues
cmake .. -DGTBENCH_BACKEND=cpu_ifirst -DGTBENCH_RUNTIME=simple_mpi -DCMAKE_INSTALL_PREFIX=./install -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_COMPILER=""
srun -n 1 -A csstaff -t5 cmake --build . --target install --parallel 1 --verbose 2>&1 | tee build_output.txt
```
`Optimized`
```
cmake .. -DGTBENCH_BACKEND=cpu_ifirst -DGTBENCH_RUNTIME=simple_mpi -DCMAKE_INSTALL_PREFIX=./install -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_COMPILER="" -DCMAKE_BUILD_TYPE=Custom -DCMAKE_CXX_FLAGS="-g -mcpu=neoverse-v2 -Ofast -msve-vector-bits=128 -fopt-info-vec-missed -fvect-cost-model=unlimited"
```
### cpu_kfirst simple_mpi
```
# need to provide empty string to CMAKE_CUDA_COMPILER otherwise nvcc was used for compilation and there were a lot of compilation issues
cmake .. -DGTBENCH_BACKEND=cpu_kfirst -DGTBENCH_RUNTIME=simple_mpi -DCMAKE_INSTALL_PREFIX=./install -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_COMPILER=""
srun -n 1 -A csstaff -t5 cmake --build . --target install --parallel 1 --verbose 2>&1 | tee build_output.txt
```
### cpu_ifirst gcl
```
cmake .. -DGTBENCH_BACKEND=cpu_ifirst -DGTBENCH_RUNTIME=gcl -DCMAKE_INSTALL_PREFIX=./install -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_COMPILER="" -DCMAKE_BUILD_TYPE=Custom -DCMAKE_CXX_FLAGS="-g -mcpu=neoverse-v2 -Ofast -msve-vector-bits=128 -fopt-info-vec-missed -fvect-cost-model=unlimited"
srun -n 1 -A csstaff -t5 cmake --build . --target install --parallel 1 --verbose 2>&1 | tee build_output.txt
```
### cpu_ifirst ghex_comm mpi
- [ ] Need to find a fix for [the correct architecture target](https://github.com/iomaganaris/GHEX/blob/805ebcacd374197123483fc2b962febb0f2ad233/include/ghex/structured/rma_put.hpp#L58) -->
```
#if defined(__GNUC__) && !defined(__llvm__) && !defined(__INTEL_COMPILER)
__attribute__((optimize("no-tree-loop-distribute-patterns")))
#elif defined(__ARM_ARCH) && __ARM_ARCH == 9
__attribute__((target("arch=armv9-a")))
#elif defined(__SSE2__)
__attribute__((target("sse2")))
#endif
```
Need to check with `master`
Old commits:
```
cmake .. -DGTBENCH_BACKEND=cpu_ifirst -DGTBENCH_RUNTIME=ghex_comm -DGHEX_TRANSPORT_BACKEND=MPI -DCMAKE_INSTALL_PREFIX=./install -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_COMPILER=""
srun -n 1 -A csstaff -t5 cmake --build . --target install --parallel 1 --verbose 2>&1 | tee build_output.txt
```
`gridtools`: 2.3.2
`ghex`: master
```
cmake .. -DGTBENCH_BACKEND=cpu_ifirst -DGTBENCH_RUNTIME=ghex_comm -DGHEX_TRANSPORT_BACKEND=MPI -DCMAKE_INSTALL_PREFIX=./install -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_COMPILER="" -DGHEX_USE_BUNDLED_LIBS=ON -DGHEX_USE_BUNDLED_GRIDTOOLS=OFF -DGridTools_DIR=$(pwd)/_deps/gridtools-build
srun -n 1 -A csstaff -t5 cmake --build . --target install --parallel 1 --verbose 2>&1 | tee build_output.txt
LD_LIBRARY_PATH=/user-environment/env/default/lib64:install/lib64 srun -n 4 -t60 -A csstaff numactl -c 0,1,2,3 ./install/bin/benchmark --domain-size 128 128 32 --runs 1 --cart-dims 2 2
```
Optimized build:
```
cmake .. -DGTBENCH_BACKEND=cpu_ifirst -DGTBENCH_RUNTIME=ghex_comm -DGHEX_TRANSPORT_BACKEND=MPI -DCMAKE_INSTALL_PREFIX=./install -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_COMPILER="" -DGHEX_USE_BUNDLED_LIBS=ON -DGHEX_USE_BUNDLED_GRIDTOOLS=OFF -DCMAKE_BUILD_TYPE=Custom -DCMAKE_CXX_FLAGS="-g -mcpu=neoverse-v2 -Ofast"
```
```
cmake .. -DGTBENCH_BACKEND=cpu_ifirst -DGTBENCH_RUNTIME=ghex_comm -DGHEX_TRANSPORT_BACKEND=MPI -DCMAKE_INSTALL_PREFIX=./install -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_COMPILER="" -DGHEX_USE_BUNDLED_LIBS=ON -DGHEX_USE_BUNDLED_GRIDTOOLS=OFF -DCMAKE_BUILD_TYPE=Custom -DCMAKE_CXX_FLAGS="-g -mcpu=neoverse-v2 -Ofast -msve-vector-bits=128 -fopt-info-vec-missed -fvect-cost-model=unlimited"
```
- [ ] TODO `XPMEM`
### cpu_kfirst ghex_comm mpi
- [ ] Need to find a fix for [the correct architecture target](https://github.com/iomaganaris/GHEX/blob/805ebcacd374197123483fc2b962febb0f2ad233/include/ghex/structured/rma_put.hpp#L58)
`gridtools`: 2.3.2
`ghex`: master
```
cmake .. -DGTBENCH_BACKEND=cpu_kfirst -DGTBENCH_RUNTIME=ghex_comm -DGHEX_TRANSPORT_BACKEND=MPI -DCMAKE_INSTALL_PREFIX=./install -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_COMPILER="" -DGHEX_USE_BUNDLED_LIBS=ON -DGHEX_USE_BUNDLED_GRIDTOOLS=OFF
srun -n 1 -A csstaff -t5 cmake --build . --target install --parallel 1 --verbose 2>&1 | tee build_output.txt
LD_LIBRARY_PATH=/user-environment/env/default/lib64:install/lib64 srun -n 4 -t60 -A csstaff numactl -c 0,1,2,3 ./install/bin/benchmark --domain-size 128 128 32 --runs 1 --cart-dims 2 2
```
- [ ] TODO `XPMEM`
### cpu_ifirst ghex_comm UCX
```
cmake .. -DGTBENCH_BACKEND=cpu_ifirst -DGTBENCH_RUNTIME=ghex_comm -DGHEX_TRANSPORT_BACKEND=UCX -DCMAKE_INSTALL_PREFIX=./install -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_COMPILER="" -DGHEX_USE_BUNDLED_LIBS=ON -DGHEX_USE_BUNDLED_GRIDTOOLS=OFF
srun -n 1 -A csstaff -t5 cmake --build . --target install --parallel 1 --verbose 2>&1 | tee build_output.txt
LD_LIBRARY_PATH=/user-environment/env/default/lib64:install/lib64:$(spack location -i ucx)/lib srun -n 1 -t5 -A csstaff numactl -N 0 ./install/bin/benchmark --domain-size 512 512 32 --runs 1 --cart-dims 1 1
```
Optimized
```
cmake .. -DGTBENCH_BACKEND=cpu_ifirst -DGTBENCH_RUNTIME=ghex_comm -DGHEX_TRANSPORT_BACKEND=UCX -DCMAKE_INSTALL_PREFIX=./install -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_COMPILER="" -DGHEX_USE_BUNDLED_LIBS=ON -DGHEX_USE_BUNDLED_GRIDTOOLS=OFF -DCMAKE_BUILD_TYPE=Custom -DCMAKE_CXX_FLAGS="-g -mcpu=neoverse-v2 -Ofast -msve-vector-bits=128 -fopt-info-vec-missed -fvect-cost-model=unlimited"
```
`XPMEM`
```
cmake .. -DGTBENCH_BACKEND=cpu_ifirst -DGTBENCH_RUNTIME=ghex_comm -DGHEX_TRANSPORT_BACKEND=UCX -DCMAKE_INSTALL_PREFIX=./install -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_COMPILER="" -DGHEX_USE_BUNDLED_LIBS=ON -DGHEX_USE_BUNDLED_GRIDTOOLS=OFF -DCMAKE_BUILD_TYPE=Custom -DCMAKE_CXX_FLAGS="-g -mcpu=neoverse-v2 -Ofast -msve-vector-bits=128 -fopt-info-vec-missed -fvect-cost-model=unlimited" -DGHEX_USE_XPMEM=ON
```
`Cray XPMEM`
```
cmake .. -DGTBENCH_BACKEND=cpu_ifirst -DGTBENCH_RUNTIME=ghex_comm -DGHEX_TRANSPORT_BACKEND=UCX -DCMAKE_INSTALL_PREFIX=./install -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_COMPILER="" -DGHEX_USE_BUNDLED_LIBS=ON -DGHEX_USE_BUNDLED_GRIDTOOLS=OFF -DCMAKE_BUILD_TYPE=Custom -DCMAKE_CXX_FLAGS="-g -mcpu=neoverse-v2 -Ofast -msve-vector-bits=128 -fopt-info-vec-missed -fvect-cost-model=unlimited" -DGHEX_USE_XPMEM=ON -DXPMEM_DIR=/opt/cray/xpmem/default
```
### cpu_kfirst ghex_comm UCX
- [ ] TODO
- [ ] TODO `XPMEM`
### cpu_ifirst ghex_comm libfabric
```
cmake .. -DGTBENCH_BACKEND=cpu_ifirst -DGTBENCH_RUNTIME=ghex_comm -DGHEX_TRANSPORT_BACKEND=LIBFABRIC -DCMAKE_INSTALL_PREFIX=./install -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_COMPILER="" -DGHEX_USE_BUNDLED_LIBS=ON -DGHEX_USE_BUNDLED_GRIDTOOLS=OFF -DCMAKE_BUILD_TYPE=Custom -DCMAKE_CXX_FLAGS="-g -mcpu=neoverse-v2 -Ofast -msve-vector-bits=128 -fopt-info-vec-missed -fvect-cost-model=unlimited"
```
`CRAY libfabric`
```
cmake .. -DGTBENCH_BACKEND=cpu_ifirst -DGTBENCH_RUNTIME=ghex_comm -DGHEX_TRANSPORT_BACKEND=LIBFABRIC -DCMAKE_INSTALL_PREFIX=./install -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_COMPILER="" -DGHEX_USE_BUNDLED_LIBS=ON -DGHEX_USE_BUNDLED_GRIDTOOLS=OFF -DCMAKE_BUILD_TYPE=Custom -DCMAKE_CXX_FLAGS="-g -mcpu=neoverse-v2 -Ofast -msve-vector-bits=128 -fopt-info-vec-missed -fvect-cost-model=unlimited" -DLIBFABRIC_DIR=/opt/cray/libfabric/1.15.2.0
```
`XPMEM`
```
cmake .. -DGTBENCH_BACKEND=cpu_ifirst -DGTBENCH_RUNTIME=ghex_comm -DGHEX_TRANSPORT_BACKEND=LIBFABRIC -DCMAKE_INSTALL_PREFIX=./install -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_COMPILER="" -DGHEX_USE_BUNDLED_LIBS=ON -DGHEX_USE_BUNDLED_GRIDTOOLS=OFF -DCMAKE_BUILD_TYPE=Custom -DCMAKE_CXX_FLAGS="-g -mcpu=neoverse-v2 -Ofast -msve-vector-bits=128 -fopt-info-vec-missed -fvect-cost-model=unlimited" -DLIBFABRIC_DIR=/opt/cray/libfabric/1.15.2.0 -DGHEX_USE_XPMEM=ON -DXPMEM_DIR=/opt/cray/xpmem/default
```
### cpu_kfirst ghex_comm libfabric
- [ ] TODO
- [ ] TODO `XPMEM`
### gpu simple_mpi
```
cmake .. -DGTBENCH_BACKEND=gpu -DGTBENCH_RUNTIME=simple_mpi -DCMAKE_INSTALL_PREFIX=./install -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_COMPILER=$(which nvcc) -DCMAKE_CUDA_ARCHITECTURES=90
srun -n 1 -A csstaff -t5 cmake --build . --target install --parallel 1 --verbose 2>&1 | tee build_output.txt
LD_LIBRARY_PATH=/user-environment/env/default/lib64:install/lib64 CUDA_VISIBLE_DEVICES=0 srun -n 1 -t5 -A csstaff gdb --args ./install/bin/benchmark --domain-size 128 128 128 --runs 1 --cart-dims 1 1
```
:fire: segfault
```
(gdb) #0 0x0000fffff757267c in vector_m2m ()
from /user-environment/linux-sles15-neoverse_v2/gcc-12.3.0/cray-mpich-8.1.28-qrxzbznekbtztqquqpfkuxi5hy5dzjk3/lib/libmpi_gnu_123.so.12
#1 0x0000fffff7577a4c in MPII_Segment_manipulate ()
from /user-environment/linux-sles15-neoverse_v2/gcc-12.3.0/cray-mpich-8.1.28-qrxzbznekbtztqquqpfkuxi5hy5dzjk3/lib/libmpi_gnu_123.so.12
#2 0x0000fffff7576c24 in MPIR_Segment_pack ()
from /user-environment/linux-sles15-neoverse_v2/gcc-12.3.0/cray-mpich-8.1.28-qrxzbznekbtztqquqpfkuxi5hy5dzjk3/lib/libmpi_gnu_123.so.12
#3 0x0000fffff7579df4 in MPIR_Typerep_pack ()
from /user-environment/linux-sles15-neoverse_v2/gcc-12.3.0/cray-mpich-8.1.28-qrxzbznekbtztqquqpfkuxi5hy5dzjk3/lib/libmpi_gnu_123.so.12
#4 0x0000fffff642deac in MPIDIG_isend_impl.constprop.0 ()
from /user-environment/linux-sles15-neoverse_v2/gcc-12.3.0/cray-mpich-8.1.28-qrxzbznekbtztqquqpfkuxi5hy5dzjk3/lib/libmpi_gnu_123.so.12
#5 0x0000fffff6459118 in PMPI_Sendrecv ()
from /user-environment/linux-sles15-neoverse_v2/gcc-12.3.0/cray-mpich-8.1.28-qrxzbznekbtztqquqpfkuxi5hy5dzjk3/lib/libmpi_gnu_123.so.12
#6 0x00000000004305ac in std::_Function_handler<void (std::shared_ptr<gridtools::storage::data_store_impl_::data_store<gridtools::storage::gpu, float, gridtools::storage::info_impl_::info<gridtools::tuple<int, int, int>, gridtools::tuple<gridtools::integral_constant<int, 1>, int, int>, std::integer_sequence<unsigned long, 0ul, 1ul, 2ul> >, gridtools::meta::list<gridtools::tuple<gridtools::integral_constant<int, 1>, int, int> const&, gridtools::layout_map_impl::layout_map<2, 1, 0>, std::integral_constant<int, 0>, gridtools::integral_constant<int, 32> >, false, false> >&), gtbench::runtime::simple_mpi_impl::process_grid::impl::exchanger(gridtools::array<unsigned int, 3ul>, gridtools::array<unsigned int, 3ul>) const::{lambda(std::shared_ptr<gridtools::storage::data_store_impl_::data_store<gridtools::storage::gpu, float, gridtools::storage::info_impl_::info<gridtools::tuple<int, int, int>, gridtools::tuple<gridtools::integral_constant<int, 1>, int, int>, std::integer_sequence<unsigned long, 0ul, 1ul, 2ul> >, gridtools::meta::list<gridtools::tuple<gridtools::integral_constant<int, 1>, int, int> const&, gridtools::layout_map_impl::layout_map<2, 1, 0>, std::integral_constant<int, 0>, gridtools::integral_constant<int, 32> >, false, false> >&)#2}>::_M_invoke(std::_Any_data const&, std::shared_ptr<gridtools::storage::data_store_impl_::data_store<gridtools::storage::gpu, float, gridtools::storage::info_impl_::info<gridtools::tuple<int, int, int>, gridtools::tuple<gridtools::integral_constant<int, 1>, int, int>, std::integer_sequence<unsigned long, 0ul, 1ul, 2ul> >, gridtools::meta::list<gridtools::tuple<gridtools::integral_constant<int, 1>, int, int> const&, gridtools::layout_map_impl::layout_map<2, 1, 0>, std::integral_constant<int, 0>, gridtools::integral_constant<int, 32> >, false, false> >&) ()
#7 0x0000000000425f88 in std::_Function_handler<void (gtbench::numerics::solver_state&, float), gtbench::numerics::hdiff_stepper(float)::{lambda(gtbench::numerics::solver_state const&, std::function<void (std::shared_ptr<gridtools::storage::data_store_impl_::data_store<gridtools::storage::gpu, float, gridtools::storage::info_impl_::info<gridtools::tuple<int, int, int>, gridtools::tuple<gridtools::integral_constant<int, 1>, int, int>, std::integer_sequence<unsigned long, 0ul, 1ul, 2ul> >, gridtools::meta::list<gridtools::tuple<gridtools::integral_constant<int, 1>, int, int> const&, gridtools::layout_map_impl::layout_map<2, 1, 0>, std::integral_constant<int, 0>, gridtools::integral_constant<int, 32> >, false, false> >&)>)#1}::operator()(gtbench::numerics::solver_state const&, std::function<void (std::shared_ptr<gridtools::storage::data_store_impl_::data_store<gridtools::storage::gpu, float, gridtools::storage::info_impl_::info<gridtools::tuple<int, int, int>, gridtools::tuple<gridtools::integral_constant<int, 1>, int, int>, std::integer_sequence<unsigned long, 0ul, 1ul, 2ul> >, gridtools::meta::list<gridtools::tuple<gridtools::integral_constant<int, 1>, int, int> const&, gridtools::layout_map_impl::layout_map<2, 1, 0>, std::integral_constant<int, 0>, gridtools::integral_constant<int, 32> >, false, false> >&)>) const::{lambda(gtbench::numerics::solver_state&, float)#1}>::_M_invoke(std::_Any_data const&, gtbench::numerics::solver_state&, float&&) ()
#8 0x0000000000424188 in std::_Function_handler<void (gtbench::numerics::solver_state&, float), gtbench::numerics::advdiff_stepper(float)::{lambda(gtbench::numerics::solver_state const&, std::function<void (std::shared_ptr<gridtools::storage::data_store_impl_::data_store<gridtools::storage::gpu, float, gridtools::storage::info_impl_::info<gridtools::tuple<int, int, int>, gridtools::tuple<gridtools::integral_constant<int, 1>, int, int>, std::integer_sequence<unsigned long, 0ul, 1ul, 2ul> >, gridtools::meta::list<gridtools::tuple<gridtools::integral_constant<int, 1>, int, int> const&, gridtools::layout_map_impl::layout_map<2, 1, 0>, std::integral_constant<int, 0>, gridtools::integral_constant<int, 32> >, false, false> >&)>)#1}::operator()(gtbench::numerics::solver_state const&, std::function<void (std::shared_ptr<gridtools::storage::data_store_impl_::data_store<gridtools::storage::gpu, float, gridtools::storage::info_impl_::info<gridtools::tuple<int, int, int>, gridtools::tuple<gridtools::integral_constant<int, 1>, int, int>, std::integer_sequence<unsigned long, 0ul, 1ul, 2ul> >, gridtools::meta::list<gridtools::tuple<gridtools::integral_constant<int, 1>, int, int> const&, gridtools::layout_map_impl::layout_map<2, 1, 0>, std::integral_constant<int, 0>, gridtools::integral_constant<int, 32> >, false, false> >&)>) const::{lambda(gtbench::numerics::solver_state&, float)#1}>::_M_invoke(std::_Any_data const&, gtbench::numerics::solver_state&, float&&) ()
#9 0x000000000040b254 in gtbench::runtime::result gtbench::runtime::simple_mpi_impl::runtime_solve<gtbench::verification::analytical::repeated<gtbench::verification::analytical::advection_diffusion>, std::function<std::function<void (gtbench::numerics::solver_state&, float)> (gtbench::numerics::solver_state const&, std::function<void (std::shared_ptr<gridtools::storage::data_store_impl_::data_store<gridtools::storage::gpu, float, gridtools::storage::info_impl_::info<gridtools::tuple<int, int, int>, gridtools::tuple<gridtools::integral_constant<int, 1>, int, int>, std::integer_sequence<unsigned long, 0ul, 1ul, 2ul> >, gridtools::meta::list<gridtools::tuple<gridtools::integral_constant<int, 1>, int, int> const&, gridtools::layout_map_impl::layout_map<2, 1, 0>, std::integral_constant<int, 0>, gridtools::integral_constant<int, 32> >, false, false> >&)>)> >(gtbench::runtime::simple_mpi_impl::runtime&, gtbench::verification::analytical::repeated<gtbench::verification::analytical::advection_diffusion>, std::function<std::function<void (gtbench::numerics::solver_state&, float)> (gtbench::numerics::solver_state const&, std::function<void (std::shared_ptr<gridtools::storage::data_store_impl_::data_store<gridtools::storage::gpu, float, gridtools::storage::info_impl_::info<gridtools::tuple<int, int, int>, gridtools::tuple<gridtools::integral_constant<int, 1>, int, int>, std::integer_sequence<unsigned long, 0ul, 1ul, 2ul> >, gridtools::meta::list<gridtools::tuple<gridtools::integral_constant<int, 1>, int, int> const&, gridtools::layout_map_impl::layout_map<2, 1, 0>, std::integral_constant<int, 0>, gridtools::integral_constant<int, 32> >, false, false> >&)>)>, gtbench::vec<unsigned long, 3ul> const&, float, float) ()
#10 0x00000000004056c0 in main ()
```
### gpu gcl
Segfault :fire:
```
cmake .. -DGTBENCH_BACKEND=gpu -DGTBENCH_RUNTIME=gcl -DCMAKE_INSTALL_PREFIX=./install -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_COMPILER=$(which nvcc) -DCMAKE_CUDA_ARCHITECTURES=90
srun -n 1 -A csstaff -t5 cmake --build . --target install --parallel 1 --verbose 2>&1 | tee build_output.txt
LD_LIBRARY_PATH=/user-environment/env/default/lib64:install/lib64 CUDA_VISIBLE_DEVICES=0 srun -n 1 -t5 -A csstaff numactl -N 0 ./install/bin/benchmark --domain-size 512 512 32 --runs 1 --cart-dims 1 1
Running GTBENCH
Domain size: 512x512x32
Floating-point type: float
GridTools backend: gpu
srun: error: nid001770: task 0: Segmentation fault
srun: Terminating StepId=8178.0
```
### gpu ghex_comm mpi
- [ ] Need to find a fix for [the correct architecture target](https://github.com/iomaganaris/GHEX/blob/805ebcacd374197123483fc2b962febb0f2ad233/include/ghex/structured/rma_put.hpp#L58)
`gridtools`: 2.3.2
`ghex`: master
```
cmake .. -DGTBENCH_BACKEND=gpu -DGTBENCH_RUNTIME=ghex_comm -DGHEX_TRANSPORT_BACKEND=MPI -DCMAKE_INSTALL_PREFIX=./install -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_COMPILER=$(which nvcc) -DCMAKE_CUDA_ARCHITECTURES=90 -DGHEX_USE_BUNDLED_LIBS=ON -DGHEX_USE_BUNDLED_GRIDTOOLS=OFF
srun -n 1 -A csstaff -t5 cmake --build . --target install --parallel 1 --verbose 2>&1 | tee build_output.txt
LD_LIBRARY_PATH=/user-environment/env/default/lib64:install/lib64 srun -n 1 -t5 -A csstaff numactl -N 0 ./install/bin/benchmark --domain-size 512 512 32 --runs 1 --cart-dims 1 1 --device-mapping 0 # single mpi/gpu works
LD_LIBRARY_PATH=/user-environment/env/default/lib64:install/lib64 srun -n 2 -t5 -A csstaff numactl -N 0 ./install/bin/benchmark --domain-size 512 512 32 --runs 1 --cart-dims 2 1 --device-mapping 0:1 # more than 1 rank/gpu fails
Domain size: 512x512x32
Floating-point type: float
GridTools backend: gpu
process_vm_readv: Bad address
Assertion failed in file ../src/mpid/ch4/shm/cray_common/cray_common_memops.c at line 461: 0
/user-environment/linux-sles15-neoverse_v2/gcc-12.3.0/cray-mpich-8.1.28-qrxzbznekbtztqquqpfkuxi5hy5dzjk3/lib/libmpi_gnu_123.so.12(MPL_backtrace_show+0x18) [0xffff858c6f98]
/user-environment/linux-sles15-neoverse_v2/gcc-12.3.0/cray-mpich-8.1.28-qrxzbznekbtztqquqpfkuxi5hy5dzjk3/lib/libmpi_gnu_123.so.12(+0x1a05e88) [0xffff85355e88]
/user-environment/linux-sles15-neoverse_v2/gcc-12.3.0/cray-mpich-8.1.28-qrxzbznekbtztqquqpfkuxi5hy5dzjk3/lib/libmpi_gnu_123.so.12(+0x1e29c04) [0xffff85779c04]
/user-environment/linux-sles15-neoverse_v2/gcc-12.3.0/cray-mpich-8.1.28-qrxzbznekbtztqquqpfkuxi5hy5dzjk3/lib/libmpi_gnu_123.so.12(MPI_Irecv+0x10d0) [0xffff83f637b0]
install/lib64/liboomph_mpi.so(_ZN5oomph12communicator4recvEPNS_6detail14message_buffer13heap_ptr_implEmiiONS_4util15unique_functionIFviiELm40EEE+0xa0) [0xffff85b9cfa0]
./install/bin/benchmark() [0x45347c]
./install/bin/benchmark() [0x4549b4]
./install/bin/benchmark() [0x42a668]
./install/bin/benchmark() [0x428868]
./install/bin/benchmark() [0x40f2e0]
./install/bin/benchmark() [0x40f96c]
./install/bin/benchmark() [0x409580]
/lib64/libc.so.6(__libc_start_main+0xe8) [0xffff83363fa0]
./install/bin/benchmark() [0x40a808]
MPICH ERROR [Rank 1] [job id 7789.0] [Fri Feb 16 16:36:48 2024] [nid001812] - Abort(1): Internal error
```
- [ ] TODO `XPMEM`
### gpu ghex_comm UCX
```
cmake .. -DGTBENCH_BACKEND=gpu -DGTBENCH_RUNTIME=ghex_comm -DGHEX_TRANSPORT_BACKEND=UCX -DCMAKE_INSTALL_PREFIX=./install -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_COMPILER=$(which nvcc) -DGHEX_USE_BUNDLED_LIBS=ON -DGHEX_USE_BUNDLED_GRIDTOOLS=OFF
srun -n 1 -A csstaff -t5 cmake --build . --target install --parallel 1 --verbose 2>&1 | tee build_output.txt
LD_LIBRARY_PATH=/user-environment/env/default/lib64:install/lib64:$(spack location -i ucx)/lib srun -n 1 -t5 -A csstaff numactl -N 0 ./install/bin/benchmark --domain-size 512 512 32 --runs 1 --cart-dims 1 1 --device-mapping 0
LD_LIBRARY_PATH=/user-environment/env/default/lib64:install/lib64:$(spack location -i ucx)/lib srun -n 2 -t5 -A csstaff numactl -N 0 ./install/bin/benchmark --domain-size 512 512 32 --runs 1 --cart-dims 2 1 --device-mapping 0:1
Running GTBENCH
Domain size: 512x512x32
Floating-point type: float
GridTools backend: gpu
GTBench runtime: ghex_comm[1708100821.905509] [nid001812:34293:0] tcp_ep.c:1178 UCX ERROR tcp_ep 0x3e593360 (state=CONNECTED): send(104) failed: Input/output error
[1708100821.905509] [nid001812:34294:0] tcp_ep.c:1178 UCX ERROR tcp_ep 0x387d9c10 (state=CONNECTED): send(106) failed: Input/output error
```
- [ ] TODO `XPMEM`
### gpu ghex_comm libfabric
Note: needs `spack load boost+thread`
```
cmake .. -DGTBENCH_BACKEND=gpu -DGTBENCH_RUNTIME=ghex_comm -DGHEX_TRANSPORT_BACKEND=LIBFABRIC -DCMAKE_INSTALL_PREFIX=./install -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_COMPILER=$(which nvcc) -DGHEX_USE_BUNDLED_LIBS=ON -DGHEX_USE_BUNDLED_GRIDTOOLS=OFF -DCMAKE_CUDA_ARCHITECTURES=90
srun -n 1 -A csstaff -t5 cmake --build . --target install --parallel 1 --verbose 2>&1 | tee build_output.txt
LD_LIBRARY_PATH=/user-environment/env/default/lib64:install/lib64:$LD_LIBRARY_PATH:$(spack location -i boost+thread)/lib srun -n 1 -t5 -A csstaff numactl -N 0 ./install/bin/benchmark --domain-size 512 512 32 --runs 1 --cart-dims 1 1 --device-mapping 0
LD_LIBRARY_PATH=/user-environment/env/default/lib64:install/lib64:$LD_LIBRARY_PATH:$(spack location -i boost+thread)/lib srun -n 2 -t5 -A csstaff numactl -N 0 ./install/bin/benchmark --domain-size 512 512 32 --runs 1 --cart-dims 2 1 --device-mapping 0:1
```
`XPMEM`
```
cmake .. -DGTBENCH_BACKEND=gpu -DGTBENCH_RUNTIME=ghex_comm -DGHEX_TRANSPORT_BACKEND=LIBFABRIC -DCMAKE_INSTALL_PREFIX=./install -DCMAKE_CXX_COMPILER=$(which g++) -DCMAKE_CUDA_COMPILER=$(which nvcc) -DGHEX_USE_BUNDLED_LIBS=ON -DGHEX_USE_BUNDLED_GRIDTOOLS=OFF -DCMAKE_CUDA_ARCHITECTURES=90 -DGHEX_USE_XPMEM=ON
srun -n 1 -A csstaff -t5 cmake --build . --target install --parallel 1 --verbose 2>&1 | tee build_output.txt
LD_LIBRARY_PATH=/user-environment/env/default/lib64:install/lib64:$LD_LIBRARY_PATH:$(spack location -i boost+thread)/lib srun -n 1 -t5 -A csstaff numactl -N 0 ./install/bin/benchmark --domain-size 512 512 32 --runs 1 --cart-dims 1 1 --device-mapping 0
LD_LIBRARY_PATH=/user-environment/env/default/lib64:install/lib64:$LD_LIBRARY_PATH:$(spack location -i boost+thread)/lib srun -n 2 -t5 -A csstaff numactl -N 0 ./install/bin/benchmark --domain-size 512 512 32 --runs 1 --cart-dims 2 1 --device-mapping 0:1
```
Cuda issue with larger domain
```
#!/bin/bash -l
#SBATCH --job-name=gtbench_single
#SBATCH --time=12:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks-per-node=288
#SBATCH --cpus-per-task=1
#SBATCH --partition=normal
#SBATCH --account=csstaff
#SBATCH -m cyclic:cyclic:cyclic
#SBATCH --uenv=/bret/scratch/cscs/bcumming/images/prgenv-gnu-24.2.squashfs
uenv view default
source $SCRATCH/spack/share/spack/setup-env.sh
export LD_LIBRARY_PATH=/user-environment/env/default/lib64:install/lib64:$LD_LIBRARY_PATH:$(spack location -i boost+thread)/lib
DOMAIN_SIZE=1260 # crashing
for procs_per_module in 1
do
set -x
srun -n $((4*procs_per_module)) --cpu-bind=verbose --ntasks-per-socket=$procs_per_module ./install/bin/benchmark --domain-size $DOMAIN_SIZE $DOMAIN_SIZE $DOMAIN_SIZE --cart-dims 4 $procs_per_module --runs 20 --device-mapping 0:1:2:3
set +x
done
```
```
...
terminate called after throwing an instance of 'std::runtime_error'
what(): GHEX Error: CUDA Call failed cudaEventSynchronize(m_event) (an illegal memory access was encountered) in /bret/scratch/cscs/ioannmag/gtbench_latest_gridtools/build_gpu_ghex_libfabric_xpmem/_deps/ghex-src/include/ghex/rma/event.hpp:87
srun: error: nid001788: task 0: Aborted (core dumped)
```
## :1234: gtbench
Various kernels (similar to stencil_benchmark) but using gridtools C++
For reading the results from the slurm output use:
```
grep slurm-63885.out -e "Columns per second:" | awk '{print $4}'
grep slurm-63885.out -e "Domain size:" | awk '{print $3}' | grep -e "^[0-9]*" -o
```
## H100 vs A100 comparisons
### :rocket: Results [H100 vs A100]

### :end: Conclusion (H100 vs A100)
Similar performance to stencil benchmarks
**H100 2x faster** compared to **A100**
<!-- ## Multinode comparisons
### :rocket: Results [Clariden vs Santis]
#### simple_mpi
#### gcl
#### ghex_mpi
Tried `GHEX` with `MPI` backend with/without `XPMEM` (neoverse-v2 builds):
`Without XPMEM`
```
Running GTBENCH
Domain size: 2048x2048x16
Floating-point type: float
GridTools backend: cpu_ifirst
GTBench runtime: ghex_comm
Median time: 3.76895s
Columns per second: 1.11286e+06
```
`With XPMEM`
```
Running GTBENCH
Domain size: 2048x2048x16
Floating-point type: float
GridTools backend: cpu_ifirst
GTBench runtime: ghex_comm
Median time: 3.7647s
Columns per second: 1.11411e+06
``` -->
### :end: Conclusions
#### Clariden vs Santis


Same behavior for the 2 CPUs. `ifirst` is faster when everything fits the cache, `kfirst` becomes faster (up to 10%) when going out of cache due to better cache locality
Differences in performance in ifirst vs kfirst are noticible in the `Elements per Second` graph. In the beginning as the domain size is very small (few kB or MB) it's not possible to achieve the best performance. As the memory footprint of the domain data fits the cache the achieved `Elements per Second` grows large. Meanwhile when the domain size grows much bigger than the available caches we reach a point with stable performance when `kfirst` is better due to better cache locality.
Overall GH200 better performence
Memory size in 64x64x64 should be around `64^3*4(bytes)*6(storage vectors)=6.3MB`
Memory size in 128x128x128 should be around `128^3*4(bytes)*6(storage vectors)=50MB`
Memory size in 256x256x256 should be around `256^3*4(bytes)*6(storage vectors)=402MB`
AMD EPYC L3 cache is max 16MB/core (256MB/4CCX(8 cores each)/2(4 cores each)) (see: [zen3 review](https://www.anandtech.com/show/16529/amd-epyc-milan-review/4))
NVIDIA GH200 L3 cache is 117MB
AMD EPYC mem bw drops a lot after ~48MBs, so after 64x64x64

GH200 ifirst performance starts dropping between 50MB and 400MB, so for sizes > L3 cache. Then kfirst due to better data locality becomes a bit faster
#### Santis
##### CPU ifirst vs kfirst
###### Horizontal domain size sweep


In this diagrams we can see the effect `ifirst` and `kfirst` have depending on the horizontal domain size. The sizes were optimal `Elements per Second` are reached match with the memory footprint calculated in the previous domain size sweep. We can reach the same conclusion as before here.
**Question 1**
Why is there so much variability in `ifirst` execution times/`Elements per Second`?
##### Santis communication backends
1. Best performance in 1 node using `GHEX` with `libfabric` and `XPMEM` backend

2. Hybrid approach of processes+threads doesn't perform better
3. 64 or 72 processes/module (256 or 288 per node) doesn't make a very large difference. In some cases 64 ranks/module are slightly better

#### Santis multinode performance
**TODO**
### Notes
#### What do I want to run?
- [x] First try in single node/rank cpu_ifirst and cpu_kfirst for sizes (128-256-512-1024-2048-4096-8192). Then do with the best the following
- [ ] ~~Do a sweep from 128x128 to 8192x8192 (128-256-512-1024-2048-4096-8192) size in single GH200 module (1-8-16-32-64-72 ranks) --> figure out best performance/module~~
- [ ] ~~Do a sweep from 128x128 to 8192x8192 (128-256-512-1024-2048-4096-8192) size in single node (1-8-16-32-64-72 ranks/module) --> figure out best performance/node~~
- [x] Try strong scaling with 576x576x576 domain with ifirst. Scale 1-2-4-16-32-64-72
- [x] Try strong scaling with 576x576x576 domain with ifirst. Scale 1-2-4-16-32-64-72 for ghex ucx/libfabric with XPMEM (cray)
- [x] Run some benchmarks with 64 vs 70 vs 72 with larger sizes
- [x] Try with GHEX with multiple threads/process
- [ ] Try 4 GPUs/node & compare with best perf of 4 CPUs/node
- [ ] In 2-4 nodes (or more nodes) check same strong scaling
- [x] Try all of the above with XPMEM as well? (for GHEX)
- [x] Check what columns/s is
- [x] columns/s is the slope of the runtime plot
- [x] Add plot with runtimes instead of columns/s for ifirst/kfirst
- [x] Add plot with performance per element (divide columns/s by column size) (Enrique's suggestion)
- [x] Figure out what are the elements of the column --> column = vertical column
- [ ] Keep horizontal size (x&y) same and change k size (vertical) and vice versa (cpu_ifirst - kfirst) (Hannes' suggestion)
- [x] NxNx64: 192 has the best performance/element
- [ ] 192x192xN: Do sweep in vertical dimension to see performance difference
- [ ] 512x512xN: Do sweep in vertical dimension to see performance difference
- [x] Discussed issues of GHEX with Fabian
- [ ] Check variability in H100 Columns/s plot (Edoardo's question)
- [ ] Add error bars in A100 vs H100 performance (Philip's suggestion)
- [ ] Check GPU to GPU transfer speeds (Christoph's suggestion)
- [ ] Check multinode performance with nodes closer and further way (Christoph's suggestion)
- [ ] Add run scripts in the repository for certain experiments
<!-- Takeaways will be:
1. Best performance per node based on comm implementation
2. Best performance based on #ranks/#nodes
3. Best performance based on gridtools backend
4. CPU/GPU comparison -->
### Source code
The scripts to generate the plots can be found in [iomaganaris/gtbench@ioannmag/plots](https://github.com/iomaganaris/gtbench/tree/ioannmag/plots).