General Performance Considerations

# General Performance Considerations ###### tags: `benchmarking` ## General * Vectorization in combination with unit stride accesses (in terms of cache lines, for the hardware prefetcher to work) is required to get full CPU performance. * Vectorization also becomes more important on GPUs. Newest NVIDIA and AMD GPU generations require vectorized loads and stores to achieve full bandwidth. * Unstructured only: scattered reads (gather instructions) are faster than scattered stores. Writing data should always happen in a vectorized/coalesced way. * Unit-stride dimension and vector size must be known at compile time to achieve reasonable performance. * Fancier control of caches may become more widely available (non-temporal memory accesses on Intel/AMD/ARM CPUs and AMD GPUs, cache residency control on NVIDIA GPUs). ## Physics * For optimal locality, as many column operations (forward/backward loops) as possible should be put into a single kernel/loop. In current GT4Py, each `computation(…)` corresponds to a GT multistage, thus triggering a global synchronization and a full-domain loop. This leads to very bad locality and poor performance in typical physics workloads. * To compete with Fortran, we _need_ storage blocking on CPU backends. This allows for vectorized computations (with vector-sized k-caches) and unit stride (in terms of cache lines) access patterns. Both can be implemented separately without blocking, but only with blocking they can be combined, which is crucial for maximum performance and manually implemented at least in some models (e.g. ICON’s `nproma`). Also for GPUs this *may* be advantageous due to improved spatial data locality (e.g. allowing for CUDA 11 cache residency control, smaller chance of cache conflicts, maybe profiting of sector promotion/prefetching). Performance experiments on a Fujitsu A64Fx show about 20% lower run time of blocked storages compared to the best non-blocked implementation (that is a manually vectorized one _including software prefetching_) and more than 35% reduction in run time compared to a GridTools-like implementation on the COSMO vertical advection implementation of a single velocity component. For all more complex patterns (e.g., more k-cacheable fields, more consecutive forward/backward loops etc.) an even higher speedup can be expected.