Benchmarking & Performance Notes

# Benchmarking & Performance Notes ###### tags: `benchmarking` [ToC] ## Stencil Benchmarks With the ‘stencil benchmarks’ code available on Github (https://github.com/GridTools/stencil_benchmarks), you can: - Find the sustainable peak memory bandwidth by running the [STREAM benchmark](https://www.cs.virginia.edu/stream) on CPUs and GPUs. It generates highly optimized kernel code with many tunable parameters (manual vectorization, non-temporal loads/stores, unrolling, etc.). - Learn more about architecture-specific features and behavior on Cartesian stencil codes and find the most efficient implementations. There are dozens of different implementations of the [COSMO](http://www.cosmo-model.org)-like horizontal diffusion and vertical advection stencils available, all with many tunable parameters. You can follow the instructions in the README to get started. For details about how specific options change the generated code, you have to check the code directly (often accessible with `--print-code` or `--print-kernels`) or inspect the Jinja templates and generating Python code. ## GTBench ### What is GTBench? GTBench is a benchmark application based on common compute patterns used in weather and climate applications. It is meant to give performance results that are representative for a full weather or climate model, but with the following advantages: - It has only a few software dependencies and is thus easy to compile. - It solves an equation that has an analytic solution and is thus easy to verify. - It is relatively small and thus easy to port to new architectures or computational tools (e.g. DSLs[^1]). - We have a performance model which enables performance predictions for future, not yet available hardware based on a few simple characteristics. GTBench uses a numerical methods heavily inspired by the [COSMO weather model](http://www.cosmo-model.org) to solve the [advection-diffusion equation](https://en.wikipedia.org/wiki/Convection-diffusion_equation) on a periodic Cartesian grid. Documentation of the numerical methods is available in the [GTBench repository](https://github.com/GridTools/gtbench/blob/master/doc/GTBench_Numerics.pdf). For validation of the results, the method of manufactured solutions is used in combination with spatial (grid) and temporal convergence analysis. [^1]: A single-node GT4Py implementation of the ‘old’ GT4Py is already available here: https://github.com/GridTools/gt4py-benchmarks. However, this documentation is all about the original GridTools C++ version. ### Building GTBench Building GTBench should be straightforward and is documented in its [README](https://github.com/GridTools/gtbench/blob/master/README.md). Note that GTBench by default downloads and uses a matching GridTools version (and GHEX version if you enable the GHEX communication runtime), so you do not need to build and install GridTools or GHEX manually. ### Single-Node Reference Runs As a first step towards evaluating hardware performance using GTBench, you should always build and run it in single-node configuration, that is, using its `single_node` runtime (check the GTBench README to find out how to select this during using CMake). This does not require MPI and uses GridTools boundary conditions, which basically just copy values from one side of the domain to the other to get a periodic data view. As no network and no MPI is involved, this should always give an upper bound for performance achievable on the given hardware architecture. General tips for running GTBench: - For testing executions, use `--runs 1` to get much lower running times. - For final executions, always use the default value for `--runs` and check the values of the standard deviations to make sure you get consistent performance results. - Use large enough domain sizes to saturate the memory bandwidth. - Always check the convergence tests with your chosen compiler flags, it might happen that the executables are super fast but produce wrong results… #### On a CPU - Try different compilers and compiler versions. - Override the default compiler flags. - Try `-march=native` (or the closest arch if native is not yet supported on very new hardware). - Use `-Ofast` (or `-ffast-math`)[^2]. - Try GCC’s `-fvect-cost-model=unlimited` or `-fsimd-cost-model=unlimited` on architectures with large vector width. - Try GCC’s `-mprefer-vector-width=…` on AMD64 architectures. - Force a fixed-size vector width on ARM SVE (using `-msve-vector-bits=…` with GCC). - Use a single socket on multi-socket systems when using the `single_node` backend to avoid NUMA issues. - Make sure to check your affinity settings! - Do not trust the SLURM configuration, especially not on evaluation systems (use `numactl` to fix wrong SLURM configs). - Use `numactl -s` to check thread and memory bindings and `numactl -H` to check if they make sense. - If `numactl` is not available, maybe `OMP_DISPLAY_ENV` and `OMP_DISPLAY_AFFINITY` can help. - Make use of `OMP_NUM_THREADS`, `OMP_PLACES` etc. - GridTools comes with two CPU backends: `cpu_ifirst` and `cpu_kfirst`. On modern CPUs with many cores and large vector size, usually `cpu_ifirst` is faster, but always try both backends. - Try to minimize the influence of frequency scaling (e.g., by fixing the frequency to the maximum possible value). [^2]: Always try `-Ofast`/`-ffast-math` on GridTools codes as without it, conservative compilers like GCC follow IEEE standards which disables vectorization in some cases for STL math functions! On GTBench there are actually no fancy math functions used, so it might not make a big difference here. #### On a GPU - Try Clang (with `-Ofast`) and NVCC on NVIDIA GPUs. - Try to minimize the influence of frequency scaling (e.g., by fixing the frequency to the maximum possible value). ### Multi-Node Runs - Always try all distributed runtimes (`gcl`, `simple_mpi`, `ghex_comm`). - Run convergence tests to make sure you still get the correct results. - Especially if you compare CPU with GPU architectures, make sure that your horizontal domain size is more or less square (as the different data layouts used by different backends might favor different domain aspect ratios, due to the very different cost of communicating contiguous vs. scattered data) - Try to run one or more MPI ranks per CPU socket. More ranks can be faster due to better cache locality and better overlapping of communication with computation. - For GPUs, this might require enabling simultaneous access of multiple processes to the same GPU. ### Performance Model For the procurement phase of Alps, a performance model for GTBench was developed. It is able to predict performance on current GPU architectures reasonably well, based on a single variable: peak non-vectorized float STREAM bandwidth. The model is available [here](https://github.com/fthaler/gtbench-perfmodel). ## Performance Characteristics of Modern Hardware Architectures ### CPUs - Relatively low peak memory bandwidth - Large caches: - Very good performance on small domains (make sure to use large enough domains for measuring memory bandwidth). - Caching issues can have significant impact on overall performance. On some hardware you might have to offset allocation addresses to reduce cache conflicts (e.g., on Intel KNL, the least significant 12 bits of an address should be different, that is, page-aligned data might perform badly). - Hyper-threading/simultaneous multi threading: - Is mostly not required for achieving peak bandwidth or even hindering. - Can still be helpful on non-optimal access patterns. - Vectorization: - Vectorized loads/stores are required to reach peak memory bandwidth. - Help the compiler to generate faster code by using aliasing hints (`__restrict__`) or (better) using pragmas (e.g. `#pragma omp simd`) to explicitly allow/force vectorization. - Often `-ffast-math` is required to allow usage of not fully IEEE compliant vectorized math functions. - Try if overriding the compilers default vectorization cost model (e.g. with GCC’s `-fvect-cost-model=unlimited` or `-fsimd-cost-model=unlimited`) gives faster code. - Hardware prefetching: - All modern architectures feature hardware prefetching with stream detection. - Only stride-1 (in cache lines) streams are detected, that is reading one cache line after/before the other without a gap. - Even single cache line gaps here and there (e.g. when alining the innermost dimension of a multidimensional array) in a stream _can_ hit hardware prefetching performance seriously. - The number of detected streams is limited (though usually high enough). - Some hardware (e.g. Intel KNL) can not track streams across 4KB page boundaries. You _might_ find such behavior documented in the hardware architecture manual (often in a rather unclear formulation). - Bandwidth wasting: - Without cache-avoiding stores, all data is first loaded from memory before overwriting it, effectively needing twice the bandwidth than just writing would require. - That is, cache-avoiding (‘non-temporal’, ‘streaming’) stores are required to reach peak bandwidth. - Non-temporal stores are a _bad_ solution to avoid the additional load in real-world code as they invalidate the cache! A much better solution would be an instruction that stores a whole cache line without skipping/invalidating the cache but instead just avoids the unnecessary load (that’s what GPUs do). - Some architectures also provide non-temporal loads, but they are _probably_ not required to reach peak bandwidth (as they don’t avoid any unnecessary memory transfers). - Some ARM processor _might_ support a [write streaming mode](https://developer.arm.com/documentation/ka001402/latest) that automatically avoids loads to memory if it detects a write stream. This is a great solution (but haven’t seen any implementation of it in HPC hardware yet)! - NUMA: - On multi-socket systems (and sometimes even on single-socket systems!), be sure to use NUMA-node local memory. - Especially take care of huge page allocations, they can be helpful to reduce TLB pressure; but if you have multiple NUMA nodes, they can also introduce unneeded remote accesses. To force disabling huge pages for large allocations on a Linux system with transparent huge pages enabled, you have to use `madvise` with `MADV_NOHUGEPAGE` after the allocation! - Reproducible performance: - It is almost impossible to get consistent, reproducible performance on modern CPUs and operating systems. Many things can influence the performance: starting address of your allocations (e.g. through changes in cache collisions), addresses of your functions, stack variables etc. Security measures like address space randomization, frequency throttling and boosting, but also just system calls like `malloc` that do not guarantee the same return value between several benchmark runs are totally out of control for us. - Always run enough repetitions of a benchmark and if possible check the full distribution of run times. You might experience interesting multi-modal distributions (and sometimes be able to find the reason for them). - SLURM might allow to disable frequency throttling (`--cpu-freq=high`). - Flushing the CPU cache is _not_ a good strategy to estimate cold cache performance. First, depending on the CPU architecture, it might be difficult to actually invalidate cache lines directly (that is, by instructions); second, with most strategies you actually force deferred cache line write backs which reduce the achievable memory bandwidth[^3]. - Instead of flushing the cache, use different data on consecutive runs and make sure your total data size is significantly large than the last level cache. - Always consider the vendor’s architecture/optimization manual! It can help to explain strange observations or find new optimizations. [^3]: Note: GridTools still _tries_ to flush the cache (in a bad way, not respecting the cache line and leading to the mentioned write-back problem). This was just kept as simply removing the flusher led to less consistent timings in the daily benchmarks. As the CPU performance numbers are only used for detecting serious regressions, but are otherwise not too interesting, no effort was taken to implement a better strategy. ### Manycore CPUs With High-Bandwidth Memory Most characteristics of normal CPUs apply and can often appear exaggerated. But some additional might be: - Very high _theoretical_ peak bandwidth: - Often impossible to reach in real-world code due to the bandwidth wasting mentioned above. - Cache issues: - Cache per core _can_ be very limited: then, properly employing hardware prefetching and vectorization is even more crucial. - Cache coherency can be a large cost factor, especially for writes. Thus, avoiding additional temporary buffers and instead using redundant computations is _often_ beneficial. - Large vector width: - You need to vectorize everything, otherwise performance is very poor. - On ARM hardware with Scalable Vector Extensions (SVE), make sure to compile for a fixed vector width (e.g., with GCC, use `-msve-vector-bits=NATIVE_VECTOR_WIDTH`). - Prefetching: - You have to use stride-1 streams everywhere, otherwise performance is poor. - As an alternative, you can try software prefetching, but this is in general very machine-dependent, hard to tune automatically (prefetching distance) and sometimes shows special characteristics. On Intel KNL for example, prefetches with large strides seemed to fail totally (there is no ‘stride’ argument to the instruction, so it has something to do with the address locations directly; somehow prefetching seems to be per-page or so, as mentioned before KNL hardware prefetching also behaves strangely on page boundaries). - Special note on vertical tridiagonal solvers: - Because vectorization is so important, multiple columns have to be processed at once to enable vectorization. The package of columns should match the native vector width or maybe cache line size (to improve instruction-level parallelism). - Because hardware prefetching is so important, the upward and downward sweeps must have a unit stride access pattern (that is, work on contiguous memory). - These two points together mean: you need a blocked storage layout (weather and climate codes often have this implemented, the inner block length is mostly called `nproma`). - Complex code: - Some hardware requires incredibly complex code sequences to achieve peak memory bandwidth. As an example, see the A64Fx code below. No compiler can do this for you (apart maybe in the STREAM benchmark), so live with sub-optimal performance in this case and blame the hardware designers for not implementing something like the write streaming mode mentioned before. Bonus: simplified Fujitsu A64Fx code required to reach peak bandwidth (source array address in register `x0`, destination array in `x1`, loop counter in `x3`): ```arm ptrue p0.d // loop over all but last 512 elements with prefetching sub x2, x2, #512 prefetch_loop_start%=: // set some cache line 4096 bytes ahead to zero to avoid load add x3, x1, #4096 dc zva, x3 // prefetch destination array 512 bytes ahead to L1 cache prfm pstl1keep, [x1, 512] // copy whole cache line (256 bytes) ld1d {z0.d}, p0/z, [x0] ld1d {z1.d}, p0/z, [x0, #1, mul vl] ld1d {z2.d}, p0/z, [x0, #2, mul vl] ld1d {z3.d}, p0/z, [x0, #3, mul vl] st1d {z0.d}, p0, [x1] st1d {z1.d}, p0, [x1, #1, mul vl] st1d {z2.d}, p0, [x1, #2, mul vl] st1d {z3.d}, p0, [x1, #3, mul vl] // increase counters and loop sub x2, x2, #32 add x0, x0, #256 add x1, x1, #256 cbnz x2, prefetch_loop_start%= // loop over last 512 elements without prefetching // ... (omitted) ``` ### GPUs - Very high peak bandwidth: - Often more than 90% of theoretical peak bandwidth can indeed be achieved. - Much easier to reach than on CPUs, as the problem of bandwidth wasting without non-temporal stores does not exist. - Cache: - The cache size is often quite small (especially on AMD GPUs, Intel seems to improve there quite a bit). - No global coherency, so often faster than on manycore CPUs. - Execution model: - GPUs use the [SIMT exeuction model](https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads). - NVIDIA and AMD GPUs use a grid-base execution model with four levels from the grid to thread. NVIDIA calls the levels ‘grid’, ‘block’, ‘warp’, ‘thread’; AMD calls them ‘grid’, ‘workgroup’, ‘wavefront’, ‘thread’. Analogous to the SIMD model, the ‘threads’ are also sometimes referred to as ‘lanes’. - NVIDIA GPUs have a fixed warp size of 32. Older AMD GPUs have a fixed wavefront size of 64; recent AMD GPUs have a configurable wavefront size of 64 or 32. - Memory types: - GPUs typically have multiple memory types or address spaces. To confuse people, NVIDIA calls them ‘global’, ‘local’, ‘shared’, and ‘constant’ while AMD calls them ‘global’, ‘private’/‘scratch’, ‘local’/‘group’/‘LDS’, and ‘constant’ (note the different usage of ‘local’!). - Global memory: the ‘normal’ thing, where `cudaMalloc(…)` etc. allocates data. Should be accessed in a ‘coalesced’ way, that is, neighboring thread in a warp/wavefront should access neighboring elements in memory. This minimized memory transactions and thus maximizes bandwidth. - Local (NVIDIA), private/scratch (AMD) memory: per-thread memory automatically allocated from global memory if register spills occur or thread-local arrays are accessed with dynamic indices where the compiler is not able to optimize them into register accesses. Should be avoided whenever possible; usually a sign of bad code (e.g. too big kernels, thread-local arrays) or failing compiler optimizations. - Shared (NVIDIA), local/group/LDS (AMD) memory: on-chip memory shared between all threads of a block/workgroup. Much faster than global memory, supports communication and synchronization between threads of a block/workgroup. Though, nowadays communication with warp-level communication/synchronization functions is often faster. Using some access patterns you can run into bank-conflicts which may lead to significantly higher access latency. - Constant memory: same physical memory as global memory, but accessed by special instructions and supported by a special constant cache. Optimized for cases where all threads of a warp access the same value. Used for storing kernel arguments. - Vectorization - On some hardware, vectorized loads/stores (manually implementable using `float2`, `float4`, `double2`, etc. data types) are required to reach peak bandwidth. Also special load instructions for read-only accesses might be available and can improve peak bandwidth. The size of a single load is usually performance relevant, that is, `float4` gives the same bandwidth as `double2`. - AMD’s GPU compilation pipeline (HIPCC/Clang) is able to auto-vectorize small loops (e.g., of size two or four to mimic `float2`, `float4` types) in a kernel (might also work when using Clang CUDA, never tested that one). - If you use the auto-vectorization feature, you can also use the Clang builtins `__builtin_nontemporal_load(…)`, `__builtin_nontemporal_store(…)`, which _might_ actually on be required on AMD GPUs to achieve peak bandwidth. - For linear algebra operations, try using the tensor cores on NVIDIA GPUs or matrix core engines/MFMA instructions on AMD GPUs. - Each new GPU generation brings quite a few changes, so don’t assume a previously highly tuned code is still optimal for the new generation. - Reproducible performance: - Due to its ‘distance’ from the operating system, traditionally it’s easier to get relatively small variations in run times on GPUs. - On NVIDIA GPUs you can disable frequency throttling/boosting by setting the environment variable `CUDA_AUTO_BOOST=0`. - If you can not disable frequency throttling/boosting, make sure to run many repetitions and double check if small kernels give reasonable results (they might run on a low frequency as they run too short to kick on frequency boosting). - Traditionally, on GPUs, cold vs. warm cache did not matter much as the cache size was very limited. Nowadays it does, even though still less than on CPUs. - GPU timers (`cudaEvent*`) _might_ measure different things for different vendors (used to be kernel time on NVIDIA; starting time plus kernel time on AMD). - Always consider the vendor’s architecture/optimization manual! It can help to explain strange observations or find new optimizations. ## Useful Tools - [Numactl](https://linux.die.net/man/8/numactl): for setting and checking CPU and memory affinity (`OMP_DISPLAY_AFFINITY` might be also useful in this context) - [Nsight Systems](https://developer.nvidia.com/nsight-systems): NVIDIA’s kernel tracing tool (for timelines etc.) - [Nsight Compute](https://developer.nvidia.com/nsight-compute): NVIDIA’s kernel profiling tool (for hardware counters, roofline analysis, etc.) - [Rocprof](https://rocmdocs.amd.com/en/latest/ROCm_Tools/ROCm-Tools.html): AMD’s ROCm profiling and tracing tool - [Linux Perf](https://perf.wiki.kernel.org/index.php/Main_Page): useful for checking CPU vectorization/assembly code, finding and debugging low-level effects of code changes - [Intel Memory Latency Checker](https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html): useful to get a reference for the maximum achievable STREAM bandwidth on x86-64 CPUs; also useful for checking NUMA latencies