Current GTBench4Py Performance

###### tags: `benchmarking` # Current GTBench4Py Performance State: end of June 2021. ## Comparison Of C++ with Tuned Python Performance Measured on NVIDIA V100 (Tsa), domain size 512²×180, single precision. Default block sizes and default release compiler flags (fast-math disabled) are used in all cases. | Configuration | Time | Relative Time | | ----------------------------- | ----- | ------------- | | GTBench C++ | 2.56s | 100.0% | | C++ Stencils & Python Driver¹ | 2.60s | 101.5% | | GT4Py gtc:cuda² | 2.73s | 106.6% | | GT4Py gtc:gt:gpu² | 2.80s | 109.4% | | GT4Py gtcuda² | 3.47s | 135.5% | Footnotes: 1. GTBench C++ stencils are invoked using GTBench Python bindings. Boundary conditions are handled on the Python side using sliced assignment while GTBench C++ uses GridTools boundary conditions (using fewer kernel invocations). 2. Those are measurements with all tunings enabled that are described below. Results seem to be similar also for significantly smaller domain sizes. In the present case, the gap between Python and C++ is still negligible on a 64²×180 grid. On even smaller domain sizes it starts to grow and at some point the Python overhead dominates the run time. ## “Out-of-the-box” Performance vs. Tuned Performance Just running GTBench4Py _without_ any tunings in GT4Py gives terrible performance numbers. This table shows the impact of applied tunings in the example of the `gtc:cuda` backend. | Tuning | Time | Relative Time | | ---------------------------- | -------- | ------------- | | None | 72.27s | 2647.3% | | Disable Argument Validation | 72.05s | 2639.2% | | Remove Unneeded Device Syncs | 4.49s | 164.5% | | Enforce Single-Precision | 2.73s | 100.0% | ## Notes on Reproducibility ### GTBench C++ Commit: https://github.com/fthaler/gtbench/commit/2ef4e39a91ac18f83ca6a903037a2dca80f21928 Compiled using CUDA 10.2.89 and GCC 7.2.0. Built on Tsa (using CMake 3.18, Boost 1.74.0): ```bash export CUDACXX=/usr/local/cuda-10.2/bin/nvcc cmake -DGTBENCH_RUNTIME=single_node -DGTBENCH_BACKEND=gpu -DGTBENCH_FLOAT=float .. make -j 8 ``` ### GTBench4Py Commit: https://github.com/fthaler/gt4py-benchmarks/commit/c3da5f890431e16815e65f8ca92c967c7b477bb9 GT4Py version: https://github.com/fthaler/gt4py/tree/gtbench-hacks Execution commands (requires Boost in the path): ```bash export CUDA_HOME=/usr/local/cuda-10.2/ export CUDACXX=/usr/local/cuda-10.2/bin/nvcc # Using GTBench C++ bindings gtbench4py-benchmark single-node gtbench --gtbench-backend gpu --domain-size 512 512 180 --runs 1 # Using GT4Py + gtc:cuda gtbench4py-benchmark single-node gt4py --gt4py-backend gtc:cuda --domain-size 512 512 180 --runs 1 # Using GT4Py + gtc:gt:gpu gtbench4py-benchmark single-node gt4py --gt4py-backend gtc:gt:cuda --domain-size 512 512 180 --runs 1 # Using GT4Py + gtcuda gtbench4py-benchmark single-node gt4py --gt4py-backend gtcuda --domain-size 512 512 180 --runs 1 ```