###### tags: `benchmarking`
# Current GTBench4Py Performance
State: end of June 2021.
## Comparison Of C++ with Tuned Python Performance
Measured on NVIDIA V100 (Tsa), domain size 512²×180, single precision. Default block sizes and default release compiler flags (fast-math disabled) are used in all cases.
| Configuration | Time | Relative Time |
| ----------------------------- | ----- | ------------- |
| GTBench C++ | 2.56s | 100.0% |
| C++ Stencils & Python Driver¹ | 2.60s | 101.5% |
| GT4Py gtc:cuda² | 2.73s | 106.6% |
| GT4Py gtc:gt:gpu² | 2.80s | 109.4% |
| GT4Py gtcuda² | 3.47s | 135.5% |
Footnotes:
1. GTBench C++ stencils are invoked using GTBench Python bindings. Boundary conditions are handled on the Python side using sliced assignment while GTBench C++ uses GridTools boundary conditions (using fewer kernel invocations).
2. Those are measurements with all tunings enabled that are described below.
Results seem to be similar also for significantly smaller domain sizes. In the present case, the gap between Python and C++ is still negligible on a 64²×180 grid. On even smaller domain sizes it starts to grow and at some point the Python overhead dominates the run time.
## “Out-of-the-box” Performance vs. Tuned Performance
Just running GTBench4Py _without_ any tunings in GT4Py gives terrible performance numbers. This table shows the impact of applied tunings in the example of the `gtc:cuda` backend.
| Tuning | Time | Relative Time |
| ---------------------------- | -------- | ------------- |
| None | 72.27s | 2647.3% |
| Disable Argument Validation | 72.05s | 2639.2% |
| Remove Unneeded Device Syncs | 4.49s | 164.5% |
| Enforce Single-Precision | 2.73s | 100.0% |
## Notes on Reproducibility
### GTBench C++
Commit: https://github.com/fthaler/gtbench/commit/2ef4e39a91ac18f83ca6a903037a2dca80f21928
Compiled using CUDA 10.2.89 and GCC 7.2.0. Built on Tsa (using CMake 3.18, Boost 1.74.0):
```bash
export CUDACXX=/usr/local/cuda-10.2/bin/nvcc
cmake -DGTBENCH_RUNTIME=single_node -DGTBENCH_BACKEND=gpu -DGTBENCH_FLOAT=float ..
make -j 8
```
### GTBench4Py
Commit: https://github.com/fthaler/gt4py-benchmarks/commit/c3da5f890431e16815e65f8ca92c967c7b477bb9
GT4Py version: https://github.com/fthaler/gt4py/tree/gtbench-hacks
Execution commands (requires Boost in the path):
```bash
export CUDA_HOME=/usr/local/cuda-10.2/
export CUDACXX=/usr/local/cuda-10.2/bin/nvcc
# Using GTBench C++ bindings
gtbench4py-benchmark single-node gtbench --gtbench-backend gpu --domain-size 512 512 180 --runs 1
# Using GT4Py + gtc:cuda
gtbench4py-benchmark single-node gt4py --gt4py-backend gtc:cuda --domain-size 512 512 180 --runs 1
# Using GT4Py + gtc:gt:gpu
gtbench4py-benchmark single-node gt4py --gt4py-backend gtc:gt:cuda --domain-size 512 512 180 --runs 1
# Using GT4Py + gtcuda
gtbench4py-benchmark single-node gt4py --gt4py-backend gtcuda --domain-size 512 512 180 --runs 1
```