[DaCe] Optimization XI

# [DaCe] Optimization XI  - Shaped by: Philip - Appetite (FTEs, weeks): $\infty$ - Developers:  ## Problem The goal is to get a speed-up of about 1.5 compared to the OpenACC version of ICON in the `integrate_nh t_min` timer. For this project only single gpu runs are relevant. For that we perform a parallel approach thus one person get a stencil optimizes it and goes to the next one, starting with the ones that have the biggest potential to improve total time (the ones with biggest total time). For each program: - measure time on `ch1_medium` with the corresponding StencilTest - compare time to the corresponding blueline timer; if not close, understand why: - are the right timers compared (kernel or including Python overhead): prefer to use only the kernel timers from GT4Py metrics printed at the end of a pytest-benchmark report - in case timing depends on input values (e.g. masks), make sure the input is initialized with relevant data - also make sure that values such as `nflatlev` are correct, to get the right values you can check the "benchmark repo" as it was done there. - optimize the program (including custom flags on the DaCe backend) - integrate the changes into the granule (including custom backend options) and verify performance improvement in the blueline Apart from program specific optimizations we should look at each one about the following: - Is k-blocking enabled for every map? If yes, is it beneficial? - Are there maps that are unecessaringly split? Here is a short summary: - The diffusion stencils are simple and are suited for onboarding new people (but are not too relevant in total time!) Especially the two stencils `apply_diffusion_to_vn` and `apply_diffusion_to_w_and_[omitted]` have been studied in detail before. - Before we worked with the infamous "Benchmark Repo" which configured the stencil in a slightly different way, thus we will have some new cases here and differences, but the general direction should be the same. - Before we measured on Balfrin (A100) but we probably have to switch to Säntis, but this should not matter that much, but requires some re-tuning. - The majority of transformations should be there (with a few exception, see later), instead this project is more about extending the transformations. There are a set of transformations that are known to be missing: - Some redundant array removal transformation ([PR#2273](https://github.com/GridTools/gt4py/pull/2273) merged). - The advanced inlining/fusing transformation, shaped [here](https://hackmd.io/ot702TuJT7OJXp07WveiPA); Philip is working on it. - Avoid vertical map splitting in some cases, Ioannis is working on it ([see diff](https://github.com/GridTools/gt4py/compare/main...iomaganaris:gt4py:opt_vertical_split#diff-d13c7cb95d0abfd455ad662d1eb0853bd95e6d13ce0828d405760f57ac808be6)) - Some transformation related to scan; Ioannis is working on it ([gt4py branch](https://github.com/iomaganaris/gt4py/tree/remove_copies)) - If we have no further ideas on how to deal with a stencil we will give them to SPCL and ask for their opinions and help. ### How-Tos #### ICON4Py/GT4Py Baseline We will use an integration branch in ICON4Py: [blueline_integration](https://github.com/C2SM/icon4py/tree/blueline_integration) This branch pulls the main branch of GT4Py. The exact version of GT4Py is locked in `uv.lock`. You can pull the latest `blueline_integration`, then follow the below steps to check the GT4Py version installed: ``` % uv sync --extra all --extra cuda12 % uv pip list | grep gt4py gt4py 1.0.9.post15+bb7826e9 ``` The `blueline_integration` branch also points to a specific tag in [GridTools/dace](https://github.com/GridTools/dace) fork repository, which point to a commit on [gt4py-next-integration](https://github.com/GridTools/dace/tree/gt4py-next-integration). You can check that in the [pyproject.toml](https://github.com/C2SM/icon4py/blob/blueline_integration/pyproject.toml) file. At the beginning of the cycle, the tag `__gt4py-next-integration_2025_09_25` is used. #### Time the Stencils in Blueline [Edoardo] Describe how to get the GT4Py timers and compare them to openacc baseline. The following steps describe how to build blueline dycore on Santis and how to collect the GT4Py stencil timers. 0. Install uv Make sure that `uv` is installed by running `uv --version`, otherwise install it: `curl -LsSf https://astral.sh/uv/install.sh | sh ` 1. Load the uenv ``` uenv start icon/25.2:v3 --view=default export LD_LIBRARY_PATH=/user-environment/linux-sles15-neoverse_v2/gcc-12.3.0/gcc-13.2.0-iyisbcrs7nsbhcwcncsk52zmo3zjf33i/lib64:/user-environment/linux-sles15-neoverse_v2/gcc-13.2.0/nvhpc-25.1-tsfur7lqj6njogdqafhpmj5dqltish7t/Linux_aarch64/25.1/compilers/lib/:$LD_LIBRARY_PATH ``` The export command is needed as a workaround for a build issue in serialbox package. 2. Clone the icon-exclaim repository Clone the GitHub [icon-eclaim](https://github.com/C2SM/icon-exclaim) repository. Note that part of these instructions follow the README in the [dsl](https://github.com/C2SM/icon-exclaim/tree/icon-dsl/dsl) folder. The repository is configured to pull [icon-dsl](https://github.com/C2SM/icon-exclaim/tree/icon-dsl) branch as default. We need to switch to branch `add_timers`: ``` git switch add_timers ``` Inside your local icon-exclaim repository, install the dependencies, specifying the ICON4Py `blueline_integration` branch: `./dsl/install_dependencies.sh --icon4py blueline_integration` Note that the script will not switch branch, if the icon4py dependency had been previously installed on a different branch. In that case, you will have to manually switch branch, or delete the repository and rerun the script. You can check the python environment that is installed (make sure that the environment is loaded): ``` cd externals/icon4py git branch -vv uv sync --extra all --extra cuda12 uv pip list ``` For example, you can check the version of the GT4Py python package: ``` uv pip list | grep gt4py gt4py 1.0.9.post15+bb7826e9 ``` After the first installation, you can also pull the latest `blueline_integration` inside icon4py repository. You can also install a gt4py development branch from a local repository. 3. Build the blueline Fortran code In order to build the blueline (Fortran + Python dynamical core) for GPU you need to run: ``` ./dsl/setup.sh build_gpu2py ``` Please refer to the icon-exclaim [README](https://github.com/C2SM/icon-exclaim/blob/icon-dsl/dsl/README.md) for all available configurations. After the build has completed, you will find a folder `build_gpu2py`. Go to `build_gpu2py/run` and there we can edit the experiment configuration `exp.mch_icon-ch1_medium.run` and edit the backend enum value in the `run_nml` namelist: ``` icon4py_backend = 0 ! ICON4Py backend (0 = default) ``` Instead of `0`, use `2` for gtfn_gpu backend and `1` for dace_gpu. 4. First run Make sure that the uenv is loaded, see step 1, and that you have selected the desired becakend, see step 3. We need to setup some GT4Py environment variables. These variables are needed for all runs, both when we want to collect the GT4Py timers or we simply want to measure the total time using the Fortran timers, so it is convenienet to source a file: ``` export GT4PY_BUILD_CACHE_LIFETIME=PERSISTENT export GT4PY_UNSTRUCTURED_HORIZONTAL_HAS_UNIT_STRIDE=1 export GT4PY_BUILD_JOBS=64 export PYTHONOPTIMIZE=2 ``` Note that the GT4Py persistent cache will be created in `build_gpu2py/bin/gt4py_build_cache_dir`. If you need to specify a different directory, for example when switching between backends, you can create different folders and a symbolic link with name `gt4py_build_cache_dir`: ``` ls -l gt4py_build_cache_dir -> gt4py_dace_build_cache_dir gt4py_dace_build_cache_dir ``` One extra variable needs to be set when we want to build and run the Python granule to collect the GT4Py timers: ``` export GT4PY_COLLECT_METRICS_LEVEL=10 ``` Inside `build_gpu2py/run` run: ``` sbatch exp.mch_icon-ch1_medium.run ``` The job log is written to a file `LOG.exp.mch_icon-ch1_medium.run.<JOBID>.o`, so you can check the progress: ``` tail -f LOG.exp.mch_icon-ch1_medium.run.367624.o ``` The first run will have to build all gt4py programs, which includes lowering to SDFG and compiling them. This job will take ~30 minutes. You can follow the progress of gt4py lowering by counting the number of build directories inside the GT4Py cache: ``` ls -l build_gpu2py/bin/gt4py_build_cache_dir/.gt4py_cache | wc -l ``` 5. Run and collect the GT4Py timers Make sure that the variable from previous step are still set: ``` export GT4PY_BUILD_CACHE_LIFETIME=PERSISTENT export GT4PY_UNSTRUCTURED_HORIZONTAL_HAS_UNIT_STRIDE=1 export GT4PY_BUILD_JOBS=64 export PYTHONOPTIMIZE=2 export GT4PY_COLLECT_METRICS_LEVEL=10 ``` Then, inside `build_gpu2py/run`, run the same experiment as in previous step: ``` sbatch exp.mch_icon-ch1_medium.run ``` At the end of the job, you will find a JSON file containing the GT4Py timers: ``` build_gpu2py/experiments/mch_icon-ch1_medium/gt4py_timers.json ``` 6. Compare GT4Py timers against OpenACC baseline The script below allows to compare the performance of the GT4Py backends with respect to OpenACC baseline. ``` icon4py/scripts/compare_icon_icon4py.py ``` Inside the script, update the JSON file path for OpenACC and GT4Py measurements. The OpenACC JSON file, for Santis vCluster, can be downloadedfrom here: ``` /capstor/scratch/cscs/ioannmag/cycle32/icon-exclaim/build_acc/run/bencher=exp.mch_icon-ch1_medium_stencils=0.373574=ACC.json ``` At the beginning of cycle 32, this is the performance comparison: [Edoardo] add plot figure #### Time the Stencils in Blueline with feature branch Suppose you need to benchmark a new transformation which is not merged yet to GT4Py main. In order to test it, you can install GT4Py from your local repository. The Python virtual environment is located here: ``` icon-exclaim/externals/icon4py/.venv ``` You can activate the Python environment and than install your feature branch: ``` source icon-exclaim/externals/icon4py/.venv cd <your_local_gt4py_repo> git switch <your_feature_branch> uv pip install --editable . ``` #### Time the Stencils in StencilTest [Ioannis] Describe how to run the StencilTest benchmark and how to look at the results. 1. Clone `icon4py` with necessary branches ``` git clone git@github.com:C2SM/icon4py.git cd icon4py git remote add dropd git@github.com:DropD/icon4py.git && git fetch dropd # or simply `git clone git@github.com:C2SM/icon4py.git` git checkout expand-continuous-benchmarking # Should be merged to `main` soon ``` 2. Install `icon4py` in editable mode ``` # always make sure that the right `uenv` + `view` is started uenv start --view default icon/25.2:v3 cd icon4py # will create virtual environment in `.venv` folder CC=$(which gcc) MPICH_CC=$(which gcc) CXX=$(which g++) MPICH_CXX=$(which g++) uv sync --extra all --extra cuda12 --python $(which python3) ``` 3. Run `StencilTest` for getting baseline number ``` uenv start --view default icon/25.2:v3 cd icon4py source .venv/bin/activate export LD_LIBRARY_PATH=/user-environment/linux-sles15-zen3/gcc-12.3.0/gcc-13.2.0-4or6n7qyzqwxr3lcsis4e6sqgkc4obtv/lib64:/user-environment/linux-sles15-neoverse_v2/gcc-13.2.0/nvhpc-25.1-tsfur7lqj6njogdqafhpmj5dqltish7t/Linux_aarch64/25.1/compilers/lib:$LD_LIBRARY_PATH # necessary with icon/25.2:v3 export GT4PY_COLLECT_METRICS_LEVEL=10 # 10: use GT4Py CPU timer, 2: use NVTX markers export GT4PY_BUILD_CACHE_LIFETIME=persistent export GT4PY_BUILD_CACHE_DIR=$(pwd)/<folder_name> # need to be removed or selected differently every time `gt4py` is updated srun -n 1 -t 30 -p debug pytest -m continuous_benchmarking -k "test_TestVerticallyImplicitSolverAtCorrectorStep[compile_time_domain-at_first_substep[False]__at_last_substep[False]__lprep_adv[True]" -sv --backend=dace_gpu --grid=icon_benchmark model/atmosphere/dycore/tests/dycore/stencil_tests/test_vertically_implicit_dycore_solver_at_corrector_step.py # extra options: `--benchmark-only`: skips verification and only runs the benchmark very few times (bit faster for development iterations), `-k <StencilTest_name>[compile-time-<option>[__<variable_name>[<variable_value>]*]` optionally select certain `StencilTest` with certain compile time option and parameter values if the `StencilTest` is parametrizable (run without `-k` to see all the options of the test) # look at `GT4Py Timer Report` at the end for the runtime per stencil # look at `$(pwd)/<folder_name>/.gt4py_cache/<stencil_name><hash>/program.sdfg` for `SDFG` # look at `$(pwd)/<folder_name>/.gt4py_cache/<stencil_name><hash>/src/cpu/<name>.cpp` for `C++` CPU code # look at `$(pwd)/<folder_name>/.gt4py_cache/<stencil_name><hash>/src/gpu/<name>.cu` for `C++/CUDA` GPU code ``` 4. Run `StencilTest` with custom `gt4py`/`DaCe` repos ``` uenv start --view default icon/25.2:v3 git clone https://github.com/GridTools/gt4py cd gt4py git checkout <branch_name> source ../icon4py/.venv/bin/activate uv pip install -e . # install gt4py in editable mode # make sure to merge your branch with `main` regularly cd .. # probably custom `DaCe` won't be necessary but in case it is.. git clone --recucsive https://github.com/GridTools/dace # `--recursive` is necessary git checkout <branch_name> uv pip install -e . # install DaCe in editable mode cd .. cd icon4py export LD_LIBRARY_PATH=/user-environment/linux-sles15-zen3/gcc-12.3.0/gcc-13.2.0-4or6n7qyzqwxr3lcsis4e6sqgkc4obtv/lib64:/user-environment/linux-sles15-neoverse_v2/gcc-13.2.0/nvhpc-25.1-tsfur7lqj6njogdqafhpmj5dqltish7t/Linux_aarch64/25.1/compilers/lib:$LD_LIBRARY_PATH # necessary with icon/25.2:v3 export GT4PY_COLLECT_METRICS_LEVEL=10 export GT4PY_BUILD_CACHE_LIFETIME=persistent export GT4PY_BUILD_CACHE_DIR=$(pwd)/<new_folder_name> # need to be removed or selected differently every time `gt4py` is updated srun -n 1 -t 30 -p debug pytest -m continuous_benchmarking -k "test_TestVerticallyImplicitSolverAtCorrectorStep[compile_time_domain-at_first_substep[False]__at_last_substep[False]__lprep_adv[True]" -sv --backend=dace_gpu --grid=icon_benchmark model/atmosphere/dycore/tests/dycore/stencil_tests/test_vertically_implicit_dycore_solver_at_corrector_step.py # look at `GT4Py Timer Report` at the end for the runtime per stencil and compare with previous results ``` #### How to Add a Custom Backend The main point of adding a custom backend is to explore different optimization options when using the `StencilTest`. For that you open the file `model/common/src/icon4py/model/common/model_backends.py` in ICON4Py. There you find the `BACKENDS` `dict` at the bottom of the file you will find an update of that `dict` where the DaCe backends are added. You can now simply create a new entry inside it by using the `make_custom_dace_backend()` function. Note that depending on the option you want to pass to `gt_auto_optimizer()` you might have to modify the `make_custom_dace_backend()` function with is in GT4Py. You can now use the `--backend` switch of `pytest` to select that backend. If you want to apply this in Blueline then this is currently not possible (Hannes is working on it). #### How Can I Figure Out Which Fortran Loop-Nests Compose an ICON4Py Stencil? Check the timer in the FORTRAN source code, they are introduced in [PR#375](https://github.com/C2SM/icon-exclaim/pull/375). Note that some stencils are split between different regions of the file. #### How to Get The SDFG of a Stencil Run it as usual set a `breakpoint()` in `${GT4PY_REPO}/src/gt4py/next/program_processors/runners/dace/workflow/translation.py` after the call to `gt_auto_optimize()` and call `view()` or `save()` on the SDFG. > How to handle the specialization? Just run the test you are interested in with `-k parameter`. An example of how to run one is here: ```bash pytest -m continuous_benchmarking -k "test_TestVerticallyImplicitSolverAtCorrectorStep[compile_time_domain-at_first_substep[False]__at_last_substep[False]__lprep_adv[True]]" -sv --backend=dace_gpu --grid=icon_benchmark model/atmosphere/dycore/tests/dycore/stencil_tests/test_vertically_implicit_dycore_solver_at_corrector_step.py ``` #### How to write a `gt4py` transformation Discuss with Philip, Ioannis or Edoardo and/or look at [gt4py DaCe transformations](https://github.com/GridTools/gt4py/tree/main/src/gt4py/next/program_processors/runners/dace/transformations). ### ToDo - For better comparison the plotting script should also output the speed-up of the different stencils directly. - The individual runtime of stencils is important but there should also be an overall speed-up. Either an estimate based on the individual speed-ups and the number of times they are called or measured directly. Ideally we have both, because this would also allow us to estimate the overhead due to Python. ### Stencils Here is a list of the stencils which are considered and how is looking at it and the progress report. Here is a list of [Rico's Progress on the integration](https://docs.google.com/spreadsheets/d/1cpNl8EWSdAj0OSqlTBwyPNkHZMzLnLFW0e9JmaJgUA8/edit?gid=0#gid=0). Current state (26-09-2025): ![bench_blueline_stencil_compute_v5](https://hackmd.io/_uploads/HJvL5IYneg.png) #### `apply_diffusion_to_theta_and_exner` > Claimed by: (good start) #### `calculate_enhanced_diffusion_coefficients_for_grid_point_cold_pools` > Claimed by: #### `apply_diffusion_to_w_and_compute_horizontal_gradients_for_turbulence` > Claimed by: (good start) Based on the historical data, this stencil should result in two Maps (due to `concat_where` it might be one more). The top Map might operate in a range that is too large, because we would need inverse image. Furthermore, this stencil is known to have an instability in the code generator which resulted in a performance difference of up to 20%, i.e. $\approx 1.4$ vs. $\approx 1.6$. Although the fast version is more frequent than the slow one. #### `apply_diffusion_to_vn` > Claimed by: Ioannis Based on the historical data, this stencil should result in a single Map and have a speed up of about $\approx 1.5$. - [ ] Check why kblocking isn't beneficial in this stencil while it should have been #### `calculate_diagnostic_quantities_for_turbulence` > Claimed by: #### `calculate_nabla2_and_smag_coefficients_to_vn` > Claimed by: #### `vertical_implicit_solver_at_corrector_step` > Claimed by: Ioannis #### `vertical_implicit_solver_at_predictor_step` > Claimed by: Ioannis #### `update_mass_flux_weighted` > Claimed by: The runtime of this stencil is very small, so a speed up is not realistic. However, OpenACC is faster than GTFN which is faster than DaCe, thus we should at least check what is going on there. #### `apply_divergence_damping_and_update_vn` > Claimed by: This stencil is almost twice as fast as OpenACC. We should figuring out why this is the case. #### `compute_theta_rho_face_values_and_pressure_gradient_and_update_vn` > Claimed by: Christos / Philip :warning: **If [`zero_origin` is disabled](https://github.com/C2SM/icon4py/pull/931), then the performance drops in this stencil, it is not so clear (yet) why** :warning: ##### Observation About Performance Penalty - On the stencil tests I also see it on `main`/`extended-continious-benchmarking`, however, I have to modify the `start_lateral_edge` domain value that is apparently 0 on Säntis but 638 at the test. (see https://github.com/philip-paul-mueller/icon4py/commit/ffa77c0ab5a9e195902c533581d0e2c2192f9716) - Almost every run results in the slow version, which is differently than what I see on the performance validation. - Edoardo/Hannes tested it on Santis and there it appears as the penality in triggered by setting `zero_field_origin` to `False`. However, I did some tests locally on my laptop and I concluded that the effect is _not_ caused by setting the flag to `False`. Instead it seems to be a genuine indeterministic behaviour as I see both version. What is super interesting/disturbing is that I see the bad case much more often than the good case. #### `compute_averaged_vn_and_fluxes_and_prepare_tracer_advection` > Claimed by: It is already quite fast, but the comparison with GTFN indicates that there is more to gain. #### `compute_horizontal_velocity_quantities_and_fluxes` > Claimed by: It is already 1ms faster than OpenACC, but it needs to gain at least another 1ms to get to the target mark. #### `interpolate_rho_theta_v_to_half_levels_and_compute_pressure_buoyancy_acceleration` > Claimed by: Ioannis :warning: **We need to merge** [icon4py PR#762](https://github.com/C2SM/icon4py/pull/762) :warning: #### `compute_perturbed_quantities_and_interpolation` > Claimed by: Ioannis :warning: **We need to merge** [icon4py PR#762](https://github.com/C2SM/icon4py/pull/762) :warning: #### `compute_advection_in_horizontal_momentum_equation` > Claimed by: Philip It kind of looks okay, but there is a small Map that does very strange things, such as copy a slice of an input into a temporary that is then read. To remove this kind we should use the "extended inline/fusing transformation" mentioned above. #### `compute_advection_in_vertical_momentum_equation` > Claimed by: - Some maps that compute the final output could be merged together (or not split in the first place probably) - :warning: There's indeterministic behavior in the map fusion :warning: #### `compute_contravariant_correction_and_advection_in_vertical_momentum_equation` > Claimed by: > In case Edoardo still has the SDFG it would be good if you could take this one. Here we are already significantly faster than OpenACC and GTFN. What is interesting is that in the old version (`2025-09-25`) DaCe was faster than in the new (`2025-09-26`) version. Either the updated MapFusion has a negative side effects or we have found (another) instability. #### `compute_derived_horizontal_winds_and_ke_and_contravariant_correction` > Claimed by: (good start) This is already faster than OpenACC, thus a check of why this is the case is important. My (Philip) guess is, that the stencils compiled to a single kernel. ## Rabbit holes  ## No-gos This project is not about making it nice. ## Responsibilities Edoardo: - maintain the baseline branch of icon4py - update the performance comparison plot with latest data Ioannis: - provide the OpenACC baseline and help matching stencil tests to OpenACC timers Philip: - coordinate optimizations (e.g. how they go into auto optimize) ## Progress - [ ] Create a optimization baseline branch with the correct version of gt4py and dace, decide who is the gatekeeper of this branch (Edoardo) - [ ] Get the stencil test variant merged (even with a subset) (Ioannis -> Rico) - [ ] Look into adding the variants to the metrics in gt4py (Hannes) - [ ] perf repository merges (Ioannis) - [ ] Implement multiple output domain for DaCe (Edoardo - [ ] Look into uint16 connectivity tables (Rico) - [ ] get set up for measuring performance