[Green line] Benchmarking ICON4Py

# [Green line] Benchmarking ICON4Py ###### tags: `cycle 17`  - Shaped by: - Appetite (FTEs, weeks): full cycle - Developers:  ## Problem In the previous cycle a lot of time was spent on collecting and fixing pieces of ICON4Py and GT4Py to allow testing of optimizations. The goal of this project is to get everything in place for reliable performance measurements of ICON4Py stencils: from input data (realistic sizes for GPUs) to CI providing the benchmarks results. Components: - CSCS CI for ICON4Py: gives access to target hardware and solves GitHub actions minutes problem on C2SM organization (see [not started project from cycle 16](https://hackmd.io/@gridtools/HypK6jCK3)) - GT4Py readiness: GT4Py needs to fully support GPU execution from Python - ICON4Py readiness: - reasonably sized mesh needs to work with benchmark infrastructure - all stencils need to be fused - GPU execution is implemented - DaCe: compare out-of-the-box performance of DaCe backend to the gtfn backend ## Appetite Full cycle, maybe requires different developers for different pieces of the project. ## Solution for the sub projects ### CSCS CI 1. Implement CSCS CI infrastructure for the icon4py repository (Rico has the know-how). 2. Introduce branch protections rules, such that PRs can only be merged if tests and QA actions are green. 3. Test CPU and (once available) GPU backends - parametrize tests on backend - extend tox setup to allow running with and without GPU similar to GT4Py 5. Run benchmark tests on CI 6. Optional: Once the greenline is fully merged transfer the testdata to CSCS servers and run the `pytest.mark.datatests` on the CI. Data should be available on the container where the tests are run. ### GT4Py readiness From the GT4Py-side we require 2 features to be able to run on GPUs - gpu backend (see https://github.com/GridTools/gt4py/pull/1325) - gpu storage allocation (see https://github.com/GridTools/gt4py/pull/1319) Additionally, remaining work on temporaries should be made available. ### ICON4Py readiness - Fuse all remaining stencils and provide the test setup that allows the benchmarking mode - Stencils with output fields that have nlev and nlevp1 number of vertical layers need to be split into 2 field_operator calls, see (covered in [this](https://hackmd.io/PXXPT5gNRx-JJfYbp0jxfg) project) - The test infrastructure should work with the simple mesh and a serialized mesh - test references need to be correct for any mesh (see this [unmerged commit](https://github.com/C2SM/icon4py/commit/f3b3042b8ad12ff6c8d20da3bc666f85932b346e) for some existing fixes that can serve as a starting point.) - serialized test data needs to be made available on the relevant systems - The mesh interface of the `SimpleMesh` and `IconGrid` is different and needs to homogenized. Ideally both interfaces are consolidated together with a domain scientist such that all code using the meshes (tests, greenline) only needs to be changed once. Till can give an overview of the main differences. - GPU allocation of Fields is setup: - ICON4Py infrastructure should replace `np_as_located_field` by the new mechanism that will be introduced in https://github.com/GridTools/gt4py/pull/1319. The backend needs to be passed to the allocation function. - We will keep input data to be random. "Real" (serialized) data from ICON runs will be introduced later. ### DaCe Add features to the DaCe backend required to be able to get a performance baseline for icon4py stencil kernels at the end of the cycle. The performance baseline should consist of a translation from ITIR to DaCe SDFG which is functionally correct, but no optimization nor transformation needs to be applied at this stage. It should support all icon4py stencil kernels in the dynamical core. Features: - **Complete reductions.** The implementation of reduction operator in baseline is not sufficient to run the icon4py stencils. A more generic reduction operator, based on write-conflict resolution memlets, is proposed in PR [#1332](https://github.com/GridTools/gt4py/pull/1332), but needs to be finalized and reviewed. - **Tuple outputs.** Many icon4py stencil kernels produce multiple output fields. An SDFG can only have one output node, and so far in DaCe one node has been limited to contain only one array (one field). SPCL has recently introduced an API to represent nested data types, a concept similar to C-structures. This data type should allow us to define SDFG nodes in DaCe containing multiple arrays, and therefore represent tuples of fields. ### Combine ICON4Py benchmarking The goal is to have a table with performance data automatically generated in CI. This can be used as a baseline for performance optimizations in GT4Py. Machines: clariden with A100 Backends: gtfn_gpu, DaCe ## Rabbit holes ## No-gos Implementing a full solution for storing historic data and plotting collected metrics was a huge time sink in the past. We envision to have a proper setup in the future (e.g. elastic), therefore we explicitly skip storing benchmark data and plotting the results. We aim for a minimal solution that provides data in text format, e.g. csv or json. ## TODOs - [x] CICD - [x] Basic setup (Rico, Sam following along) - [x] Enable running stencil by stencil benchmarks - [x] Rewrite Github workflows to focus on code quality - [x] Enable running with GTFN backend - [x] GT4Py readiness - [x] Rico pushes Peter to get the GPU backend PR merged - [x] Hannes works on GPU storage allocation - [ ] ICON4Py readiness - [ ] Fuse stencils - [ ] Christoph prepares the work - [x] GPU backend infrastructure (currently blocked by GPU PRs) - [x] Mesh interface (blocked by merge) - [ ] DaCe (Edoardo) - [x] Reductions - [x] Tuples - [ ] Temporaries - [ ] Offset providers ### Stencil Fusion Note: The following dycore switches can be kept/ignored: * only support idiv_method 1, delete control flow * itime_scheme we keep flexible: requires porting of 1 additional stencil for itime_scheme 6 * divdamp_type: we support 32 and 3 * divdamp_order: we support 2, 4, and 24 * igradp_method we use 3, throw everything else out * nesting should be ported sometime in the future, for now remove all switches #### References to implement ##### Not integrated into ICON - [x] model/atmosphere/dycore/tests/test_fused_velocity_advection_stencil_1_to_7.py, blocked by nlev+1 - [x] model/atmosphere/dycore/tests/test_fused_velocity_advection_stencil_8_to_14.py, blocked by nlev+1 ##### integrated into ICON - [x] model/atmosphere/dycore/tests/test_fused_velocity_advection_stencil_15_to_18.py - [x] model/atmosphere/dycore/tests/test_fused_velocity_advection_stencil_19_to_20.py - [x] model/atmosphere/dycore/tests/test_apply_diffusion_to_theta_and_exner.py - [x] model/atmosphere/dycore/tests/test_apply_diffusion_to_vn.py - [x] model/atmosphere/diffusion/diffusion_tests/test_apply_diffusion_to_w_and_compute_horizontal_gradients_for_turbulence.py - [x] stencils 39 - 40 on cells * contains nlevp1, however only has one output field with vertical length nlevp1, so not an issue * both only istep 1 #### Stencils to fuse - [ ] stencil 01-13 on cells * stencils 1-9 is istep 1 * stencil 10 is istep 2 * stencil 11-13 is istep 1 * **contains nlevp1, only possible after half level fix** * talk to Christoph about stencil 1 when taking this task - [ ] stencil grad_green_gauss_cell to 28 on edges * Yes, grad_green_gauss_cell is on cells, but it can be merged with the advection stencil so it can be in this fused block * grad_green_gass_cell, 14, 15, 16_fused_btraj are istep 1 * 17 is istep 2 * 18 - 22 are istep 1 * 23 is istep 2 * 24 is istep 1 * 25 - 4th_order_divdamp is istep 2 * 28 is always - [x] stencil 29 is on boundary and standalone - no need to do anything - [ ] stencil 30 to 38 on edges * stencil 30 is istep 1 * stencil 31 is istep 2 * stencil 32 is always * stencil 33 is istep 2 * stencil 34 is always * stencil 35 - 38 are istep 1 * **contains nlevp1, only possible after half level fix** * very similar to vel adv 1 to 7 - [ ] stencils 41 - 60 on cells * **contains nlevp1, only possible after half level fix** ### Discuss points next cycle #### Index type unform /upper/lower /index interval types