[ICON4Py] Benchmarking (2nd try)

# [ICON4Py] Benchmarking (2nd try)  - Shaped by: Philip, Magdalena - Appetite (FTEs, weeks): - Developers:  ## Problem  We want to automatically track performance in ICON4Py. To this aim Benchmark runs for entire granules and stencils should be run regularly and the result uploaded to bencher. For the stencil performance there is a custom repo where the DaCe performance engineers have been using to track performance of the dace backend for ICON4Py stencils and compare them to the OpenACC icon baseline. This should become obsolete and the functionality moved to ICON4Py repository. ## Appetite  ## Solution  ### granules benchmark - Diffusion see [PR-747](https://github.com/C2SM/icon4py/pull/747) - add analogous test for dycore - Run them on the Benchmark grids: - [MCH CH2][icon-ch2] production - [MCH medium size][mch-medium] - [R02B07][R02B07] - The backend should run in non-blocking (async mode) and a sync add after the granule.run - run the test in CI and upload to [bencher] ### StencilTest s We want to re-use the benchmark option of the StencilTest to reproduce the setup of the Benchmark-repo: - add compile time args to the StencilTest - allow the StencilTest to run several code paths, that is run a test for each relevant compile time variant - Allow parametrization for the stencil test with the relevant compile time (or also runtime variants), all parametrized tests should also run and pass verification mode. (For verification that might require fixing the numpy references.) - Separate verification runs from benchmark runs in StencilTests: (the reason for this is taht we want to use separate grid for validation and for benchmarking() - Stencil Benchmarks should run on - [Medium sized grid for single node MCH][mch-medium] run - [MCH ICON-CH2][icon-ch2] production grid - [R02B07][R02B07] - Verification should run on - [MCH-ICON2-Small][icon-ch2-small] - [R02B04][R02B04] - customize backend for single stencil timings: - make sure caching is enabled and first run (compilation!) is excluded from the benchmark run (benchmark) might already do some warm up runs) - make sure the backend does not launch async but waits at the end of a stencil run (or run async backend and wait in the benchmark code) - for other options - see the backend parametrization in the [Benchmark-repo][benchmark-repo] - configure pytest benchmark fixture to our needs. (possible already done) ### bencher - Run with dace and gtfn backends and the benchmark grids (see above) - reset the data in [bencher][bencher] - add the OpenAcc baseline (could be added as a [pytest benchmark][pytest-benchmark] run to the repo,then use --benchmark-compare) ## Rabbit holes  ## No-gos  [icon-ch2]:https://github.com/C2SM/icon4py/blob/6c97904d8bf3363703f1fc4fdfe4b673d79025ef/model/testing/src/icon4py/model/testing/definitions.py#L82 [mch-medium]:https://github.com/C2SM/icon4py/blob/6c97904d8bf3363703f1fc4fdfe4b673d79025ef/model/testing/src/icon4py/model/testing/definitions.py#L98 [R02B07]: https://github.com/C2SM/icon4py/blob/6c97904d8bf3363703f1fc4fdfe4b673d79025ef/model/testing/src/icon4py/model/testing/definitions.py#L74 [R02B04]: https://github.com/C2SM/icon4py/blob/6c97904d8bf3363703f1fc4fdfe4b673d79025ef/model/testing/src/icon4py/model/testing/definitions.py#L65 [icon-ch2-small]: https://github.com/C2SM/icon4py/blob/6c97904d8bf3363703f1fc4fdfe4b673d79025ef/model/testing/src/icon4py/model/testing/definitions.py#L90 [bencher]:https://bencher.dev/console/organizations/c2sm/projects [pytest-benchmark]:https://pytest-benchmark.readthedocs.io/en/latest/ [benchmark-repo]:https://github.com/philip-paul-mueller/benchmark_2 ## Progress  - [x] Task 1 ([PR#xxxx](https://github.com/GridTools/gt4py/pulls)) - [x] Subtask A - [x] Subtask X - [ ] Task 2 - [x] Subtask H - [ ] Subtask J - [ ] Discovered Task 3 - [ ] Subtask L - [ ] Subtask S - [ ] Task 4 ## SCRATCH | scope \ | data test | standalone | |:------------ | --------- | ---------- | | unit test | -k 'datatest' | StencilTest | | granule test | -k 'datatest' | Yilu's PR | ---- ```python backend_variant1 = make_custom_backend(param1=True, param2="foo") backend_variant2 = make_custom_backend(param1=True, param2="foo") # run with: pytest ... --backend model.backends.module_foo.backend_variant1 --backend variant2 class SampleStencilTestForDace(StencilTest): PROGRAM = helpers.average_two_vertical_levels_downwards_on_cells OUTPUTS = ("average") STATIC_ARGS = { "variant1": None, "variant2": ("flag", "horizontal_start") } # second step # @pytest.fixture # def backend(backend): # if backend == dace_gpu: # return backend_variant1 # return backend @staticmethod def reference( connectivities: dict[gtx.Dimension, np.ndarray], input_field: np.ndarray, **kwargs: Any, ) -> dict: shp = input_field.shape res = 0.5 * (input_field + np.roll(input_field, shift=-1, axis=1))[:, : shp[1] - 1] return dict(average=res) @pytest.fixture def input_data(self, grid: base.Grid) -> dict: input_field = data_alloc.random_field(grid, dims.CellDim, dims.KDim, extend={dims.KDim: 1}) flag = 1 horizontal_start = grid.get_horizontal_start(...) result = data_alloc.zero_field(grid, dims.CellDim, dims.KDim) return dict( input_field=input_field, average=result, flag=1 horizontal_start=horizontal_start, horizontal_end=gtx.int32(grid.num_cells), vertical_start=gtx.int32(0), vertical_end=gtx.int32(grid.num_levels), ) ```