CI: Performance measurement

# CI: Performance measurement ###### tags: `cycle 15` Developers: Chistopher, Sam Support: Hannes, Christoph Appetite: half cycle Shaped by: Christoph, Till ## Background For optimizing the dynamical core, performance measurements should be automated, in order to make the optimization process smooth. ## Goals * A CI plan which automatically does a wallclock time measurements with ICON timers for a branch of icon-exclaim which returns these wall clock times in a convenient fashion (a list in the jenkins web interface or a comment in an open PR). It should be possible to easily trigger a run with a specific GT4Py branch. - The parsing of the ICON LOG file can be done using the `performance` script of `probtest`, which produces a JSON version of the timings in the LOG file. See https://github.com/MeteoSwiss/probtest/blob/main/engine/performance.py - Use of the `performance` script can be seen in the following buildbot script in the `performance_check` function: https://github.com/C2SM/icon-exclaim/blob/icon-dsl/run/checksuite.icon-dev/icon-dev.checksuite#L513 - We need to introduce one additional Jenkins stage in the icon-gt4py-pr plan (and add one additional `.sh` script), which runs the `performance` script, generating a JSON file which contains the timings. The information in the JSON should then be printed in the CI in a user friendly way. - Optionally the generated JSON files could be stored as build artifacts in Jenkins, making it easier to download them at a later time for comparison to other runs. See https://www.jenkins.io/doc/pipeline/steps/core/#code-archiveartifacts-code-archive-the-artifacts * A CI plan which automatically does a wallclock time measurement for the icon4py CUDA backend generated code which returns these wall clock times in a convenient fashion (a list in the jenkins web interface or a comment in an open PR) (this depends on the [gt4py CUDA backend](https://hackmd.io/zzW6_SqvRE-jrHUmjUG8Pg) to be done). - Use `pytest-benchmark` to benchmark all stencil functions (see https://pytest-benchmark.readthedocs.io/en/stable/index.html). - Benchmarking CI workflow is not always triggered. Experiment if we can trigger it based on a comment (by default `pytest-benchmark` is disabled). - Could also experiment with running performance benchmarks on the current branch, and the main branch. The main branch exports the benchmarking data and is then compared against the benchmarks on the current branch using `py.test-benchmark compare` * (optional) A script which triggers a kernel-by-kernel runtime and metrics measurement using nvidia tools for a branch of icon-exclaim and outputs the resulting values in a human-readable format. As a preperation for this task we need to find out how meaningful and stable the names of the kernels are. * (optional) A script which triggers a kernel-by-kernel runtime and metrics measurement using nvidia tools for the icon4py CUDA backend (this depends on the [gt4py CUDA backend](https://hackmd.io/zzW6_SqvRE-jrHUmjUG8Pg) to be done). ## Dependencies All of the goals related to the icon4py CUDA backend can only be worked on once that backend is finished, see [here](https://hackmd.io/zzW6_SqvRE-jrHUmjUG8Pg). ## Non-Goals * Do not implement automatic data-crunching and plotting of any of the data. ## Known Tasks * For icon-exclaim, check if the ICON timers are already ACC synchronized (there should be a PR coming by ICON22 soon). * There already exists an automated functionality which reads the ICON wall-clock time from a LOG file and writes them into a .csv file. See [here](https://gitlab.dkrz.de/icon/wiki/-/wikis/GPU-development/Validating-with-probtest-and-buildbot-references#performance-test). The same plan can be used to fail a branch/PR if the performance degrades too much (future work). ### TODO ### Stencil by stencil benchmarks (refactoring) #### Sparse Field Tests (need more work): - [ ] test_calculate_nabla2_of_theta - [ ] test_calculate_nabla4 - [ ] test_mo_advection_traj_btraj_compute_o1_dsl.py - [ ] mo_velocity_advection_stencil_19.py #### Custom test patterns: - [ ] test_mo_nh_diffusion_stencil_15.py - [ ] test_mo_solve_nonhydro_stencil_20.py - [ ] test_mo_solve_nonhydro_stencil_21.py - [ ] mo_velocity_advection_stencil_02.py - [ ] mo_velocity_advection_stencil_16.py #### np.roll stencils: - [ ] mo_velocity_advection_stencil_03.py - [ ] mo_velocity_advection_stencil_06.py - [ ] mo_velocity_advection_stencil_10.py - [ ] mo_velocity_advection_stencil_20.py - [ ] test_mo_solve_nonhydro_stencil_05.py - [ ] test_mo_solve_nonhydro_stencil_08.py - [ ] test_mo_solve_nonhydro_stencil_09.py - [ ] test_mo_solve_nonhydro_stencil_10.py - [ ] test_mo_solve_nonhydro_stencil_11_upper.py - [ ] test_mo_solve_nonhydro_stencil_36.py - [ ] test_mo_solve_nonhydro_stencil_38.py - [ ] test_mo_solve_nonhydro_stencil_39.py - [ ] test_mo_solve_nonhydro_stencil_40.py - [ ] test_mo_solve_nonhydro_stencil_51.py #### more complex cases: - [ ] test_mo_solve_nonhydro_stencil_16_fused_btraj_traj_o1.py - [ ] test_mo_solve_nonhydro_stencil_52.py - [ ] test_truly_horizontal_diffusion_nabla_of_theta_over_steep_points.py #### GTFN NeighborTable Error ``` atm_dyn_iconam/tests/conftest.py:58: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ .tox/py310/lib/python3.10/site-packages/pytest_benchmark/fixture.py:125: in __call__ return self._raw(function_to_benchmark, *args, **kwargs) .tox/py310/lib/python3.10/site-packages/pytest_benchmark/fixture.py:147: in _raw duration, iterations, loops_range = self._calibrate_timer(runner) .tox/py310/lib/python3.10/site-packages/pytest_benchmark/fixture.py:275: in _calibrate_timer duration = runner(loops_range) .tox/py310/lib/python3.10/site-packages/pytest_benchmark/fixture.py:90: in runner function_to_benchmark(*args, **kwargs) _external_src/gt4py/src/gt4py/next/ffront/decorator.py:279: in __call__ backend( _external_src/gt4py/src/gt4py/next/program_processors/otf_compile_executor.py:35: in __call__ self.otf_workflow(stages.ProgramCall(program, args, kwargs))( _external_src/gt4py/src/gt4py/next/otf/workflow.py:144: in __call__ step_result = getattr(self, step_name)(step_result) _external_src/gt4py/src/gt4py/next/program_processors/codegens/gtfn/gtfn_module.py:171: in __call__ connectivity_parameters, connectivity_args_expr = self._process_connectivity_args( _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = GTFNTranslationStep(language_settings=LanguageWithHeaderFilesSettings(formatter_key='cpp', formatter_style='LLVM', file_extension='cpp', header_extension='cpp.inc'), enable_itir_transforms=True, use_imperative_backend=False) offset_provider = {'C2CE': <gt4py.next.iterator.embedded.StridedNeighborOffsetProvider object at 0x7f59b00c4820>, 'C2E': <gt4py.next.ite...at 0x7f59b00c7af0>, 'C2E2CO': <gt4py.next.iterator.embedded.NeighborTableOffsetProvider object at 0x7f59b00c4eb0>, ...} def _process_connectivity_args( self, offset_provider: dict[str, Connectivity | Dimension], ) -> tuple[list[interface.Parameter], list[str]]: parameters: list[interface.Parameter] = [] arg_exprs: list[str] = [] for name, connectivity in offset_provider.items(): if isinstance(connectivity, Connectivity): if connectivity.index_type not in [np.int32, np.int64]: > raise ValueError( "Neighbor table indices must be of type `np.int32` or `np.int64`." ) E ValueError: Neighbor table indices must be of type `np.int32` or `np.int64`. ```