[Blueline] Py2F: CI integration, more experiments and verification

# [Blueline] Py2F: CI integration, more experiments and verification  - Shaped by: Sam - Appetite (FTEs, weeks): 2 weeks - Developers: ## Problem  We need to enable py2f integration tests on the CI on GPU to ensure the framework is working correctly, as there is no easy way of testing locally without a GPU. Furthermore in addition to the mch_r04b09 experiment we should also test other experiments to check for consistent performance. It is also important to revisit our verification approach by documenting its current state and potentially trying out an extension to it (serializing granule outputs directly) and comparing against a reference. ## Appetite  2 weeks ## Solution  - Enable integration tests on CI that use GPU and ICON grid: - Enable various integration tests (currently disabled in test_cli.py) on CI. - GPU tests running embedded Python in Fortran. - Execute diffusion wrapper from Python to detect breaking changes early. - Testing other experiments, mch production run on Balfrin: - Run other experiments to check performance consistency, such as APE (prioritize APE) and MCH production run. - Verification and debugging approach: - Serialize data using serialbox after each diffusion run and compare. - Probtest after the full experiment - Document the way we do verification. - Better document build steps for Py2F? - Test the ITIR -> dace backend instead of ITIR -> gtfn ## Rabbit holes  [Fill in any potential complications or areas where additional attention may be needed] ## No-gos  [Fill in any functionalities or use cases intentionally excluded] - Do not write a whole new suite of tests, enable the ones which are currently disabled. - No implementation of a fully fledged CI which tests the integration of the diffusion granule into ICON. - Do not come up with a completely new verification framework. ## Progress  [This section will be filled during the project implementation phase] - [x] We can run both APE and MCH experiment in ICON using the diffusion granule. - [x] Need to merge this PR to enable integration into ICON: https://github.com/C2SM/icon4py/pull/449 - [ ] Merging this PR would enable calling diffusion granule from ICON https://github.com/C2SM/icon-exclaim/pull/268 - [x] Enable integration tests on CI that use GPU and ICON grid - [ ] Verify diffusion granule using cuda_verify infrastructure # Progress Notes #### 1. APE Experiment - **Bug Fixes:** - Resolved the issue with the APE host device data bug, ensuring smoother data transfer and processing. - Established compatibility between experiments, specifically MCH and APE, enabling them to run concurrently without conflicts (currently requires recompilation). #### 2. CI Integration Tests - **Local Testing:** - Successfully ran integration tests locally using a local Docker container on sys76, with nvhpc installed locally. This setup ensures that the tests can be executed in a controlled environment, replicating the production conditions closely. - Initial PR with cscs-ci GPU tests was merged, setting a foundation for more comprehensive CI integrations. - **Pending:** - Reusing the container after it has been built remains to be implemented, which would enhance efficiency by reducing setup time for subsequent tests. #### 3. Running Solve Non-Hydro - **Test and Debug:** - Disabled malfunctioning stencils and ran the solve non-hydro jwb test successfully. - Assisted Chia Rui in debugging the remaining solve non-hydro stencils, allowing us to run JW in icon4py on GPU. - **Issue Resolution:** - Solved the nanobind issue, which was critical for the stability of the solve non-hydro module. - Attempted to rerun jabw tests after merging `py2f-with-optimisations`, ensuring that recent optimizations are effective. #### 4. Performance Investigations - **GT4PY Optimisations for Python Branch:** - **Diffusion Timings:** - ICON OpenACC: 0.00158s per timestep - ICON Py2F: 0.00203s per timestep - icon4py: Average 0.001121s per timestep (100 executions) - Observations indicate that the diffusion runtime from Python is faster than from Fortran, primarily due to Fortran/CFFI overheads. This suggests that performance slowdowns in the dycore module are not due to gt4py stencils but likely other factors. - **Dycore Granule JW Test Case:** - Utilising the `optimisations-for-icon4py` branch and disabling logging are crucial for achieving optimal performance. - **Execution Times:** - Diffusion execution times are faster than OpenACC in the MCH experiment but slower in the JW test case (icon-nwp). This variability is expected due to different experimental conditions. - **TimeLoop:** - Average per timestep: 0.04299s (11 runs) - Dycore substepping showed significant variability, likely due to experimental setup differences. - **Diffusion:** - JW test case: 0.00050s per timestep, faster than MCH experiment at 0.00156s. - icon4py diffusion: Consistent at around 0.0012s per timestep. - **Comparison:** - Non-hydro solve in icon DSL: 0.00525s (0.02625s with substepping) versus icon4py dycore at 0.04s per timestep. This shows icon4py dycore is about 50% slower than the OpenACC version. #### 5. Verification of MCH Experiment - **Verification Against OpenACC Build:** - Initial verification failed due to the `nudge_max_coeff` being multiplied twice by a factor of 5. This issue was identified in both ICON and diffusion.py, where hardcoding and direct passing of values led to incorrect final results. - The solution implemented was to set the DYNAMICS FACTOR to 1, ensuring the values are correctly calculated and preventing double multiplication errors.