[Blueline] Debug mch_ch2

# [Blueline] Debug mch_ch2  - Shaped by: - Appetite (FTEs, weeks): huge - Developers:  ## Problem Debug the problems in mch_ch2 production experiment: either ## Illegal memory access with crash in first predictor call in `compute_theta_rho_face_values_and_pressure_gradient_and_update_vn` this crash is sensitive to transformations / how the program is written. For cases where it runs past this problem the next problem appears. ## model explodes in timestep 274 ... ## Progress - [x] Illegal memory access in first call to predictor might be resolved after restricting connectivities to their actual domain (instead of nproma size) - [x] No crash after https://github.com/C2SM/icon4py/pull/791 - [x] Cleanup grid construction https://github.com/C2SM/icon4py/pull/797 (note temporarily https://github.com/C2SM/icon4py/pull/791) - [x] and shrink connectivities https://github.com/C2SM/icon4py/pull/803 - [x] Fix computation of `kstart_dd3d` or pass from Fortran or delete the performance optimization, see https://github.com/C2SM/icon4py/pull/793 - [x] we remove the special handling, it was not a bug... - [x] Figure out if the solution slowly diverges from OpenACC or if there is a concrete event - [x] Check if divergence in later timesteps also show up in verification mode: from timestep 226 verification fails, when 4th order div damp becomes active. - [x] Fix problem in timestep 226 which is due to a scaling issue in `max_nudge_coeff` (missing the factor of 5 for the default physics-dynamics-ratio) - [x] Implement a clean solution https://github.com/C2SM/icon4py/pull/806 - [x] Synchronization issue (see [below](#Synchronization-issue-with-PYTHONOPTIMIZE)) is likely due to ghex comm not waiting for kernel to finish - [x] Add a deviceSynchronize before in ghex.exchange to verify that this fixes the issue -> **this is not enough!** there is a crash around timestep 1800. - [x] Add a deviceSynchronize before `wait` - [x] Implement a clean solution: with cudaDeviceSynchronize() https://github.com/C2SM/icon4py/pull/807 - [ ] The issue around timestep 1800 is also in build_acc (sometimes), - [x] Issue always appears with cuda-memcheck, unfortunately takes 3 hours to run this, also always appears in icon-ch1 with memcheck - [x] Issue also appears with icon-nwp when compiled with exclaim software stack and scripts - [x] Issue does not appear for icon-nwp compiled with spack on balfrin - [x] Issue also appears for nblocks_c=1, whenever it appears for nblocks_e=1 - [x] Issue also does not appear on santis with exclaim stack - [x] Current Hypothesis: Difference in software stack or build scripts/compile flags - [ ] Remove Liskov - [ ] Update icon-exclaim to newest icon and try again - [ ] Debug compile flags - [ ] Debug exclaim software stack on balfrin ## Status of the project The crash (illegal memory acces) in first call to predictor is fixed once connectivities are restricted to their actual domain. The crash in timestep 274 is related to divdamp order 4 (or 24 switching to 4): in computation of the damping fields, `max_nudge_coeff` is not properly scaled. We also have an MPI synchronization issue that manifests as - failed verification for `vn` after diffusion after timestep 1 - crash in second timestep in non-verify mode which is missing synchronization before MPI exchanges (kernels not finished when exchange starts). This can be fixed by adding cudaDeviceSynchronize and MPI.Barrier before the exchange. Additionally, `build_acc` (and `build_gpu2py`) sometimes crash around 1800 with the following (or similar) error ``` integrate_nh: call advection DOM: 1 mo_nwp_ecrad_interface::nwp_ecrad_radiation_reduced: ecrad radiation on reduced grid upscale_rad_input: Upscaling of radiation input fields 1 => 0 downscale_rad_output: Downscaling of radiation output fields 0 => 1 Failing in Thread:1 Accelerator Fatal Error: call to cuCtxSynchronize returned error 700 (CUDA_ERROR_ILLEGAL_ADDRESS): Illegal address during kernel execution File: /scratch/mch/vogtha/debug_icon_acc/icon-exclaim/src/atm_phy_nwp/mo_phys_nest_utilities.f90 Function: upscale_rad_input:197 Line: 962 srun: error: nid001085: task 1: Exited with exit code 1 srun: Terminating StepId=3021407.0 slurmstepd: error: *** STEP 3021407.0 ON nid001085 CANCELLED AT 2025-07-16T10:24:40 *** srun: error: nid001085: tasks 0,2-5: Terminated srun: Force Terminated StepId=3021407.0 + exit 1 ``` ## Hannes scratch ### OpenACC crash - I fixed the max_nudge_coeff issue in icon4py in some work-directory where I played with different things (but I think the only change in Fortran was some debugging prints), then it crashed around 1800 - I run build_acc a few times from the same icon-exclaim directory it crashed in a similar area. I was happy because I thought it's not my problem anymore. Then I remembered that Will reported he run it to completion on santis, so >Christoph Müller It would be interesting to know if a "faulty" OpenACC binary also has the error for nblocks_c=1 >Hannes Vogt Apparently the crash is with nblocks_c=1 . It seems I changed that for the mem consumption experiment and never switched it back. Let me now run this with nblocks_e=1. >Hannes Vogt I have the crash for both. - [ ] The crash is most likely with commit `779d500f370962df8c8ef08c7dc6ae896b5c0c81`. Confirm by building again. See also the diff below **The above also happens (sometimes with a clean build, after fixing the submodule problem). Hannes stops investigating icon4py related issues until the OpenACC works reliably.** ### Synchronization issue with PYTHONOPTIMIZE `build_gpu2py_verify` works with - [x] A - `GT4PY_BUILD_CACHE_LIFETIME=persistent` - [x] B - `GT4PY_BUILD_CACHE_LIFETIME=persistent` - `GT4PY_UNSTRUCTURED_HORIZONTAL_HAS_UNIT_STRIDE=1` - [x] D - `GT4PY_BUILD_CACHE_LIFETIME=persistent` - `GT4PY_UNSTRUCTURED_HORIZONTAL_HAS_UNIT_STRIDE=1` - `PYTHONOPTIMIZE=2` - `GT4PY_DEBUG=1` - [x] E - `GT4PY_BUILD_CACHE_LIFETIME=persistent` - `GT4PY_UNSTRUCTURED_HORIZONTAL_HAS_UNIT_STRIDE=1` - `PYTHONOPTIMIZE=2` - `GT4PY_DEBUG=1` - explicit syncs after each gtfn call does not verify in `vn` with - C - `GT4PY_BUILD_CACHE_LIFETIME=persistent` - `GT4PY_UNSTRUCTURED_HORIZONTAL_HAS_UNIT_STRIDE=1` - `PYTHONOPTIMIZE=2` - F - `GT4PY_BUILD_CACHE_LIFETIME=persistent` - `GT4PY_UNSTRUCTURED_HORIZONTAL_HAS_UNIT_STRIDE=1` - `PYTHONOPTIMIZE=2` - explicit sync after each granule call For `build_gpu2py` non-verify, the model crashes after the first timestep in the case with the validation errors above. ### Crash in timestep 274 - Starting from timestep 226 when 4th order div damp is enabled verification fails