[Blueline] Granule integration: Dycore verification, build system, granule interface improvements

# [Blueline] Granule integration: Dycore verification, build system, granule interface improvements Continuation of: https://hackmd.io/NwlnwrDQTAGnjzuSkNtKQw  - Shaped by: Christoph, Daniel, Hannes - Appetite (FTEs, weeks): 1 cycle - Developers: ## Problem Currently missing: - Dycore/diffusion granule verifying for multiple processes - Passing probtest - Find icon4py-consistent approach for handling double buffers - Fully automated run script generation - Build scripts for verification and substitution mode - Optional arguments of granule - Overhead optimization - Switch from gtfn to dace backend (as soon as available), repeat all tests We would also like to discuss the fundamental design of the granule interface ## Appetite 1 cycle ## Solution #### Dycore/diffusion granule verification - Verify and probtest the dycore granule (CPU, GPU) for multiple processes - Ensure we can execute experiment with both granules integrated at the same time. - There should be datatests in place on the icon4py side which are able to test parallel runs. #### Find icon4py-consistent approach for handling double buffers - Ideally use same mechanism as for green line runs to handle double buffers - This requires updating icon4py to the newest main #### Fully automated runscript generation - Currently one has to copy-paste in some of the parameters for the granule interface into ICON experiment `.run` scripts. This should be fully automated. Since ICON runscript generation is meta-bash (bash generating bash) one has to figure out how to generate the following: ``` # Path to your boost installation export BOOST_ROOT=/user-environment/env/icon/include/boost/ # Path to the Python interpreter in your icon4py virtual environment export PATH=/scratch/mch/cmueller/debug_granule_blue/icon4py/.venv/bin:$PATH # Path to shared libraries in user environment export LD_LIBRARY_PATH=/user-environment/env/icon/lib64:$LD_LIBRARY_PATH # Whether running on CPU or GPU export ICON4PY_BACKEND=CPU export GT4PY_BUILD_CACHE_LIFETIME=PERSISTENT export GT4PY_BUILD_CACHE_DIR=/scratch/mch/cmueller/debug_granule_blue/cache/ ``` ``` export BOOST_ROOT=<path to boost> -> uenv export ICON4PY_VENV=<path to venv with icon4py> ``` #### Build scripts for verification and substitution mode - Introduce `build_cpu2py`, `build_gpu2py`, `build_cpu2py_verify` and `build_gpu2py_verify` as build modes in `setup.sh` of `icon-exclaim`. #### Optional arguments of granule - One idea is to pass nullptr if the argument does not exist and only generate gt4py fields if the ptr is not a nullptr - Not clear if another approach would be superior #### Overhead optimization - Evaluate if overhead optimization necessary - Create dictionary with re-usable GT4Py fields for both `now` and `new`, so we can get rid of overhead of constructing fields at every timestep. Be aware of swapping pattern of the dycore. #### ITIR -> dace - Test the (ITIR -> dace) backend instead of (ITIR -> gtfn) #### Discuss fundamental design of granule interface - Should only be done after current interface is fully working (complete all of the above tasks minus overhead optimization) in order for us to be able to see the necessary complexity - Interface re-design should then start at the requirements evaluation stage - One idea is to use interfaces given by ICON community such as community interface or memory manager ## Rabbit holes ## No-gos ## Future cycles ## Progress (I use `?` for in progress, but not merged) - py2fgen/bindings: - [x] Pass sizes for all arrays - [x] Support for optional fields - [ ] Post [cleanup continued PR](https://github.com/C2SM/icon4py/pull/665) cleanup - [x] Make generated code CPU/GPU agnostic - [x] `xp = config.array_ns` remove from python template - [ ] Use ICON4Py dtype defines (wpfloat, etc) - [?] Support for numpy arrays - [x] Pass the ICON4Py backend from a Fortran - [ ] ~~via namelist? ~~ - [ ] add version check for bindings - [ ] move wrappers to a bindings package - [?] Cache fields and avoid reconstruction if ptrs didn't change: see below - [x] diffusion verifying: - [x] MCH and APE experiments fully verified - [x] dycore verifying: - [x] merged with icon4py main (first time step works) - [x] swapping - [x] parallel runs - [x] MCH and APE experiments fully verified - [x] Infrastructure - [x] Generate run script exports automatically - [x] Add verification and substitution modes for granules - [ ] Passing probtest - [ ] Switching to DaCe backend ## Scratch ### Bindings cleanup - ~~correct field sizes instead of size via dimensions~~ - done: or pass the dimension sizes to py2fgen to make it more robust and more Fortran ICON independent. ### GPU Why doesn't cupy find the cuda toolkit? Probably need to set the following variables pointing to the uenv cuda path ``` export CUDA_PATH=/usr/local/cuda-9.2 export LD_LIBRARY_PATH=$CUDA_PATH/lib64:$LD_LIBRARY_PATH ``` as described in https://docs.cupy.dev/en/stable/install.html#cupy-always-raises-cupy-cuda-compiler-compileexception . `uv pip install nvidia-cuda-runtime-cu12==12.3.*` as a workaround ### Flexible py2fgen #### Idea The generated code (python) always only constructs numpy/cupy arrays, the wrapping into fields is done on the Python side. `functools.cache` will take care that wrapping is only done once per pointer. ### Design - From Python a field can be `HOST` or `MAYBE_DEVICE` (meaning on device if compiled with OpenACC and otherwise on host) - Pass an extra implicit `on_gpu` boolean flag from Fortran which will create cupy arrays for `MAYBE_DEVICE` fields, in all other cases numpy - The function that does ffi.ptr -> numpy/cupy arrays will be `functools.cache`d (parameters are `ptr`, `sizes`, `dtype`) - The user can put a translation functions (possibly re-using the to_array functions of py2fgen) to user-specific data structures, like gtx.Fields in a decorator on the bindings function, which also needs `functools.cache` to avoid reconstruction for already known numpy/cupy arrays.