# [Blueline] UENV deployment - Shaped by: Rico ## Current state A UENV which makes ICON4Py available on santis has been demonstrated. It is not fully configured to make the best use of the hardware and does not build ICON4Py with all extras. It installs UV from source, instead of binaries, which is wasteful. Furthermore, no version for balfrin exists yet. There is a pull request to the [software stack recipies](https://github.com/C2SM/software-stack-recipes) repository, which has started to fix some of the above but is currently not buildable. ## Goal Prepare a UENV with ICON4Py in a state ready (or close to) for building ICON-Exclaim with DSL dycore. The current understanding is that this means: - ICON4Py built with extras `all,cuda12` - all python dependencies that take part in utilizing MPI and CUDA built by spack, optimized for the architecture This UENV should be CI-deployable through the [software stack recipies](https://github.com/C2SM/software-stack-recipes) repository. This is a stepping stone on the way to an ICON-DSL UENV deployment. It will allow - manual building of DSL enabled ICON on top of it for verification of the ICON4Py install provided - merging a modified version of the icon uenv based on knowledge gained from the above verification. ## Uncertainties There are interactions between details in how the underlying spack environment is configured as well as the custom spack packages, which are not (yet) very well understood (by me, Rico). It is unclear why the UENV in https://github.com/C2SM/software-stack-recipes/pull/7 does not build and the error messages from spack are not very helpful. ## Steps - [X] Try likely changes to the configuration to see if it will build - [X] If the above does not work, revert to the last working version and re-apply changes incrementally (if that version still works on upgraded santis, that is) - [X] Add a Balfrin version, as soon as the build passes CI for santis - [X] Verify that the version of ICON4Py provided is ready to support the DSL version of ICON (as close as possible without actually building it) - [x] Verify that no spack-built python packages are re-installed by UV during the build - [x] merge icon dependencies from the "icon" uenv into this one ## Outcomes Version without ICON dependencies: - santis: `uenv image pull build::icon-dsl/25.8:1933897003` - balfrin: `uenv image pull --build icon-dsl/25.8:1933897076` Version **with** ICON deps: - santis: `uenv image pull build::icon-dsl/25.8:1965073287` - balfrin: `uenv image pull --build icon-dsl/25.8:1966746163` ## Learnings - Some of the build errors came down to mistakes in the custom spack packages / uenv environment file - `py-cupy` package has an always-on `cudnn` dependency, which we don't need. The `cudnn` package creates `version()`s (spack api) dynamically based on platform properties. Somehow this leads to `spack` as running during the uenv build not finding a compatible `8.8` version. Solution: custom `py-cupy` package without `cudnn` dependency. Nicer solution: check if problem persists in `spack 1.0` and upstream a version of `py-cupy` package with switchable dep on `cudnn` (via variant). See [Out of scope (below)](https://hackmd.io/gLCmP3sgQd2Z_16PSZRVrg#Rabbit-holes) - The official `ghex` spack package sets the `GHEX_GPU_TYPE` build var wrongly (`CUDA` instead of `NVIDIA`). Reported here: https://github.com/ghex-org/spack-repos/issues/10 ## Rabbit holes It is probably out of scope to rigorously understand why the current state of the UENV fails. Trial-and-error development towards a working UENV should not extend past a handful of experiments. In other words the first step must be time boxed or limited to a small set of tries (preferrably, defined up front with reasons why they are likely). ## Out of scope Porting the UENV to a newer version of `stackinator` based on `spack 1.0` is out of scope, unless it becomes clear that it is the only (or by far most likely) path to a working UENV.