# [Greenline] Distributed tests in CI
<!-- Add the tag for the current cycle number in the top bar -->
- Shaped by: @msimberg
- Appetite (FTEs, weeks):
- Developers: @msimberg
## Problem
<!-- The raw idea, a use case, or something we’ve seen that motivates us to work on this -->
There are currently tests that use MPI, but none of them are run in CI. https://github.com/C2SM/icon4py/pull/692 enables testing for tests that currently work as an MVP. Some tests fail and will be investigated in follow-ups. No tests are run with GPU backends in that PR; this will also be done in follow-up PRs within this project.
## Appetite
<!-- Explain how much time we want to spend and how that constrains the solution -->
## Solution
<!-- The core elements we came up with, presented in a form that’s easy for people to immediately understand -->
- enable distributed tests (basic version, only CPU backends, disable what doesn't work), i.e. merge https://github.com/C2SM/icon4py/pull/692
The following open tasks remain and should be tackled in order of priority (low priority can be left for future work, medium is nice to have but not blocking for further work):
- (high priority) enable testing with `dace_gpu`
- (high priority) `test_distributed_geometry_attrs_for_inverse` has wrong values
- (high priority) `test_distributed_metrics_attrs` has wrong values with `COEFF_GRADEKIN`
- (high priority) `test_distributed_interpolation_rbf` hangs in CI with all backends (but not locally)
- (high priority) enable/fix testing with gauss3d experiment
- (high priority) enable testing with weisman klemp experiment (distributed and single-rank)
- (high priority/production configuration) run with GPU-aware MPI (only option for GPU backends, no copying is done?)
- (medium priority) enable testing with gtfn_gpu
- (medium/low priority?) run with 2 ranks and 4 ranks (currently only 4 ranks)
- (medium priority) check if we can use nox sessions, and in general unify with non-distributed tests
- (low priority) `test_run_solve_nonhydro_single_step` in `test_parallel_solve_nonhydro.py`: `ValueError: axes don't match` with `embedded` backend (this is part of [[GT4Py] Full support for embedded execution](/Mw4QnUwaQwqp_1ddjCXhhQ))
- (low priority) `test_distributed_metrics_attrs_no_halo_regional` in `test_parallel_metrics.py`: `ValueError: axes don't match` with `embedded` (this is part of [[GT4Py] Full support for embedded execution](/Mw4QnUwaQwqp_1ddjCXhhQ))
This project also interacts with [[Greenline] Finish domain decomposition for global grids](/3aTBq9-6QbKZrHOtsA-GpA) and that project may add more tests that need to be investigated for failures.
## Rabbit holes
<!-- Details about the solution worth calling out to avoid problems -->
Avoid focusing on low priority tests (tests that are not functionally important for e.g. warm-bubble experiment).
## No-gos
<!-- Anything specifically excluded from the concept: functionality or use cases we intentionally aren’t covering to fit the ## appetite or make the problem tractable -->
## Progress
<!-- Don't fill during shaping. This area is for collecting TODOs during building. As first task during building add a preliminary list of coarse-grained tasks for the project and refine them with finer-grained items when it makes sense as you work on them. -->
- [x] Basic tests, cpu backends: https://github.com/C2SM/icon4py/pull/692
- [X] run with GPU-aware MPI (only option for GPU backends, no copying is done?) (works in https://github.com/C2SM/icon4py/pull/1012)
- [ ] enable testing with `dace_gpu`
- [ ] `test_distributed_geometry_attrs_for_inverse` has wrong values
- [ ] `test_distributed_metrics_attrs` has wrong values with `COEFF_GRADEKIN`
- [ ] `test_distributed_interpolation_rbf` hangs in CI with all backends (but not locally)
- [ ] enable/fix testing with gauss3d experiment
- [ ] enable testing with weisman klemp experiment (distributed and single-rank)
- [ ] make sure all components are enabled in distributed ci pipelines
- [ ] enable testing with gtfn_gpu
- [ ] run with 2 ranks and 4 ranks (currently only 4 ranks)
- [ ] check if we can use nox sessions, and in general unify with non-distributed tests
During the project it was discovered that unit tests are not required checks in CSCS-CI. Unrelated to distributed tests, but we'll look at this along with this project anyway.