[Blueline] Understanding ICON4Py halo-exchanges

# Understanding ICON4Py halo-exchanges ## Measurements ### MCH-CH1_medium on balfrin (1 node exclusive) - reported time is `t_min` (in s) of `integrate_nh` ICON Fortran timer - icon4py backend: gtfn_gpu | | Blueline | OpenACC | Ratio | | ------------------------------------------------------ | --------- | ---------- | ----- | | 1 GPU | 0.10346 | 0.08078 | 1.28 | | 4 GPU - MPICH_GPU_IPC_ENABLED=1 (default) | 0.14216 | 0.04569 | 3.11 | | 4 GPU - MPICH_GPU_IPC_ENABLED=0 | 0.08175 | 0.05167 | 1.58 | | 4 GPU - no exchange | 0.06212 | N/A | 1.36* | | 4 GPU - sync instead of exchange | 0.06597 | N/A | 1.44* | | hwmalloc 2mb 4 GPU - MPICH_GPU_IPC_ENABLED=1 (default) | 0.13808 | | | | hwmalloc 2mb 4 GPU - MPICH_GPU_IPC_ENABLED=0 | 0.07945 | | | | no compute 4 GPU - MPICH_GPU_IPC_ENABLED=1 (default) | (0.13216) | | | | no compute 4 GPU - MPICH_GPU_IPC_ENABLED=0 | (0.06164) | | | | 4 GPU - LIBFABRIC | 0.09232** | | | \*\) vs fastest openACC. \*\*\) but doesn't seem to verify - Blueline `MPICH_GPU_IPC_ENABLED=0`/`MPICH_GPU_IPC_ENABLED=1` = 0.58 - OpenACC `MPICH_GPU_IPC_ENABLED=0`/`MPICH_GPU_IPC_ENABLED=1` = 1.13 - `no exchange / sync instead of exchange = 0.94` -> could be reduced by reducing python overhead between calls ### MCH-CH1_medium on santis - reported time is `t_min` (in s) of `integrate_nh` ICON Fortran timer - icon4py backend: gtfn_gpu | | Blueline | OpenACC | Ratio | | ------------------------------------------------------ | --------- | ---------- | ----- | | 1 GPU | 0.0705 | 0.6-0.8? | TODO | | 4 GPU (IPC_ENABLED=0) | 0.0630 | TODO | TODO | | 4 GPU (IPC_ENABLED=1, IPC_CACHE_SIZE=50; defaults) | 1.8602 | 0.4-0.5? | TODO | | 4 GPU (IPC_ENABLED=1, IPC_CACHE_SIZE=32768) | 1.5544* | TODO | TODO | | 4 GPU/4 nodes (IPC_ENABLED=1, IPC_CACHE_SIZE=50) | 0.6422 | TODO | TODO | \*\) likely just fluctutation, not a real performance change (2M handles or more is also 1.8-2.0 seconds) #### After hwmalloc allocation change (see further down for details) | | Blueline | OpenACC | Ratio | | ------------------------------------------------------ | --------- | ---------- | ----- | | 1 GPU | 0.0688 | 0.0626 | 1.10 | | 4 GPU (IPC_ENABLED=0) | 0.0583 | 0.0449 | 1.30 | | 4 GPU (defaults) | 0.0539 | 0.0407 | 1.32 | #### After ghex optimizations TODO: details on what was changed - don't synchronize stream before starting send; instead assume work happens on default stream and insert dependency (using cuda events) - don't synchronize unpacking streams before returning; same as above - optimize pack/unpack kernels (flatten 2d loop over elements and levels) - micro-optimizations to create fewer events/do less synchronization - don't check for request completion immediately (avoid situation where starting a send triggers MPI to make progress on a receive, delaying the send) - use dace backend POC (mess of a) branch: https://github.com/ghex-org/GHEX/compare/master...msimberg:GHEX:pack-unpack-optimization and disable sync in icon4py halo exchange: ```diff diff --git a/model/common/src/icon4py/model/common/decomposition/mpi_decomposition.py b/model/common/src/icon4py/model/common/decomposition/mpi_decomposition.py index cf5b3ce5c..68f79c2b4 100644 --- a/model/common/src/icon4py/model/common/decomposition/mpi_decomposition.py +++ b/model/common/src/icon4py/model/common/decomposition/mpi_decomposition.py @@ -244,10 +244,10 @@ class GHexMultiNodeExchange: ) for f in sliced_fields ] - if hasattr(fields[0].array_ns, "cuda"): - # TODO(havogt): this is a workaround as ghex does not know that it should synchronize - # the GPU before the exchange. This is necessary to ensure that all data is ready for the exchange. - fields[0].array_ns.cuda.runtime.deviceSynchronize() + # if hasattr(fields[0].array_ns, "cuda"): + # # TODO(havogt): this is a workaround as ghex does not know that it should synchronize + # # the GPU before the exchange. This is necessary to ensure that all data is ready for the exchange. + # fields[0].array_ns.cuda.runtime.deviceSynchronize() handle = self._comm.exchange(applied_patterns) log.debug(f"exchange for {len(fields)} fields of dimension ='{dim.value}' initiated.") return MultiNodeResult(handle, applied_patterns) ``` | | Blueline | OpenACC | Ratio | | ------------------------------------------------------ | --------- | ---------- | ----- | | 4 GPU (defaults) | 0.0485 | 0.0407 | 1.19 | #### Additional minor optimizations TODO: details on what was changed - cache ghex patterns (see below) - disable logs in solve_nonhydro.py (turn info into debug) | | Blueline | OpenACC | Ratio | | ------------------------------------------------------ | --------- | ---------- | ----- | | 4 GPU (defaults) | 0.0470 | 0.0407 | 1.15 | ### MCH-CH2 on santis (12 hours of simulated time) TL;DR: nh_solve average time is ~350 us slower with icon blueline (~4% slower). nh_hdiff average time is < 100 us slower (~1.5% slower). OpenACC: ``` ------------------------------- ------- ------------ -------- ------------ ------------ -------- ------------- -------------- ------------- -------------- ------------- ----- name # calls t_min min rank t_avg t_max max rank total min (s) total min rank total max (s) total max rank total avg (s) # PEs ------------------------------- ------- ------------ -------- ------------ ------------ -------- ------------- -------------- ------------- -------------- ------------- ----- total 4 05m51s [2] 05m51s 05m51s [1] 351.529 [2] 351.529 [1] 351.529 4 L integrate_nh 8640 0.14149618s [0] 0.15865493s 0.65534902s [3] 342.669 [1] 342.737 [0] 342.695 4 L nh_solve 43200 0.00794792s [0] 0.00835373s 0.15928721s [1] 89.950 [0] 90.368 [1] 90.220 4 L nh_hdiff_initial_run 4 0.00414896s [0] 0.00418174s 0.00421190s [1] 0.004 [0] 0.004 [1] 0.004 4 L nh_hdiff 8640 0.00218010s [1] 0.00228803s 0.09231114s [2] 4.895 [1] 4.967 [3] 4.942 4 ``` icon blueline (with optimizations mentioned above): ``` ------------------------------- ------- ------------ -------- ------------ ------------ -------- ------------- -------------- ------------- -------------- ------------- ----- name # calls t_min min rank t_avg t_max max rank total min (s) total min rank total max (s) total max rank total avg (s) # PEs ------------------------------- ------- ------------ -------- ------------ ------------ -------- ------------- -------------- ------------- -------------- ------------- ----- total 4 05m56s [3] 05m56s 05m56s [0] 356.247 [3] 356.247 [0] 356.247 4 L integrate_nh 8640 0.14217210s [3] 0.16081500s 4.54769516s [1] 347.355 [1] 347.366 [2] 347.360 4 L nh_solve 43200 0.00794911s [0] 0.00870382s 2.86767006s [3] 93.789 [0] 94.093 [1] 94.001 4 L nh_hdiff_initial_run 4 1.21267390s [3] 1.22282827s 1.23601699s [2] 1.213 [3] 1.236 [2] 1.223 4 L nh_hdiff 8640 0.00224805s [3] 0.00232436s 0.00423908s [2] 5.019 [3] 5.022 [2] 5.021 4 ``` ## ghex caching of patterns in icon4py See https://github.com/C2SM/icon4py/pull/873/files#diff-9f03ac545e4d8a40eb657bdfb51101d652953ad6d8ee50c4c7373389ec5896f2R131. 4 GPU with default options gives 0.0530. Within noise of 4 GPU (defaults) above? Does not make things significantly better, but probably doesn't hurt either? Update: Based on profiles (with viztracer) and new timings, the caching does help. In the first tests the improvement seemed like noise, but there is a measurable improvement when patterns are cached. ## Halo exchange strategy ### ICON Fortran - yaxt? - built-in communication module ### GHex ## Recommended settings These are already in the experiment files: ```bash= export CUDA_BUFFER_PAGE_IN_THRESHOLD_MS=0.001 export FI_CXI_SAFE_DEVMEM_COPY_THRESHOLD=0 export FI_CXI_RX_MATCH_MODE=software export FI_MR_CACHE_MONITOR=disabled export NVCOMPILER_ACC_DEFER_UPLOADS=1 export NVCOMPILER_TERM=trace ``` This should be set additionally: ```bash #SBATCH --constraint=thp_never ``` though does not seem to make much of a difference in this case? Maybe makes more of a difference when the CPU is used and/or at scale (the original reason for disabling THP is that the kernel will spend a lot of time coalescing pages into huge pages, introducing jitter; this is more likely at scale, and more likely to affect workloads that use the CPU more). `FI_CXI_SAFE_DEVMEM_COPY_THRESHOLD=0` may or may not be needed (cf. https://cscs-lugano.slack.com/archives/C06F0EQ9NMC/p1715610535984029?thread_ts=1715608951.921669&cid=C06F0EQ9NMC). `FI_MR_CACHE_MONITOR=userfaultfd` may be sufficient (though hangs have been reported with that as well; `disabled` is so far the only guaranteed way to avoid hangs, but comes with a performance penalty). ## hwmalloc allocation change https://github.com/ghex-org/GHEX/compare/master...msimberg:GHEX:allocation-debug 1. changes hwmalloc segment sizes to be at least 2MiB each 1. changes the "huge" allocation limit to be bigger than the largest "large" segment sizes (to make sure segments are likely to be used) 1. forces hwmalloc to never free allocations It's unclear exactly which change has the biggest effect, but it's likely the combination of 1. and 2. ## Async MPI progress and multiple CUDA streams for MPICH Setting ``` export MPICH_GPU_MAX_NUM_STREAMS=4 # default is 1 export MPICH_ASYNC_PROGRESS=1 # default is 0 ``` leads to segfaults ``` GTL_DEBUG: [3] cudaEventRecord: invalid resource handle GTL_DEBUG: [2] cudaEventRecord: invalid resource handle GTL_DEBUG: [0] cudaEventRecord: invalid resource handle GTL_DEBUG: [0] cudaEventRecord: invalid resource handle GTL_DEBUG: [2] cudaEventRecord: invalid resource handle GTL_DEBUG: [3] cudaEventRecord: invalid resource handle GTL_DEBUG: [3] cudaMemcpyAsync: invalid resource handle Error: segmentation violation, address not mapped to object ``` Setting only `MPICH_ASYNC_PROGRESS` has no significant impact (maybe a bit slower than without?). The hope of this change was that ghex halo exchanges could be more asynchronous (without any real analysis). That turns out not to be the case. ## TODO (which information to gather) - [x] Collect timings on santis (target for coupled runs) - [x] Find out "recommended settings" for icon on santis (from GB runs?) - [x] Do both setups use the same env variables (or do run scripts set different options)? - [ ] pass in fields(?) to ghex comm so that ghex uses IPC directly? - [x] Check reuse of patterns etc. with Till (cf. https://github.com/C2SM/icon4py/pull/873/files#diff-9f03ac545e4d8a40eb657bdfb51101d652953ad6d8ee50c4c7373389ec5896f2R131) - [ ] GHEX optimizations (fabian)? expose mode that allows skipping unpacking step (because fields already in correct format) - [x] Profile the entire model with `MPICH_GPU_IPC_ENABLED=1`, understand where in cray-mpich the overhead occurs and the conditions of this overhead (see [#Observations-in-PMAP-GO](#Observations-in-PMAP-GO)). Using `perf` should be enough as we only care about native calls. Then try to reproduce this with a synthetic benchmark (see [script in PMAP-GO for a template](https://github.com/PMAP-Project/PMAP-GO/blob/b871b814db915a4c960fe724a850186b4477ee85/scripts/bench_halo_exchange.py)). Alternative: understand how much and what type of communication is being done? Probably easiest to collect using nsys anyway. - [x] Compare 4 gpus on 4 separate nodes vs 4 gpus on same node. - [x] How much are buffers reused/reallocated? c.f. hwmalloc section (probably not enough; can work around by changing hwmalloc settings, but root cause not clear) - [x] Does icon4py need to cache/reuse patterns/descriptors/etc.? (cf. https://github.com/C2SM/icon4py/pull/873/files#diff-9f03ac545e4d8a40eb657bdfb51101d652953ad6d8ee50c4c7373389ec5896f2R131) - [x] Are buffers aligned to 2MB blocks? (cf. hwmalloc section) - [ ] Why is https://github.com/C2SM/icon4py/blob/47ddf3bab36eb5b0ad99c916edf8eb35cc94203b/model/common/src/icon4py/model/common/decomposition/mpi_decomposition.py#L234 needed? - [x] Try increasing segment sizes in hwmalloc to 2MB (or multiple of 2MB): https://github.com/ghex-org/hwmalloc/blob/89e113a354582fe964bab124cd24d54252e8c9e0/include/hwmalloc/heap.hpp#L112-L114. cf. https://github.com/eth-cscs/alps-gh200-reproducers/tree/main/gpudirect-p2p-overalloc: makes both MPICH_GPU_IPC_ENABLED=0/1 slightly faster - [x] increasing the segment sizes further, increasing the large segment limit, and never freeing buffers: https://github.com/ghex-org/hwmalloc/compare/master...msimberg:hwmalloc:allocation-debug (usable through https://github.com/ghex-org/GHEX/compare/master...msimberg:GHEX:allocation-debug) avoids all ipc handle creation during ghex halo exchanges - [ ] understand _why_: does hwmalloc reallocate/free buffers more frequently without the change, or is it because the backing segments are too small? - [x] Try libfabric `GHEX_TRANSPORT_BACKEND=LIBFABRIC` (runs faster, but does not give correct results) - [ ] Redo no compute, no exchange timings with timer around Python call - [ ] Try openmpi - [ ] Try libfabric ghex backend after john's fixes? - [ ] Try newest libfabric (though the IPC caching seems to be in MPICH, likely makes no difference) ## Misc ### Observations in PMAP-GO (Till) A synthetic benchmark showed no performance degradation / scaling issues with IPC enabled. The issue only appeared when running the entire model. When profiling (with py-spy) the entire model a significant amount of the runtime was spent in the setup of some sort of IPC handles (in cray-mpich). Increasing the amount of fields that were exchanged in the synthetic benchmark did however still not allow for reproducing the issue without running the entire model. ## Scratch ``` CC=nvc CFLAGS=-noswitcherror uv sync --extra cuda12 --extra all --no-cache --reinstall-package ghex ```