# [DaCe] Halo Exchange Optimizations in fused SDFGs (continuation) ###### tags: `cycle 22` - Shaped by:Christos - Appetite (FTEs, weeks): - Developers: Christos ## Problem As of the previous cycle, the halo exchanges are placed automatically in the fused SDFG, i.e. the user does not need to manually place them, but an analysis of the fused SDFG reveals where the communication is needed. This analysis scans the fused SDFG (before any simplification/optimization), and gathers per inserted stencil the output fields, and all the input fields of all the stencils below (in the directed graph sense). An ouput field that feeds the stencils below, and given that this field performs neighbor accesses (in the stencils below), it should be exchanged ([see slide](https://docs.google.com/presentation/d/1as3nbVsTx_ZkZd45uYR7VzGMIzVGS_XStP5gUQSKgu0/edit#slide=id.g2d07f566a52_3_0)). The above optimization works as expected and produces the correct results (diffusion module tests). Also, in terms of timing, we maintain the same performance. However, if we compare the fused SDFG with the manual placement of halo exhanges, with the one with the automated halo exchanges, in the latter case we perform more updates. This happens, because even if there are neighbor accesses, the accessed elements are owned by the MPI processes. To further optimize the above-mentioned automated analysis, we need to deeper understand the amount of accessed neighboring elements. This must be done in Icon4Py with a better understanding of the domain decomposition infrasturcture, since GT4Py does not hold any info of owned/ghost elements. This investigation will reveal the halo layers we are accessing, and subsequently we are going to exploit this info for choosing the optimal amount of data to exchnage (number of halo layers to exchange or not). ## Appetite Multiple cycles. **Keep in mind that in this cycle I will work for only 1 week.** ## Solution 1. Study [ICON tutorial](https://www.dwd.de/DE/leistungen/nwv_icon_tutorial/pdf_einzelbaende/icon_tutorial2023.html) (especially chapters 8 & 9) to understand how data are organized, i.e. halos, halo layers, domain decomposition. (as discussed with Christophe) 2. ~~Use the icochainsize tool (found in tools/src/icon4pytools/icon4pygen/icochainsize.py) to determine halo layers we are accessing per field and stencil.~~ 3. ~~Combine the above to automatically perform the absolute minimum of the halo exchanges needed.~~ 4. ~~Extend step 2, define multiple GHEX patters (according to number of halo layers), and use these info to exchange the needed data only.~~ ## Rabbit holes <!-- Details about the solution worth calling out to avoid problems --> ## No-gos <!-- Anything specifically excluded from the concept: functionality or use cases we intentionally aren’t covering to fit the ## appetite or make the problem tractable --> ## Progress <!-- Don't fill during shaping. This area is for collecting TODOs during building. As first task during building add a preliminary list of coarse-grained tasks for the project and refine them with finer-grained items when it makes sense as you work on them. --> - [x] No need to use the icochainsize tool. - [x] Perform only the needed halo updates: - [x] Use the offset provider to collect all the visited indices. - [x] Use ICON4Py `DecompositionInfo` object to figure out which accessed indices belong to the halo. - [x] Check which of the halo indices are overcomputed (computationally updated and not via MPI). - [x] Combine all the above and figure out if the overcomputed halo elements do not suffice. If not the perform halo update. - [ ] Update only the not updated halo elements (task for another cycle!).