# [Dace] Halo Exchange in fused sdfgs (icon4py diffusion module) ###### tags: `cycle 19` - Shaped by: Christos & Edoardo - Appetite (FTEs, weeks): At least a whole cycle - Developers: ## Problem I have started working (last two weeks of `cycle 18` + cool-down) on the driving code of the diffusion module in icon4py and specifically on the [**_do_diffusion_step** function](https://github.com/C2SM/icon4py/blob/main/model/atmosphere/diffusion/src/icon4py/model/atmosphere/diffusion/diffusion.py#L586). This function calls sequentially 12 sdfgs. The goal of this project is to fuse these sdfgs under one sdfg. **Current status:** 1. I have managed to extract the generated sdfgs (per stencil) and make them visible to the driving code. ~~For now this is a workaround but in the future I could rely on the caching implementation of ITIR_to_SDFG (not sure that it fully works -revisit with Edoardo-).~~ Add [SDFGConvertible methods](https://github.com/spcl/dace/blob/master/dace/frontend/python/common.py#L61) to the decorator.py in gt4py. 2. All the extracted sdfgs are now fused into one big sdfg (transient variables have been taken care of as well). However, the fused sdfg does not consider the HALO exchange (communication layer). Below is the fused sdfg of the **_do_diffusion_step** function (HALO not considered) ![Screenshot 2024-01-16 at 11.16.02](https://hackmd.io/_uploads/BkOBOCXtT.png) ![Screenshot 2024-01-16 at 11.16.21](https://hackmd.io/_uploads/Sk4Ld07Kp.png) The solution to SDFG fusion requires automatization of SDFG extraction and (more challenging) SDFG invocation. The invocation step requires passing the field arguments and all required symbols, following the SDFG arglist signature. Another requirement is to extract from the SDFG signature the fields that are used as temporaries, and that can be declared as transient arrays inside the dace program. It is not obvious how we can automatize these 2 steps. We could look in the solution called "dace orchestration" which was developed for the Pace project, and see if it can be applied to our case. Manual fusion is acceptable only as a workaround to extract one fused SDFG and perform some benchmark. On the other hand, there is interest in GT4Py to represent halo-exchange in DSL, which would allow to describe the diffusion granule as a single GT4Py program. It seems therefore not so important to focus on a dace-specific solution for SDFG-fusion. If interest and time are available, we could start looking into a GT4Py representation of halo exchange. If we keep using manual SDFG-fusion, we could work instead on the DaCe library node for halo-exchange and use it in the fused SDFG. This would be an interesting input to see what kind of optimization DaCe can perform at granule-level. For the halo-exchange, we need to propose an interface that is generic enough to support current and future requirements, if possible. For the code generation, during library node expansion, we could focus on GHEX and generate calls to the GHEX C++ API. It is not clear how a default implementation (Python-based) could look like, and we don't know what limitations exist in DaCe with respect to MPI communication. We can simply focus on getting a GHEX-specialized C++ implementation for our use case and leave the default implementation for later. ## Appetite <!-- Explain how much time we want to spend and how that constrains the solution --> During next cycle, we can get a working (but with limited interface support) DaCe library node for halo exchange in order to complete the diffusion granule. ## Solution <!-- The core elements we came up with, presented in a form that’s easy for people to immediately understand --> Based on available resources and priorities, the proposed solution is to keep using SDFG manual fusion to achieve a working dace-granule prototype for benchmarking. In next cycle, we can focus on proposing a dace library node for halo exchange and providing a C++ specialized implementation based on GHEX. Investigation of Pace is still relevant, in particular as input to future cycles. Automatic SDFG fusion is excluded from next cycle scope. If interest and time are available we could start looking into a GT4Py representation of halo exchange, which could become the future approach to generate a single fused SDFG. - [ ] Define the interface of a DaCe library node for halo exchange - [ ] Provide a C++ specialization of halo exchange library node based on GHEX - [ ] Integrate the halo exchange library node in the fused SDFG - [ ] Investigate [Pace](https://github.com/NOAA-GFDL/pace) - [ ] Understand Dace's status on MPI handling, in order to provide a default implementation of halo exchange. ~~- [ ] Adopt Pace's Dace orchestration strategy or develop our own~~ ## Rabbit holes <!-- Details about the solution worth calling out to avoid problems --> ## No-gos <!-- Anything specifically excluded from the concept: functionality or use cases we intentionally aren’t covering to fit the ## appetite or make the problem tractable --> ## Progress <!-- Don't fill during shaping. This area is for collecting TODOs during building. As first task during building add a preliminary list of coarse-grained tasks for the project and refine them with finer-grained items when it makes sense as you work on them. --> - [x] Testing using the fused SDFG from the previous cycle - [x] Halo Exchange as Python Callbacks (just for experimentation) - [x] Halo Exchange through DaCe Library node. The sdfg node is expanded in native GHEX C++ calls. The approach I followed is as follows: - Wrote new python binding methods (in GHEX) to expose the underlying C++ pointers of the already initialized GHEX comm objects ([happens in icon4py](https://github.com/C2SM/icon4py/blob/main/model/common/src/icon4py/model/common/decomposition/mpi_decomposition.py#L120)). The pointers are returned to the python interface as uintptr_t. - Pass at runtime the above pointers in the fused sdfg. The halo lib node is expanded and the pointers are reinterpreted to the corresponding GHEX types.