owned this note
owned this note
Published
Linked with GitHub
# [DaCe] Toolchain Runtime Support
- Shaped by: Edoardo
- Appetite (FTEs, weeks):
- Developers: <!-- Filled in at the betting table unless someone is specifically required here -->
## Problem
There are multiple work items in this project:
- Optimization of the SDFG call
- Move the initialization of SDFG arglist outside the SDFG call (**down-prioritized**, see task description)
- Support for multi-node execution
- Support for parallel builds
- Make SDFG execution asynchronuous with respect to python code (**newly added**, should be prioritized)
### Optimization of the SDFG call
This is the most important task, because it represents a risk for usage of the GTIR-DaCe backend in the ICON granules.
The profiling of the granule execution with the DaCe backend shows that the preparation of the SDFG call takes very long time. This code is part of the decoration stage in the OTF workflow, before the SDFG is executed by means of the DaCe `CompiledSDFG.fast_call()` method.
There are some obvious opportunities in the current code to reduce the overhead. For example, there is a runtime exeception with some expensive checks that should be turned into an assert. But this is not enough, and this project is aimed at implementing some more static bindings from gt4py program arguments to SDFG call arguments.
The terms _bindings_ here is not very appropriate. It is more a translation module from the canonical format of gt4py program arguments to the format of SDFG arguments as required by the `CompiledSDFG.fast_call()` method.
The canonical argument list of gt4py programs starts with the sequence of arguments as defined in the program itself. The list is extended with the output field, if it was not already part of the program arguments. Besides, symbols used for implicit domain definitions are appended to the arguments list. However, shape and strides of field array representataion remain attributes of the underlying array objects. Note that the field arguments could be tuples of fields, in which case remain tuples.
The SDFG arglist is a flat sequence (no nested tuples) of `ctype` values. This sequence contains first all array arguments, in alphabetical order, then all scalar arguments (including SDFG symbols), also in alphabetical order. Therefore, there is some conversion to be done:
- Tuples need to be flattened
- The position of gt4py fields and scalars does not correspond to the position in the SDFG call.
- Array shape and strides need to be represented as scalar values, and mapped to the correct argument position in the SDFG call.
### Move the initialization of SDFG arglist outside the SDFG call
The SPCL group agrees that the current API for calling an SDFG should be improved. One problem we have highlighted, from gt4py side, is that the first SDFG call also performs initialization of the `ctype` interface to parse the SDFG call arguments.
In dace, we should find a solution to move the `_construct_args()` call to a separate method, so that it can be executed at compile-time.
### Support for multi-node execution
The GTIR-DaCe backend seems to work fine with multi-node parallel execution on icon4py main. However, some preliminary tests show problems when we use use precompiled programas with static-arg variants. The icon4py `main` branch is not using precompiled programs, instead it still builds the prigrams on first call on each node.
The problem is related to loading a `CompiledSDFG` object. It could be that it is problematic to load the same program on multiple ranks, but this is just a speculation.
### Support for parallel builds
The dace cuda code generator is not re-entrant and therefore does not support parallel builds. We opened a [PR](https://github.com/spcl/dace/pull/1995) in dace to make the code generator thread-safe but we need to spend some time to address the review comments and add some tests.
### Make SDFG execution asynchronuous with respect to python code
The profiling of GTFN backend shows that GTFN does not synchronize the stencil execution at the end of the program call. This allows to overlap stencil computation on the GPU with python execution and therefore compensate for the python overhead.
This feature is currently missing in the DaCe backend, simply because DaCe code generation enforces that the SDFG is fully executed before returning from the SDFG call. Beasides, the generated CUDA code queries for available cuda streams at runtime, which does not enforce serialization on the same cuda stream of multiple SDFG programs, as it happens for the GTFN backend.
Enabling asynchronuous execution of the SDFG would bring a great improvement to the toolchain, by hiding the python overhead.
## Appetite
The full cycle. However, the tasks corresponding to the different work items are independent and could be taken by different people.
## Solution
We are going to present here only the solution for optimization of the SDFG call.
The solution is to implement during the first week of the cycle a Python conversion module that consists of partial functions for mapping from gt4py canonical form to the dace SDFG args list. Once this is implmented and profiled, we should be able to take a decision whether this is sufficient or we should think about a different solution.
A different solution would be to write new bindings (e.g. nanobinds, like GTFN backend) to the dace SDFG, that ease mapping of shape and stride symbols for array arguments. This would be a change in dace module, to provide a new `fast_call` interface. This kind of change could be combined with the work in dace to write a better API for SDFG call, see [Move the initialization of SDFG arglist outside the SDFG call](https://hackmd.io/ftO583jkSIid0cNjmaTDmw#Move-the-initialization-of-SDFG-arglist-outside-the-SDFG-call).
## No-Go
## Rabbit holes
## Progress
Here is a list of PRs grouped according to the work items in this project:
* Optimization of the SDFG call
* Improved SDFG fast call [#2029](https://github.com/GridTools/gt4py/pull/2029) (merged)
* Found a bug in program decorator for caching of the offset providers, which introduced a big latency in program call. Work in progress: https://github.com/GridTools/gt4py/pull/2044
* Make SDFG execution asynchronuous with respect to python code
* Added a config option for async sdfg call [#2035](https://github.com/GridTools/gt4py/pull/2035) (merged)
* Support for parallel builds
* Done, the fix is merged in dace repo: https://github.com/spcl/dace/pull/1995
* Support for multi-node execution
* These two PRs solve the issue with loading of program binary in multi-rank granule run: [#2047](github.com/GridTools/gt4py/pull/2047) and [#2059](https://github.com/GridTools/gt4py/pull/2059)
* Note that multi-rank granule test require configuration of the dace backend with `make_persistent=False`.
* Move the initialization of SDFG arglist outside the SDFG call
* We decided that no work is needed here, for the project goal.
During the project, the following technical debt was identified:
* The module `gt4py.next.program_processors.runners.gtfn.dace.workflow.factory` imports `gt4py.next.program_processors.runners.gtfn.FileCache`. There are two problems with this:
1. This is a wrong dependency, from dace to gtfn backend.
2. A `gtfn_cache` folder is created at import time of the module `gtfn.FileCache`, which looks very strange inside the cache folder of the dace backend.
During the project, the following new issues were identified:
* The gt4py persistent cache is not working for the `muphys` granule test (https://github.com/C2SM/icon4py/pull/742). The program is recompiled at each application run. This was observed on laptop CPU run. Note that it works for the gtfn backend.
* On Santis GPU execution of the `muphys` granule test (https://github.com/C2SM/icon4py/pull/742), the SDFG transformations seem to hang on `MoveDataflowIntoIfBody`. Note that the SDFG transformations, as well as the entire granule test, work on Balfrin GPU using `icon.v1.rc4` uenv.