[DaCe] Toolchain Runtime Support

Shaped by: Edoardo
Appetite (FTEs, weeks):
Developers:

Problem

There are multiple work items in this project:

Optimization of the SDFG call
Move the initialization of SDFG arglist outside the SDFG call (down-prioritized, see task description)
Support for multi-node execution
Support for parallel builds
Make SDFG execution asynchronuous with respect to python code (newly added, should be prioritized)

Optimization of the SDFG call

This is the most important task, because it represents a risk for usage of the GTIR-DaCe backend in the ICON granules.

The profiling of the granule execution with the DaCe backend shows that the preparation of the SDFG call takes very long time. This code is part of the decoration stage in the OTF workflow, before the SDFG is executed by means of the DaCe CompiledSDFG.fast_call() method.

There are some obvious opportunities in the current code to reduce the overhead. For example, there is a runtime exeception with some expensive checks that should be turned into an assert. But this is not enough, and this project is aimed at implementing some more static bindings from gt4py program arguments to SDFG call arguments.

The terms bindings here is not very appropriate. It is more a translation module from the canonical format of gt4py program arguments to the format of SDFG arguments as required by the CompiledSDFG.fast_call() method.

The canonical argument list of gt4py programs starts with the sequence of arguments as defined in the program itself. The list is extended with the output field, if it was not already part of the program arguments. Besides, symbols used for implicit domain definitions are appended to the arguments list. However, shape and strides of field array representataion remain attributes of the underlying array objects. Note that the field arguments could be tuples of fields, in which case remain tuples.

The SDFG arglist is a flat sequence (no nested tuples) of ctype values. This sequence contains first all array arguments, in alphabetical order, then all scalar arguments (including SDFG symbols), also in alphabetical order. Therefore, there is some conversion to be done:

Tuples need to be flattened
The position of gt4py fields and scalars does not correspond to the position in the SDFG call.
Array shape and strides need to be represented as scalar values, and mapped to the correct argument position in the SDFG call.

Move the initialization of SDFG arglist outside the SDFG call

The SPCL group agrees that the current API for calling an SDFG should be improved. One problem we have highlighted, from gt4py side, is that the first SDFG call also performs initialization of the ctype interface to parse the SDFG call arguments.

In dace, we should find a solution to move the _construct_args() call to a separate method, so that it can be executed at compile-time.

Support for multi-node execution

The GTIR-DaCe backend seems to work fine with multi-node parallel execution on icon4py main. However, some preliminary tests show problems when we use use precompiled programas with static-arg variants. The icon4py main branch is not using precompiled programs, instead it still builds the prigrams on first call on each node.

The problem is related to loading a CompiledSDFG object. It could be that it is problematic to load the same program on multiple ranks, but this is just a speculation.

Support for parallel builds

The dace cuda code generator is not re-entrant and therefore does not support parallel builds. We opened a PR in dace to make the code generator thread-safe but we need to spend some time to address the review comments and add some tests.

Make SDFG execution asynchronuous with respect to python code

The profiling of GTFN backend shows that GTFN does not synchronize the stencil execution at the end of the program call. This allows to overlap stencil computation on the GPU with python execution and therefore compensate for the python overhead.

This feature is currently missing in the DaCe backend, simply because DaCe code generation enforces that the SDFG is fully executed before returning from the SDFG call. Beasides, the generated CUDA code queries for available cuda streams at runtime, which does not enforce serialization on the same cuda stream of multiple SDFG programs, as it happens for the GTFN backend.

Enabling asynchronuous execution of the SDFG would bring a great improvement to the toolchain, by hiding the python overhead.

Appetite

The full cycle. However, the tasks corresponding to the different work items are independent and could be taken by different people.

Solution

We are going to present here only the solution for optimization of the SDFG call.

The solution is to implement during the first week of the cycle a Python conversion module that consists of partial functions for mapping from gt4py canonical form to the dace SDFG args list. Once this is implmented and profiled, we should be able to take a decision whether this is sufficient or we should think about a different solution.

A different solution would be to write new bindings (e.g. nanobinds, like GTFN backend) to the dace SDFG, that ease mapping of shape and stride symbols for array arguments. This would be a change in dace module, to provide a new fast_call interface. This kind of change could be combined with the work in dace to write a better API for SDFG call, see Move the initialization of SDFG arglist outside the SDFG call.

No-Go

Rabbit holes

Progress

Here is a list of PRs grouped according to the work items in this project:

Optimization of the SDFG call
- Improved SDFG fast call #2029 (merged)
- Found a bug in program decorator for caching of the offset providers, which introduced a big latency in program call. Work in progress: https://github.com/GridTools/gt4py/pull/2044
Make SDFG execution asynchronuous with respect to python code
- Added a config option for async sdfg call #2035 (merged)
Support for parallel builds
- Done, the fix is merged in dace repo: https://github.com/spcl/dace/pull/1995
Support for multi-node execution
- These two PRs solve the issue with loading of program binary in multi-rank granule run: #2047 and #2059
- Note that multi-rank granule test require configuration of the dace backend with make_persistent=False.
Move the initialization of SDFG arglist outside the SDFG call
- We decided that no work is needed here, for the project goal.

During the project, the following technical debt was identified:

The module gt4py.next.program_processors.runners.gtfn.dace.workflow.factory imports gt4py.next.program_processors.runners.gtfn.FileCache. There are two problems with this:
1. This is a wrong dependency, from dace to gtfn backend.
2. A gtfn_cache folder is created at import time of the module gtfn.FileCache, which looks very strange inside the cache folder of the dace backend.

During the project, the following new issues were identified:

The gt4py persistent cache is not working for the muphys granule test (https://github.com/C2SM/icon4py/pull/742). The program is recompiled at each application run. This was observed on laptop CPU run. Note that it works for the gtfn backend.
On Santis GPU execution of the muphys granule test (https://github.com/C2SM/icon4py/pull/742), the SDFG transformations seem to hang on MoveDataflowIntoIfBody. Note that the SDFG transformations, as well as the entire granule test, work on Balfrin GPU using icon.v1.rc4 uenv.

Philip Müller

2025/05/12 08:06:53

The terms bindings here is not very appropriate. It is more a translation module from the canonical format of gt4py program arguments to the format of SDFG arguments as required by the fast_call method.

Since SPCL wants to modify the `fast_call()` stuff anyway, we should maybe integrate this work into this project?

edopao

2025/05/12 09:58:04

I will add a new work item for it. So far the discussion about modifying `fast_call()` in dace has been about how to move the initialization of the SDFG arguments outside the first call.