There are multiple work items in this project:
This is the most important task, because it represents a risk for usage of the GTIR-DaCe backend in the ICON granules.
The profiling of the granule execution with the DaCe backend shows that the preparation of the SDFG call takes very long time. This code is part of the decoration stage in the OTF workflow, before the SDFG is executed by means of the DaCe CompiledSDFG.fast_call()
method.
There are some obvious opportunities in the current code to reduce the overhead. For example, there is a runtime exeception with some expensive checks that should be turned into an assert. But this is not enough, and this project is aimed at implementing some more static bindings from gt4py program arguments to SDFG call arguments.
The terms bindings here is not very appropriate. It is more a translation module from the canonical format of gt4py program arguments to the format of SDFG arguments as required by the CompiledSDFG.fast_call()
method.
The canonical argument list of gt4py programs starts with the sequence of arguments as defined in the program itself. The list is extended with the output field, if it was not already part of the program arguments. Besides, symbols used for implicit domain definitions are appended to the arguments list. However, shape and strides of field array representataion remain attributes of the underlying array objects. Note that the field arguments could be tuples of fields, in which case remain tuples.
The SDFG arglist is a flat sequence (no nested tuples) of ctype
values. This sequence contains first all array arguments, in alphabetical order, then all scalar arguments (including SDFG symbols), also in alphabetical order. Therefore, there is some conversion to be done:
The SPCL group agrees that the current API for calling an SDFG should be improved. One problem we have highlighted, from gt4py side, is that the first SDFG call also performs initialization of the ctype
interface to parse the SDFG call arguments.
In dace, we should find a solution to move the _construct_args()
call to a separate method, so that it can be executed at compile-time.
The GTIR-DaCe backend seems to work fine with multi-node parallel execution on icon4py main. However, some preliminary tests show problems when we use use precompiled programas with static-arg variants. The icon4py main
branch is not using precompiled programs, instead it still builds the prigrams on first call on each node.
The problem is related to loading a CompiledSDFG
object. It could be that it is problematic to load the same program on multiple ranks, but this is just a speculation.
The dace cuda code generator is not re-entrant and therefore does not support parallel builds. We opened a PR in dace to make the code generator thread-safe but we need to spend some time to address the review comments and add some tests.
The profiling of GTFN backend shows that GTFN does not synchronize the stencil execution at the end of the program call. This allows to overlap stencil computation on the GPU with python execution and therefore compensate for the python overhead.
This feature is currently missing in the DaCe backend, simply because DaCe code generation enforces that the SDFG is fully executed before returning from the SDFG call. Beasides, the generated CUDA code queries for available cuda streams at runtime, which does not enforce serialization on the same cuda stream of multiple SDFG programs, as it happens for the GTFN backend.
Enabling asynchronuous execution of the SDFG would bring a great improvement to the toolchain, by hiding the python overhead.
The full cycle. However, the tasks corresponding to the different work items are independent and could be taken by different people.
We are going to present here only the solution for optimization of the SDFG call.
The solution is to implement during the first week of the cycle a Python conversion module that consists of partial functions for mapping from gt4py canonical form to the dace SDFG args list. Once this is implmented and profiled, we should be able to take a decision whether this is sufficient or we should think about a different solution.
A different solution would be to write new bindings (e.g. nanobinds, like GTFN backend) to the dace SDFG, that ease mapping of shape and stride symbols for array arguments. This would be a change in dace module, to provide a new fast_call
interface. This kind of change could be combined with the work in dace to write a better API for SDFG call, see Move the initialization of SDFG arglist outside the SDFG call.
Here is a list of PRs grouped according to the work items in this project:
Optimization of the SDFG call
Make SDFG execution asynchronuous with respect to python code
Support for parallel builds
Support for multi-node execution
Move the initialization of SDFG arglist outside the SDFG call
During the project, the following technical debt was identified:
gt4py.next.program_processors.runners.gtfn.dace.workflow.factory
imports gt4py.next.program_processors.runners.gtfn.FileCache
. There are two problems with this:
gtfn_cache
folder is created at import time of the module gtfn.FileCache
, which looks very strange inside the cache folder of the dace backend.During the project, the following new issues were identified:
The gt4py persistent cache is not working for the muphys
granule test (https://github.com/C2SM/icon4py/pull/742). The program is recompiled at each application run. This was observed on laptop CPU run. Note that it works for the gtfn backend.
On Santis GPU execution of the muphys
granule test (https://github.com/C2SM/icon4py/pull/742), the SDFG transformations seem to hang on MoveDataflowIntoIfBody
. Note that the SDFG transformations, as well as the entire granule test, work on Balfrin GPU using icon.v1.rc4
uenv.