DaCe/GT4Py weekly meeting

# DaCe/GT4Py weekly meeting ## 2024-11-20 - Dace upgraded to v1.0.0 in GT4Py. - No topics to discuss. ## 2024-11-06 - Lex continued the overview of ongoing dace projects: - GPU auto-tiling: based on auto-tuning, should be applicable to any SDFG program. - Optimizer for CPU/GPU co-scheduling, based on normalized training set. - DaceML, support for auto-differentiation starting from Python application. Alternative approach to JAX-DSL. - SPCL is interested in the SDFGs generated from ICON4Py: - Improve stability of SDFG transformations and target Gordon-Bell prize next year - GT4Py team can present for DaCe team how to generate SDFGs in ICON4Py. A date in the end of November could work, on the other side it is desirable to wait for DaCe release v1.0 and demo the GTIR lowering. ## 2024-10-23 - Introduction of new member, Giacomo Castiglioni, and round-the-table introductions - Lex gave an overview of ongoing projects in DaCe: - Fortran frontend - Giacomo will help with extending the Fortran frontend to support tracer advection granules (ECRAT was probably mentioned) - Representation of data objects in nested SDFG (https://github.com/spcl/dace/issues/1695). - Just mentions of other ongoing projects: - auto-tiling - auto-differentiation ## 2024-10-02 ### GTIR - Briefly discussed how to represent masked arrays, in the sense of neighbors local fields containing skip values. - Requested review on dace PR to extend the SDFG transformation for removal of trivial tasklets. ## 2024-09-18 ### DaCe Orchestration - Dycore requires a list of `PrognosticState` structs, for this we could use a `ContainerArray` but the DaCe frontend seems to fail parsing it. - Probably we need to register the type of the array elements. - For practical purpose, given that the list has fixed length 2, just unpack it and expose 2 structs. ## 2024-08-14 ### DaCe Orchestration - DaCe Python fronend fails to create `View` nodes for `dace.data.Structure` field members. Support for data structures is an experimental feature, and most likely its representation and API will undergo some modification. For now, we can try a transformation pass from Lex's development branch (`multi_sdfg` branch) that whenver it encounters an node trying to access a structure member (e.g. some `structure.field1`) it will split the data string on the field separator (.) and create the hierarchy of view nodes. - How to represent descriptors on user-defined data classes, so that in dace-orchestarted programs we do not need to create wrap objects? We need to understand why the approach using args_descr on dace program was deprecated. ### GTIR Discussion about different representations for programs with multiple output fields, where a field is used as both input and output. The read-write order of tasklets from access nodes inside a map scope is undefined. A possible solution is to serialize the tasklets, in order to force the correct order. Another approach is to use a temporary field and serialize the field update, by using a separate state to write back the updated field. ## 2024-08-07 No topics to discuss ## 2024-07-31 - Philip described the work done in GT4Py development cycle 23, about SDFG transformation for map fusion and K-blocking. He also presented benchmarking results on the nabla4 stencil. The custom transformations allow to achieve performance in pair with a hand-optimized CUDA version of the same stencil. - The current transformation tries to fuse as many map nodes as it can, in the SDFG. We would to investigate smart criteria to decide whether to apply map fusion. Lex suggested to get in contact with Yakup to discuss GPU-transformations. SPCL has also worked on estimation of compute intensity and register utilization, and DaCe should already have APIs to get this information. ## 2024-07-24 - Ben Weber is leaving SPCL in September, to start a PhD in the Electrical Engineering department. - Philip updated on the status of SDFG transformations and showed some initial benchmarking results: - The reference test case is the nabla4 stencil, which Philip has written in Combined-IR. The dace backend lowers the Combined-IR to SDFG automatically, although the lowering program is not merged in GT4Py repo yet. - The SDFG contains many map nodes. The optimization developed by Philip applies map fusion to create one large map over the `Edge` and `K` dimensions. Besides, the `Edge` stride is hard-coded to 1. - An additional optimization is to extract the maps that do not depend on `K` index, and compute these results first. The K-map nodes are forced to sequential execution, on each CUDA thread. This transformation allows to apply K-blocking, with a step in loop unrolling. - The bechmarking results show that the manually-optimized CUDA version (written by Ioannis) is still faster. One issue in the DaCe-generated program seems to be the register pressure. ## 2024-07-03 - Validation of SDFGs now detects if an SDFG returns a scalar, which should be invalid (not supported). - No additions in latest DaCe release to auto-optimize. Work in progress to make simplify more stable. SPCL is working on helpers for transformations, kind of common utilities to manipulate the SDFG. - No particular issue or topic to discuss. ## 2024-06-26 - DaCe orchestration: the structure type in DaCe: currently requires the definition of lots of symbols for all strides of internal arrays. Work is ongoing on DaCe side to reduce the complexity of this representation. - Discussed differences between symbols and scalars, free symbols and used symbols in SDFG representation. - Lex is working on transforming symbols defined on inter-state edges to map-scope symbols. This should enable more inlining of nested SDFGs and therefore better analysibility. ## 2024-05-22 ### Combined-IR to SDFG - [Edoardo] Working on translation of GT4Py pattern for neighbors reduction, on unstructured grid. - Suggestion from SPCL to start doing some benchmark on the small test applications to identify potential bottlenecks in local optimization. ### DaCe Fortran bindings for ICON-DSL - [Christoph] For ICON-DSL we need to link all source files generated for the stencil SDFGs (~100 SDFGs) to the fortran application. We encounter link errors caused by symbol collision, since DaCe code generation uses the same names for internal functions in .cu/.cpp files, for example: ``` __dace_init_cuda __dace_exit_cuda __dace_gpu_set_stream __dace_gpu_set_all_streams tasklet_toplevel_map_0_0_1 __dace_runkernel_tasklet_toplevel_map_0_0_1 ``` Creating a separate compilation unit for each SDFG is not a viable approach, because we do not want to modify the ICON build configuration too much. We could modify the code generation by using the stencil name as a namespace or function prefix. ## 2024-05-15 ### JaCe - Does DaCe supports CUDA code generation for multiple GPUs on same node (no MPI)? - No, there is no support. There are several reasons for not exposing GPU selection, as a map attribute, to the user. ### Combined-IR to SDFG - Large single tasklet vs. several small tasklets within map scope. - Better to use small tasklets, to enable reordering. - Use or not use transient scalars to connect tasklets. - We should used scalars, because memlets are supposed to write or read from a data container. Memlets without data access nodes are not analyzable, and can lead to unexpected bugs during SDFG transformations. As long as we use transient scalars as single assignment nodes, the compiler will be able to remove unnecessary data movements. ## 2024-05-08 ### DaCe Fortran bindings for ICON-DSL - Use of symbols in `__dace_init` function for allocation of internal temporary storage. - How to ensure that one-size allocation fits all runtime symbol configurations, which could change from run to run? For example, the computation domain could be different, and the corresponding horizontal/vertical start/stop symbols should not be used to dimension temporary arrays. - We can constrain code generation to use N cuda streams, then we override the streams with the set provided by OpenACC at runtime. - Open question is how to get access to the OpenACC GPU memory pool, so that DaCe CUDA allocations for temporary arrays use the OpenACC memory pool. ## 2024-04-17 ### DaCe orchestration - dace orchestration can be reused by ECMWF - Christos is in contact with Christian for adoption of DaCe orchestration ### Combined-IR to SDFG - approach to lowering Combined-IR to SDFG: start from scratch or extend the exsting ITIR DaCe backend? - we are currently starting from scratch as we won't target programs in itir representation for the forseeable future ### Jax2DaCe (JaCe): What's the goal? - Make the prototype production-ready - Extending the support for Jax primitives - Make pyhpc run as fast as possible (depending on time budged end of next cycle). ### Other topics/questions - Status of multi-node execution in SDFG - early stage, large scope, requires domain decomposition - limited support for mpi4py - Access to icon4py in order to generate stencil SDFGs - provide early access now so SPCL can generate and run the ITIR SDFGs