# Using GTScript in DaCe programs ###### tags: `cycle 3` `dace` The goal of this project is to implement frontend features and utilities in the DaCe and GT4Py repositories to integrate GTScript stencils in DaCe programs. To drive this development, a DaCe enabled versions, each representable by a single SDFG, of * The acoustic substep of the FV3 dycore * [MPDATA_GT4PY](https://github.com/ckuehnlein/MPDATA_GT4PY) application developed by Christian Kühnlein at ECMWF. will be developped. ### Program perspective: DaCe between stencils So far, GT4Py has demonstrated good optimizability on CPU and GPU for single stencils in small applications running on single nodes. However, limitations imposed by GTScript cause all existing user applications to rely on computations performed in pure python or other libraries such as numpy to integrate individual GTScript stencils into a program. This approach limits the scope of optimization to GTScript stencils but does not allow to exploit the optimization oportunities arising when looking at a whole program. The project shaped in this document provides a fast path forward to combine pure python components, Numpy and GTScript components in a single SDFG and hence a single compiled application, which can be wholistically optimized in future cycles. The program-scope view will enable the study of an entire program and identifying important performance aspects when going beyond the limitations of single-kernel, single-node stencils in subsequent ShapeUp cycles. This will provide a learning experience for design aspects for the GT4Py toolchain when such features are added to the core GT4Py toolchain and demonstrate the scalability and performance portability of the DSL and JIT approaches. #### Acoustic substep standalone FV3 To keep this development focused on the requirements of real applications, we will adapt two programs to a code structure which is amenable to this approach. As the largest application using GT4Py to date, FV3 is an interesting target for this. However, the large size of the codebase and its object oriented code structure which is not readily supported in the DaCe frontend are expected to make a transition within a cycle impossible. Adding support for the code structure will involve an iterative improvement on supported features and smaller changes to the FV3 code. We therefore target a standalone application which has to be extracted from the FV3 dycore during the cycle. This standalone application will contain most of the code patterns that are in the dycore, but has far fewer lines of code and is hence more amenable to experimental restructuring. ## Goals * Develop the tooling necessary to integrate an SDFG representation of GTScript stencils in DaCe programs. * As a driver for this development, the project focuses on the creation of a DaCe enabled versions, representable by a single SDFG, of two programs: * the [MPDATA_GT4PY](https://github.com/ckuehnlein/MPDATA_GT4PY) application developed by Christian Kühnlein at ECMWF. * The acoustic substep as a standalone program. The extraction of this substep and defining grids and removed features is part of the cycle. * The main product of this project should be frontend in the DaCe and GT4Py repositories. The general structure of the user codebases should be preserved and only be changed where necessary. In this way, it is ensured that the result is not only working for a snapshot of the user application, but can actually be maintained with negligible effort by the developers. In case of the FV3 standalone, the result should serve as a blueprint for the adaption of the remaining FV3 dycore. ### No-Goals * There is no focus on performance optimization in this project * This project does not aim to create a generic Python-to-C++ compiler. While the approach generalizes to more applications, some restrictions apply. This is mostly equivalent to the restrictions imposed by using DaCe's `dace.program` decorators, which are documented in the DaCe framework. However, adding functionality to both DaCe and GT4Py are within the scope of the cycle. ## Examples This is a simple diffusion program in 3D, which is already working as a proof of concept, which showcases a number of features that could not be done with current GTScript, namely: * reduction (`field.max()`) * iteration loop * boundary conditions ```python import dace import gt4py import numpy as np from gt4py import gtscript from gt4py.gtscript import Field @gtscript.as_sdfg def diffusion_step(c_in: Field[np.float32], c_out: Field[np.float32], delc: Field[np.float32], coeff: np.float32, dt: np.float32): with computation(PARALLEL), interval(1, -1): delc = ( -6.0 * c_in[0, 0, 0] + c_in[1, 0, 0] + c_in[0, 1, 0] + c_in[0, 0, 1] + c_in[-1, 0, 0] + c_in[0, -1, 0] + c_in[0, 0, -1] ) c_out = c_in + dt * coeff * delc delc = delc if delc>=0 else -delc dt = np.float32(0.1) coeff = np.float32(0.5) domain = (10, 10, 10) I = domain[0] J = domain[1] K = domain[2] @dace.program def swap(x, y): return y, x @dace.program def diffusion_periodic(c: dace.float32[I, J, K], threshold: dace.float32, maxiter: dace.int64): c_haloed = np.empty((I + 2, J + 2, K + 2), dtype=np.float32) c_tmp = np.empty_like(c_haloed) delc = np.empty((I, J, K+2), dtype=np.float32) delc[...] = threshold+1.0 c_haloed[1:I + 1, 1:J + 1, 1:K + 1] = c it = 0 tmp_c_tmp = np.empty_like(c_tmp[1:I + 1, 1:J + 1, 0:K + 2]) tmp_delc = np.empty((I, J, K-2), dtype=np.float32) tmp_delc[...] = delc[:,:,1:K-1] while tmp_delc.max() > threshold and it<maxiter: # set periodic BC: c_haloed[0, 1:J + 1, 1:K + 1] = c_haloed[I, 1:J + 1, 1:K + 1] c_haloed[I + 1, 1:J + 1, 1:K + 1] = c_haloed[1, 1:J + 1, 1:K + 1] c_haloed[1:I + 1, 0, 1:K + 1] = c_haloed[1:I + 1, J, 1:K + 1] c_haloed[1:I + 1, J + 1, 1:K + 1] = c_haloed[1:I + 1, 1, 1:K + 1] c_haloed[1:I + 1, 1:J + 1, 0] = c_haloed[1:I + 1, 1:J + 1, K] c_haloed[1:I + 1, 1:J + 1, K + 1] = c_haloed[1:I + 1, 1:J + 1, 1] # preparing output slice tmp_c_tmp[...] = c_tmp[1:I + 1, 1:J + 1, 0:K + 2] # stencil call diffusion_step(c_in=c_haloed, c_out=tmp_c_tmp, delc=delc, coeff=coeff, dt=dt) # writing back results c_tmp[1:I + 1, 1:J + 1, 0:K+2] = tmp_c_tmp tmp_delc[...] = delc[:,:,1:K-1] # swap c_tmp[...], c_haloed[...] = swap(c_tmp, c_haloed) it += 1 c[...] = c_haloed[1:I + 1, 1:J + 1, 1:K + 1] if __name__ == '__main__': c = np.zeros(domain, dtype='float32') c[0:2, 0:2, 0:2] = 1.0 res = diffusion_periodic(c, threshold=1e-8, maxiter=1000) print(c.min(), c.max()) ``` As a further proof of concept, an example of the vertical remap in FV3 has been implemented in this way. There, more features are emulated: * Applying stencils to different grids (staggered and unstaggered) * Arbitrary for-loop based operations in columns * Reusing program parts with slight compile-time parametrization in a single compiled unit ## Appetite / Project Plan * An initial suggestion is 1 dev for improvements to `@dace.program` and `@gtscript.as_sdfg` as well as applying them to FV3 acoustic substep. A second dev will be needed to adapt MPDATA in the second half of the cycle. Ideal candidates to develop this are Linus for the full cycle with Till joining for the second half. * Two people for the first two weeks by Vulcan to extract the acoustic substep standalone * Continuous part-time support by DaCe team * An active interaction with Christian Kühnlein and the FV3 Team is needed for the adaption part to ensure that proposed solutions are viable <style type="text/css"> td.inactive {background-color: #2e2e2e; } </style> <table> <thead> <tr> <th></th> <th>Week 1</th> <th>Week 2</th> <th>Week 3</th> <th>Week 4</th> <th>Week 5</th> <th>Week 6</th> </tr> </thead> <tbody> <tr> <td>CSCS 1 (Linus)</td> <td colspan=2> <tt>@dace.program</tt> and <tt>@gtscript.as_sdfg</tt> solutions to support existing code patterns </td> <td colspan=4> adapt acoustic substep, refine <tt>@dace.program</tt> and <tt>@gtscript.as_sdfg</tt> as needed </td> </tr> <tr> <td>CSCS 2 (Till)</td> <td colspan=3 class="inactive"></td> <td colspan=3>adapt MPDATA, refine <tt>@dace.program</tt> and <tt>@gtscript.as_sdfg</tt> as needed</td> </tr> <tr> <td>Vulcan</td> <td colspan=2>extract acoustic substep</td> <td colspan=4 class="inactive"></td> </tr> </tbody> </table>