[DaCe] ITIR vs. Field View – The Best Representation

# [DaCe] ITIR vs. Field View -- The Best Representation. ###### tags: `cycle 19` - Shaped by: Philip - Appetite (FTEs, weeks): At least a whole cycle - Developers:  ## Problem Up to now it is not clear what the best starting point is to go from GT4Py to DaCe. There are two candidates: - Iterator IR (ITIR): Which is used in the GTFN backend and expresses stencil operations in a local sense. - Field View IR (FV): Which describes stencil operations in a global sense, essentially as NumPy operations with slicing/advanced indexing. From a theoretical point of view, both representations should be suitable for the task. Thus the goal of this project is to answer or partially answer the question which representation we should use. Furthermore, it is also unclear which of the produced SDFG is better and what does "better" in this context even mean? ## Appetite However, I consider the most important aspect of this project the creation of a simple benchmarking infrastructure on which we can build. - Fixing ITIR to DaCe $\sim$ 1 week - Infrastructure - Driver code $\sim$ 1 week - Integrating stencils (including modifications) $\sim$ 1 week - Benchmarking $\sim$ 1 week - First Run - Adaptions of Benchmarking. ## Solution We will set up some small infrastructure that will allow us to benchmark and compare stencils that were either translated from ITIR, using the current DaCe backend GT4Py and with stencils that where translated using a prototype of a [Jax to DaCe translator (J2D)](https://hackmd.io/84NQjpebS8KXLB9_0PmhIA), as approximation to a a FV based route. #### Representative Stencils To systematically evaluate the different routes we will use a set of _representative_ stencils, from ICON. The stencils listed below were suggested by [Christoph Müller](mailto:Christoph.Mueller@meteoswiss.ch) from MeteoSwiss as they capture the most common computational patterns. - Neighbor Reductions: - `model/atmosphere/dycore/src/icon4py/model/atmosphere/dycore/mo_icon_interpolation_scalar_cells2verts_scalar_ri_dsl.py` - `model/atmosphere/dycore/src/icon4py/model/atmosphere/dycore/mo_velocity_advection_stencil_01.py` - `model/atmosphere/dycore/src/icon4py/model/atmosphere/dycore/mo_velocity_advection_stencil_20.py` _As it is the best studied stencil to this day._ - `as_offset`: In this initial phase of the project we will ignore these stencils, since the primitive is experimental in GT4Py and it is only used by $\approx\!4$ stencils. - `model/atmosphere/dycore/src/icon4py/model/atmosphere/dycore/mo_solve_nonhydro_stencil_20.py` - `model/atmosphere/diffusion/src/icon4py/model/atmosphere/diffusion/stencils/truly_horizontal_diffusion_nabla_of_theta_over_steep_points.py` - Vertical Interpolation: - `model/atmosphere/dycore/src/icon4py/model/atmosphere/dycore/mo_solve_nonhydro_stencil_05.py` - `model/atmosphere/diffusion/src/icon4py/model/atmosphere/diffusion/stencils/calculate_diagnostics_for_turbulence.py` - Direct Neighbor Access: - `model/atmosphere/diffusion/src/icon4py/model/atmosphere/diffusion/stencils/calculate_nabla4.py` - `model/atmosphere/diffusion/src/icon4py/model/atmosphere/diffusion/stencils/calculate_nabla2_for_z.py` - Scans: With a few exceptions scans are absent from the dycore, but are very prevalent in micro physics (it is yet unclear if it will be ported at all). - `model/atmosphere/dycore/src/icon4py/model/atmosphere/dycore/mo_solve_nonhydro_stencil_52.py` - `model/atmosphere/dycore/src/icon4py/model/atmosphere/dycore/mo_solve_nonhydro_stencil_53.py` - Fused: - `model/atmosphere/dycore/src/icon4py/model/atmosphere/dycore/fused_velocity_advection_stencil_1_to_7.py` - `model/atmosphere/dycore/src/icon4py/model/atmosphere/dycore/fused_velocity_advection_stencil_8_to_14.py` - `model/atmosphere/dycore/src/icon4py/model/atmosphere/dycore/fused_velocity_advection_stencil_15_to_18.py` - `model/atmosphere/dycore/src/icon4py/model/atmosphere/dycore/fused_velocity_advection_stencil_19_to_20.py` I have inspected all the suggested stencils and a lot of them should work out of the box with both translators. However, some of them we will have to slightly modify, especially for the J2D translator. #### Preliminary Work As mentioned above the both translators have some issues, that have to be fixed first, before a meaningful comparison can be done. Note that they only discuss the currently known issues. ###### ITIR to SDFG Preliminary work related to the ITIR to SDFG translator is collected in document [[DaCe] Fixing ITIR to SDFG Issues For a Comparison](https://hackmd.io/@gridtools/ByRU0Qp_a). ###### Jax to SDFG Preliminary work related to the Jax to SDFG translator is collected in document [[DaCe] Fixing Jax to SDFG Issues For a Comparison](https://hackmd.io/@gridtools/H1NZ2YfYp). #### SDFG Metric In order to compare the different translators we need a metric. One aspect of this has to be performance. However, as the translators are far from optimal yet, there must be other aspects. Let's use the fuzzy term "future potential (of an SDFG)" for this. We (the GT4Py team) are most likely not able to properly judge an SDFG by looking at it, we have to rely on the SPCL people to tell us that something is a good SDFG and something is not for certain reasons. Another metric, that Hannes Vogt has suggested, is the extensibility of the translators. #### Actual Comparison ###### Infrastructure To perform the actual comparison we will create a simple benchmarking infrastructure. Its main purpose is to redo the comparison as we are fixing the translators and add new stencils. The infrastructure will use the hook pattern, i.e. there is a base class that implements the entire machinery for performing the actual benchmark. A concrete test is then derived from it and only implements the steps needed to build the object that should be benchmarked. ###### Input Data For running the stencil we will use several (about 4) representative grids, that have yet to be selected. We will most likely ask EXCLAIM for advice on that mater. The grid and the data have to be large enough to saturate the GPU and host memory. The fields we will randomly generate. ###### Target Hardware As target hardware we will use an A100 or H100, depending on availability. ###### Measurements One of the primary quantities that we want to collect is runtime. From our experience the results obtained by the Python module `timeit` are quite reliable. However, to gain more insights we will also explore different options. Using the `nvprof` might be difficult to do in an automated fashion, but it might be worth to use for some. Furthermore, DaCe has capabilities to instrumentalize the code. However, it seems that it does not implement the new CuDa profiling API [CUPTI](https://developer.nvidia.com/cupti), but it supports LIKWID, which seems to offer several counters on the GPU and CPU. ## No-gos Fixing other minor issues in the translators. ## Progress  - [x] Selecting the stencils. - [ ] Fixing the issues in the translators - [ ] [ITIR](https://hackmd.io/@gridtools/ByRU0Qp_a) - [ ] [Jaxpr to SDFG](https://hackmd.io/@gridtools/H1NZ2YfYp) - [ ] Creating (skeleton) benchmarking infrastructure - [ ] Add the stencils and adapt them at that time. - [ ] Perform the tests (might be iterative as new bugs are discovered and tests have to be redone) - [ ] Discuss with SPCL people about the generated SDFGs.