GT4Py, Eve & Dace

General Strengths and Weaknesses of Eve & Dace

Dace provides code generators for many different architectures, the design of the SDFGs allows for relatively simple code generation for all of those.
Eve allows simple, easy template-based code generation. This makes most domain-specific optimizations very simple.
Dace’s generality could help in later stages of the project. For example fusing boundary conditions with stencils in local models, but also other features like reductions or branching might be quite easy to add at a later stage, but not without some drawbacks (see below).

Eve’s code generation should be very flexible in generating whatever low-level optimizations that are required. This could be an advantage for architectures requiring very special code structure, inline assembly or similar optimizations.
This allows for user-readable code generation from Eve. Dace-generated code is not extremely nice to read, but still OK.
Implementation of domain-specific optimizations like GT’s CUDA thread-placements or shuffles (as used in a more general j-scan implementation than currently in GT’s cuda_horizonatal backend) within Dace requires specific plugins for its code generator. So while it is nice that Dace allows for such plugins, it also shows that Dace available code generation might not be enough to reach full performance.
Some optimizations like k-caches are inherently complex to get right in the graph-based representation of Dace. They require quite complex logic inside the pattern matching which is neither easy to write nor easy to maintain.

Of course Dace is very good for data flow optimizations. But also here the generality can make things quite complex here, so we should make use of domain knowledge whenever it leads to simplifications.
For more complex optimizations that require a combined knowledge of data flow and the domain, like performing halo exchanges at the right point in time, things still get tricky pretty soon. Maybe some work for Torsten’s group?

Pure Eve: no Dace is used at all. This requires a custom data flow analysis for some optimizations and also some state machine if control flow should be supported inside the DSL.
Eve for the part we know well (low-level, “multistage”-level optimization), Dace for full-program data flow optimizations. That is:
1. Transform some Eve IR to a Dace SDFG using library nodes (Eve IR can be stored directly within the library nodes).
2. Optimize the data flow in Dace without lowering.
3. Transform back to the same Eve IR and lower from there.
4. Perform low-level optimizations and code generation in Eve.
Leave Eve as soon as possible:
1. Transform some Eve IR (potentially pre-optimized) to a Dace SDFG using library nodes.
2. Optimize the data flow in Dace, perform all lowering in Dace, while keeping and using domain knowledge as long as possible (possibly multiple levels of library nodes).
3. Code generation using the Dace code generator. May require plugins to achieve all known domain-specific optimizations.
Leave Eve as late as possible, but use Dace code generators.
1. Perform all optimizations with Eve (using multiple levels of IRs).
2. Lower to a low-level SDFG-compatible Eve IR (not using library nodes), with all optimizations included (e.g. k-caches) and generate an SDFG from there.
3. Use the Dace code generator on this low-level SDFG, possibly applying general Dace transformations for further optimization before.

Some issues noticed during experimentation with GT4Py + Dace, that should not be forgotten:

Dace-related:

Code generation and compilation from Dace can be extremely slow, at least on Ault (where the file systems still seem to be quite bad). Debugging the GPU code generation was thus approximately as annoying as debugging GridTools meta-template programming stuff. There was not much time for investigations, just disabled some inspect.getframeinfo that sped up the Python-side quite a bit. But it looked like most time was spent in CMake and compilation, so probably not a problem of the Dace-internal transformations and thus improvable.

In general:

We should have a simple to use debugging functionality that can tell us that some untransformed code (and/or Numpy reference or so) produces the same output as the optimized one, given the same input (at least with some probability, i.e. with multiple executions on random input). When applying a whole chain of optimization passes, this system should be able to detect after which pass the output becomes wrong, which can be a pain to detect manually as soon as there are a few transformations applied. Such a thing could safe a huge amount of time when moving from simpler codes where everything works to more complex ones.
We probably need integration tests that make sure that all optimizations are applied as expected. At least tests at the complexity level of GTBench might be needed to make sure we don’t accidentally kill previously working optimizations.