Performance Measurement for Fused Diffusion

# Performance Measurement for Fused Diffusion ###### tags: `functional cycle 13` Developers: Matthias or Christoph (after the other high priority tasks) Appetite: 2 Weeks ## Background In cycle 12, the stencils in the diffusion submodule were fused up, analogous to a experiment once performed in dusk/dawn. In dusk/dawn this experiment lead to a 1.48 speed up for the diffusion submodule as compared to OpenACC. Since gt4py is quite different in it's behavior w.r.t to splitting up the computaitonal load into cuda kernels, it is of interest to measure the runtime of those fused gt4py stencils. Preferably, this task is done on Balfrin ## Known Steps * Measure the Runtime using the icon timers or measure kernel-wise runtimes with cuda profiler * Try to use the imperative backend. Does the runtime change? Is the imperative backend able to compile the fused stencils in the first place? * If the runtime is bad, perform a time boxed profiling effort to understand what is wrong * If the CUDA kernels emitted by GT4Py suffer from over-computation, proceed to steps below ## Multiple Field Operators per Stencil Program (if required) Currently, it is only possible to have one field operator in one stencil program, at least for compiled programs that are integrated into FORTRAN. It would be interesting to lift this restriction, for example because it would give a simple mechanism to split up the aggressive inlining behavior (into one CUDA kernel) of gt4py ## Known Steps for Multiple Field Operators per Stencil Program (if required) * Support passing more than one compute domain from ICON to the DSL * How does this interact with [this commit](https://github.com/C2SM/icon4py/commit/6c3d069bf8eb5c50b70646ad6cf24e36890b6b57)? * There are assertions in icon4py that demand only a single field operator being present, e.g. [here](https://github.com/C2SM/icon4py/blob/main/pyutils/src/icon4py/pyutils/backend.py#L57). Enumerate them, try to understand why they are there, lift them subsequently