# [Blue line] Measure performance and support dace in blue line
<!-- Add the tag for the current cycle number in the top bar -->
- Shaped by: Till, Christoph
- Appetite (FTEs, weeks):
- Developers: <!-- Filled in at the betting table unless someone is specifically required here -->
## Problem
<!-- The raw idea, a use case, or something we’ve seen that motivates us to work on this -->
### Performance
In the last cycle we developed infrastructure to measure and analyse performance of the blue line. That project was primarily aimed at guiding performance optimizations in GT4Py and didn't yet give us quantitive numbers on where we stand with performance overall. The question we would like to answer now (and easily monitor in the future) is: How fast is the blue line with OpenACC compared to a GT4Py enabled version for (1) use cases and (2) hardware we care about. In order to answer this questions we are confrontend with several subtasks and questions to answer:
1. While we can build icon-exclaim with OpenACC and GT4Py disabled the resulting code is not a vanilla icon-nwp, but differs in order to make verification work. This results in some very poor performance of the OpenACC version for some stencils artifically making OpenACC look bad.
2. We don't really know what percentage of time we spend in code covered by GT4Py.
3. We have only run experiments on TSA, but we actually care about performance on Balfrin (A100) and Santis (GH200).
4. We are lacking understanding of how to interpret Icon Timers. How do they compare to the NVTX range measurements we have made?
5. We have just picked an experiment that fully utilizes the GPU. We should have something that actually represents our use cases (both for MCH and Exclaim, i.e. LAM and global).
### Dace in blue line
Currently dace generated stencils cannot be executed in the blue line, which would allow us to verify them and measure their performance. To allow this one needs to extend the icon4pygen code to target the dace generated C interface.
## Rabbit holes
<!-- Details about the solution worth calling out to avoid problems -->
## No-gos
<!-- Anything specifically excluded from the concept: functionality or use cases we intentionally aren’t covering to fit the ## appetite or make the problem tractable -->
## Progress
<!-- Don't fill during shaping. This area is for collecting TODOs during building. As first task during building add a preliminary list of coarse-grained tasks for the project and refine them with finer-grained items when it makes sense as you work on them. -->
- [x] Support balfrin in setup.sh script
- [ ] Make sure all production MCH experiments pass for all variants (ICON-CH1, ICON-CH2, kenda-CH1)
- [x] build verification
- [x] build substitution
- [ ] build verification fused
- [ ] build substitution fused
- [ ] Measure performance of gt4py + gtfn with at least one MCH production experiment
- [ ] Support dace in blue line
- [x] Generate dace .h .cpp .cu files which provide C interface to dace
- [x] Adjust icon4pygen to support calling dace from ICON
- [x] Adjust cmake to compile generated dace code
- [ ] Fix code generation of combined dimension stencils such as ECV
- [ ] Pass correct cuda stream from OpenACC to dace
- [ ] Create spack recipe for py-dace
- [ ] Adjust no-spack jenkins plan for tsa
- [ ] Adjust spack jenkins plan for daint/balfrin
- [ ] Measure performance of gt4py + gtfn with at least one MCH production experiment