ICON dycore optimizations

- Shaped by: Christoph, Till - Appetite (FTEs, weeks): 1.5 FTE, 1 cycle Till (GT4Py related work), 0.5 cycle Christoph (ICON related work) - Developers:  ## Problem / Objective Increase performance of ICON dycore. Main focus of the project is to optimize. We explicitly allow intermediate workarounds (that can be easily cleaned up later) when no short-term solution is available. Dusk / Dawn showed they could archieve a 1.4x speedup. There are too many unknowns right now to estimate how far we will get, but the goal is to squeeze out as much as possible with reasonable workarounds. ## Appetite Full cycle ## Solution ### Steps: - Enable temporary pass to use symbols for the number of vertices, edges, cell (to allow code generation without mesh). Proposed solution right now is to make this an option to the backend, pass the symbols there, and then also pass them at runtime. This is a workaround that we want to resolve properly in the next cycles. - Enable temporaries on all stencils executed on (almost the) entire domain (e.g. stencils only on lateral boundary are not considered). - Check everything works and has same performance with the temporary pass. Fix small issues and default heuristics - Allow for stencil-by-stencil activation of temporaries in the integration, similar to how imperative backend can be activated stencil by stencil. - Implement temporary heuristics (continuation of project from last cycle: https://hackmd.io/oJR7phI8S0yUOfnmz2lTug) - Test with all fused stencils (including fused nh solve stencils written during the cycle [fusion-project](https://hackmd.io/iQsw7AbCSBC2q5OYSZpTrg)) ### Optimization workflow: - Add timers for fused stencils - Use Christopher's work to get this numbers from a Jenkins run - Weigh timings such that we know which stencils to focus optimization efforts on - Optimize stencil-by-stencil in isolation in Icon4Py using Sams work (that way we don't need to run icon to do the actual optimizations) ### Observations: - Stencils executed on the lateral boundary are _not targeted_ as the temporary pass does not support (smart) extent analysis in the horizontal domain and just computes everywhere (large compared to only the lateral boundary size). - Even for stencils on the inner domain, it is not clear how the horizontal overcomputation will affect the performance (hopefully we can answer this after the cycle). - `nproma` needs to be greater than the number of edges in the beginning so that we only have one block. ### Separate optimization strategies: - Play around with 128 or (32, 4) tiling - Register pressure, intentional spilling with launch bounds ## Rabbit holes ## No-gos - Horizontal extends (start and end indices) and horizontal extend inference for temporaries will only be done if all other goals are reached. ## Tasks - [ ] Make a single gpu setup with relevant size for performance optimizations (Christoph) - [ ] Serialize with liskov and write small icon4py utils for setting up test cases with serialized data (Hannes) - [ ] Write utils needed for extraction heuristics / extend trace shifts (Till) - [ ] Use fused velocity advection stencils for first optimizations. Benchmarks from diffusion could serve as a rough guideline on what to focus on until Liskov supports fused stencils and we actually have timings for velocity advection (Till). - [ ] Add number of vertices, edges, cells as argument to stencil calls (manually) and use them in temporary pass (needed for blue line) - [ ] Test temporaries in Blue line (Christoph). Blocked by task above.