[Blueline] Optimize dycore

# [Blueline] Optimize dycore - Shaped by: Till, Christoph - Appetite (FTEs, weeks): full cycle - Developers: ## Problem The dycore stencils which are fused can aready make use of the temporaries to improve run and compile times, however, exact location/situation of the generation of the temporaries has not yet been looked at, especially for run time improvements. The next issue is that Liskov generates GPU memory allocation for the substitution mode, because this is required for the code generated by the bindings generator to work. Both projects need to be touched to prevent these allocations in order to improve runtime and reduce memory footprint. Lastly, as discussed with nvidia employees several times, there are some optimizations for nvidia GPUs specifically, such as Lv 2 cache hints and launch bounds, which are worth trying. ## Appetite Full cycle ## Solution * *Understanding performance*: Gather performance data, find stencils with poor performance and try to understand the reasons. Verify the performance numbers we get using the timers in ICON are actually to be trusted (e.g. using cuda events). * *Improve benchmarking infrastructure*: Some more automation of the performance measurements would be nice to iterate faster. This has to happen on `tsa` or `clariden` since `balfrin` will be offline for the next ~5 weeks. * Write script to run the 3 versions (acc, no fuse, fuse) and then parse output to compare runtimes of fused segments, probably based on nvprof output. * Then improve heuristic which places temporaries to improve runtime. * *Fix liskov memory allocations*: Make the generated code of the bindings generator not depend on the memory allocation of the `_before` fields anymore, then change Liskov so that the memory allocations are not generated anymore. * *Low level optimization / performance tuning*: Several optimizations have been discussed with nvidia: * Give Lv 2 cache hints so neighbor lists stick in cache (ask Dmitry). Figure out if this is possible for gtfn. * Try to force spilling with launch bounds to increase occupancy. Figure out if this is possible with gtfn. * Try to make the neighbor lists stick in cache by using `TILE(32,4)` instead of 128 threads in the horizontal dimension. This is definitely possible with gtfn.` * *High level optimizations*: Look at the generated ITIR of the stencils with poor performance and fine-tune the optimization passes (e.g. temporary pass). We explicitly do not target investing time into writing new passes as we have reached the limit of what we can do on ITIR (this time is better spent on combined IR). ## Rabbit holes ? ## No-gos ? ## Progress  - [x] Task 1 ([PR#xxxx](https://github.com/GridTools/gt4py/pulls)) - [x] Subtask A - [x] Subtask X - [ ] Task 2 - [x] Subtask H - [ ] Subtask J - [ ] Discovered Task 3 - [ ] Subtask L - [ ] Subtask S - [ ] Task 4