[GT4Py] Remove unnecessary overhead of `Program` calls

# [GT4Py] Remove unnecessary overhead of `Program` calls - Shaped by: @egparedes @havogt - Appetite (FTEs, weeks): full cycle - Developers:  ## Problem Calling GT4Py programs in a fully JIT way entails a large Python processing overhead due to the required complex inspection and processing of the call arguments to identify the right version of the compiled code which should be called. Currently, not all the potentially cachable steps are actually cached in GT4Py and the code hasn't ever been optimized for performance, which results in an overhead usually larger (sometimes by orders of magnitude) than the stencil computation itself. ## Appetite ~2 weeks. Large and unnecessary Python calls overhead will kill any performance improvement in other parts of the toolchain, so this feature is essential for any model code using GT4Py. Additionally, development will profit as pure-Python test cases can be run at expected speed. Currently both Icon4py and PMAP-G are affected by this issue. ## Solution [DRAFT] The optimization strategy should focus on reducing the overhead of the fast-call code path, which will be likely used many times across a model run, even if that makes the initialization steps slightly slower. For example, the default execution branch of `try-except` blocks are usually faster than conditionals: ```python # This should be faster if we don't expect to enter in the except block try: cached_step_1() cached_step_2() except CachedStepsAreInvalid: cached_step_1 = heavy_preprocessing_only_happens_once() cached_step_2 = other_heavy_preprocessing_only_happens_once(cached_step_1) # This is usually slower for the `then` branch if CachedStepsAreInvalid(): cached_step_1 = heavy_preprocessing_only_happens_once() cached_step_2 = other_heavy_preprocessing_only_happens_once(cached_step_1) else: cached_step_1() cached_step_2() ``` Microbenchmarks should be run across different cpython versions and architectures to make sure the implementation strategies are optimal. (Suggestions: `timeit()` microbenchmarks or a toy test project with [`nox` with `uv` backend](https://nox.thea.codes/en/stable/config.html#configuring-a-session-s-virtualenv) parameterized with different python versions + [pytest-benchmarks](https://pytest-benchmark.readthedocs.io/en/latest/)). Also consider using `assert`s for basic sanity checks, which makes very easy for the user to keep some safety net for catching errors during development and disable it during production by just running the python interpreter in optimized mode (`python -O`). ### Optimizations steps - Make sure all potentially cacheable workflow steps are cached and that the cache key computation is fast. Consider a multi-level cache key computation if it helps (e.g. try first using the `id()` of an instance as a cache key, if it is not found compute its `hash` value, ... ) - _Freeze_ program arguments: - First remove type checks to assume call arguments always have the same JIT-identifying attributes (e.g. types) - Later evaluate if it is required to also inline the actual values of some arguments which are expected to be constant accross severall calls (e.g. offset provider) ### Startup overhead Additionally, the startup overhead for the first call is huge, even if the cache is populated. A prototype to get rid of this overhead is implemented in https://github.com/GridTools/gt4py/pull/1474. ### Benchmark setup Sync with [DaCe optimization project](https://hackmd.io/6w2Oupl3QgWzxBawx6n7bA) to get a reasonable benchmark setup for ICON4Py. ## Rabbit holes Avoid over-optimizing the code with fragile or convoluted optimization hacks which are not really needed for the expected input data sizes when running in production HPC context. `Program`s from actual models like Icon4Py (e.g. `StencilTest`s from [icon4py stencil tests](https://github.com/C2SM/icon4py/tree/main/model/atmosphere/diffusion/tests/diffusion_stencil_tests)) and PMAP-G should be called with reasonable grid sizes to evaluate the actual overhead in context. The work should stop when a predefined goal of reducing the `Program` call overhead to something around ~0.2%? of a single GPU-saturating call (assuming a warmup phase were all caches have been successfully initialized). ## No-gos IR optimization passes are completely excluded from this project. The work in this project should only deal with the implementation details of the generated `Program`s call. ## Progress  - [x] Task 1 ([PR#xxxx](https://github.com/GridTools/gt4py/pulls)) - [x] Subtask A - [x] Subtask X - [ ] Task 2 - [x] Subtask H - [ ] Subtask J - [ ] Discovered Task 3 - [ ] Subtask L - [ ] Subtask S - [ ] Task 4