Design of a full-program-optimization framework for FV3Core

# Design of a full-program-optimization framework for FV3Core ## Goal In order to reach our performance goal we need to enable full program optimization in the entire dynamical core. We want to build this framework based on the vertical demonstrator by Linus. To extend and stabilize the approach taken there we want to build a framework based on a design document to be written that clearly states goals, limitations and requirements. We want to get to a place where DaCe is our primary implementation of all the required features but have entry-points for other tools to also do full-program-optimization if they fulfill all the requirements coming out of the documentation of requirements on each step. The current implementation showed - after significant use - some limitations that we would like to tackle. These are outlined in the following tasks: ## Prior work The vertical demonstrator `computepath` implemented [here](https://github.com/ai2cm/fv3core/blob/3348c6a2d15d539a273cb61bd070fed5c35aac70/fv3core/decorators.py#L577) showed very promising initial results and should be the basis for how we want to move forward. ## Tasks In order to solidify the way our full program optimization works we want to tackle the following issues: ### 1. Stabilize performance irrespective of the nature of calls #### Problem Since the caching of the internal representation of the program state is not fully mature right now, the call-times vastly differ if the call to the full program is inside a loop, outside a loop, with a prior call to the program or with an SDFG passed in. #### Solution We need to make sure we follow a similar structure as we've used within `FrozenStencil` as done in [GT4Py](https://github.com/GridTools/gt4py/commit/fee908acdf74cf6dc65f725b6fda82d3e28056fb) or [FV3Core](https://github.com/ai2cm/fv3core/blob/cb327368e08a3150f482991039e624ba513b66b5/fv3core/utils/stencil.py#L124) to make sure that calls are fast irrespective of the call-nature #### Goals The **minimal goal** is to have a version of the full program optimization implemented that is fast after the initial call to the program ### 2. Clear definition of the front-end facing interface #### Problem Discussion with the FV3Core developer side of the team showed that the current interface is not extremely intuitive to use. #### Solution In collaboration with the main FV3Core developers on AI2's side of the project we want to design and implement a light-weight, intuitive looking front-end that can be easily applied in FV3Core (and the full PACE-model in the next step) #### Goals The **minimal goal** is to determine clear requirements on the front-end interface and modify the implementation to have a clear strategy how to apply this to the entire FV3Core framework. The know-how on how the interface works needs to be disseminated throughout the team. The **ideal goal** is to have a clear documentation of how the new interface works with reproducible minimal working examples in standalone codes in demo-codes to disseminate the work beyond the collaboration in these two teams ### 3. Documentation of the full program framework and its requirements #### Problem Right now the requirements on what the limitations are are not documented #### Solution We want to create a design document that clearly states what should be allowed in the representation of the full program and how this interacts with the gt4py-stencils used inside of it. #### Goals The **minimal goal** is to have a design document that is followed by our implementation of the full-program-optimization framework with clear requirements on the code to be parsed and all intermediate layers ### 4. Reduce the amount of recursion used to build the fully orchestrated code #### Problem Right now each call to a stencil recursively calls the GT4Py framework for specific stencils within the call-stack to build the full program internal representation. #### Solution Ideally this approach would be replaced by a flattening approach in the first place that does not need to recurse in a DFS-like way though the code but rater is able to generate all the stencil based IRs after a pass of flattening the full program. #### Goals The **ideal goal** is to have a pass through the full program where the complete program IR is built before we resolve the individual stencils ## No-Goals This project does not aim for an implementation of any backend besides DaCe for the parsing / representation of the full codes. Due to the complex nature of this layer that uses GT4Py - but only in a specific way it is not yet clear where the resulting code should live. This problem should not be tackled in this project - we aim for an implementation that works before solving this (partially political) issue ## Other issues - It is currently unclear how this code should interact when using backends other than `gtc:dace` for gt4py stencils. Since we're aiming for a DaCe only implementation for now this should not be tackled but kept in mind when navigating the outlined tasks - The layer to do distributed compilation could also be built in here if this is the best place for it. That part of the project is significant enough to warrant its own project outlined [here](https://hackmd.io/@twicky/H15oYxRAu).