Implementation of Distributed Compilation for Full Program Optimization

# Implementation of Distributed Compilation for Full Program Optimization ## Goal In the feasibility-study on the DaCe full program optimization we found that we need to work on how to do full program optimization when enabling distributed, ahead of time compilation for full codes. ## Prior work The vertical demonstrator `computepath` implemented [here](https://github.com/ai2cm/fv3core/blob/3348c6a2d15d539a273cb61bd070fed5c35aac70/fv3core/decorators.py#L577) was **not** based on the latest work of distributed compilation by Eddie implemented in [FutureStencil](https://github.com/ai2cm/fv3core/blob/master/fv3core/utils/future_stencil.py). We found that the two approaches do not work seamless together due to how heavily the DaCe code-generation depends on the file system. We learned that other codes using dace solved this problem in the interface layer with a direct locking mechanism. ## Tasks In order to get the compilation time for large scale runs to be resource efficient we need to solve the following tasks: ### 1. Make sure we parametrize codes so we do not need to run more than 9 different codes #### Problem If we take the easiest route and hard-code all the communication into the generated code we still need to generate / compile codes for all ranks that run. This is not feasible for large scale runs so we need a way to generate between 1 and 9 (the maximum number of different code-paths to be found in FV3Core) versions of the code. ##### Solution In order to do this we need to parametrize the the rank-specific information that is not different code-paths. ### 2. Understand where distributed compilation should be controlled #### Problem It is currently unclear where the control for this infrastructure should live, either in the full-program-optimization framework or separate to it. #### Solution Investigate where the smoothest control interface for this utility is and move / design the code there. ### 3. Implementation of the `FutureStencil` feature for full program codes #### Problem Due to how heavily the DaCe code-generation depends on the file system we need a way to distribute work for generating full stencil IRs safely between ranks similar to what was done in the `FutureStencil` work for individual stencils. #### Solution Implement a solution for distributing the work of generating executables / code across ranks in a save way similar to how it has been done in `FutureStencil`. This solution can leverage direclty the `FutureStencil` class or implement a similar pattern. ## No-Goals We were already able to reduce compile times significantly by not code-generating the (unused) GT4Py executables / codes and are still impacted but not significantly blocked by compilation-time. The goal of this project is not to come up with the fastest possible solution but rather focus on a stable, scalable solution. ## Other issues ?