GT4Py execution model: Horizontal Write Domain and Field Sizes

# GT4Py execution model: Horizontal Write Domain and Field Sizes The relevant part of the GTScript semantics are defined in the [parallel model](https://github.com/GridTools/concepts/blob/master/GTScript-Parallel-model.md). However, it does not clearly define which, mainly horizontal, grid points are written per statement. We here show some cases to consider when defining this behavior and discusses approaches to resolve the issue. ## Glossary In this document we use these definitions * **Extent** Computations can have an **extent** per field. This refers to the maximum and maximum offsets at which the field is accessed per dimension. We sometimes explicitly refer to the **input extent** which includes only offsets in read accesses. The analogous output extent is always either 0 in all dimensions or undefined. * **Iteration Space** Any assign statement implies writing on a specific subset of grid indices, which we here call the _iteration space_. In this document, we are mostly concerned with the iteration space of the horizontal indices. In the framework of the OIR, this is a property of `HorizontalExecution`s. * **Domain** and **Output Domain** When API fields are written, they are written at least on the domain. Depending on the resolution of the issues of this documents, they may also be written outside of this, but at least any API output on which no other API write depends, the iteration space will have exactly coincide with the domain. * **Halo** The halo per field denotes the field-specific number of extra grid points that is needed in order to compute the outputs for all the API fields in the entire output domain. It is always assumed that API fields need to includes at least the entire domain. ## Problem Statement In general, it seems intuitive and is hence desireable that for any field that has been written to and is later read is assumed to have been written on all grid points that are being read. In most cases, this can indeed be satisfied. Nevertheless, cases can be found, where this is not a sufficient or even a contradictory requirement. Here, we first highlight some of those considerations and comment on different ways of resolving the issue. Although the pre-GTC pipeline and GTC employ a different resolution of the issue, none of the current tests implemented in gt4py nor any user codes that were tried with GTC catch invalid memory accesses or even inconsistent results accross backends. It can therefore be assumed that these considerations are somewhat pedantic. ### Behavior of old pipeline In the old pipeline, internal representation follows the structure of C++ GridTools. That is, each stencil is composed of a hierarchy of multi stages, stages and intervals. Extents are defined on the level of stages. That is, multiple intervals are grouped together in a stage, with the extent on the stage level being the bounding box around all the individual extents of the intervals. The iteration space can then be inferred by going back through the list of stages and making sure that all (horizontal) grid points that will later be read are written, based on this stage-level extent ### The OIR structure In OIR, the hierarchy is different. `VerticalLoop`s contain `VerticalLoopSection`s, which contain `HorizontalExecution`s. This means that there is no correspondence between the code of `HorizontalExecution`s of different intervals and hence it is not possible to create this bounding box on this level. It is still possible to create a bounding box across sections (After stacking the extents of `HorizontalExecution`s, like the old pipeline would do with stages, but only within the same interval.) ### No Grouping of Intervals in Parallel model However, the GT4Py parallel model does not give any information about how the intervals should be grouped and therefore we can not make any assumption. In fact, it seems to be assumed that to begin with, all sections go in their own interval. This would imply that the assumed extent for what the user would perceive as different sections of a single vertical loop may grow the iteration space for, since to the logic of stacking extents of consecutive `VerticalLoop`s extents and iteration spaces, these would be treated as if they consume each others outputs, although in reality they work on different vertical subsets of the data. ### How does the user notice? The user can notice this in two ways: * If the iteration spaces grow unnecessarily, the resulting generated code will eventually demand more halo than is actually accessed. This is currently not a problem, since the required number of halo lines is a property of the generated `StencilObject` subclass, which is generated by the old pipeline from IIR. * If outputs are consumed at an offset, before being written to a different output, the first output will actually be written outside of the domain. This means that the writing spills out into the halo lines. While this would already hypothetically have been the case so far, there may be some inconsistencies arisinig about this that have been introduced by the parallel model. ### Fusing changes iteration space and extent Not only is the lack of grouping causing the iteration spaces to stack unnecessarily, but this also implies that when grouping them in the `VerticalLoopMerger`, the iteration spaces and therefore the semantics change in what is supposed to only be an optimization. It is likely attributed to this circumstance that none of this is a big issue currently: In the current GTC backends, intervals are merged so that for the implemented cases this works well enough that the behavior is the same as with the old pipeline. ### Same behavior as old pipeline not possible. As a consequence of the different loop order in OIR, it is not possible to replicate the exact behavior of the old pipeline in all cases. ### Lesser considerations Besides the problems arising from the new parallel model, some other cases seem to not be well-defined. #### Diagonal offsets If a field is both written and read in a non-parallel vertical loop, while also being read at an offset, one additional halo line needs to be read in that halo line for each level of the interval. If this field is not temporary, this could be resolved by reading from the halo. However, this case should likely be disallowed anyways, as is discussed in [this PR](https://github.com/GridTools/concepts/pull/39). If the field is a temporary however, no halo can be provided and this can not be resolved for any non-constant sized interval size. (If the size is constant, it would require a number of halo lines equal to the size of the interval which is unlikely to be useful.) #### Different Write Domain for Sections Since there is no link between horizontal executions of different sections, horizontal executions that conceptually belong together from the user's perspective may result in different iteration spaces. #### Double write If an API field is written and later read again with an offset, it is not clear if this implies writing into the halo in the first write, or if the read should require halo lines. ## Resolution As a basic strategy to resolution, we propose to build a dependency graph. Any read that, based on intervals and offets possibly consumes a value written by another horizontal execution forms a dependency edge. The start of an implementation of this can be found [here](https://github.com/gronerl/gt4py/tree/oir_iteration_space). Some particular cases need to be considered. ### Shadowed reads Sometimes writes will never be read since in the reading `HorizontalExecution`, the values are overwritten before they are read (indicating "bad" user code). Should the read access still be expanded in this case? A particular, hard to detect case of shadowing is masks: It could happen that a value is overwritten in two consecutive horizontal executions with complementary masks. (Case of if and else branches of the same conditional.) In both branches we have to assume that the value is not overwritten, although the user could clearly see that it would be, which could potentially be unintuitive. ### Double write As mentioned above, if an API field is written and later read again with an offset, it is not clear if this implies writing into the halo in the first write, or if the read should require halo lines. Another way of fixing this is by introducing a temporary to use as a working copy for the API field, from which only the domain part is eventually copied back to the API storage. ### Diagonal offsets We propose to forbid diagonal reads as mentioned above also in the case of only temporaries being involved. ### Implementability in GT (C++) Since this is not necessarily always conforming to the gridtools model of extent analysis, it would be necessary to introduce a domain mask in the more exotic cases. ### Dependent-only Temporaries ("dead-ends") A user may specify computations that never actually have an impact on any API field. If a backend does not delete those, and still writes at least the domain, this may imply that an API field is unnecessarily required to provide halo lines, so it should be made sure that those are pruned. ### Possibly per-assignment domain-masking needed To keep the OIR optimizeable, it needs to be legal to fuse horizontal executions with different iteration space at least when no direct output field write is affected. However, this implies that iteration spaces of writes to temporaries can be augmented although the value is never read. This still implies that you could get illegal memory accesses if you try to read the input values to compute these extra values. Therefore API fields will either require more halo lines, or a domain masking is required per statement. ### Topics not yet discussed Topics that need special care are: * How do halo exchanges factor into this * The topic gets more complicated for horizontal regions