Multiple output domains

# Multiple output domains ## Overview From examples we collect problems that need to be addressed. Syntax discussion is left to the end. ## Examples ### Example 1: temporary is output - horizontal A field is output, but also used to compute a quantity on another domain. ```python= @field_operator def op(...): edge = foo(...) vertex = bar(edge, ...) return edge, vertex ``` - If `edge` wasn't output, it might or might not materialize as a temporary. - Problem A: `edge` is output (on a user-specified domain). Additionally it's consumed by `bar` to compute a `vertex` field. If `vertex`'s output domain is the _interior_, then the `edge` that enters `bar` needs to be available in _halo_. If the user specified `edge`'s domain as `interior` the domain of the output `edge` is smaller than the one of `edge` used in `bar`, but we cannot write to the output on a bigger domain than request by the user. How do we handle this case? ### Example 2: simple vertical staggering We have 2 output fields where one of them has one more k layer. No dependency between the fields (except maybe same input). ```python= @field_operator def op(...): k_field = foo(...) kp1_field = bar(...) return k_field, kp1_field ``` - Problem B: gtfn cannot execute this in a single kernel. In gtfn we can only specify one domain. Therefore needs to be executed as 2 kernels. For this simple example it is easy to split in a kernel that computes both in kp1_field[:-1] and an extra kernel for kp1_field[-1]. - There should be no problem for DaCe: it's just 2 maps which possibly share input. Map fusion with mask could be implemented as an optimization (or exists already?). ### Example 3: vertical staggering with dependency This is a slightly simpler problem than Example 1 as we know that we only have one more layer. However, it is more likely that a single kernel implementation is desirable for performance. ```python= @field_operator def op(...): kp1_field = bar(...) k_field = foo(kp1_field(K+1), ...) return k_field, kp1_field ``` This example is probably only an optimization problem and seems to exist as well if we write to a single output domain, like ```python= @field_operator def op(...): a = bar(...) b = foo(a(K+1), ...) return a, b op(..., domain=[0, nlev]) ``` How can we avoid creating the temporary on the shifted domain? ### Example 4: Scan Currently the scan range is defined by the single output domain. This has other problems. Should we exclude multiple domains if a scan is involved? - Problem C: We would need to change scan semantics. ## Discussion of problems ### Problem A: temporary and output disagree on domain #### Can we avoid materializing the temporary? If edge wasn't output we could inline the its computation into bar and we would iterate over only the vertex domain. This would do overcomputation: same intermediate edge value is computed multiple times. It is not obvious how we could efficiently write each edge once (in the edge output domain). => Probably we can't #### Can we compute it in a single kernel? We could run a kernel over the max(output_domains). In Example 1, we would run a kernel over number of edges to compute `edge` (and materialize the user-specified part of it to the given output array). We probably have to materialize also an `edge` temporary over a bigger domain (with halo). Then we would have to sync (globally) and compute the vertex output (masking out many threads (as n_vertex < n_edge)). => Doesn't sound very performant. #### Can we avoid materializing `edge` twice? The most obvious execution strategy would be to run a kernel to compute `edge` output, then a kernel to compute the `vertex` output. This is similar to temporary creation, with 2 important differences: - the field needs to materialize in a user-specified buffer (not one that we control) - and as a consequence we must only write to the part of the buffer that is specified by the user as domain. We must not write in the halo (which is needed to compute `vertex`). **Greenline**: (very long-term) we could use a fully functional GT4Py -> return a new field on each field_operator call. This would solve the problem as we would return a field with halo, where the user is not allowed to look into halo as it is implementation detail. **Blueline**: Options a. Add extra info on field about halo that would allow GT4Py to freely write there. On first thought this seems undesirable as longer term this should disappear. TODO: But maybe we would need this later as implementation detail anyway, then we could use this now explictly and later completly hide it. b. Force the user to specify the bigger domain (with halo) for `edge` in Example 1. This would allow us to use the same buffer for output and for consumption in `bar`. This is also not nice as it might be difficult to match the symbolic domains: user-specified and inferred one. Compile-time horizontal domains could be a way out. ## frontend syntax ... especially for fieldoperators with slicing ```python foo(..., out=(out_field[horizontal_start:horizontal_end, vertical_start:vertical_end], out2[other_domain])) ``` how does it interact with pre-compilation? ## GTIR representation Example ``` @field_operator def op(...): edge = foo(...) vertex = bar(edge, ...) return edge, vertex ``` ### Observations - The straight-forward backend implementation is one that has a named temporary `edge`. - The representation for the above would be `(λedge. (edge bar(edge, ...))) (foo(...))` (TODO: translate to gtir) - If we have to overcompute `edge` (for consumption in bar), but the requested output is smaller, can we avoid to copy from the bigger temporary? ## DaCe Probably DaCe wouldn't need big extensions. Check this! ## SCRATCH ```python= @field_operator def foo(...): ... return a,b foo(.., out=(a, b)) # implicit domain foo(a[a_domain], b[b_domain]) # explicit domain foo(a, b, domain=(domain_a, domain_b)) # -> could implicitly define halo ``` ## How to allow GT4Py to write in a bigger domain than output - add halo info to field - user specifies bigger output domain which matches intermediate temporaries - write where specified but possibly more (everywhere where you can); or more fine-grained: - for every output additionally specify if we care about values outside of compute domain ## Halo vs boundary