# GT4Py Brainstorming: stencil calls, staggering, halo exchanges
### GT Stencil Calls / stencil composition
We have a stencil that can be called from other stencils here
```python=
@stencil(...)
def callee(...):
...
# gets compiled to callee.cpp, callee.hpp, m_callee.py
```
Users expect to be able to call this normally, so it has to be optimized and compiled into it's own cpp (target lang) unit. Potentially that could be switched off by doing something like
```python=
@stencil(standalone=False, ...)
def callee(...):
...
```
We could allow calling it with control flow, for example:
```python
@stencil(...)
def caller(...):
for i in range(X):
callee(...)
# convergence test ???
for i in range(X):
error = inf
while error > MIN_ERROR:
callee(...)
with computation(...):
a = field[...] + field[...]
error = reduce(field_a - field_b, 0)
# Optimizer could unrol calls in IR, for example
for i_range in range(0, X, 8):
callee(...)
callee(...)
callee(...)
callee(...)
callee(...)
callee(...)
callee(...)
callee(...)
# questions:
# gets compiled to caller.cpp, which #includes callee.hpp and calls callee on cpp level?
# where does it know from where callee.cpp sits (folder name contains a hash)
# how does it know callee was or was not updated (caching?)
# are we writing a build system at this point?
```
Now questions start arising:
* should this use the pre-compiled `callee` or inline callee and re-optimize it as part of the whole?
* compile time in large stencil libraries?
* optimization benefits from inlining?
* duplicating binary code when inlining and recompiling
* if pre-compiled, then we start implementing parts of a build system with dependent target lang units (can we avoid this?)
* implementing control flow raises testability questions
* Should the control flow be lifted outside the target language and be implemented in python?
We could restrict to simply calling:
```python=
@stencil(...):
def simple_calling(...):
with computation(...), interval(...):
callee(...)
with computation(...), interval(...):
other_callee(...)
for _ in range(X):
simple_calling(...)
```
Here we are not allowing control flow statements, but we are allowing essentially calling different `apply` functions of the callee stencil. How does the runtime determined domain get passed to stencils (in GT)?
### How to deal with Staggering if at all (To stagger or not to stagger)
In scientific domain semantics, some fields contain values on grid points, some contain values on edges (between grid points, so one value less in each dimension). Fields relating to between grid points are called **staggered**. Expressing semantics of operations that deal with both in an agnostic GTScript (only deals with fields, not with where they are in 3d space relative to each other) is certainly possible by doing transformations before / after running stencils. However, can and / or should we provide any help with that in GTScript?
Related link (?): https://en.wikipedia.org/wiki/Arakawa_grids
#### Staggering code example
```python
field_a = numpy.zeros((5, 5, 5))
field_b = numpy.zeros((4, 5, 5)) # staggered in i-direction
???
```
normal field
```
xxxxx
x000x
x000x
x000x
xxxxx
```
staggered field (in i-direction)
```
xxxxx
x00xx
x00xx
x00xx
xxxxx
```
is equivalent to
```
xxxx
x00x
x00x
x00x
xxxx
```
#### Proposals:
* pass domain per field to stencils
* keep passing domain for all fields but per-field staggering
* explicitly define domain per assignment?
## Halo exchanges
data layout
[ exclusive region ][first halo shell][second halo shell][third halo shell]
### on meshes
The following are my (Till) personal notes on Halo exchanges for meshes. The notation is descriptive and is only meant for demonstration purposes.
Example stencil:
```python
def example():
with location(Cell) as c:
field_tmp = sum(field_in for c in cells(c)) # stmt 1
field_out = sum(field_tmp for c in cells(c)) # stmt 2
```
### Naive approach
Use the new storages API to initiate halo exchanges by initiating an exchange before each stencil call. This would require Python bindings for ghex. As debugging distributed code with a debug backend is useful anyway, the effort to write the bindings is not lost if we do exchanges "inside" the stencils at some point.
Ghex: Pattern initialization is global and synchronized
__Synchronous exchanges__
```python
class DistributedStorage():
# ...
def acquire():
# exchange field values on the halo region
# ...
# init stoages
in_field = ...
out_field = ...
stencil(in_field, out_field)
```
__Asynchronous exchanges__
```python
class DistributedStorage():
# ...
def acquire():
# wait for wait exchanges
# ...
def release():
# initiate halo exchange
# ...
# init stoages
in_field = ...
out_field = ...
stencil(in_field, out_field)
```
Disadvantages:
Exchanges can only be initiated after computation is done.
```python
with computation(FORWARD):
field = field[-1, 0, 0] + field[-1, 0, 1]
```
### Approach 1
### Approach 2
This approach is more complex, but allows for overlapping communication and computation even across time stepping loops.
1. Decompose domain into `halo`, `foreign_halo`, `internal`. Computations on halos with external extend are never carried out, but must be replaced with halo exchanges. Halo computations without external extend may be computed on the same node.
__Code for decomposition__
Requirements:
- the global domain decomposition must provide cells which are foreign
- or boundary cells, vertices, edges must be known (can always be computed)
```python
tag_cells(mesh, "halo", lambda c: c2c2c[c] in mesh.tagged_cells("foreign"))
tag_cells(mesh, "foreign_halo", lambda c: c2c2c[c] in mesh.tagged_cells("halo"))
tag_cells(mesh, "internal", lambda c: c in mesh.tagged_cells("halo")
and not c in mesh.tagged_cells("foreign_halo"))
```
__Stencil after decomposition__
```python
def example():
with location(Cell, tag="halo") as c:
field_tmp = sum(field_in for c in cells(c))
with location(Cell, tag="foreign_halo") as c:
field_tmp = sum(field_in for c in cells(c))
with location(Cell, tag="internal") as c:
field_tmp = sum(field_in for c in cells(c))
with location(Cell, tag="halo", dependency={"field_tmp": ["external", "foreign_halo", "halo"]}) as c:
field_tmp = sum(field_in for c in cells(c))
with location(Cell, tag="foreign_halo", dependency={"field_tmp": ["halo", "foreign_halo", "internal"]}) as c:
field_out = sum(field_tmp for c in cells(c))
with location(Cell, tag="internal", dependency={"field_tmp": ["foreign_halo", "internal"]}) as c:
field_out = sum(field_tmp for c in cells(c))
```
2. Merge statements with the same tag and highest expected return:
```python
def example():
with location(Cell, tag="halo") as c:
field_tmp = sum(field_in for c in cells(c))
with location(Cell, tag="foreign_halo") as c:
field_tmp = sum(field_in for c in cells(c))
with location(Cell, tag="halo", dependency={"field_tmp": ["external", "foreign_halo"]}) as c:
field_out = sum(field_tmp for c in cells(c))
with location(Cell, tag="foreign_halo", dependency={"field_tmp": ["halo", "foreign_halo"]}) as c:
field_out = sum(field_tmp for c in cells(c))
with location(Cell, tag="internal", dependency={"field_tmp": ["foreign_halo"]}) as c:
field_tmp = sum(field_in for c in cells(c))
field_out = sum(field_tmp for c in cells(c))
```
3. Computations on halos can now be replaced by a halo exchange.
```python
def example():
with location(Cell, tag="foreign_halo") as c:
field_tmp = sum(field_in for c in cells(c))
halo_exchange(field_tmp, ["foreign_halo", "halo"])
with location(Cell, tag="foreign_halo", dependency={"field_tmp": ["halo", "foreign_halo"]}) as c:
field_out = sum(field_tmp for c in cells(c))
halo_exchange(field_out, ["foreign_halo", "halo"])
with location(Cell, tag="internal", dependency={"field_tmp": ["foreign_halo"]}) as c:
field_tmp = sum(field_in for c in cells(c))
field_out = sum(field_tmp for c in cells(c))
```
The halo exchange can be replaced by a seperate send and recieve to overlap communication and computations.
### Open questions
- How to ensure halos are contiguous in memory?
1. Additional information added to the tag
2. Generate code for contiguous and non-contiguous case
- Where to store the "patterns" object and avoid initialization on subsequent stencil calls
- How to do loop merging on the halo region exactly?
- Can we start computing when only part of the input fields have already been exchanged
- Contiguous halos require support from storages (e.g. Atlas) problematic for internal halo