How to write clean ICON4Py programs

The following is a summary of patterns that currently appear in ICON4Py's dycore and diffusion. *Rule* describes what to do now, *Reason* describes why this is the rule, *Ideal* describes how we would like to implement the pattern, *Current status* describes why we cannot apply the ideal case. ## Readability vs performance **Rule:** Sacrifice readability for performance only if absolutely needed. Then clearly document the implementation. Absolutely needed means: big impact on performance and no (short-term) readable alternative available. **Reason:** We should try to avoid temporary solutions. **Ideal:** Readable code should allow for best performance. **Current status:** We violate this rule when we have programs that output on nlev and nlev+1. The more readable implementation would be to separate the nlev+1 outputs into a separate field_operator, however we leave them together and execute an additional kernel on the nlev+1 level, see e.g. https://github.com/C2SM/icon4py/blob/c2e0c7fffda26d20a9caefc77869b92ed6362038/model/atmosphere/dycore/src/icon4py/model/atmosphere/dycore/stencils/compute_edge_diagnostics_for_velocity_advection.py#L241. ## Compute domain **Rule:** Make the domain as small as possible for the computation to be correct. **Reason:** The domain of the program specifies the compute range for kernels. Too large domains will lead to unnecessary overcompute. **Ideal:** The compute domain should be always be the interior and possibly boundary layers. The user should never specify compute in halos. **Current status:** Because of GT4Py limitations the user sometimes has to split computations in a way that later programs consume output of previous programs (without halo exchanges). In such cases the user sometimes needs to specify overcomputation (=computation in halo for these fields). ## Boundary conditions **Rule:** Discuss case-by-case. **Ideal:** Apply boundary conditions together with the computation of the interior. This is mainly for readability. Additionally, in some cases the boundary condition (if not covered by the compute domain) will never materialize. **Current status:** Probably needs to be discussed on a case-by-case basis. If the field is used in several places it might be needed to materialize the boundary condition for clarity. Applying the *ideal* might require to compute on an extra k-level which not all output fields might have. Could be resolved once GT4Py allows multiple output domains. ## Use of concat_where in the horizontal: compute domain is subset **Rule:** Don't use concat_where's if the compute domain is already a subset. **Reason:** The compiler might not be able to recognize the subset (because domains are symbolic) and might not be able to apply transformations / dead-code elimination properly. This is an insufficiency of the toolchain. But it's also not clear why it is good for readability to overspecify the compute ranges. **Ideal:** Probably rule == ideal, however if parts of the program could be useful in other context, the `concat_where` might be semantically correct in the particular case. An example could be an operator with boundary conditions: if executed only in the interior the `concat_where` for the boundary condition case could be eliminated but the general case would require it. **Current status**: Can be applied without limitations. ## Use of concat_where in the horizontal: protect halo **Rule:** Don't protect the halo from being written. In the Fortran loop-nest implementation, every array is only written where it is strictly needed, sometimes with halo sometimes without. In [Compute domain](#Compute-domain), the ideal case is not compute in halo, however if we do compute in halo for one field, we should not protect other fields from being written in halo. **Reason:** Pollutes user code with irrelevant halo computations. Also optimizations see more complicated domains. Downsides: - This possibly introduces a bit of overcompute. - Verification against Fortran requires to exclude the halo. There is a mode for this pattern in icon-exclaim's `verify_field`. **Ideal case:** see [Compute domain](#Compute-domain). **Current status:** Can be applied (most likely) without limitations. ## Use of concat_where in the vertical TODO: Are there special considerations? ## Use of in/out fields **Rule:** While GT4Py does not advertise allowing fields to be input and output in a field_operator call from a program, we tolerate this in case the in/out fields are used point-wise. We allow this to avoid extra copies or increase memory foot-print. **Important: GT4Py currently doesn't warn the user in case in/out fields are used for non-pointwise programs! An additional explicit copy is currently needed.** **Examples:** ```python @program def acceptable(...): a_pointwise_fop(bar_old, out=bar) ``` ```python @program def wrong(...): a_non_pointwise_fop(bar, out=bar) @program def correct(...): copy(bar, out=bar_old) a_non_pointwise_fop(bar_old, out=bar) ``` **Ideal case:** No in/out fields. In Greenline we can deal with this, by returning new fields (or swapping out the buffer). However this is not feasible in blueline and GT4Py is lacking features to support this. **Current status:** As described in **Rule**. This might actually introduce more copies than required for non-pointwise programs: for semantics the user is required to add the copy, however GT4Py might introduce an intermediate temporary that might remove the requirement for the extra copy. ## Writing Zeros (This is a special case of concat_where/boundary conditions) **Rule**: Don't restrict reads by using `concat_where` and `0.0` if not needed. E.g. in ``` @field_operator def foo(...): return concat_where(subset_of_domain, do_something, IRRELEVANT), some_other_stuff ``` don't use `0.0` for IRRELEVANT if not needed for correctness. **Reason**: We will not be able to distinguish relevant from irrelevant zeros. Additionally, the performance impact of trying to avoid a read is not clear. **Ideal:** The pattern above is most likely an artifact of not supporting multiple output domains and should not exist. **Current status**: Can be applied now. If a performance investigation shows that it makes a different, we should use a slightly different, more expressive pattern. ## Non-suported configuration options **Rule:** Obviously remove them. Sometimes there are remaining patterns from half-supported configuration options. ## Initialization of user-defined temporaries **Rule:** Don't rely on zero-initialization of user-defined temporaries (e.g. the `local_fields` in `solve_non_hydro`). If zero-value boundaries are needed they should be implemented in GT4Py programs. **Reason:** We should reduce their lifetime to where they are used, i.e. allocate/deallocate. The allocation should not zero-initialize for performance reason (instead should use `empty()`). **Ideal:** Ideally we should not have user-defined temporaries, but that would require the whole dycore to be within GT4Py which is not mid-term feasible. **Current status:** Currently user-defined temporaries are zero-initialized. We should check that we don't rely on this behavior. ## Compile-time switches **Rule:** Use compile-time switches for performance when possible (to be discussed). **Example:** ```python @field_operator def fop(foo: float, ...): if foo > 42.0: bar(...) @program def not_ideal(foo: float, ...): # foo changes on each call fop(foo, ...) # no static compilation for foo possible not_ideal(foo, ...) # --- def new_fop(apply_bar: bool, ...): if apply_bar: bar(...) @program def possibly_more_performant(apply_bar: bool): new_fop(apply_bar, ...) possibly_more_performant.compile(apply_bar=[True,False]) possibly_more_performant(apply_bar=foo>42.0) ``` **Ideal:** **Current status:** Can be used today.