Imperative Style Backend

# Imperative Style Backend ###### tags: `functional cycle 12` Developers: Matthias, support from Hannes Appetite: 5 Weeks ## Background MeteoSwiss has strict requirements regarding the backend code that is generated by the gt4py toolchain: - The code needs to be readily understood by the average HPC engineer - The code needs to be able to be profiled by the nvidia tools (especially ncu) The first requirement ensures that the code can be efficiently debugged further down the line. Bugs may turn up e.g. by triggering of code paths by weather events not part of the standard set of tests or introduced by porting changes from upstream icon into the dsl version. In these situations a HPC engineer needs to be able to react swiftly and be able to work to resolve the problem efficiently. The second requirement ensures maintainability across future versions of the cuda toolkit. Experience shows that stencils may become faster or slower between versions, sometimes critically so. ## Preliminary Investigations A sort of middle way between a low level CUDA backend and the current functional backend was proposed: A debug backend that still uses (U)SID and a lot of the `gtfn` facilities, but does not adhere to a functional style and commits intermediary results to temporary variables. It was hoped that this makes the back end code more readable to engineers not well versed in template meta programming and functional programming style in C++, while at the same time restoring the ability to profile the code using `ncu`. ## Example / Look and Feel Let's consider an example: ``` @field_operator def _mo_nh_diffusion_stencil_02( test_mask: Field[[CellDim, KDim], bool], kh_smag_ec: Field[[EdgeDim, KDim], float], vn: Field[[EdgeDim, KDim], float], e_bln_c_s: Field[[CellDim, C2EDim], float], geofac_div: Field[[CellDim, C2EDim], float], diff_multfac_smag: Field[[KDim], float], ) -> tuple[Field[[CellDim, KDim], float], Field[[CellDim, KDim], float]]: kh_c = where(test_mask, neighbor_sum(kh_smag_ec(C2E) * e_bln_c_s, axis=C2EDim) / diff_multfac_smag, 1.) div = neighbor_sum(vn(C2E) * geofac_div, axis=C2EDim) return kh_c, div ``` Which is a slightly modified version of the usual `mo_nh_diffusion_stencil_02`. A mask has been introduced to also consider conditionals in the exploration. This would be translated to ``` return make_tuple( (deref(test_mask) ? ([=](auto _step_3) { return _step_3(_step_3(_step_3(0_c, 0_c), 1_c), 2_c); }([=](auto _acc_1, auto _i_2) { return (_acc_1 + (deref(shift(kh_smag_ec, C2E, _i_2)) * tuple_get(_i_2, deref(e_bln_c_s)))); }) / deref(diff_multfac_smag)) : 1.0), [=](auto _step_6) { return _step_6(_step_6(_step_6(0_c, 0_c), 1_c), 2_c); }([=](auto _acc_4, auto _i_5) { return (_acc_4 + (deref(shift(vn, C2E, _i_5)) * tuple_get(_i_5, deref(geofac_div)))); })); ``` An equivalent version of the above reads: ``` double acc_1 = 0.; if (deref(test_mask)) { acc_1 += deref(shift(kh_smag_ec, C2E, 0_c)) * tuple_get(0_c, deref(e_bln_c_s)); acc_1 += deref(shift(kh_smag_ec, C2E, 1_c)) * tuple_get(1_c, deref(e_bln_c_s)); acc_1 += deref(shift(kh_smag_ec, C2E, 2_c)) * tuple_get(2_c, deref(e_bln_c_s)); } else { acc_1 = 1.; } acc_1 /= deref(diff_multfac_smag)) double acc_4 = 0.; acc_4 += (deref(shift(vn, C2E, 0_c)) * tuple_get(0_c, deref(geofac_div))); acc_4 += (deref(shift(vn, C2E, 1_c)) * tuple_get(1_c, deref(geofac_div))); acc_4 += (deref(shift(vn, C2E, 2_c)) * tuple_get(2_c, deref(geofac_div))); return make_tuple(acc_1, acc_4); ``` It is hoped that this version would be better digestable to people not familiar with USID and `gtfn`. ### Profiling Profiling the original version of the compiled stencil yields (columns on the right hand side are "instructions executed", "stall_sb", "stall_long_sb", which are two of the most important reasons for stalling because of memory operations) ![](https://i.imgur.com/0EZfZJZ.png) It is evident that it is hard to correlate the performance counters and stall reasons to what is actually happening on the gpu, e.g. it is not at all clear what the instructions executed on the `_step_x` lines actually mean (and why `_step_3` does not have memory stalls while `_step_6` (highlighted) does. For the actual lambdas that accumulate the result of the reduction it is not possible to determine during which "layer" (iteration) of the `_step_x` call the stalls happen (e.g. do they happen mostly in the first iteration or with roughly the same frequency during all 3 iterations?) The problems get worse if there is one more redirection being generated: ![](https://i.imgur.com/0bqCQtr.png) It seems unreasonable that the hihglighted line that only contains a dereference and is executed ~105'000 times never causes a single memory stall. The memory stalls in the other two lines could have about half a dozen sources per line. Using the "debug" version of the backend the profiling looks much more comprehensible at first sight. ![](https://i.imgur.com/TEYBBbT.png) However, the stall reasons indicated are only a fraction of the stalls actually occuring. Inspecting the source code of the `shift` function being called on 6 lines above reveals: ![](https://i.imgur.com/WoxAZmG.png) i.e. 92 and 813 more stalls, with no way to determine which of the 6 lines calling `shift` actually caused these stalls ### Summary of Preliminary Investigation: Emitting code that is more akin to a conventional, procedural backend fulfills part of the MeteoSwiss requirements. It can certainly be argued that this style would be more readily understood by the majority of performance engineers, however the current state of the nvidia tools do not allow for proper profiling even in this style of generated code. It is hoped that future versions of ncu will be able to sum up the stalls correctly along the call graph ## Description of Task This task asks for the implementation of the imperative style backend sketched out above. More formally - All lambdas should be "unwrapped" except for the outermost one - Ternaries should be transformed to if - then - else constructs - Reductions should be lowered to for-loops ## Known steps Roughly, the task could proceed as follows 1. Introduce a new dialect of ITIR that allows for a statement node 2. Write a pass that unwraps the lambda functions 3. Write a pass that transforms the ternaries into if - then - else constructs 5. Treat the scan builtin 6. Extend the code generator to handle - statements - neighbor reductions ## Rabbit Holes and Non-Goals - Do not adapt the instrumentation / interface. Only the code generation of the field operators should be adapted - Do not try to move further low level than currently envisioned