[DaCe] Optimization VI

# [DaCe] Optimization VI  - Shaped by: Philip - Appetite (FTEs, weeks): - Developers:  ## Problem In several of the [last](https://hackmd.io/@gridtools/HJ__Ec0pR) [cycles](https://hackmd.io/e8-76c5OR6e0PWKSDN-pEA) we worked on the optimization pipeline in GT4Py using DaCe. We were able to lower and verify - `apply_diffusion_to_vn` - `apply_diffusion_to_w_and_compute_horizontal_gradients_for_turbulence` from diffusion quite extensively. The pipeline was able to generate a speed-up of roughly 1.6 compared to OpenACC. See also the "Left Over" section below about the regression. We were also able to lower (not all validated) and pass them to the optimization pipeline and run the result - `calculate_nabla2_and_smag_coefficients_for_vn` - `calculate_diagnostic_quantities_for_turbulence` - `calculate_enhanced_diffusion_coefficients_for_grid_point_cold_pools` However, we did not spend much time on analyzing them as we did for the other two. Furthermore, we miss an OpenACC reference time for them. This means that beside `apply_diffusion_to_theta_and_exner` (which had an error, but it might be fixed by [now](https://github.com/GridTools/gt4py/pull/1814)) all stencils from the diffusion can be lowered, but not all of them that can be lowered also validate (not yet sure why). #### Left Overs During our previous work, not all aspects of the problem were fully solved. Here is a not fully complete list with the most sever/important issues the optimizer currently has. Note that this is not a list of tasks, it is more of a reminder of things that could be done if there is spare time or a list of particular issues that might surface while working with a stencil. ##### DaCe Related Issues Here are all issues that are highly related to DaCe. - The MapFusion PR in DaCe has finally been merged. However, in GT4Py we need also parallel map fusion, there is a [DaCe PR](https://github.com/spcl/dace/pull/1965) open, but it is not reviewed yet. - In certain situations, the `InlineMultistateSDFG` transformation can cause some [troubles](see https://github.com/spcl/dace/issues/1959) because it splits the write to transients, i.e. there are more than one AccessNode that write to a transient. - An error in the code generator, related to `concat_where`, was found. In order for kernels to run concurrently, they have to be submitted to different Cuda streams, where have to be created in the beginning. However, DaCe uses more streams than it creates, which causes segmentation fault. The current solution is to limit it to one stream, but this is not a solution as `concat_where` is a textbook example for concurrent streams. - In some cases `gt_substitute_compiletype_symbols()` does not work (this is a limitation of the implementation). - Not directly affecting this project, but during the upgrade to DaCe main it was realized that `is_accessed_downstream()` has to be changed. Before it was exploring the graph manually, but this is no longer possible because of the new structure of the SDFG this is no longer possible. The current solution is to rely on the `StateReachability` which is called by the function on its own. This is okay in some cases but in most cases it should be run inside a pipeline. This need to be changed, but it also triggers the [`PatternTransformation._pipeline_results` issue](https://github.com/spcl/dace/issues/1911) that also affects the MapFusion PR. - There is an issue regarding the launch of kernels, it mostly happens for fully specialized SDFG, i.e. all sizes, strides are replaced with their value. There is a [fix](https://github.com/spcl/dace/pull/1968) for that but it was not yet accepted. - Cuda provides an API call to copy 2d arrays around, however, there are some limitations in [DaCe](https://github.com/spcl/dace/issues/1953). This must be addressed. ##### Misc - The order of the transformations in auto optimize should be revisited, the first step (creating large Maps is probably still good), most importantly is where we should put the loop blocking in relation to the "if mover" (because the two could block each other) - We currently assumed constant folding, for example we assumed that `limited_area` is always `True` and `extra_diffu` is `False`. This is fine for MCH, but in general we should find a solution to that. For doing this we use `substitute_compiletype_symbols()` which does not work correctly, see above. - The NASA guys have fixed `ScalarToSymbol` promotion, It might be time to enable that thing in our simplify pipeline. - The pattern matching is not so fast, we should implement our custom one to speed up at least the fusion stuff, because it has a super regular and simple pattern. - Sometimes the reduction nodes are already expanded when we get them from the lowering. I do not think that this is a big problem, but we have to figuring out why this is the case. - It is very likely that we have dynamic allocations, i.e. calls to `new` which we have to address. - There is also an updated state fusion transformation, which seams better than the current one, which should be integrated. - Double buffering needs a refactoring. ##### Strides - For optimal performance the Map variables must be set in a specific way. Usually we rely on the name of the Map variables, however, in some cases this is not possible (especially if the `CopyToMap` transformation has been run). The right way would be to look at the access pattern and then find the correct iteration order based on the strides of the accessed containers. Note that this means that we know the values of the strides at optimization time. As a call to `CopyToMap` is essentially a sign that some fusion or elimination did not work, we should address that problem. - There is a transformation that correct strides of containers in NestedSDFGs. However, currently views are ignored, there is an experimental [PR](https://github.com/GridTools/gt4py/pull/1784) for this. ##### GPU Transform - There is an issue in the GT GPU transformation. In order to set the GPU block size and the iteration order correctly, we essentially run `CopyToMap` in it (the background is that the code generator runs that to transform some Memlets it can not handle), However, it seems that we do not catch all cases. - There is a bug in DaCe's GPU transformation (see [DaCe Issue#1773](https://github.com/spcl/dace/issues/1773) and [GT4Py PR#1741](https://github.com/GridTools/gt4py/pull/1741/files)). However, it has only low priority as it only affect a unit test that covers an edge case. ## Appetite Before only me, Philip, worked on this project, but it needs more people, because the work is much and I am the only one that knows about the pipeline and how it works, so we should spread the knowledge. The ideal candidate would be familiar with DaCe (knowing the core concepts is enough) but it is not mandatory. It would also be interesting if the additional person knows something about GTIR and would thus be able to suggests some optimizations that are possible because of some implicit assumptions of the IR. ## Solution We will proceed in the same manner as before, i.e. - Select a stencil. - Lower it pass it to the optimization pipeline. Fix the errors that might be generated by this. - Look at the optimized SDFG and look for any obvious issues, such as redundant copies or unfused maps and fix/improve/develop transformations to solve these issues. Note, sometimes the SDFG does not look optimal, but the assemble code looks as it should. So before doing any strange work, look at the assembly to see if the compiler is performing this step. An example would be the fusing of serial `for` loops with bounds known at compiletime, i.e. reductions. - Time the stencil and compare if there is an improvement or not. We then iterate this process until the SDFG and the timings are okay. ### Additional Tools For this process there is already an [infrastructure](https://github.com/philip-paul-mueller/benchmark_2) (Enrique should _not_ look at this) which can be used for the job. However, there is one problem. Before we worked at one or two stencils, but this process will accelerate as new stencil will have less (but more specific) issues. Thus it becomes important to check if there is a regression, which was simple when there were just two stencils. Thus we need a tool that allows to compare stencils with some reference. Implementing a very simple tool that does this should be straight forward, as the current benchmarker already outputs a statistic. So we can just put some reference data dump somewhere and then run the stencils and make a comparison. ### Integration Branches - GT4Py: all changes are merged to `main` branch. - However, for `concat_where` the following - DaCe: `main` branch. - ICON4Py: [`update_to_gtir_dace`](https://github.com/C2SM/icon4py/tree/update_to_gtir_dace) branch - Originally branched from `update_to_gtir` to comply with GT4Py main. - Compared to `update_to_gtir`, disabled tests in `atmosphere/advection`. ## Rabbit holes  ## No-gos  ## Progress  ### Stencil Priority Here is a list of stencils that we should look at next. They are ranked according to their priority, based on Christoph's input: - [`fused_velocity_advection_stencil_19_to_20`](https://github.com/C2SM/icon4py/blob/main/model/atmosphere/dycore/src/icon4py/model/atmosphere/dycore/stencils/fused_velocity_advection_stencil_19_to_20.py): - How does the SDFG looks like: The SDFG looks good, there is however a redundant copy that the chain remover did not get rid of. The reason is that it is can not be shown that the full thing is read. However, we should add a special case there. - ICON grid global (keep skip values) - Locally: It verifies on CPU & GPU - MCH grid (remove skip values) - Locally: It verifies on CPU & GPU - Balfrin: - [`fused_velocity_advection_stencil_15_to_18.py`](https://github.com/C2SM/icon4py/blob/main/model/atmosphere/dycore/src/icon4py/model/atmosphere/dycore/stencils/fused_velocity_advection_stencil_15_to_18.py): - Version `a`, where `lvn_only := False`: - How does the SDFG looks like: There are some redundant copies that are turned into maps (on GPU). The main reason is that copy chain does not apply because `a1` (the one that is removed) is a global, we have to extend that. - ICON grid global (keep skip values) - Locally: It verifies on CPU & GPU - MCH grid (remove skip values) - Locally: It verifies on CPU & GPU - Balfrin: - Version `b`, where `lvn_only := True`: - How does the SDFG looks like: The SDFG is supper small, one map with not that much inside. - ICON grid global (keep skip values) - Locally: It verifies on CPU & GPU - MCH grid (remove skip values) - Locally: It verifies on CPU & GPU - Balfrin: - [`fused_velocity_advection_stencil_8_to_13`](https://github.com/C2SM/icon4py/blob/main/model/atmosphere/dycore/src/icon4py/model/atmosphere/dycore/stencils/fused_velocity_advection_stencil_8_to_13.py): - Version a, `istep == 0`: - How does the SDFG looks like: There are some redundant copies that could not have been removed by the chain remover, we have to improve them. - ICON grid global (keep skip values) - Locally: GPU & CPU - MCH grid (remove skip values) - Locally: GPU & CPU - Balfrin: - Version b, `istep == 1`: - How does the SDFG looks like: There are a lot of redundant copies that the chain remover can not get rid of. - ICON grid global (keep skip values) - Locally: Verifies on CPU and GPU - MCH grid (remove skip values) - Locally: Verifies on CPU and GPU - Balfrin: - [`fused_velocity_advection_stencil_1_to_7.py`](https://github.com/C2SM/icon4py/blob/main/model/atmosphere/dycore/src/icon4py/model/atmosphere/dycore/stencils/fused_velocity_advection_stencil_1_to_7.py): - Version a, `istep == 1, lvn_only == True`: - How does the SDFG looks like: There are a lot of invalid map ranges (either zero sized or negative). Furthermore, there are a lot of redundant copies that the chain could not remove. - ICON grid global (keep skip values) - Locally: Verifies on CPU and GPU - MCH grid (remove skip values) - Locally: Verifies on CPU and GPU - Balfrin: - Version b, `istep == 1, lvn_only == False`: - How does the SDFG looks like: There are a lot of invalid map ranges (either zero sized or negative). Furthermore, there are a lot of redundant copies that the chain could not remove. - ICON grid global (keep skip values) - Locally: Verifies on CPU and GPU - MCH grid (remove skip values) - Locally: Verifies on CPU and GPU - Balfrin: - Version c, `istep == 2, lvn_only == False`: - How does the SDFG looks like: There are some warnings about invalid range maps, but I can not see them, probably they were optimized away. However, there is some strange self copying, concerning `vn_ie` going on. - ICON grid global (keep skip values) - Locally: Verifies on CPU and GPU - MCH grid (remove skip values) - Locally: Verifies on CPU and GPU - Balfrin: - `calculate_nabla2_and_smag_coefficients_for_vn`: - How does the SDFG looks like: There is one big Map that is created. - ICON grid global (keep skip values) - Locally: Verifies on CPU and GPU - MCH grid (remove skip values) - Locally: Verifies on CPU and GPU - Balfrin: - `calculate_diagnostic_quantities_for_turbulence`: (No NumPy reference present) - How does the SDFG looks like: There are two maps, but this is because they have different bounds in vertical dimension. - ICON grid global (keep skip values) - Locally: Runs on CPU & GPU (but there is no reference) - MCH grid (remove skip values) - Locally: Runs on CPU & GPU (but there is no reference) - Balfrin: - `calculate_enhanced_diffusion_coefficients_for_grid_point_cold_pools`: - How does the SDFG looks like: There are two maps, because there is a nested neighborhood reduction, so data dependencies can not be satisfied. - ICON grid global (keep skip values) - Locally: Runs on CPU & GPU (but there is no reference) - MCH grid (remove skip values) - Locally: Runs on CPU & GPU (but there is no reference) - Balfrin: - `apply_diffusion_to_theta_and_exner`: - How does the SDFG looks like: - There are two maps, but this is because it is a nested reduction. - First map runs the `_calculate_nabla2_for_z` function which has an `E2C` neighbor access - Second map runs the rest of the operator with only `C2*` neighbors - ICON grid global (keep skip values) - Locally: Verifies on CPU and runs on GPU (It takes forever to run the NumPy reference) - MCH grid (remove skip values) - Locally: Verifies on CPU and runs on GPU (It takes forever to run the NumPy reference) - Balfrin: Verifies on GPU with K = 3 (K = 80 takes forever) - [ ] Performance improvement - The stencil calls [_truly_horizontal_diffusion_nabla_of_theta_over_steep_points](https://github.com/C2SM/icon4py/blob/main/model/atmosphere/diffusion/src/icon4py/model/atmosphere/diffusion/stencils/truly_horizontal_diffusion_nabla_of_theta_over_steep_points.py#L20) where there is a `where` statement that modifies conditionally the value of a variable - Instead of executing the extra calculations to update the variable only if the variable is `True`, the extra computations are executed unconditionally - There are 2 issues with this - 1. In CUDA there is an assignement to a new variable even if the value doesn't change (doesn't have much influence in the performance) - 2. The extra computations in the CUDA kernel are outside the if-statement so they are executed all the time. Since the if-statement is rarely `True` in the real data (according to Christoph) this adds a lot of overhead - The results after changing by hand the CUDA code with the above recommendation are the following: - Reference DaCe - 50% true: 0.00297852 - 5% true: 0.002612 - Opt DaCe - 50% true: 0.00228161 - 30% true: 0.00195992 - 10% true: 0.00169644 - 5% true: 0.00161418 - OpenACC (According to Christoph mask should be mostly False but we don't the %. We need to serialize the data to make sure) - 0.001779 The verifications were done on: - ICON4Py: `update_to_gtir_dace_concat_where` (`e43e480d3fd3a2226d4110131b33244aa3a30ca5`) - GT4Py: `edoardo/gtir-dace-next` (`5395f1de10bf754bb46067a538a1d5d197828636`) together with [PR#1910](https://github.com/GridTools/gt4py/pull/1910) (`fbe49ed930437c7e22f74574ad44baf4d21ce190`) - DaCe: [philip/modified-gpu-launches](https://github.com/philip-paul-mueller/dace/tree/modified-gpu-launches) (`fbe49ed930437c7e22f74574ad44baf4d21ce190`) #### Newest Version However, no extensive tests were made yet. - GT4Py: It essentially needs Edoardo's `gtir-dace-next` branch with some additions, they are all combined in [`gtir-dace-next_phil`](https://github.com/philip-paul-mueller/gt4py/tree/gtir-dace-next_phil), which is essentially Edoardo's branch (forked of at `86a66c44f7`) together with: - [PR#1913](https://github.com/GridTools/gt4py/pull/1913), which modifies the GPU transformations. - [PR|1915](https://github.com/GridTools/gt4py/pull/1915), which modifies some redundant array transformations - DaCe: It must be [PR#1976](https://github.com/spcl/dace/pull/1976), which uses a native Cuda `memcpy` call in more cases. - ICON4Py: For this branch [`update_to_gtir_dace_concat_where`](https://github.com/C2SM/icon4py/tree/update_to_gtir_dace_concat_where) which is a branch maintained by Edoardo, especially we are using commit `120693d1b4d1f4c561a10d1af1afefd19243dfe4`.