[Blueline] Granule performance profiling and correctness checking for running Python diffusion from ICON

# [Blueline] Granule performance profiling and correctness checking for running Python diffusion from ICON  - Shaped by: Sam, Abishek - Appetite (FTEs, weeks): Full cycle - Developers:  ## Problem  - We need to check whether the outputs that the Python diffusion granule produces when called directly from Fortran are correct. - Initial tests have shown that the Python diffusion granule called from Fortran for the CPU build is significantly slower than Fortran (5x slowdown). We need to identify the bottlenecks and understand if there is room for further optimisation. ## Appetite  Full cycle ## Solution  #### Benchmarking Benchmark diffusion granule runtime on: - single and multi node runs - CPU and GPU - GHEX execution (is it running?) We want to get timings for the `init` and `run` subroutines respectively. It is also important to vary the times of the simulations to see whether the initial penalty incurred from initialising the interpreter is reduced overall when running longer simulations. Profiling/benchmarking tools: - [gprof](https://ftp.gnu.org/old-gnu/Manuals/gprof-2.9.1/html_mono/gprof.html), can profile Fortran and C as long as its compiled with profiling enabled. #### Identifying bottlenecks - Using a simple example which can be run locally we should identify bottlenecks in running Python from Fortran using CFFI. E.g. how long does it take to setup the Python interpreter, what is the time taken to convert pointers to numpy arrays, and from numpy arrays to gt4py fields. What about executing the gt4py programs? - Identify possible areas where we could potentially optimise. - Talk to NOAA/NASA team about their approach for FV3 dycore. #### Correctness - Use serialbox to serialize data from the (CPU, GPU) and Python Granule builds to compare against the reference serialised data. 1. Compare CPU2Py against icon-dsl CPU 2. Compare ACC2Py against icon-dsl ACC - Need to ensure that order of dimensions passed from Fortran-CPU code matches what the Python-CPU stencils expect - Potential issues with memory not being contiguous? Try to understand this #### Profiling goals **high level** - Determine overhead of interpreter initialisation - Determine overhead of running code through already initialised Python interpreter (Fortran>C>Python) **python level** - Determine runtime of converting pointers to arrays - Determine runtime of allocating gt4py fields - Determine runtime of compiling stencils vs using precompiled stencils ##### granule level - profile execution of diffusion granule on CPU and GPU using gtfn cached backends ##### simple profiling with very simple function - Determine overhead of the call to CFFI itself. - Can use timers outside the Fortran code. - Determine overhead of interpreter initialisation. - Determine overhead of calling the routine and returning value. ``` timer_start call_routine() # first time measures with interpreter init timer_end timer_start call_routine() # second time should measure just overhead of calling routine # timer_end ``` ##### profiling with different tools - Can use `gprof` to get overview of timings. This can potentially be used without having to use manual timers, as it should tell us where the time is spent in different parts of the program. - `pgprof` could be used for profiling GPU accelerated code as well. - `nsys` could be used to profile the GPU kernels. ## Rabbit holes  ## No-gos  Applying any kind of optimisations is not in scope for this cycle. ## EXCLAIM planning meeting slides https://docs.google.com/presentation/d/1wVFJy0eaxx8YLxHAw2aaCyiSRySEu_1GEQeqt-S34SY/edit#slide=id.g2c619506477_0_5 ## Progress  Running ICON-DSL over 6 timesteps. Experiment MCH-R04B09. **OpenACC** | name | # calls | t_min | t_avg | t_max | total | min (s) | total | | -------------------- | ------- | -------- | -------- | -------- | ----- | ------- | ----- | | nh_hdiff_run | 6 | 0.00176s | 0.00178s | 0.00183s | 0.011 | 0.011 | 0.011 | | nh_hdiff_run_initial | 1 | 0.00198s | 0.00198s | 0.00198s | 0.002 | 0.002 | 0.002 | **CPU (nproma 32000)** | name | # calls | t_min | t_avg | t_max | total | min (s) | total | | -------------------- | ------- | -------- | -------- | -------- | ----- | ------- | ----- | | nh_hdiff_run | 6 | 0.62581s | 0.63509s | 0.64770s | 3.811 | 3.811 | 3.811 | | nh_hdiff_run_initial | 1 | 0.66723s | 0.66723s | 0.66723s | 0.667 | 0.667 | 0.667 | **GPU2Py (no debug, no optimisations)** | name | # calls | t_min | t_avg | t_max | total | min (s) | total | | -------------------- | ------- | -------- | -------- | -------- | ------- | ------- | ------- | | nh_hdiff_run | 6 | 0.61353s | 0.61503s | 0.61659s | 3.690 | 3.690 | 3.690 | | nh_hdiff_run_initial | 1 | 03m02s | 03m02s | 03m02s | 182.069 | 182.069 | 182.069 | **GPU2Py (no profiling, no debug, CachedProgram)** | name | # calls | t_min | t_avg | t_max | total | min (s) | total | | -------------------- | ------- | -------- | -------- | -------- | ------- | ------- | ------- | | nh_hdiff_run | 6 | 0.02157s | 0.02161s | 0.02166s | 0.130 | 0.130 | 0.130 | | nh_hdiff_run_initial | 1 | 03m08s | 03m08s | 03m08s | 188.158 | 188.158 | 188.158 | **GPU2Py (no profiling, no debug, CachedProgram, optimised extract_connectivity_args)** | name | # calls | t_min | t_avg | t_max | total | min (s) | total | | -------------------- | ------- | -------- | -------- | -------- | ------- | ------- | ------- | | nh_hdiff_run | 6 | 0.00277s | 0.00279s | 0.00281s | 0.017 | 0.017 | 0.017 | | nh_hdiff_run_initial | 1 | 03m08s | 03m08s | 03m08s | 188.039 | 188.039 | 188.039 | **GPU2Py (no profiling, no debug, CachedProgram, optimised extract_connectivity_args, optimised convert_args)** | name | # calls | t_min | t_avg | t_max | total | min (s) | total | | ---------------------- | ------- | -------- | -------- | -------- | ------- | ------- | ------- | | L nh_hdiff_run | 6 | 0.00221s | 0.00222s | 0.00223s | 0.013 | 0.013 | 0.013 | | L nh_hdiff_run_initial | 1 | 03m07s | 03m07s | 03m07s | 187.593 | 187.593 | 187.593 | **GPU2Py (no profiling, no debug, CachedProgram, optimised extract_connectivity_args, optimised convert_args), PYTHONOPTIMIZE=1** | name | # calls | t_min | t_avg | t_max | total | min (s) | total | | ---------------------- | ------- | -------- | -------- | -------- | ------- | ------- | ------- | | L nh_hdiff_run | 6 | 0.00212s | 0.00212s | 0.00214s | 0.013 | 0.013 | 0.013 | | L nh_hdiff_run_initial | 1 | 03m05s | 03m05s | 03m05s | 185.177 | 185.177 | 185.177 | **GPU2Py (no profiling, no debug, CachedProgram, optimised extract_connectivity_args, optimised convert_args), PYTHONOPTIMIZE=1, `math.prod`** | name | # calls | t_min | t_avg | t_max | total | min (s) | total | | ---------------------- | ------- | -------- | -------- | -------- | ------- | ------- | ------- | | L nh_hdiff_run | 6 | 0.00199s | 0.00200s | 0.00204s | 0.012 | 0.012 | 0.012 | | L nh_hdiff_run_initial | 1 | 03m04s | 03m04s | 03m04s | 184.615 | 184.615 | 184.615 | All the above did not verify as wrong sizes were passed in CachedProgram. **GPU2Py (no profiling, no debug, CachedProgram, optimised extract_connectivity_args, optimised convert_args), PYTHONOPTIMIZE=1, `math.prod` , CachedProgram fixes | name | # calls | t_min | t_avg | t_max | total | min (s) | total | | ---------------------- | ------- | -------- | -------- | -------- | ------- | ------- | ------- | | L nh_hdiff_run | 720 | 0.00215s | 0.00218s | 0.00229s | 1.567 | 1.567 | 1.567 | | L nh_hdiff_run_initial | 1 | 02m52s | 02m52s | 02m52s | 172.886 | 172.886 | 172.886 | --- ## GPU2Py Results Interpretation Average runtime (`t_avg`) compared to OpenACC. ### No debug, no optimizations - **nh_hdiff_run**: approximately 344.52 times slower ### No profiling, no debug, CachedProgram - **nh_hdiff_run**: approximately 11.14x slower ### no profiling, no debug, CachedProgram, Optimised `extract_connectivity_args` - **nh_hdiff_run**: approximately 56% slower ### no profiling, no debug, CachedProgram, Optimised `extract_connectivity_args`, `convert_args` - **nh_hdiff_run**: approximately 25% slower ### no profiling, no debug, CachedProgram, Optimised `extract_connectivity_args`, `convert_args`, `PYTHONOPTIMIZE=1` - **nh_hdiff_run**: approximately 19% slower ### no profiling, no debug, CachedProgram, Optimised `extract_connectivity_args`, `convert_args`, `PYTHONOPTIMIZE=1`, `math.prod` - **nh_hdiff_run**: approximately 12% slower ### no profiling, no debug, CachedProgram, optimised extract_connectivity_args, optimised convert_args), PYTHONOPTIMIZE=1, `math.prod` , CachedProgram fixes - **nh_hdiff_run**: approximately 22.5% slower ### Optimisations - Remove isinstance checks from convert_args, extract_connectivity_args by using type dispatching. - Use CachedProgram, to cache the program, and thereby avoid excessive lowering. - Remove _ensure_is_on_device_ by loading connectivities on device on GPU at Grid initialization. - Use PYTHONOPTIMIZE (`-O`) - Use `math.prod` in unpack function instead of `np.prod` Other optimisations targets: - embedded `_maker` takes up ~15% of time. - `convert_args` still takes up ~30% of time. - `extract_connectivity_args` takes up 6% of time. Cache from start? - `unpack_gpu` takes up 5%. - `CachedProgram` size passing. ### Fixes - Pass sizes correctly from CachedProgram to decorated program. GT4Py nanobind bindings code expects size arguments when no domain is specified, otherwise it does not expect them. Had to handle passing of sizes correctly. Furthermore it also expects types of integers to be of type `int` and not `np.integer`. These also had to be converted. # Relevant PRs GT4Py: https://github.com/GridTools/gt4py/pull/1536 Icon4Py: https://github.com/C2SM/icon4py/pull/449