[Blueline] Run Python granules from Fortran with py2fgen

# [Blueline] Run Python granules from Fortran with py2fgen  - Shaped by: Sam, Magdalena, Abishek - Appetite (FTEs, weeks): Full cycle - Developers: Abishek, Sam  ## Problem  Continuation of project [Diffusion granule interface to Fortran](https://hackmd.io/GQEhCwILSESwWtx4BVwzKg). Currently the CFFI plugin and bindings in `py2fgen` have only been tested for the case where we execute a simple python function and gt4py program without sparse fields from Fortran. For this tool to be useful we want to use it to create cffi plugins of granules and execute those from Fortran. By the end of the cycle we want to understand whether we can use `py2fgen` to generate a cffi plugin for an entire granule. ## Appetite  Full cycle for 1 to 2 developers ## Solution  ### Steps #### Granule CFFI Plugin Generation - Write a basic Python driver code for the diffusion granule (for testing). - Modify the Python code generation so that it correctly calls the diffusion code. - Write a Fortran driver code modeled on the Python driver code. - Compile and run the Fortran driver which executes the diffusion granule in Python. #### Additional cleanup - Explore whether field sizes could be handled automatically within a subroutine using the `SIZE` function. Similar to how it is being done in the bindings generator. - pass `start_`, `end_indices` as additional arrays - increase test coverage (cover code generation, parsing, and granule integration test) - `as_field` should be used instead of the deprecated `np_as_allocated_field`, need to adapt to new representation of field. #### ICON integration - Research how the cffi plugin (dynamic library) can be compiled as part of the ICON binary and integrated into the build system. - Try to call the diffusion granule from ICON. ## Rabbit holes  - Do not spend too much time coming up with an automatic process to determe the `SIZE` of Fortran arrays. - Build integration into ICON may not be straightforward. Need to ensure we have access to the correct Python interpreter when running the ICON binary. ## No-gos  - Do not try to automate the build integration of py2fgen into ICON-DSL. - No benchmarking/timing of the overhead incurred by calling Python from Fortran. ## Progress  *cycle 20* ## TODO ##### Other changes - [x] Allow embedding more than one function in DLL ##### Auto Array Size Determination - [x] Implement automatic arrays size determination in `template.py` - [x] Rewrite tests without passing array size arguments. ##### Fortran arrays in Python - [x] Use `as_field` to construct GT4Py Fields. *`as_field` creates a copy of the data, this has to be modified. Until then use `np_as_located_field`* ##### Granules - [x] Write dummy Python driver code for diffusion granule. - [x] Instantiate grid without serialized data (use grid file directly). - [x] Write dummy Fortran driver code for diffusion granule. - [x] Generate bindings, compile and run dummy diffusion granule. ##### Test Infrastructure - [x] Improve testing infrastructure - [x] Expand code generation tests - [x] Expand parsing tests - [x] Add test for `generate_and_compile_cffi_plugin` - [x] Test error propagation - [x] Add compilation tests (full integration) - [x] Other minor cleanup tasks ##### Integration (CPU, GPU backends) - [x] Run Diffusion Granule with GTFN CPU - [x] Run with Chia Rui's diffusion granule - [x] Run with pre-compiled stencils - [x] Compile with GTFN GPU backend - [x] Run basic example with GTFN GPU backend (Approach 1) - [x] Use `$acc host_data use_device` directive to run basic example with GTFN GPU backend. - [x] Cleanup switching between GPU and CPU modes. - [x] Add tests for GPU and CPU modes (square, multi_return, diffusion). - [x] Run diffusion granule on GPU locally. ##### Open Questions - What is the overhead of the Python interpreter? - What is the runtime difference between OpenACC and Python GTFN_GPU - What is the runtime difference between ICON CPU and Python GTFN_CPU - Why does `np_as_located_field` behave differently to as_field? - Are produced results at runtime correct? - How can we fully automate this in the ICON build system? ### Automatic Array Size Determination We can use assumed size arrays in Fortran using `dimension(*)` like in icon4pygen. Array sizes can be determined using `SIZE()` at runtime in a wrapper subroutine which calls the C function (also similar to icon4pygen). **Solution** ```Fortran module square_plugin use, intrinsic :: iso_c_binding implicit none public :: run_square_plugin ! Declaration of the external C function interface subroutine square_wrapper(inp, result, n_Cell, n_K) bind(c, name="square_wrapper") import :: c_int, c_double integer(c_int), value :: n_Cell, n_K real(c_double), dimension(*), intent(in) :: inp real(c_double), dimension(*), intent(out) :: result end subroutine square_wrapper end interface contains subroutine run_square_plugin(inp, result) use, intrinsic :: iso_c_binding real(c_double), dimension(:, :), intent(in) :: inp real(c_double), dimension(:, :), intent(out) :: result integer(c_int) :: n_Cell, n_K n_Cell = SIZE(inp, 1) n_K = SIZE(inp, 2) call square_wrapper(inp, result, n_Cell, n_K) end subroutine run_square_plugin end module ``` ### Fortran Arrays in Python They are passed as a pointer to C, which passes a pointer to Python and is available as a ctype pointer object, for example `<cdata 'double *' 0x621e9345e910> The pointer does not have information on: - Total number of elements the pointer is going to reference. - The shape of the array (what are the dimensions). **This information needs to be passed through the Fortran interface.** Then we can create a NumPy view of the data and create a GT4Py Field from it. ### Propagating errors from Fortran to Python We can catch any exceptions which occur on the Python side and return a return code from the embedded Python function accordingly. Thus all embedded functions return an `int`. In the Fortran interface the C bound method is a Fortran `function` which returns this `int` return code. When we call the embedded function from Fortran we thus call the respective subroutine and also pass it an additional return code `rc` argument. We can check this return code after the function ran and then decide whether to exit from Fortran execution or not. Example: ```Fortran integer(c_int) :: rc call diffusion_wrapper(..., rc) if (rc /= 0) then exit(1) end if ``` If errors occur in the init_code like undefined symbol, then the code doesnt crash. Error ocurrs inside CFFI. #### Booleans in Fortran We need to declare logical arrays in the fortran interface as logical(c_int) otherwise we get garbage values. This is represented in memory as an array of `[0, 1]` or `[0, -1]`. Thus we need to convert these values to boolean values when constructing the numpy array to satisfy the gt4py interface which requests the dtype to be `bool` for these fields. ### Preliminary Performance Numbers We should not call with_backend()() in a loop multiple times in the diffusion granule. This can invalidate the cache and lead to slow downs. Instead we define the function once .with_backend() and in a loop call an instance of that. For 6 timestep ICON mch_r04b09 run we observed a 3x speedup in doing that, going from 200 seconds to 60 seconds. 60 seconds with building stencils once: ``/scratch/mch/skellerh/icon-exclaim/build/run/LOG.exp.mch_ch_r04b09_chia` 34.859 seconds with using already built stencils: `/scratch/mch/skellerh/icon-exclaim/build/run/LOG.exp.mch_ch_r04b09_dsl.run.2051747.o` 7 seconds using ICON CPU build: `/scratch/mch/skellerh/icon-cpu-build/run/LOG.exp.mch_ch_r04b09_dsl.run.2051742.o` ## Fortran -> Python on GPU First all arrays have to created on the device using `$acc enter data create(...)` statements. Then arrays need to be copied onto the device using `$acc data copyin(...)`. Then we can execute the C function bound to the Fortran interface which calls the Python function via CFFI. On the Python side in order to create cupy arrays from the passed pointer, the pointer must be a GPU device pointer and not a host pointer on the CPU. The pointer can then be used to create a cupy array using ```python ptr_val = int(ffi.cast("uintptr_t", ptr)) mem = cp.cuda.UnownedMemory(ptr_val, total_size, owner=ptr, device_id=device_id) memptr = cp.cuda.MemoryPointer(mem, 0) arr = cp.ndarray(shape=sizes, dtype=dtype, memptr=memptr, order="F") ``` #### Creating device pointers ##### Approach 1: C function to get `device_ptr` ^a0dbc8 Within the Fortran interface we can call a C function which returns the device pointer using the OpenACC C API. ```C #include <openacc.h> #include <stdio.h> #include <stdint.h> uintptr_t get_device_ptr_from_fortran_array(double *host_array) { void* device_ptr = acc_deviceptr(host_array); printf("Host pointer = %p\n", (void*)host_array); printf("Device pointer = %p\n", device_ptr); return (uintptr_t)device_ptr; } ``` The C header then just takes a universal C pointer for each argument `void *arg`. The Fortran interface then looks as follows: ```Fortran module identity_plugin use, intrinsic :: iso_c_binding implicit none public :: identity interface function get_device_ptr_from_fortran_array(host_array) bind(C, name="get_device_ptr_from_fortran_array") result(ptr) use, intrinsic :: iso_c_binding ! Adjusted to assume a contiguous block of memory as a 1D array from Fortran's perspective real(c_double), dimension(*), intent(in), target :: host_array type(c_ptr) :: ptr end function get_device_ptr_from_fortran_array function identity_wrapper(device_ptr, & n_Cell, & n_K) bind(c, name="identity_wrapper") result(rc) import :: c_int, c_double, c_ptr type(c_ptr), value :: device_ptr integer(c_int), value :: n_Cell integer(c_int), value :: n_K integer(c_int) :: rc ! Stores the return code end function identity_wrapper end interface contains subroutine identity(inp, & rc) use, intrinsic :: iso_c_binding type(c_ptr) :: device_ptr integer(c_int) :: n_Cell integer(c_int) :: n_K real(c_double), dimension(:, :), target :: inp integer(c_int) :: rc ! Stores the return code n_Cell = SIZE(inp, 1) n_K = SIZE(inp, 2) ! Obtain device pointer for the inp array device_ptr = get_device_ptr_from_fortran_array(inp) ! Pass the device pointer to the C function rc = identity_wrapper(device_ptr, & n_Cell, & n_K) end subroutine identity end module ``` ##### Approach 2: Use OpenACC `$acc host_data use_device(...)` This directs the compiler to use the device address of any entry in list, for instance, when passing a variable to a procedure. Through this we should be able to circumvent the convoluted logic of [[#Approach 1 C function to get `device_ptr`]] ### Environment Variables ##### GPU ``` PYTHONUNBUFFERED=1;GT4PY_GPU=1;GT4PY_BUILD_CACHE_LIFETIME=PERSISTENT;CUDACXX=/usr/local/cuda-12.3/bin/nvcc;GT4PY_BUILD_CACHE_DIR=/tmp/gt4py_cache ``` ### Basic Timings ###### gtfn_cached_cpu ``` PYTHONUNBUFFERED=1;GT4PY_BUILD_CACHE_LIFETIME=PERSISTENT;GT4PY_BUILD_CACHE_DIR=/tmp/gt4py_stencils_cpu ``` before precompiled stencils: ``` Initialising diffusion... Done running initialising diffusion. Diffusion initialisation time: 18.66 seconds Running diffusion... Done running diffusion. Diffusion run time: 58.12 seconds ``` after precompiled stencils: ``` Initialising diffusion... Done running initialising diffusion. Diffusion initialisation time: 3.02 seconds Running diffusion... Done running diffusion. Diffusion run time: 35.01 seconds ``` ###### gtfn_cached_gpu ``` PYTHONUNBUFFERED=1;GT4PY_GPU=1;GT4PY_BUILD_CACHE_LIFETIME=PERSISTENT;CUDACXX=/usr/local/cuda-11.8/bin/nvcc;GT4PY_BUILD_CACHE_DIR=/tmp/gt4py_stencils_gpu/ ``` before precompiled stencils: ``` Initialising diffusion... Done running initialising diffusion. Diffusion initialisation time: 25.33 seconds Running diffusion... Done running diffusion. Diffusion run time: 108.58 seconds ``` after precompiled stencils: ``` Initialising diffusion... Done running initialising diffusion. Diffusion initialisation time: 3.02 seconds Running diffusion... Done running diffusion. Diffusion run time: 36.40 seconds ``` ### Profiling using nsys ```bash /usr/local/bin/nsys profile -o profileOutput env CUDACXX=/usr/local/cuda-11.8/bin/nvcc GT4PY_BUILD_CACHE_DIR=/tmp/gt4py_stencils_gpu/ GT4PY_GPU=1 GT4PY_BUILD_CACHE_LIFETIME=PERSISTENT PYTHONUNBUFFERED=1 ICON_GRID_LOC=/home/sk/Dev/icon4py/testdata python /home/sk/Dev/icon4py/tools/tests/py2fgen/test_diffusion_wrapper.py ``` # Diffusion Granule Integration Steps https://hackmd.io/lnbpB4CZSyemf5rQwdOYgw