# [Blueline] Run Python granules from Fortran with py2fgen
<!-- Add the tag for the current cycle number in the top bar -->
- Shaped by: Sam, Magdalena, Abishek
- Appetite (FTEs, weeks): Full cycle
- Developers: Abishek, Sam <!-- Filled in at the betting table unless someone is specifically required here -->
## Problem
<!-- The raw idea, a use case, or something we’ve seen that motivates us to work on this -->
Continuation of project [Diffusion granule interface to Fortran](https://hackmd.io/GQEhCwILSESwWtx4BVwzKg).
Currently the CFFI plugin and bindings in `py2fgen` have only been tested for the case where we execute a simple python function and gt4py program without sparse fields from Fortran. For this tool to be useful we want to use it to create cffi plugins of granules and execute those from Fortran.
By the end of the cycle we want to understand whether we can use `py2fgen` to generate a cffi plugin for an entire granule.
## Appetite
<!-- Explain how much time we want to spend and how that constrains the solution -->
Full cycle for 1 to 2 developers
## Solution
<!-- The core elements we came up with, presented in a form that’s easy for people to immediately understand -->
### Steps
#### Granule CFFI Plugin Generation
- Write a basic Python driver code for the diffusion granule (for testing).
- Modify the Python code generation so that it correctly calls the diffusion code.
- Write a Fortran driver code modeled on the Python driver code.
- Compile and run the Fortran driver which executes the diffusion granule in Python.
#### Additional cleanup
- Explore whether field sizes could be handled automatically within a subroutine using the `SIZE` function. Similar to how it is being done in the bindings generator.
- pass `start_`, `end_indices` as additional arrays
- increase test coverage (cover code generation, parsing, and granule integration test)
- `as_field` should be used instead of the deprecated `np_as_allocated_field`, need to adapt to new representation of field.
#### ICON integration
- Research how the cffi plugin (dynamic library) can be compiled as part of the ICON binary and integrated into the build system.
- Try to call the diffusion granule from ICON.
## Rabbit holes
<!-- Details about the solution worth calling out to avoid problems -->
- Do not spend too much time coming up with an automatic process to determe the `SIZE` of Fortran arrays.
- Build integration into ICON may not be straightforward. Need to ensure we have access to the correct Python interpreter when running the ICON binary.
## No-gos
<!-- Anything specifically excluded from the concept: functionality or use cases we intentionally aren’t covering to fit the ## appetite or make the problem tractable -->
- Do not try to automate the build integration of py2fgen into ICON-DSL.
- No benchmarking/timing of the overhead incurred by calling Python from Fortran.
## Progress
<!-- Don't fill during shaping. This area is for collecting TODOs during building. As first task during building add a preliminary list of coarse-grained tasks for the project and refine them with finer-grained items when it makes sense as you work on them. -->
*cycle 20*
## TODO
##### Other changes
- [x] Allow embedding more than one function in DLL
##### Auto Array Size Determination
- [x] Implement automatic arrays size determination in `template.py`
- [x] Rewrite tests without passing array size arguments.
##### Fortran arrays in Python
- [x] Use `as_field` to construct GT4Py Fields.
*`as_field` creates a copy of the data, this has to be modified. Until then use `np_as_located_field`*
##### Granules
- [x] Write dummy Python driver code for diffusion granule.
- [x] Instantiate grid without serialized data (use grid file directly).
- [x] Write dummy Fortran driver code for diffusion granule.
- [x] Generate bindings, compile and run dummy diffusion granule.
##### Test Infrastructure
- [x] Improve testing infrastructure
- [x] Expand code generation tests
- [x] Expand parsing tests
- [x] Add test for `generate_and_compile_cffi_plugin`
- [x] Test error propagation
- [x] Add compilation tests (full integration)
- [x] Other minor cleanup tasks
##### Integration (CPU, GPU backends)
- [x] Run Diffusion Granule with GTFN CPU
- [x] Run with Chia Rui's diffusion granule
- [x] Run with pre-compiled stencils
- [x] Compile with GTFN GPU backend
- [x] Run basic example with GTFN GPU backend (Approach 1)
- [x] Use `$acc host_data use_device` directive to run basic example with GTFN GPU backend.
- [x] Cleanup switching between GPU and CPU modes.
- [x] Add tests for GPU and CPU modes (square, multi_return, diffusion).
- [x] Run diffusion granule on GPU locally.
##### Open Questions
- What is the overhead of the Python interpreter?
- What is the runtime difference between OpenACC and Python GTFN_GPU
- What is the runtime difference between ICON CPU and Python GTFN_CPU
- Why does `np_as_located_field` behave differently to as_field?
- Are produced results at runtime correct?
- How can we fully automate this in the ICON build system?
### Automatic Array Size Determination
We can use assumed size arrays in Fortran using `dimension(*)` like in icon4pygen. Array sizes can be determined using `SIZE()` at runtime in a wrapper subroutine which calls the C function (also similar to icon4pygen).
**Solution**
```Fortran
module square_plugin
use, intrinsic :: iso_c_binding
implicit none
public :: run_square_plugin
! Declaration of the external C function
interface
subroutine square_wrapper(inp, result, n_Cell, n_K) bind(c, name="square_wrapper")
import :: c_int, c_double
integer(c_int), value :: n_Cell, n_K
real(c_double), dimension(*), intent(in) :: inp
real(c_double), dimension(*), intent(out) :: result
end subroutine square_wrapper
end interface
contains
subroutine run_square_plugin(inp, result)
use, intrinsic :: iso_c_binding
real(c_double), dimension(:, :), intent(in) :: inp
real(c_double), dimension(:, :), intent(out) :: result
integer(c_int) :: n_Cell, n_K
n_Cell = SIZE(inp, 1)
n_K = SIZE(inp, 2)
call square_wrapper(inp, result, n_Cell, n_K)
end subroutine run_square_plugin
end module
```
### Fortran Arrays in Python
They are passed as a pointer to C, which passes a pointer to Python and is available as a ctype pointer object, for example `<cdata 'double *' 0x621e9345e910>
The pointer does not have information on:
- Total number of elements the pointer is going to reference.
- The shape of the array (what are the dimensions).
**This information needs to be passed through the Fortran interface.**
Then we can create a NumPy view of the data and create a GT4Py Field from it.
### Propagating errors from Fortran to Python
We can catch any exceptions which occur on the Python side and return a return code from the embedded Python function accordingly. Thus all embedded functions return an `int`. In the Fortran interface the C bound method is a Fortran `function` which returns this `int` return code. When we call the embedded function from Fortran we thus call the respective subroutine and also pass it an additional return code `rc` argument. We can check this return code after the function ran and then decide whether to exit from Fortran execution or not.
Example:
```Fortran
integer(c_int) :: rc
call diffusion_wrapper(..., rc)
if (rc /= 0) then
exit(1)
end if
```
If errors occur in the init_code like undefined symbol, then the code doesnt crash. Error ocurrs inside CFFI.
#### Booleans in Fortran
We need to declare logical arrays in the fortran interface as logical(c_int) otherwise we get garbage values. This is represented in memory as an array of `[0, 1]` or `[0, -1]`. Thus we need to convert these values to boolean values when constructing the numpy array to satisfy the gt4py interface which requests the dtype to be `bool` for these fields.
### Preliminary Performance Numbers
We should not call with_backend()() in a loop multiple times in the diffusion granule. This can invalidate the cache and lead to slow downs. Instead we define the function once .with_backend() and in a loop call an instance of that.
For 6 timestep ICON mch_r04b09 run we observed a 3x speedup in doing that, going from 200 seconds to 60 seconds.
60 seconds with building stencils once: ``/scratch/mch/skellerh/icon-exclaim/build/run/LOG.exp.mch_ch_r04b09_chia`
34.859 seconds with using already built stencils: `/scratch/mch/skellerh/icon-exclaim/build/run/LOG.exp.mch_ch_r04b09_dsl.run.2051747.o`
7 seconds using ICON CPU build: `/scratch/mch/skellerh/icon-cpu-build/run/LOG.exp.mch_ch_r04b09_dsl.run.2051742.o`
## Fortran -> Python on GPU
First all arrays have to created on the device using `$acc enter data create(...)` statements. Then arrays need to be copied onto the device using `$acc data copyin(...)`. Then we can execute the C function bound to the Fortran interface which calls the Python function via CFFI. On the Python side in order to create cupy arrays from the passed pointer, the pointer must be a GPU device pointer and not a host pointer on the CPU. The pointer can then be used to create a cupy array using
```python
ptr_val = int(ffi.cast("uintptr_t", ptr))
mem = cp.cuda.UnownedMemory(ptr_val, total_size, owner=ptr, device_id=device_id)
memptr = cp.cuda.MemoryPointer(mem, 0)
arr = cp.ndarray(shape=sizes, dtype=dtype, memptr=memptr, order="F")
```
#### Creating device pointers
##### Approach 1: C function to get `device_ptr`
^a0dbc8
Within the Fortran interface we can call a C function which returns the device pointer using the OpenACC C API.
```C
#include <openacc.h>
#include <stdio.h>
#include <stdint.h>
uintptr_t get_device_ptr_from_fortran_array(double *host_array) {
void* device_ptr = acc_deviceptr(host_array);
printf("Host pointer = %p\n", (void*)host_array);
printf("Device pointer = %p\n", device_ptr);
return (uintptr_t)device_ptr;
}
```
The C header then just takes a universal C pointer for each argument `void *arg`.
The Fortran interface then looks as follows:
```Fortran
module identity_plugin
use, intrinsic :: iso_c_binding
implicit none
public :: identity
interface
function get_device_ptr_from_fortran_array(host_array) bind(C, name="get_device_ptr_from_fortran_array") result(ptr)
use, intrinsic :: iso_c_binding
! Adjusted to assume a contiguous block of memory as a 1D array from Fortran's perspective
real(c_double), dimension(*), intent(in), target :: host_array
type(c_ptr) :: ptr
end function get_device_ptr_from_fortran_array
function identity_wrapper(device_ptr, &
n_Cell, &
n_K) bind(c, name="identity_wrapper") result(rc)
import :: c_int, c_double, c_ptr
type(c_ptr), value :: device_ptr
integer(c_int), value :: n_Cell
integer(c_int), value :: n_K
integer(c_int) :: rc ! Stores the return code
end function identity_wrapper
end interface
contains
subroutine identity(inp, &
rc)
use, intrinsic :: iso_c_binding
type(c_ptr) :: device_ptr
integer(c_int) :: n_Cell
integer(c_int) :: n_K
real(c_double), dimension(:, :), target :: inp
integer(c_int) :: rc ! Stores the return code
n_Cell = SIZE(inp, 1)
n_K = SIZE(inp, 2)
! Obtain device pointer for the inp array
device_ptr = get_device_ptr_from_fortran_array(inp)
! Pass the device pointer to the C function
rc = identity_wrapper(device_ptr, &
n_Cell, &
n_K)
end subroutine identity
end module
```
##### Approach 2: Use OpenACC `$acc host_data use_device(...)`
This directs the compiler to use the device address of any entry in list,
for instance, when passing a variable to a procedure. Through this we should be able to circumvent the convoluted logic of [[#Approach 1 C function to get `device_ptr`]]
### Environment Variables
##### GPU
```
PYTHONUNBUFFERED=1;GT4PY_GPU=1;GT4PY_BUILD_CACHE_LIFETIME=PERSISTENT;CUDACXX=/usr/local/cuda-12.3/bin/nvcc;GT4PY_BUILD_CACHE_DIR=/tmp/gt4py_cache
```
### Basic Timings
###### gtfn_cached_cpu
```
PYTHONUNBUFFERED=1;GT4PY_BUILD_CACHE_LIFETIME=PERSISTENT;GT4PY_BUILD_CACHE_DIR=/tmp/gt4py_stencils_cpu
```
before precompiled stencils:
```
Initialising diffusion...
Done running initialising diffusion.
Diffusion initialisation time: 18.66 seconds
Running diffusion...
Done running diffusion.
Diffusion run time: 58.12 seconds
```
after precompiled stencils:
```
Initialising diffusion...
Done running initialising diffusion.
Diffusion initialisation time: 3.02 seconds
Running diffusion...
Done running diffusion.
Diffusion run time: 35.01 seconds
```
###### gtfn_cached_gpu
```
PYTHONUNBUFFERED=1;GT4PY_GPU=1;GT4PY_BUILD_CACHE_LIFETIME=PERSISTENT;CUDACXX=/usr/local/cuda-11.8/bin/nvcc;GT4PY_BUILD_CACHE_DIR=/tmp/gt4py_stencils_gpu/
```
before precompiled stencils:
```
Initialising diffusion...
Done running initialising diffusion.
Diffusion initialisation time: 25.33 seconds
Running diffusion...
Done running diffusion.
Diffusion run time: 108.58 seconds
```
after precompiled stencils:
```
Initialising diffusion...
Done running initialising diffusion.
Diffusion initialisation time: 3.02 seconds
Running diffusion...
Done running diffusion.
Diffusion run time: 36.40 seconds
```
### Profiling
using nsys
```bash
/usr/local/bin/nsys profile -o profileOutput env CUDACXX=/usr/local/cuda-11.8/bin/nvcc GT4PY_BUILD_CACHE_DIR=/tmp/gt4py_stencils_gpu/ GT4PY_GPU=1 GT4PY_BUILD_CACHE_LIFETIME=PERSISTENT PYTHONUNBUFFERED=1 ICON_GRID_LOC=/home/sk/Dev/icon4py/testdata python /home/sk/Dev/icon4py/tools/tests/py2fgen/test_diffusion_wrapper.py
```
# Diffusion Granule Integration Steps
https://hackmd.io/lnbpB4CZSyemf5rQwdOYgw