Fortran Serialization

# Fortran Serialization ###### tags: `cycle 14` - Developers: - Appetite: full cycle ## Goal Provide a flexible framework for serializing Fortran execution to improve testing and performance optimizations. Currently, we see two cases were serialization would be handy: - at the granule level (defined as standalone-usable Fortran components of ICON), which have a well-defined interface, but no serialization directives - at the level where GT4Py stencils are integrated into ICON (via Liskov). ## Design Enter a shared representation from which serialization statements can be generated for different kind of serialization libraries. ![](https://i.imgur.com/cW8apjG.png) The shared representation can support different backends starting with `SerialBox`, and others such as `NetCDF`. ```mermaid flowchart LR subgraph Frontend A[serialise.py] -->|f2py + regex| C{{Serialisation Interface}} B[Liskov] -->|Start/EndStencil Directives| C end subgraph Backend C --> D[SerialisationGenerator] D --> E[CodegenWriter] E --> F[F90File] end ``` To be able to target the same interface we effectively need **two** additional **deserialisers**. In Liskov, a `DeserialisedDirectives2SI` deserialiser, and in the Granule CLI a `ParsedGranule2SI` deserialiser. Both produce a `SerialisationInterface` object, which is passed to a shared `SerialisationGenerator`. This produces a list of `GeneratedCode` objects which is passed to the `SerialisationCodegenWriter` which writes it to the f90 file. ## Steps 1. `pp_ser` directives - Refactor Christopher's script and add it to `pyutils`. Investigate use of `f2py` for parsing. - Define serialisation interface to target generation of `pp_ser` directives. - Liskov has already all the information needed to support generation of serialisation code, except metadata for a savepoint (e.g. timestep). In a first iteration this can be ignored (limiting to only one savepoint). - Figure out a non-intrusive way to provide metadata for each stencil. - Write jinja templates for `pp_ser` directive generation as part of new subpackage in `Liskov`. - Test build integration. 2. `serialbox` library calls. - Investigate generation of serialbox library calls. - If feasible, add new `serialbox` codegen backend to `Liskov`. - Test build integration 3. `NetCDF` serialisation - Investigate required NetCDF data structure to replicate Serialbox `savepoint`. - If feasible, add new `netcdf` codegen backend to `Liskov`. ## Frontends In both cases we need to get the information into the **Serialization Information** and then generate the place where to generate the serialization statements. ### Liskov directives Stencil Information is generated from the Liskov directives + GT4Py stencil information. Liskov directives in the code denote start and end of a stencil. - Line numbers of each Start and EndStencil. - Stencil name is used to import information about stencil input/output fields, and dimensions. - What key/value pairs should be included in the Savepoint information? Potentially use a timestep counter in the backend? ### Granule interface Parses Fortran code to extract subroutine names, field intent etc. - Needs to also record line numbers of where to generate code. Could use regex to find end of intent/out declarations and end of subroutine. - Could f2py be used to parse the Fortran code? ## Christopher/Will script This python (pre)serialization script, given an input Fortran90m source file, produces a new, modified version of the file itself (leaving the original unchanged), with Serialbox serialization directives in the needed places. The entire serialization procedure is based on two steps: 1) the pre-serialization, performed with the script discussed here, which add the Serialbox serialization directives to the provided input source file. The output file is the input file for step 2. The script can be executed as follows: ./add_pp_ser_directives.py source.f90 -o pre_serialized_source.f90 2) the generation of the Fortran source file making use of Serialbox API, to be compiled and linked against Serialbox library. This second step is based on the usage of pp_ser.py, which takes in input the file produced in step 1. During the parsing of the input file, all the function/subroutine parameters are stored together with their intent (in, out and inout, plus the function name in case it corresponds to the returned parameter). The intent are used to place the serialization directives in the right place: right before any computation for intent(in) parameter, right before any return statement for intent(out) parameter and in both places for intent(inout) parameter. Specific serialization names are used to identify in which point of the execution flow the serialization is performed. Special treatment is reserved to "jg" input parameter: if present, is considered to be a grid identifier and as such is used as part of the output folder name. Open issues: 1) How to deal with time loops? 2) Composed types: Serialbox is currently not able to identify and serialize composed types (e.g., t_cartesian_coordinate). As a temporary solution, when such kind of variables are found we decompose them by creating new basic variable which are added to the Fortran source code and then serialized. 3) Arrays are serialized as scalar varibles since Serialbox should be able to deal with them. We need to find a way to load them with the right indexing scheme. Can we maybe serialize the indexing as well as a metadata? ## f2py parsing example Example subroutine ```fortran! subroutine mysubroutine(a, b, c) implicit none real, intent(in) :: a real, intent(inout) :: b real, intent(out) :: c c = a + b b = 2.0 * b end subroutine mysubroutine ``` parse using f2py: ```python from numpy.f2py.crackfortran import crackfortran parsed = crackfortran('example.f90') ``` yields: ```python [ { "block": "subroutine", "name": "mysubroutine", "from": "example.f90", "args": ["a", "b", "c"], "body": [], "externals": [], "interfaced": [], "vars": { "a": {"typespec": "real", "attrspec": [], "intent": ["in"]}, "b": {"typespec": "real", "attrspec": [], "intent": ["inout"]}, "c": {"typespec": "real", "attrspec": [], "intent": ["out"]}, }, "entry": {}, "implicit": None, "sortvars": ["a", "b", "c"], } ] ``` ## Serialization Information Interface for `pp_ser` code generation. ```python! class Metadata: key: str value: str class SerialisationMetaInfo: directory_path: str class SavepointInformation: ln: int name: str fields: list[FieldSerializationInformation] metadata: Optional[list[Metadata]] class FieldSerializationInformation: var_name: str var_association: str ``` Generation targets (`pp_ser`) ```bash !$ser init directory=<path> prefix=<prefix> !$ser mode write !$ser savepoint <name> <**metadata> !$ser <var_name>=<var_associaton> ``` Parsing and object construction pseudocode. ```python! inp = SavePointInformation(name="in_fields", metadata={"jg", "jg"}) out = SavePointInformation(name="out_fields", metadata={"jg", "jg"}) for line in lines: if is_param(line): if is_intend_in_or_inout(line): inp.fields.append(get_info(line)) if is_intend_out_or_inout(line): out.field.append(get_info(line)) ``` Code generation context ```fortran! subroutine mysubroutine(a, b, c) implicit none real, intent(in) :: a real, intent(inout) :: b real, intent(out) :: c <serialise_input_fields_here> c = a + b b = 2.0 * b <serialise_output_fields_here> end subroutine mysubroutine ``` ## Backend Both front-ends should target the same code generation backend. - Make inputs required for codegen use a common serialisation interface (`CodeGenInput`). - Inputs are passed to a `SerialisationGenerator` which generates codegen inputs. - Passed to `SerialisationWriter`, which writes generated code to output file ### pp_ser The first goal is to support generating `pp_ser` directives to serialise fields. Special care needs to be taken in the build integration so that liskov runs before `pp_ser`. #### Directive examples Serialbox provides `!$ser` preprocessor directives which when parsed by [`pp_ser.py`](https://github.com/GridTools/serialbox/blob/master/src/serialbox-python/pp_ser/pp_ser.py) generates calls to the Serialbox Fortran Interface, which in turn serialises data to disk. Required data points include: - Field name - Fortran field to serialise (e.g. `vct_a(:)`) - Serialisation directory - Prefix - Savepoint metadata **serialisation examples** 1. ```fortran! !$ser verbatim CALL datetimeToString(eventStartDate, dt_string) !$ser init directory='./ser_data' prefix='icon_pydycore' mpi_rank=get_my_mpi_work_id() !$ser verbatim DO jg=1, n_dom !$ser savepoint icon-grid nproma=nproma nlev=p_patch(jg)%nlev id=jg date=dstring nsteps=nsteps dtime=dtime n_dom=n_dom iforcing=iforcing lvert_nest=lvert_nest limited_area=l_limited_area !$ser mode write !$ser data vct_a=vct_a(:) !$ser data vct_b=vct_b(:) !$ser data nrdmax=nrdmax(jg) ``` 2. ```fortran! MODULE m_ser IMPLICIT NONE CONTAINS SUBROUTINE serialize(a) IMPLICIT NONE REAL(KIND=8), DIMENSION(:,:,:) :: a !$ser init directory='.' prefix='SerialboxTest' !$ser savepoint sp1 !$ser mode write !$ser data ser_a=a END SUBROUTINE serialize SUBROUTINE deserialize(a) IMPLICIT NONE REAL(KIND=8), DIMENSION(:,:,:) :: a !$ser init directory='.' prefix='SerialboxTest-output' prefix_ref='SerialboxTest' !$ser savepoint sp1 !$ser mode read !$ser data ser_a=a !$ser mode write !$ser data ser_a=a END SUBROUTINE deserialize SUBROUTINE deserialize_with_perturb(a) IMPLICIT NONE REAL(KIND=8), DIMENSION(:,:,:) :: a REAL(KIND=8) :: rprecision rprecision = 10.0**(-PRECISION(1.0)) !$ser init directory='.' prefix='SerialboxTest-output' prefix_ref='SerialboxTest' rprecision=rprecision rperturb=1.0e-5_8 !$ser savepoint sp1 !$ser mode read-perturb !$ser data ser_a=a END SUBROUTINE deserialize_with_perturb END MODULE m_ser ``` More examples can be seen at: https://github.com/C2SM/icon-exclaim/blob/strip_down_mo_nh_diffusion/src/atm_dyn_iconam/mo_nh_diffusion.f90 ### optional: direct Serialbox library calls The long-term solution is to generate serialbox calls directly, thereby circumventing pp_ser. ### optional: research if netCDF can be used for this use-case To support NetCDF in the future it must be investigated whether additional metadata can be stored in NetCDF files, effectively replicating the `Savepoint` functionality of Serialbox, which is used to discriminate fields at different points in time. See https://github.com/GridTools/serialbox/blob/master/src/serialbox-python/serialbox/savepoint.py for more information. Pseudocode representation of a savepoint is as follows: ```python { ("stencil1", "savepoint1"): {"vn": vn, "foo": bar}, ("stencil1", "savepoint2"): {"vn": vn, "foo": bar} } ``` #### Simple NetCDF Serialisation Example For serialisation to NetCDF the following information is needed: - Array dimensions (A name for each dimension) - Name of the array to serialise ```fortran! program serialize_to_netcdf use netcdf implicit none integer, parameter :: nx = 10, ny = 5 integer :: status, ncid, varid integer :: data(nx,ny) integer :: i, j ! initialize data do i = 1, nx do j = 1, ny data(i,j) = (i-1)*ny + j end do end do ! create NetCDF file status = nf90_create("data.nc", NF90_CLOBBER, ncid) if (status /= NF90_NOERR) then write(*,*) "Error creating NetCDF file" stop end if ! define dimensions status = nf90_def_dim(ncid, "x", nx, varid) if (status /= NF90_NOERR) then write(*,*) "Error defining dimension 'x'" stop end if status = nf90_def_dim(ncid, "y", ny, varid) if (status /= NF90_NOERR) then write(*,*) "Error defining dimension 'y'" stop end if ! define variable status = nf90_def_var(ncid, "data", NF90_INT, (/nf90_dimid(ncid, "x"), nf90_dimid(ncid, "y")/), varid) if (status /= NF90_NOERR) then write(*,*) "Error defining variable 'data'" stop end if ! end definition mode status = nf90_enddef(ncid) if (status /= NF90_NOERR) then write(*,*) "Error ending definition mode" stop end if ! write data status = nf90_put_var(ncid, nf90_varid(ncid, "data"), data) if (status /= NF90_NOERR) then write(*,*) "Error writing variable 'data'" stop end if ! close file status = nf90_close(ncid) if (status /= NF90_NOERR) then write(*,*) "Error closing NetCDF file" stop end if write(*,*) "Successfully serialized Fortran array to NetCDF file 'data.nc'" end program serialize_to_netcdf ``` ## Rabbit holes - For serialization Liskov can use GT4Py stencil information, no additional directives will be added (except potentially metadata). ## No-gos - Do not extend Liskov's parser to also parse Fortran code, as this would further increase the complexity/maintainability of Liskov which is meant to be a simple code generation tool. ## Current Status - `f2ser` command line script done. Parses f90 granule and generates corresponding `pp_ser` statements. - Serialisation statement generation now an option in `icon_liskov`. - Can serialise `mo_velocity_advection`, `mo_nh_diffusion` - Tested integration into ICON build system. ## Todo - Support serialisation for `mo_solve_nonhydro`, and other stencils defined in subroutines such as `mo_int_rbf` - Code cleanup - Unit/Integration Testing