# ICON4Py - IO system
Notes: Magdalena
## Requirements
| nr | title | priority | description |
| --- |:------------------------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| #1 | CF conventions | 1 | All output according to [CF conventions](#CF) (where possible they exist). Together with CF compliant datafiles we provide a UX convention compliant grid file, when output is unstructured |
| #2 | Output Format | 1 | In first version output are Netcdf files. |
| #3 | Remapping | 3 | Output might be remapped to a structured lat/lon grid. (*Feedback (LK, BG): For some analysis it is essential to have the unstructured grid, as well as the original model levels)* |
| #4 | Resampling | 3 | Output might be resampled to a lower resolution grid. |
| #5 | Parallel writes | 2 | Parallel writing should be possible. (Netcdf4-python provides this functionality for MPI enabled netCDF library mpi4py) |
| #6 | Configuration | 1 | output is configurable: filename, time resolution, variable names, remapping, regridding, vertical level type. |
| #7 | Configuration per field or group of field | 1 | It is configurable what data fields are output: The user specifies: field name (CF conventions, output interval (in terms of n [min, hrs, days]), vertical level type (pressure or model level)) |
| #8 | Height levels, 3d fields | 3 | height levels can be output as pressure levels that can be configured. *(Feedback CF, BG: can be done in post processing, for certain analysis it is essential to have the model levels, pressure levels are mostly used for model intercomparison)* |
| #9 | user defined fields | 3 | user can specifiy their own adhoc fields (diagnostics) |
| #10 | IO nodes | 2 | spawn I/O to its own process in order to not block the model run |
| #11 | derived output , means | 2 | there are some processed variables that are better computed during model run, mean values over time steps. |
| #12 | standard deviation | 2 | together with mean values it would be valuable to also get the standard deviation or confidence levels |
#### priorities
1: for prototype,
2: necessary enhancement,
3: optional (for example can be done in post processing chain or analysis.)
## Further explanations
### 3: Remapping, regridding
Many analysis algorithms are defined on a structure grid. It easier for user to formulate them and use them if the data is structured. (cf CR) However there might be applications that require the global unstructured grid.
- if working with unstructured data becomes as easy as with the structured (through `uxarray`s maybe) there might be no longer the need for this.
- **Should** remapping be a feature of the analysis workflow?
- When outputing unstructured data, provide a [UGRID conventional grid file](#UGRID) with it.
- According to users (LK, BG): On the other hand there are certain analysis that cannot be done on the remapped grid.
Regridding/mapping can always be done in the analysis.
### 6/7: Configuration
Configuration should be per field, or per group of fields:
- time intervall
- vertical level types (pressure or model level)
- filename
-
Such a configuration can be for a
- list of variables
- preconfigured variable group
- combination of the above
```yaml
fields: var1, var1, var_group_x
```
will result in a output that contains set `{x | x in var_group_x or {var1, var2}}`
#### preconfigured variables groups
there should be pre-configured variable groups (f.ex. all variables needed for precipitation analysis) one important group is: all variables needed to initialize a different model run.
(this is unlike restart, which needs to be on the same configuration, resolution)
### 8: Height levels, 3d fields
Output of k-levels (vertical model levels in `m`) is used and essential for certain kind of analysis. Transformation to pressure levels is mostly used for model intercomparison and can be done in post processing
### 10: I/O nodes
does it make sense to run IO on dedicated MPI nodes? Doesn't this create a lot of unnecessary communication? Would it not make more sense to do IO on the same compute node in a seperate process (on the CPU). The IO component would need to get the data at the approriate time snapshot and then do the writing and transforming in a process of its own.
## Prototype:
Prototype should roughly implement requirements with prio 1.
## Architecture:
Follow general structure of [component interfaces](#icon4py-arch).
```mermaid
classDiagram
class Monitor
<<Abstract>> Monitor
Monitor : + store(model_state, model_time)
class IOConfig
class FieldGroupIOConfig
class NETCDFWriter
class IOMonitor{
+ __init__(IOConfig config)
+ register(FieldGroupMonitor monitor)
- List~FieldGroupMonitor~ monitors
}
class FieldGroupMonitor{
+ __init__(FieldGroupIOConfig config)
+ store(model_state, model_time)
- NETCDFWriter writer
}
IOMonitor *-- IOConfig
IOMonitor "1" *-- "n"FieldGroupMonitor
FieldGroupMonitor *-- FieldGroupIOConfig
FieldGroupMonitor *-- NETCDFWriter
Monitor <|-- IOMonitor
Monitor <|-- FieldGroupMonitor
note for Monitor "Component that takes the model state `saves` \n to whatever external resource it uses. \n Name is chosen following sympl"
```
## Questions:
### restart files?
- are they written per node?
- fixed set of data, from which everything can be reconstructed? what are those?
-
### Output Formats assessments
| | NetCDF | Grib2 |
|:-------------- |:-------------------------------------- |:------------------------------------------------------------------------------ |
| usage | climate, research community | weather |
| size | | higher compression, smaller size |
| speed | allows parallel writing | |
| python support | direct, usage in xarray under the hood | |
| adaptability | | datasets can be concatenated together, that is what `fdb` does under the hood, |
## Further user inputs
### problems with output in ICON
- different version of icon create different output: some variables are named differently between ICON-NWP and ICON of MPI
- - it is not so clear whether the metadata contained in the model files is always enough,
- It is common that a variables is forgotten in the namelist configuration and the run needs to be redone in order to add it. How can we improve there: There are some models that output everything and you need to remove what you do not want, maybe there should be a preconfigured set and you add to this specific varialbes
- the list of possible output variables is not always up to date. There are variables that are specfied but they just output 0 (might be a GPU port problem?)
- icon model tuning is mostly done via **monitoring variables** parameters do not work in GPU / icon-dsl version. like `g_mean_tas`(?). They involve some global means. (cf LK)
- it would be useful to not only compute mean of variables but also standard deviation.
- it should be possible to add a custom variable to the output. Users own diagnostics, tendency (see req.[9](#9) )
- There is also **meteogram output** which has a very different structure from regular ICON netcdf output files. It is mostly used for local model or high resolution. (BG uses it for vertical profiles along a station for example).
## resources
- [#CDO](https://code.mpimet.mpg.de/projects/cdo/wiki/Cdo#Documentation)
- [#CF Conventions](https://cfconventions.org/)
- [#UGRID conventions](http://ugrid-conventions.github.io/ugrid-conventions/)
- [#netcdf](https://docs.unidata.ucar.edu/netcdf-c/current/accessing_subsets.html)
- [#netcdf4-python](https://unidata.github.io/netcdf4-python/#writing-data-to-and-retrieving-data-from-a-netcdf-variable)
- [#xarray](https://docs.xarray.dev/en/latest/index.html)
- [#uxarray](https://uxarray.readthedocs.io/en/latest/)
- [#iris](https://scitools-iris.readthedocs.io/en/latest/index.html)
- [#cftime](https://unidata.github.io/cftime/api.html#cftime.datetime)
- [#icon4py-arch](https://hackmd.io/TR5MM8n3TQqbGtyBmQ9tFQ)