XORCA / NEMO CF Design

# XORCA / NEMO CF Design ## What is this about? Nemo output is coming with all kinds of traps like coordinate and dimension names that have different meanings depending on the context / grid they are used on etc. Here, we try to define a way of making typical NEMO output self-describing. If done right, this will allow for using [Xarray](http://xarray.pydata.org/) for tracking complete model runs as one dataset and enable automatic analysis of the [C-grid structure](https://en.wikipedia.org/wiki/Arakawa_grids) with [XGCM](http://xgcm.readthedocs.io/). ## Things we need to clarify - With partial cells, `depth{t,u,v,w}` are not really indicative of the true depth of the cell or face centres. To be fully consistent, we'd have to go for a 3d (maybe 4d) variable `gdep{t,u,v,w}` which contains the actual depth and use `z_{l,c}=0,1,2,...` as dimensional coord similar to `{x,y}_{c,r}`. The following does, however, use one-dimensional `depth_{l,c}` for now. - `newmask_glo` probably is not a community wide standard. - What do time-dependent grids look like? Are all the coord also time dependent? How does Xarray handle coords that contain as much info as the actual data variables? - Collect (small!) example data sets? ## Target structure ### Naming of the dims and grid arrangement The grid is arranged as follows. We use - subscript `_c` for center positions (associated with the tracer or `T` grid) - subscript `_r` for right (shifted in positive direction) positions - subscript `_l` for left (shifted in negative direction) positions ``` | | | j+1 y_r - F - V - F - V - F - - W - depth_l k-1 | | | j+1 y_c U T U T U T depth_c k-1 | | | j y_r - F - V - F - V - F - - W - depth_l k | | | j y_c U T U T U T depth_c k | | | j-1 y_r - F - V - F - V - F - - W - depth_l k+1 | | | j-1 y_c U T U T U T depth_c k+1 | | | x_r x_c x_r x_c x_r i-1 i i i+1 i+1 ``` ### Final datasets We want the final datasets to be structured as follows. Important features: - Dimensional coordinates have speaking names (`time`, `depth`, ...) and subscripts indicating their position on the C grid. - There is dimensions `x_{c,r}`, and `y_{c,r}` which do not have coordinates associated. - Aux variables (coordinates in Xarray's data model) are everything that describes the grid, the prescribed parts of the model configuration (topography, masks, grid constants, ...). - Data variables are associated with the correct coordinates because of matching dimensions in a way the all dimensions of a coordinate need to be dimensions of a variable in order for the coordinate to belong to the variable. (Note that without distinguishing the dims by subscripts, we would not have a way of making sure which coord belongs / does not belong to which data variable.) - The direction of the z axis is positive downward. This is a strict requirement (which is easy to test / ensure automatically) introduced by the configuration and naming of the vertical dims indicated above. - We do not retain any of the `nav_{lev,lon,lat}` coordinates. This is because in order for the data to be fully self describing, we cannot have any ambiguous names. In NEMO output data, however, `nav_lon` and `nav_lat` mean different things on different grids. Instead, we use `g{phi,lam}{t,u,v,f}`. - We also do not retain any of the `depth{u,v,w,t}`, because this would hide the fact, that all of `depth{t,u,v}` are identical and the only levels that differ are the `depthw`. The following shows how <https://doi.org/10.5281/zenodo.3634490> would look: ``` <xarray.Dataset> Dimensions: (time: 72, depth_c: 31, depth_l: y_c: 22, y_r: 22, 31 x_c: 32, x_r: 32) Coordinates: * time (time) object ... * depth_l (depth_l) float32 ... * depth_c (depth_c) float32 ... glamt (y_c, x_c) float32 ... gphit (y_c, x_c) float32 ... glamu (y_c, x_r) float32 ... gphiu (y_c, x_r) float32 ... glamv (y_r, x_c) float32 ... gphiv (y_r, x_c) float32 ... glamf (y_r, x_r) float32 ... gphif (y_r, x_r) float32 ... tmask (depth_c, y_c, x_c) bool ... umask (depth_c, y_c, x_r) bool ... vmask (depth_c, y_r, x_c) bool ... fmask (depth_c, y_r, x_r) bool ... tmaskutil (y_c, x_c) bool ... umaskutil (y_c, x_r) bool ... vmaskutil (y_r, x_c) bool ... fmaskutil (y_r, x_r) bool ... e1t (y_c, x_c) float64 ... e1u (y_c, x_r) float64 ... e1v (y_r, x_c) float64 ... e1f (y_r, x_r) float64 ... e2t (y_c, x_c) float64 ... e2u (y_c, x_r) float64 ... e2v (y_r, x_c) float64 ... e2f (y_r, x_r) float64 ... ff (y_r, x_r) float64 ... mbathy (y_c, x_c) int16 ... misf (y_c, x_c) int16 ... isfdraft (y_c, x_c) float32 ... e3t_1d (depth_c) float64 ... e3w_1d (depth_l) float64 ... Dimensions without coordinates: x_c, x_r, y_c, y_r Data variables: votemper (time, depth_c, y_c, x_c) float32 ... vosaline (time, depth_c, y_c, x_c) float32 ... sosstsst (time, y_c, x_c) float32 ... sosaline (time, y_c, x_c) float32 ... sossheig (time, y_c, x_c) float32 ... sowaflup (time, y_c, x_c) float32 ... sorunoff (time, y_c, x_c) float32 ... sosfldow (time, y_c, x_c) float32 ... sosst_cd (time, y_c, x_c) float32 ... sosss_cd (time, y_c, x_c) float32 ... sohefldo (time, y_c, x_c) float32 ... soshfldo (time, y_c, x_c) float32 ... somixhgt (time, y_c, x_c) float32 ... somxl010 (time, y_c, x_c) float32 ... soicecov (time, y_c, x_c) float32 ... sowindsp (time, y_c, x_c) float32 ... sohefldp (time, y_c, x_c) float32 ... sowafldp (time, y_c, x_c) float32 ... sosafldp (time, y_c, x_c) float32 ... sobowlin (y_c, x_c) float32 ... vozocrtx (time, depth_c, y_c, x_r) float32 ... sozotaux (time, y_c, x_r) float32 ... vomecrty (time, depth_c, y_r, x_c) float32 ... sometauy (time, y_r, x_c) float32 ... vovecrtz (time, depth_l, y_c, x_c) float32 ... votkeavt (time, depth_l, y_c, x_c) float32 ... votkeavm (time, depth_l, y_c, x_c) float32 ... ``` `depth_l` and `depth_c` are always positive downward. ### Attributes The CF conventions require all data variables and all coordinates to have all attributes necessary to render their description fully CF compliant. We need to prescribe all attributes for the coordinates, because these are not given in the typical `mesh_mask.nc` files. Examples: ```python # coriolis paramter ff.attrs = { "long_name": "Coriolis parameter on F-grid", "units": "s-1", "coordinates": "glamf gphif", "nemo_hgrid": "t", } # longitude fields glamt.attrs = { "long_name": f"longitude of T-grid points", "units": "degrees_east", "standard_name": "longitude", "nemo_hgrid": "t", } # [...] ``` ### Trimming and folding Typical global NEMO output contains 1 or 2 rows of redundant information in the North, and 2 columns of redundant information in the East. From a usability point of view, we want to remove the redundant data so that, e.g., global avarages can be calculated without accidentially including some information twice. NEMO brings "utility" masks `{t,u,v,w}maskutil` that null all redundant data. Using these in a `where` method which drops nulled data can, however, lead to large volumes loaded into memory for some cases and some version of Xarray. On the global tri-polar ORCA grid, there is different conventions on how to fold the northernmost (in `(i,j)` space) line. To my knowledge, there is no mechanism built into NEMO or NEMO data to indicate these conventions in a machine-readable way. I'm not sure how we can solve trimming and folding / topology without the need of users explicitly stating what they know about redundant data. We could also load small chunks of data along the edges and try to detect the different conventions to trigger index-based trimming which would not result in the described memory problems. ## Strategy for enforcing these conventions ### Things to keep in mind - _**We do not want to hard code more than abosolutely necessary.**_ As the `mesh_mask.nc` comes completely un-annotated, we need to have full specification of attrs etc. for all the aux variables / coordinates. But as there is many different conventions about naming the physical or biological fields and as it is easy to add more fields, a design that relies on all properties of the physical fields being hard coded would be hard to maintain, difficult to pick up for new users and error prone. - We need to retain full lazyness and other Dask features in everything we do. This is a hard problem that can only be solved (approximately) by testing for lazyness with many different test cases. ### Strategy for constructing CF compliant datasets of NEMO data _(**NOTE** that the following shows that it's difficult to clearly separate where CF conformity starts and where XGCM / Comodo compatibility ends.)_ - Open un-decoded (!) datasets. - Completely annotate aux dataset (`mesh_mask`, ... and, if present, `newmask_glo` or other mask files) - CF attributes (`"standard name"`, `"long_name"`, `"unit"`, ...) - `"coordinates"` attribute (e.g., `"gdept_1d gphit glamt"`) - `"nemo_hgrid"` (one of `["t", "u", "v", "f"]`) and `"nemo_vgrid"` (one of `["t", "w"]`) - If there's a singleton time dim / coord in the aux datasets, completely remove them (`.squeeze(drop=True)`). - Detect horizontal and vertical grids for all data variables of the other fields - based on dims and by comparing the `"depth?"` and `"nav_lat"`, `"nav_lon"` fields in the same dataset to the fields from the aux dataset - store in attrs `"nemo_hgrid"` and `"nemo_vgrid"` - Drop coordinates `"nav_lat"`, `"nav_lon"` etc. from the data variables. - Rename dims (per variable) to `"x_t"`, ..., `"y_t"`, ..., `"z_t"`, ... and consolidate dim names to `"x_c"`, `"x_r"`, `"y_c"`, ... - Fix calendars? (There is, e.g., some `calender="360d"` which should be called `"360day"`.) - Merge everything together. - Decode `xr.decode_cf` - Rename dims / coords - gdep?_1d --> depth_c, depth_l - time_counter --> time - Apply masks (using the ?maskutil fields) and drop columns. (Not sure this is possible for general grids? At least it would need complete loading of the surface slice of each mask.) - Generate xgcm grid