---
title: 'Esmlab Design Document'
---
## Table of Contents
[TOC]
## Goal
Generic tools for working with gridded earth-system products using xarray and dask.
## Introduction
esm-lab could provide a suite of tools for analysing gridded earth-system products that do not depend on the provenance of said products. Anderson has already implemented some useful data-informed tools for computing various statistics (weighted averages, dealing with time appropriately...). Dougie/Thomas have been compiling generic codes for computing various atmospheric and ocean "diagnostics" (fields or indices that require further postprocessing of earth-system products). I (Dougie) personally think these codebases go hand in hand, since a typical scientists workflow is to take some data, compute some diagnostics, and then run some statistics on those diagnsotics. Dougie would happily dedicate time to refactoring doppyo code into esm-lab if that is decided to be the way forward. However, there are some design decisions to be made before this can begin.
### Related Work
- [climpred](https://github.com/bradyrx/climpred): an xarray wrapper for analysis of ensemble forecast models for climate prediction.
- [xskillscore](https://github.com/raybellwaves/xskillscore): provides verification metrics of deterministic (and probabilistic from properscoring) forecasts with xarray.
- [doppyo](https://github.com/csiro-dcfp/doppyo): a diagnostics/verification package for climate forecast systems.
- [Atmospheric Community Toolkit](https://github.com/ANL-DIGR/ACT): Python toolkit for working with atmospheric time-series datasets of varying dimensions. The toolkit is meant to have functions for every part of the scientific process; discovery, IO, quality control, corrections, retrievals, visualization, and analysis.
- [esmtools](https://github.com/bradyrx/esmtools/): a toolbox for Earth System Model analysis
- [xclim](https://github.com/Ouranosinc/xclim): a library of functions to compute climate indices.
## Design Approach
### Core Functionality
#### Time-aware operations
*Note: This section is work in progress*
Currently, esmlab relies on time boundary variable: `time_bounds`or `time_bnds`, etc... for operations such as `compute_annual_mean()`, `compute_mon_mean()`. This time boundary variable is used to generate weights used during these operations.
For example, if we had the following CESM POP timeseries file with monthly data:
```python
In [22]: import esmlab
In [23]: ds = esmlab.datasets.open_dataset('cesm_pop_monthly', decode_times=False)
In [24]: keep_vars = ["NO3", "time_bound", "TAREA", "REGION_MASK", "TLAT", "TLONG"]
In [25]: ds = ds.drop([v for v in ds.variables if v not in keep_vars and v not in ds.dims])
In [26]: ds.attrs = {}
In [27]: ds
Out[27]:
<xarray.Dataset>
Dimensions: (d2: 2, lat_aux_grid: 395, moc_z: 61, nlat: 6, nlon: 6, time: 24, z_t: 6, z_t_150m: 15, z_w: 60, z_w_bot: 60, z_w_top: 60)
Coordinates:
* z_t (z_t) float32 500.0 1500.0 2500.0 3500.0 4500.0 5500.0
* z_t_150m (z_t_150m) float32 500.0 1500.0 2500.0 ... 13500.0 14500.0
* z_w (z_w) float32 0.0 1000.0 2000.0 ... 500004.7 525000.94
* z_w_top (z_w_top) float32 0.0 1000.0 2000.0 ... 500004.7 525000.94
* z_w_bot (z_w_bot) float32 1000.0 2000.0 3000.0 ... 525000.94 549999.06
* lat_aux_grid (lat_aux_grid) float32 -79.48815 -78.952896 ... 89.47441 90.0
* moc_z (moc_z) float32 0.0 1000.0 2000.0 ... 525000.94 549999.06
TLONG (nlat, nlon) float64 ...
TLAT (nlat, nlon) float64 ...
* time (time) float64 1.825e+05 1.826e+05 ... 1.832e+05 1.832e+05
Dimensions without coordinates: d2, nlat, nlon
Data variables:
REGION_MASK (nlat, nlon) float64 ...
TAREA (nlat, nlon) float64 ...
time_bound (time, d2) float64 ...
NO3 (time, z_t, nlat, nlon) float32 ...
```
```python
In [28]: ds.time_bound
Out[28]:
<xarray.DataArray 'time_bound' (time: 24, d2: 2)>
array([[182500., 182531.],
[182531., 182559.],
[182559., 182590.],
[182590., 182620.],
[182620., 182651.],
[182651., 182681.],
[182681., 182712.],
[182712., 182743.],
[182743., 182773.],
[182773., 182804.],
[182804., 182834.],
[182834., 182865.],
[182865., 182896.],
[182896., 182924.],
[182924., 182955.],
[182955., 182985.],
[182985., 183016.],
[183016., 183046.],
[183046., 183077.],
[183077., 183108.],
[183108., 183138.],
[183138., 183169.],
[183169., 183199.],
```
By computing the discrete difference for `time_bound` variable along `d2` dimension, we get the number of days in each month. This number of days would be used to generate proper weights to use when computing `monthly averages` for example:
```python
In [33]: print(ds.time_bound.diff(dim="d2")[:, 0])
<xarray.DataArray 'time_bound' (time: 24)>
array([31., 28., 31., 30., 31., 30., 31., 31., 30., 31., 30., 31., 31.,
28., 31., 30., 31., 30., 31., 31., 30., 31., 30., 31.])
Coordinates:
* time (time) float64 1.825e+05 1.826e+05 ... 1.832e+05 1.832e+05
Attributes:
long_name: boundaries for time-averaging interval
units: days since 0000-01-01 00:00:00
```
#### Resampling
For most of the time-aware operations, esmlab relies on xarray's `Groupby()` in conjuction with `resample()` objects to change the frequency of input time series. The `Groupby()` approach has proven to be hard to work with for the follwing reasons:
- Due to the fact that xarray is not aware of the `time boundary` variable, this translation between data at different temporal intervals necessitates **hacky** workarounds in esmlab in order to keep track of the `time boundary` variable at new intervals.
For instance, notice how xarray discards the `time_bound` variable when we try to downsample our time series from `monthly` to `yearly`:
```python
In [38]: dset = xr.decode_cf(ds)
In [39]: dset
Out[39]:
<xarray.Dataset>
Dimensions: (d2: 2, lat_aux_grid: 395, moc_z: 61, nlat: 6, nlon: 6, time: 24, z_t: 6, z_t_150m: 15, z_w: 60, z_w_bot: 60, z_w_top: 60)
Coordinates:
* z_t (z_t) float32 500.0 1500.0 2500.0 3500.0 4500.0 5500.0
* z_t_150m (z_t_150m) float32 500.0 1500.0 2500.0 ... 13500.0 14500.0
* z_w (z_w) float32 0.0 1000.0 2000.0 ... 500004.7 525000.94
* z_w_top (z_w_top) float32 0.0 1000.0 2000.0 ... 500004.7 525000.94
* z_w_bot (z_w_bot) float32 1000.0 2000.0 3000.0 ... 525000.94 549999.06
* lat_aux_grid (lat_aux_grid) float32 -79.48815 -78.952896 ... 89.47441 90.0
* moc_z (moc_z) float32 0.0 1000.0 2000.0 ... 525000.94 549999.06
TLONG (nlat, nlon) float64 ...
TLAT (nlat, nlon) float64 ...
* time (time) object 0500-02-01 00:00:00 ... 0502-01-01 00:00:00
Dimensions without coordinates: d2, nlat, nlon
Data variables:
REGION_MASK (nlat, nlon) float64 ...
TAREA (nlat, nlon) float64 ...
time_bound (time, d2) object ...
NO3 (time, z_t, nlat, nlon) float32 ...
In [40]: dset.resample(time="A").mean() # Incorrect results since the months are not weighed accordingly.
/Users/abanihi/devel/pangeo/xarray/xarray/core/nanops.py:140: RuntimeWarning: Mean of empty slice
return np.nanmean(a, axis=axis, dtype=dtype)
Out[40]:
<xarray.Dataset>
Dimensions: (lat_aux_grid: 395, moc_z: 61, nlat: 6, nlon: 6, time: 3, z_t: 6, z_t_150m: 15, z_w: 60, z_w_bot: 60, z_w_top: 60)
Coordinates:
* time (time) object 0500-12-31 00:00:00 ... 0502-12-31 00:00:00
TLONG (nlat, nlon) float64 320.6 321.7 322.8 ... 323.9 325.1 326.2
TLAT (nlat, nlon) float64 -79.22 -79.22 -79.22 ... -76.55 -76.55
* z_w_top (z_w_top) float32 0.0 1000.0 2000.0 ... 500004.7 525000.94
* lat_aux_grid (lat_aux_grid) float32 -79.48815 -78.952896 ... 89.47441 90.0
* z_w (z_w) float32 0.0 1000.0 2000.0 ... 500004.7 525000.94
* z_w_bot (z_w_bot) float32 1000.0 2000.0 3000.0 ... 525000.94 549999.06
* z_t (z_t) float32 500.0 1500.0 2500.0 3500.0 4500.0 5500.0
* z_t_150m (z_t_150m) float32 500.0 1500.0 2500.0 ... 13500.0 14500.0
* moc_z (moc_z) float32 0.0 1000.0 2000.0 ... 525000.94 549999.06
Dimensions without coordinates: nlat, nlon
Data variables:
REGION_MASK (time, nlat, nlon) float64 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0
TAREA (time, nlat, nlon) float64 1.125e+13 1.125e+13 ... 1.728e+13
NO3 (time, z_t, nlat, nlon) float32 nan nan ... 22.182762
```
In addition to losing the `time_bound` variable, we may end up getting wrong results when the new time coordinate needs to be weighed properly.
Ideally, it would be nice if xarray were to keep the `time_bound` variable, and generate new values for it accordingly. This way, we could take advantage of the bounds to compute weights for weighted operations along the time-axis.
- Groupby can be computationaly expensive: Groupby uses a split-apply-combine paradigm to transform the data. The apply and combine steps can be lazy. But the split step cannot. Xarray uses the group variable to determine how to index the array, i.e. which items belong in which group.
### TimeAxis: General Resample Utility
*Note: This section is still work in progress*
In [NCAR/emslab#142](https://github.com/NCAR/esmlab/issues/142), Matt Long proposed a more general resample capability that would enable fluid translation between data at different temporal intervals. Moe has started working on this utility in https://github.com/coderepocenter/AxisUtilities. The fundamental concept here is a`TimeAxis` object. `TimeAxis`'s role includes:
- **Generating** a new `to` time coordinate when given information of the `from` time coordinate.
- **Remapping** data from the `from` time coordinate to the `to` time coordinate.
*Note*: This `TimeAxis` object is analogous to `ESMF's regridder`. Once`TimeAxis` is implemented, esmlab will interface with it via a new object `TimeConverter() or TimeTransformer()` (not sure what this object will be called yet)
```python
from esmlab import TimeConverter
import xarray as xr
ds = xr.open_dataset(fname)
tc = TimeConverter(ds)
```
_Anderson, can you please write a short overview of the design of esm-lab as it currently sits?_(will do very soon!)
Some technical questions we have had:
- should esm-lab generate a new class type or simply expose methods that ingest and egest xarray objects that are expected to adhere to some minimum requirements?
## Application Programming Interface (API)
## Appendix and FAQ
:::info
**TODO**!
:::
###### tags: `ncar` `pangeo`