owned this note
owned this note
Published
Linked with GitHub
# Xarray-datatree backend design doc
**Prototype Implementation Branch**: https://github.com/jthielen/xarray/tree/datatree-backend
## Goals
Xarray will soon include the `DataTree` class, a new data structure that will represent hierarchical data as a series of nested "groups". This data structure maps closely to HDF5, netCDF4, Zarr, and other data formats. Xarray should include an `open_datatree` function that efficiently creates `DataTree` objects.
This design doc lays out the basic design for the integration of `open_datatree` with Xarray's backends.
## Non-goals
- Update all backends to support datatree (this should be enabled by this work, but then implemented one-by-one later)
## Design
### Backend API (General)
**Guiding Principle:** Add a `open_datatree` method to `xarray.backends.BackendEntrypoint` that instantiates and returns a `DataTree`
1. Take the existing `BackendEntrypoint.open_dataset` method as starting point (with similar [prototype implementation details as in existing docs](https://docs.xarray.dev/en/stable/internals/how-to-add-new-backend.html#open-dataset))
1. Rely on backend to handle (through `open_datatree` method)
- file open
- decoding variables
- assembling DataTree out of contained variables
- setting a close method
- return DataTree
1. Implement same args & kwargs as `open_dataset` (`filename_or_obj`, `drop_variables` (Q: how to handle variables of same name in different groups?), `mask_and_scale`, `decode_times`, `use_cftime`, `concat_characters`, `decode_coords`)
1. Backend implimentations should have variables with data as `numpy.ndarray` or lazy loading `BackendArray` subclass
The following (reimplemented from existing `open_dataset`) is an example of the high level processing steps:
```python
def open_datatree(
self,
filename_or_obj,
*,
drop_variables=None,
decode_times=True,
decode_timedelta=True,
decode_coords=True,
my_backend_option=None,
):
import datatree
vars, attrs, coords = my_reader(
filename_or_obj,
drop_variables=drop_variables,
my_backend_option=my_backend_option,
)
vars, attrs, coords = my_decode_variables(
vars, attrs, decode_times, decode_timedelta, decode_coords
) # see also conventions.decode_cf_variables
datasets = my_assemble_groups(vars, attrs, coords)
dt = datatree.DataTree.from_dict(datasets)
dt.set_close(my_close_method)
return dt
```
*(Note that this is only one method a backend could implement; alternatively the children/parent nodes could be constructed iteratively through `DataTree.__init__`)*
**Points of uncertainty:**
1. For backends that choose not to (or cannot) support `DataTree`, how to handle `NotImplemented`?
2. `BackendEntrypoint` presently has `open_dataset_parameters`. Should this apply to `DataTree` identically, or should there be a separate `open_datatree_parameters` method?
- more generally, what should (or should not) be handled in common in the Backend API between Dataset- and DataTree-based functionality?
- Tom: If it always makes sense to think of opening a tree as opening many datasets, then I think this is fine...
3. Should we always open the full hierarchy or there be some limit (e.g. `depth=3`)
### Core Backends (Specific, e.g., NetCDF, Zarr)
...
### External Backends (Specific, e.g., cfgrib, xradar)
Tom: Let's try and make a start on the cfgrib backend too?