owned this note changed 2 years ago
Linked with GitHub

Xarray-datatree backend design doc

Prototype Implementation Branch: https://github.com/jthielen/xarray/tree/datatree-backend

Goals

Xarray will soon include the DataTree class, a new data structure that will represent hierarchical data as a series of nested "groups". This data structure maps closely to HDF5, netCDF4, Zarr, and other data formats. Xarray should include an open_datatree function that efficiently creates DataTree objects.

This design doc lays out the basic design for the integration of open_datatree with Xarray's backends.

Non-goals

  • Update all backends to support datatree (this should be enabled by this work, but then implemented one-by-one later)

Design

Backend API (General)

Guiding Principle: Add a open_datatree method to xarray.backends.BackendEntrypoint that instantiates and returns a DataTree

  1. Take the existing BackendEntrypoint.open_dataset method as starting point (with similar prototype implementation details as in existing docs)
  2. Rely on backend to handle (through open_datatree method)
    • file open
    • decoding variables
    • assembling DataTree out of contained variables
    • setting a close method
    • return DataTree
  3. Implement same args & kwargs as open_dataset (filename_or_obj, drop_variables (Q: how to handle variables of same name in different groups?), mask_and_scale, decode_times, use_cftime, concat_characters, decode_coords)
  4. Backend implimentations should have variables with data as numpy.ndarray or lazy loading BackendArray subclass

The following (reimplemented from existing open_dataset) is an example of the high level processing steps:

def open_datatree(
    self,
    filename_or_obj,
    *,
    drop_variables=None,
    decode_times=True,
    decode_timedelta=True,
    decode_coords=True,
    my_backend_option=None,
):
    import datatree

    vars, attrs, coords = my_reader(
        filename_or_obj,
        drop_variables=drop_variables,
        my_backend_option=my_backend_option,
    )
    vars, attrs, coords = my_decode_variables(
        vars, attrs, decode_times, decode_timedelta, decode_coords
    )  #  see also conventions.decode_cf_variables
    datasets = my_assemble_groups(vars, attrs, coords)
    dt = datatree.DataTree.from_dict(datasets)
    dt.set_close(my_close_method)

    return dt

(Note that this is only one method a backend could implement; alternatively the children/parent nodes could be constructed iteratively through DataTree.__init__)

Points of uncertainty:

  1. For backends that choose not to (or cannot) support DataTree, how to handle NotImplemented?
  2. BackendEntrypoint presently has open_dataset_parameters. Should this apply to DataTree identically, or should there be a separate open_datatree_parameters method?
    • more generally, what should (or should not) be handled in common in the Backend API between Dataset- and DataTree-based functionality?
    • Tom: If it always makes sense to think of opening a tree as opening many datasets, then I think this is fine
  3. Should we always open the full hierarchy or there be some limit (e.g. depth=3)

Core Backends (Specific, e.g., NetCDF, Zarr)

External Backends (Specific, e.g., cfgrib, xradar)

Tom: Let's try and make a start on the cfgrib backend too?

Select a repo