Xarray-datatree backend design doc

Prototype Implementation Branch: https://github.com/jthielen/xarray/tree/datatree-backend

Goals

Xarray will soon include the DataTree class, a new data structure that will represent hierarchical data as a series of nested "groups". This data structure maps closely to HDF5, netCDF4, Zarr, and other data formats. Xarray should include an open_datatree function that efficiently creates DataTree objects.

This design doc lays out the basic design for the integration of open_datatree with Xarray's backends.

Non-goals

Update all backends to support datatree (this should be enabled by this work, but then implemented one-by-one later)

Design

Backend API (General)

Guiding Principle: Add a open_datatree method to xarray.backends.BackendEntrypoint that instantiates and returns a DataTree

Take the existing BackendEntrypoint.open_dataset method as starting point (with similar prototype implementation details as in existing docs)
Rely on backend to handle (through open_datatree method)
- file open
- decoding variables
- assembling DataTree out of contained variables
- setting a close method
- return DataTree
Implement same args & kwargs as open_dataset (filename_or_obj, drop_variables (Q: how to handle variables of same name in different groups?), mask_and_scale, decode_times, use_cftime, concat_characters, decode_coords)
Backend implimentations should have variables with data as numpy.ndarray or lazy loading BackendArray subclass

The following (reimplemented from existing open_dataset) is an example of the high level processing steps:

def open_datatree(
    self,
    filename_or_obj,
    *,
    drop_variables=None,
    decode_times=True,
    decode_timedelta=True,
    decode_coords=True,
    my_backend_option=None,
):
    import datatree

    vars, attrs, coords = my_reader(
        filename_or_obj,
        drop_variables=drop_variables,
        my_backend_option=my_backend_option,
    )
    vars, attrs, coords = my_decode_variables(
        vars, attrs, decode_times, decode_timedelta, decode_coords
    )  #  see also conventions.decode_cf_variables
    datasets = my_assemble_groups(vars, attrs, coords)
    dt = datatree.DataTree.from_dict(datasets)
    dt.set_close(my_close_method)

    return dt

(Note that this is only one method a backend could implement; alternatively the children/parent nodes could be constructed iteratively through DataTree.__init__)

Points of uncertainty:

For backends that choose not to (or cannot) support DataTree, how to handle NotImplemented?
BackendEntrypoint presently has open_dataset_parameters. Should this apply to DataTree identically, or should there be a separate open_datatree_parameters method?
- more generally, what should (or should not) be handled in common in the Backend API between Dataset- and DataTree-based functionality?
- Tom: If it always makes sense to think of opening a tree as opening many datasets, then I think this is fine…
Should we always open the full hierarchy or there be some limit (e.g. depth=3)

Core Backends (Specific, e.g., NetCDF, Zarr)

…

External Backends (Specific, e.g., cfgrib, xradar)

Tom: Let's try and make a start on the cfgrib backend too?

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.