owned this note
owned this note
Published
Linked with GitHub
# Weekly Xarray-DataTree design meeting
[Zoom link](https://us02web.zoom.us/j/87503265754?pwd=cEFJMzFqdTFaS3BMdkx4UkNZRk1QZz09)
[Meetings issue (#8747)](https://github.com/pydata/xarray/issues/8747) - includes list of design questions
[Tracking issue (#8572)](https://github.com/pydata/xarray/issues/8572) - includes checklist of what's been done so far
## Jul 30, 2024
### Attendees
- Matt Savoie / @flamingbear
- Justus Magin / @keewis
- Tom Nicholas / @TomNicholas
- Stephan Hoyer / shoyer
- Eni Awowale / @eni-awowale
- Owen Littlejohns / @owenlittlejohns
### 60 Second Updates.
- Matt: Almost completed the update for [Doc PR](https://github.com/pydata/xarray/pull/9033)
- Tom:
- Looked at fixing several bugs
- https://github.com/pydata/xarray/issues/9285
- https://github.com/pydata/xarray/issues/9196
- https://github.com/pydata/xarray/pull/9292
- Eni:
- PR [#9243](https://github.com/pydata/xarray/pull/9243)
### Agenda
## Jul 23, 2024
### Attendees
- Matt Savoie / @flamingbear
- Justus Magin / @keewis
- Stephan Hoyer / shoyer
- Eni Awowale / @eni-awowale
- Alfonso Ladino Rincon / @aladinor
- Etienne Schalk / @eschalkargans
- Tom Nicholas / @TomNicholas
### 60 Second Updates.
- Tom:
- Was at SciPy then PTO
- Matt: still nothing. looking at Eni's draft [PR #9243](https://github.com/pydata/xarray/pull/9243/files)
- Etienne: convert datatree to dict [PR #9080](https://github.com/pydata/xarray/pull/9080) (note: with coordinate inheritance, inherited coords are duplicated ; disadvantage: denormalization of data ; advantage: self sufficient leaf groups)
- Eni: Back from SciPy and PTO working on draft PR #9243
- Will add tests to new file
### Agenda
- SciPy report
- We should move old issues
- Best to do manually as then a human will check
- Eni has issue with openDAP for trees
- Latest [tasks](https://github.com/pydata/xarray/issues/8572#issuecomment-2218020742) to get datatree released and original set [#8572](https://github.com/pydata/xarray/issues/8572)
## Jul 16, 2024
### Attendees
- Matt Savoie / @flamingbear
- Stephan Hoyer / @shoyer
- Justus Magin / @keewis
- Alfonso Ladino / @aladinor
### 60 Second Updates.
- Matt has barely been even following issues.
### Agenda
- Not much but Alfonso had two PRs to discuss
Options for credentials for s3 when opening zarr stores https://github.com/pydata/xarray/pull/9198/files
Addresses backend kwargs that were removed (addresses [#9135](https://github.com/pydata/xarray/issues/9135)) https://github.com/pydata/xarray/pull/9199/files
- Early adjournment
## Jul 9, 2024
### Attendees
- Justus Magin / @keewis
- Stephan Hoyer
- Tom Nicholas / @TomNicholas
### Agenda
- checklist for releasing datatree
- https://github.com/pydata/xarray/issues/8572#issuecomment-2218020742
-
## Jul 2, 2024
### Attendees
- Matt Savoie / @flamingbear
- Justus Magin / @keewis
- Owen Littlejohns / @owenlittlejohns
- Stephan Hoyer
- Alfonso Ladino / @aladinor
### 60 second updates
- Tom
- Reviewed coordinate inheritance PR properly
- Matt
- Also viewed the inheritance PR understood most.
- Owen
- Also reviewed PR 9063 (inheritance)
- Stephan
- Inheritance PR
- Alfonso
- Got both PR for keywords and benchmarks ready.
- https://github.com/pydata/xarray/pull/9158
- https://github.com/pydata/xarray/pull/9199
### Agenda
- Are we happy to merge Stephan's PR?
- Outstanding Q's?
- A couple of other things to merge
- Constructor parent not mutating
- What does that unblock?
- Release schedule
- release
- whats required
- docs PR
- open_as_dict_of_datasets
- blog
## Jun 25, 2024
### Attendees
- Matt Savoie / @flamingbear
- Justus Magin / @keewis
- Owen Littlejohns / @owenlittlejohns
- Stephan Hoyer
### 60 second updates
- Matt: Reviewed / following the inherited coordinate PR [#9063](https://github.com/pydata/xarray/pull/9063/files)
- Tom: Also reviewed the PR
- Owen: Also partially reviewed Stephan's PR [#9063](https://github.com/pydata/xarray/pull/9063/files)
### Agenda
- Benchmark for open_datatree: https://github.com/pydata/xarray/pull/9158
- Probably should close the files
- DataTree should be a context manager (like how you can already do `with open_dataset(path) as ds:`)
- raise an issue for this!
- Backend kwargs are not forwarded: https://github.com/pydata/xarray/issues/9135
- Review of coordinate inheritance PR [#9063](https://github.com/pydata/xarray/pull/9063/files)
- Tom: Main question is what should the internal structure be?
- DataTree repr: https://github.com/pydata/xarray/pull/9064
- SciPy talk
- Practice talk for NASA 2nd July 12pm EDT
- Everyone welcome on teams (https://teams.microsoft.com/l/meetup-join/19%3ameeting_NDc3ZWRiOGUtOTdhNS00ZDkyLWI2ZGQ[…]2c%22Oid%22%3a%2275a4b9ac-327c-4e32-9aeb-1eab36528186%22%7d)
- Tom and Eni will give half of talk each
- Tom on general datatree idea, Eni on NASA's use case
- Top-level functions like `xr.concat` accepting DataTree objects?
- https://github.com/pydata/xarray/issues/9106
## Jun 18, 2024
### Attendees
- Matt Savoie / @flamingbear
- Tom Nicholas
- Eni Awowale/ @eni-awowale
- Owen Littlejohns / @owenlittlejohns
- Alfonso Ladino Rincon
### 60 second updates
- Trying hard to wrap my head around the current discussion [#9077](https://github.com/pydata/xarray/issues/9077)
### Agenda
- Inherited coordinates -- allow overrides or not?
- The case for forbidding overrides
- If non-alignment is allowed, we would need a way to tell update/setitem methods whether or not we want them to check alignment in this particular case
- Alignment will have to be checked between variables on the same node anyway
- Discuss #9077 some more?
- Particularly this `open_as_dict_of_datasets` idea
- Could even point to this function from within the alignment failure in `open_datatree`
- Is the value in having `open_datatree` work on everything or having some xarray function work on everything?
- Optional vs forbidden overriding of dimensions in child nodes
- How much feedback do we actually need from the community?
- Mapping top-level functions like concat over trees https://github.com/pydata/xarray/issues/9106
- Eni's SciPy talk?
## Jun 11, 2024
### Attendees
- Matt Savoie / @flamingbear
- Eni Awowale / @eni-awowale
- Owen Littlejohns / @owenlittlejohns
- Tom Nicholas
- Justus Magin / @keewis
### 60 second updates
- Matt - Following discussions at most.
- Tom - Mostly just following other people's issues / PRs
- Justus - nothing datatree-related, but I'll try releasing numpy 2 later today
- Eni - dropped a bug report #9093 about segmentation faults with `open_datatree()`
### Agenda
- Let's merge some things?
- open_datatree speedup PR
- Matt will add commits to remove uneeded kwargs then we can merge
- Tom reply to Etienne's PR about to_dict
- Owen self-merge common.py PR
- Coordinate inheritance issue
- Stephan summarized it nicely
- We should use his description to ask around
- Pangeo discourse
- Twitter
- ESDIS metadata manager people?
- Point out on issue
- that one can still open invalid files using group/root kwarg
- becomes hard to list the groups in a file
- New function?:
- `list_groups`
- `open_datasets_dict`
- Numpy release status?
- basically done, one PR missing
- will release today or tomorrow morning
## Jun 4, 2024
### Attendees
- Matt Savoie / @flamingbear
- Owen Littlejohns / @owenlittlejohns
- Justus Magin / @keewis
- Eni Awowale / @eni-awowale
- Tom Nicholas
### 60 second updates
- Matt - have only read [proposal](https://github.com/pydata/xarray/issues/9056#) and PRs.
- Owen - have open PRs for migration https://github.com/pydata/xarray/issues/9011, https://github.com/pydata/xarray/issues/9033 (latter probably needs to wait for numpy 2.0 support)
- Stephan - sketch of hierarhical coordinates: https://github.com/pydata/xarray/pull/9063
- Tom
- Also messed with hierarchical coordinates: https://github.com/pydata/xarray/pull/9065/files
### Agenda
- Owens' TreeAttrAccessMixin PR
- Decision to not worry about slots/dict stuff too much and move forward
- Alfonso's [open_datatree PR](https://github.com/pydata/xarray/pull/9014)
- Review
- Stephan's hierarchical coordinates PR
## May 28, 2024
### Attendees
- Matt Savoie / @flamingbear
- Justus Magin / @keewis
- Eni Awowale / @eni-awowale
- Tom Nicholas
- Stephan Hoyer
### 60 second updates
- Matt - still nothing.
### Agenda
- decision on variable inheritance:
- should we change behavior now? Or should we have a separate API instead?
- Way to defer the decision?
- Proposal
- Keep `.ds`, `__getitem__` as-is
- Define "compatible variables" for inheritance
- Same-named dimensions have to the same
- Alignable
- (Compare with what it says in the CF conventions)
- Additional API which allows access to inherited variables
- dt.ds will never give access to inherited vars
- But dt.inherited.ds would allow `__getitem__` access to inherited vars
- `dt.inherited[...].ds`?
- `dt.inherited.to_dataset()` -> xr.Dataset containing inherited vars
- Don't change `map_over_subtree` (again for backwards compatibility)
- `map_over_inherited_subtree` isolates the conceptuals of mapping over tree with inherited variables
- issues: e.g. map over and see the same variable multiple times (in its "local" group and in all its child groups)
- Explicit API for propagating / shallow-copying variables to child nodes?
- dt.inherit()? -> DataTree
- Either way: this will be a new feature, to be done in a separate release (i.e. no blocker right now)
## May 21, 2024
### Attendees
- Matt Savoie / @flamingbear
- Justus Magin / @keewis
- Owen Littlejohns / @owenlittlejohns
- Eni Awowale / @eni-awowale
- Tom Nicholas
### 60 sec updates.
- Matt: Reviewed Alfonso's open_datatree PR. No ticket work.
- Owen: Submitted PR for documentation and exposing DataTree in public API (https://github.com/pydata/xarray/pull/9033)
### Agenda
- Announcements
- Write a blog post
- Doesn't need to be long
- https://medium.com/pangeo/easy-ipcc-part-1-multi-model-datatree-469b87cf9114
## May 14, 2024
### Attendees
- Matt Savoie / @flamingbear
- Justus Magin / @keewis
- Tom Nicholas
- Alfonso Ladino
- Owen Littlejohns / @owenlittlejohns
- Stephan Hoyer
- Eni Awowale
### 60 sec updates.
- Matt slacking on other work and time off.
- Owen responding to feedback for [PR](https://github.com/pydata/xarray/pull/9011) migrating `io.py` and `common.py`
- Tom prepping for virtualizarr talk tomorrow
### Agenda
- Alfonso's `open_datatree` performance PR
- https://github.com/pydata/xarray/pull/9014
- Coordinate inheritance discussion
- Implementation isn't that hard, difficulty is clear model and behaviour, especially wrt mapping
- Need to keep Dataset invariant of all shared dims on one group have same length
- Option (1): Explicit API separation of group with inherited variables
- e.g. dt.inherited.ds
- The check:
`xarray.align(*[node.ds, node.parent.ds, node.parent.parent.ds, ...], join='exact')`
- Tom to make an issue to write out thoughts/options
## May 7, 2024
### Attendees
- Matt Savoie / @flamingbear
- Justus Magin / @keewis
- Tom Nicholas
- Alfonso Ladino
- Owen Littlejohns / @owenlittlejohns
### 60 sec updates.
- Owen: [PR migrating last pieces of datatree code into xarray.core](https://github.com/pydata/xarray/pull/9011)
### Agenda
- Alfonso show us his work on opening stuff efficienctly
- 1-2 order of magnitude speedup with <= 1000 groups on netcdf4!
- Separate PRs would be great
- important things left in the merge
- docs
- formalize the backend
- moving to_netcdf and AttrAccessMixin
- issue with slots
- split up into 2 PRs to separate out the potential rabbit hole
## Apr 30rd, 2024
### Attendees
- Matt Savoie / @flamingbear
- Tom Nicholas
- Eni
- Ty
- Justus
### 60 sec updates.
- Matt: PR for [ops.py](https://github.com/pydata/xarray/pull/8976)
### Agenda
- Progress / priorities
- Good progress on merging core modules
- Still need also docs, expose API, backends optimization
- Should docs be added on same release as API is made public?
- Each docs page is intended to be merged into the existing xarray docs page of the same name
- With the exception of "Hierarchical Data", which is its own new page in the user guide
- inherited variables:
- maybe have a separate namespace (for example, `dt.cf["/path/to/inherited/variable"]` does inherited access as defined by the CF conventions)
- or `dt.ia[]` for inherited access.
- the advantage would be that we would be able to release, then add this feature later
## Apr 23rd, 2024
### Attendees
- Matt Savoie / @flamingbear
- Justus Magin / @keewis
- Tom Nicholas
- Eni Awowale / @eni-awowale
- Owen Littlejohns / @owenlittlejohns
### 60 sec updates.
- Matt: I'm just returning my attention. ops.py.
- Owen: Working on migrating most of remaining modules.
### Agenda
- Merge tarball PR (merged)
- SciPy talk?
- Ideally be able to say DataTree is in xarray main by then (July)
- Integrating backends
- https://github.com/xarray-contrib/datatree/issues/330
- Currently we create a new `CachingFileManager` for each group
- Want to only create one per file
- two options:
- Modify netcdfdatastore object to iterate over groups
- allow creating the datastore given a file manager object
- How do we test the performance of this?
- Benchmark
- Create datatree object with many nodes (but doesn't need actual data)
- Write to disk, then benchmark opening it up.
- Action items
- Tom: Dedicated issue for this? (on xarray)
- Write that benchmark first (goes with the other airspeed velocity tests)
- Modify netcdfdatastore to only create one FileManager
- Publicly the top-level `open_datatree` function (plus docs on datatree backends)
- Tom: Ask Kai and Max etc. if they are actually planning to do this
- Quick questions on xarray.core.common.py and testing.py.
- `from_root` kwargs to `assert_equal` → add `**options` to `assert_*`
-
## Apr 16th, 2024
### Attendees
- Matt Savoie / @flamingbear
- Tom Nicholas
- Stephan Hoyer
- Owen Littlejohns / @owenlittlejohns
- Eni Awowale
### 60 sec updates.
- Matt: working other side.
- Owen: looking at `mapping.py`
- Eni: HTML repr
- https://github.com/pydata/xarray/pull/8930
-
### Agenda
- Justus (can't join but would like to bring this up):
- type checking of xarray apparently fails because of the typing import of `DataTree`: https://github.com/pydata/xarray/issues/8768
- should we remove that for now / replace with `"DataTree"` (not sure if that works)?
- action: Matt will change tarball to stop stripping out datatree
## Apr 9th, 2024
### Attendees
- Tom Nicholas
- Matt Savoie / @flamingbear
- Ty Schlichenmeyer
### Agenda
- Discussed the original Xarray [Tracking issue (#8572)](https://github.com/pydata/xarray/issues/8572). Tom will update where we are.
- Matt will see if we can add planned work for getting the documentation another pair of eyes before the merge as well as to get a short (no pressure) blog post for both NASA and Xarray to celebrate :tada: completion.
- Talked through the depth first (PreOrderIter) and breadth first (LevelOrderIter) and discussed if there was any benefit to having both in the code base. We are going to try to replace and simplify by using LevelOrderIter only. We could not determine a performance reason for having depth first considering all of the intermediate nodes have to be created.
## Apr 2nd, 2024
### Attendees
- Tom Nicholas
- Justus Magin / @keewis
- Eni Awowale / @eni-awowale
## Mar 26th, 2024
### Attendees
- Matt Savoie / @flamingbear
- Tom Nicholas
- Owen Littlejohns / @owenlittlejohns
- Stephan Hoyer
### 60 Second updates
- Matt: Looking at mapping.py
- Owen: Resolve last few mypy issues with datatree.py PR (thanks to Matt for help there). PR is pretty much ready to go.
### Agenda
- Current [datatree.py PR](https://github.com/pydata/xarray/pull/8789). [Should we pull everything that is imported from `datatree_` out of this one?](https://github.com/pydata/xarray/pull/8789#discussion_r1538584210)
- `ops.py` should go into xarray's `generate_aggregations`? [no for now, can be cleaned up later, add an issue?]
- Priorities?
- `from xarray import datatree`
## Mar 19th, 2024 (special time)
### Attendees
- Matt Savoie / @flamingbear
- Tom Nicholas
- Owen Littlejohns / @owenlittlejohns
- Justus Magin / @keewis
### Agenda
Discussed "DataTree handles Hashables"
- The use cases seemed very infrequent.
- zarr groups are limited to strings. The Netcdf4 doesn't have types but you can't create a group from an int `TypeError: expected str, bytes or os.PathLike object, not int`
- To move forward, allow the getter to have a Hashable type, but be clear that we only use str and raise errors on non-str in DataTrees. Hopefully this solves problems with traversing and finding data, but keeps us without having terrible typing conflicts between Dataset Dataarray and DataTree
Discussed issues with wrapping a Dataset in a "FrozenDataset" as a replacement for DatasetView which problematically inherits from Dataset.
- First suggested solution for FrozenDataset was failing because special methods aren't caught by `__getattr__`.
- Owen was looking into a metaclass solution that seemed really complicated.
- Tom, Matt and Owen decided that we should move on if Owen's next stab also failed (using a mixin).
Tom showed Matt the metaprogramming in [generate_aggregations.py](https://github.com/pydata/xarray/blob/main/xarray/util/generate_aggregations.py) and the resulting [_aggregations.py](https://github.com/pydata/xarray/blob/main/xarray/core/_aggregations.py) and sounded like he convinced himself that we might use that instead of the code currently in ops.py to apply the map_over_subtree decorator. This solution wasn't avaiable before as the datatree repo was separate from xarray when implemented. This would also allow us to fixup some of the documentation for datatree that is "good enough". Probably a good thing for Tom and Stephan to discuss before we migrate that code.
## Mar 12th, 2024
### Attendees
- Matt Savoie / @flamingbear
- Tom Nicholas
- Owen Littlejohns / @owenlittlejohns
- Eni Awowale / @eni-awowale
- Justus Magin / @keewis
- Stephan
### 60 second updates
- Matt: No progress last week.
- Have PR up for datatree.py migration. Working on FrozenDataset.
### Agenda
- Slow week with not much to report.
- Some discussion about missing API pieces to Datatree. For merging or filtering in particular.
- It was mostly agreed that maybe an advanced usage documentation with recipes for how to do common operations could be useful, but keep an eye open for opportunities to improve if obvious, repeating use cases appear.
## Mar 5th, 2024
### Attendees
- Matt Savoie / @flamingbear
- Justus Magin / @keewis
- Tom Nicholas
- Stephan Hoyer
- Eni Awowale / @eni-awowale
### 60 second updates
- Matt
- Struggling to rectify the mypy errors in [#8789](https://github.com/pydata/xarray/pull/8789). Looking for advice on which way to proceed.
- Same story for implementing Hashable for Datatree.
### Agenda
- Continue Discussion around Datatree following CF model for [scoping variables](https://cfconventions.org/cf-conventions/cf-conventions.html#_scope).
+ Justus would like a flag for behavor switching, Tom thinks that would over complicate things including docs and support.
+ Tom will go back to thinking and see if he can prototype something.
- Questions for implementing Hashable for Datatree led to discussion
+ Should backslash "\", slash "/", dot "." and dotdot ".." be allowed in variable names (I think this was the discussion).
+ Seemed like Hashable should work except for the Paths. Maybe it was a bad idea in Xarray? Don't think wse had a decision on how to move here, but Matt will continue to think about it. overall generally inconsequental.
- Matt will replace DatasetView with a Frozen style wrapper to Dataset.
## Feb 27th, 2024
### Attendees
- Matt Savoie / @flamingbear
- Stephan Hoyer
- Tom Nicholas
- Eni Awowale / @eni-awowale
- Etienne Schalk / @etienneschalk
### 60 second updates
- Tom
- Not much - at conference
- Matt
- Waiting on first PR, have a few others behind. https://github.com/pydata/xarray/pull/8757
### Agenda
- Recap of previous meeting
- Updates / Q's
- Deep dive?
- Data model for inherited nodes
- e.g.,
- Entirely independent?
- Shared coordinates from parent nodes?
- CF conventions: https://cfconventions.org/cf-conventions/cf-conventions.html#groups
- Key clause: "If any dimension of an out-of-group variable has the same name as a dimension of the referring variable, the two must be the same dimension (i.e. they must have the same netCDF dimension ID)."
- design questions:
- Should we be able to open any netCDF file?
- Dict contents are ambiguous when there is fallback look-up
- Could maybe use ChainMap for inheritance
- Example in h5netcdf https://github.com/h5netcdf/h5netcdf/blob/b19d4a03a4bb553312d77135c23f3eedba243899/h5netcdf/core.py#L697
- are we excluding any use-cases by adopting a netCDF data model?
- do we allow conflicts in inherited variables?
- CF conventions do not allow conflicting dimensions
- Do we want to allow conflicting coordinates/data variables?
- EDIT: Tom commented a summary of this https://github.com/xarray-contrib/datatree/issues/297#issuecomment-1967328385
## Feb 20th, 2024
### Attendees
- Matt Savoie / @flamingbear
- Justus Magin / @keewis
- Owen Littlejohns / @owenlittlejohns
- Stephan Hoyer
### 60 second updates
- datatree tests are not skipped in the new release
### Agenda
- Intro to the purpose of these meetings
- Update from Matt?
- High-level explanation of datatree's overall design from Tom
- One group, one `Dataset`
- Nested dictionary
- Independent nodes
- Store `Variable` objects instead of `Dataset`s
- Map API downwards
- Deep-dive into one decision / part of code (if time)
- pathlib: non-pure paths on datatree?
### Actions
- [X] Track down reason for exploding Dataset into pieces in datatree in issues.
https://github.com/pydata/xarray/issues/8747#issuecomment-1955051183
- [X] Make migrations flat, i.e. no datatree subdir in xarray.
### Ideas
- Ideas from Stephan:
- Switched OrderdDict -> dict
- Move Dataset-like hidden properties onto a dedicated object?
- idea: subtree mapping: returns the full tree with just the specified nodes (and maybe their children)
```python
dt.subtree(["/a", "/b/c"]).isel(...)
```