Weekly Xarray Flexible Indexes Refactor Meeting Notes

Jan 11 Stephan: do we have a meeting today?
Deepak: I guess not?
Stephan: Yes, I guess not. I emailed Benoit.

December 7, 2021

Refactor concat (_dataset_concat internals):

  • _calc_concat_dim_coord
    • maybe return a PandasIndex instance too.
  • _calc_concat_over
    • If a meta-index is involved in concat_over, append to the latter all coordinate names related to the index (even coordinates that won't be concatenated). There's no need to merge coordinate variables that could possibly be created from the concatenated meta-index (i.e., returned by Index.concat()).
  • Merge unconcatenated variables
    • Merge unconcateneted indexes too
    • Use collect_variables_and_indexes and merged_collected directly for convenience (not sure about the impact on performance)
    • Need to add an argument in collect_variables_and_indexes so that we can skip concat_over
  • Stack-up variables
    • Check first for the presence of indexes, maybe concatenate the indexes (if Index.concat() is implemented) and create new coordinate variables from the resulting index (if Index.create_variables() is implemented)
    • Fallback to concat_vars()

November 23, 2021

Useful identity: ds == concat([ds.isel(x=slice(5)), ds.isel(x=slice(5, None))], dim='x')

Do we need to refactor concat now? (IMO: yes)

  • Currently working, even for multi-indexes it yields consistent results (multiple coordinates with the same index object), although indexes wrapped in coordinate variables are not the same objects than indexes returned by ds.indexes.
  • However, it likely won't work when we'll drop implicit creation of indexes from coordinates in Dataset.__init__()
  • Performance issue: the same multi-indexes are concatenated together as many times as their levels (since concat works with one variable at a time)
  • Not compatible with "meta-indexes"

Refactor concat:

  • Add Index.concat abstract class method, which basically does the same internally than IndexVariable.concat. The latter won't be needed anymore.
  • If Index.concat is not implemented, index is dropped in the resulting Dataset or DataArray and fallback to coordinate variable(s) concat.
  • Like in the merge implementation, still loop trough variables and cache concat of multi-coordinate indexes
  • Xarray objects are already aligned internally in concat, so extracting the indexes to concat together should be relatively straightforward

November 09, 2021

Status: https://github.com/pydata/xarray/pull/5692

TODO:

  • refactor concat (+ dependents like groupby (combine) and to_stacked_array)
  • Fix tests
  • Clean-up

October 26, 2021

October 19, 2021

https://github.com/pydata/xarray/issues/5647

October 12, 2021

https://github.com/pydata/xarray/issues/5647#issuecomment-937865982

October 5, 2021

September 28, 2021

Status: https://github.com/pydata/xarray/pull/5692

TODO:

  • Remaining issues (70+ failing tests):

    • test_dataarray (swap_dims, update, etc.)
    • test_dataset (swap_dims, to_stacked_array, polyfill, isel, etc.)
    • test_plot (interval index + implicit coord creation)
    • test_backends (49 test failing locally but not on CI?)
    • test_dask (did not investigate yet)
    • test_groupby (did not investigate yet)
    • test_units (did not investivate yet)
  • Implicit Creation of coordinates from multi-index:

    • Struggling to find a robust approach: currently check for conflicts between levels and other coordinates, but "false negative" when passing all multi-index coordinates from an existing DataArray to a new DataArray
  • Should Xarray indexes provide optional implementation of isel?

    • Currently both indexes and variables are indexed separately, which has some possible downsides:
      • the same index may be indexed more than once (multi-index)
      • possible unsync (different underlying objects) between variables and indexes
    • Opportunity to implement custom behavior (e.g., staggered grid meta-index -> consistent selection of cell centers vs. cell edges)
    • Indexes without any implementation will be dropped.
    • Rename Index.query to Index.sel? And add Index.isel?
  • Assign / update coordinates: maybe drop (multi-coordinate) indexes

    • Complex logic: clean-up -> align -> merge
    • Any pointer or hint on what would be the best place to check this would be greatly appreciated!!
    • -> don't allow update/remove a coordinate that is part of a multi-coordinate index
  • Allow non-1D variables with a name that matches one of their dimensions.

  • Remove .level_coords

    • And refactor everything that relies on it

September 14, 2021

Status: https://github.com/pydata/xarray/pull/5692

  • refactor label-based selection
  • refactor rename_*
  • refactored set_index, reset_index, reorder_levels

(notebook showcase)

Small changes in behavior (no API change), mostly to get rid of "hacky" workarounds due to the limitations of the "index/dimension" coordinates concept (we don't need it anymore with the new data model). Is it ok to introduce those changes now? Or should we go through some smooth transition?

See https://github.com/pydata/xarray/issues/4825#issuecomment-916974087

September 7, 2021

Index.query API

https://github.com/pydata/xarray/pull/5692#issuecomment-914207743

@dataclass
class QueryOptions:
    """Base class used by Xarray indexes to define query options."""
    ...


@dataclass
class QueryResult:
    """Index query results.

    Attributes
    ----------
    dim_indexers: dict
        A dictionary where keys are array dimensions and values are
        location-based indexers.
    indexes: dict, optional
        New indexes to replace in the resulting DataArray or Dataset.
    index_vars : dict, optional
        New indexed variables to replace in the resulting DataArray or Dataset.
    drop_coords : list, optional
        Coordinate(s) to drop in the resulting DataArray or Dataset.
    rename_dims : dict, optional
        A dictionnary in the form ``{old_dim: new_dim}`` for dimension(s) to
        rename in the resulting DataArray or Dataset.

    """
    dim_indexers: Mapping[Any, Any]
    indexes: Dict[Hashable, "Index"] = field(default_factory=dict)
    index_vars: "IndexVars" = field(default_factory=dict)
    drop_coords: List[Hashable] = field(default_factory=list)
    rename_dims: Mapping[Any, Hashable] = field(default_factory=dict)


class Index:

    query_options: Type[QueryOptions] = QueryOptions

    def query(labels: Dict[Any: Any], options: QueryOptions) -> QueryResult:
        raise NotImplementedError()
@dataclass
class PandasIndexQueryOptions(QueryOptions):
    method: Optional[str] = None
    tolerance: Optional[float] = None

    def __post_init__(self):
        if self.method is not None and not isinstance(self.method, str):
            raise TypeError("``method`` must be a string")



def group_indexers_by_index(
    obj: Union["DataArray", "Dataset"],
    indexers: Mapping[Any, Any],
    options: Mapping[str, Any],
) -> Tuple[Dict[int, "Index"], Dict[Union[int, None], Dict]]:
    """Returns a dictionary of unique index items and another dictionary of label indexers
    grouped by index (both using the same index ids as keys).

    """
    unique_indexes = {}
    grouped_indexers: Mapping[Union[int, None], Dict] = defaultdict(dict)

    for key, label in indexers.items():
        index: "Index" = obj.xindexes.get(key, None)

        if index is not None:
            index_id = id(index)
            unique_indexes[index_id] = index
            grouped_indexers[index_id][key] = label

            invalid_options = set(options) - {f.name for f in fields(index.query_options)}
            if invalid_options:
                raise ValueError(
                    f"Invalid selection options for {type(index)!r}:"
                    + ",".join(invalid_options)
                )

        elif key in obj.coords:
            raise KeyError(f"no index found for coordinate {key}")
        elif key not in obj.dims:
            raise KeyError(f"{key} is not a valid dimension or coordinate")
        elif len(options):
            raise ValueError(
                f"cannot supply selection options {options!r} for dimension {key!r}"
                "that has no associated coordinate or index"
            )
        else:
            # key is a dimension without coordinate
            # failback to location-based selection
            grouped_indexers[None][key] = label

    return unique_indexes, dict(grouped_indexers)

August 31, 2021

Explicit indexes

https://github.com/pydata/xarray/pull/5692

Normalize label-based indexers before passing them to Index.query?

https://github.com/pydata/xarray/issues/5697

(Benoît): probably better to let indexes take care of normalization.

  • pandas indexes: lots of edge cases, many types (slices, single items, sequences, etc.)
  • xoak: currently only accepts xarray objects (Variable or DataArray)

A lot of flexibility but also a lot of responsibility for xarray Index

If repetitive patterns emerge in custom indexes implementations, we could eventually implement some convenient layer via Query and QueryResult classes used as interface to Index.query.

August 24, 2021

Explicit indexes, the big PR: https://github.com/pydata/xarray/pull/5692

  • create default indexes using PandasIndex and PandasMultiIndex classes
  • get rid of multi-index virtual coordinates (mostly)
  • updated reprs (some remaining tweaks to do)

This refactoring has lots of impacts, fix the broken tests will probably require update and/or refactor things like:

  • rename vars / dims
  • set_index / reset_index
  • stack / unstack
  • sel

At this point, I'm wondering whether we should depreciate a couple of things now or later:

  • Passing a multi-index directly as a coordinate in Dataset / DataArray constructors

Currently, level coordinates are implicitly created from the multi-index. Suggested behavior: treat the index as an array-like (single coordinate) by default and encourage passing it more explicitly, e.g., something like

midx = pd.MultiIndex.from_arrays([[...], [...]], names=("foo", "bar"))
coords, index = xr.PandasMultiIndex.from_pandas_index(midx, dim=“x”)
ds = xr.Dataset(coords=coords, indexes={("foo", "bar"): index})

Maybe OK?

foo = np.array(...)
bar = np.array(...)
midx = pd.MultiIndex.from_arrays([foo, bar], names=("foo", "bar"))
ds = xr.Dataset(coords=coords, indexes={"foo": midx, "bar": midx})

Existing code (not ideal):

ds = xr.Dataset()
ds.coords['foo_bar'] = pd.MultiIndex.from_arrays(
    [[...], [...]],
    names=("foo", "bar"),
)
# currently: MultiIndex "foo_bar" with levels "foo" and "bar"
# very soon: issue FutureWarning. Both array of tuples
#   "foo_bar" and arrays "foo" and "bar". 
# in a few releases: raise an error (we don't want the
#   array "foo_bar")
# much later: Array of tuples, not a MultiIndex.
  • IndexVariable.level_names and IndexVariable.get_level_variable

    • remove as soon as possible (these can dissappear with the deprecated "foo_bar" Variables of tuples)
  • Maybe #5732 too?

August 10, 2021

https://discourse.pangeo.io/t/handling-slicing-with-circular-longitude-coordinates-in-xarray/1608

Possible "explosion" of custom indexes? Very complex "meta-indexes"? Index overlapping goals & features. How best to avoid the development of a messy ecosystem built on top of xarray.Index?

An example of geo-aware indexes:

Alternatives? Some sort of protocol? Data model allowing coordinates with multiple indexes?

August 3, 2021

Xarray index vs. variables API

Option A

Every operation that returns a new xarray.Index (e.g., set a new index from variables, selection, join) may (or should?) also return new IndexVariable object(s). If no variable is returned, the input variables or the original index variables will be reused (maybe converted into IndexVariable objects).

IndexVars = Dict[Hashable, "IndexVariable"] IndexAndVars = Tuple["Index", Optional[IndexVars]] class Index: @classmethod def from_variables( cls, variables: Dict[Hashable, "Variable"], **kwargs ) -> IndexAndVars: ... def query( self, labels: Dict[Hashable, Any] ) -> Tuple[Any, Optional[IndexAndVars]]: ... def union(self, other) -> IndexAndVars: ... def intersection(self, other) -> IndexAndVars: ... def rename(self, var_names) -> ???: ... def copy(self, deep: bool = True) -> ???: ... def __getitem__(self, indexer: Any) -> ???: ...
  • More complex return types and potentially more arguments to pass on to other functions.
  • What about copy and __getitem__? Seems not conventional to return anything other than a new Index object. But if we copy indexes and variables independently, we could unsync them, i.e.,
# Is it an issue? new_xr_var._data is new_xr_index.index # might be False new_xr_var._data == new_xr_index.index # still true
  • What about rename? If we don't rename the underlying index in-place, e.g.,
    pd.Index.rename(inplace=False), we may need to return new variables as well.

Option B

Alternatively, use an xarray.Index.coords property? But API more confusing, separation of concerns is less clear, how to check that new variables are returned, etc.

July 27, 2021

https://github.com/pydata/xarray/pull/5636

July 20, 2021

https://github.com/pydata/xarray/issues/5553

June 29, 2021

  • Decouple indexes and coordinates
    • (not only) get rid of multi-index virtual coordinates, (also) set the path toward a flexible data model
    • Loose vs. tight relationship xarray.Index <-> a set of coordinates
      1. may an xarray.Index instance be shared among different xarray DataArray / Dataset objects?
      2. an xarray.Index internally needs information like the coordinate name(s), dimension(s), shape(s), etc. (e.g., xoak use case: back and forth transformation of n-d coordinate arrays from/to a 2-d array of shape (npoints, ncoordinates))
      3. so maybe (1) is not possible and an xarray.Index should be tightly coupled to a DataArray / Dataset ?
    • do not wrap an xarray.Index anymore into an xarray.IndexVariable
      • instead expose xarray.Index's .from_variables and .to_variables API?
      • how to reuse an existing index when building a new one (e.g., set_index(..., append=True))? Using a .from_variables class method may not be the best option Stick with __init__?

Option A: from_variables / to_variables API

class Index: """Base class inherited by all xarray-compatible indexes.""" # what would be self.coords? # where setting self.coords? `__init__` or `from_variables`? __slots__ = ("coords",) @classmethod def from_variables(cls, variables: Dict[Hashable, "Variable"], **kwargs): raise NotImplementedError() def to_variables(self) -> Dict[Hashable, "Variable"]: raise NotImplementedError()

Option B: constructor signature + coords attribute

class Index: """Base class inherited by all xarray-compatible indexes.""" __slots__ = ("coords",) def __init__( self, coords: Dict[Hashable, "Variable"], # could be more flexible? index: Optional[Any] = None, **kwargs ): # it may be more complex than this, e.g., # create the underlying index then # create new xarray variables from LazilyIndexedArray objects self.coords = coords

June 22, 2021

  • API for setting custom indexes
    • best strategy for updating set_index API? See notes
      • Deepak: could haveassign_indexes that looks like assign_coords (accepts a mapping)
    • or maybe another method: set_xindex? (like .indexes vs .xindexes properties)
      • Justus: use a namespace / accessor
    • Deepak: or maybe .xindexes[("x", "y")] = CRSIndex(...)
      • Benoit: requires duplication of variable names in some cases
      • Benoit: would be in-place operation
    • Dataset / DataArray constructors:
      • depreciate passing indexes through coords?

June 8, 2021

  • Current status: all index-based operations go through xarray.Index

    • pandas.Index -> xarray.PandasIndex and pandas.MultiIndex -> xarray.PandasMultiIndex
    • xarray.Index's equal, union and intersection methods for alignment
    • xarray.Index.query method for label-based selection
    • xarray.Index.to_pandas_index() for all other operations in Xarray that still rely exclusively on pandas.Index
  • Feedback on https://github.com/pydata/xarray/pull/5322 ?

  • Next step: update the data model (index <1:many> coordinates and/or dimensions)

    • Go step by step (separate PRs) but what's the best strategy?
    • Still focus on PandasIndex and PandasMultiIndex for now
    • Keep IndexVariable for now
    • Allow multiple indexes for one dimension? (i.e., two coordinates with each a pandas.Index)
      • What API? update set_index?
      • How to combine results from equal, union, intersection, query?
      • Many places still expect Dataset.xindexes (or Dataset.indexes) to return one index per dimension
      • Multiple multi-indexes for one dimension? Sounds crazy.
    • Multi-index virtual coordinates -> real coordinates
      • How best to avoid data copy/duplication?
Select a repo