owned this note
owned this note
Published
Linked with GitHub
# Some notes for Xarray's SciPy 2022 sprint on flexible indexes
## GH links
- discussion about the sprint: https://github.com/pydata/xarray/discussions/6783
- next development steps: https://github.com/pydata/xarray/issues/6293
- project: https://github.com/pydata/xarray/projects/1
## The Xarray `Index` base class
Every Xarray index should inherit from the `xarray.core.indexes.Index` base class (TODO: should we make it available under the main namespace `xarray.Index`?).
The `Index` API closely follows the `Dataset` / `DataArray` API, e.g., for a custom index to support `.sel()` it needs to implement `Index.sel()`, to support `.stack()` / `.unstack()` it needs to implement `Index.stack()` / `Index.unstack()`, etc.
The base class is defined [here](https://github.com/pydata/xarray/blob/main/xarray/core/indexes.py#L34-L124). For convenience, it is shown below:
```python=
class Index:
"""Base class inherited by all xarray-compatible indexes."""
@classmethod
def from_variables(cls, variables: Mapping[Any, Variable]) -> Index:
raise NotImplementedError()
@classmethod
def concat(
cls: type[T_Index],
indexes: Sequence[T_Index],
dim: Hashable,
positions: Iterable[Iterable[int]] = None,
) -> T_Index:
raise NotImplementedError()
@classmethod
def stack(cls, variables: Mapping[Any, Variable], dim: Hashable) -> Index:
raise NotImplementedError(
f"{cls!r} cannot be used for creating an index of stacked coordinates"
)
def unstack(self) -> tuple[dict[Hashable, Index], pd.MultiIndex]:
raise NotImplementedError()
def create_variables(
self, variables: Mapping[Any, Variable] | None = None
) -> IndexVars:
if variables is not None:
# pass through
return dict(**variables)
else:
return {}
def to_pandas_index(self) -> pd.Index:
"""Cast this xarray index to a pandas.Index object or raise a TypeError
if this is not supported.
This method is used by all xarray operations that expect/require a
pandas.Index object.
"""
raise TypeError(f"{self!r} cannot be cast to a pandas.Index object")
def isel(
self, indexers: Mapping[Any, int | slice | np.ndarray | Variable]
) -> Index | None:
return None
def sel(self, labels: dict[Any, Any]) -> IndexSelResult:
raise NotImplementedError(f"{self!r} doesn't support label-based selection")
def join(self: T_Index, other: T_Index, how: str = "inner") -> T_Index:
raise NotImplementedError(
f"{self!r} doesn't support alignment with inner/outer join method"
)
def reindex_like(self: T_Index, other: T_Index) -> dict[Hashable, Any]:
raise NotImplementedError(f"{self!r} doesn't support re-indexing labels")
def equals(self, other): # pragma: no cover
raise NotImplementedError()
def roll(self, shifts: Mapping[Any, int]) -> Index | None:
return None
def rename(
self, name_dict: Mapping[Any, Hashable], dims_dict: Mapping[Any, Hashable]
) -> Index:
return self
def __copy__(self) -> Index:
return self.copy(deep=False)
def __deepcopy__(self, memo=None) -> Index:
# memo does nothing but is required for compatibility with
# copy.deepcopy
return self.copy(deep=True)
def copy(self, deep: bool = True) -> Index:
cls = self.__class__
copied = cls.__new__(cls)
if deep:
for k, v in self.__dict__.items():
setattr(copied, k, copy.deepcopy(v))
else:
copied.__dict__.update(self.__dict__)
return copied
def __getitem__(self, indexer: Any):
raise NotImplementedError()
```
### Minimal requirements
Every index should at least implement the `Index.from_variables()` class method, which is used to build a new index instance from one or more existing coordinate(s).
It is the responsability of the custom index to check the consistency of the given coordinates. For example, `PandasIndex` accepts only one coordinate, `PandasMultiIndex` accepts one or more 1-dimensional coordinates that must all share the same dimension. For other, custom indexes this is not necessarily the case, e.g.,
- a georeferenced raster index which takes two 1-d coordinates with each distinct dimensions.
- a staggered grid index which takes coordinates with different dimension name suffixes (e.g., `_c` and `_l` for center and left).
### Optional requirements
Pretty much everything else is optional. Depending on the case, in the absence of implementation `Index` will either raise an error (operation not supported) or won't do anything specific.
For example, just skip implementing `Index.rename()` in an index subclass if there's no internal structure to rename. In the case of `PandasIndex`, we rename the underlying `pandas.Index` object and/or update the `PandasIndex.dim` attribute.
### Wrapping index data as coordinate variables
In some cases it is possible to reuse the index underlying object or structure as coordinate variable data and hence avoid data duplication.
It is the case of `PandasIndex` and `PandasMultiIndex`, where we can leverage the fact that `pandas.Index` objects (partially) behaves as arrays. In Xarray we use some wrappers around those underlying objects as a thin compatibility layer to, e.g., preserve dtypes, handle explicit and n-dimensional indexing, etc. The wrappers are implemented [here](https://github.com/pydata/xarray/blob/5678b758bff24db28b53217c70da00c2fc0340a3/xarray/core/indexing.py#L1367-L1538).
If (maybe wrapped) index data can be reused as coordinate variable data, the xarray index subclass should implement the `Index.create_variables()` method. This method accepts a dictionary of `xarray.Variable` objects as input, which is used for propagating variable metadata (attrs, encoding). The method should return a dictionary of new `xarray.IndexVariable` objects. See for example [PandasIndex.create_variables()](https://github.com/pydata/xarray/blob/5678b758bff24db28b53217c70da00c2fc0340a3/xarray/core/indexes.py#L322-L341).
### Selection
For a custom index to support (label-based) selection, it needs at least to implement `Index.sel()`. This method accepts a dictonary of labels where the keys are coordinate names (already filtered for the current index) and the values can be anything (e.g., a slice, a tuple, a list, a numpy array, a `xarray.Variable` object or a `xarray.DataArray`). It is the responsibility of the index to properly handle those input labels.
The `Index.sel()` method must return an instance of `IndexSelResult` (defined [here](https://github.com/pydata/xarray/blob/5678b758bff24db28b53217c70da00c2fc0340a3/xarray/core/indexing.py#L36-L79)). The latter is a small class that stores positional indexers (indices) and that could also store new variables, new indexes, names of variables or indexes to drop, names of dimensions to rename, etc. This is useful in the case of `PandasMultiIndex` as it allows to convert it into a single `PandasIndex` when only one level remains after the selection.
The `IndexSelResult` class is also used to merge results from label-based selection performed by different indexes (e.g., it is now possible to have two distinct indexes for two 1-d coordinates sharing the same dimension, but it is not currently possible to use those two indexes in the same call to, e.g., `Dataset.sel()`).
Optionally, the custom index may also implement `Index.isel()`. In the case of `PandasIndex`, we use it to create a new index object by just indexing the underlying `pandas.Index` object. In other cases this may not be possible (e.g., a `kd-tree` index object may not be easily indexed). If `Index.isel()` is not implemented, the index in just dropped in the DataArray or Dataset resulting from the selection.
### Alignment
For a custom index to support alignment, it needs to implement `Index.equals()`, `Index.join()` (and possibly `Index.reindex_like()`).
`Index.equals()` takes another index object (of the same type) and return either `True` or `False`.
`Index.join()` takes another index object (of the same type) and return a new index object. The "how" parameter accepts the same values than the "join" parameter in `xarray.align()`.
### Meta-indexes
Many potential use cases for Xarray custom indexes would consist of adding some extra functionality on top of pandas indexes. We call those kinds of indexes "meta-indexes".
For those cases, it is possible (and likely recommended) to encapsulate `PandasIndex` in custom `Index` subclasses.
A small incomplete (and untested) example for a raster (meta-)index:
```python=
from xarray.core.indexes import Index, PandasIndex
from xarray.core.indexing import merge_sel_results
class RasterIndex(Index):
def __init__(self, xy_indexes):
assert len(xy_indexes) == 2
# must have two distinct dimensions
dim = [idx.dim for idx in xy_indexes.values()]
assert dim[0] != dim[1]
self._xy_indexes = xy_indexes
@classmethod
def from_variables(cls, variables):
assert len(variables) == 2
xy_indexes = {
k: PandasIndex.from_variables({k: v})
for k, v in variables.items()
}
return cls(xy_indexes)
def create_variables(self, variables):
idx_variables = {}
for index in self._xy_indexes.values():
idx_variables.update(index.create_variables(variables))
return idx_variables
def sel(self, labels):
results = []
for k, index in self._xy_indexes.items():
if k in labels:
results.append(index.sel({k: labels[k]}))
return merge_sel_results(results)
```
Note: a lot of boilerplate code in the example here above could probably be abstracted away in a helper class built in Xarray for anyone who wants to add such a meta-index.
## The `xindexes` property
Dataset and DataArray both provide a `.xindexes` property that returns an `Indexes` object (defined [here](https://github.com/pydata/xarray/blob/5678b758bff24db28b53217c70da00c2fc0340a3/xarray/core/indexes.py#L1008-L1222) with docstrings for the public API).
`Indexes` is a dict-like object of Dataset/DatArray indexes + some extra, convenient API to deal with multi-coordinate indexes (e.g., get unique indexes objects, get all coordinates that are mapped to the same index, etc.).
## Use cases and examples so far (please add your own!)
(See also the "Would allow this" column of [this project](https://github.com/pydata/xarray/projects/1) for a live list of relevant issues)
### Periodic index
Doesn't give an out-of-bounds error, instead wrapping around.
Useful for longitude for example
### KDTree index
* Be able to select a set of gates (points) from a radar dataset with dimensions azimuth (angle around the radar) and range (distance from teh radar)
* Latitude and longitude are fields within the dataset, with dimensions azimuth and range
* Adding one for the altitude (height above ground) would be great too
* We implemented our own custom function here - https://github.com/ARM-DOE/pyart/blob/main/pyart/util/columnsect.py
* Would also allow for the selection of data using lon/lat for datasets that don't give projected (native, 1D) coordinates
### Unit-aware index (via pint)
### Out-of-core index (i.e. dask)
### Staggered grid index
###