Jan 11 – Stephan: do we have a meeting today?
Deepak: I guess not?
Stephan: Yes, I guess not. I emailed Benoit.
Refactor concat
(_dataset_concat
internals):
_calc_concat_dim_coord
PandasIndex
instance too._calc_concat_over
concat_over
, append to the latter all coordinate names related to the index (even coordinates that won't be concatenated). There's no need to merge coordinate variables that could possibly be created from the concatenated meta-index (i.e., returned by Index.concat()
).collect_variables_and_indexes
and merged_collected
directly for convenience (not sure about the impact on performance…)collect_variables_and_indexes
so that we can skip concat_over
Index.concat()
is implemented) and create new coordinate variables from the resulting index (if Index.create_variables()
is implemented)concat_vars()
Useful identity: ds == concat([ds.isel(x=slice(5)), ds.isel(x=slice(5, None))], dim='x')
Do we need to refactor concat
now? (IMO: yes)
ds.indexes
.Dataset.__init__()
concat
works with one variable at a time)Refactor concat
:
Index.concat
abstract class method, which basically does the same internally than IndexVariable.concat
. The latter won't be needed anymore.Index.concat
is not implemented, index is dropped in the resulting Dataset
or DataArray
and fallback to coordinate variable(s) concat
.merge
implementation, still loop trough variables and cache concat
of multi-coordinate indexesconcat
, so extracting the indexes to concat together should be relatively straightforwardStatus: https://github.com/pydata/xarray/pull/5692
TODO:
concat
(+ dependents like groupby
(combine) and to_stacked_array
)merge_core
: check for potential index conflicts when prioritized is not None
?create_index
option for .stack()
https://github.com/pydata/xarray/issues/5647
https://github.com/pydata/xarray/issues/5647#issuecomment-937865982
Status: https://github.com/pydata/xarray/pull/5692
TODO:
Remaining issues (70+ failing tests):
swap_dims
, update
, etc.)swap_dims
, to_stacked_array
, polyfill
, isel
, etc.)Implicit Creation of coordinates from multi-index:
Should Xarray indexes provide optional implementation of isel
?
Index.query
to Index.sel
? And add Index.isel
?Assign / update coordinates: maybe drop (multi-coordinate) indexes
Allow non-1D variables with a name that matches one of their dimensions.
Remove .level_coords
Status: https://github.com/pydata/xarray/pull/5692
rename_*
set_index
, reset_index
, reorder_levels
(notebook showcase)
Small changes in behavior (no API change), mostly to get rid of "hacky" workarounds due to the limitations of the "index/dimension" coordinates concept (we don't need it anymore with the new data model). Is it ok to introduce those changes now? Or should we go through some smooth transition?
See https://github.com/pydata/xarray/issues/4825#issuecomment-916974087
https://github.com/pydata/xarray/pull/5692#issuecomment-914207743
@dataclass
class QueryOptions:
"""Base class used by Xarray indexes to define query options."""
...
@dataclass
class QueryResult:
"""Index query results.
Attributes
----------
dim_indexers: dict
A dictionary where keys are array dimensions and values are
location-based indexers.
indexes: dict, optional
New indexes to replace in the resulting DataArray or Dataset.
index_vars : dict, optional
New indexed variables to replace in the resulting DataArray or Dataset.
drop_coords : list, optional
Coordinate(s) to drop in the resulting DataArray or Dataset.
rename_dims : dict, optional
A dictionnary in the form ``{old_dim: new_dim}`` for dimension(s) to
rename in the resulting DataArray or Dataset.
"""
dim_indexers: Mapping[Any, Any]
indexes: Dict[Hashable, "Index"] = field(default_factory=dict)
index_vars: "IndexVars" = field(default_factory=dict)
drop_coords: List[Hashable] = field(default_factory=list)
rename_dims: Mapping[Any, Hashable] = field(default_factory=dict)
class Index:
query_options: Type[QueryOptions] = QueryOptions
def query(labels: Dict[Any: Any], options: QueryOptions) -> QueryResult:
raise NotImplementedError()
@dataclass
class PandasIndexQueryOptions(QueryOptions):
method: Optional[str] = None
tolerance: Optional[float] = None
def __post_init__(self):
if self.method is not None and not isinstance(self.method, str):
raise TypeError("``method`` must be a string")
def group_indexers_by_index(
obj: Union["DataArray", "Dataset"],
indexers: Mapping[Any, Any],
options: Mapping[str, Any],
) -> Tuple[Dict[int, "Index"], Dict[Union[int, None], Dict]]:
"""Returns a dictionary of unique index items and another dictionary of label indexers
grouped by index (both using the same index ids as keys).
"""
unique_indexes = {}
grouped_indexers: Mapping[Union[int, None], Dict] = defaultdict(dict)
for key, label in indexers.items():
index: "Index" = obj.xindexes.get(key, None)
if index is not None:
index_id = id(index)
unique_indexes[index_id] = index
grouped_indexers[index_id][key] = label
invalid_options = set(options) - {f.name for f in fields(index.query_options)}
if invalid_options:
raise ValueError(
f"Invalid selection options for {type(index)!r}:"
+ ",".join(invalid_options)
)
elif key in obj.coords:
raise KeyError(f"no index found for coordinate {key}")
elif key not in obj.dims:
raise KeyError(f"{key} is not a valid dimension or coordinate")
elif len(options):
raise ValueError(
f"cannot supply selection options {options!r} for dimension {key!r}"
"that has no associated coordinate or index"
)
else:
# key is a dimension without coordinate
# failback to location-based selection
grouped_indexers[None][key] = label
return unique_indexes, dict(grouped_indexers)
https://github.com/pydata/xarray/pull/5692
Index.query
?https://github.com/pydata/xarray/issues/5697
(Benoît): probably better to let indexes take care of normalization.
Variable
or DataArray
)A lot of flexibility but also a lot of responsibility for xarray Index
If repetitive patterns emerge in custom indexes implementations, we could eventually implement some convenient layer via Query
and QueryResult
classes used as interface to Index.query
.
Explicit indexes, the big PR: https://github.com/pydata/xarray/pull/5692
PandasIndex
and PandasMultiIndex
classesThis refactoring has lots of impacts, fix the broken tests will probably require update and/or refactor things like:
rename
vars / dimsset_index
/ reset_index
stack
/ unstack
sel
At this point, I'm wondering whether we should depreciate a couple of things now or later:
Currently, level coordinates are implicitly created from the multi-index. Suggested behavior: treat the index as an array-like (single coordinate) by default and encourage passing it more explicitly, e.g., something like
midx = pd.MultiIndex.from_arrays([[...], [...]], names=("foo", "bar"))
coords, index = xr.PandasMultiIndex.from_pandas_index(midx, dim=“x”)
ds = xr.Dataset(coords=coords, indexes={("foo", "bar"): index})
Maybe OK?
foo = np.array(...)
bar = np.array(...)
midx = pd.MultiIndex.from_arrays([foo, bar], names=("foo", "bar"))
ds = xr.Dataset(coords=coords, indexes={"foo": midx, "bar": midx})
Existing code (not ideal):
ds = xr.Dataset()
ds.coords['foo_bar'] = pd.MultiIndex.from_arrays(
[[...], [...]],
names=("foo", "bar"),
)
# currently: MultiIndex "foo_bar" with levels "foo" and "bar"
# very soon: issue FutureWarning. Both array of tuples
# "foo_bar" and arrays "foo" and "bar".
# in a few releases: raise an error (we don't want the
# array "foo_bar")
# much later: Array of tuples, not a MultiIndex.
IndexVariable.level_names
and IndexVariable.get_level_variable
Maybe #5732 too?
https://discourse.pangeo.io/t/handling-slicing-with-circular-longitude-coordinates-in-xarray/1608
Possible "explosion" of custom indexes? Very complex "meta-indexes"? Index overlapping goals & features. How best to avoid the development of a messy ecosystem built on top of xarray.Index?
An example of geo-aware indexes:
Alternatives? Some sort of protocol? Data model allowing coordinates with multiple indexes?
Option A
Every operation that returns a new xarray.Index
(e.g., set a new index from variables, selection, join) may (or should?) also return new IndexVariable
object(s). If no variable is returned, the input variables or the original index variables will be reused (maybe converted into IndexVariable
objects).
IndexVars = Dict[Hashable, "IndexVariable"]
IndexAndVars = Tuple["Index", Optional[IndexVars]]
class Index:
@classmethod
def from_variables(
cls, variables: Dict[Hashable, "Variable"], **kwargs
) -> IndexAndVars:
...
def query(
self, labels: Dict[Hashable, Any]
) -> Tuple[Any, Optional[IndexAndVars]]:
...
def union(self, other) -> IndexAndVars:
...
def intersection(self, other) -> IndexAndVars:
...
def rename(self, var_names) -> ???:
...
def copy(self, deep: bool = True) -> ???:
...
def __getitem__(self, indexer: Any) -> ???:
...
copy
and __getitem__
? Seems not conventional to return anything other than a new Index
object. But if we copy indexes and variables independently, we could unsync them, i.e.,
# Is it an issue?
new_xr_var._data is new_xr_index.index # might be False
new_xr_var._data == new_xr_index.index # still true
rename
? If we don't rename the underlying index in-place, e.g.,Option B
Alternatively, use an xarray.Index.coords
property? But API more confusing, separation of concerns is less clear, how to check that new variables are returned, etc.
https://github.com/pydata/xarray/pull/5636
https://github.com/pydata/xarray/issues/5553
xarray.Index
<-> a set of coordinates
xarray.Index
instance be shared among different xarray DataArray
/ Dataset
objects?xarray.Index
internally needs information like the coordinate name(s), dimension(s), shape(s), etc. (e.g., xoak
use case: back and forth transformation of n-d coordinate arrays from/to a 2-d array of shape (npoints, ncoordinates)
)xarray.Index
should be tightly coupled to a DataArray
/ Dataset
?xarray.Index
anymore into an xarray.IndexVariable
xarray.Index
's .from_variables
and .to_variables
API?set_index(..., append=True)
)? Using a .from_variables
class method may not be the best option… Stick with __init__
?Option A: from_variables
/ to_variables
API
class Index:
"""Base class inherited by all xarray-compatible indexes."""
# what would be self.coords?
# where setting self.coords? `__init__` or `from_variables`?
__slots__ = ("coords",)
@classmethod
def from_variables(cls, variables: Dict[Hashable, "Variable"], **kwargs):
raise NotImplementedError()
def to_variables(self) -> Dict[Hashable, "Variable"]:
raise NotImplementedError()
Option B: constructor signature + coords
attribute
class Index:
"""Base class inherited by all xarray-compatible indexes."""
__slots__ = ("coords",)
def __init__(
self,
coords: Dict[Hashable, "Variable"], # could be more flexible?
index: Optional[Any] = None,
**kwargs
):
# it may be more complex than this, e.g.,
# create the underlying index then
# create new xarray variables from LazilyIndexedArray objects
self.coords = coords
set_index
API? See notes
assign_indexes
that looks like assign_coords
(accepts a mapping)set_xindex
? (like .indexes
vs .xindexes
properties)
.xindexes[("x", "y")] = CRSIndex(...)
Dataset
/ DataArray
constructors:
coords
?Current status: all index-based operations go through xarray.Index
pandas.Index
-> xarray.PandasIndex
and pandas.MultiIndex
-> xarray.PandasMultiIndex
xarray.Index
's equal
, union
and intersection
methods for alignmentxarray.Index.query
method for label-based selectionxarray.Index.to_pandas_index()
for all other operations in Xarray that still rely exclusively on pandas.Index
Feedback on https://github.com/pydata/xarray/pull/5322 ?
Next step: update the data model (index <1:many> coordinates and/or dimensions)
PandasIndex
and PandasMultiIndex
for nowIndexVariable
for nowpandas.Index
)
set_index
?equal
, union
, intersection
, query
?Dataset.xindexes
(or Dataset.indexes
) to return one index per dimension