Draft design document for dim support in pytensor

# Draft design document for dim support in pytensor ## What do we mean by dim support? - PyMC has had support for named dimensions of random variables for some time - This information is currently not represented in the pytensor backend, so we can not take advantage of it to check shapes or for indexing operations. - The dimensions should be first class objects. It does not make sense to add a vector with dim `country` to a vector with dim `experiment`, even if both happen to have the same length. - Broadcasting with named dimensions is different from shapes: With shape only the order of the axes is used for broadcasting, while the identity of the dimensions is used when we use dims. I think we probably want to achieve an API that looks something like this: ```python country = pt.Dim("country") age = pt.Dim("age") country_effect = pt.named.dvector(dims=country) age_effect = pt.named.dvector(dims=age) effect = country_effect + age_effect # dims=[country, age] # Go from new api to old effect_old_api = effect.to_shaped([country, age]) effect.ravel() # dims={ProductDim({country, age})} effect.rename_dim(country) + effect # dims=(RenamedDim(country), country, age) # Go from old api to new second_dim = pt.Dim() # Should names be required for a dim? effects_imported = pt.named.from_shaped(effect_old_api, dims=[county, second_dim]) effects_imported.sum(second_dim) effects.sel(country=["USA", "Italy"]) # dims=(SelDim(country, ["USA", "Italy"]), age) ``` ## Naming? Do we want to call this "named tensor", or "dim tensor" or something else? I'm kind of used to "dim" right now, but I actually think "named-" sounds better? ## How do we represent a dimension? ```python @dataclass(frozen=True) class Dim: name: str size: ScalarVariable(pytensor.scalar.ScalarType("uint64")) size_hint: int index_type: Optional[Type[xr.IndexVariable]] unique_index: bool def __eq__(self, other): return self is other def __init__(self, name, *, size_hint=10, size: int = None, index_type=None, unique_index=False): self.name = name if isinstance(size, int): size_hint = size self.size_hint = size_hint if size is None: size = pytensor.scalar.ScalarVariable(pytensor.scalar.ScalarType("uint64"), name=name) self.size = pytensor.scalar.as_scalar(size, dtype="uint64") self.index_type = index_type self.unique_index = unique_index ``` For many Ops we will also need derived dimensions, such as: ```python # (The dataclasses aren't quite accurate yet I think. Maybe we # want an abstract dim class, and other dims inherit from that # instead from the Dim above)? # The resulting dimension of a `ravel` or some reshape operations. @dataclass(frozen=True) class ProductDim(Dim): items: List[Dim] # The dimension that represents a slice of some other dimension, # where the slice refers to indices (is xarray isel with a slice) @dataclass(frozen=True) class SliceDim(Dim): base: Dim slice: slice # The dimension that represents a slice of some other dimension, # based on labels (ie xarray sel with a slice) @dataclass(frozen=True) class SliceDim(Dim): base: Dim slice: slice # The result of concatenating different dimensions @dataclass(frozen=True) class ConcatDim(Dim): items: Tuple[Dim, ...] @dataclass(frozen=True) class ISelDim(Dim): base: Dim isel: Tuple[int, ...] @dataclass(frozen=True) class SelDim(Dim): base: Dim sel: Tuple[Any, ...] # Basically the same dimension, but with a different identity. We need # this so that we can work with objects like covariance matrices # that contain the same dimension more than onces, which I think # we should not allow directly. @dataclass(frozen=True) class RenameDim(Dim): orig: Dim # Maybe we need something like this? A dim that doesn't # just depend on other dims but directly the shape of some # non-named tensor? I think something along those lines # is needed to represent things like # `normal(size=poisson())`. @dataclass(frozen=True) class DynamicDim(Dim): size_of: Variable ``` ## What other new types do we need? - `Variables`: I think we need a new type `NamedTensorVariable`. I don't think this can reasonably be a subtype of our current `TensorVariable`, because values of those types behave differently (for instance they behave differently when multiplied). But maybe we could refactor a bit so that there is a common super type (called `TensorVariable`) and two inheriting classes `NamedTensorVariable` and `UnnamedTensorVariable`, where the second is the current `TensorVariable`. - `TensorType`: Does the same hold for TensorTypes? I'm actually not sure, maybe we could get away with just adding a `dims` attribute that can be None for `UnnamedTensorVariables`? I guess most Ops can be generic and accept either (all?) NamedTensorVariables or UnnamedTensorVariables. But this probably involves changing all `make_node` methods? There would probably also need to be a couple of new functions in a namespace like `pytensor.named_tensor` that only make sense for named tensors, and conversely some functions in `pytensor.tensor` might want to check that the input is a `UnnamedTensorVariable`. ## What do we need to change about Ops? We need some additional information from each Op, that is currently unavailable. If we compute a matrix vector product for instance using a `Dot` op, we need to know what which dimensions the output will have. Or for a `reshape` op, we need to get the information which dimensions of the output correspond to which `ProductDim`. I think the correct place for this information is in the `make_node` method, or some other similar mechanism that replaces the `make_node` function: - We have previously discussed changing this to a signature object, and using minikanren to figure out type parameters of the Op. - Alternatively we could simply incorporate it into the existing `make_node` functions, they could simply check if the input variables are named or unnamed, probably raise an error if this is not consistent and choose corresponding output variables with the correct dimensions in their types. - We could also come up with a dispatch mechanism similar to the backend implementations for the current `make_node` method, that also dispatches by the type of the input variables. In general, I think there is a lot of overlap with these changes and shape inference and Op signatures using minikanren. ## Rewrites I haven't really though about consequences this might have for rewrites. Would some of them have to be changed so that they can work with `NamedTensorVariables`? Maybe we could also just start with a rewrite that replaces `NamedTensorVariables` with corresponding `UnnamedTensorVariables` and keep all later rewrites exactly as they are? But Could dimension information maybe sometimes be useful for rewrites?