# Draft design document for dim support in pytensor
## What do we mean by dim support?
- PyMC has had support for named dimensions of random variables for some time
- This information is currently not represented in the pytensor backend, so we
can not take advantage of it to check shapes or for indexing operations.
- The dimensions should be first class objects. It does not make sense to add a
vector with dim `country` to a vector with dim `experiment`, even if both
happen to have the same length.
- Broadcasting with named dimensions is different from shapes: With shape only
the order of the axes is used for broadcasting, while the identity of the
dimensions is used when we use dims.
I think we probably want to achieve an API that looks something like this:
```python
country = pt.Dim("country")
age = pt.Dim("age")
country_effect = pt.named.dvector(dims=country)
age_effect = pt.named.dvector(dims=age)
effect = country_effect + age_effect # dims=[country, age]
# Go from new api to old
effect_old_api = effect.to_shaped([country, age])
effect.ravel() # dims={ProductDim({country, age})}
effect.rename_dim(country) + effect # dims=(RenamedDim(country), country, age)
# Go from old api to new
second_dim = pt.Dim() # Should names be required for a dim?
effects_imported = pt.named.from_shaped(effect_old_api, dims=[county, second_dim])
effects_imported.sum(second_dim)
effects.sel(country=["USA", "Italy"]) # dims=(SelDim(country, ["USA", "Italy"]), age)
```
## Naming?
Do we want to call this "named tensor", or "dim tensor" or something else?
I'm kind of used to "dim" right now, but I actually think "named-" sounds
better?
## How do we represent a dimension?
```python
@dataclass(frozen=True)
class Dim:
name: str
size: ScalarVariable(pytensor.scalar.ScalarType("uint64"))
size_hint: int
index_type: Optional[Type[xr.IndexVariable]]
unique_index: bool
def __eq__(self, other):
return self is other
def __init__(self, name, *, size_hint=10, size: int = None, index_type=None, unique_index=False):
self.name = name
if isinstance(size, int):
size_hint = size
self.size_hint = size_hint
if size is None:
size = pytensor.scalar.ScalarVariable(pytensor.scalar.ScalarType("uint64"), name=name)
self.size = pytensor.scalar.as_scalar(size, dtype="uint64")
self.index_type = index_type
self.unique_index = unique_index
```
For many Ops we will also need derived dimensions, such as:
```python
# (The dataclasses aren't quite accurate yet I think. Maybe we
# want an abstract dim class, and other dims inherit from that
# instead from the Dim above)?
# The resulting dimension of a `ravel` or some reshape operations.
@dataclass(frozen=True)
class ProductDim(Dim):
items: List[Dim]
# The dimension that represents a slice of some other dimension,
# where the slice refers to indices (is xarray isel with a slice)
@dataclass(frozen=True)
class SliceDim(Dim):
base: Dim
slice: slice
# The dimension that represents a slice of some other dimension,
# based on labels (ie xarray sel with a slice)
@dataclass(frozen=True)
class SliceDim(Dim):
base: Dim
slice: slice
# The result of concatenating different dimensions
@dataclass(frozen=True)
class ConcatDim(Dim):
items: Tuple[Dim, ...]
@dataclass(frozen=True)
class ISelDim(Dim):
base: Dim
isel: Tuple[int, ...]
@dataclass(frozen=True)
class SelDim(Dim):
base: Dim
sel: Tuple[Any, ...]
# Basically the same dimension, but with a different identity. We need
# this so that we can work with objects like covariance matrices
# that contain the same dimension more than onces, which I think
# we should not allow directly.
@dataclass(frozen=True)
class RenameDim(Dim):
orig: Dim
# Maybe we need something like this? A dim that doesn't
# just depend on other dims but directly the shape of some
# non-named tensor? I think something along those lines
# is needed to represent things like
# `normal(size=poisson())`.
@dataclass(frozen=True)
class DynamicDim(Dim):
size_of: Variable
```
## What other new types do we need?
- `Variables`: I think we need a new type `NamedTensorVariable`. I don't think
this can reasonably be a subtype of our current `TensorVariable`, because
values of those types behave differently (for instance they behave
differently when multiplied). But maybe we could refactor a bit so that there
is a common super type (called `TensorVariable`) and two inheriting classes
`NamedTensorVariable` and `UnnamedTensorVariable`, where the second is the
current `TensorVariable`.
- `TensorType`: Does the same hold for TensorTypes? I'm actually not sure,
maybe we could get away with just adding a `dims` attribute that can be None
for `UnnamedTensorVariables`?
I guess most Ops can be generic and accept either (all?) NamedTensorVariables
or UnnamedTensorVariables. But this probably involves changing all `make_node`
methods?
There would probably also need to be a couple of new functions in a namespace
like `pytensor.named_tensor` that only make sense for named tensors, and
conversely some functions in `pytensor.tensor` might want to check that the
input is a `UnnamedTensorVariable`.
## What do we need to change about Ops?
We need some additional information from each Op, that is currently
unavailable.
If we compute a matrix vector product for instance using a `Dot` op, we need to
know what which dimensions the output will have. Or for a `reshape` op, we need
to get the information which dimensions of the output correspond to which
`ProductDim`.
I think the correct place for this information is in the `make_node` method, or
some other similar mechanism that replaces the `make_node` function:
- We have previously discussed changing this to a signature object, and using
minikanren to figure out type parameters of the Op.
- Alternatively we could simply incorporate it into the existing `make_node`
functions, they could simply check if the input variables are named or
unnamed, probably raise an error if this is not consistent and choose
corresponding output variables with the correct dimensions in their types.
- We could also come up with a dispatch mechanism similar to the backend
implementations for the current `make_node` method, that also dispatches by
the type of the input variables.
In general, I think there is a lot of overlap with these changes and shape
inference and Op signatures using minikanren.
## Rewrites
I haven't really though about consequences this might have for rewrites. Would
some of them have to be changed so that they can work with
`NamedTensorVariables`?
Maybe we could also just start with a rewrite that replaces
`NamedTensorVariables` with corresponding `UnnamedTensorVariables` and keep all
later rewrites exactly as they are? But Could dimension information maybe
sometimes be useful for rewrites?