Another Dtype document

--- tags: NumPy, dtype --- # Another Dtype document Inspired by https://hackmd.io/ok21UoAQQmOtSVk6keaJhw?view Some terminology notes: * _"`dtype`"_ refers to the `PyTypeObject *` with value `PyArrayDescr_Type` * _"A `dtype`"_ refers to a `PyArray_Descr *` with either a static or heap-allocated value, and is an instance of _"`dtype`"_ * _"A numpy scalar type"_ refers to a `PyTypeObject *` with value `Py@NAME@ArrType_Type`, that is a subclass of `PyGenericArrType_Type` * _"A numpy scalar"_ refers to a `Py@NAME@ScalarObject *`, which is an instance of _"A numpy scalar type"_ Two orthogonal goals: ## Introducing subclasses of `dtype` to aid with repesentation of parameterized types (builtin and third-party) In some ways, `np.dtype` is becoming a union of all possible data types. For example, it contains the fields (shown as `.python`, `->c`): * `.fields` (`->fields`), `.names` (`->names`), `.isalignedstruct` - only applicable to structured arrays * `.subdtype` (`->subarray->base`), `.shape` (`->subarray->shape`) - only applicable to subarrays * `->c_metadata` - only applicable to datetime types, and perhaps third party types * `.metadata` - not used internally, but accessible to python Each of these assume some default value in the cases when they don't apply. This type of approach does not scale well. Many third party types may want to store their own metadata alongside a dtype (and make it accesible from C), such a quantity type wanting to store a `PyObject *unit`. Our `c_metadata` slot allows them to do this, but it makes them second-class citizens. It would be better to allow custom dtypes to freely add their own information at the end of this struct by subclassing. To dogfood ourselves on this approach, we could try to extract structured, subarray, and datetime dtypes into their own dtype subclasses. Expressed in python, this would look like: ```python class dtype(object): # all of the existing slots kind: str ... # deprecate access to properties that are part of subclasses, continue to # return the defaults @property def fields(self): warnings.warn( "Use `isinstance(dt, structure_dtype)` to detect structured types", DeprecationWarning ) return None # repeat for names, subdtype, etc def __new__(self, *args, **kw): # as before, but: # `np.dtype([('a', int)])` defers to `structured_dtype([('a', int)])` # `np.dtype((int, 3))` defers to `subarray_dtype((int, 3))` # `np.dtype('m')` defers to `subarray_dtype((int, 3))` class structured_dtype(dtype): fields: dict names: Tuple[str] # etc def __new__(self, *args, **kwargs): self.type = np.void # just the structured dtype construction part of the current `np.dtype(...)` class subarray_dtype(dtype): base: dtype shape: Tuple[int] # etc def __new__(self, *args, **kwargs): self.type = np.void # just the subarray dtype construction part of the current `np.dtype(...)` # class timelike64_dtype(dtype): unit: 'str' ``` This sets out to primarily help users / numpy itself writing _parameterizable_ dtypes, including: * Quantity types (`unit: Unit`) * Categorical types (`values: Tuple[Any]`) * Text types (`encoding: str`) Open questions: 1. Should this approach be used even for unparameterized dtypes, like: * `quaternion` > [mattip]Maybe mention this holds 4 doubles and overrides all the ufunc methods? * `int256` In favor: * Extends nicely to types that gain parameters later Against: * Makes user types (each a different subclass) assymetric from builtin types (all the same class) * Muddies the water over dtype instances vs instances of dtype subclasses 2. Do we need to maintain C ABI compatibility? 4. Do we need to maintain C API compatibility? * Best acheived via a Numpy 1.7-style API hiding, by making `PyArrayDescr` an opaque struct, and requiring users to use `PyArrayDescr_KIND(desc)` instead of `desc->kind`. After a couple releases of this, we could remove the old API, and start to change the underlying struct layouts. 5. Do we need to maintain python API compatibility? * Easy to address as shown above, by introducing deprecated `@property` wrappers in the base class. Counter-argument: * What makes a `dtype` different to a `type` (comparing dtype methods to classmethod, dtype attributes to class attributes)? `type` has few subclasses, instead opting for customization via `tp_*` slots. Why can't we do the same with dtype? ## Making `dtype` a metaclass, so that `np.dtype('i')` is itself a type This would: * Allow the merging of `np.int64`, a scalar type, and `np.dtype(np.int64)`, its dtype. * Allow `dtype.type` to be deprecated, and just return `self` * Make it easier to create an instance from a dtype, using `my_dtype(0)` instead of `my_dtype.type(0)`. Although it opens TODO: more justification needed here. ## How these goals interact TODO