owned this note
owned this note
Published
Linked with GitHub
---
tags: Dtype, NumPy
---
# Dtype Thoughts
## Goals:
### Clarify the meaning of `np.dtype`
It is [documented](http://www.numpy.org/devdocs/reference/generated/numpy.dtype.html?highlight=dtype#numpy.dtype) as "Create a data type object". But it is a class. When I do `int(100)` it creates an instance of the `int` class, with 100 as one of its attributes. However, np.dtype(`int8`) actually returns an instance of the `np.dtype` class set up with methods and attributes that are totally different from `np.dtype([('x','int'), ('y', int)])`. So what is `np.dtype`? A class factory? A class `__new__` method?
- It currently calls `PyArray_DescrType.tp_new` which is `arraydescr_new`, which is a class `__new__`. But the class instance it returns depends on the arguments, it may return a `uint8` or a `void`.
- The instances all have separate methods and attributes, so they should be considered instances of separate subclasses.
- The PyArray_Descr type is more of a container than a type with methods. For instance, casting relies on a global table (tree?, list of lists?) to resolve casing rules rather than calling a method on the dtype instance. The same with result-type resolution for ufuncs.
### Propose a mechanism to subclass existing dtypes
- Refactor along the lines of NEP proposals in [PR #12660](https://github.com/numpy/numpy/pull/12660) or [PR #12636](https://github.com/numpy/numpy/pull/12660) to allow subclassing.
This is actually easier than thinking about what it all means, and has been implemented in [PR #12585](https://github.com/numpy/numpy/pull/12585).
- Requires rethinking PyArray_Descr to be more like a `tp_as_number` that is inherited in subclasses (filled in from the class heirarchy when calling `PyType_Ready`).
- Requires rethinking the ufunc lookup rules. For instance, what would happen if we create a int-overflow class that checks int operations for overflow? Now how do we lookup a ufunc to use? Rather than having a global table, we should have a more pythonic protocol.
- Propose a mechanism to create dtypes in python, would need to do something like `np.frompyfunc`
You cannot override tp_as_number slots on instances so we would do the same, once you call `PyType_Ready` on your class the slots are set.
```
class A(int):
def __add__(self, other):
return 'in A.__add__'
def adder(self, other):
return 'in adder'
a = A(10)
print(a + 20) # prints "in A.__add__"
a.__add__ = adder
print(a + 20) # still prints "in A.__add__"
```
## Discussion on specifying dtypes at array creation
- Specifying dtypes. [This PR](https://github.com/numpy/numpy/pull/5634) suggests extending `a = np.array(b, min_dtype=np.float32)` but then moved on to more exotic dtype specification syntax:
- np.array(b, dtype=np.blasable) for `(s, d, c, z)`
- np.array(b, dtype=np.floating)
and there are also proposals around to make the default never convert to an object array to avoid the `np.array([[1], [2, 2]])` problems. I used `np.array` but the same holds for `np.asarray` or `np.asanyarray` or ... (Matti)
- Hameer suggests ordering the dtypes and adding casting rules. Would need a flag for abstract?
- Can we do this as a table? A tree? A method on a dtype subclass?
- Use cases: np.blasable, np.complex, np.no-object, np.inexcact (float, complex)
- Julia casting tables for type promoting
- Maybe use context managers to modify the casting rules?
- Think about this in the context of overflow
## Appendix
Other documents
- SciPy 2018 [brainstorming session](https://github.com/numpy/numpy/wiki/Dtype-Brainstorming)
- [PR to refactor dtypes](https://github.com/numpy/numpy/pull/12585)
- [NEP design PR](https://github.com/numpy/numpy/pull/12630) (one of many such PRs since the subject got complicated)
### quaternions
Separate [repo](https://github.com/moble/quaternion) as one C file that
- Calls `PyObject_New(PyArray_Descr, &PyArrayDescr_Type)`
- Assigns all the special `PyArray_Descr` fields including `f` (with a `_PyQuaternion_ArrFuncs` like `setitem`, `getitem`)
- Registers the dtype via `PyArray_RegisterDataType` which gives it a `type_num` (python level `dtype.num`)
- Defines all the needed ufuncs and registers them via `PyUFunc_FromFuncAndData`, `PyUFunc_RegisterLoopForType` which requres adding them to `np.add`, `np.subtract`,... This is done for `quat, quat -> quat`, `quat, double -> quat`, `double, quat -> quat`
- Defines extra ufuncs specifically for quaternions and exports them as `np.norm`, `np.normalized`, `np.*parity*`, `np.rot*`
- Registers all the casting functions via `PyArray_RegisterCastFunc`, `PyArray_RegisterCanCast`
- Adds `_eps` and `quaternion` to the top-level `np` namespace
Whew!!
## Rational numbers
Part of the NumPy repo, `src/umath/_rational_tests.c.src`
- Defines a PyArray_ArrFuncs npyrational_arrfuncs, fills fields like `setitem`, `getitem`
- Defines a PyArray_Descr rather than call `PyObject_New`
```
PyArray_Descr npyrational_descr = {
PyObject_HEAD_INIT(0)
...
&npyrational_arrfuncs, /* f */
}
```
- Registers the dtype via `PyArray_RegisterDataType` which gives it a `type_num` (python level `dtype.num`)
- Defines all the needed ufuncs and registers them via `PyUFunc_FromFuncAndData`, `PyUFunc_RegisterLoopForType` which requres adding them to `np.add`, `np.subtract`,... This is done for `rational, rational -> rational`
- Registers all the casting functions via `PyArray_RegisterCastFunc`, `PyArray_RegisterCanCast`
## User Stories
- units
- [unyt](https://github.com/yt-project/unyt) (ndarray subclass)
- [Pint](https://github.com/hgrecco/pint) (wrapper via [__array_prepare__ and __array_wrap__](https://github.com/hgrecco/pint/blob/master/pint/unit.py#L237), see [comment](https://pint.readthedocs.io/en/latest/numpy.html#comments))
- astropy.units (ndarray subclass)
- several others, see https://www.youtube.com/watch?v=N-edLdxiM40
- enumerations / categorical
- text
- encoded fixed width text (utf8, latin1, ...)
- variable width
- datetime
- 360 day calendar
- Ora: https://github.com/alexhsamuel/ora
- shapely/GEOS geometries
- does this include jagged arrays of polygons?
- Numerics
- Novel floating point formats
- Decimal (arbitrary precision)
- Big int
- finite fields
- Rationals https://github.com/numpy/numpy-dtypes
- float16?
- Missing values
- sentinels
- bitmask
- record-like array
- optional
- quaternion
- https://github.com/martinling/numpy_quaternion (outdated)
- https://github.com/moble/quaternion (maintianed)
- a general pointer dtype that does memory packagement
- xdress (https://github.com/xdress/xdress)
- ndtypes (https://ndtypes.readthedocs.io/)