Another Dtype document

Inspired by https://hackmd.io/ok21UoAQQmOtSVk6keaJhw?view

Some terminology notes:

"dtype" refers to the PyTypeObject * with value PyArrayDescr_Type
"A dtype" refers to a PyArray_Descr * with either a static or heap-allocated value, and is an instance of "dtype"
"A numpy scalar type" refers to a PyTypeObject * with value Py@NAME@ArrType_Type, that is a subclass of PyGenericArrType_Type
"A numpy scalar" refers to a Py@NAME@ScalarObject *, which is an instance of "A numpy scalar type"

Two orthogonal goals:

Introducing subclasses of `dtype` to aid with repesentation of parameterized types (builtin and third-party)

In some ways, np.dtype is becoming a union of all possible data types. For example, it contains the fields (shown as .python, ->c):

.fields (->fields), .names (->names), .isalignedstruct - only applicable to structured arrays
.subdtype (->subarray->base), .shape (->subarray->shape) - only applicable to subarrays
->c_metadata - only applicable to datetime types, and perhaps third party types
.metadata - not used internally, but accessible to python

Each of these assume some default value in the cases when they don't apply.

This type of approach does not scale well. Many third party types may want to store their own metadata alongside a dtype (and make it accesible from C), such a quantity type wanting to store a PyObject *unit. Our c_metadata slot allows them to do this, but it makes them second-class citizens.

It would be better to allow custom dtypes to freely add their own information at the end of this struct by subclassing. To dogfood ourselves on this approach, we could try to extract structured, subarray, and datetime dtypes into their own dtype subclasses. Expressed in python, this would look like:

class dtype(object):
    # all of the existing slots
    kind: str
    ...
    
    # deprecate access to properties that are part of subclasses, continue to
    # return the defaults
    @property
    def fields(self):
        warnings.warn(
            "Use `isinstance(dt, structure_dtype)` to detect structured types",
            DeprecationWarning
        )
        return None
    # repeat for names, subdtype, etc
    
    
    def __new__(self, *args, **kw):
        # as before, but:
        #   `np.dtype([('a', int)])` defers to `structured_dtype([('a', int)])`
        #   `np.dtype((int, 3))` defers to `subarray_dtype((int, 3))`
        #   `np.dtype('m')` defers to `subarray_dtype((int, 3))`


class structured_dtype(dtype):
    fields: dict
    names: Tuple[str]
    # etc
    
    def __new__(self, *args, **kwargs):
        self.type = np.void
        # just the structured dtype construction part of the current `np.dtype(...)`


class subarray_dtype(dtype):
    base: dtype
    shape: Tuple[int]
    # etc
    
    def __new__(self, *args, **kwargs):
        self.type = np.void
        # just the subarray dtype construction part of the current `np.dtype(...)`
        # 

class timelike64_dtype(dtype):
    unit: 'str'

This sets out to primarily help users / numpy itself writing parameterizable dtypes, including:

Quantity types (unit: Unit)
Categorical types (values: Tuple[Any])
Text types (encoding: str)

Open questions:

Should this approach be used even for unparameterized dtypes, like:
- quaternion

[mattip]Maybe mention this holds 4 doubles and overrides all the ufunc methods?

int256

In favor:

Extends nicely to types that gain parameters later

Against:

Makes user types (each a different subclass) assymetric from builtin types (all the same class)
Muddies the water over dtype instances vs instances of dtype subclasses

Do we need to maintain C ABI compatibility?
Do we need to maintain C API compatibility?
- Best acheived via a Numpy 1.7-style API hiding, by making PyArrayDescr an opaque struct, and requiring users to use PyArrayDescr_KIND(desc) instead of desc->kind. After a couple releases of this, we could remove the old API, and start to change the underlying struct layouts.
Do we need to maintain python API compatibility?
- Easy to address as shown above, by introducing deprecated @property wrappers in the base class.

Counter-argument:

What makes a dtype different to a type (comparing dtype methods to classmethod, dtype attributes to class attributes)? type has few subclasses, instead opting for customization via tp_* slots. Why can't we do the same with dtype?

Making `dtype` a metaclass, so that `np.dtype('i')` is itself a type

This would:

Allow the merging of np.int64, a scalar type, and np.dtype(np.int64), its dtype.
Allow dtype.type to be deprecated, and just return self
Make it easier to create an instance from a dtype, using my_dtype(0) instead of my_dtype.type(0).

Although it opens

TODO: more justification needed here.

How these goals interact

TODO

Saul Shanabrook

2019/02/28 19:58:09

Is this counter argument suggesting `dtype` be a metaclass or am I understanding this wrong? (Edited)

Travis E. Oliphant

2019/03/02 21:00:28

I believe this is the approach that should be taken. This will simplify the conceptual understanding. About four years ago, I realized this is what dtypes should have been all along. What is needed is something like Python's "type" but with a few additional tp_* style function pointers to handle the specifics of copying and accessing the exact data inside the type. (Edited)

2019/03/02 21:01:28

I'm strongly in favor of this approach. This is what we are experimenting with in "mtype" as part of the "plures" project. (Edited)

2019/03/02 21:02:26

This, in-essence does what Guido did in Python 2.0 and continued in Python 3.0 making user-defined classes first-class citizen's in the "type" hierarchy of Python. (Edited)

2019/03/02 21:03:28

We need to do the same with NumPy to bring dtypes from being analagous to Python 1.X style classes (where every class was an instance of a single built-in type) to Python 2.X "new-style" classes where everything that inherited from "object" was an instance of the meta-class "type". (Edited)

2019/03/02 21:04:13

This means making the PyArray_Descr object a sub-struct of PyHeapTypeObject (in C). (Edited)

Another Dtype document

Introducing subclasses of dtype to aid with repesentation of parameterized types (builtin and third-party)

Making dtype a metaclass, so that np.dtype('i') is itself a type

How these goals interact

Introducing subclasses of `dtype` to aid with repesentation of parameterized types (builtin and third-party)

Making `dtype` a metaclass, so that `np.dtype('i')` is itself a type