Try   HackMD

Another Dtype document

Inspired by https://hackmd.io/ok21UoAQQmOtSVk6keaJhw?view

Some terminology notes:

  • "dtype" refers to the PyTypeObject * with value PyArrayDescr_Type
  • "A dtype" refers to a PyArray_Descr * with either a static or heap-allocated value, and is an instance of "dtype"
  • "A numpy scalar type" refers to a PyTypeObject * with value Py@NAME@ArrType_Type, that is a subclass of PyGenericArrType_Type
  • "A numpy scalar" refers to a Py@NAME@ScalarObject *, which is an instance of "A numpy scalar type"

Two orthogonal goals:

Introducing subclasses of dtype to aid with repesentation of parameterized types (builtin and third-party)

In some ways, np.dtype is becoming a union of all possible data types. For example, it contains the fields (shown as .python, ->c):

  • .fields (->fields), .names (->names), .isalignedstruct - only applicable to structured arrays
  • .subdtype (->subarray->base), .shape (->subarray->shape) - only applicable to subarrays
  • ->c_metadata - only applicable to datetime types, and perhaps third party types
  • .metadata - not used internally, but accessible to python

Each of these assume some default value in the cases when they don't apply.

This type of approach does not scale well. Many third party types may want to store their own metadata alongside a dtype (and make it accesible from C), such a quantity type wanting to store a PyObject *unit. Our c_metadata slot allows them to do this, but it makes them second-class citizens.

It would be better to allow custom dtypes to freely add their own information at the end of this struct by subclassing. To dogfood ourselves on this approach, we could try to extract structured, subarray, and datetime dtypes into their own dtype subclasses. Expressed in python, this would look like:

class dtype(object):
    # all of the existing slots
    kind: str
    ...
    
    # deprecate access to properties that are part of subclasses, continue to
    # return the defaults
    @property
    def fields(self):
        warnings.warn(
            "Use `isinstance(dt, structure_dtype)` to detect structured types",
            DeprecationWarning
        )
        return None
    # repeat for names, subdtype, etc
    
    
    def __new__(self, *args, **kw):
        # as before, but:
        #   `np.dtype([('a', int)])` defers to `structured_dtype([('a', int)])`
        #   `np.dtype((int, 3))` defers to `subarray_dtype((int, 3))`
        #   `np.dtype('m')` defers to `subarray_dtype((int, 3))`


class structured_dtype(dtype):
    fields: dict
    names: Tuple[str]
    # etc
    
    def __new__(self, *args, **kwargs):
        self.type = np.void
        # just the structured dtype construction part of the current `np.dtype(...)`


class subarray_dtype(dtype):
    base: dtype
    shape: Tuple[int]
    # etc
    
    def __new__(self, *args, **kwargs):
        self.type = np.void
        # just the subarray dtype construction part of the current `np.dtype(...)`
        # 

class timelike64_dtype(dtype):
    unit: 'str'
    

This sets out to primarily help users / numpy itself writing parameterizable dtypes, including:

  • Quantity types (unit: Unit)
  • Categorical types (values: Tuple[Any])
  • Text types (encoding: str)

Open questions:

  1. Should this approach be used even for unparameterized dtypes, like:
    • quaternion

[mattip]Maybe mention this holds 4 doubles and overrides all the ufunc methods?

  • int256

In favor:

  • Extends nicely to types that gain parameters later

Against:

  • Makes user types (each a different subclass) assymetric from builtin types (all the same class)
  • Muddies the water over dtype instances vs instances of dtype subclasses
  1. Do we need to maintain C ABI compatibility?
  2. Do we need to maintain C API compatibility?
    • Best acheived via a Numpy 1.7-style API hiding, by making PyArrayDescr an opaque struct, and requiring users to use PyArrayDescr_KIND(desc) instead of desc->kind. After a couple releases of this, we could remove the old API, and start to change the underlying struct layouts.
  3. Do we need to maintain python API compatibility?
    • Easy to address as shown above, by introducing deprecated @property wrappers in the base class.

Counter-argument:

  • What makes a dtype different to a type (comparing dtype methods to classmethod, dtype attributes to class attributes)? type has few subclasses, instead opting for customization via tp_* slots. Why can't we do the same with dtype?

Making dtype a metaclass, so that np.dtype('i') is itself a type

This would:

  • Allow the merging of np.int64, a scalar type, and np.dtype(np.int64), its dtype.
  • Allow dtype.type to be deprecated, and just return self
  • Make it easier to create an instance from a dtype, using my_dtype(0) instead of my_dtype.type(0).

Although it opens

TODO: more justification needed here.

How these goals interact

TODO