or
or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up
Syntax | Example | Reference | |
---|---|---|---|
# Header | Header | 基本排版 | |
- Unordered List |
|
||
1. Ordered List |
|
||
- [ ] Todo List |
|
||
> Blockquote | Blockquote |
||
**Bold font** | Bold font | ||
*Italics font* | Italics font | ||
~~Strikethrough~~ | |||
19^th^ | 19th | ||
H~2~O | H2O | ||
++Inserted text++ | Inserted text | ||
==Marked text== | Marked text | ||
[link text](https:// "title") | Link | ||
 | Image | ||
`Code` | Code |
在筆記中貼入程式碼 | |
```javascript var i = 0; ``` |
|
||
:smile: | ![]() |
Emoji list | |
{%youtube youtube_id %} | Externals | ||
$L^aT_eX$ | LaTeX | ||
:::info This is a alert area. ::: |
This is a alert area. |
On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?
Please give us some advice and help us improve HackMD.
Do you want to remove this version name and description?
Syncing
xxxxxxxxxx
NEP: high level data types and universal functions
Author: Stephan Hoyer
Status: Informational
Introduction
This document outlines what NumPy needs to make it possible to write fully-featured data types from Python. We focus on universal functions, since these are NumPy's main abstration for data type dependent functionality.
This arose out of discussion at the NumPy developer meeting in Berkeley, California on November 30, 2018. Thanks to Eric Wieser, Matti Picus, Charles Harris, and Travis Oliphant for sharing their thoughts on parts of this proposal (did I miss anyone?).
The dtype abstraction
Data types ("dtype") in NumPy idenfies the type of the data stored in arrays (e.g., float vs integer). The power of dtypes is that they allows for a clean separation (at least in principle) between shape and data dependent functionality:
concatenate
,reshape
, indexing.add
,sin
,sum
.NumPy up to 1.16 has poor support for custom dtypes. Defining dtype-dependent functionality requires writing inner loops in C, and even the C API is insufficiently flexible for many use cases (e.g., there is no support for functions that depend on metadata stored on dtypes).
Types of dtypes
NumPy comes with many builtin dtypes, and there are also many many usecases for user-defined dtypes. These fall into several categories:
These are all valuable, but here we want to focus on category (3). We think this sort of dtype has the largest need, and would have the largest beneficial impact on the broader NumPy ecosystem.
This NEP intentionally does not address questions of how NumPy's low-level dtype machinery will need to be updated to make it possible to write high-level logical dtypes. These will be the subject another NEP.
Why aren't duck arrays or subclasses enough?
"Duck arrays" are great and we should support them better in NumPy. See NEP-18 for an overview and discussion. Most functionality envisioned for dtypes could also be implemented by duck arrays, but duck arrays require a lot more work: you need to redefine all of the shape functions as well as the dtype functions.
Subclasses of NumPy arrays are widely used and, at least in principle, could satisfy the use-case of extending dtype-specific functions without rewriting shape-dependent functions. Unfortunately, subclasses are a little too flexible: it's easy to write a subclass that violates the Liskov substitution principle. NumPy itself includes several examples of
ndarray
subclasses that violate substitutabiltiy, either by changing shape semenatics (fornp.matrix
) or the behavior of methods (fornp.ma.MaskedArray
).Dtypes provide just the right amount of flexibility to define a new data type, while allowing them to be reliably used from third-party libraries. You need to define new functionality related to the dtype, but shape functions are guaranteed to work exactly like standard NumPy arrays.
Defining a dtype from Python
Dtype customized functions should be ufuncs
The core idea of universal functions is that they are data-dependent but shape agnostic. They use shared function signatures allows them to be processed in standard ways, e.g., with
__array_ufunc__
.We should double-down on
ufuncs
, by making all data-dependent behavior in NumPy useufuncs
. See issue 12514 for a list of things that need to change. This isn't to say that every data-dependent function in NumPy needs to be ufunc: in many cases, functions have esoteric enough signatures (e.g., for shape handling) that they can't fit into the ufunc model. But we should write ufuncs for all the data-dependent functionality within NumPy, and implementing ufuncs for a new dtype should be NumPy's extension mechanism for dtype-specific functionality.A proposed interface
Note: This NEP presumes that dtypes will be rewritten as a metaclasses, so NumPy scalars will become instances of dtypes. This will be the subject of another NEP; protyping has started in pull requests #12467 and #12462.
Writing a high-level Python dtype should simply entail writing a new Python type that implements a standard interface. This interface should specify:
itemsize
: how many bytes does each dtype element consume?alignment
(should beitemsize
or smaller).dtype=object
, which usesNPY_USE_GETITEM
)ndarray.item()
?__array_ufunc__
to override existing ufuncs (see__dtype_ufunc__
below)__repr__
method.By design, it is impossible to override arbitrary NumPy functions that act on shapes, e.g.,
np.concatenate
. These should not vary in a dtype dependent fashion.Dtype specific ufunc overrides
We propose a more restricted variant of
__array_ufunc__
(only for high level Python dtypes) that restricts itself to not handle duckarrays, which we'll tentatively call__dtype__ufunc__
.Unlike
__array_ufunc__
, calling ufunc overrides__dtype_ufunc__
should happen at a lower level in the ufunc machinery:However,
__dtype_func__
overrides happens at a higher level than NumPy's existing ufunc implementations:Drawbacks:
__array_ufunc__
,__dtype_ufunc__
and NumPy's internal thing).Example usage
Consider datetime and timedelta dtype like NumPy's datetime64/timedelta64.
Most operations could be implemented simply by casting to int64 and calling another ufunc on the int64 data, e.g., for
np.sub
:How NumPy calls
__dtype_ufunc__
NumPy should check for
__dtype_ufunc__
attributes after looking for__array_ufunc__
overrides, but before builtin ufunc implementations, e.g.,As part of calling
__dtype_ufunc__
overrides, NumPy should verify that the custom ufunc implementation honors appropriate invariants:Defining new universal functions from Python
Most dtypes need new functions, beyond those that already exist as ufuncs in NumPy. For example, our new datetime type should have functions for doing custom datetime conversions.
Logically, almost all of these operations are element-wise, so they are a good fit for NumPy's ufunc model. But right now it's hard to write ufuncs: you need to define the inner loops at a C level, and sometimes even write or modify NumPy's internal "type resolver" function that determines the proper output type and inner loop function to use given certain input types (e.g., NumPy has hard-coded support for
datetime64
in the type resolver fornp.add
). For user-defined dtypes written in Python to be usable, it should be possible write user-defined ufuncs in Python, too.Use cases
There are least three use-cases for writing ufuncs in Python:
np.vectorize
but actually creating a ufunc. This will not be terribly useful because it is painfully slow to do inner loops in Python.__dtype_ufunc__
or__array_ufunc__
. This provides useful introspection options for third-party libraries to build upon, e.g.,dask.array
can automatically determine how to parallelize such a ufunc.For usable user-defined ufuncs, case (2) is probably most important. There are lots of examples of performant vectorized functions in user code, but with the exception of trivial cases where non-generalized NumPy ufuncs are wrapped, most of these don't handle the full generality of NumPy's ufuncs.
For NumPy itself, case (3) could be valuable: we have lots of non-ufuncs that could logically fit into the ufunc model, e.g.,
argmin
,median
,sort
,where
, etc.Note:
numba.vectorize
is does not produce a ufunc currently, but it should.Proposed interfaces
A ufunc decorator should check args, and do broadcasting/reshaping such that the ufunc implementation only needs to handle arrays with one more dimensions than the number of "core dimensions" in the gufunc signature. For example:
or perhaps supporting multiple loops
This is doing three things:
Why is
@ufunc
different fromvectorize
?__array_ufunc__
or__dtype_ufunc__
.where
, andaxis
for gufuncs).Changes within NumPy
TODO: finish cleaning this up
NumPy's low-level ufunc machinery
For each ufunc, we currently have:
This results in hard-wired cases for new dtypes (e.g.,
np.datetime64
) inside type resolver functions, which is not very extensible.Instead, we might want:
__dtype_ufunc__
but without the overhead of Python calls) finds a dtype that implements the ufunc for all the given argument dtypesWe will want to default to using NumPy's current type resolving protocol for current builtin dtypes/ufuncs, i.e., by writing a generic version of
__low_level_dtype_ufunc__
to set on builtin dtypes.Rewriting existing NumPy functions
ddof
onnp.std
):np.mean
, or perhaps we could define these as ufuncs that have areduce
method but no__call__
method.np.linalg.solve
) have their own strange casting rules. If we want to support these, we will need some dtype equivalent version of__array_function__
.np.busday_count
)sort
,mean
,median
etc.mean
will need new axis rules.np.where
are vectorized like ufuncs, but it can use a generic (non dtype-dependent) inner loop.np.where.ufunc
).Appendix
References
The current interface of dtypes in NumPy
Proposed ufunc dispatching design
Example: Implementing multiplication for a unit dtype
Example: flexible dtype bytes addition