owned this note
owned this note
Published
Linked with GitHub
# Array API Discussions
Some topics to talk about on Monday morning:
* missing functions in the array API standard
* for example weighted average is used in scipy stats and scikit-learn, but it is not in the array API standard. Right now everyone makes their own implementation
* possible solutions:
* each library implements their own versions
* create a library that contains these things that libraries can vendor/depend on
* wait for a bit to gather information on what/how much is missing to make a decision?
* maybe this isn't actually that big of a deal
* Xarray discussion of missing functions needed for adoption: https://github.com/pydata/xarray/issues/7848 (mainly need https://github.com/data-apis/array-api/issues/669)
* Promotion of array object types?
* Mixing dask and NumPy is not supported right now. Should it be?
```python=
import dask.array as da
import numpy as np
x = da.random.normal(0, 1, size=(10_000, 3), chunks=10)
style = "dask"
if style == "numpy":
y = np.diag([3, 2, 1])
elif style == "dask-broken":
y = da.diag([3, 2, 1]) # TypeError
elif style == "dask":
y = da.diag(np.array([3, 2, 1]))
x @ y # "Array API operation"
```
* Old issue: https://github.com/data-apis/array-api/issues/586
* Array API with different array libraries
* calling a array API function with two arguments, where each argument is from a different array library (e.g. dask and numpy, numpy and cupy, etc)
* how inconvenient is it for a user to have to make the conversion themselves?
* Dask might have a lot to do if all tiny array ops are part of the graph.
* casting/conversion for inputs?
* e.g. if user inputs `ints`, should we convert to floating point? should we require the user converts?
* in scikit-learn we have wondered about converting to float64 if the input is float32, but only if float64 is available
* not all libraries/devices have the same floating point types
* `xp.asarray([1.]).dtype` to find out the default floating dtype is the current solution, which feels clunky
* in the future we can use array API inspection to ifnd out the default floating point type
* array-api-compat versioning? what to do if users have an old version?
* GitHub GPU CI/testing of array API code
* eScience has experience running the GitHub GPU runners
* Connects to Azure subscription
* dispatching
* Networkx dispatching:
* Based on inputs, ask what each type
* Plugin Through Python entrypoint
* Backend can support a subset
* Support backend specific keywords
* Has `should_run` and `can_run`. Some problems are not big enough that they should run, but they could run (e.g. conversion cost).
* Introspeciton and documentation for which plugins are available. "Blessed ones" also show up in the online docs.
* When a library is build on top of another library where both libraries can dispatch, what to do?
* Let the user configure both libraries
* sklearn alternatives:
* each plugin provides one or more "engines", e.g. `KMeansEngine` which is used to accelerate `KMeans` estimator
* each engine has a well defined API, representing parts of the whole algorithm
* up for discussion, should an engine be able to replace all of `fit`, `predict`, etc?
* engine selection performed by querying each engine in turn "do you want to handle this input + hyper-parameters?"
* fall back to default engine
* PR state: Will chunk the job up into task, backends implement those "tasks/kernels". (i.e. "subsections"):
* Plugin has engines. Engines is a class e.g. for KMeans:
* `kmeans_engine.single_lloyd()`, other methods... (including "do you want to do this")
* Estimator uses the methods (which means it can implement the outer iteration for example).
* Plugin has to replicate everything from scratch.
* is there middle ground (e.g. only input validation is sklearn)?
* Could also be solved by helper APIs? (discussed in networkx)
* Can have default implementation which call other methods, but still allow overriding.
* [Matthew] Adoption questions outside of "core Scientific Python" (mainly [Scikit-HEP](https://scikit-hep.org/))
- `ragged`: https://github.com/scikit-hep/ragged/discussions/6
- `pyhf`: https://github.com/scikit-hep/pyhf/issues/2253
- This is specifically about [`array-api-compat`](https://github.com/data-apis/array-api-compat)
- said that when it comes to trying to support multiple backends with `array-api-compat` is to try to drop in as much as possible and then when things are outside of the standard to just modify the backend to support things, but this might be less efficient then the standard computational backends that `pyhf` has.
---
Further discussion on dipsatching (seberg hoping to fill in a bit here): https://hackmd.io/yI1iAqekQIq0a4jLS9WPyw