# scikit-learn computational plugins (draft)
## Plugin registration
scikit-learn would provide an context manager to make it possible to activate a plugin. For instance:
```python
from sklearn import computational_engine
from sklearn.neighbors import KNeighborsClassifier
with computational_engine("sklearn_dppy"):
clf = KNeighborsClassifier(n_neighbors=5).fit(X_train)
y_pred = clf.predict(X_test)
```
Some pluggins could be maintained as extension packages under the
https://github.com/scikit-learn organization (e.g. https://github.com/scikit-learn/sklearn_dppy), with a community governance. Other could be maintained by third-party under their own governance.
It would also be possible for a user to declare that they accept to use
several computational engine by priority order, automatically fallingback
```python
from sklearn import computational_engine
from sklearn.neighbors import KNeighborsClassifier
with computational_engine("sklearn_dppy", "sklearn_numba"):
for metric in ["euclidean", "manhattan"]:
clf = KNeighborsClassifier(n_neighbors=5, metric=metric).fit(X_train)
y_pred = clf.predict(X_test)
```
In the example above, one could imagine than `sklearn_dppy` would be highly specialized for the `metric="euclidean"` case, while `sklearn_numba` could provide routines for any support metric. Since `"sklearn_dppy"` is declared first, it would be used for `metric="euclidean"` while scikit-learn would select `sklearn_numba` for the `metric="manhattan"` case.
The list of installed computational engine can be defined via a dedicated [importlib.metadata.entry_points](https://docs.python.org/3/library/importlib.metadata.html#entry-points).
## Input-validation and data containers
Plugins that register a specific computational routine should be able to declare that they accept specific:
- data container types (numpy or CuPy with device allocated memory)
- specific data-types (e.g. only float32 or float64) for a given container
- specific hyper-parameter values for the routine but exclude others.
If a plugin accepts the data container, it should also be able to handle its input validation: for instance detecting the presence of nan or inf values can be done more efficiently in parallel on device without conversion to an intermediate host array.
## Computational routines
The target for a first pluggable computation routine (or pair of routines) could be bruteforce k-nearest neighbor computation as done as part of [#20254](https://github.com/scikit-learn/scikit-learn/pull/20254).
This routine could improve the computation efficiency of KNN Classification and regression but also other neighbors-based estimators such as DBSCAN and TSNE.
The pairwise-distance plugin could also be reused for kernel computation in SVMs, Gaussian Processes, Nystroem kernel expansion, KernelPCA...
Other potential computational routines include:
- k-means on dense data (assuming the pairwise-distance + argkmin routines cannot handle it on their own)
- the computation of the sum of sample-wise gradients for logistic regression: in this case the l-BFGS solver would stay unchanged but the embarrassingly computation of the gradient update could be delegated to the pluging.
- score metrics which are typically parallel arithmetic operation on 2 or 3 arrays (`y_true`, `y_pred`, `sample_weight`) followed by a reduction (e.g. a sum or a mean).
## Automatic Fallback mechanism
If the active plugin does not accept a given input data-structure (e.g. a sparse matrix) or a given hyper-parameter setting, scikit-learn should automatically fall-back to the default built-in implementation (typically written in Cython+OpenMP or numpy/scipy).
This means probably means that we scikit-learn needs to define an API to ask the active plugin if it accepts to perform the computation.

or

By clicking below, you agree to our terms of service.

Sign in via Facebook
Sign in via Twitter
Sign in via GitHub
Sign in via Dropbox
Sign in with Wallet

Wallet
(
)

Connect another wallet
New to HackMD? Sign up