# scikit-learn computational plugins (draft)
## Plugin registration
scikit-learn would provide an context manager to make it possible to activate a plugin. For instance:
```python
from sklearn import computational_engine
from sklearn.neighbors import KNeighborsClassifier
with computational_engine("sklearn_dppy"):
clf = KNeighborsClassifier(n_neighbors=5).fit(X_train)
y_pred = clf.predict(X_test)
```
Some pluggins could be maintained as extension packages under the
https://github.com/scikit-learn organization (e.g. https://github.com/scikit-learn/sklearn_dppy), with a community governance. Other could be maintained by third-party under their own governance.
It would also be possible for a user to declare that they accept to use
several computational engine by priority order, automatically fallingback
```python
from sklearn import computational_engine
from sklearn.neighbors import KNeighborsClassifier
with computational_engine("sklearn_dppy", "sklearn_numba"):
for metric in ["euclidean", "manhattan"]:
clf = KNeighborsClassifier(n_neighbors=5, metric=metric).fit(X_train)
y_pred = clf.predict(X_test)
```
In the example above, one could imagine than `sklearn_dppy` would be highly specialized for the `metric="euclidean"` case, while `sklearn_numba` could provide routines for any support metric. Since `"sklearn_dppy"` is declared first, it would be used for `metric="euclidean"` while scikit-learn would select `sklearn_numba` for the `metric="manhattan"` case.
The list of installed computational engine can be defined via a dedicated [importlib.metadata.entry_points](https://docs.python.org/3/library/importlib.metadata.html#entry-points).
## Input-validation and data containers
Plugins that register a specific computational routine should be able to declare that they accept specific:
- data container types (numpy or CuPy with device allocated memory)
- specific data-types (e.g. only float32 or float64) for a given container
- specific hyper-parameter values for the routine but exclude others.
If a plugin accepts the data container, it should also be able to handle its input validation: for instance detecting the presence of nan or inf values can be done more efficiently in parallel on device without conversion to an intermediate host array.
## Computational routines
The target for a first pluggable computation routine (or pair of routines) could be bruteforce k-nearest neighbor computation as done as part of [#20254](https://github.com/scikit-learn/scikit-learn/pull/20254).
This routine could improve the computation efficiency of KNN Classification and regression but also other neighbors-based estimators such as DBSCAN and TSNE.
The pairwise-distance plugin could also be reused for kernel computation in SVMs, Gaussian Processes, Nystroem kernel expansion, KernelPCA...
Other potential computational routines include:
- k-means on dense data (assuming the pairwise-distance + argkmin routines cannot handle it on their own)
- the computation of the sum of sample-wise gradients for logistic regression: in this case the l-BFGS solver would stay unchanged but the embarrassingly computation of the gradient update could be delegated to the pluging.
- score metrics which are typically parallel arithmetic operation on 2 or 3 arrays (`y_true`, `y_pred`, `sample_weight`) followed by a reduction (e.g. a sum or a mean).
## Automatic Fallback mechanism
If the active plugin does not accept a given input data-structure (e.g. a sparse matrix) or a given hyper-parameter setting, scikit-learn should automatically fall-back to the default built-in implementation (typically written in Cython+OpenMP or numpy/scipy).
This means probably means that we scikit-learn needs to define an API to ask the active plugin if it accepts to perform the computation.