scikit-learn computational plugins (draft)

# scikit-learn computational plugins (draft) ## Plugin registration scikit-learn would provide an context manager to make it possible to activate a plugin. For instance: ```python from sklearn import computational_engine from sklearn.neighbors import KNeighborsClassifier with computational_engine("sklearn_dppy"): clf = KNeighborsClassifier(n_neighbors=5).fit(X_train) y_pred = clf.predict(X_test) ``` Some pluggins could be maintained as extension packages under the https://github.com/scikit-learn organization (e.g. https://github.com/scikit-learn/sklearn_dppy), with a community governance. Other could be maintained by third-party under their own governance. It would also be possible for a user to declare that they accept to use several computational engine by priority order, automatically fallingback ```python from sklearn import computational_engine from sklearn.neighbors import KNeighborsClassifier with computational_engine("sklearn_dppy", "sklearn_numba"): for metric in ["euclidean", "manhattan"]: clf = KNeighborsClassifier(n_neighbors=5, metric=metric).fit(X_train) y_pred = clf.predict(X_test) ``` In the example above, one could imagine than `sklearn_dppy` would be highly specialized for the `metric="euclidean"` case, while `sklearn_numba` could provide routines for any support metric. Since `"sklearn_dppy"` is declared first, it would be used for `metric="euclidean"` while scikit-learn would select `sklearn_numba` for the `metric="manhattan"` case. The list of installed computational engine can be defined via a dedicated [importlib.metadata.entry_points](https://docs.python.org/3/library/importlib.metadata.html#entry-points). ## Input-validation and data containers Plugins that register a specific computational routine should be able to declare that they accept specific: - data container types (numpy or CuPy with device allocated memory) - specific data-types (e.g. only float32 or float64) for a given container - specific hyper-parameter values for the routine but exclude others. If a plugin accepts the data container, it should also be able to handle its input validation: for instance detecting the presence of nan or inf values can be done more efficiently in parallel on device without conversion to an intermediate host array. ## Computational routines The target for a first pluggable computation routine (or pair of routines) could be bruteforce k-nearest neighbor computation as done as part of [#20254](https://github.com/scikit-learn/scikit-learn/pull/20254). This routine could improve the computation efficiency of KNN Classification and regression but also other neighbors-based estimators such as DBSCAN and TSNE. The pairwise-distance plugin could also be reused for kernel computation in SVMs, Gaussian Processes, Nystroem kernel expansion, KernelPCA... Other potential computational routines include: - k-means on dense data (assuming the pairwise-distance + argkmin routines cannot handle it on their own) - the computation of the sum of sample-wise gradients for logistic regression: in this case the l-BFGS solver would stay unchanged but the embarrassingly computation of the gradient update could be delegated to the pluging. - score metrics which are typically parallel arithmetic operation on 2 or 3 arrays (`y_true`, `y_pred`, `sample_weight`) followed by a reduction (e.g. a sum or a mean). ## Automatic Fallback mechanism If the active plugin does not accept a given input data-structure (e.g. a sparse matrix) or a given hyper-parameter setting, scikit-learn should automatically fall-back to the default built-in implementation (typically written in Cython+OpenMP or numpy/scipy). This means probably means that we scikit-learn needs to define an API to ask the active plugin if it accepts to perform the computation.

Read more

Scikit-learn bi-weekly progress status (even weeks)

OS Team Review & Planning (odd weeks)

EuroScipy 2024 - probablistic machine learning and optimal decision making under uncertainty

scikit-learn 1.5.0 social media