# Scikit-learn bi-weekly progress status (even weeks) **Goal**: internal communication on recent work and short term planning **Who**: people working the maintenance of scikit-learn and related, in particular at probabl and Inria and maybe others **Frequency**: every other Monday at 15:00 CET/CEST, unless it happens on the same day as the Monthly meeting. **Where**: https://meet.google.com/xdm-ozyn-pgj **Meeting notes**: to be archived on the scikit-learn org repo ## Next meeting templates **Rules of the game:** - No question during the progress reports - Add those questions in the discussion reports ### Progress reports - [name=Someone] - item - item ### Discussion points - ... ## 2025-03-17 ### Progress reports - [name=Arturo] - DOC Rework voting classifier example [#30985](https://github.com/scikit-learn/scikit-learn/pull/30985) - [name=Shruti] - Working on test using scores for sample weight invariance in non-deterministic estimators PR [#18](https://github.com/snath-xoc/sample-weight-audit-nondet/pull/18). - Working on [#53593](https://github.com/scikit-learn/scikit-learn/pull/53593) (Binmapper in HGBT) PR, similar to `KBinsDiscretizer` however a bit complicated since breaking several tests - Started mini comparison of scikit-learn Gaussian Processes vs. GPytorch, maybe something to try out skore with? - [name=Stefanie] - fix testing and routing for dynamic method selection in PR [FEA Add metadata routing through predict methods of BaggingClassifier and BaggingRegressor](https://github.com/scikit-learn/scikit-learn/pull/30833) - fairlearn PR [MNT Refactor _validate_and_reformat_input](https://github.com/fairlearn/fairlearn/pull/1527) - fairlearn PR [MNT Narwhalify reductions.ErrorRate](https://github.com/fairlearn/fairlearn/pull/1526) - a lot of doc reviews - [name=Olivier] - First pass of review on default policy `sample_weight` meta-data routing - https://github.com/scikit-learn/scikit-learn/pull/30946 - are `sample_weight` special or should we always request all metadata? - if yes, this could be a justification to release scikit-learn 2.0 if this causes breaking changes? - would require to study impact on downstream libraries to make a decision - Investigated with Antoine why decision trees do not pass the `sample_weight` repetition equivalence tests - rounding errors cause different choices of splits with near-tied impurity improvements when running the stochastic tests - rounding errors cause small but non-uniform offsets in leaf values in shallow trees (10 to 100x machine epsilon) - need to follow-up with opening issues to document the bias induced by the current code when the data leads to (exact or near) tied splits - current handling of tied splits leads to hard to understand inductive bias when inspecting the prediction function of learned decision trees - potential solution would involve randomized near-tie breaking in trees and then our stochastic test sample_weight equivalence tests should pass - alternatively we would need to change the way we write the `sample_weight` equivalence tests. - A few reviews to get loky 3.5.0 out. - Science: I read the [TabPFNv2 paper](https://www.nature.com/articles/s41586-024-08328-6) and now reading the [TabICL preprint](https://arxiv.org/abs/2502.05564). - [name=Loïc] - last week: open-source in academia conference + scikit-learn triage - scikit-learn - bump to Python 3.10 https://github.com/scikit-learn/scikit-learn/pull/30895 - WIP GaussianMixture array API support with Stefanie: https://github.com/scikit-learn/scikit-learn/pull/30777 - fix CircleCI doc lock-files update with `PYTHONNOUSERSITE` https://github.com/scikit-learn/scikit-learn/pull/31006 - scikit-learn MOOC: Fix Binder link with js on page load (merged): https://github.com/INRIA/scikit-learn-mooc/pull/807 - [name=Antoine] - investigated why `GradientBoosting` fails the `sample_weight` equivalence check - root cause is tied splits in `DecisionTree` investigated with Olivier - meeting with Olivier, Shruti and Jérémie on adapting the statistical test - this week: metadata routing and multiclass brier score - [name=Guillaume] - last week: - open-source in academia conference - some discussions with Dea and Lucy - next week: - PyData Milan meetup - [name=Adrin] - https://github.com/scikit-learn/scikit-learn/pull/30859 for `LogisticRegressionCV.score` - https://github.com/scikit-learn/scikit-learn/pull/30946 for default routing - https://github.com/scikit-learn/scikit-learn/pull/30990 for a better `refit=callable` example and plotting - Triage this week - [name=Jérémie] - Set up automated release with github actions for threadpoolctl and loky - publish to test pypi on push to main for `.dev0` versions - publish to pypi when a tag is pushed for actual releases - still need to manually release on conda-forge - released threadpoolctl 3.6.0 - released loky 3.5.0 - [name=Gael Varoquaux] - Went over scikit-learn + ColumnTransformer / pandas questions on stackoverflow - PRs/contributions to list skrub in various parts of the ecosystem (pandas, polars, scikit-learn) ### Discussion points - Bump to Python 3.10 opinions are roughly split between plan 1 (oldest minor `X.Y` with Python 3.10 wheels) plan 2 (oldest bugfix `X.Y.Z` with Python 3.10 wheels) and "both are fine". https://github.com/scikit-learn/scikit-learn/pull/30895 - Moving meta-data routing (sample weights) to more mandatory - The challenge is when we add metadata routing to something, eg a scoring, in which case it leads to a change in statistical behavior - [name=Guillaume] `SearchCV`: - I'm under the impression that we should improve the validation curve - Still thinking about the parallel coordinate plot (`skrub` does some stuff in this area (wip)) - [name=Adrin] Copilot context hacks - [name=Adrin] SBOMs, GH's action, minimal starting point ## 2025-03-03 ### Progress reports - [name=Guillaume] - Working upstream (`joblib`) in order to propagate configuration from driver (main process) to workers: https://github.com/joblib/joblib/pull/1668 - Hitting issue with backward compatibility for people that already implement the trick, e.g. scikit-learn - [name=Arturo] - [Update wikipedia article for scikit-learn](https://github.com/scikit-learn/scikit-learn/issues/30907) - [name=Olivier] - Design session on default metadata routing for `sample_weight` with Antoine, Adrin, Stefanie and Jeremie - joblib / sprint - refactored the loky process spawning to avoid raising the `DeprecationWarning` on manual `os.fork` and fix a crash on macOS: https://github.com/joblib/loky/pull/429 - revived enabling faulhandler for loky workers by default https://github.com/joblib/loky/pull/419 - we should be able to release loky soon - several reviews/merges and more to come - a bit of follow-up on: - `sample_weight` entry in glossary: https://github.com/scikit-learn/scikit-learn/pull/30564 - issue with setting the number of threads in gradient boosting https://github.com/scikit-learn/scikit-learn/issues/30662 - [name=Stefanie] - scikit-learn - swapped test data for tests in PR [MNT _weighted_percentile supports np.nan values](https://github.com/scikit-learn/scikit-learn/pull/29034) - joblib - PR [ENH add auto .gitignore in the Memory cache folder](https://github.com/joblib/joblib/pull/1674) - PR [DOC update README with up-to-date installation instruction](https://github.com/joblib/joblib/pull/1663) - PR [DOC include cpu_count in built docs](https://github.com/joblib/joblib/pull/1664) - PR [DOC add info on pre-commit to README](https://github.com/joblib/joblib/pull/1678) - issue [Path to cache in Memory should not depend on input type of location param](https://github.com/joblib/joblib/issues/1671) - reviewed - https://github.com/scikit-learn/scikit-learn/pull/30876 - https://github.com/scikit-learn/scikit-learn/pull/30882 - [name=jeremie] (off) - joblib sprint -> loky sprint - migrated CI from azure pipelines to github actions. faster than expected - dropped support for PyPy, Python 3.7 and Python 3.8. - fixed `cpu_count` on recent windows versions and "exotic" Linux systems. - triage - attended drafting sample_weight × metadata routing - [name=Adrin] - A bit of `joblib` sprint work - numpy.unique merged: https://github.com/numpy/numpy/pull/26018 - work on metadata routing in scikit-learn - default routing - GS routing of sample weight - https://github.com/adrinjalali/agents-to-block to collect accounts people want to block (LLM spam) - [name=Antoine] - [FIX Forward sample weight to the scorer in grid search](https://github.com/scikit-learn/scikit-learn/pull/30743) ready for review/merge - [name=Gaël] - [skrub] [improved memory and time of skrub.StringEncoder](https://github.com/skrub-data/skrub/pull/1248) 2x memory, 1.5x time - [skrub] [gave a vscode only presentation on skrub](https://github.com/GaelVaroquaux/skrub_presentation_2025), including the latest features (Recipe / Expressions) - [name=Vincent] - [skrub] Extensively test the expression API/Recipe to catch some gotchas and bugs - Iterate on the documentation of the Recipe - [hazardous] iteration on the documentation of the C-index PR ### Discussion points - [name=Olivier] Wikipedia editing - [name=Adrin] balancing priorities, comms with the outside, expectations from other maintainers - [name=Olivier] inconsistent handling of location in `joblib.Memory` - [name=Gael] what can we learn / borrow from the numpy review process? How do we change? - One good reviewer sufficient for merge - Suggestion: define what need two reviews (eg no changelog => 1 review sufficient; new API => 2 reviews needed). Need more trust of reviewer - Maybe introducing a time boundary after which only one maintainer is allowed to merge. - PR triaging, close more PR - Revived triaging meeting to triage PR and close some to help focus attention ## 2025-02-17 ### Progress reports - [name=Jérémie] - sample weight debugging - For MiniBatchKMeans. Several issues in different parts of the estimator (fit, convergence checking, ...) https://github.com/scikit-learn/scikit-learn/pull/30751 Raises questions regarding the equivalence properties we want. - For the statistical test. Spurious pvalue constant to 1 https://github.com/snath-xoc/sample-weight-audit-nondet/issues/14 - [name=Stefanie] - PR [FEA Add metadata routing through predict methods of BaggingClassifier and BaggingRegressor](https://github.com/scikit-learn/scikit-learn/pull/30833) - fairlearn: PR [DOC add example for using ErrorRate](https://github.com/fairlearn/fairlearn/pull/1502) - reviewed several DOC PRs - [name=Shruti] - Fully back from teaching in Cape Town (put out a good work) - Working on expanding sample weight testing to clustering algorithms and using a score based equivalence test: https://github.com/snath-xoc/sample-weight-audit-nondet - Implemented fix for spurious 1 values found due to construction of `np.random.choice` (thank you Jeremie) - PR https://github.com/scikit-learn/scikit-learn/pull/30751, need to further discuss issues with scaling of weights, sometimes optimisation problem is not well-defined - [name=Loïc] - Use OpenML download URL from OpenML metadata: https://github.com/scikit-learn/scikit-learn/pull/30708 - Download parquet file from OpenML in example: https://github.com/scikit-learn/scikit-learn/pull/30824 - inputs on [public website analysis](https://views.scientific-python.org/scikit-learn.org) https://github.com/scikit-learn/scikit-learn/issues/30815 ? - WIP GaussianMixture Array API wih Stefanie still ongoing: https://github.com/scikit-learn/scikit-learn/pull/30777 - JupyterLite: + issue with polars and parquet file: https://github.com/pola-rs/polars/issues/20876 + issue with CSV and polars.read_csv https://github.com/jupyterlite/jupyterlite/issues/1576 - sphinx-gallery API doc duplicated links to example: https://github.com/scikit-learn/scikit-learn/pull/30822 and https://github.com/skrub-data/skrub/pull/1239 - Github actions for arm64 CI (not using CirrusCI anymore): https://github.com/scikit-learn/scikit-learn/pull/30797 - joblib triage in preparation of the sprint 26-27 @ Inria Paris - mybinder.org may become more stable in the future: https://github.com/scikit-learn/scikit-learn/pull/30697#issuecomment-2659881848 - [name=Dea] - Comments welcome here (get_params() html): PR https://github.com/scikit-learn/scikit-learn/pull/30763 - Working on https://github.com/scikit-learn/scikit-learn/pull/30846 - [name=Antoine] - still investigating sample_weight and metadata routing - found two issues 30818 and 30817 - [name=Arturo] - Experimented a bit with stratify on X: [issue #26821](https://github.com/scikit-learn/scikit-learn/issues/26821) - [name=Vincent] - [skrub] iter on Recipe ### Discussion points - ## 2025-02-03 ### Progress reports - [name=Olivier] & [name=Shruti] - Progress on comprehensive deterministic and stochastic estimator testing for correct use of `sample_weight`: - https://github.com/snath-xoc/sample-weight-audit-nondet/blob/main/reports/sklearn_estimators_sample_weight_audit_report.ipynb - Still need proper way to test clustering algorithms and simpler handling of transformers - Summary: ``` ✅ 19 passed the deterministic test ❌ 4 failed the deterministic test ✅ 14 passed the statistical test ❌ 17 failed the statistical test ❌ 5 other errors ⚠ 112 lack sample_weight support ``` - Next: plan to give feedback on stratification, array API, display PRs issue/PR: - calibration binning / uncertainty https://github.com/scikit-learn/scikit-learn/issues/30664 - Will come to Paris soon (Wednesday and Thursday) - [name=Loïc] - will be in Paris Tuesday - Thursday - Remove 10 year old tutorial links (1 more approval needed) https://github.com/scikit-learn/scikit-learn/pull/30724 - Use OpenML dataset description for download URL: https://github.com/scikit-learn/scikit-learn/pull/30708 - metrics always return Python floats Jérémie's PR (merged): https://github.com/scikit-learn/scikit-learn/pull/30575 - end of the OpenML saga (still using scikit-learn/examples for one parquet file) - indexing of older versions doc by search engines was due to switch to sphinx-pydata-theme. It seems to have been fixed by using canonical link (Tim's PRs)? Double-check with your favourite search engine! - social media links have been updated - gave opinions on scikit-image JupyterLite interactive doc https://github.com/scikit-image/scikit-image/pull/7644 and moving sphinx-gallery JupyterLite functionality to jupyterlite-sphinx https://github.com/sphinx-gallery/sphinx-gallery/issues/1427 - joblib security reports on Huntr opened by the same person. 3 marked as spam (not by me), replied to the last one - joblib sprint @ Inria Paris wed. 26/ thu. 27 February. Started to collect good issues in a [Github project](https://github.com/orgs/joblib/projects/1?query=sort%3Aupdated-desc+is%3Aopen), feel free to add some! - [name=Stefanie] - [ENH Array API support for confusion_matrix converting to numpy array](https://github.com/scikit-learn/scikit-learn/pull/30562) - second suggestion of how to solve pandas extension dtype failure - some doc reviews - [issue](https://github.com/orgs/community/discussions/150314) at Github about inconsistent search functionality - continued with Guillaumes Traces course (trees and bagging) - [name=Arturo] - Helped contributor of [#30740 DOC Add drawings to demonstrate Pipeline, ColumnTransformer, and FeatureUnion](https://github.com/scikit-learn/scikit-learn/pull/30740) with her setup - [Jupyterlab kernel crash after page refresh](https://github.com/jupyterlab/jupyterlab/issues/16059) - [name=Antoine] - fix sample weight in GridSearch - draft PR forward sample weight to the scorer https://github.com/scikit-learn/scikit-learn/pull/30743 - need to investigate when metadata is enabled - reviews hazardous - [name=Guillaume] - mainly work on `skore` library with brainstorming with [name=Adrin] - attended FOSDEM - [name=Vincent] [skrub] - Released 0.5.1 (adding StringEncoder and fixes for the datasets fetcher) - P16 conference in Paris to collect feedback about TableReport, tabular_learner and the recipe - Testing the recipe on a few examples, we want to release this thing soon [hazardous] metrics PRs are moving forward thanks to @Antoine - Slight revamp of the C-index metric - Enhance the accuracy in time ### Discussion points - [name=Guillaume] I confirm that Kagi search engine looks to have the same behaviour than Google and point out to 1.6.1 - [name=Loïc] Stefanie's Github search issue: probably an alternative way to do what you want. Likely due to us switching to "new-style issues" (or whatever it is called with sub-issues) - [name=Loïc] JupyterLab/JupyterLite issue, do you have a way to reproduce? - [name=Arturo] JupyterLite crash on the scikit-learn.org/stable examples. - error in the JS console of firefox when running the first cell of an example with import statements `TypeError: _query_package() got multiple values for argument 'index_urls'`