# Scikit-learn bi-weekly progress status (even weeks)
**Goal**: internal communication on recent work and short term planning
**Who**: people working the maintenance of scikit-learn and related, in particular at probabl and Inria and maybe others
**Frequency**: every other Monday at 15:00 CET/CEST, unless it happens on the same day as the Monthly meeting.
**Where**: https://meet.google.com/xdm-ozyn-pgj
**Meeting notes**: to be archived on the scikit-learn org repo
## Next meeting templates
**Rules of the game:**
- No question during the progress reports
- Add those questions in the discussion reports
### Progress reports
- [name=Someone]
- item
- item
### Discussion points
- ...
## 2025-03-17
### Progress reports
- [name=Arturo]
- DOC Rework voting classifier example [#30985](https://github.com/scikit-learn/scikit-learn/pull/30985)
- [name=Shruti]
- Working on test using scores for sample weight invariance in non-deterministic estimators PR [#18](https://github.com/snath-xoc/sample-weight-audit-nondet/pull/18).
- Working on [#53593](https://github.com/scikit-learn/scikit-learn/pull/53593) (Binmapper in HGBT) PR, similar to `KBinsDiscretizer` however a bit complicated since breaking several tests
- Started mini comparison of scikit-learn Gaussian Processes vs. GPytorch, maybe something to try out skore with?
- [name=Stefanie]
- fix testing and routing for dynamic method selection in PR [FEA Add metadata routing through predict methods of BaggingClassifier and BaggingRegressor](https://github.com/scikit-learn/scikit-learn/pull/30833)
- fairlearn PR [MNT Refactor _validate_and_reformat_input](https://github.com/fairlearn/fairlearn/pull/1527)
- fairlearn PR [MNT Narwhalify reductions.ErrorRate](https://github.com/fairlearn/fairlearn/pull/1526)
- a lot of doc reviews
- [name=Olivier]
- First pass of review on default policy `sample_weight` meta-data routing
- https://github.com/scikit-learn/scikit-learn/pull/30946
- are `sample_weight` special or should we always request all metadata?
- if yes, this could be a justification to release scikit-learn 2.0 if this causes breaking changes?
- would require to study impact on downstream libraries to make a decision
- Investigated with Antoine why decision trees do not pass the `sample_weight` repetition equivalence tests
- rounding errors cause different choices of splits with near-tied impurity improvements when running the stochastic tests
- rounding errors cause small but non-uniform offsets in leaf values in shallow trees (10 to 100x machine epsilon)
- need to follow-up with opening issues to document the bias induced by the current code when the data leads to (exact or near) tied splits
- current handling of tied splits leads to hard to understand inductive bias when inspecting the prediction function of learned decision trees
- potential solution would involve randomized near-tie breaking in trees and then our stochastic test sample_weight equivalence tests should pass
- alternatively we would need to change the way we write the `sample_weight` equivalence tests.
- A few reviews to get loky 3.5.0 out.
- Science: I read the [TabPFNv2 paper](https://www.nature.com/articles/s41586-024-08328-6) and now reading the [TabICL preprint](https://arxiv.org/abs/2502.05564).
- [name=Loïc]
- last week: open-source in academia conference + scikit-learn triage
- scikit-learn
- bump to Python 3.10 https://github.com/scikit-learn/scikit-learn/pull/30895
- WIP GaussianMixture array API support with Stefanie: https://github.com/scikit-learn/scikit-learn/pull/30777
- fix CircleCI doc lock-files update with `PYTHONNOUSERSITE` https://github.com/scikit-learn/scikit-learn/pull/31006
- scikit-learn MOOC: Fix Binder link with js on page load (merged): https://github.com/INRIA/scikit-learn-mooc/pull/807
- [name=Antoine]
- investigated why `GradientBoosting` fails the `sample_weight` equivalence check
- root cause is tied splits in `DecisionTree` investigated with Olivier
- meeting with Olivier, Shruti and Jérémie on adapting the statistical test
- this week: metadata routing and multiclass brier score
- [name=Guillaume]
- last week:
- open-source in academia conference
- some discussions with Dea and Lucy
- next week:
- PyData Milan meetup
- [name=Adrin]
- https://github.com/scikit-learn/scikit-learn/pull/30859 for `LogisticRegressionCV.score`
- https://github.com/scikit-learn/scikit-learn/pull/30946 for default routing
- https://github.com/scikit-learn/scikit-learn/pull/30990 for a better `refit=callable` example and plotting
- Triage this week
- [name=Jérémie]
- Set up automated release with github actions for threadpoolctl and loky
- publish to test pypi on push to main for `.dev0` versions
- publish to pypi when a tag is pushed for actual releases
- still need to manually release on conda-forge
- released threadpoolctl 3.6.0
- released loky 3.5.0
- [name=Gael Varoquaux]
- Went over scikit-learn + ColumnTransformer / pandas questions on stackoverflow
- PRs/contributions to list skrub in various parts of the ecosystem (pandas, polars, scikit-learn)
### Discussion points
- Bump to Python 3.10 opinions are roughly split between plan 1 (oldest minor `X.Y` with Python 3.10 wheels) plan 2 (oldest bugfix `X.Y.Z` with Python 3.10 wheels) and "both are fine". https://github.com/scikit-learn/scikit-learn/pull/30895
- Moving meta-data routing (sample weights) to more mandatory
- The challenge is when we add metadata routing to something, eg a scoring, in which case it leads to a change in statistical behavior
- [name=Guillaume] `SearchCV`:
- I'm under the impression that we should improve the validation curve
- Still thinking about the parallel coordinate plot (`skrub` does some stuff in this area (wip))
- [name=Adrin] Copilot context hacks
- [name=Adrin] SBOMs, GH's action, minimal starting point
## 2025-03-03
### Progress reports
- [name=Guillaume]
- Working upstream (`joblib`) in order to propagate configuration from driver (main process) to workers: https://github.com/joblib/joblib/pull/1668
- Hitting issue with backward compatibility for people that already implement the trick, e.g. scikit-learn
- [name=Arturo]
- [Update wikipedia article for scikit-learn](https://github.com/scikit-learn/scikit-learn/issues/30907)
- [name=Olivier]
- Design session on default metadata routing for `sample_weight` with Antoine, Adrin, Stefanie and Jeremie
- joblib / sprint
- refactored the loky process spawning to avoid raising the `DeprecationWarning` on manual `os.fork` and fix a crash on macOS: https://github.com/joblib/loky/pull/429
- revived enabling faulhandler for loky workers by default https://github.com/joblib/loky/pull/419
- we should be able to release loky soon
- several reviews/merges and more to come
- a bit of follow-up on:
- `sample_weight` entry in glossary: https://github.com/scikit-learn/scikit-learn/pull/30564
- issue with setting the number of threads in gradient boosting https://github.com/scikit-learn/scikit-learn/issues/30662
- [name=Stefanie]
- scikit-learn
- swapped test data for tests in PR [MNT _weighted_percentile supports np.nan values](https://github.com/scikit-learn/scikit-learn/pull/29034)
- joblib
- PR [ENH add auto .gitignore in the Memory cache folder](https://github.com/joblib/joblib/pull/1674)
- PR [DOC update README with up-to-date installation instruction](https://github.com/joblib/joblib/pull/1663)
- PR [DOC include cpu_count in built docs](https://github.com/joblib/joblib/pull/1664)
- PR [DOC add info on pre-commit to README](https://github.com/joblib/joblib/pull/1678)
- issue [Path to cache in Memory should not depend on input type of location param](https://github.com/joblib/joblib/issues/1671)
- reviewed
- https://github.com/scikit-learn/scikit-learn/pull/30876
- https://github.com/scikit-learn/scikit-learn/pull/30882
- [name=jeremie] (off)
- joblib sprint -> loky sprint
- migrated CI from azure pipelines to github actions. faster than expected
- dropped support for PyPy, Python 3.7 and Python 3.8.
- fixed `cpu_count` on recent windows versions and "exotic" Linux systems.
- triage
- attended drafting sample_weight × metadata routing
- [name=Adrin]
- A bit of `joblib` sprint work
- numpy.unique merged: https://github.com/numpy/numpy/pull/26018
- work on metadata routing in scikit-learn
- default routing
- GS routing of sample weight
- https://github.com/adrinjalali/agents-to-block to collect accounts people want to block (LLM spam)
- [name=Antoine]
- [FIX Forward sample weight to the scorer in grid search](https://github.com/scikit-learn/scikit-learn/pull/30743) ready for review/merge
- [name=Gaël]
- [skrub] [improved memory and time of skrub.StringEncoder](https://github.com/skrub-data/skrub/pull/1248) 2x memory, 1.5x time
- [skrub] [gave a vscode only presentation on skrub](https://github.com/GaelVaroquaux/skrub_presentation_2025), including the latest features (Recipe / Expressions)
- [name=Vincent]
- [skrub] Extensively test the expression API/Recipe to catch some gotchas and bugs
- Iterate on the documentation of the Recipe
- [hazardous] iteration on the documentation of the C-index PR
### Discussion points
- [name=Olivier] Wikipedia editing
- [name=Adrin] balancing priorities, comms with the outside, expectations from other maintainers
- [name=Olivier] inconsistent handling of location in `joblib.Memory`
- [name=Gael] what can we learn / borrow from the numpy review process? How do we change?
- One good reviewer sufficient for merge
- Suggestion: define what need two reviews (eg no changelog => 1 review sufficient; new API => 2 reviews needed). Need more trust of reviewer
- Maybe introducing a time boundary after which only one maintainer is allowed to merge.
- PR triaging, close more PR
- Revived triaging meeting to triage PR and close some to help focus attention
## 2025-02-17
### Progress reports
- [name=Jérémie]
- sample weight debugging
- For MiniBatchKMeans. Several issues in different parts of the estimator (fit, convergence checking, ...)
https://github.com/scikit-learn/scikit-learn/pull/30751
Raises questions regarding the equivalence properties we want.
- For the statistical test. Spurious pvalue constant to 1
https://github.com/snath-xoc/sample-weight-audit-nondet/issues/14
- [name=Stefanie]
- PR [FEA Add metadata routing through predict methods of BaggingClassifier and BaggingRegressor](https://github.com/scikit-learn/scikit-learn/pull/30833)
- fairlearn: PR [DOC add example for using ErrorRate](https://github.com/fairlearn/fairlearn/pull/1502)
- reviewed several DOC PRs
- [name=Shruti]
- Fully back from teaching in Cape Town (put out a good work)
- Working on expanding sample weight testing to clustering algorithms and using a score based equivalence test: https://github.com/snath-xoc/sample-weight-audit-nondet
- Implemented fix for spurious 1 values found due to construction of `np.random.choice` (thank you Jeremie)
- PR https://github.com/scikit-learn/scikit-learn/pull/30751, need to further discuss issues with scaling of weights, sometimes optimisation problem is not well-defined
- [name=Loïc]
- Use OpenML download URL from OpenML metadata: https://github.com/scikit-learn/scikit-learn/pull/30708
- Download parquet file from OpenML in example: https://github.com/scikit-learn/scikit-learn/pull/30824
- inputs on [public website analysis](https://views.scientific-python.org/scikit-learn.org) https://github.com/scikit-learn/scikit-learn/issues/30815 ?
- WIP GaussianMixture Array API wih Stefanie still ongoing: https://github.com/scikit-learn/scikit-learn/pull/30777
- JupyterLite:
+ issue with polars and parquet file: https://github.com/pola-rs/polars/issues/20876
+ issue with CSV and polars.read_csv https://github.com/jupyterlite/jupyterlite/issues/1576
- sphinx-gallery API doc duplicated links to example:
https://github.com/scikit-learn/scikit-learn/pull/30822 and
https://github.com/skrub-data/skrub/pull/1239
- Github actions for arm64 CI (not using CirrusCI anymore): https://github.com/scikit-learn/scikit-learn/pull/30797
- joblib triage in preparation of the sprint 26-27 @ Inria Paris
- mybinder.org may become more stable in the future: https://github.com/scikit-learn/scikit-learn/pull/30697#issuecomment-2659881848
- [name=Dea]
- Comments welcome here (get_params() html): PR https://github.com/scikit-learn/scikit-learn/pull/30763
- Working on https://github.com/scikit-learn/scikit-learn/pull/30846
- [name=Antoine]
- still investigating sample_weight and metadata routing
- found two issues 30818 and 30817
- [name=Arturo]
- Experimented a bit with stratify on X: [issue #26821](https://github.com/scikit-learn/scikit-learn/issues/26821)
- [name=Vincent]
- [skrub] iter on Recipe
### Discussion points
-
## 2025-02-03
### Progress reports
- [name=Olivier] & [name=Shruti]
- Progress on comprehensive deterministic and stochastic estimator testing for correct use of `sample_weight`:
- https://github.com/snath-xoc/sample-weight-audit-nondet/blob/main/reports/sklearn_estimators_sample_weight_audit_report.ipynb
- Still need proper way to test clustering algorithms and simpler handling of transformers
- Summary:
```
✅ 19 passed the deterministic test
❌ 4 failed the deterministic test
✅ 14 passed the statistical test
❌ 17 failed the statistical test
❌ 5 other errors
⚠ 112 lack sample_weight support
```
- Next: plan to give feedback on stratification, array API, display PRs issue/PR:
- calibration binning / uncertainty https://github.com/scikit-learn/scikit-learn/issues/30664
- Will come to Paris soon (Wednesday and Thursday)
- [name=Loïc]
- will be in Paris Tuesday - Thursday
- Remove 10 year old tutorial links (1 more approval needed) https://github.com/scikit-learn/scikit-learn/pull/30724
- Use OpenML dataset description for download URL: https://github.com/scikit-learn/scikit-learn/pull/30708
- metrics always return Python floats Jérémie's PR (merged): https://github.com/scikit-learn/scikit-learn/pull/30575
- end of the OpenML saga (still using scikit-learn/examples for one parquet file)
- indexing of older versions doc by search engines was due to switch to sphinx-pydata-theme. It seems to have been fixed by using canonical link (Tim's PRs)? Double-check with your favourite search engine!
- social media links have been updated
- gave opinions on scikit-image JupyterLite interactive doc https://github.com/scikit-image/scikit-image/pull/7644 and moving sphinx-gallery JupyterLite functionality to jupyterlite-sphinx https://github.com/sphinx-gallery/sphinx-gallery/issues/1427
- joblib security reports on Huntr opened by the same person. 3 marked as spam (not by me), replied to the last one
- joblib sprint @ Inria Paris wed. 26/ thu. 27 February. Started to collect good issues in a [Github project](https://github.com/orgs/joblib/projects/1?query=sort%3Aupdated-desc+is%3Aopen), feel free to add some!
- [name=Stefanie]
- [ENH Array API support for confusion_matrix converting to numpy array](https://github.com/scikit-learn/scikit-learn/pull/30562)
- second suggestion of how to solve pandas extension dtype failure
- some doc reviews
- [issue](https://github.com/orgs/community/discussions/150314) at Github about inconsistent search functionality
- continued with Guillaumes Traces course (trees and bagging)
- [name=Arturo]
- Helped contributor of [#30740 DOC Add drawings to demonstrate Pipeline, ColumnTransformer, and FeatureUnion](https://github.com/scikit-learn/scikit-learn/pull/30740) with her setup
- [Jupyterlab kernel crash after page refresh](https://github.com/jupyterlab/jupyterlab/issues/16059)
- [name=Antoine]
- fix sample weight in GridSearch
- draft PR forward sample weight to the scorer https://github.com/scikit-learn/scikit-learn/pull/30743
- need to investigate when metadata is enabled
- reviews hazardous
- [name=Guillaume]
- mainly work on `skore` library with brainstorming with [name=Adrin]
- attended FOSDEM
- [name=Vincent]
[skrub]
- Released 0.5.1 (adding StringEncoder and fixes for the datasets fetcher)
- P16 conference in Paris to collect feedback about TableReport, tabular_learner and the recipe
- Testing the recipe on a few examples, we want to release this thing soon
[hazardous]
metrics PRs are moving forward thanks to @Antoine
- Slight revamp of the C-index metric
- Enhance the accuracy in time
### Discussion points
- [name=Guillaume] I confirm that Kagi search engine looks to have the same behaviour than Google and point out to 1.6.1
- [name=Loïc] Stefanie's Github search issue: probably an alternative way to do what you want. Likely due to us switching to "new-style issues" (or whatever it is called with sub-issues)
- [name=Loïc] JupyterLab/JupyterLite issue, do you have a way to reproduce?
- [name=Arturo] JupyterLite crash on the scikit-learn.org/stable examples.
- error in the JS console of firefox when running the first cell of an example with import statements `TypeError: _query_package() got multiple values for argument 'index_urls'`