# Scikit-learn bi-weekly progress status (even weeks) **Goal**: internal communication on recent work and short term planning **Who**: people working the maintenance of scikit-learn and related, in particular at probabl and Inria and maybe others **Frequency**: every other Monday at 15:00 CET/CEST, unless it happens on the same day as the Monthly meeting. **Where**: https://meet.google.com/xdm-ozyn-pgj **Meeting notes**: to be archived on the scikit-learn org repo ## Next meeting template **Rules of the game: ** - No question during the progress reports - Add those questions in the discussion reports ### Progress reports - ... ### Discussion points - ... ## 2024-05-13 ### Progress report - [name=Jérémie] - released 1.5.0rc1 - fix wheel builder windows https://github.com/scikit-learn/scikit-learn/pull/29006 - release highlights 1.5 https://github.com/scikit-learn/scikit-learn/pull/29007 Please add sections for the features you want. - TunedThresholdClassifier :+1: [name=Gael] ? - metadata routing progress ? - PCA speed improvements for `n_samples >> n_features` :+1: [name=Gael] ? - [name=Adrin] - Reviwes (a course has students sending PRs) - numpy hash based unique => speed in types of target, scores, confusion matrices - Pipeline's `transform_input`: [#28901](https://github.com/scikit-learn/scikit-learn/pull/28901) - Tags: starting the work with [#28927](https://github.com/scikit-learn/sci28320kit-learn/pull/28927) - Goal: make them public mid-term - Persistence doc revamp: [#28889](https://github.com/scikit-learn/scikit-learn/pull/28889) - HalvingSearchCV: [#28320](https://github.com/scikit-learn/scikit-learn/pull/28320) - Edge toward out of experimental - We still need to work on the API side for usability - [name=Stefanie] - metadata routing for Stacking* [#28701](https://github.com/scikit-learn/scikit-learn/pull/28701) finished - Metadata routing for learning_curve [#28975](https://github.com/scikit-learn/scikit-learn/pull/28975) started - nans in SplineTransformer [#28043](https://github.com/scikit-learn/scikit-learn/pull/28043) - reviews with Olivier - especially defining test cases - check_scoring() has raise_exc [#28992](https://github.com/scikit-learn/scikit-learn/pull/28992) - `raise_exc=False` for multimetric scoring - add test - [name=Olivier] - Reviews on nan handling for SplineTransformer [#28043](https://github.com/scikit-learn/scikit-learn/pull/28043) - Discussed `RANSACRegressor`'s `sample_weight` handling with Shruti - Some testing of conda-forge's 1.5.0rc1 on macOS m1 - Follow-up / reviews on Array API PRs - Some scipy-dev CI maintenance - [name=Shruti] - Opened pull request to add sample weights to RANSAC estimator [#3](https://github.com/probabl-ai/scikit-learn-exercise-snath-xoc/pull/3) - Follow-up on ransac specific tests (e.g. score, default reisdual threshold and exceeding max skips tests) - [name=Arturo] - Some [mooc PRs](https://github.com/INRIA/scikit-learn-mooc/pulls) - [name=Jérôme] - was away for most of April - mostly working on skrub: added selectors #895, more tests now run with polars #896 #903 PRs by Theo, refactoring/fixing the [TableVectorizer](https://github.com/skrub-data/skrub/pull/902) - addressed most of review of LabelBinarizer array API [PR](https://github.com/scikit-learn/scikit-learn/pull/28626), now updating the RidgeCV PR - a bit more on the skrub Recipe/PipelineBuilder/... and [skrubview](https://github.com/skrub-data/skrubview) - [name=Loïc] + Complete Meson entry for scikit-learn 1.5 changelog https://github.com/scikit-learn/scikit-learn/pull/29008. PR just opened, feed-back and discussions welcome! Current plan: drop setuptools support in scikit-learn 1.6 (but maybe rename `setup.py` -> `_setup.py` like Scipy did without testing it, and full drop in scikit-learn 1.7. This was useful for example for Pyodide scipy package). + some free-threading (aka nogil) work is happening in the Scientific Python ecosystem, quite excited about this! For example https://github.com/scikit-learn/scikit-learn/issues/28978. I added a free-threaded label to help track this. - [name=Gael] + Moving contributors to emeritus: answers from all but one + More research than software, but paper behind hazardous (survival models) moving beautifully forward ### Discussion points - re conda-forge build @ogrisel: https://github.com/conda-forge/scikit-learn-feedstock/pull/258 - re: array API: label binarizer's diff - dropping setuptools support in scikit-learn 1.6 (or 1.7 with `_setup.py` approach) any quick opinion? ## 2024-04-29 No bi-weekly progress status in favor of the monthly meeting. ## 2024-04-15 **Rules of the game: ** - No question during the progress reports - Add those questions in the discussion reports ### Progress reports - [name=Jérémie] - released 1.4.2 (numpy 2 support) - clean-up deprecations for 1.5 - fix for ColumnTransformer in parallel (https://github.com/scikit-learn/scikit-learn/pull/28822) But there's a bigger issue (https://github.com/scikit-learn/scikit-learn/issues/28824) - currently: bump threadpoolctl min version but issues with the benchamrks and profilers. - [name=Stefanie] - discussing RecursionError bug with Adrin: https://github.com/scikit-learn/scikit-learn/pull/28712 - metadata routing for predict in StackingClassifier und StackingRegressor: https://github.com/scikit-learn/scikit-learn/pull/28701 - issue with _records attribute on the final estimator - setting up new laptop - [name=Adrin] - triage this week - working on transforming metadata in pipeline (early stoping related) - [name=Guillaume] - CZI EOSS6 submission - Review of a couple of PRs - Couple of discussions related to the release - [name=Loïc] + More Meson follow-up - Fix more build dependencies https://github.com/scikit-learn/scikit-learn/pull/28821 (aka no-OpenMP build-time failures) - reported OpenMP detection on Apple Clang: https://github.com/mesonbuild/meson/issues/7435#issuecomment-2047585466 linked to misleading warning https://github.com/scikit-learn/scikit-learn/issues/28710 - Weird recompilation issue: https://github.com/scikit-learn/scikit-learn/issues/28837 - adated sdist check workflow: https://github.com/scikit-learn/scikit-learn/pull/28757 + Triage last week - [name=Arturo] - Some reviews (I still have to look at [#27357](https://github.com/scikit-learn/scikit-learn/pull/27357)); - accept `d2_absolute_error_score` as named scorer [#28836](https://github.com/scikit-learn/scikit-learn/pull/28836). ### Discussion points - X_val - What should be fixed regarding Meson for building the wheel for 1.5. See https://github.com/scikit-learn/scikit-learn/pull/28757#issuecomment-2034638954 and https://github.com/scikit-learn/scikit-learn/pull/28757#issuecomment-2037832722 where I did the comparison between sdist ## 2024-03-18 ### Progress reports Rules of the game: - No questions during this part - If you have questions / comments, write them in the second part Progresses: - [name=Loïc] + meson as main build backend [#28506](https://github.com/scikit-learn/scikit-learn/pull/28506). pyproject.toml changes needed for the wheels need to be reviewed. Tested conda-forge in https://github.com/conda-forge/scikit-learn-feedstock/pull/250. + conda-lock update: https://github.com/scikit-learn/scikit-learn/pull/28653 - [name=Olivier] - Triaging duty this week. - Array API reviews: - array-api-strict [#28555](https://github.com/scikit-learn/scikit-learn/pull/28555) - ready for second review - train_test_split [#28407](https://github.com/scikit-learn/scikit-learn/pull/28407) - ready for second review - ridge [#27800](https://github.com/scikit-learn/scikit-learn/pull/27800) - first pass - label binarizer [#28626](https://github.com/scikit-learn/scikit-learn/pull/28626) - WIP: - needed for RidgeClassifierCV [#27961](https://github.com/scikit-learn/scikit-learn/pull/27961) - strategy accept Array API namespace but continue building a scipy sparse datastructure with numpy components before final conversion to Array API container to keept the code minimally changed - alternative would be to implement the code directly for dense allocated structure using Array API directly but not sure it's worth the extra code branch - Had a look at Thomas prototype for a cirun config to run CUDA tests in [#24491](https://github.com/scikit-learn/scikit-learn/issues/24491#issuecomment-1999663173) - Security concerns should be cleared w.r.t. VM isolation - Still questions about the proper dispatch mechanism to use to allow maintainers to trigger the CI on external contributors' PRs. - Started to have a look at scipy's BSpline NaN handling to simplify the code to change to add missing value support to `SplineTransformer` [#28043](https://github.com/scikit-learn/scikit-learn/pull/28043) - need to write a minimal reproducer to discuss the issue with upstream (but paused because of triaging duty) - we can proceed with more complex code in scikit-learn code otherwise. - [name=Stefanie] - nan support for SplineTransformer [#28043](https://github.com/scikit-learn/scikit-learn/pull/28043): restructure (missingness indicator computed lazily) and computing bsplines of sparse arrays with workaround - UnsetMetadataPassedError [#28517](https://github.com/scikit-learn/scikit-learn/pull/28517): parent param doesn‘t need to be passed - learning: - 1) concurrent and parallel programming - 2) serialization - [name=Jérémie] (was off last week) - review and finalized some old but almost ready PRs. - currently on this one (allow string input in `pairwise_distances`). Discussed with guillaume for the final details but almost ready. https://github.com/scikit-learn/scikit-learn/pull/27456 - still cleaning-up `utils.__init__`. Only a few utils left. - Read about averaging ROC curves for Guillaume's PR https://github.com/scikit-learn/scikit-learn/pull/25939 (ROC curve display from cv results) "On Averaging ROC Curves", Jack Hogan, Niall M. Adams, https://openreview.net/forum?id=FByH3qL87G https://github.com/scikit-learn/scikit-learn/pull/25939 - [name=Arturo] - Description of `l2_regularization` in docstrings and user guide [#28652](https://github.com/scikit-learn/scikit-learn/pull/28652) - Mooc reviews - [name=Guillaume] - Triaging issues and PRs - fixing CIs: [#28636](https://github.com/scikit-learn/scikit-learn/pull/28636) - closing a couple of old PRs - give feedback on newly open one - Discussion regarding `skrub` design - [name=Jérôme] - mostly skrub - pipeline (recipe) builder with convenient way of specifying param grid - now starting to split out PR for column selectors - array api: updated ridgecv pr after r2 score was merged; opened LabelBinarizer pr ### Discussion points - meson main build backend: do we want to merge soonish or do we want to wait, e.g. for + meson-python 0.16 release (not yet out) with for some quality of life enhancements, https://github.com/mesonbuild/meson-python/pull/594 and https://github.com/mesonbuild/meson-python/pull/569 (need Pytest<8 for now if you use `pytest --pyargs sklearn`) + Numpy 2 release (and scikit-learn 1.4.2 for Numpy 2 compatibility), to avoid having too many impactful changes at the same time ("what could possibly go wrong?") + other things? - [name=Guillaume] I think we should wait because we did not take some potential changes in 1.4.1 so it would be easier when branching 1.5 - Conclusion: Merge the PR (and check the pyproject.toml) have it in main, but not backport it for the 1.4.2 release. - Average ROC: https://openreview.net/forum?id=FByH3qL87G - PR https://github.com/scikit-learn/scikit-learn/pull/25939 - we need to make an editorial choice regarding the method ## 2024-03-04 ### General topics ### Progress reports - [name=Stefanie] - preparing talk for PyLadies meetup: „How to start contributing to open source?“ - [PR UnsetMetadataPassedError](https://github.com/scikit-learn/scikit-learn/pull/28517) - display which method metadata request is not set and from where it was called - correct error message for composite methods - no default value for parent param (little trick when parent is a function) - [PR metadata routing for FeatureUnion](https://github.com/scikit-learn/scikit-learn/pull/28205) continued - keep old behaviour: no routing to transform, and some little things - [OrthogonalMatchingPursuit](https://github.com/scikit-learn/scikit-learn/pull/28557#issuecomment-1973000754) (DocInspector): documentation and code clarity (discussion with LucyLeeow) - [name=Olivier] - Array API: r2_score [#27904](https://github.com/scikit-learn/scikit-learn/pull/27904): upcasts are probably no longer needed: - backed by empirical study (linked in the discussion above) - backed by the fact that np.sum was made more stable right after the upcasts where added to scikit-learn 10 years ago - I plan to takeover to implement this simplication and hopefully unblock: - [array-api-strict](https://github.com/scikit-learn/scikit-learn/pull/28555) - [train_test_split](https://github.com/scikit-learn/scikit-learn/pull/28407) - [RidgeCV](https://github.com/scikit-learn/scikit-learn/pull/27961) - and probably more - Worked on a motivating example to demonstrate the usefulness of a built-in handling of missing values in SplineTransformer: - https://github.com/scikit-learn/scikit-learn/issues/26793 - first pass of review on [#28043](https://github.com/scikit-learn/scikit-learn/pull/28043) - TODO: follow-up on [a study about the causes of miscalibration for logistic regression models](https://gist.github.com/ogrisel/8502eb455cd38d41e92fee31863ffea7) - next step: open meta issue to discuss the big picture about scores, tuning, documentation and small PRs on focused points. - TODO: review the callback SLEP, and PRs waiting for reviews and linked here. - [name=Guillaume] - Review and merge the backlogs of PRs related to adding example in docstring of public function - closing https://github.com/scikit-learn/scikit-learn/issues/27982 - TODO: - `TunedThresholdClassifier`: https://github.com/scikit-learn/scikit-learn/pull/26120 - [name=Adrin] - Triage this week! - [name=Jérémie] - cleaning up `utils.__init__`. Several PRs merged, current PR moving out indexing related stuff https://github.com/scikit-learn/scikit-learn/pull/28546 - SLEP for the callback API https://github.com/scikit-learn/enhancement_proposals/pull/90. Feedback welcome - merged last param validation for public function PRs - merged array-like / sparse matrix disctinction for param validation - [name=Loïc] + back from one week of holiday, some CI work/clean-ups before + meson as main build backend: https://github.com/scikit-learn/scikit-learn/pull/28506. Tested with `cd build`, joblib does not seem in the wheel dependency, this may be good to fix it in the same PR. - [name=Arturo] - iter on outlier detection [#28550](https://github.com/scikit-learn/scikit-learn/pull/28550) - some mooc PRs - [remove `head`](https://github.com/INRIA/scikit-learn-mooc/pull/766) - [nested-cross validation figure](https://github.com/INRIA/scikit-learn-mooc/pull/765) - [feedback from Camille](https://github.com/INRIA/scikit-learn-mooc/pull/764) ### Discussion Add discussion points here, possibly live during the presentation of progress reports: - [name=Gael] (if we have time): what do people think is the best tool to draft design documents and meeting note / priorities (asking for skrub, where we need more note taking, and community building)? - Conclusion: hackmd with a note on top saying that people interested in commenting should ask for permission - [name=Guillaume] Any thoughts on helping the SLEP on callback to go forward? - https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep000/proposal.html - Plan the next drafting meeting on callbacks. - [name=Guillaume] Any thoughts on going forward regarding the user survey (helping Francois and Cailean) - First round of review in probabl. - Once done, check with core team. ## 2024-02-19 ### General topics ### Progress reports - General news - scikit-learn 1.4.1(.post1) is out! - Yao Xiao granted rights as a new core contributor - [name=Olivier]: - Array API with devices that do not support `float64` data - https://github.com/scikit-learn/scikit-learn/pull/27904 - Investigating pytest 8.0.1 (`setup_module` is not run) - pydata-sphinx-theme reviews to a preview branch - [name=Arturo] - a bit of mooc forum - some reviews (mostly dropdown stuff) - Gael's [mediation project](https://notes.inria.fr/Uz9HaPyJQ1yDY_LUuT2f-A#) (first contact with Samuel Girard) - [name=Adrin] - Documentation PRs with @Charlie-Xiao - SLEP6 work, and Stefanie's PRs - Validation set discussions with Christian - https://github.com/scikit-learn/scikit-learn/pull/28440 - Fairlearn's maintainance against sklearn and pandas - [name=Jérémie] - mostly threadpoolctl (released 3.3.0) - working on callbacks (https://github.com/scikit-learn/scikit-learn/pull/27663) - [name=Loïc] + a bit of help on scikit-learn 1.4.1.post1 + reviewed/merged some CI security PRs from Thomas. Context : running CI for GPU https://github.com/scikit-learn/scikit-learn/issues/24491 - [name=Gael] (reports on skrub) - dataframe API: giving up on that because upstream gives up -> moving to using dispatch mechanism - I'm figuring out how to coordinate Inria and probabl efforts on data wrangling and DB connections - [name=Stephanie] - Mainly meta-data routing PRs - For feature-union: found a gap in the tests - [name=Guillaume] - Release 1.4.1.post1 - Reviewed backlog from accumulated notification - [name=Jérôme] - skrub [#888](https://github.com/skrub-data/skrub/pull/888) (dispatch on dataframe type + fixtures for testing on both pandas & polars) can be reviewed - reviewed [#876](https://github.com/skrub-data/skrub/pull/876) (agg joiner improvements + addition of multiaggjoiner) - worked a bit more on [#877](https://github.com/skrub-data/skrub/pull/877) (addition of column-wise transformations, selectors, refactoring/fixing TableVectorizer) & prototyping transformers that rely on expressions for lazy frames; discussions with Guillaume & Olivier - profiling of logistic regression: it seems array api may be worthwhile for lbfgs & newton-cg ### Discussion - `r2_score` pb of casting, because some devices do not support float64: - We should put this code not in `r2_score` - We can have the notion of `strict` vs non `strict` - Maybe `r2_score` does not need float64 - validation set API: - internal https://github.com/scikit-learn/scikit-learn/pull/28440 - next: open an alternative issue/PR. - pros: - easier to understand what a fixed split does - cons: - having splitter is more convenient (better UX) to make it easy to enable early stopping. - data leakage only from early stopping cannot be so catastrophic(?) - [name=Olivier] Google Meeting configuration problem: I did not see notifications to external people to join and could not find how to configure this. ## 2024-02-06 ### Gael (GaelVaroquaux) - Still working on authorizations from Inria - Discussions on skrub, dataframe API, and TableVectorizer with Jérome and Guillaume ### Olivier (ogrisel) - OS team priority planning at probabl (ongoing, needs to be made public) - Helped with an ICML submission (strongly related to hazardous) - Survival model working with Gradient boosted trees - PRs and reviews: - iterating on calibration curve example: - https://github.com/scikit-learn/scikit-learn/pull/28231 - goal is to fix our wording on Logistic Regression calibration also in the user guide - https://github.com/scikit-learn/scikit-learn/pull/28171 - GPU FAQ update: - https://github.com/scikit-learn/scikit-learn/pull/28328 - rotated codecov token to fix the CI - https://github.com/scikit-learn/scikit-learn/pull/28361 - tried to investigate slow tests on scipy-dev pandas update PR: - https://github.com/scikit-learn/scikit-learn/pull/28348 - Prototyping a fluent API to compose transformers for feature + target data engineering with iterative/interactive previews and builtin subsampling and train/test split / cross-validation handling. - Goal: experiment with a future vision for skrub (or maybe even scikit-learn in the longer term?) ### Guillaume - OS team priority - Dev: - reviewing/merging PR for scikit-learn 1.4.1 - regression in tree criterion: https://github.com/scikit-learn/scikit-learn/pull/28327 - discussion for design of `TableVectorizer` in skrub with Jerome - Accepted talk to PyConDE: - need to go to the Algolia side - need some work to ragger-duck ### Arturo - a bit of mooc forum - some reviews - Gael's [mediation project](https://notes.inria.fr/Uz9HaPyJQ1yDY_LUuT2f-A#) - Goal: web UI to give intuitions on machine learning. Was asked a few months ago in the context of discussion with the broader society on the impact of ML on our lives - planning to work on "advanced modules" for the mooc ### Loïc - plenty of CI work (merged): [Pytest 8](https://github.com/scikit-learn/scikit-learn/pull/28318), [pandas 2.2 warnings](https://github.com/scikit-learn/scikit-learn/pull/28305), [pandas copy-on-write fixes](https://github.com/scikit-learn/scikit-learn/pull/28348) - [code scanning workflow](https://github.com/scikit-learn/scikit-learn/pull/28312) (merged) - meson editable install limitations is [getting fixed](https://github.com/mesonbuild/meson-python/pull/569) following my issues ([`__path__` empty](https://github.com/mesonbuild/meson-python/issues/568), [`pytest --pyargs` not collecting tests](https://github.com/mesonbuild/meson-python/issues/557)) and [Olivier PR attempt](https://github.com/mesonbuild/meson-python/pull/562) - [codecov uploader update](https://github.com/scikit-learn/scikit-learn/pull/28361) (merged) - scipy-dev with Python 3.12 seems a lot slower (~50 minutes instead of ~25 minutes). Dataset downloads in part? Needs more investigation (seen in https://github.com/scikit-learn/scikit-learn/pull/28348) - nogil build failing, needs Cython 3.0.8 for nogil Python: https://github.com/colesbury/nogil-wheels/pull/6 - Try out Meson `make dev-meson` and report issues, if you haven't already :wink:. See [doc](https://scikit-learn.org/dev/developers/advanced_installation.html#building-with-meson) for more details. ### Jérôme - skrub tablevectorizer; discussions with Guillaume & Gaël - storing state to have consistent transformations - applying transformations to part of a dataframe - support for pandas & polars - skrub aggjoiner pr help & start reviewing - reading & feedback on Riccardo's data lake fishing paper, some of which may end up in skrub - Context: one big messy data lake (many tables, we don't quite know what's in there), the challenge is table discovery: finding the tables that might lead to joins on a target table to help predict a given target ### Jérémie - administrative stuff - extended flexiblas support for threadpoolctl ### Stefanie - metadata routing FeatureUnion (https://github.com/scikit-learn/scikit-learn/pull/28205) - case where transformer doesn‘t have fit_transform: test with new mocking class - also added testcase for that into non-routing normal test - metadata routing: simplify development process for routing (https://github.com/scikit-learn/scikit-learn/pull/28422) - MethodPair: order caller and callee - MethodMapping: less variation for hwo to define the mapping - book: Mastering OOP - maybe speaking at PyLadies Paris meetup ### TODO/next - scikit-learn 1.4.1 (Guillaume) - calibration PR: - https://github.com/scikit-learn/scikit-learn/pull/28231 - https://github.com/scikit-learn/scikit-learn/pull/28171 - regression for decision tree: https://github.com/scikit-learn/scikit-learn/pull/28327 - inappropritate casting for `SimpleImputer`: https://github.com/scikit-learn/scikit-learn/pull/28365 - Array API: - finalize `r2_score` PR: - https://github.com/scikit-learn/scikit-learn/pull/27904 - Ridge, RidgeCV and maybe try Logistic Regression in // ### Topics/Questions to discuss at the end of the meeting - META: how to organize this meeting: - sub-teams split when the OS team grows too large - move to a bi-weekly schedule for scikit-learn and other OS projects at probabl every other week. - stick to google meet