Scikit-learn bi-weekly progress status (even weeks)

# Scikit-learn bi-weekly progress status (even weeks) **Goal**: internal communication on recent work and short term planning **Who**: people working the maintenance of scikit-learn and related, in particular at probabl and Inria and maybe others **Frequency**: every other Monday at 15:00 CET/CEST, unless it happens on the same day as the Monthly meeting. **Where**: https://meet.google.com/xdm-ozyn-pgj **Meeting notes**: to be archived on the scikit-learn org repo ## Next meeting template **Rules of the game: ** - No question during the progress reports - Add those questions in the discussion reports ### Progress reports - ... ### Discussion points - ... ## 2024-07-22 **Rules of the game: ** - No question during the progress reports - Add those questions in the discussion reports ### Progress reports - [name=Loïc] - spin is available on `main`, please give it a spin 🥁 https://github.com/scikit-learn/scikit-learn/pull/29012: - `spin` shows you a nice menu when you don't remember what you can do - `spin install -v` - `spin docs --no-plot` - `spin test` (you can still use `pytest` if you want) - improve spin error message for `spin docs` when not built: https://github.com/scientific-python/spin/pull/224 - make `spin install` verbose by default: https://github.com/scientific-python/spin/pull/225 - reviewed and contributed test to `mean_absolute_percentage_error` array API fix for doc-min-dependencies https://github.com/scikit-learn/scikit-learn/pull/29490 - add pandas and polars to our min dependency build (one more reviewer?): https://github.com/scikit-learn/scikit-learn/pull/29502 - update lock-file as a comment, ongoing PR by Charlie-XIAO: https://github.com/scikit-learn/scikit-learn/pull/29505 - run scipy tests as part of Pyodide CI https://github.com/pyodide/pyodide/pull/4935 instead of my own repo: https://github.com/lesteve/scipy-tests-pyodide - [name=Shruti] - Waiting for review on Logistic Regression CV weighting (pull request #29419) - Testing sag and saga optimisers in Logistic Regression CV - Working on test sample weighting in bin selection for HistGradientBoosting tree - Reading up on proper scoring for concordance metric in survival analysis - [name=Tamara] - fixing (all) the failing estimator checks for fairlearn's estimators, which includes a lot of refactoring and reading scikit-learn code - proposed to add `only_non_negative` as an option to `check_array` https://github.com/scikit-learn/scikit-learn/pull/29540 - [name=Emily] - PRs ready for review: [mean_poisson_deviance](https://github.com/scikit-learn/scikit-learn/pull/29227), [paired_euclidean_distances](https://github.com/scikit-learn/scikit-learn/pull/29389) - (Accidentally closed mean_poisson_deviance while solving conflict, now it has a codecov error) - Going through Stefanie's review comments on weighted percentile (https://github.com/scikit-learn/scikit-learn/pull/29431) - Interesting issues I found in the backlog and would like to take a look: [MIN_CAT_SUPPORT in HGBT](https://github.com/scikit-learn/scikit-learn/issues/19008), [Improve near constant feature detection in scalars](https://github.com/scikit-learn/scikit-learn/issues/19898) - [name=Jérémie] - Several clean-up PRs in the tests (all merged). The end goal is to be able to turn all warnings into errors in the CI. - wip https://github.com/scikit-learn/scikit-learn/pull/29516 - Rework of the maintainers page ready for a second review https://github.com/scikit-learn/scikit-learn/pull/29412 - Quick fix to include license file in wheels https://github.com/scikit-learn/scikit-learn/pull/29522. Do we need/want to make a 1.5.1.post1 ? - triage duty this week - [name=Arturo] - Thresholds in DET curve [#29151](https://github.com/scikit-learn/scikit-learn/pull/29151) and quantile loss in HGBT regression [#29063](https://github.com/scikit-learn/scikit-learn/pull/29063) still waiting for review - Function to cross- validate coverage fraction [#29499](https://github.com/scikit-learn/scikit-learn/pull/29499) - some reviews - [name=Stefanie] - continue reviewing [FIX min_value and max_value not indexed when features are removed](https://github.com/scikit-learn/scikit-learn/pull/29451#) in IterativeImputer - knowledge exchange with Emily and Loic on CI and Array API - PR [ENH Array API for check_consistent_length](https://github.com/scikit-learn/scikit-learn/pull/29519) - reviewing PR [Add array API support for _weighted_percentile](https://github.com/scikit-learn/scikit-learn/pull/29431#) - continue linear algebra course; lots of gaussian elimination - [name=Tim] - worked on GPU CI - reviewing array API PRs - "random things" - [name=Jerome] - add skrub [TableReport](https://github.com/skrub-data/skrub/pull/984) - POC [online demo](https://jeromedockes.github.io/skrub-online-reports/) - reviewed misc small improvements to docs - [name=Adrin] - Did triage (but still catching up since didn't have much time) - `linalg.eigh` returning (small) negative values [#29534](https://github.com/scikit-learn/scikit-learn/issues/29534) - [name=Guillaume] - Just back from vacation and catching up with my mail box ### Discussion points - [name=Adrin] `quicksort` (stable) vs `mergesort` - ask why the user cares, and close them all if it's not valid - [name=Adrin] closed a bunch of AI generated looking issues - PRs are slightly more welcome than just issues - [name=Jérémie] do a 1.5.1post1 to add COPYING file to wheels? - will be fixed in future releases - [name=Adrin] array API is breaking existing code where array API is not enabled, since we don't have a good coverage - we have had tests that do not use array API anywhere break as the result of array API changes. - this breakage goes unnoticed because we, for example, don't build all documentation examples - not a great look if we keep introducing regressions while adding array API support - potential fix: add a unittest that explicitly exercises the feature/use-case that leads to breakage that we discover "by chance" - for example a test in the array API test suite that uses polars as input to make sure polars + no array API works - [name=Jérémie] negative eigen values ## 2024-07-08 **Rules of the game: ** - No question during the progress reports - Add those questions in the discussion reports ### Progress reports - [name=Adrin] - reviews on SLEP6 and a look at some array API work - Adam accepted to be a maintainer! - less time on sklearn, but will keep working with Stefanie (when needed) and Tamara - released skops=0.10.0 with numpy2 support (was rather painful) - need to release fairlearn with all the fixes Tamara has made - [name=Olivier] - Investigated a `[scipy-dev]` regression detected by our CI. Temporary workaround: - https://github.com/scikit-learn/scikit-learn/pull/29432 - hazardous sprint: - discussions on how to properly generalize the Concordance Index survival analysis metric to the competing risks setting - worked on the integration to the variant of the main estimator described in Julie Alberge's manuscript: - https://github.com/soda-inria/hazardous/pull/53 - This is multi-indicidence variant (with builtin softmax normalization) of our previous one-risk-vs-others estimator. - Various PR reviews (improvements in docstrings, API fixes to use consistent shapes in returned arrays, documentation improvements...) - Still WIP: - more PRs to review - would like refactor the competing risks IPCW estimation to use a generic wrapper for any survival analysis estimator - Reviewed and discuss Shruti's progress in fixing sample_weight support in linear models with internal CV: - https://github.com/scikit-learn/scikit-learn/pull/29308 (ElasticNetCV) - https://github.com/scikit-learn/scikit-learn/pull/29419 (LogisticRegressionCV) - Trying to catch-up on array API and other reviews. - Addressed first review on `fetch_file`: - https://github.com/scikit-learn/scikit-learn/pull/29354 - Need to address the second review. - FYI: - I will be off for 2 weeks starting next Friday (included) - Need to find a replacement for triaging - Need to organize travel to EuroScipy 2024 - [name=Arturo] - Thresholds in DET curve [#29151](https://github.com/scikit-learn/scikit-learn/pull/29151) and quantile loss in HGBT regression [#29063](https://github.com/scikit-learn/scikit-learn/pull/29063) still waiting for review - hazardous sprint - investigating `ConvergenceWarning` in `GaussianProcessRegressor` with `DotProduct` kernel [#29380](https://github.com/scikit-learn/scikit-learn/pull/29380) - some reviews - [name=Jérôme] - skrub 0.2.0 release - [PR](https://github.com/skrub-data/skrub/pull/984) to add dataframe summary visualizations, see [example](https://output.circle-artifacts.com/output/job/58fe4d8e-4d1f-4035-b99a-234e3375bb9b/artifacts/0/doc/auto_examples/01_encodings.html#easy-learning-on-a-dataframe) (in CI artifact) - [name=Emily] - Array API support for _weighted_percentile with Stefanie ([[Draft PR](https://github.com/scikit-learn/scikit-learn/pull/29431)]) - Addressed review comments on: - [mean_poisson_deviance](https://github.com/scikit-learn/scikit-learn/pull/29227) - [cosine_distances](https://github.com/scikit-learn/scikit-learn/pull/29265) - [mean_absolute_percentage_error](https://github.com/scikit-learn/scikit-learn/pull/29300) - Created new PR: [paired_euclidean_distances](https://github.com/scikit-learn/scikit-learn/pull/29389) - To Do: euclidean_distances, rbf_kernel, Nystroem - [name=Jérémie] - Released 1.5.1 - follow-up experimenting OpenBLAS callback mechanism - https://github.com/scikit-learn/scikit-learn/pull/29403 - Rename `force_all_finite` into `ensure_all_finite`: https://github.com/scikit-learn/scikit-learn/pull/29404 - Started reviewing a rework of the maintainers doc: https://github.com/scikit-learn/scikit-learn/pull/29412 - [name=Stefanie] - discussing Array API on _weighted_percentile with Emily - PR [TransformedTargetRegressor warns when set_output expects dataframe](https://github.com/scikit-learn/scikit-learn/pull/29401) (merged) - issue [Array API tests fail on main](https://github.com/scikit-learn/scikit-learn/issues/29396) - review PR [MAINT move _estimator_has function to utils](https://github.com/scikit-learn/scikit-learn/pull/29319) - conversation about ch. 03 Fluent Python (sets and dicts) - continue learning linear algebra - good recource which starts with patterns and finishes with definitions: https://www.lem.ma/library - [name=Loïc] - on triage last week, busy answering issues, fixing CI, small PR reviews - Fix issues when updating main-ci lock-files: https://github.com/scikit-learn/scikit-learn/pull/29388 - Check build dependency version in meson.build (relying on `pyproject.toml` has limitations). Had a closer look at cross-compilation edge case and I think it should be fine: https://github.com/scikit-learn/scikit-learn/pull/28721 - Remove support for setuptools build: https://github.com/scikit-learn/scikit-learn/pull/29400 - `make clean` if you notice weird compilation issue (for example switching between Numpy<2 and Numpy>=2 in same environment): https://github.com/scikit-learn/scikit-learn/pull/29413 - `/take` workflow was broken for 6 months. Fixed but do we think this is useful? https://github.com/scikit-learn/scikit-learn/pull/29408 - meson-python quality of life improvement: add obvious error when rebuilding fails https://github.com/mesonbuild/meson-python/pull/648 - meson-python Pytest assertion rewriting bug: https://github.com/mesonbuild/meson-python/issues/646 - [name=Shruti] - Working on pull request #29308 for fix_elastic_net_cv adding in test - working on Logistic RegressionCV pull request and integrating test into existing tests for sample weights #29419 - Uploaded common tests to gist for sample_weight handing in regressors and classifiers: indentified problematic regressors and classifiers (e.g., elasticnet and lasso cv, logisticregressorcv, histgradientboostingtrees) ### Discussion points - [name=Olivier] About [GP ConvergenceWarning](https://github.com/scikit-learn/scikit-learn/pull/29380): did you try to create a local environment using the CI lock file to reproduce? - -Werror::ConvergenceWarning might give better insight into things. - [name=Olivier] Taking over other people's pull requests. - [name=Loïc] Update scikit-learn calendar when triaging changes: worth it or not? - [name=Adrin] Just did, forgot to do that. Definitely worth it. Also added an email notification for everyone, and made everyone able to edit the "event" to tune the notification to something else or none if they wish. - [name=Loïc] quick opinion check `/take` workflow do we think it's useful? - [name=Adrin] are we reusing the PR about sample_weight from way back? ## 2024-06-10 **Rules of the game: ** - No question during the progress reports - Add those questions in the discussion reports ### Progress reports - [name=Adrin] - PR reviews (taking more time than I like, but gotta do) - Would like to get back to `numpy.unique` - non-coding work - work on fairlearn, review PRs there as well - skops: remove something persistence related due to a CVE - [name=Guillaume] - scikit-learn - Triage of issues - Providing feedback on some PRs: getting my head around `info_gain` and linked with expected mutual information - skrub - CI improvements - PRs review for upcoming release: refactoring to solve bug in `TableVectorizer` - [name=Loïc] - CPython 3.13 free-threaded build added to the CI: https://github.com/scikit-learn/scikit-learn/pull/29191. Next is uploading a nightly wheel. - free-threaded PR in progress for joblib: https://github.com/joblib/joblib/pull/1589 - Dependabot to update our Github actions + follow-ups https://github.com/scikit-learn/scikit-learn/pull/29180 - PyPy official support officially dropped - [name=Stefanie] - information gain closed [#28905](https://github.com/scikit-learn/scikit-learn/pull/28905) - zero_division param for cohen_kappa_score [#29210](https://github.com/scikit-learn/scikit-learn/pull/29210) - some learning (reading Fluent Python with Tamara) - [name=Olivier] - Reviewed free-threading / dependabots PRs - Found and fixed a bug in adjustable thresholding experiment - https://github.com/scikit-learn/scikit-learn/pull/29150 - Reviewed and beta-tested / refined the new CUDA GPU CI github workflow: - https://github.com/scikit-learn/scikit-learn/blob/main/.github/workflows/cuda-gpu-ci.yml - https://github.com/scikit-learn/scikit-learn/pull/29130 - Getting started pair programming session with Emily Chen - Bunch of reviews on all open Array API PRs - Starting triage week (mostly updating/fixing Build/CI bot issues so far) - [name=Arturo] - Thresholds in DET curve [#29151](https://github.com/scikit-learn/scikit-learn/pull/29151) - Quantile loss in user guide on HGBT regression [#29063](https://github.com/scikit-learn/scikit-learn/pull/29063) - A bit of mooc [#780](https://github.com/INRIA/scikit-learn-mooc/pull/780) - [name=Jérôme] - skrub TableVectorizer PR merged - follow-ups: simplifying GapEncoder, MinHashEncoder - fixing missing value handling in GapEncoder & MinHashEncoder - adding a `make_tabular_pipeline` helper to easily build a simple but reasonable supervised estimator - RidgeCV array api: should be running but in small timing experiments it seems slow, need to investigate - [name=Jérémie] - PR adding `writeable` param to check_array ready for review: https://github.com/scikit-learn/scikit-learn/pull/29018 Fixes a 1.5.0 regression so worth putting in 1.5.1 imo. - Callback base infra + ProgressBar based on discussions from drafting meeting ready for review: https://github.com/scikit-learn/scikit-learn/pull/28760 ### Discussion points - [name=Adrin] - Finding sponsors specific to GPU CI costs? - Did we backport fix on ML model map? - backport ASAP to avoid having more people open issues on this - [name=Guillaume] - scikit-learn 1.5.1 timeline - blocker is probably: https://github.com/scikit-learn/scikit-learn/pull/29018 - random dead-lock at import: https://github.com/scikit-learn/scikit-learn/issues/29145 ## 2024-05-13 ### Progress report - [name=Jérémie] - released 1.5.0rc1 - fix wheel builder windows https://github.com/scikit-learn/scikit-learn/pull/29006 - release highlights 1.5 https://github.com/scikit-learn/scikit-learn/pull/29007 Please add sections for the features you want. - TunedThresholdClassifier :+1: [name=Gael] ? - metadata routing progress ? - PCA speed improvements for `n_samples >> n_features` :+1: [name=Gael] ? - [name=Adrin] - Reviwes (a course has students sending PRs) - numpy hash based unique => speed in types of target, scores, confusion matrices - Pipeline's `transform_input`: [#28901](https://github.com/scikit-learn/scikit-learn/pull/28901) - Tags: starting the work with [#28927](https://github.com/scikit-learn/sci28320kit-learn/pull/28927) - Goal: make them public mid-term - Persistence doc revamp: [#28889](https://github.com/scikit-learn/scikit-learn/pull/28889) - HalvingSearchCV: [#28320](https://github.com/scikit-learn/scikit-learn/pull/28320) - Edge toward out of experimental - We still need to work on the API side for usability - [name=Stefanie] - metadata routing for Stacking* [#28701](https://github.com/scikit-learn/scikit-learn/pull/28701) finished - Metadata routing for learning_curve [#28975](https://github.com/scikit-learn/scikit-learn/pull/28975) started - nans in SplineTransformer [#28043](https://github.com/scikit-learn/scikit-learn/pull/28043) - reviews with Olivier - especially defining test cases - check_scoring() has raise_exc [#28992](https://github.com/scikit-learn/scikit-learn/pull/28992) - `raise_exc=False` for multimetric scoring - add test - [name=Olivier] - Reviews on nan handling for SplineTransformer [#28043](https://github.com/scikit-learn/scikit-learn/pull/28043) - Discussed `RANSACRegressor`'s `sample_weight` handling with Shruti - Some testing of conda-forge's 1.5.0rc1 on macOS m1 - Follow-up / reviews on Array API PRs - Some scipy-dev CI maintenance - [name=Shruti] - Opened pull request to add sample weights to RANSAC estimator [#3](https://github.com/probabl-ai/scikit-learn-exercise-snath-xoc/pull/3) - Follow-up on ransac specific tests (e.g. score, default reisdual threshold and exceeding max skips tests) - [name=Arturo] - Some [mooc PRs](https://github.com/INRIA/scikit-learn-mooc/pulls) - [name=Jérôme] - was away for most of April - mostly working on skrub: added selectors #895, more tests now run with polars #896 #903 PRs by Theo, refactoring/fixing the [TableVectorizer](https://github.com/skrub-data/skrub/pull/902) - addressed most of review of LabelBinarizer array API [PR](https://github.com/scikit-learn/scikit-learn/pull/28626), now updating the RidgeCV PR - a bit more on the skrub Recipe/PipelineBuilder/... and [skrubview](https://github.com/skrub-data/skrubview) - [name=Loïc] + Complete Meson entry for scikit-learn 1.5 changelog https://github.com/scikit-learn/scikit-learn/pull/29008. PR just opened, feed-back and discussions welcome! Current plan: drop setuptools support in scikit-learn 1.6 (but maybe rename `setup.py` -> `_setup.py` like Scipy did without testing it, and full drop in scikit-learn 1.7. This was useful for example for Pyodide scipy package). + some free-threading (aka nogil) work is happening in the Scientific Python ecosystem, quite excited about this! For example https://github.com/scikit-learn/scikit-learn/issues/28978. I added a free-threaded label to help track this. - [name=Gael] + Moving contributors to emeritus: answers from all but one + More research than software, but paper behind hazardous (survival models) moving beautifully forward ### Discussion points - re conda-forge build @ogrisel: https://github.com/conda-forge/scikit-learn-feedstock/pull/258 - re: array API: label binarizer's diff - dropping setuptools support in scikit-learn 1.6 (or 1.7 with `_setup.py` approach) any quick opinion? ## 2024-04-29 No bi-weekly progress status in favor of the monthly meeting. ## 2024-04-15 **Rules of the game: ** - No question during the progress reports - Add those questions in the discussion reports ### Progress reports - [name=Jérémie] - released 1.4.2 (numpy 2 support) - clean-up deprecations for 1.5 - fix for ColumnTransformer in parallel (https://github.com/scikit-learn/scikit-learn/pull/28822) But there's a bigger issue (https://github.com/scikit-learn/scikit-learn/issues/28824) - currently: bump threadpoolctl min version but issues with the benchamrks and profilers. - [name=Stefanie] - discussing RecursionError bug with Adrin: https://github.com/scikit-learn/scikit-learn/pull/28712 - metadata routing for predict in StackingClassifier und StackingRegressor: https://github.com/scikit-learn/scikit-learn/pull/28701 - issue with _records attribute on the final estimator - setting up new laptop - [name=Adrin] - triage this week - working on transforming metadata in pipeline (early stoping related) - [name=Guillaume] - CZI EOSS6 submission - Review of a couple of PRs - Couple of discussions related to the release - [name=Loïc] + More Meson follow-up - Fix more build dependencies https://github.com/scikit-learn/scikit-learn/pull/28821 (aka no-OpenMP build-time failures) - reported OpenMP detection on Apple Clang: https://github.com/mesonbuild/meson/issues/7435#issuecomment-2047585466 linked to misleading warning https://github.com/scikit-learn/scikit-learn/issues/28710 - Weird recompilation issue: https://github.com/scikit-learn/scikit-learn/issues/28837 - adated sdist check workflow: https://github.com/scikit-learn/scikit-learn/pull/28757 + Triage last week - [name=Arturo] - Some reviews (I still have to look at [#27357](https://github.com/scikit-learn/scikit-learn/pull/27357)); - accept `d2_absolute_error_score` as named scorer [#28836](https://github.com/scikit-learn/scikit-learn/pull/28836). ### Discussion points - X_val - What should be fixed regarding Meson for building the wheel for 1.5. See https://github.com/scikit-learn/scikit-learn/pull/28757#issuecomment-2034638954 and https://github.com/scikit-learn/scikit-learn/pull/28757#issuecomment-2037832722 where I did the comparison between sdist ## 2024-03-18 ### Progress reports Rules of the game: - No questions during this part - If you have questions / comments, write them in the second part Progresses: - [name=Loïc] + meson as main build backend [#28506](https://github.com/scikit-learn/scikit-learn/pull/28506). pyproject.toml changes needed for the wheels need to be reviewed. Tested conda-forge in https://github.com/conda-forge/scikit-learn-feedstock/pull/250. + conda-lock update: https://github.com/scikit-learn/scikit-learn/pull/28653 - [name=Olivier] - Triaging duty this week. - Array API reviews: - array-api-strict [#28555](https://github.com/scikit-learn/scikit-learn/pull/28555) - ready for second review - train_test_split [#28407](https://github.com/scikit-learn/scikit-learn/pull/28407) - ready for second review - ridge [#27800](https://github.com/scikit-learn/scikit-learn/pull/27800) - first pass - label binarizer [#28626](https://github.com/scikit-learn/scikit-learn/pull/28626) - WIP: - needed for RidgeClassifierCV [#27961](https://github.com/scikit-learn/scikit-learn/pull/27961) - strategy accept Array API namespace but continue building a scipy sparse datastructure with numpy components before final conversion to Array API container to keept the code minimally changed - alternative would be to implement the code directly for dense allocated structure using Array API directly but not sure it's worth the extra code branch - Had a look at Thomas prototype for a cirun config to run CUDA tests in [#24491](https://github.com/scikit-learn/scikit-learn/issues/24491#issuecomment-1999663173) - Security concerns should be cleared w.r.t. VM isolation - Still questions about the proper dispatch mechanism to use to allow maintainers to trigger the CI on external contributors' PRs. - Started to have a look at scipy's BSpline NaN handling to simplify the code to change to add missing value support to `SplineTransformer` [#28043](https://github.com/scikit-learn/scikit-learn/pull/28043) - need to write a minimal reproducer to discuss the issue with upstream (but paused because of triaging duty) - we can proceed with more complex code in scikit-learn code otherwise. - [name=Stefanie] - nan support for SplineTransformer [#28043](https://github.com/scikit-learn/scikit-learn/pull/28043): restructure (missingness indicator computed lazily) and computing bsplines of sparse arrays with workaround - UnsetMetadataPassedError [#28517](https://github.com/scikit-learn/scikit-learn/pull/28517): parent param doesn‘t need to be passed - learning: - 1) concurrent and parallel programming - 2) serialization - [name=Jérémie] (was off last week) - review and finalized some old but almost ready PRs. - currently on this one (allow string input in `pairwise_distances`). Discussed with guillaume for the final details but almost ready. https://github.com/scikit-learn/scikit-learn/pull/27456 - still cleaning-up `utils.__init__`. Only a few utils left. - Read about averaging ROC curves for Guillaume's PR https://github.com/scikit-learn/scikit-learn/pull/25939 (ROC curve display from cv results) "On Averaging ROC Curves", Jack Hogan, Niall M. Adams, https://openreview.net/forum?id=FByH3qL87G https://github.com/scikit-learn/scikit-learn/pull/25939 - [name=Arturo] - Description of `l2_regularization` in docstrings and user guide [#28652](https://github.com/scikit-learn/scikit-learn/pull/28652) - Mooc reviews - [name=Guillaume] - Triaging issues and PRs - fixing CIs: [#28636](https://github.com/scikit-learn/scikit-learn/pull/28636) - closing a couple of old PRs - give feedback on newly open one - Discussion regarding `skrub` design - [name=Jérôme] - mostly skrub - pipeline (recipe) builder with convenient way of specifying param grid - now starting to split out PR for column selectors - array api: updated ridgecv pr after r2 score was merged; opened LabelBinarizer pr ### Discussion points - meson main build backend: do we want to merge soonish or do we want to wait, e.g. for + meson-python 0.16 release (not yet out) with for some quality of life enhancements, https://github.com/mesonbuild/meson-python/pull/594 and https://github.com/mesonbuild/meson-python/pull/569 (need Pytest<8 for now if you use `pytest --pyargs sklearn`) + Numpy 2 release (and scikit-learn 1.4.2 for Numpy 2 compatibility), to avoid having too many impactful changes at the same time ("what could possibly go wrong?") + other things? - [name=Guillaume] I think we should wait because we did not take some potential changes in 1.4.1 so it would be easier when branching 1.5 - Conclusion: Merge the PR (and check the pyproject.toml) have it in main, but not backport it for the 1.4.2 release. - Average ROC: https://openreview.net/forum?id=FByH3qL87G - PR https://github.com/scikit-learn/scikit-learn/pull/25939 - we need to make an editorial choice regarding the method ## 2024-03-04 ### General topics ### Progress reports - [name=Stefanie] - preparing talk for PyLadies meetup: „How to start contributing to open source?“ - [PR UnsetMetadataPassedError](https://github.com/scikit-learn/scikit-learn/pull/28517) - display which method metadata request is not set and from where it was called - correct error message for composite methods - no default value for parent param (little trick when parent is a function) - [PR metadata routing for FeatureUnion](https://github.com/scikit-learn/scikit-learn/pull/28205) continued - keep old behaviour: no routing to transform, and some little things - [OrthogonalMatchingPursuit](https://github.com/scikit-learn/scikit-learn/pull/28557#issuecomment-1973000754) (DocInspector): documentation and code clarity (discussion with LucyLeeow) - [name=Olivier] - Array API: r2_score [#27904](https://github.com/scikit-learn/scikit-learn/pull/27904): upcasts are probably no longer needed: - backed by empirical study (linked in the discussion above) - backed by the fact that np.sum was made more stable right after the upcasts where added to scikit-learn 10 years ago - I plan to takeover to implement this simplication and hopefully unblock: - [array-api-strict](https://github.com/scikit-learn/scikit-learn/pull/28555) - [train_test_split](https://github.com/scikit-learn/scikit-learn/pull/28407) - [RidgeCV](https://github.com/scikit-learn/scikit-learn/pull/27961) - and probably more - Worked on a motivating example to demonstrate the usefulness of a built-in handling of missing values in SplineTransformer: - https://github.com/scikit-learn/scikit-learn/issues/26793 - first pass of review on [#28043](https://github.com/scikit-learn/scikit-learn/pull/28043) - TODO: follow-up on [a study about the causes of miscalibration for logistic regression models](https://gist.github.com/ogrisel/8502eb455cd38d41e92fee31863ffea7) - next step: open meta issue to discuss the big picture about scores, tuning, documentation and small PRs on focused points. - TODO: review the callback SLEP, and PRs waiting for reviews and linked here. - [name=Guillaume] - Review and merge the backlogs of PRs related to adding example in docstring of public function - closing https://github.com/scikit-learn/scikit-learn/issues/27982 - TODO: - `TunedThresholdClassifier`: https://github.com/scikit-learn/scikit-learn/pull/26120 - [name=Adrin] - Triage this week! - [name=Jérémie] - cleaning up `utils.__init__`. Several PRs merged, current PR moving out indexing related stuff https://github.com/scikit-learn/scikit-learn/pull/28546 - SLEP for the callback API https://github.com/scikit-learn/enhancement_proposals/pull/90. Feedback welcome - merged last param validation for public function PRs - merged array-like / sparse matrix disctinction for param validation - [name=Loïc] + back from one week of holiday, some CI work/clean-ups before + meson as main build backend: https://github.com/scikit-learn/scikit-learn/pull/28506. Tested with `cd build`, joblib does not seem in the wheel dependency, this may be good to fix it in the same PR. - [name=Arturo] - iter on outlier detection [#28550](https://github.com/scikit-learn/scikit-learn/pull/28550) - some mooc PRs - [remove `head`](https://github.com/INRIA/scikit-learn-mooc/pull/766) - [nested-cross validation figure](https://github.com/INRIA/scikit-learn-mooc/pull/765) - [feedback from Camille](https://github.com/INRIA/scikit-learn-mooc/pull/764) ### Discussion Add discussion points here, possibly live during the presentation of progress reports: - [name=Gael] (if we have time): what do people think is the best tool to draft design documents and meeting note / priorities (asking for skrub, where we need more note taking, and community building)? - Conclusion: hackmd with a note on top saying that people interested in commenting should ask for permission - [name=Guillaume] Any thoughts on helping the SLEP on callback to go forward? - https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep000/proposal.html - Plan the next drafting meeting on callbacks. - [name=Guillaume] Any thoughts on going forward regarding the user survey (helping Francois and Cailean) - First round of review in probabl. - Once done, check with core team. ## 2024-02-19 ### General topics ### Progress reports - General news - scikit-learn 1.4.1(.post1) is out! - Yao Xiao granted rights as a new core contributor - [name=Olivier]: - Array API with devices that do not support `float64` data - https://github.com/scikit-learn/scikit-learn/pull/27904 - Investigating pytest 8.0.1 (`setup_module` is not run) - pydata-sphinx-theme reviews to a preview branch - [name=Arturo] - a bit of mooc forum - some reviews (mostly dropdown stuff) - Gael's [mediation project](https://notes.inria.fr/Uz9HaPyJQ1yDY_LUuT2f-A#) (first contact with Samuel Girard) - [name=Adrin] - Documentation PRs with @Charlie-Xiao - SLEP6 work, and Stefanie's PRs - Validation set discussions with Christian - https://github.com/scikit-learn/scikit-learn/pull/28440 - Fairlearn's maintainance against sklearn and pandas - [name=Jérémie] - mostly threadpoolctl (released 3.3.0) - working on callbacks (https://github.com/scikit-learn/scikit-learn/pull/27663) - [name=Loïc] + a bit of help on scikit-learn 1.4.1.post1 + reviewed/merged some CI security PRs from Thomas. Context : running CI for GPU https://github.com/scikit-learn/scikit-learn/issues/24491 - [name=Gael] (reports on skrub) - dataframe API: giving up on that because upstream gives up -> moving to using dispatch mechanism - I'm figuring out how to coordinate Inria and probabl efforts on data wrangling and DB connections - [name=Stephanie] - Mainly meta-data routing PRs - For feature-union: found a gap in the tests - [name=Guillaume] - Release 1.4.1.post1 - Reviewed backlog from accumulated notification - [name=Jérôme] - skrub [#888](https://github.com/skrub-data/skrub/pull/888) (dispatch on dataframe type + fixtures for testing on both pandas & polars) can be reviewed - reviewed [#876](https://github.com/skrub-data/skrub/pull/876) (agg joiner improvements + addition of multiaggjoiner) - worked a bit more on [#877](https://github.com/skrub-data/skrub/pull/877) (addition of column-wise transformations, selectors, refactoring/fixing TableVectorizer) & prototyping transformers that rely on expressions for lazy frames; discussions with Guillaume & Olivier - profiling of logistic regression: it seems array api may be worthwhile for lbfgs & newton-cg ### Discussion - `r2_score` pb of casting, because some devices do not support float64: - We should put this code not in `r2_score` - We can have the notion of `strict` vs non `strict` - Maybe `r2_score` does not need float64 - validation set API: - internal https://github.com/scikit-learn/scikit-learn/pull/28440 - next: open an alternative issue/PR. - pros: - easier to understand what a fixed split does - cons: - having splitter is more convenient (better UX) to make it easy to enable early stopping. - data leakage only from early stopping cannot be so catastrophic(?) - [name=Olivier] Google Meeting configuration problem: I did not see notifications to external people to join and could not find how to configure this. ## 2024-02-06 ### Gael (GaelVaroquaux) - Still working on authorizations from Inria - Discussions on skrub, dataframe API, and TableVectorizer with Jérome and Guillaume ### Olivier (ogrisel) - OS team priority planning at probabl (ongoing, needs to be made public) - Helped with an ICML submission (strongly related to hazardous) - Survival model working with Gradient boosted trees - PRs and reviews: - iterating on calibration curve example: - https://github.com/scikit-learn/scikit-learn/pull/28231 - goal is to fix our wording on Logistic Regression calibration also in the user guide - https://github.com/scikit-learn/scikit-learn/pull/28171 - GPU FAQ update: - https://github.com/scikit-learn/scikit-learn/pull/28328 - rotated codecov token to fix the CI - https://github.com/scikit-learn/scikit-learn/pull/28361 - tried to investigate slow tests on scipy-dev pandas update PR: - https://github.com/scikit-learn/scikit-learn/pull/28348 - Prototyping a fluent API to compose transformers for feature + target data engineering with iterative/interactive previews and builtin subsampling and train/test split / cross-validation handling. - Goal: experiment with a future vision for skrub (or maybe even scikit-learn in the longer term?) ### Guillaume - OS team priority - Dev: - reviewing/merging PR for scikit-learn 1.4.1 - regression in tree criterion: https://github.com/scikit-learn/scikit-learn/pull/28327 - discussion for design of `TableVectorizer` in skrub with Jerome - Accepted talk to PyConDE: - need to go to the Algolia side - need some work to ragger-duck ### Arturo - a bit of mooc forum - some reviews - Gael's [mediation project](https://notes.inria.fr/Uz9HaPyJQ1yDY_LUuT2f-A#) - Goal: web UI to give intuitions on machine learning. Was asked a few months ago in the context of discussion with the broader society on the impact of ML on our lives - planning to work on "advanced modules" for the mooc ### Loïc - plenty of CI work (merged): [Pytest 8](https://github.com/scikit-learn/scikit-learn/pull/28318), [pandas 2.2 warnings](https://github.com/scikit-learn/scikit-learn/pull/28305), [pandas copy-on-write fixes](https://github.com/scikit-learn/scikit-learn/pull/28348) - [code scanning workflow](https://github.com/scikit-learn/scikit-learn/pull/28312) (merged) - meson editable install limitations is [getting fixed](https://github.com/mesonbuild/meson-python/pull/569) following my issues ([`__path__` empty](https://github.com/mesonbuild/meson-python/issues/568), [`pytest --pyargs` not collecting tests](https://github.com/mesonbuild/meson-python/issues/557)) and [Olivier PR attempt](https://github.com/mesonbuild/meson-python/pull/562) - [codecov uploader update](https://github.com/scikit-learn/scikit-learn/pull/28361) (merged) - scipy-dev with Python 3.12 seems a lot slower (~50 minutes instead of ~25 minutes). Dataset downloads in part? Needs more investigation (seen in https://github.com/scikit-learn/scikit-learn/pull/28348) - nogil build failing, needs Cython 3.0.8 for nogil Python: https://github.com/colesbury/nogil-wheels/pull/6 - Try out Meson `make dev-meson` and report issues, if you haven't already :wink:. See [doc](https://scikit-learn.org/dev/developers/advanced_installation.html#building-with-meson) for more details. ### Jérôme - skrub tablevectorizer; discussions with Guillaume & Gaël - storing state to have consistent transformations - applying transformations to part of a dataframe - support for pandas & polars - skrub aggjoiner pr help & start reviewing - reading & feedback on Riccardo's data lake fishing paper, some of which may end up in skrub - Context: one big messy data lake (many tables, we don't quite know what's in there), the challenge is table discovery: finding the tables that might lead to joins on a target table to help predict a given target ### Jérémie - administrative stuff - extended flexiblas support for threadpoolctl ### Stefanie - metadata routing FeatureUnion (https://github.com/scikit-learn/scikit-learn/pull/28205) - case where transformer doesn‘t have fit_transform: test with new mocking class - also added testcase for that into non-routing normal test - metadata routing: simplify development process for routing (https://github.com/scikit-learn/scikit-learn/pull/28422) - MethodPair: order caller and callee - MethodMapping: less variation for hwo to define the mapping - book: Mastering OOP - maybe speaking at PyLadies Paris meetup ### TODO/next - scikit-learn 1.4.1 (Guillaume) - calibration PR: - https://github.com/scikit-learn/scikit-learn/pull/28231 - https://github.com/scikit-learn/scikit-learn/pull/28171 - regression for decision tree: https://github.com/scikit-learn/scikit-learn/pull/28327 - inappropritate casting for `SimpleImputer`: https://github.com/scikit-learn/scikit-learn/pull/28365 - Array API: - finalize `r2_score` PR: - https://github.com/scikit-learn/scikit-learn/pull/27904 - Ridge, RidgeCV and maybe try Logistic Regression in // ### Topics/Questions to discuss at the end of the meeting - META: how to organize this meeting: - sub-teams split when the OS team grows too large - move to a bi-weekly schedule for scikit-learn and other OS projects at probabl every other week. - stick to google meet

Read more

OS Team Review & Planning (odd weeks)

EuroScipy 2024 - probablistic machine learning and optimal decision making under uncertainty

scikit-learn 1.5.0 social media

Je fait des fautes de grammair