Try   HackMD

Scikit-learn bi-weekly progress status (even weeks)

Goal: internal communication on recent work and short term planning
Who: people working the maintenance of scikit-learn and related, in particular at probabl and Inria and maybe others
Frequency: every other Monday at 15:00 CET/CEST, unless it happens on the same day as the Monthly meeting.
Where: https://meet.google.com/xdm-ozyn-pgj
Meeting notes: to be archived on the scikit-learn org repo

Next meeting templates

Rules of the game:

  • No question during the progress reports
  • Add those questions in the discussion reports

Progress reports

  • Someone
    • item
    • item

Discussion points

2025-07-07

Progress reports

Discussion points

  • GPC/GPR: approximate gradient computation: isn't it possible to have an explicit symbolic expression of the gradient?
  • Linting/comment bot: open an issue?
    • A private issue + PR to test the mechanism.
  • feed-back from monthly meeting about requiring 2FA at the scikit-learn org level?
    • no objection, sklearn-ci bot needs 2FA set-up
  • 1.7.1
  • SkrubPipeline and TableVectorizer could be extended to act as meta-data routers if someone is looking for a way to contribute to skrub.

2025-06-23

Progress reports

Discussion points

  • Guillaume The windows CI node on Azure is bothering us. Any idea why?
    • It looks that any node started with a single physical core will fail
  • Adrin ClassicalMDS: should it be its own class? https://github.com/scikit-learn/scikit-learn/pull/31322
  • Loïc require 2FA for scikit-learn organization members? Bots will need to have 2FA setup and added. Do anyone know what sklearn-wheels bot was used for?
  • Guillaume document property in sphinx
    • I think that we can overwrite the way sphinx work there

2025-06-09

Progress reports

  • Olivier

    • Ongoing work on the clustering lesson for the MOOC:
    • Review / discussion about display classes for:
      • permutation importance
      • (unbiased) Gini importance for tree-based models under implementation in #31279
    • Investigating Bagging sample weight support and relationship to the dependency between the regularization parameter of LogisticRegression and Ridge regression on the number of training data points. #31414
  • Antoine

  • Jérémie

    • 2nd round of interviews for the callback internship
      • We selected François Paugam
    • Release highlights for 1.7.0 then released 1.7.0
    • was off wednesday + Thursday
  • Shruti

    • Investigating sag, saga and liblinear (with dual=True) solver:
      • Realised liblinear with dual is in fact fine just slower convergence
      • Read up about sag and saga to understand a bit more how the incremental aggregation of the algorithm works
      • [TO DO]: make a minimal reproducer code to open issue for sag, still not sure whether the test is good enough: https://github.com/snath-xoc/sample-weight-audit-nondet/issues/25
  • Arturo

    • Mooc (clustering)
    • scikit-learn v1.7 in jupyterlite?

Discussion points

2025-05-26

Progress reports

Discussion points

  • Olivier hidimstats discussion in Saclay next week?
    • Conditional Permutation Importance make it possible to detect spurious features that are correlated with predictive features but does not bring anything new on top of the other features.
  • Guillaume I'm curious of potential takeaways of the max_features experiments.
  • Guillaume What would be the best representation of "chance level" on PR curve when it comes to cross-validation

2025-05-12

Progress reports

Discussion points

2025-04-28

Progress reports

Discussion points

2025-04-14

Progress reports

Discussion points

  • Olivier can the loky problem be reproduced with the resource tracker implementation in the Python standard library?
    • Loïc I don't know how to start on this Olivier said resource tracker is used in concurrent.futures not multiprocessing.Pool

2025-03-31

Progress reports

Discussion points

  • Guillaume EuroSciPy proposal for scikit-learn?
    • Tutorial on imbalanced classification
    • skrub tutorial by Jerome
  • Olivier CFP PyData Paris
    • Olivier: probabilistic regression
  • Gael Should we add sphinx-lint to our tests? (not a required test, something to help contributors)
    • that is a possibility, either a pre-commit or an additional linter step like black/ruff/mypy, or both. Some content are generated (rst from .py examples, .rst.template) so this may catch only a subset of issues, probably good enough still
      • +1 (Jérémie)
      • +1 for both (Olivier)
      • Suggestion to feed the output to the comment bot

2025-03-17

Progress reports

  • Arturo

    • DOC Rework voting classifier example #30985
  • Shruti

    • Working on test using scores for sample weight invariance in non-deterministic estimators PR #18.
    • Working on #53593 (Binmapper in HGBT) PR, similar to KBinsDiscretizer however a bit complicated since breaking several tests
    • Started mini comparison of scikit-learn Gaussian Processes vs. GPytorch, maybe something to try out skore with?
  • Stefanie

  • Olivier

    • First pass of review on default policy sample_weight meta-data routing
      • https://github.com/scikit-learn/scikit-learn/pull/30946
      • are sample_weight special or should we always request all metadata?
        • if yes, this could be a justification to release scikit-learn 2.0 if this causes breaking changes?
        • would require to study impact on downstream libraries to make a decision
    • Investigated with Antoine why decision trees do not pass the sample_weight repetition equivalence tests
      • rounding errors cause different choices of splits with near-tied impurity improvements when running the stochastic tests
      • rounding errors cause small but non-uniform offsets in leaf values in shallow trees (10 to 100x machine epsilon)
      • need to follow-up with opening issues to document the bias induced by the current code when the data leads to (exact or near) tied splits
        • current handling of tied splits leads to hard to understand inductive bias when inspecting the prediction function of learned decision trees
      • potential solution would involve randomized near-tie breaking in trees and then our stochastic test sample_weight equivalence tests should pass
      • alternatively we would need to change the way we write the sample_weight equivalence tests.
    • A few reviews to get loky 3.5.0 out.
    • Science: I read the TabPFNv2 paper and now reading the TabICL preprint.
  • Loïc

  • Antoine

    • investigated why GradientBoosting fails the sample_weight equivalence check
    • root cause is tied splits in DecisionTree investigated with Olivier
    • meeting with Olivier, Shruti and Jérémie on adapting the statistical test
    • this week: metadata routing and multiclass brier score
  • Guillaume

    • last week:
      • open-source in academia conference
      • some discussions with Dea and Lucy
    • next week:
      • PyData Milan meetup
  • Adrin

  • Jérémie

    • Set up automated release with github actions for threadpoolctl and loky
      • publish to test pypi on push to main for .dev0 versions
      • publish to pypi when a tag is pushed for actual releases
      • still need to manually release on conda-forge
    • released threadpoolctl 3.6.0
    • released loky 3.5.0
  • Gael Varoquaux

    • Went over scikit-learn + ColumnTransformer / pandas questions on stackoverflow
    • PRs/contributions to list skrub in various parts of the ecosystem (pandas, polars, scikit-learn)

Discussion points

  • Bump to Python 3.10 opinions are roughly split between plan 1 (oldest minor X.Y with Python 3.10 wheels) plan 2 (oldest bugfix X.Y.Z with Python 3.10 wheels) and "both are fine". https://github.com/scikit-learn/scikit-learn/pull/30895

  • Moving meta-data routing (sample weights) to more mandatory

    • The challenge is when we add metadata routing to something, eg a scoring, in which case it leads to a change in statistical behavior
  • Guillaume SearchCV:

    • I'm under the impression that we should improve the validation curve
    • Still thinking about the parallel coordinate plot (skrub does some stuff in this area (wip))
  • Adrin Copilot context hacks

  • Adrin SBOMs, GH's action, minimal starting point

2025-03-03

Progress reports

Discussion points

  • Olivier Wikipedia editing
  • Adrin balancing priorities, comms with the outside, expectations from other maintainers
  • Olivier inconsistent handling of location in joblib.Memory
  • Gael what can we learn / borrow from the numpy review process? How do we change?
    • One good reviewer sufficient for merge
      • Suggestion: define what need two reviews (eg no changelog => 1 review sufficient; new API => 2 reviews needed). Need more trust of reviewer
    • Maybe introducing a time boundary after which only one maintainer is allowed to merge.
    • PR triaging, close more PR
      • Revived triaging meeting to triage PR and close some to help focus attention

2025-02-17

Progress reports

Discussion points

2025-02-03

Progress reports

Discussion points

  • Guillaume I confirm that Kagi search engine looks to have the same behaviour than Google and point out to 1.6.1

  • Loïc Stefanie's Github search issue: probably an alternative way to do what you want. Likely due to us switching to "new-style issues" (or whatever it is called with sub-issues)

  • Loïc JupyterLab/JupyterLite issue, do you have a way to reproduce?

  • Arturo JupyterLite crash on the scikit-learn.org/stable examples.

    • error in the JS console of firefox when running the first cell of an example with import statements TypeError: _query_package() got multiple values for argument 'index_urls'