owned this note
owned this note
Published
Linked with GitHub
# 25th January 2021 -- Scikit-learn dev meeting
*Don't forget to sign in when editing and put your name in front of the entries you want to discuss*
### Need decision
- Christian: RFC [RFC ColumnTransformer input validation and requirements
#14251](https://github.com/scikit-learn/scikit-learn/issues/14251)
Should ColumnTransformer enforce the order of the input columns?
Info: This was already merged in [#14544](https://github.com/scikit-learn/scikit-learn/pull/14544) but I could not find a consensus/decision to do so.
- Nicolas: regarding consensus: it was reached (between Adrin, Andy, Joel and I) on the original issue in August 2019. Divergent viewpoints were only voiced one year later but [#14544](https://github.com/scikit-learn/scikit-learn/pull/14544) was already merged by then. Also, #14544 fixed buggy behaviour.
- Christian: My focus is *user-friendly end-to-end ML pipelines*.
- Thomas: I opened PR [#19263](https://github.com/scikit-learn/scikit-learn/pull/19263) with an implementation to resolve this issue if we decide to. This PR enables transform to only require non-dropped columns to exist in the input, X, regardless of the order. Also dropped columns are not required in transform.
- Andy: I think this is a good solution (I think the crux was that we previously stored indices, I assume you're now storing names). If someone uses boolean masks and positional indexing, there's still edge cases, right? (Thomas: I am storing the original column names and the indices to the original column names.)
- Summary from the meeting:
- The problem is that we currently need a repeatable column order. However, some data science pipelines have weak control on this (for instance querying from a DB)
- Adrin finds that this makes the code very complicated, and would like this to be done in a separate object
- Maybe add more tests to be sure that we do not allow silent bugs if the transform-time DF has new columns (with various configs of passthrough and drop)
- Christian (if time permits): How to specify a design matrix for linear models? (Compare R-formula and patsy, also [#10603](https://github.com/scikit-learn/scikit-learn/issues/10603) and [#15263](https://github.com/scikit-learn/scikit-learn/issues/15263)) Ideally:
- You can use feature names.
- Native support for categorical features
- Crux: Easy to specify individual interaction terms
- Andy: https://github.com/amueller/patsylearn (this is reaaally old)
- Pastylearn is not very robust and not well maintained (neither is patsy :D ), however there seems to be a consensus that it is a desired functionality. The question is whether to make it live in scikit-learn or outside
### Need attention (review)
- Guillaume:
- A series of example improvements: [PR #18835](https://github.com/scikit-learn/scikit-learn/pull/18835), [PR #18830](https://github.com/scikit-learn/scikit-learn/pull/18830), [PR #18821](https://github.com/scikit-learn/scikit-learn/pull/18821), [PR #18836](https://github.com/scikit-learn/scikit-learn/pull/18836)
- Code refactoring for methods/functions that needs to get responses from an estimator methods (`predict`/`predict_proba`/`decision_function`) [PR #18589](https://github.com/scikit-learn/scikit-learn/pull/18589)
### General topics
- Thomas: typing revisited: [PR #17799](https://github.com/scikit-learn/scikit-learn/pull/17799)
- Nicolas: Has there been any new developement to this? From what I can tell:
- Type annotation adoption is still very much an [ongoing discussion](https://github.com/scikit-learn/scikit-learn/issues/16705#issuecomment-683477933) (I'm still not sold, personally)
- Joris [has strongly recommended against](https://github.com/scikit-learn/scikit-learn/issues/16705#issuecomment-683717061) using type annotations for checking docstrings
- Thomas: In the same comment Joris said: "... or to have validation of consistency between the two formats.", which I am leaning toward.
- PR [#17799](https://github.com/scikit-learn/scikit-learn/pull/17799) is narrow in the sense that it wants to only type `__init__` parameters.
- I began typing some of sklearn here: [sk-typing](https://github.com/thomasjpfan/sk-typing/tree/main/sk_typing) and found that most of our hyperparameters are pretty simple and not complicated Unions.
- I think it is a typing net-win since most types are simple.
- Comment from Alex: can we type part of the code (at least to bootstrap the process)
- Nicolas exposes difficulties contributing to a big code base with typing
- Adrin suggests typing only the builtin types and only in inits, keeping away from advanced stuff
- This is useful for IDEs, in particular with simple builtins, because the IDE can suggest the type as we type
- Loïc : Github Discussions, any feedback so far? Do we want to announce it more widely at one point (mailing list, Twitter, others)?
- Small traffic, but things are going well.
- We should probably avertise it more to get the ball rolling
- Action: can people retweet [Andy tweet](https://twitter.com/amuellerml/status/1347263788446146560)
- Andreas: what's the status on feature names?
- Thomas has a PR on n_features_in (need review)
- [PR#18741](https://github.com/scikit-learn/scikit-learn/pull/18741)
- [PR#18742](https://github.com/scikit-learn/scikit-learn/pull/18742)
- [PR#18744](https://github.com/scikit-learn/scikit-learn/pull/18744)
- Andreas: What's the status on fit_transform != fit.transform?
- Gael: I'm a bottleneck here. I'll be able to pick it up in a few weeks. If someone else wants to fill in my shoes, I won't take it badly
- Guillaume: Move items away from experimental
- Seems feasible for `HistGradientBoosting` / `fetch_openml`
- `HistGradientBoosting`: specifying categorical features is not easy and intuitive at the moment
- `SuccessiveHalving`? (maybe too new?)
- `IterativeImputer` -> we probably need to solve/find out the reason for `ConvergenceWarning`. Andreas: The reason is that MissForest has a weird definition of convergence, that's not convergence at all.
- `fetch_openml` (marked experimental in the doc but not with the explicit experimental import mechanisms)
### Contributors
### Priorities
Until next dev meeting:
- Someone in Inria (to be decided) will invest on feature names, starting by reviewing the linked PRs on `n_features_in_` checks starting issue: [#18514](https://github.com/scikit-learn/scikit-learn/pull/18514) with follow up PRs: [PR#18741](https://github.com/scikit-learn/scikit-learn/pull/18741), [PR#18742](https://github.com/scikit-learn/scikit-learn/pull/18742), [PR#18744](https://github.com/scikit-learn/scikit-learn/pull/18744)
- Passing categoricals to the HGBT [#18894](https://github.com/scikit-learn/scikit-learn/issues/18894)
### Next meeting
February 22th, same time? To confirm, we forgot to confirm.