25th January 2021 -- Scikit-learn dev meeting

# 25th January 2021 -- Scikit-learn dev meeting *Don't forget to sign in when editing and put your name in front of the entries you want to discuss* ### Need decision - Christian: RFC [RFC ColumnTransformer input validation and requirements #14251](https://github.com/scikit-learn/scikit-learn/issues/14251) Should ColumnTransformer enforce the order of the input columns? Info: This was already merged in [#14544](https://github.com/scikit-learn/scikit-learn/pull/14544) but I could not find a consensus/decision to do so. - Nicolas: regarding consensus: it was reached (between Adrin, Andy, Joel and I) on the original issue in August 2019. Divergent viewpoints were only voiced one year later but [#14544](https://github.com/scikit-learn/scikit-learn/pull/14544) was already merged by then. Also, #14544 fixed buggy behaviour. - Christian: My focus is *user-friendly end-to-end ML pipelines*. - Thomas: I opened PR [#19263](https://github.com/scikit-learn/scikit-learn/pull/19263) with an implementation to resolve this issue if we decide to. This PR enables transform to only require non-dropped columns to exist in the input, X, regardless of the order. Also dropped columns are not required in transform. - Andy: I think this is a good solution (I think the crux was that we previously stored indices, I assume you're now storing names). If someone uses boolean masks and positional indexing, there's still edge cases, right? (Thomas: I am storing the original column names and the indices to the original column names.) - Summary from the meeting: - The problem is that we currently need a repeatable column order. However, some data science pipelines have weak control on this (for instance querying from a DB) - Adrin finds that this makes the code very complicated, and would like this to be done in a separate object - Maybe add more tests to be sure that we do not allow silent bugs if the transform-time DF has new columns (with various configs of passthrough and drop) - Christian (if time permits): How to specify a design matrix for linear models? (Compare R-formula and patsy, also [#10603](https://github.com/scikit-learn/scikit-learn/issues/10603) and [#15263](https://github.com/scikit-learn/scikit-learn/issues/15263)) Ideally: - You can use feature names. - Native support for categorical features - Crux: Easy to specify individual interaction terms - Andy: https://github.com/amueller/patsylearn (this is reaaally old) - Pastylearn is not very robust and not well maintained (neither is patsy :D ), however there seems to be a consensus that it is a desired functionality. The question is whether to make it live in scikit-learn or outside ### Need attention (review) - Guillaume: - A series of example improvements: [PR #18835](https://github.com/scikit-learn/scikit-learn/pull/18835), [PR #18830](https://github.com/scikit-learn/scikit-learn/pull/18830), [PR #18821](https://github.com/scikit-learn/scikit-learn/pull/18821), [PR #18836](https://github.com/scikit-learn/scikit-learn/pull/18836) - Code refactoring for methods/functions that needs to get responses from an estimator methods (`predict`/`predict_proba`/`decision_function`) [PR #18589](https://github.com/scikit-learn/scikit-learn/pull/18589) ### General topics - Thomas: typing revisited: [PR #17799](https://github.com/scikit-learn/scikit-learn/pull/17799) - Nicolas: Has there been any new developement to this? From what I can tell: - Type annotation adoption is still very much an [ongoing discussion](https://github.com/scikit-learn/scikit-learn/issues/16705#issuecomment-683477933) (I'm still not sold, personally) - Joris [has strongly recommended against](https://github.com/scikit-learn/scikit-learn/issues/16705#issuecomment-683717061) using type annotations for checking docstrings - Thomas: In the same comment Joris said: "... or to have validation of consistency between the two formats.", which I am leaning toward. - PR [#17799](https://github.com/scikit-learn/scikit-learn/pull/17799) is narrow in the sense that it wants to only type `__init__` parameters. - I began typing some of sklearn here: [sk-typing](https://github.com/thomasjpfan/sk-typing/tree/main/sk_typing) and found that most of our hyperparameters are pretty simple and not complicated Unions. - I think it is a typing net-win since most types are simple. - Comment from Alex: can we type part of the code (at least to bootstrap the process) - Nicolas exposes difficulties contributing to a big code base with typing - Adrin suggests typing only the builtin types and only in inits, keeping away from advanced stuff - This is useful for IDEs, in particular with simple builtins, because the IDE can suggest the type as we type - Loïc : Github Discussions, any feedback so far? Do we want to announce it more widely at one point (mailing list, Twitter, others)? - Small traffic, but things are going well. - We should probably avertise it more to get the ball rolling - Action: can people retweet [Andy tweet](https://twitter.com/amuellerml/status/1347263788446146560) - Andreas: what's the status on feature names? - Thomas has a PR on n_features_in (need review) - [PR#18741](https://github.com/scikit-learn/scikit-learn/pull/18741) - [PR#18742](https://github.com/scikit-learn/scikit-learn/pull/18742) - [PR#18744](https://github.com/scikit-learn/scikit-learn/pull/18744) - Andreas: What's the status on fit_transform != fit.transform? - Gael: I'm a bottleneck here. I'll be able to pick it up in a few weeks. If someone else wants to fill in my shoes, I won't take it badly - Guillaume: Move items away from experimental - Seems feasible for `HistGradientBoosting` / `fetch_openml` - `HistGradientBoosting`: specifying categorical features is not easy and intuitive at the moment - `SuccessiveHalving`? (maybe too new?) - `IterativeImputer` -> we probably need to solve/find out the reason for `ConvergenceWarning`. Andreas: The reason is that MissForest has a weird definition of convergence, that's not convergence at all. - `fetch_openml` (marked experimental in the doc but not with the explicit experimental import mechanisms) ### Contributors ### Priorities Until next dev meeting: - Someone in Inria (to be decided) will invest on feature names, starting by reviewing the linked PRs on `n_features_in_` checks starting issue: [#18514](https://github.com/scikit-learn/scikit-learn/pull/18514) with follow up PRs: [PR#18741](https://github.com/scikit-learn/scikit-learn/pull/18741), [PR#18742](https://github.com/scikit-learn/scikit-learn/pull/18742), [PR#18744](https://github.com/scikit-learn/scikit-learn/pull/18744) - Passing categoricals to the HGBT [#18894](https://github.com/scikit-learn/scikit-learn/issues/18894) ### Next meeting February 22th, same time? To confirm, we forgot to confirm.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.