owned this note
owned this note
Published
Linked with GitHub
# January 4th 2021 -- Scikit-learn dev meeting
*Don't forget to sign in when editing and put your name in front of the entries you want to discuss*
### Needs to be said:
**Happy new year 2021!**
### Need decision
- Adrin, Alex, Christian, Joel: SLEP 006 Sample Props
Next steps towards voting.
- [PR 16079](https://github.com/scikit-learn/scikit-learn/pull/16079)
- https://hackmd.io/Psfd7aB1Qt6_fwv-EBVXqQ
- After the conversations we converged on option 4 (yaay)
- Alex thinks the solution is not too intrusive to the API
- **Next step:** consolidate changes to option 4, and work on the PR
### Need attention (review)
- Christian: SplineTransformer [PR 18368](https://github.com/scikit-learn/scikit-learn/pull/18368)
Reviews are very welcome. All linear models would profit. It gives smooth interpolation on continuous features as opposed to tree based methods, which have discontinuities all over the place.
- Thomas and Guillaume would be happy to review
- Christian (if nobody else wants...): Common private loss module
- [PR 19088](https://github.com/scikit-learn/scikit-learn/pull/19088) `sklearn._loss` module itself (touches nothing else in sklearn)
- [PR 19089](https://github.com/scikit-learn/scikit-learn/pull/19089) replacement in linear logistic regression and hgbt (so that one can test and benchmark)*
- Will give quantile loss in HGBT for free (almost)
- Urgentish: post 0.24.0 fixes (released Dec 22, 2020):
- [#19063](https://github.com/scikit-learn/scikit-learn/issues/19063) binary compat issue with old-ish macOS (10.13) for vendored libomp
- [#19097](https://github.com/scikit-learn/scikit-learn/issues/19097) build scikit-learn from source on Apple Silicon
- Shall we do 0.24.1 fast with this bugfix soon (e.g. next week) and maybe another 0.24.2 later? Other important bugs/regression to fix?
- Option: could do for next release? or to tag it in new release.
- Release 1.0 (renaming of 0.25) (do in April or May 2021, after next Python release)
- we should remove non-ARM builds from travis, and then we'll probably be fine, and limit the frequency of ARM builds on travis
- Drop python3.6 support (scipy has already done it)
### FYI (coming up SLEP)
- Gael: fit_transform == .fit.transform (full discussion on https://docs.google.com/document/d/1DvoqY00ov7f87VutqGBP0ad0s-1a4PLH2FzuC4RVz4A/edit)
- **Proposal**: The question at hand is to allow fit.transform to differ from fit_transform beyond numerical errors, ie implement significantly different strategy
- **Benefits**: The benefits are that it would enable us to implement new learning approaches that we don’t support (stacking as a transformer, target encoding with internal CV, k nearest neighbor transformer without the pain). The overall concept is that of a train-time transform that differs from the test-time transform. TargetEncoding is the current main motivation.
- **Plan B**: Workarounds that would enable this proposal to be rejected: bagging can be used instead of cross_val_predict to implement TargetEncoder (and stacking), potentially adding a keyword argument to enable the unsafe behavior
- **Drawbacks**: The drawback is that it could be surprising to users (more complicated story to explain) and, importantly, would open the door to nasty bugs, for instance for users not using a pipeline and calling manually .fit.transform
- Mitigation of this drawback: hashing and weak-refs could be used to detect that the same data is passed to fit and transform on those transformers and issue a warning. It would probably detect only a fraction of the cases, but would be useful to educate users about the problem.
- **Next step**: draft a SLEP and vote on it (even if to reject it, so as to document the decision process)
- **Comments**:
- We don't necessarily follow this contract as it is in sklearn
- (Adrin): limitations of the workaround
- (Adrin): impact on third party -> what would this API allow other people do, and how would they implement their workarounds (point: third parties already violate it)
- Clarify the amount of difference acceptable between (fit.transform), NMF is an example of a challenging
- Data augmentation is a good example (eg in bias-mitigation strategies)
- Adrin to add references and/or sample code
### General topics
- Loïc: enable Github Discussions? One hope would be to help getting people involved outside the core developers. [Github Discussions doc](https://docs.github.com/en/free-pro-team@latest/discussions)
- Reshama: how many people are on the mailing list? over 2k
- + Discussions would allow for better coding formatting
- We should be able to enforce our CoC
- We would need to extend the triage team who can set the accepted answers and make sure the answers are indeed correct
- We can try it and see how it goes, and roll back if we see we can't handle it
- Make a PR to update the issue template to point to github discussions for users question
- AFME Feb 2021 Sprint (https://afme2021.dataumbrella.org)
- Andy, Adrin, and Olivier are giving a hand
### Contributors
### Next Meeting
25.1.2020, same time