HackMD - Collaborative Markdown Knowledge Base

- one alternative is to have estimators recieve everything but they ignore them by default - requiring explicit request and validation against them gives good error reporing - the API should be able to handle typos in given props - if an estimator requests my_weight instead of sample_weight, is it expected at the meta-estimator level or the individual estimators - meta estimators either pass everything or only the ones that the estimator needs - does the pipeline/estimator fail if an estimator request sth but doesn't get it - At the moment, there are estimators, splitters and scorers: - estimators have to request explicitely, e.g. sample_weight - groups are passed to splitter, but a bit inconsequently - scorers are difficult as you can pass them as a string and then there is (almost) no way (currently) to make them request sample_weight. ## Main Open Questions ### 1. Routing: does pipeline pass everything around or only requested ones - only pass what's requested, only with explicit name - The "pass everything" proposal makes documenting accepted props a bit harder. ### 2. Validation: do estimators validate given props against requested ones - estimators should do what they're doing now do metaestimators do validation to check if given props are requested or not - meta-estimators do props == given_props ### 3. Parameters: do estimators accept only `sample_weight`, or also a different name that the user specifies like `my_weight` - Alex prefers explicit ### 4. list or params or dict we've been passing a list of params (`**fit_params`) ### 5. default requested props - estimators by default request nothing - spliters request groups by default - this could be avoided if we had `KFold` splitter which would optionally accept `groups`. We should only have a default prop request where it is _required_ by the object and does not switch behaviour. - scorers should explicitly request `sample_weight` - estimator passes `sample_weight` to their scorer by default - we could introduce weighted string names for the most used scorers to be passed to meta-estimators such as `GridSearchCV` - meta-estimators' `score` method should do the same routing as in other places done in meta-estimators - AG: also need to explicitly declare not to accept a prop that is available ## Notes make sure things that don't accept a parameter also don't have the method allowing users to request that prop (e.g. `ElasticNet`) what to do with meta-estimators which are also consumers `GridSearchCV(LogisticRegression()).fit(X, y, sample_weight)`: - if neither `fit` nore `score` of `LogisticRegression` request `sample_weight`, then it fails - but `GridSearchCV(LogisticRegression(), scorer='sample_weighted_accuracy').fit(X, y, sample_weight)` would do a weighted scoring and unweighted `fit` - option is if the user passes `sample_weight` but `LogisticRegression` doesn't require it, but not explicitly, just by default, is to raise. As in, by default we **don't know** what to do and we raise. Can it be experimental? - I really don't think so (Adrin) The current proposal makes the existing code break and not silently change. We need to also add code snipets to the SLEP and explain what's expected there and expand on the user experience of the API. ## Takeaway Message "If requested, you shall pass." ## Code snipets ``` python def test_slep_caseA(): # Case A: weighted scoring and fitting # Here we presume that GroupKFold requests `groups` by default. # We need to explicitly request weights in make_scorer and for # LogisticRegressionCV. Both of these consumers understand the meaning # of the key "sample_weight". weighted_acc = make_scorer( accuracy_score, request_props=["sample_weight"] ) lr = LogisticRegressionCV( cv=GroupKFold(), scoring=weighted_acc, ).request_sample_weight(fit=True) cross_validate( lr, X, y, cv=GroupKFold(), # Group* by default request `groups` props={"sample_weight": my_weights, "groups": my_groups}, scoring=weighted_acc, ) # Error handling: if props={'sample_eight': my_weights, ...} was passed, # cross_validate would raise an error, since 'sample_eight' was not # requested by any of its children. def test_slep_caseB(): # Case B: weighted scoring and unweighted fitting # Since LogisticRegressionCV requires that weights explicitly be requested, # removing that request means the fitting is unweighted. weighted_acc = make_scorer( accuracy_score, request_props=["sample_weight"] ) lr = LogisticRegressionCV( cv=GroupKFold(), scoring=weighted_acc, ).request_sample_weight(fit=False) # crash if not required explicitly # not setting the sample_weight to False in the above line, # cross_validate should fail cross_validate( lr, X, y, cv=GroupKFold(), # Group* by default request `groups` props={"sample_weight": my_weights, "groups": my_groups}, scoring=weighted_acc, ) def test_slep_caseC(): # Case C: unweighted feature selection # Like LogisticRegressionCV, SelectKBest needs to request weights # explicitly. Here it does not request them. weighted_acc = make_scorer( accuracy_score, request_metadata=["sample_weight"] ) lr = LogisticRegressionCV( cv=GroupKFold(), scoring=weighted_acc, ).request_sample_weight(fit=True) sel = SelectKBest(k=2) pipe = make_pipeline(sel, lr) cross_validate( pipe, X, y, cv=GroupKFold(), props={"sample_weight": my_weights, "groups": my_groups}, scoring=weighted_acc, ) def test_slep_caseD(): # Case D: different scoring and fitting weights # Despite make_scorer and LogisticRegressionCV both expecting a key # sample_weight, we can use aliases to pass different weights to different # consumers. weighted_acc = make_scorer( accuracy_score, request_metadata={"scoring_weight": "sample_weight"} ) lr = LogisticRegressionCV( cv=GroupKFold(), scoring=weighted_acc, ).request_sample_weight(fit="fitting_weight") cross_validate( lr, X, y, cv=GroupKFold(), # Group* by default request `groups` props={ "scoring_weight": my_weights, "fitting_weight": my_other_weights, "groups": my_groups, }, scoring=weighted_acc, ) ``` Examples of behavior ``` python from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline clf = LogisticRegression() print(clf) print(clf.get_metadata_request()) clf = LogisticRegression().request_sample_weight(fit=True) print(clf) print(clf.get_metadata_request()) clf = make_pipeline(StandardScaler(), LogisticRegression()) print(clf) print(clf.get_metadata_request()) clf = make_pipeline( StandardScaler().request_sample_weight(fit=False), LogisticRegression().request_sample_weight(fit=False) ) print(clf) print(clf.get_metadata_request()) clf = make_pipeline( StandardScaler().request_sample_weight(fit='myweights'), LogisticRegression().request_sample_weight(fit=False) ) print(clf) print(clf.get_metadata_request()) clf = make_pipeline( StandardScaler().request_sample_weight(fit=False), LogisticRegression().request_sample_weight( fit=True).request_sample_weight(score=True), ) print(clf) print(clf.get_metadata_request()) ``` gives ``` python In [6]: %run play LogisticRegression() {'fit': {}, 'predict': {}, 'transform': {}, 'score': {}, 'split': {}, 'inverse_transform': {}} LogisticRegression() {'fit': {'sample_weight': {'sample_weight'}}, 'predict': {}, 'transform': {}, 'score': {}, 'split': {}, 'inverse_transform': {}} Pipeline(steps=[('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression())]) {'fit': {}, 'predict': {}, 'transform': {}, 'score': {}, 'split': {}, 'inverse_transform': {}} Pipeline(steps=[('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression())]) {'fit': {}, 'predict': {}, 'transform': {}, 'score': {}, 'split': {}, 'inverse_transform': {}} Pipeline(steps=[('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression())]) {'fit': {'myweights': 'myweights'}, 'predict': {}, 'transform': {}, 'score': {}, 'split': {}, 'inverse_transform': {}} Pipeline(steps=[('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression())]) {'fit': {'sample_weight': 'sample_weight'}, 'predict': {}, 'transform': {}, 'score': {'sample_weight': 'sample_weight'}, 'split': {}, 'inverse_transform': {}} ``` ``` python # This would raise a deprecation warning, that provided metadata # is not requested GridSearchCV(LogisticRegression()).fit(X, y, sample_weight=sw) # this would work with no warnings GridSearchCV(LogisticRegression().request_sample_weight( fit=True) ).fit(X, y, sample_weight=sw) # This will raise that LR could accept `sample_weight`, but has # not been specified by the user GridSearchCV( LogisticRegression(), scoring=make_scorer(accuracy_score, request_metadata=['sample_weight']) ).fit(X, y, sample_weight=sw) ```

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.