owned this note
owned this note
Published
Linked with GitHub
- one alternative is to have estimators recieve everything but they ignore them by default
- requiring explicit request and validation against them gives good error reporing
- the API should be able to handle typos in given props
- if an estimator requests my_weight instead of sample_weight, is it expected at the meta-estimator level or the individual estimators
- meta estimators either pass everything or only the ones that the estimator needs
- does the pipeline/estimator fail if an estimator request sth but doesn't get it
- At the moment, there are estimators, splitters and scorers:
- estimators have to request explicitely, e.g. sample_weight
- groups are passed to splitter, but a bit inconsequently
- scorers are difficult as you can pass them as a string and then there is (almost) no way (currently) to make them request sample_weight.
## Main Open Questions
### 1. Routing:
does pipeline pass everything around or only requested ones
- only pass what's requested, only with explicit name
- The "pass everything" proposal makes documenting accepted props a bit harder.
### 2. Validation:
do estimators validate given props against requested ones
- estimators should do what they're doing now
do metaestimators do validation to check if given props are requested or not
- meta-estimators do props == given_props
### 3. Parameters:
do estimators accept only `sample_weight`, or also a different name that the user specifies like `my_weight`
- Alex prefers explicit
### 4. list or params or dict
we've been passing a list of params (`**fit_params`)
### 5. default requested props
- estimators by default request nothing
- spliters request groups by default
- this could be avoided if we had `KFold` splitter which would optionally accept `groups`. We should only have a default prop request where it is _required_ by the object and does not switch behaviour.
- scorers should explicitly request `sample_weight`
- estimator passes `sample_weight` to their scorer by default
- we could introduce weighted string names for the most used scorers to be passed to meta-estimators such as `GridSearchCV`
- meta-estimators' `score` method should do the same routing as in other places done in meta-estimators
- AG: also need to explicitly declare not to accept a prop that is available
## Notes
make sure things that don't accept a parameter also don't have the method allowing users to request that prop (e.g. `ElasticNet`)
what to do with meta-estimators which are also consumers
`GridSearchCV(LogisticRegression()).fit(X, y, sample_weight)`:
- if neither `fit` nore `score` of `LogisticRegression` request `sample_weight`, then it fails
- but `GridSearchCV(LogisticRegression(), scorer='sample_weighted_accuracy').fit(X, y, sample_weight)` would do a weighted scoring and unweighted `fit`
- option is if the user passes `sample_weight` but `LogisticRegression` doesn't require it, but not explicitly, just by default, is to raise. As in, by default we **don't know** what to do and we raise.
Can it be experimental?
- I really don't think so (Adrin)
The current proposal makes the existing code break and not silently change.
We need to also add code snipets to the SLEP and explain what's expected there and expand on the user experience of the API.
## Takeaway Message
"If requested, you shall pass."
## Code snipets
``` python
def test_slep_caseA():
# Case A: weighted scoring and fitting
# Here we presume that GroupKFold requests `groups` by default.
# We need to explicitly request weights in make_scorer and for
# LogisticRegressionCV. Both of these consumers understand the meaning
# of the key "sample_weight".
weighted_acc = make_scorer(
accuracy_score, request_props=["sample_weight"]
)
lr = LogisticRegressionCV(
cv=GroupKFold(), scoring=weighted_acc,
).request_sample_weight(fit=True)
cross_validate(
lr,
X,
y,
cv=GroupKFold(), # Group* by default request `groups`
props={"sample_weight": my_weights, "groups": my_groups},
scoring=weighted_acc,
)
# Error handling: if props={'sample_eight': my_weights, ...} was passed,
# cross_validate would raise an error, since 'sample_eight' was not
# requested by any of its children.
def test_slep_caseB():
# Case B: weighted scoring and unweighted fitting
# Since LogisticRegressionCV requires that weights explicitly be requested,
# removing that request means the fitting is unweighted.
weighted_acc = make_scorer(
accuracy_score, request_props=["sample_weight"]
)
lr = LogisticRegressionCV(
cv=GroupKFold(), scoring=weighted_acc,
).request_sample_weight(fit=False) # crash if not required explicitly
# not setting the sample_weight to False in the above line,
# cross_validate should fail
cross_validate(
lr,
X,
y,
cv=GroupKFold(), # Group* by default request `groups`
props={"sample_weight": my_weights, "groups": my_groups},
scoring=weighted_acc,
)
def test_slep_caseC():
# Case C: unweighted feature selection
# Like LogisticRegressionCV, SelectKBest needs to request weights
# explicitly. Here it does not request them.
weighted_acc = make_scorer(
accuracy_score, request_metadata=["sample_weight"]
)
lr = LogisticRegressionCV(
cv=GroupKFold(), scoring=weighted_acc,
).request_sample_weight(fit=True)
sel = SelectKBest(k=2)
pipe = make_pipeline(sel, lr)
cross_validate(
pipe,
X,
y,
cv=GroupKFold(),
props={"sample_weight": my_weights, "groups": my_groups},
scoring=weighted_acc,
)
def test_slep_caseD():
# Case D: different scoring and fitting weights
# Despite make_scorer and LogisticRegressionCV both expecting a key
# sample_weight, we can use aliases to pass different weights to different
# consumers.
weighted_acc = make_scorer(
accuracy_score,
request_metadata={"scoring_weight": "sample_weight"}
)
lr = LogisticRegressionCV(
cv=GroupKFold(), scoring=weighted_acc,
).request_sample_weight(fit="fitting_weight")
cross_validate(
lr,
X,
y,
cv=GroupKFold(), # Group* by default request `groups`
props={
"scoring_weight": my_weights,
"fitting_weight": my_other_weights,
"groups": my_groups,
},
scoring=weighted_acc,
)
```
Examples of behavior
``` python
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
clf = LogisticRegression()
print(clf)
print(clf.get_metadata_request())
clf = LogisticRegression().request_sample_weight(fit=True)
print(clf)
print(clf.get_metadata_request())
clf = make_pipeline(StandardScaler(), LogisticRegression())
print(clf)
print(clf.get_metadata_request())
clf = make_pipeline(
StandardScaler().request_sample_weight(fit=False),
LogisticRegression().request_sample_weight(fit=False)
)
print(clf)
print(clf.get_metadata_request())
clf = make_pipeline(
StandardScaler().request_sample_weight(fit='myweights'),
LogisticRegression().request_sample_weight(fit=False)
)
print(clf)
print(clf.get_metadata_request())
clf = make_pipeline(
StandardScaler().request_sample_weight(fit=False),
LogisticRegression().request_sample_weight(
fit=True).request_sample_weight(score=True),
)
print(clf)
print(clf.get_metadata_request())
```
gives
``` python
In [6]: %run play
LogisticRegression()
{'fit': {}, 'predict': {}, 'transform': {}, 'score': {}, 'split': {}, 'inverse_transform': {}}
LogisticRegression()
{'fit': {'sample_weight': {'sample_weight'}}, 'predict': {}, 'transform': {}, 'score': {}, 'split': {}, 'inverse_transform': {}}
Pipeline(steps=[('standardscaler', StandardScaler()),
('logisticregression', LogisticRegression())])
{'fit': {}, 'predict': {}, 'transform': {}, 'score': {}, 'split': {}, 'inverse_transform': {}}
Pipeline(steps=[('standardscaler', StandardScaler()),
('logisticregression', LogisticRegression())])
{'fit': {}, 'predict': {}, 'transform': {}, 'score': {}, 'split': {}, 'inverse_transform': {}}
Pipeline(steps=[('standardscaler', StandardScaler()),
('logisticregression', LogisticRegression())])
{'fit': {'myweights': 'myweights'}, 'predict': {}, 'transform': {}, 'score': {}, 'split': {}, 'inverse_transform': {}}
Pipeline(steps=[('standardscaler', StandardScaler()),
('logisticregression', LogisticRegression())])
{'fit': {'sample_weight': 'sample_weight'}, 'predict': {}, 'transform': {}, 'score': {'sample_weight': 'sample_weight'}, 'split': {}, 'inverse_transform': {}}
```
``` python
# This would raise a deprecation warning, that provided metadata
# is not requested
GridSearchCV(LogisticRegression()).fit(X, y, sample_weight=sw)
# this would work with no warnings
GridSearchCV(LogisticRegression().request_sample_weight(
fit=True)
).fit(X, y, sample_weight=sw)
# This will raise that LR could accept `sample_weight`, but has
# not been specified by the user
GridSearchCV(
LogisticRegression(),
scoring=make_scorer(accuracy_score,
request_metadata=['sample_weight'])
).fit(X, y, sample_weight=sw)
```