owned this note
owned this note
Published
Linked with GitHub
# Return type
How should we handle unnamed columns provided as sensitive features or control features? By unnamed columns I mean any `ndarray`, `list` or an unnamed `pandas.Series`.
Besides sensible printouts, we would like to enable expressions like these seamlessly:
```python
mf = MetricFrame(...)
print(mf.group_by - mf.overall)
print(mf.group_by.min(level=mf.control_levels))
```
## Pandas approach
Pandas creates a concatenated index, where unnamed columns have automatically generated names, which are integers beginning with 0. For example:
```python
n = 7
array = np.random.choice(['gray', 'pink'], n)
series_strname = pd.Series(array, name='f')
series_intname = pd.Series(array, name=100)
series_noname = pd.Series(array)
df = pd.concat([
series_noname,
series_strname,
series_strname,
series_intname,
series_noname], axis=1)
print(df)
print("Columns: " + str(df.columns))
```
This returns:
```
0 f f 100 1
0 gray gray gray gray gray
1 pink pink pink pink pink
2 pink pink pink pink pink
3 pink pink pink pink pink
4 pink pink pink pink pink
5 gray gray gray gray gray
6 pink pink pink pink pink
Columns: Index([0, 'f', 'f', 100, 1], dtype='object')
```
## What should Fairlearn do?
I think that we have three options.
### Option 1: We only allow named sensitive and control features.
:::info
_MD_: I think that this is the minimal option. I'd be okay with this. However, this will still run into strange behavior if there are some repeated feature names, and especially if there are repeated **int** names. So I would suggest that we require that
* all control features and sensitive features have distinct names, which are all strings.
:::
```python
# ALLOWED
# Named pandas.Series
mf = MetricFrame(...,
sensitive_features=series_strname)
# Dictionary of pandas.Series, ndarrays, and lists
mf = MetricFrame(...,
sensitive_features={
'f1': series_intname,
'f2': series_noname,
'f3': array})
# Any pandas.DataFrame whose columns are strings
df = pd.DataFrame(array2d)
df.columns = df.columns.astype(str)
mf = MetricFrame(...,
sensitive_features=df)
# NOT ALLOWED
# Unnamed or int-named pandas.Series
mf = MetricFrame(...,
sensitive_features=series_noname)
mf = MetricFrame(...,
sensitive_features=series_intname)
# pandas.DataFrame with ints or None as columns
mf = MetricFrame(...,
sensitive_features=pd.DataFrame(array2d))
# 1D or 2D ndarray
mf = MetricFrame(...,
sensitive_features=array)
mf = MetricFrame(...,
sensitive_features=array2d)
```
### Option 2: Allow unnamed features and impute as in pandas
:::info
_MD_: Frankly, after I spelled out this approach below, I don't think that this is a good idea. Covering all the cases below would be confusing. However, only covering **2a** and **2b** below would probably suffice and would remove the confusion I think (that's my most preferred proposal **Option 4**, see below).
:::
Example 2a: Two unnamed sensitive features provided in an `ndarray`
```python
mf = MetricFrame(...,
sensitive_features=array2d)
print(mf.by_group)
```
```
0 1
gray gray 0.375970
pink 0.368810
pink gray 0.555287
pink 0.390717
```
Example 2b: Two unnamed sensitive features, one unnamed control feature
```python
mf = MetricFrame(...,
sensitive_features=array2d,
control_features=array1d)
print(mf.by_group)
```
```
0 1 2
high gray gray 0.482279
pink 0.629078
pink gray 0.280982
pink 0.549721
low gray gray 0.593529
pink 0.382138
pink gray 0.473462
pink 0.373155
```
Example 2c: Two unnamed sensitive features, one named control feature
```python
mf = MetricFrame(...,
sensitive_features=array2d,
control_features=pd.Series({'control': array1d}))
print(mf.by_group)
```
```
control 0 1
high gray gray 0.432310
pink 0.437083
pink gray 0.504155
pink 0.491229
low gray gray 0.600662
pink 0.566908
pink gray 0.559627
pink 0.529099
```
Example 2d: Two named sensitive features, one unnamed control feature
```python
mf = MetricFrame(...,
sensitive_features=data_frame,
control_features=array1d)
print(mf.by_group)
```
```
0 Feature 1 Feature 2
high gray gray 0.553912
pink 0.696952
pink gray 0.369010
pink 0.492070
low gray gray 0.511260
pink 0.614432
pink gray 0.440655
pink 0.567196
```
### Option 3: Allow unnamed features and impute with strings
:::info
_MD_: This is probably preferred over Option 2, but I like Option 4 below even better.
:::
Example 3a: Two unnamed sensitive features provided in an `ndarray`
```python
mf = MetricFrame(...,
sensitive_features=array2d)
print(mf.by_group)
```
```
sensitive_level_0 sensitive_level_1
gray gray 0.545032
pink 0.470399
pink gray 0.570158
pink 0.302527
```
Example 3b: Two unnamed sensitive features, one unnamed control feature
```python
mf = MetricFrame(...,
sensitive_features=array2d,
control_features=array1d)
print(mf.by_group)
```
```
cf0 sf0 sf1
high gray gray 0.545842
pink 0.544927
pink gray 0.516627
pink 0.569869
low gray gray 0.586118
pink 0.458823
pink gray 0.431459
pink 0.551652
```
Example 3c: Two unnamed sensitive features, one named control feature
```python
mf = MetricFrame(...,
sensitive_features=array2d,
control_features=pd.Series({'control': array1d}))
print(mf.by_group)
```
```
control sf0 sf1
high gray gray 0.432310
pink 0.437083
pink gray 0.504155
pink 0.491229
low gray gray 0.600662
pink 0.566908
pink gray 0.559627
pink 0.529099
```
Example 3d: Two named sensitive features, one unnamed control feature
```python
mf = MetricFrame(...,
sensitive_features=data_frame,
control_features=array1d)
print(mf.by_group)
```
```
cf0 Feature 1 Feature 2
high gray gray 0.553912
pink 0.696952
pink gray 0.369010
pink 0.492070
low gray gray 0.511260
pink 0.614432
pink gray 0.440655
pink 0.567196
```
### Option 4 = Option 2 limited to:
* all features are distinctly named, or
* all features are unnamed
:::info
_MD_: This is my currently preferred choice.
:::
Example 4a: Two unnamed sensitive features provided in an `ndarray`
```python
mf = MetricFrame(...,
sensitive_features=array2d)
print(mf.by_group)
```
```
0 1
gray gray 0.375970
pink 0.368810
pink gray 0.555287
pink 0.390717
```
Example 4b: Two unnamed sensitive features, one unnamed control feature
```python
mf = MetricFrame(...,
sensitive_features=array2d,
control_features=array1d)
print(mf.by_group)
```
```
0 1 2
high gray gray 0.482279
pink 0.629078
pink gray 0.280982
pink 0.549721
low gray gray 0.593529
pink 0.382138
pink gray 0.473462
pink 0.373155
```
~~Example 4c: Two unnamed sensitive features, one named control feature~~
~~Example 4d: Two named sensitive features, one unnamed control feature~~
# Two API variants [old stuff; disgregard!!!]
### Variant 1 (final proposal)
```python
class MetricFrame:
def __init__(self, metric,
y_true, y_pred, *,
sensitive_features,
control_features=None,
sample_params=None):
@property
def overall(self):
@property
def by_group(self):
def group_max(self):
def group_min(self):
def difference(self, method='between_groups'):
# method can also be 'to_overall'
def ratio(self, method='between_groups'):
# method can also be 'to_overall'
def make_derived_metric(base_metric, derivation_type, *,
sample_param_names=['sample_weight']):
# derivation_type can be:
# 'group_min', 'group_max', 'difference', 'ratio'
#
# Parameters of the returned callable are treated as
# static (i.e., not to be sliced) unless their name is
# in sample_param_names.
### Examples
# examples of predefined metrics
recall_score_difference = make_derived_metric(
skm.recall_score, 'difference')
recall_score_group_min = make_derived_metric(
skm.recall_score, 'group_min')
# get values using predefined metrics
val1 = recall_score_difference(
y_true, y_pred, sensitive_features=sf,
pos_label=2, sample_weight=w, method='to_overall')
val2 = recall_score_group_min(
y_true, y_pred, sensitive_features=sf,
pos_label=2, sample_weight=w)
# get the same values using MetricFrame
mf = MetricFrame(
partial(skm.recall_score, pos_label=2),
y_true, y_pred, sensitive_features=sf,
sample_params = {'sample_weight': sw})
val1 = mf.difference(method='to_overall')
val2 = mf.group_min()
```
### Variant 2 (simplified `make_derived_metric`)
```python
class GroupedMetric:
# the same as Variant 1
def make_derived_metric(metric_type, base_metric):
# metric_type can be:
# 'group_min', 'group_max', 'difference', 'ratio'
#
# Parameters of the returned callable are all treated
# as sample parameters.
### Examples
# no predefined metrics, or only predefined metrics
# without static parameters.
# custom derived metrics for recall_score with pos_label=2
recall_label2 = partial(skm.recall_score, pos_label=2)
recall_label2_difference = make_derived_metric(
'difference', recall_label2)
recall_label2_group_min = make_derived_metric(
'group_min', recall_label2)
val1 = recall_score_difference(
y_true, y_pred, sensitive_features=sf,
sample_weight=w, method='to_overall')
val2 = recall_score_group_min(
y_true, y_pred, sensitive_features=sf,
sample_weight=w)
# GroupedMetric example as in Variant 1
```
## TASKS
### TASK 1: report one disaggregated metric
```python
# STATUS QUO
bunch = group_summary(
accuracy_score, y_true, y_pred, sensitive_features=sf)
frame = pd.Series(bunch.by_group)
frame_o = pd.Series({**bunch.by_group, 'overall': bunch.overall})
# IDEA 1A
grouped = GroupSummary(
accuracy_score, y_true, y_pred, sensitive_features=sf)
frame = grouped.by_group
frame_o = grouped.by_group.append(grouped.overall)
# or
frame_o = pd.concat([grouped.by_group, grouped.overall])
```
### TASK 2: report multiple disaggregated metrics
```python
# STATUS QUO
bunch1 = group_summary(
accuracy_score, y_true, y_pred, sensitive_features=sf)
bunch2 = group_summary(
f1_score, y_true, y_pred, sensitive_features=sf)
frame = pd.DataFrame({
'accuracy': bunch1.by_group,
'f1': bunch2.by_group})
frame_o = pd.DataFrame({
'accuracy': {**bunch1.by_group, 'overall': bunch1.overall},
'f1': {**bunch2.by_group, 'overall': bunch2.overall}})
# IDEA 2A
grouped = GroupSummary(
{'accuracy': accuracy_score, 'f1': f1_score},
y_true, y_pred, sensitive_features=sf)
frame = grouped.by_group
frame_o = grouped.by_group.append(grouped.overall)
# or
frame_o = pd.concat([grouped.by_group, grouped.overall])
```
### TASK 3: Report several performance and fairness metrics of several models in a data frame.
```python
# STATUS QUO
# handling of metric parameters using functools
fhalf_score = functools.partial(fbeta_score, beta=0.5)
# standard transformations provided by fairlearn
custom_difference1 = make_derived_metric(
difference_from_summary,
make_metric_group_summary(fhalf_score))
# non-standard transformation
def custom_difference2(y_true, y_pred, *, sensitive_features):
bunch = group_summary(
fbeta_score, y_true, y_pred, sensitive_features=sensitive_features, beta=0.5)
frame = pd.Series(bunch.by_group)
return (frame-frame['White']).min()
# Below is more of a boilerplate code whose simplification
# is beyond the scope of the current proposal, but it is
# in some ways reminiscent of sklearn.model_selection.cross_validate
fairness_metrics = {
'Custom difference 1': custom_difference1,
'Custom difference 2': custom_difference2,
'Demographic parity difference': demographic_parity_difference,
'Worst-case balanced accuracy': balanced_accuracy_score_group_min}
performance_metrics = {
'FPR': false_positive_rate,
'FNR': false_negative_rate}
predictions_by_estimator = {
'logreg': y_pred_lr,
'svm': y_pred_svm}
df = pd.DataFrame()
for pred_key, y_pred in predictions_by_estimator.items():
for fairm_key, fairm in fairness_metrics.items():
df.loc[fairm_key, pred_key] = fairm(y_true, y_pred, sensitive_features=sf)
for perfm_key, perfm in performance_metrics.items():
df.loc[perfm_key, pred_key] = perfm(y_true, y_pred)
# IDEA 3A - simpler creation of standard transformations
custom_difference1 = make_derived_metric(
'difference', fbeta_score, beta=0.5)
# IDEA 3B - variant of 3A
custom_difference1 = make_derived_metric(
'difference', fbeta_score, params={'beta': 0.5})
# IDEA 3C - leveraging a more powerful differences() method
def custom_difference2(y_true, y_pred, *, sensitive_features):
grouped = GroupedMetric(
fbeta_score, y_true, y_pred, sensitive_features=sensitive_features,
params={'beta': 0.5})
return grouped.differences(
relative_to='group', group='White', aggregate='min')
# IDEA 3D - without the differences() method
def custom_difference2(y_true, y_pred, *, sensitive_features):
grouped = GroupedMetric(
fbeta_score, y_true, y_pred, sensitive_features=sensitive_features,
params={'beta': 0.5})
return (grouped.by_group - grouped.by_group['White']).min()
# the remainder as before
```
MD: Issues with the above pattern (both status quo and proposed): it doesn't work so well with multiple metrics if some metrics need scores, i.e., `score()` or `predict_proba()` and some raw predictions, i.e., `predict()`.
AM: sklearn has the scorer interface to deal with the different requiements, and to ensure multiple metrics don't call the same method multiple times, we have a private [_MultimetricScorer](https://github.com/scikit-learn/scikit-learn/pull/14593) that implements some caching.
RGE: I don't quite understand what the above means
### TASK 4: Report several performance and fairness metrics as well as some disaggregated metrics of several models in a data frame.
Skip for now
### TASK 5: Create a fairness-performance raster plot of several models.
```python
# Current
my_fairness_metric=custom_difference1
my_performance_metric=false_positive_rate
xs = [my_performance_metric(Y_test, y_pred)
for y_pred in predictions_by_estimator.values()]
ys = [my_disparity_metric(Y_test, y_pred, sensitive_features=A_test['Race'])
for y_pred in predictions_by_estimator.values()]
plt.scatter(xs,ys)
plt.xlabel('False positive rate')
plt.ylabel('Custom difference 1')
plt.show()
# Proposed
# The same, but with new definition of custom_difference1
```
### TASK 6: Run sklearn.model_selection.cross_validate
Use demographic parity and precision score as the metrics
```python
# Current
precision_scorer = make_scorer(precision_score)
y_t = pd.Series(Y_test)
def dpd_wrapper(y_t, y_p, sensitive_features):
# We need to slice up the sensitive feature to match y_t and y_p
# See Adrin's reply to:
# https://stackoverflow.com/questions/49581104/sklearn-gridsearchcv-not-using-sample-weight-in-score-function
sf_slice = sensitive_features.loc[y_t.index.values].values.reshape(-1)
return demographic_parity_difference(y_t, y_p, sensitive_features=sf_slice)
dp_scorer = make_scorer(dpd_wrapper, sensitive_features=A_test['Race'])
scoring = {'prec':precision_scorer, 'dp':dp_scorer}
clf = svm.SVC(kernel='linear', C=1, random_state=0)
scores = cross_validate(clf, X_test, y_t, scoring=scoring)
scores
# Proposed
# Unchanged until SciKit-Learn supports the slicing of sensitive_features
```
### TASK 7: Run GridSearchCV
With demographic parity and accuracy score, where the goal is to find the lowest-error model whose demographic parity is <= 0.05.
```python
# Current
from sklearn.model_selection import GridSearchCV
param_grid = [
{'C': [1, 10, 100, 1000], 'kernel': ['linear']},
{'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]
scoring = {'prec':precision_scorer, 'dp':dp_scorer}
clf = svm.SVC(kernel='linear', C=1, random_state=0)
# selection_function would implement the best estimator
# selection strategy
gscv = GridSearchCV(clf, param_grid=param_grid, scoring=scoring, refit=selection_function, verbose=1)
gscv.fit(X_test, y_t)
print("Best parameters set found on development set:")
print(gscv.best_params_)
print("Best score:", gscv.best_score_)
print()
print("Overall results")
print(gscv.cv_results_)
```