Return type

How should we handle unnamed columns provided as sensitive features or control features? By unnamed columns I mean any ndarray, list or an unnamed pandas.Series.

Besides sensible printouts, we would like to enable expressions like these seamlessly:

mf = MetricFrame(...)
print(mf.group_by - mf.overall)
print(mf.group_by.min(level=mf.control_levels))

Pandas approach

Pandas creates a concatenated index, where unnamed columns have automatically generated names, which are integers beginning with 0. For example:

n = 7
array = np.random.choice(['gray', 'pink'], n)
series_strname = pd.Series(array, name='f')
series_intname = pd.Series(array, name=100)
series_noname = pd.Series(array)

df = pd.concat([
    series_noname,
    series_strname,
    series_strname,
    series_intname,
    series_noname], axis=1)
print(df)
print("Columns: " + str(df.columns))

This returns:

      0     f     f   100     1
0  gray  gray  gray  gray  gray
1  pink  pink  pink  pink  pink
2  pink  pink  pink  pink  pink
3  pink  pink  pink  pink  pink
4  pink  pink  pink  pink  pink
5  gray  gray  gray  gray  gray
6  pink  pink  pink  pink  pink

Columns: Index([0, 'f', 'f', 100, 1], dtype='object')

What should Fairlearn do?

I think that we have three options.

Option 1: We only allow named sensitive and control features.

MD: I think that this is the minimal option. I'd be okay with this. However, this will still run into strange behavior if there are some repeated feature names, and especially if there are repeated int names. So I would suggest that we require that

all control features and sensitive features have distinct names, which are all strings.

# ALLOWED
# Named pandas.Series
mf = MetricFrame(...,
    sensitive_features=series_strname)

# Dictionary of pandas.Series, ndarrays, and lists
mf = MetricFrame(...,
    sensitive_features={
        'f1': series_intname,
        'f2': series_noname,
        'f3': array})

# Any pandas.DataFrame whose columns are strings
df = pd.DataFrame(array2d)
df.columns = df.columns.astype(str)
mf = MetricFrame(...,
    sensitive_features=df)

# NOT ALLOWED
# Unnamed or int-named pandas.Series
mf = MetricFrame(...,
    sensitive_features=series_noname)

mf = MetricFrame(...,
    sensitive_features=series_intname)

# pandas.DataFrame with ints or None as columns
mf = MetricFrame(...,
    sensitive_features=pd.DataFrame(array2d))

# 1D or 2D ndarray
mf = MetricFrame(...,
    sensitive_features=array)
    
mf = MetricFrame(...,
    sensitive_features=array2d)

Option 2: Allow unnamed features and impute as in pandas

MD: Frankly, after I spelled out this approach below, I don't think that this is a good idea. Covering all the cases below would be confusing. However, only covering 2a and 2b below would probably suffice and would remove the confusion I think (that's my most preferred proposal Option 4, see below).

Example 2a: Two unnamed sensitive features provided in an ndarray

mf = MetricFrame(...,
    sensitive_features=array2d)
print(mf.by_group)

0     1   
gray  gray    0.375970
      pink    0.368810
pink  gray    0.555287
      pink    0.390717

Example 2b: Two unnamed sensitive features, one unnamed control feature

mf = MetricFrame(...,
    sensitive_features=array2d,
    control_features=array1d)
print(mf.by_group)

0     1     2   
high  gray  gray    0.482279
            pink    0.629078
      pink  gray    0.280982
            pink    0.549721
low   gray  gray    0.593529
            pink    0.382138
      pink  gray    0.473462
            pink    0.373155

Example 2c: Two unnamed sensitive features, one named control feature

mf = MetricFrame(...,
    sensitive_features=array2d,
    control_features=pd.Series({'control': array1d}))
print(mf.by_group)

control  0     1   
high     gray  gray    0.432310
               pink    0.437083
         pink  gray    0.504155
               pink    0.491229
low      gray  gray    0.600662
               pink    0.566908
         pink  gray    0.559627
               pink    0.529099

Example 2d: Two named sensitive features, one unnamed control feature

mf = MetricFrame(...,
    sensitive_features=data_frame,
    control_features=array1d)
print(mf.by_group)

0     Feature 1  Feature 2
high  gray       gray         0.553912
                 pink         0.696952
      pink       gray         0.369010
                 pink         0.492070
low   gray       gray         0.511260
                 pink         0.614432
      pink       gray         0.440655
                 pink         0.567196

Option 3: Allow unnamed features and impute with strings

MD: This is probably preferred over Option 2, but I like Option 4 below even better.

Example 3a: Two unnamed sensitive features provided in an ndarray

mf = MetricFrame(...,
    sensitive_features=array2d)
print(mf.by_group)

sensitive_level_0   sensitive_level_1 
gray                gray                0.545032
                    pink                0.470399
pink                gray                0.570158
                    pink                0.302527

Example 3b: Two unnamed sensitive features, one unnamed control feature

mf = MetricFrame(...,
    sensitive_features=array2d,
    control_features=array1d)
print(mf.by_group)

cf0   sf0   sf1 
high  gray  gray    0.545842
            pink    0.544927
      pink  gray    0.516627
            pink    0.569869
low   gray  gray    0.586118
            pink    0.458823
      pink  gray    0.431459
            pink    0.551652

Example 3c: Two unnamed sensitive features, one named control feature

mf = MetricFrame(...,
    sensitive_features=array2d,
    control_features=pd.Series({'control': array1d}))
print(mf.by_group)

control  sf0   sf1   
high     gray  gray    0.432310
               pink    0.437083
         pink  gray    0.504155
               pink    0.491229
low      gray  gray    0.600662
               pink    0.566908
         pink  gray    0.559627
               pink    0.529099

Example 3d: Two named sensitive features, one unnamed control feature

mf = MetricFrame(...,
    sensitive_features=data_frame,
    control_features=array1d)
print(mf.by_group)

cf0   Feature 1  Feature 2
high  gray       gray         0.553912
                 pink         0.696952
      pink       gray         0.369010
                 pink         0.492070
low   gray       gray         0.511260
                 pink         0.614432
      pink       gray         0.440655
                 pink         0.567196

Option 4 = Option 2 limited to:

all features are distinctly named, or
all features are unnamed

MD: This is my currently preferred choice.

Example 4a: Two unnamed sensitive features provided in an ndarray

mf = MetricFrame(...,
    sensitive_features=array2d)
print(mf.by_group)

0     1   
gray  gray    0.375970
      pink    0.368810
pink  gray    0.555287
      pink    0.390717

Example 4b: Two unnamed sensitive features, one unnamed control feature

mf = MetricFrame(...,
    sensitive_features=array2d,
    control_features=array1d)
print(mf.by_group)

0     1     2   
high  gray  gray    0.482279
            pink    0.629078
      pink  gray    0.280982
            pink    0.549721
low   gray  gray    0.593529
            pink    0.382138
      pink  gray    0.473462
            pink    0.373155

~~Example 4c: Two unnamed sensitive features, one named control feature~~

~~Example 4d: Two named sensitive features, one unnamed control feature~~

Two API variants [old stuff; disgregard!!!]

Variant 1 (final proposal)

class MetricFrame:
    def __init__(self, metric,
                 y_true, y_pred, *,
                 sensitive_features,
                 control_features=None,
                 sample_params=None):
    @property
    def overall(self):

    @property
    def by_group(self):

    def group_max(self):

    def group_min(self):

    def difference(self, method='between_groups'):
    # method can also be 'to_overall'

    def ratio(self, method='between_groups'):
    # method can also be 'to_overall'


def make_derived_metric(base_metric, derivation_type, *,
                        sample_param_names=['sample_weight']):
    # derivation_type can be:
    #     'group_min', 'group_max', 'difference', 'ratio'
    #     
    # Parameters of the returned callable are treated as
    # static (i.e., not to be sliced) unless their name is
    # in sample_param_names.

### Examples

# examples of predefined metrics
recall_score_difference = make_derived_metric(
    skm.recall_score, 'difference')

recall_score_group_min = make_derived_metric(
    skm.recall_score, 'group_min')

# get values using predefined metrics
val1 = recall_score_difference(
    y_true, y_pred, sensitive_features=sf,
    pos_label=2, sample_weight=w, method='to_overall')

val2 = recall_score_group_min(
    y_true, y_pred, sensitive_features=sf,
    pos_label=2, sample_weight=w)

# get the same values using MetricFrame
mf = MetricFrame(
    partial(skm.recall_score, pos_label=2),
    y_true, y_pred, sensitive_features=sf,
    sample_params = {'sample_weight': sw})

val1 = mf.difference(method='to_overall')

val2 = mf.group_min()

Variant 2 (simplified `make_derived_metric`)

class GroupedMetric:
    # the same as Variant 1


def make_derived_metric(metric_type, base_metric):
    # metric_type can be:
    #     'group_min', 'group_max', 'difference', 'ratio'
    #     
    # Parameters of the returned callable are all treated
    # as sample parameters.

### Examples

# no predefined metrics, or only predefined metrics
# without static parameters.

# custom derived metrics for recall_score with pos_label=2
recall_label2 = partial(skm.recall_score, pos_label=2) 

recall_label2_difference = make_derived_metric(
    'difference', recall_label2)

recall_label2_group_min = make_derived_metric(
    'group_min', recall_label2)

val1 = recall_score_difference(
    y_true, y_pred, sensitive_features=sf,
    sample_weight=w, method='to_overall')

val2 = recall_score_group_min(
    y_true, y_pred, sensitive_features=sf,
    sample_weight=w)

# GroupedMetric example as in Variant 1

TASKS

TASK 1: report one disaggregated metric

# STATUS QUO
bunch = group_summary(
    accuracy_score, y_true, y_pred, sensitive_features=sf)
frame = pd.Series(bunch.by_group)
frame_o = pd.Series({**bunch.by_group, 'overall': bunch.overall})

# IDEA 1A
grouped = GroupSummary(
    accuracy_score, y_true, y_pred, sensitive_features=sf)
frame = grouped.by_group
frame_o = grouped.by_group.append(grouped.overall)
# or
frame_o = pd.concat([grouped.by_group, grouped.overall])

TASK 2: report multiple disaggregated metrics

# STATUS QUO
bunch1 = group_summary(
    accuracy_score, y_true, y_pred, sensitive_features=sf)
bunch2 = group_summary(
    f1_score, y_true, y_pred, sensitive_features=sf)
frame = pd.DataFrame({
   'accuracy': bunch1.by_group,
   'f1': bunch2.by_group})
frame_o = pd.DataFrame({
   'accuracy': {**bunch1.by_group, 'overall': bunch1.overall},
   'f1': {**bunch2.by_group, 'overall': bunch2.overall}})

# IDEA 2A
grouped = GroupSummary(
   {'accuracy': accuracy_score, 'f1': f1_score},
   y_true, y_pred, sensitive_features=sf)
frame = grouped.by_group
frame_o = grouped.by_group.append(grouped.overall)
# or
frame_o = pd.concat([grouped.by_group, grouped.overall])

TASK 3: Report several performance and fairness metrics of several models in a data frame.

# STATUS QUO

# handling of metric parameters using functools
fhalf_score = functools.partial(fbeta_score, beta=0.5)

# standard transformations provided by fairlearn
custom_difference1 = make_derived_metric(
    difference_from_summary,
    make_metric_group_summary(fhalf_score))

# non-standard transformation
def custom_difference2(y_true, y_pred, *, sensitive_features):
    bunch = group_summary(
        fbeta_score, y_true, y_pred, sensitive_features=sensitive_features, beta=0.5)
    frame = pd.Series(bunch.by_group)
    return (frame-frame['White']).min()

# Below is more of a boilerplate code whose simplification
# is beyond the scope of the current proposal, but it is
# in some ways reminiscent of sklearn.model_selection.cross_validate 
fairness_metrics = {
    'Custom difference 1': custom_difference1,
    'Custom difference 2': custom_difference2,
    'Demographic parity difference': demographic_parity_difference,
    'Worst-case balanced accuracy': balanced_accuracy_score_group_min}
performance_metrics = {
    'FPR': false_positive_rate,
    'FNR': false_negative_rate}
predictions_by_estimator = {
    'logreg': y_pred_lr,
    'svm': y_pred_svm}

df = pd.DataFrame()
for pred_key, y_pred in predictions_by_estimator.items():
    for fairm_key, fairm in fairness_metrics.items():
        df.loc[fairm_key, pred_key] = fairm(y_true, y_pred, sensitive_features=sf)
    for perfm_key, perfm in performance_metrics.items():
        df.loc[perfm_key, pred_key] = perfm(y_true, y_pred)
    
# IDEA 3A - simpler creation of standard transformations
custom_difference1 = make_derived_metric(
    'difference', fbeta_score, beta=0.5)

# IDEA 3B - variant of 3A
custom_difference1 = make_derived_metric(
    'difference', fbeta_score, params={'beta': 0.5})

# IDEA 3C - leveraging a more powerful differences() method
def custom_difference2(y_true, y_pred, *, sensitive_features):
    grouped = GroupedMetric(
        fbeta_score, y_true, y_pred, sensitive_features=sensitive_features,
        params={'beta': 0.5})
    return grouped.differences(
        relative_to='group', group='White', aggregate='min')

# IDEA 3D - without the differences() method
def custom_difference2(y_true, y_pred, *, sensitive_features):
    grouped = GroupedMetric(
        fbeta_score, y_true, y_pred, sensitive_features=sensitive_features,
        params={'beta': 0.5})
    return (grouped.by_group - grouped.by_group['White']).min()

# the remainder as before

MD: Issues with the above pattern (both status quo and proposed): it doesn't work so well with multiple metrics if some metrics need scores, i.e., score() or predict_proba() and some raw predictions, i.e., predict().
AM: sklearn has the scorer interface to deal with the different requiements, and to ensure multiple metrics don't call the same method multiple times, we have a private _MultimetricScorer that implements some caching.

RGE: I don't quite understand what the above means

TASK 4: Report several performance and fairness metrics as well as some disaggregated metrics of several models in a data frame.

Skip for now

TASK 5: Create a fairness-performance raster plot of several models.

# Current
my_fairness_metric=custom_difference1
my_performance_metric=false_positive_rate

xs = [my_performance_metric(Y_test, y_pred)
      for y_pred in predictions_by_estimator.values()]
ys = [my_disparity_metric(Y_test, y_pred, sensitive_features=A_test['Race']) 
      for y_pred in predictions_by_estimator.values()]

plt.scatter(xs,ys)
plt.xlabel('False positive rate')
plt.ylabel('Custom difference 1')
plt.show()

# Proposed
# The same, but with new definition of custom_difference1

TASK 6: Run sklearn.model_selection.cross_validate

Use demographic parity and precision score as the metrics

# Current
precision_scorer = make_scorer(precision_score)

y_t = pd.Series(Y_test)
def dpd_wrapper(y_t, y_p, sensitive_features):
    # We need to slice up the sensitive feature to match y_t and y_p
    # See Adrin's reply to:
    # https://stackoverflow.com/questions/49581104/sklearn-gridsearchcv-not-using-sample-weight-in-score-function
    sf_slice = sensitive_features.loc[y_t.index.values].values.reshape(-1)
    return demographic_parity_difference(y_t, y_p, sensitive_features=sf_slice)
dp_scorer = make_scorer(dpd_wrapper, sensitive_features=A_test['Race'])

scoring = {'prec':precision_scorer, 'dp':dp_scorer}
clf = svm.SVC(kernel='linear', C=1, random_state=0)
scores = cross_validate(clf, X_test, y_t, scoring=scoring)
scores

# Proposed
# Unchanged until SciKit-Learn supports the slicing of sensitive_features

TASK 7: Run GridSearchCV

With demographic parity and accuracy score, where the goal is to find the lowest-error model whose demographic parity is <= 0.05.

# Current
from sklearn.model_selection import GridSearchCV

param_grid = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
 ]
scoring = {'prec':precision_scorer, 'dp':dp_scorer}

clf = svm.SVC(kernel='linear', C=1, random_state=0)

# selection_function would implement the best estimator
# selection strategy
gscv = GridSearchCV(clf, param_grid=param_grid, scoring=scoring, refit=selection_function, verbose=1)
gscv.fit(X_test, y_t)

print("Best parameters set found on development set:")  
print(gscv.best_params_)
print("Best score:", gscv.best_score_)
print()
print("Overall results")
print(gscv.cv_results_)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.

Return type

Pandas approach

What should Fairlearn do?

Option 1: We only allow named sensitive and control features.

Option 2: Allow unnamed features and impute as in pandas

Option 3: Allow unnamed features and impute with strings

Option 4 = Option 2 limited to:

Two API variants [old stuff; disgregard!!!]

Variant 1 (final proposal)

Variant 2 (simplified make_derived_metric)

TASKS

TASK 1: report one disaggregated metric

TASK 2: report multiple disaggregated metrics

TASK 3: Report several performance and fairness metrics of several models in a data frame.

TASK 4: Report several performance and fairness metrics as well as some disaggregated metrics of several models in a data frame.

TASK 5: Create a fairness-performance raster plot of several models.

TASK 6: Run sklearn.model_selection.cross_validate

TASK 7: Run GridSearchCV

Variant 2 (simplified `make_derived_metric`)