We thank the reviewers for their detailed and thoughtful feedback, and for noting the work is "thorough on its theoretical guarantees of performance" and "the results are backed up by experiments". Below, we address their questions and show that they can be remedied through clarification in the paper.
1) Our proposed EOR criterion is fundamentally different from FA*IR.
2) We clarify the connection to relevant prior work based on reviewers' suggestions.
### Difference between FA*IR and EOR
1. Unlike EOR, FA*IR is oblivious to the relevance distribution and thus cannot take disparate uncertainty into account.
2. FA*IR ensures a notion of proportional representation and thus falls into the same category as demographic parity or proportional Rooney Rule. We already extensively compare against these methods but also report additional results for FA*IR below.
3. FA*IR requires the normative designation of a disadvantaged group, while EOR criterion does not require this.
The FA*IR criterion is anchored on the principle that a top-k ranking is fair when the proportion of disadvantaged candidates selected doesn't fall far below a required minimum proportion $p$. This is formalized with a Binomial distribution, and a confidence level ($1-\alpha$).
Binomial(p=0.5,n) corresponds to a ranking where at each position, a candidate from either group is selected randomly. This would be a "softened" version of demographic parity(DP) and is thus fundamentally different from Axiom 1 and Definition 4.1 derived from the uniform lottery fairness because, unlike DP, uniform lottery is anchored on selecting an equal fraction of **relevance** from each group.
Consider the following example, and top k=4
group A = [0.7, 0.7, 0.7, 0.7, 0.1, 0.1], group size = 6, relevant candidates = 3.0
group B = [0.5, 0.5, 0.5, 0.5, 0.5, 0.5], group size = 6, relevant candidates = 3.0
The EOR Ranking for top-4 is [0.5, 0.7, 0.5, 0.7] with 2 candidates from group A, and 2 from group B. $\delta(\sigma_4)=0.13$
The FA*IR Algorithm with Binomial(p=0.5, n=12), k=4 and $\alpha=0.1$ requires that at least 1 candidate be selected from the disadvantaged group while maximizing the utility to the principal.
The FA*IR Ranking **w. group B as the disadvantaged group** is [0.7, 0.7, 0.7, 0.5]. It selects 3 candidates from group A, and 1 from group B, resulting in $\delta(\sigma_4)=0.53$.
If instead **group A is designated as the disadvantaged group**, the FA*IR Ranking is [0.7, 0.7, 0.7, 0.7]. Now all candidates are from group A, and none from group B, resulting in $\delta(\sigma_4)=0.93$.
Note that for both FA*IR rankings, far fewer relevant candidates are chosen from group B, even though both groups have an equal number of relevant candidates in expectation.
FA*IR recommends and uses $\alpha=0.1$ for all experiments. But note that FA*IR converges to DP for extreme $\alpha$, and we have already shown that DP is generally not EOR fair.
We attach the plot for this example to illustrate the difference in EOR and costs at all positions $k$ of the ranking (with FS denoting the FA*IR policy in brown color), using group B as the disadvantaged group. Unsurprisingly, the behavior of FA*IR lies between PRP and DP.
Finally, we extend Table 1 (Section 7), with FA*IR as yet another demographic-parity type method. (using recommended parameters and implementation from the python fairsearchcore library)
|| Unfairness $\downarrow$| || Effectiveness $\uparrow$| | |
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|Disp. Uncertainty| High |Medium |Low|High|Medium|Low|
|EOR | 1.07 $\pm 0.01$|1.02 $\pm 0.$|1.02 $\pm0.$| 10.44 $\pm0.15$|11.89 $\pm 0.04$|14.58 $\pm 0.10$ |
|FA*IR |13.33 $\pm 0.70$|7.04 $\pm 0.16$ |2.95 $\pm 0.17$| 11.98 $\pm 0.20$|12.00 $\pm0.02$| 14.62 $\pm 0.09$|
The results are as expected.
### Novelty of our work
Our key contributions include
1. We formalize for the first time that differential uncertainty between groups can cause unfairness in rankings because it induces an uneven cost burden on the groups. (Section 3)
2. We propose the EOR fairness criterion for ranking by deriving it from the axiom of fairness of the uniform lottery. We show how this is related to selecting an equal fraction of **relevant** candidates from each group. (Section 4.2)
3. We propose an efficient algorithm -- the EOR Algorithm, which selects candidates based on the EOR fairness criterion. (Section 5)
4. We theoretically show that the EOR Algorithm is near cost optimal (Theorem 4.1, Theorem 4.2) and analyze its comparison with the Uniform policy (Proposition 5.1) and PRP (Proposition 5.2). We also extend it from two to g groups (Theorem 6.1).
5. We empirically evaluate the EOR Algorithm with numerous baselines including demographic parity, PRP, exposure-based baselines, Thompson Sampling, and Uniform policy. We will add FA*IR as another baseline. (Section 7)
We do not claim that equations (1) through (4) in section 4.1 are novel and as mentioned on line ~282, the cost to principal is related to the conventional metric of Recall@k. However, we derive it from first principles to draw the connection between the EOR criterion and ensuring an even group cost burden (introduced in the following section 4.2). We claim that the definition of the EOR ranking criterion is a novel contribution and we will further clarify this in the paper, thanks to the reviewers' suggestion.
### EOR is fundamentally different from existing notions of fairness in rankings
There are primarily two forms of fairness notions (and their variations) in rankings that exist in the literature
1. Proportional representation by size - These include a) diversity constraints like demographic parity or affirmative action such as the Rooney Rule and guarantee a minimum proportion by group size; b) threshold-based formulations that ensure a minimum number of candidates to be selected from the designated disadvantaged group (e.g., FA*IR).
2. Exposure Based - Exposure quantifies the amount of attention allocated to candidates individually or from a particular group. Exposure-based fairness notions guarantee an equitable allocation of attention.
Since existing notions are motivated by either representation by group size, normative designation of disadvantaged group, or allocation of attention, they do not balance the uneven cost burden that disparate uncertainty between the groups induces. This is demonstrated extensively in all experiments.
### Reviewer 149A
Regarding: "... calibration is necessary to achieve group fairness (line ~215)...":
While the connection between group-wise calibration and fairness is well established [1] in literature, we clarify that the requirement of group-wise calibration for EOR ranking is to ensure that the expected number of relevant candidates based on the probabilities are meaningful and comparable for each of the groups. We will clarify this in the main text.
### Reviewer 149B
We will include a discussion on the connection between false positives and the EOR criterion.
### Reviewer 149C
1) Regarding "if there is one highly relevant candidate in a group A, ... the new criterion might be satisfied.": We would like to clarify that a group B with all $p(r_i)=0.1$ is not deterministically less relevant than a group A with a few $p(r_i)=0.9$, since every candidate in group B has 10% prob of being relevant. With many such candidates in group B, the expected number of relevant candidates can be even higher than in group A! and needs to be accounted for. Its just that while it is easy to identify the relevant candidates in group A, it is not so clear who the relevant candidates are in group B.
3) Regarding "this new criterion is not the implementation ... directly take the distribution into account.": We would like to clarify that our setup is Bayes optimal, as illustrated in Section 3.2. In the case of the Bernoulli random variable of relevances $r_i$, the expected costs and the EOR criterion are fully summarized by the posterior predictive distribution, which is again a Bernoulli with parameter $p(r_i)$. Deciding based on $p(r_i)$ is thus Bayes optimal.
4) Regarding "The concept of considering the uncertainty of estimation ... An idea of Thompson sampling is in particular similar..." we extensively compare with Thompson Sampling for fairness in rankings[2] (TS policy) and demonstrate that it is not EOR optimal.
[1]Pleiss et al, On Fairness and Calibration, 2017
[2]Singh et al, Fairness in Ranking under Uncertainty, 2021