# ICML 2024 ALFA Rebuttal (Revised)
# Reviewer 1 (Qyhc)
### W1: Improving Readability
Thank you for your feedback. In response, we will carefully revise the introduction and visual aids, and provide more straightforward materials, such as animations explaining the concept on our GitHub page if the paper is published.
### W2: Rotation and translation in Decision Boundary
Thank you for pointing this out. Our proposed method indeed not only rotates the decision boundary but translates it too. Since the last layer trains the original and perturbed features, both parameters of the last layer, the weight, and bias, will be updated. It results in a change in weight, indicating a rotated decision boundary, while the updated bias component indicates the translation as well.
We will emphasize that the last layer is fine-tuned, resulting in both rotation and translation.
# Reviewer 2 (ncUi)
### W1: Clarification of the Concept
Our paper does not suggest that the decision boundary itself covers an area, but data augmentation covers the unfair region leading the newly trained decision boundary to separate the latent space in a fairer manner.
As the reviewer mentioned, the pre-trained decision boundary separates the latent space. Sometimes it may produce unfair predictions such as a higher false positive rate for the privileged group and higher false negative for the unprivileged group. We refer to these subgroups with higher misclassification rates as the "unfair region," defined in Line 107 of the paper:
>"This region is characterized by disproportionate misclassification rates between privileged and underprivileged groups. Figure 1(a) illustrates this concept, highlighting areas where biased predictions are most prevalent".
Moreover, the caption in Figure 1 explains,
>"The misclassification rates of subgroup {A =1, Y = 0} and {A = 0, Y = 1} are disproportionately high, indicated as the unfair region in the left figure."
As emphasized in the paper, our proposed method is a data augmentation approach in the latent space. The augmented features have the same class label and sensitive attributes as the samples in unfair region, and will be located in the unfair region covering the area. Ultimately, the newly trained decision boundary on the augmented feature will separate the latent space in a fairer manner.
### W2: Regarding Adversarial Learning References
Compared to the given references, our proposed method is still distinct and novel. While our method employs 'adversarial training,' it specifically attacks fairness, not accuracy. Consequently, we discuss 'fairness attacks' in our literature review from lines 130 to 140.
Here, we sum up how the proposed method is novel compared to the five references from the reviewer.
- Reference [2] adopts a counterfactual data augmentation strategy by blinding the identity terms in the text. This approach does not involve adversarial training and fairness attacks, while our method considers which group of samples specifically brings fairness issues.
- References [3] to [5] aim to ensure robust classification within the $\ell_p$-ball of a target instance by utilizing adversarial training, which is a min-max optimization in terms of accuracy. As a result, the prediction becomes more stable and less prone to dynamic changes but doesn't improve group fairness or detect unfair regions. Although these studies include adversarial training, their goals and methodologies differ from ours.
- Reference [1] includes a classifier model named adversary, aiming to predict the sensitive attribute, while an encoder wants to deceive the adversary. It is also an adversarial training, but is different from ours as [1] does not involve perturbation, data augmentation, or fairness attack.
- Despite the difference in methodology, we recognize that Reference [1] is one of the methods to achieve group fairness. We thank the reviewer for mentioning this related work and have conducted experiments for comparison. We note that their method, named LAFTR, is only applicable to MLP and provides experimental results solely on the Adult dataset.
- Here, we validate that the performance of LAFTR [1] is not consistent across the dataset, and our method, ALFA, significantly outperforms LAFTR in various datasets.
| Adult | Accuracy |$\Delta DP$ |$\Delta EOd$ |
| ------------------- | ------------- | -------------- | ------------- |
| Baseline | 0.8525±0.0010 | 0.1824±0.0114 | 0.1768±0.0411 |
| ALFA (Ours) | 0.8380±0.0045 | 0.1642±0.0261 | **0.0971±0.0098** |
| LAFTR (Madras. et. al.) | 0.8470±0.0020 | **0.1497±0.0191** | 0.1117±0.0443 |
| COMPAS | Accuracy |$\Delta DP$ |$\Delta EOd$ |
| ------------------- | ------------- | ------------- | ------------- |
| Baseline | 0.6711±0.0049 | 0.2059±0.0277 | 0.3699±0.0597 |
| ALFA (Ours) | 0.6701±0.0020 | **0.0207±0.0142** | **0.0793±0.0418** |
| LAFTR (Madras. et. al.) | 0.6397±0.0284 | 0.1164±0.0183 | 0.2089±0.0252 |
| German | Accuracy |$\Delta DP$ |$\Delta EOd$ |
| ------------------- | ------------- | ------------- | ------------- |
| Baseline | 0.7800±0.0150 | 0.0454±0.0282 | 0.2096±0.0924 |
| ALFA (Ours) | 0.7570±0.0024 | **0.0053±0.0064** | **0.0813±0.0110** |
| LAFTR (Madras. et. al.) | 0.7308±0.0270 | 0.0419±0.0410 | 0.1677±0.1433 |
| Drug | Accuracy |$\Delta DP$ |$\Delta EOd$ |
| ------------------- | ------------- | ------------- | ------------- |
| Baseline | 0.6674±0.0096 | 0.2760±0.0415 | 0.4718±0.0838 |
| ALFA (Ours) | 0.6382±0.0061 | **0.0820±0.0259** | **0.1068±0.0476** |
| LAFTR (Madras. et. al.) | 0.6195±0.0352 | 0.1848±0.1035 | 0.3235±0.1715 |
[1] Madras et al., Learning Adversarially Fair and Transferable Representations, 2018
[2] Garg et al., Counterfactual Fairness in Text Classification through Robustness, 2019
[3] Yurochkin et al., Training individually fair ML models with Sensitive Subspace Robustness, 2020
[4] Ruoss et al., Learning Certified Individually Fair Representations, 2020
[5] Peychev et al., Latent Space Smoothing for Individually Fair Representations, 2022
### W3: Consistency of Experiments, Accuracy-Fairness Trade-off
Thanks for pointing out a significant insight. We observe that accuracy-fairness trade-offs don't always happen, and sometimes they can be improved simultaneously.
- At first, from our insight, the Pareto Frontier in our results represents the optimal trade-off line. Below this line, it is possible for both accuracy and fairness to improve simultaneously.
- Moreover, [6] shows that there could be an ideal distribution where accuracy and fairness are in accord, which supports our observation.
[6] Dutta, S., Wei, D., Yueksel, H., Chen, P. Y., Liu, S., & Varshney, K. (2020, November). Is there a trade-off between fairness and accuracy? a perspective using mismatched hypothesis testing. In International conference on machine learning (pp. 2803-2813). PMLR.
### Q1: Considerations on Post-Processing for Fairness
Thanks for raising a valuable question.
Post-processing doesn't rotate the decision boundary, which is determined by the prediction model's weight. Instead, it adjusts the threshold or bias of the last layer.
As an example, we implemented a fair post-processing [7] on the synthetic dataset. The image below illustrates that while it can translate the decision boundaries by adjusting the threshold for each demographic group, the weight of the linear classifier remains unchanged, which means not rotated.
As a consequence, compared with post-processing methods, our proposed adversarial augmentation method is more effective by rotating and translating the decision boundary (as in Reviewer Qyhc - W2) and thus can achieve better accuracy-fairness trade-off.

[7] Jang, T., Shi, P., & Wang, X. (2022, June). Group-aware threshold adaptation for fair classification. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 6, pp. 6988-6995).
### Q2 \& Q5: Interpretability of the Augmented Feature and Input Perturbation
##### Interpretability
Thanks for pointing out the issue of interpretability. In this work, we can consider interpretability from two aspects:
1) Interpretability on decision boundary (latent space)
2) Interpretability on input feature (input space)
While we have focused on the first aspect, we argue that the proposed method can cover the second aspect as well, satisfying the reviewer's concern. Let us break down the two-fold interpretability, and how our method is applicable in both cases.
* At first, we are focusing on the interpretability of decision boundaries, which is a common approach to understand the classifier's behavior [8,9]. By manipulating features in the latent space by the fairness attack, we can interpret the decision boundary by discovering an unfair region and adjusting the decision boundary. In this case, it is true that it can't analyze how the changes in input features affect the decision boundary.
* On the other hand, the interpretability of the input feature might make it possible to analyze how the fairness attack perturbs input data. However, it may lose the interpretability of decision boundary, such as discovering unfair regions and understanding the last layer's behavior.
##### Fairness attack in input space, and fine-tuning entire network
Fortunately, our framework is applicable to the input space by deploying the fairness attack and perturbation in the input space. In this case, the entire model will be fine-tuned, while offering input-level interpretability. We conducted additional experiments with MLP to show the validity of our framework on the input space in the table below.
Consequently, our method can offer either interpretability on latent space or input space. In both cases, we can maintain the accuracy level while mitigating the fairness issue. We opt to freeze the pretrained encoder and deploy perturbations in the latent space, as this approach generally leads to greater improvements in fairness compared to perturbation in input space in various datasets.
| Adult | Accuracy | $\Delta DP$ |$\Delta EOd$ |
| ------------------- | ------------- | -------------- | ------------- |
| Baseline | 0.8525±0.0010 | 0.1824±0.0114 | 0.1768±0.0411 |
| Latent perturbation | 0.8380±0.0045 | 0.1642±0.0261 | **0.0971±0.0098** |
| Input perturbation | 0.8473±0.0016 | **0.1588±0.0135** | 0.1016±0.0394 |
| COMPAS | Accuracy | $\Delta DP$ |$\Delta EOd$ |
| ------------------- | ------------- | ------------- | ------------- |
| Baseline | 0.6711±0.0049 | 0.2059±0.0277 | 0.3699±0.0597 |
| Latent perturbation | 0.6701±0.0020 | **0.0207±0.0142** | **0.0793±0.0418** |
| Input perturbation | 0.6629±0.0051 | 0.0610±0.0389 | 0.1086±0.0649 |
| German | Accuracy | $\Delta DP$ |$\Delta EOd$ |
| ------------------- | ------------- | ------------- | ------------- |
| Baseline | 0.7800±0.0150 | 0.0454±0.0282 | 0.2096±0.0924 |
| Latent perturbation | 0.7570±0.0024 | **0.0053±0.0064** | **0.0813±0.0110** |
| Input perturbation | 0.7465±0.0067 | 0.0188±0.0106 | 0.1700±0.0400 |
| Drug | Accuracy | $\Delta DP$ |$\Delta EOd$ |
| ------------------- | ------------- | ------------- | ------------- |
| Baseline | 0.6674±0.0096 | 0.2760±0.0415 | 0.4718±0.0838 |
| Latent perturbation | 0.6382±0.0061 | 0.0820±0.0259 | **0.1068±0.0476** |
| Input perturbation | 0.6188±0.0146 | **0.0571±0.0365** | 0.1893±0.0809 |
[8] Guidotti, R., Monreale, A., Matwin, S., & Pedreschi, D. (2020). Black box explanation by learning image exemplars in the latent feature space. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, Proceedings, Part I (pp. 189-205). Springer International Publishing.
[9] Bodria, F., Guidotti, R., Giannotti, F., & Pedreschi, D. (2022, October). Interpretable latent space to enable counterfactual explanations. In International Conference on Discovery Science (pp. 525-540). Cham: Springer Nature Switzerland.
### Q3. Analysis on Synthetic Data
Thank you for highlighting this detail. We provide the details of the synthetic data, illustrating the concept of the unfair region and how the decision boundary is rotated. We simplify the binary classification task with a 2D Gaussian mixture model, as assumed in [10], consisting of two classes $y \in \{0, 1\}$ and two sensitive attributes $A \in \{0, 1\}$ (indicating unprivileged and privileged groups).
\begin{align}
x \sim \begin{cases}
group1: \textbf{N} (\begin{bmatrix}
\mu \\
\mu
\end{bmatrix} , \sigma^2 )& \text{if} \: y=1, a=1 \\
group2: \textbf{N} (\begin{bmatrix}
\mu \\
\mu^\prime
\end{bmatrix} , \sigma^2)& \text{if} \: y=0, a=1 \\
group3: \textbf{N} (\begin{bmatrix}
0 \\
\mu
\end{bmatrix} ,(K\sigma)^2) & \text{if}\: y=1, a=0 \\
group4: \textbf{N} (\begin{bmatrix}
0 \\
0
\end{bmatrix} , (K\sigma)^2)& \text{if}\: y=0, a=0
\end{cases}
\end{align}
where $\mu^\prime =r\mu, 0<r<1$ and $K>1$, where the number of samples in each group is $N_1 : N_2: N_3: N_4$. We arbitrarily set $K=3$, $r=0.7$, $\mu = 1$, $N_1 = N_2 = 100$, and $N_3=N_4=400$.
From the synthetic data, we observe a decision boundary like Figure 2(a) in the paper. Due to dataset imbalance, the subgroup $a=1,y=0$ is overestimated as label $y=1$, and the subgroup $a=0, y=1$ is underestimated as label $y=0$. The disparity in misclassification rates is depicted in Figure 2(c). We define these disparities as 'unfair regions' where the misclassification rate is disproportionately high.
Regarding $\delta$, it is a trainable parameter in our framework, rather than being a hyperparameter. However, only in Figure 2(c), we manually vary the amount of perturbation $\delta$ from 0 to 0.2 to show the impact of $\delta$ by demonstrating how the misclassification rate for each group changes accordingly.
As the fairness evaluation metric $\Delta EOd$ is defined as the summation of the True Positive Rate (TPR) gap and False Positive Rate (FPR) gap between demographic groups, we plot the TPR gap, FPR gap, $\Delta EOd$, and the overall misclassification rate. Figure 2(c) shows that both the TPR gap and FPR gap decrease significantly indicating a small EOd, with a minor increase in overall misclassification rate.
[10] Xu, H., Liu, X., Li, Y., Jain, A., & Tang, J. (2021, July). To be robust or to be fair: Towards fairness in adversarial training. In International conference on machine learning (pp. 11492-11501). PMLR.
### Q4-1: Goal of Fairness Attack
The goal of the attack is clearly stated in Section 3, from lines 164 to 172:
>"The proposed method aims to automatically discover unfair regions and generate perturbed samples that directly cover these regions with over/underestimated demographic groups for each label, by attacking the fairness constraint. Training on the perturbed latent features results in a rotated decision boundary that reduces the misclassification rate of biased subgroups."
This is closely related to the choice of loss function, as discussed in Section 3.1.
### Q4-2: Why Covariance-Attack?
Indeed, any type of fairness constraint can be used in the attacking step.
For example, we can adopt a convex fairness constraint [11]. We elaborate on the detail of the convex fairness constraint in Appendix J and report the experimental results in the table below by comparing the baseline, the covariance-base fairness attack (suggested in the paper), and the convex fairness attack. The experiment shows that our method can adopt any type of fairness constraint during the attacking step, both showing improvement in fairness.
While our framework has wide adaptability in the choice of fairness constraint during the fairness attack, the reason we chose covariance instead of convex fairness constraint is it doesn't depend on the empirical outputs and offers clear proof illustrated in Proposition 3.1 and Theorem 3.2.
| Adult | Accuracy | $\Delta DP$ | $\Delta EOd$ |
| --- | --- | --- | --- |
| Model | mean | std. | mean | std. | mean | std. |
| Logistic (Baseline) |0.8470±0.0007|0.1829±0.0020|0.1982±0.0077|
| Logistic + ALFA (covariance) |0.8464±0.0004|0.1555±0.0013|**0.0616±0.0022**|
| Logistic + ALFA (convex) |0.8227±0.0026|**0.0852±0.0078**|0.1547±0.0133|
| MLP (Baseline) |0.8525±0.0010|0.1824±0.0114|0.1768±0.0411|
| MLP + ALFA (covariance) |0.8380±0.0045|0.1642±0.0261|0.0971±0.0098|
| MLP + ALFA (convex) |0.8324±0.0031|**0.1400±0.0166**|**0.0904±0.0184**|
| COMPAS | Accuracy | $\Delta DP$ | $\Delta EOd$ |
| --- | --- | --- | --- |
| Model | mean | std. | mean | std. | mean | std. |
| Logistic (Baseline) |0.6578±0.0034|0.2732±0.0129|0.5319±0.0245|
| Logistic + ALFA (covariance) |0.6682±0.0040|**0.0210±0.0167**|**0.0931±0.0323**|
| Logistic + ALFA (convex) |0.6740±0.0034|0.0470±0.0180|0.1444±0.0379|
| MLP (Baseline) |0.6711±0.0049|0.2059±0.0277|0.3699±0.0597|
| MLP + ALFA (covariance) |0.6701±0.0020|0.0207±0.0142|0.0793±0.0418|
| MLP + ALFA (convex) |0.6624±0.0010|**0.0130±0.0075**|**0.0738±0.0150**|
| German | Accuracy | $\Delta DP$ | $\Delta EOd$ |
| --- | --- | --- | --- |
| Model | mean | std. | mean | std. | mean | std. |
| Logistic (Baseline) |0.7220±0.0131|0.1186±0.0642|0.3382±0.1268|
| Logistic + ALFA (covariance) |0.7660±0.0189|0.0397±0.0261|0.1596±0.0354|
| Logistic + ALFA (convex) |0.7410±0.0130|**0.0240±0.0179**|**0.1030±0.0360**|
| MLP (Baseline) |0.7800±0.0150|0.0454±0.0282|0.2096±0.0924|
| MLP + ALFA (covariance) |0.7570±0.0024|**0.0053±0.0064**|**0.0813±0.0110**
| MLP + ALFA (convex) |0.7575±0.0087|0.0181±0.0120|0.1960±0.0079|
| Drug | Accuracy | $\Delta DP$ | $\Delta EOd$ |
| --- | --- | --- | --- |
| Model | mean | std. | mean | std. | mean | std. |
| Logistic (Baseline) |0.6626±0.0135|0.2938±0.0761|0.5064±0.1616|
| Logistic + ALFA (covariance) |0.6554±0.0067|0.0909±0.0261|**0.1170±0.0255**|
| Logistic + ALFA (convex) |0.6509±0.0072|**0.0596±0.0198**|0.1284±0.0286|
| MLP (Baseline) |0.6674±0.0096|0.2760±0.0415|0.4718±0.0838|
| MLP + ALFA (covariance) |0.6382±0.0104|**0.0820±0.0259**|**0.1068±0.0476**|
| MLP + ALFA (convex) |0.6329±0.0173|0.1002±0.0826|0.1955±0.0956|
[11] Wu, Yongkai, Lu Zhang, and Xintao Wu. "On convexity and bounds of fairness-aware classification." The World Wide Web Conference. 2019.
### Q6-1: Impact of hyperparameters
Eq.(5) and Eq.(6) show our intention to retain the accuracy while ensuring fairness, by using Sinkhorn distance to maintain the semantical meaning of perturbed samples, and training the perturbed samples and original samples together also maintain the accuracy level.
We are varying the $\alpha$ value which is the weight of the Sinkhorn distance, to construct the Pareto Frontier as stated in line 356. Despite of the line in Pareto Frontier, we agree that the impact of $\alpha$ and having the same weight in Eq.(6) are less introduced in our paper.
Here, we report a detailed analysis of how each component in Eq.(5) and Eq.(6) affect fairness-accuracy trade-off by showing 1) the result without Sinkhorn distance ($\alpha=0$) in Eq.(5), and 2) the result without original feature in Eq.(6).
1) As shown in the table below, maximizing solely on $L_{fair}$ during the fairness attack, which means $\alpha=0$, also improves fairness. However, it compromises the accuracy. As the same intention introduced in Section 3.2, the usage of Sinkhorn distance can maintain the semantical meaning of perturbed samples, resulting in retaining the accuracy.
| Drug | Accuracy | $\Delta DP$ | $\Delta EOd$ |
| --- | --- | --- | --- |
| Model | mean | std. | mean | std. | mean | std. |
| Logistic (Baseline) |0.6626±0.0135|0.2938±0.0761|0.5064±0.1616|
| Logistic + ALFA ($\alpha=0$) |0.6395±0.0067|0.0325±0.0244|0.1638±0.0593|
| Logistic + ALFA ($\alpha=10$) |0.6554±0.0067|0.0909±0.0261|0.1170±0.0255|
| MLP (Baseline) |0.6674±0.0096|0.2760±0.0415|0.4718±0.0838|
| MLP + ALFA ($\alpha=0$) |0.6276±0.0092|0.0393±0.0407|0.0691±0.0518|
| MLP + ALFA ($\alpha=10$) |0.6382±0.0104|0.0820±0.0259|0.1068±0.0476|
2) Similar to the Sinkhorn distance, we believe that re-training the classifier solely on perturbed features may negatively impact accuracy. In line with our intuition, training exclusively on perturbed features results in slightly lower accuracy, but it can effectively achieve fairness.
| Drug | Accuracy | $\Delta DP$ | $\Delta EOd$ |
| --- | --- | --- | --- |
| Model | mean | std. | mean | std. | mean | std. |
| Logistic (Baseline) |0.6626±0.0135|0.2938±0.0761|0.5064±0.1616|
| Logistic + ALFA (only perturbed) |0.6515±0.0070|0.0829±0.0249|0.1237±0.0275|
| Logistic + ALFA (original+perturbed) |0.6554±0.0067|0.0909±0.0261|0.1170±0.0255|
| MLP (Baseline) |0.6674±0.0096|0.2760±0.0415|0.4718±0.0838|
| MLP + ALFA (only perturbed) |0.6340±0.0050|0.0533±0.0313|0.0762±0.0530|
| MLP + ALFA (original+perturbed) |0.6382±0.0104|0.0820±0.0259|0.1068±0.0476|
Consquently, the our proposed method is effectively designed to retaining the accuracy, while ensuring the fairness.
### Q6-2: Minimizing Fairness Constraint
Minimizing the fairness constraint [12] together with our framework makes it challenging to verify the effectiveness of the proposed data augmentation. Therefore, we don't consider minimizing $L_{fair}$ during the training.
However, we report additional experiments when minimizes $L_{fair}$ upon our framework. As shown in the table below, the combination of the two methods doesn't additionally improve fairness.
| COMPAS | Accuracy | $\Delta DP$ | $\Delta EOd$ |
| --- | --- | --- | --- |
| Model | mean | std. | mean | std. | mean | std. |
| Logistic (Baseline) |0.6578±0.0034|0.2732±0.0129|0.5319±0.0245|
| Logistic + ALFA |0.6682±0.0040|0.0210±0.0167|0.0931±0.0323|
| Logistic + ALFA + minimizing $L_{fair}$ |0.6701±0.0037|0.0481±0.00431|0.1291±0.0732|
| MLP (Baseline) |0.6711±0.0049|0.2059±0.0277|0.3699±0.0597|
| MLP + ALFA (covariance) |0.6701±0.0020|0.0207±0.0142|0.0793±0.0418|
| MLP + ALFA + minimizing $L_{fair}$ |0.6632±0.0513|0.1242±0.0087|0.0422±0.0474|
Furthermore, the baseline that utilizes solely the fairness constraint is depicted on the Pareto Frontier as 'covariance loss' by the brown color, which shows less significant improvement in fairness compared to ours.
[12] Zafar, Muhammad Bilal, et al. "Fairness constraints: Mechanisms for fair classification." Artificial intelligence and statistics. PMLR, 2017.
# Reviewer3 (dbYU)
### W1: The Analysis and Justification of Eq.(5) and (6)
Thanks for pointing out a core principle of our paper. Here we provide the evidence of validity of each formulation.
At first, the validity of Eq.(5) is justified by Theorem 3.2. Eq.(5) is designed to maximize fairness by intentionally generating biased features to cover the unfair region. Theorem 3.2 shows that the perturbations that maximize the fairness constraint also increase the gaps in Demographic Parity (DP) and Equality of Opportunity (EOd). To enlarge the EOd gap, the perturbed features must move in a specific direction, resulting in a rotated decision boundary when the classifier trains the perturbed features.
Secondly, Eq.(6) represents the empirical risk minimization (ERM) applied to both the original and augmented features. While this may be a heuristic approach, it is a widely used method in data augmentation literature such as [1] and [2] for enhancing model performance.
[1] Hsu, C. Y., Chen, P. Y., Lu, S., Liu, S., & Yu, C. M. (2022, June). Adversarial examples can be effective data augmentation for unsupervised machine learning. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 6, pp. 6926-6934).
[2] Zhao, L., Liu, T., Peng, X., & Metaxas, D. (2020). Maximum-entropy adversarial data augmentation for improved generalization and robustness. Advances in Neural Information Processing Systems, 33, 14435-14447.
### W2: Contribution of each factor to find $\delta$
Thank you for highlighting the clarity of our framework. In order to derive the perturbation $\delta$ in Eq.(5) for the fairness attack, we aim to maximize $L_{fair}$ while minimizing the Sinkhorn distance. Thus, the impact of Sinkhorn distance should be analyzed.
We dissect the impact of the Sinkhorn distance on obtaining the perturbation $\delta$ in Section W2-1. Furthermore, in response to the reviewer's inquiry, we elaborate the distinction between our adversarial augmentation with $L_{fair}$ and the direct minimization of $L_{fair}$ in Section W2-2.
#### W2-1: Contribution of Sinkhorn distance
Indeed, $L_{fair}$ alone during the fairness attack is sufficient to transform the decision boundary fair. However, to prevent the decision boundary from hurting accuracy, we utilize the Sinkhorn distance to maintain the semantical meaning of the perturbed samples.
Here, we present a comparison with/without the Sinkhorn distance by showing the improvements in accuracy and fairness rather than the $\delta$ value itself. Without Sinkhorn distance, $L_{fair}$ alone, shows the improvement in fairness. However, it compromises the accuracy. Therefore, Sinkhorn distance may not be a component enhancing fairness, but it's crucial to retain the accuracy.
| Drug | Accuracy | $\Delta DP$ | $\Delta EOd$ |
| --- | --- | --- | --- |
| Model | mean | std. | mean | std. | mean | std. |
| Logistic (Baseline) |0.6626±0.0135|0.2938±0.0761|0.5064±0.1616|
| Logistic + ALFA (w/o Sinkhorn) |0.6395±0.0067|0.0325±0.0244|0.1638±0.0593|
| Logistic + ALFA (w/ Sinkhorn) |0.6554±0.0067|0.0909±0.0261|0.1170±0.0255|
| MLP (Baseline) |0.6674±0.0096|0.2760±0.0415|0.4718±0.0838|
| MLP + ALFA (w/o Sinkhorn) |0.6276±0.0092|0.0393±0.0407|0.0691±0.0518|
| MLP + ALFA (w/ Sinkhorn) |0.6382±0.0104|0.0820±0.0259|0.1068±0.0476|
<!-- $L_{fair}$ alone during the fairness attack is sufficient to transform the decision boundary fair. However, to prevent the decision boundary from hurting accuracy, we utilize the Sinkhorn distance multiplied by $\alpha$ to maintain the semantical meaning of the perturbed samples.
Here, we report the comparison with/without the Sinkhorn distance. Without Sinkhorn distance, $L_{fair}$ alone, shows the improvement in fairness. However, it compromises the accuracy. Therefore, Sinkhorn distance may not be a component enhancing fairness, but it's crucial to retain the accuracy.
| Drug | Accuracy | $\Delta DP$ | $\Delta EOd$ |
| --- | --- | --- | --- |
| Model | mean | std. | mean | std. | mean | std. |
| Logistic (Baseline) |0.6626±0.0135|0.2938±0.0761|0.5064±0.1616|
| Logistic + ALFA ($\alpha=0$) |0.6395±0.0067|0.0325±0.0244|0.1638±0.0593|
| Logistic + ALFA ($\alpha=10$) |0.6554±0.0067|0.0909±0.0261|0.1170±0.0255|
| MLP (Baseline) |0.6674±0.0096|0.2760±0.0415|0.4718±0.0838|
| MLP + ALFA ($\alpha=0$) |0.6276±0.0092|0.0393±0.0407|0.0691±0.0518|
| MLP + ALFA ($\alpha=10$) |0.6382±0.0104|0.0820±0.0259|0.1068±0.0476|
-->
#### W2-2: Minimizing Fairness Constraint
Minimizing the fairness constraint alone during our framework makes it challenging to verify the effectiveness of our proposed method. Therefore, we separately demonstrate our method, and a baseline that utilizes solely the fairness constraint [3] is depicted on the Pareto Frontier as 'covariance loss' by the brown color. Minimizing only the fairness constraint shows less significant improvement in fairness compared to ours.
[3] Zafar, Muhammad Bilal, et al. "Fairness constraints: Mechanisms for fair classification." Artificial intelligence and statistics. PMLR, 2017.
### Q1: Iterative Adversarial Training
Thank you for highlighting this. As the reviewer mentioned, Eq.(5) is trained iteratively. This is indirectly mentioned in line 328 of Section 4.2, where we discuss the learning rate of the adversarial stage. However, we did not specify the number of epochs. In reality, it is iteratively optimized for 10 epochs. We will revise the experimental setup section to clarify this.
### Q2: Details about Synthetic Dataset
Thank you for highlighting this detail. We provide the details of the synthetic data, illustrating the concept of the unfair region and how the decision boundary is rotated. We simplify the binary classification task with a 2D Gaussian mixture model, as assumed in [4], consisting of two classes $y \in \{0, 1\}$ and two sensitive attributes $A \in \{0, 1\}$ (indicating unprivileged and privileged groups).
\begin{align}
x \sim \begin{cases}
group1: \textbf{N} (\begin{bmatrix}
\mu \\
\mu
\end{bmatrix} , \sigma^2 )& \text{if} \: y=1, a=1 \\
group2: \textbf{N} (\begin{bmatrix}
\mu \\
\mu^\prime
\end{bmatrix} , \sigma^2)& \text{if} \: y=0, a=1 \\
group3: \textbf{N} (\begin{bmatrix}
0 \\
\mu
\end{bmatrix} ,(K\sigma)^2) & \text{if}\: y=1, a=0 \\
group4: \textbf{N} (\begin{bmatrix}
0 \\
0
\end{bmatrix} , (K\sigma)^2)& \text{if}\: y=0, a=0
\end{cases}
\end{align}
where $\mu^\prime =r\mu, 0<r<1$ and $K>1$, where the number of samples in each group is $N_1 : N_2: N_3: N_4$. We arbitrarily set $K=3$, $r=0.7$, $\mu = 1$, $N_1 = N_2 = 100$, and $N_3=N_4=400$.
We will revise the appendix to provide details of the synthetic data.
[4] Xu, H., Liu, X., Li, Y., Jain, A., & Tang, J. (2021, July). To be robust or to be fair: Towards fairness in adversarial training. In International conference on machine learning (pp. 11492-11501). PMLR.
### Q3: Rationalizing the Piecewise Linear Approximation
In our implementation, the inverse sigmoid function can be approximated by a piecewise linear function, eliminating any issues related to infinity.
It is true that the logit function can produce infinite values. Indeed, this is why we employ a piecewise linear approximation. Without this approximation, highly confident samples (e.g., $p(y) \approx 1$) could disproportionately influence the overall loss value during the adversarial attack, potentially leading to suboptimal optimization. By approximating the logit function as piecewise linear, we can achieve more stable optimization of the fairness attack. This is done by adjusting the sigmoid output range from (0,1) to $(\beta, 1-\beta)$, where $\beta$ is a very small value ($10^{-7}$). As defined in line 238, this adjustment results in a logit value range of (-16.1181, 16.1181), preventing extreme values from dominating the loss calculation.

### Limitation: Adaptability to Large-Scale Datasets
Our framework can be adopted in large-scale datasets. To illustrate the adaptability of our proposed method to large-scale datasets, we have employed the Wikipedia toxicity classification dataset, an NLP dataset consisting of over 100,000 comments from the English Wikipedia, as introduced in Appendix I.4. We highlight the results in the table below:

# Reviewer 4 (CNDR)
### W1: Improving Readability of Comparison Methods
Thank you for highlighting the issue of readability regarding the comparison methods. While these methods are detailed in the appendix, we will revise the paper to provide a brief introduction to them earlier, incorporating this information into Section 2 for better clarity.
### W2: Enhancing Clarity in Result Analysis
We will revise the text in Sections 4.4.1 and 4.4.2 to better demonstrate the effectiveness of our method and provide more details about the comparison methods. Additionally, we will update the captions of Figures 3, 4, and 5 to highlight the consistent improvements achieved through our approach.
### Q1: Addressing the Trade-off in the Adult Dataset with MLP
An extension of our framework can solve the problem of the impact of correctly predicted samples in unfair regions, by employing the same strategy on the input space.
Here, we elaborate the issue and the solution.
- In Figures 3 and 4, our proposed method (ALFA) does not achieve the best results only in the Adult dataset with MLP. We suspect that the MLP encoder may already extract a mixed representation of misclassified privileged and correctly classified unprivileged groups. This makes it challenging to define the unfair region. In this case, relying solely on latent perturbation, an accuracy-fairness trade-off is likely to occur since our method cannot enhance the encoder's ability to distinguish between these two sets of samples.
- However, the adaptability of our method through fairness attack and perturbation in the input space offers an alternative approach to mitigate this trade-off. In this case, the perturbation is deployed in the input feature, while the entire network will be fine-tuned with the perturbed data. Specifically, in the Adult dataset with MLP, input perturbation exhibits a better trade-off compared to latent perturbation, as shown below. Therefore, this modification could potentially resolve the issue.
| Adult | Accuracy | $\Delta DP$ |$\Delta EOd$ |
| ------------------- | ------------- | -------------- | ------------- |
| Baseline | 0.8525±0.0010 | 0.1824±0.0114 | 0.1768±0.0411 |
| Latent perturbation | 0.8380±0.0045 | 0.1642±0.0261 | **0.0971±0.0098** |
| Input perturbation |**0.8473±0.0016** | **0.1588±0.0135** | 0.1016±0.0394 |
### Limitation: Mitigating Data Imbalance in Multi-Class Classification
In response to the reviewer's comment, we recognize that the number of classes in multi-class classification may lead to data imbalance in the one-to-all strategy discussed in Appendix A. To address this concern, we employ an upsampling strategy to equalize the number of samples in each subgroup, as outlined in line 205. This approach effectively mitigates the data imbalance issue by ensuring that each class is equally represented in the dataset. By doing so, we enhance the fairness of our model and improve its performance in multi-class classification scenarios. We believe that this strategy provides a practical solution to the data imbalance problem in the one-to-all strategy.