WWW'24 rebuttal

# <Reviewer 1> We appreciate your thoughtful review and the time and effort you have dedicated to evaluating our work. We have carefully considered your comments and criticisms. Please refer to our itemized responses addressing each of your concerns: # W1: Technical novelties of our fairness loss > (W1) While I found the paper to be technically solid, it appears to employ only existing methods simultaneously to address the issue of user-side fairness in dynamic recommendation. Note that we address three challenges: {(C1) distribution shifts, (C2) frequent model updates, and (C3) non-differentiability of ranking metrics, and furthermore the time-inefficiency of the existing soft ranking methods like NeuralNDCG and its susceptibility to gradient vanishing}, in the context of “ensuring user-side fairness in dynamic recommendation scenarios” through both our fairness loss and fine-tuning model update strategy. Specifically, our fairness loss address (C2) and (C3) and fine-tuning addresses (C1) and (C2). We would like to clarify that our fairness loss is not just a combination of existing methods, but it introduces two novel aspects, specifically **fast and effective soft ranking metric** and **fairness loss without absolute**: 1. We propose a new soft ranking metric, differentiable hit, which is not only effective but also lightweight compared to metrics like NeuralNDCG. This characteristic makes it particularly suitable for our dynamic setting where the model requires frequent updates over time (C2&C3). For example, Figure 11 shows the effectiveness of differentiable hit in reducing PD, and its runtime (3.05s) is significantly faster compared to NeuralNDCG (12.15s). 2. We propose the fairness loss without absolute on DPD (i.e., $L_\text{fair}$), which overcomes the instability of naive fairness loss with absolute DPD (i.e., $L_\text{fair-abs}$), by leveraging the competing nature between the recommendation and fairness losses (Proposition 3.3). Theoretically, the gradient of $L_\text{fair-abs}$ is equal to the **$\text{sign(DPD)}$** multiplied by the gradient of $L_\text{fair}$ (i.e., eq. (69)). Thus, $(1-2\sigma(\text{DPD}(W_t)))$ of Eq. (73) changes into $(1-2\text{sign(DPD)}\sigma(\text{DPD}(W_t))$ when using $L_\text{fair-abs}$. In other words, due to the absolute value operation, the gradient of $L_\text{fair-abs}$ is not symmetric around DPD=0, favoring the advantaged group more. Meanwhile, the gradient of $L_\text{fair}$ is symmetric around DPD=0, which is better for achieving fairness. Empirically, we observed FADE-abs using $L_\text{fair-abs}$ is sub-optimal. # W2: Further clarification of our theory > (W2) It is commendable that the authors provide a generalization error analysis in a complex setting. However, my understanding is that the paper deals with a dynamic but not "sequential" context. > Also, I'm not quite clear why the authors evaluate only the error bound up to t_te-1 instead of the regret bound. ## Regarding sequential context of our theory  Regarding “sequential” vs “repeated learning”, we would like to clarify that Assumption 1 just assume the data tuples in $D_t$ are independent, but because of Assumption 4 (proximal fine-tuning), our Theorem 3.1 has a sequential nature. Specifically, in Theorem 3.1, the gamma factor $\gamma$ is indeed to characterizes the **forgetting** phenomenon in **sequential** fine-tuning, while our Theorem 3.2 is more akin to “repeated learning” because there will be no forgetting in the setting it describes. In fact, how to properly characterize forgetting is one of the **unique** technical challenges in analyzing sequential fine-tuning. Please see our response below for elaboration. Specifically, the below discussion is about (1) the nontriviality of our conclusion and (2) our error bound v.s. Regret bound. ## Key contributions of our Theory Thank you for acknowledging our theory. Our theory exhibits several novel and nontrivial aspects. We will add this discussion in the revised version. 1. **Characterizing forgetting via proximity:** Properly characterizing forgetting in sequential fine-tuning without sacrificing generality is a nontrivial task. Existing works on forgetting (e.g., [1,2]) impose strong structural assumptions on the model architecture, the loss function, and the learning algorithm, rendering their results inapplicable to our setting. Meanwhile, proximal fine-tuning is a classic assumption for one-stage fine-tuning, but no previous work has recognized its connection to forgetting. We successfully established this connection through a fine-grained analysis of the influence of proximity over time, ultimately yielding a more general theory without relying on strong structural assumptions. 2. **A tighter and more general generalization bound:** The classic generalization bound [3] w.r.t. discrepancy distance includes a term dependent on the VC-dimension or the Rademacher complexity, and VC-dimension and Rademacher complexity are defined only for classification or regression models but not for ranking models. This dependence is because their approach is proving uniform stability. In contrast, our analysis leverages the sub-Gaussian property, allowing us to enhance the analysis and eliminate the dependence on VC-dimension or Rademacher complexity. This refinement results in a tighter generalization bound, and as such, it is general enough to handle ranking models. # W3: Conference selection rationale > (W3) The analysis in Section 3.1 seems more applicable to a general machine-learning setting than the specific setting of this paper. Therefore, if the paper claims novelty or nontriviality of this theory, I believe it should be judged in an ML-focused conference (e.g., NeurIPS, ICML, ICLR). We chose to submit to WebConf because our primary objective is to address the realistic problem of dynamic user-side fairness and design a fair dynamic recommendation system, which aligns more closely with the themes addressed by WebConf than those of general machine learning conferences. Specifically, we aim to tackle the novel problem where user-side unfairness tends to persist or worsen over time. Additionally, we introduce a novel method, FADE, which is an end-to-end framework employing incremental fine-tuning and our proposed fairness loss. While theoretical analyses play a crucial role in justifying our choice of fine-tuning, we have more contributions in other aspects. We would like to note that our track it titled **"user modeling and recommendation"**, and our main topic is **"fairness-aware retrieval and ranking"**, which is the relevent sub-topic state on the WebConf webpage. Furthermore, there are several relevant WebConf papers [4,5,6,7] cited in our submitted paper, including the one selected as our competitor[4]. # Q1 > (Q1) Is there a comment about the above weaknesses from the authors' perspective? We have diligently addressed each of your concerns. Please refer to our responses above. ## References [1] Doan et al. A theoretical analysis of catastrophic forgetting through the NTK overlap matrix. AISTATS, 2021. [2] Lin et al. Theory of forgetting and generalization of continual learning. arXiv:2302.05836, 2023. [3] Mansour et al. Domain adaptation: learning bounds and algorithms. arXiv:0902.3430, 2009. [4]Rashidul Islam et al. Debiasing career recommendations with neural fair collaborative filtering. WebConf’21 [5]Yunqi Li et al. User-oriented fairness in recommendation. WebConf’21 [6]Le Wu et al. Learning fair representations for recommendation: A graph-based perspective. WebConf’21 [7] Yu Zheng et al. Disentangling user interest and conformity for recommendation with causal embedding. WebConf'21 # <Reviewer 2> We appreciate your thoughtful review and the time and effort you have dedicated to evaluating our work. We have carefully considered your comments and criticisms. Please refer to our itemized responses addressing each of your concerns. # Con 1: Results with different length of recommendation > 1) Only F1/NDCG@20 results are reported, more results with different lengths should be included. Also, only one fairness metric has been reported. We have conducted new experiments and report overall performance and PD w.r.t. NDCG@50 and F1@50 of all the compared methods based on the Matrix Factorization (MF) backbone recommendation model in Task-R on Movielenz dataset. The results show similar trend to those w.r.t. NDCG@20 and F1@20 (reported in the left-most figure in Figure 2): FADE leads to a substantial reduction in PD with a modest impact on overall performance.  | | NDCG@50 | PD w.r.t. NDCG@50 | F1@50 | PD w.r.t. F1@50 | | ------------- | ------- | ----------------- | ----- | --------------- | | Adver | 0.833 | 0.017 | 0.425 | 0.019 | | Rerank | 0.779 | 0.021 | 0.357 | 0.016 | | Pretrain | 0.738 | 0.008 | 0.366 | 0.011 | | Retrain | 0.818 | 0.017 | 0.420 | 0.025 | | Finetune | 0.834 | 0.014 | 0.421 | 0.021 | | Pretrain-Fair | 0.746 | 0.006 | 0.365 | 0.005 | | Retrain-Fair | 0.824 | 0.006 | 0.417 | 0.004 | | FADE-Abs | 0.837 | 0.008 | 0.421 | 0.017 | | FADE (Ours) | 0.837 | 0.006 | 0.414 | 0.005 | Regarding other fairness metrics like individual fairness or counterfactual fairness, building a model that simultaneously ensures different types of fairness can be challenging. So, this presents another line for fairness research. However, in our current work, we do not specifically focus on this direction. # Con 2: Results of competitors at each time period > 2) Clarification is needed regarding whether the baselines also periodically update their model parameters. If they do, it would be helpful to see these results in a graphical format. If not, the basis for comparison may not be equitable. In lines 594-596, we mentioned that our fairness-aware competitors, Adver and Re-rank, are implemented with a **fine-tuning** strategy for a fair comparison, even though they were originally not designed for dynamic scenarios. To better answer reviewer’s question, we provide a detailed report their performance and PD w.r.t. NDCG@20 and F1@20 **at each time period over time**, using the same setting as Fig (MF) backbone recommendation model and tested in Task-R on Movielenz dataset. The results show that FADE consistently outperforms other methods in terms of reducing PD at nearly every time period. In terms of overall performance, FADE is slightly more effective or comparable to Adver and significantly outperforms Re-rank. | PD w.r.t. NDCG@20 | t=1 | t=2 | t=3 | t=4 | t=5 | t=6 | | ----------------- | ----- | ----- | ----- | ----- | ----- | ----- | | Rerank | 0.010 | 0.019 | 0.008 | 0.033 | 0.031 | 0.013 | | Adver | 0.007 | 0.021 | 0.011 | 0.024 | 0.017 | 0.020 | | FADE | 0.002 | 0.013 | 0.000 | 0.004 | 0.002 | 0.004 | | PD w.r.t. F1@20 | t=1 | t=2 | t=3 | t=4 | t=5 | t=6 | | --------------- | ----- | ----- | ----- | ----- | ----- | ----- | | Rerank | 0.021 | 0.018 | 0.012 | 0.020 | 0.015 | 0.002 | | Adver | 0.006 | 0.012 | 0.014 | 0.019 | 0.021 | 0.024 | | FADE | 0.001 | 0.002 | 0.000 | 0.003 | 0.003 | 0.003 | | NDCG@20 | t=1 | t=2 | t=3 | t=4 | t=5 | t=6 | | ------------------------- | ----- | ----- | ----- | ----- | ----- | ----- | | Rerank | 0.851 | 0.841 | 0.825 | 0.806 | 0.703 | 0.620 | | Adver | 0.851 | 0.843 | 0.843 | 0.841 | 0.833 | 0.824 | | FADE | 0.856 | 0.859 | 0.855 | 0.854 | 0.833 | 0.820 | | F1@20 | t=1 | t=2 | t=3 | t=4 | t=5 | t=6 | | --------------------- | ----- | ----- | ----- | ----- | ----- | ----- | | Rerank | 0.339 | 0.335 | 0.329 | 0.316 | 0.278 | 0.243 | | Adver | 0.344 | 0.341 | 0.333 | 0.333 | 0.333 | 0.328 | | FADE | 0.338 | 0.340 | 0.332 | 0.330 | 0.326 | 0.316 |  # Cons 3: Fairness loss with (i.e., FADE-abs) or without (i.e., FADE) absolute on DPD > 3) Can you explain the reason that fade-abs performs much worse than fade? Theoretically, the gradient of $L_\text{fair-abs}$ is equal to the **$\text{sign(DPD)}$** multiplied by the gradient of $L_\text{fair}$ (i.e., eq. (69)). Thus, $(1-2\sigma(\text{DPD}(W_t)))$ of Eq. (73) changes into $(1-2\text{sign(DPD)}\sigma(\text{DPD}(W_t))$ when using $L_\text{fair-abs}$. In other words, due to the absolute value operation, the gradient of $L_\text{fair-abs}$ is not symmetric around DPD=0, favoring the advantaged group more. Meanwhile, the gradient of $L_\text{fair}$ is symmetric around DPD=0, which is better for achieving fairness. Empirically, we observed FADE-abs using $L_\text{fair-abs}$ is sub-optimal. # Questions > Line 361, “We will define Lfair in §4.3.” should be “We will define Lfair in §3.4.” Thank you for correcting the reference in Line 361. We will correct that line in the revised version. # <Reviewer 3> We appreciate your thoughtful review and the time and effort you have dedicated to evaluating our work. Also, thank you for recommending relevant papers to enhance the robustness of our work. [1] has an interesting idea to improve item-side group fairness via adaptive adjustment of group-level negative sampling distribution. [2], though focused on individual fairness (i.e., a different focus from our focus on group fairness) in graph learning, has similarity with our work in the sense they solve the problem from a ranking perspective. We will add those references to the revised version. # Questions > I am not convinced by the reproducibility or resource availability of this work, so if the authors can provide more guidance on that, that would be appreciated. Regarding reproducibility, we are sharing the code for FADE via the following link: https://anonymous.4open.science/r/fade-14BE. Reproducing our results is straightforward using the provided script; detailed instructions are available in the "readme" file. Additionally, both datasets utilized in the paper are accessible through the specified link in our publication. We will include the link to the code in the revised version. ## References [1] Chen X, Fan W, Chen J, et al. Fairly adaptive negative sampling for recommendations[C]//Proceedings of the ACM Web Conference 2023. 2023: 3723-3733. [2] Dong Y, Kang J, Tong H, et al. Individual fairness for graph neural networks: A ranking based approach[C]//Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2021: 300-310. # <Reviewer 4> We appreciate your thoughtful review and the time and effort you have dedicated to evaluating our work. We have carefully considered your comments and criticisms. Please refer to our itemized responses addressing each of your concerns: # Q1: Comparison with relevant literature > (1) It appears that the initial work may be making an overclaim in its attempt to address dynamic user fairness [1,2,3]. While some of the existing research may not explicitly focus on user fairness, it is imperative to recognize that the differential privacy (DP) metric and feature shift, though seemingly distinct, share similarities and warrant a comparative analysis in evaluating the overall fairness implications of a model. We appreciate your guidance on relevant and recent papers that can make our paper more solid. Our work addresses user-side fairness (i.e., performance parity) in dynamic/online scenarios where new interactions are continually generated/collected and the recommendation model needs to be periodically updated with those new data. While the points you raised are all reasonable and insightful, we’d like to note that those papers have distinct differences from our work in terms of the type of dynamic setting and/or the type of fairness they deal with. We plan to clarify the relationship and difference between our work and these references in the revised version. In particular, * [1] does consider changes in item attributes or group labels due to the newly observed user feedback over time (i.e., changing item popularity and thus changing popularity group in which that item exists). However, they focus on ensuring item-side exposure fairness between item groups with different levels of popularity. Thus, their method is not directly comparable with ours. * [2] does focus on the long-term effects of intervention of utility fairness between different demographic groups in a social network. However, the differences between our study and theirs are fundamental: 1. They study the impact of addition of fairness notions on the state of a social network, i.e., the difference in average network sizes between user groups, unlike our focus on performance disparity (PD). 2. Their recommendation problem is link recommendation in a *unipartite* social network. (e.g., friend recommendation on Facebook), while our focus is on item recommendation in a user-item *bipartite* graph. 3. their relevance scores between users are not based on the learned model parameters, but rely on linear combination of several *unipartite network properties*, such as the size of a user’s current network and users’ common neighbors (i.e., triadic closure), which cannot be inferred from bipartite networks. For these reasons, this work is not directly comparable with ours. * [3] is closest literature to our work, as they try to impose user-side performance disparity in the training phase and ensure that the learned fairness can be consistent on all items in the test phase via IPS-based ideas. However, their model is designed for static recommendation settings and tested on a single randomly split training/test data. In other words, they do not consider the incremental update settings. To better answer the reviewer's question, we have conducted additional experiments by adapting the method in [3] to our settings, i.e., we can periodically retrain the model at each time period. Both this method and FADE are based on the Matrix Factorization backbone recommendation model and tested in Task-R on Movielenz and Modcloth datasets. | | Movielenz | Movielenz | ModCloth | ModCloth | | --------- | --------- | --------- | -------- | --------- | | | FADE | [3] | FADE | [3] | | NDCG | 0.846 | 0.844 | 0.270 | 0.265 | | Disparity | 0.004 | 0.005 | 0.047 | 0.087 | | F1 | 0.330 | 0.341 | 0.087 | 0.086 | | Disparity | 0.002 | 0.006 | 0.020 | 0.032 | The results reveal that FADE is more effective in reducing PD in all cases. Regarding overall performance, they are quite comparable. We suspect the sub-optimal PD reduction of the baseline method is due to the naive loss design of absolute DPD and retraining model update strategy. Moreover, since their method is based on historical retraining, it requires significantly more runtime **(FADE: 4.08s and [3]: 7232.23s)** measured in the same setting as Table 2 in the submitted paper. Additionally, their computation per epoch itself takes more time than ours due to additional computation regarding IPS when we compare them in the same retraining update setting **(Retrain-Fair: 1401.18s and [3]: 7232.23s)**. Also, they have significantly more learnable parameters: additional user/item embeddings for IPS and two neural networks for two user groups. Thank you once more for your guidance regarding relevant and recent papers. We will incorporate this discussion into the revised version. ## Regarding differential privacy (DP) vs feature shifts Thank you for highlighting the connection between DP and feature shift. DP settings often consider two datasets D1 and D2 that differ, and this can be viewed as a small distribution shift between D1 and D2. We believe DP+fairness is an open problem being actively studied in the literature [4,5], including applications to graph learning [5]. We acknowledge the open problem and leave it for future work to delve into DP+fairness in dynamic recommendation settings. # Q2: Novelty and generality of our proposed theory > (2) I carefully examine the Proof of Theorem 3.1 and 3.2, however, I cannot find the relationship between fairness and the theorems. If I miss the correlation, please point out which lines in the proof support the correlation We introduce a new theory since existing theories [6,7] are not general enough to handle recommendation and fairness loss. We would like to clarify that our theories are not specifically tailored to recommendation and fairness loss but are general enough to encompass them (the analysis is about the recommendation+fairness loss as indicated in lines 266-267, and in Eq.(7), $L=L_\text{rec}+\lambda L_\text{fair}$). Classic theories on distribution shift often have too strong assumptions on the model class. For example, [6] and [7] rely on VC-dimension/Rademacher complexity, which are only defined for classification/regression models where the loss function is a sum over i.i.d. samples. However, our losses are based on user-item interactions, where the interaction depends on the user and the item, and the fairness loss is a single term w.r.t. DPD, not a sum over samples. These characteristics render classic theories inapplicable to our setting. One of our key novelties is that our theory gets rid of the dependence on VC-dimension/Rademacher complexity by carefully exploiting the sub-Gaussian property. In this way, we managed to generalize the theory to a broader class of models that include both recommendation and fairness losses. Our new theory is thus eligible to guide the design of our method. # Q3: Novelties of our fairness loss and incremental fine-tuning > (3) If the problem (2) has a correlation, how is the method of BPR-related loss and differentiable hit well to solve the unfairness of the fine-tuning phase? It seems they are an independent part. How does your method well address dynamic user-fairness uniquely? We address three challenges ((C1) Distribution shifts, (C2) Frequent model updates, (C3) non-differentiability of ranking metrics, and furthermore the time-inefficiency of the existing soft ranking methods like NeuralNDCG and its susceptibility to gradient vanishing), in the context of “ensuring user-side fairness in dynamic recommendation scenarios” through both our fairness loss and fine-tuning model update strategy. Specifically, our fairness loss address (C2) and (C3) and fine-tuning addresses (C1) and (C2), as detailed below. Note that our fairness loss, by itself, does not address distribution shifts over time, as the reviewer correctly pointed out, but when it is trained under the fine-tuning strategy, the model can dynamically impose fairness regularization over time without being as significantly influenced by distribution shifts as retraining. For the fairness loss, we introduce two novel aspects: 1. We propose a new soft ranking metric, differentiable hit, which is not only effective but also lightweight. This characteristic makes it particularly suitable for our dynamic setting where the model requires frequent updates over time (C2&C3). For example, Figure 11 shows the effectiveness of differentiable hit in reducing PD, and its runtime (3.05s) is significantly faster compared to NeuralNDCG (12.15s). 2. We propose the fairness loss without absolute on DPD (i.e., $L_\text{fair}$), which overcomes the instability of naive fairness loss with absolute DPD (i.e., $L_\text{fair-abs}$), by leveraging the competing nature between the recommendation and fairness losses (Proposition 3.3). Theoretically, the gradient of $L_\text{fair-abs}$ is equal to the **$\text{sign(DPD)}$** multiplied by the gradient of $L_\text{fair}$ (i.e., eq. (69)). Thus, $(1-2\sigma(\text{DPD}(W_t)))$ of Eq. (73) changes into $(1-2\text{sign(DPD)}\sigma(\text{DPD}(W_t))$ when using $L_\text{fair-abs}$. In other words, due to the absolute value operation, the gradient of $L_\text{fair-abs}$ is not symmetric around DPD=0, favoring the advantaged group more. Meanwhile, the gradient of $L_\text{fair}$ is symmetric around DPD=0, which is better for achieving fairness. For fine-tuning, we provide theoretical analyses of the generalization error in Theorem 3.1 & 3.2, demonstrating that fine-tuning can exponentially reduce the influence of distribution shift, whereas retraining is more susceptible to such shifts. Guided by that, we can expect fine-tuning to effectively address (C1) while imposing fairness regularization at each time period. Moreover, our fine-tuning naturally addresses (C2) well because it only uses data observed at the current time period, rather than using all historical data, to train the model (as evidenced by the runtime difference in Table 2). # Q4: Further comparison with other baselines > As you mentioned, baselines Adver's primary focus is not on reducing the performance disparity among different user groups. More baselines for addressing feature shift should be added such as the IPS[3] method or some user-fairness-aware methods. (4) As you mentioned, baselines Adver's primary focus is not on reducing the performance disparity among different user groups. More baselines for addressing feature shift should be added such as the IPS[3] method or some user-fairness-aware methods. Following reviewer’s suggestion, we have conducted additional experiments comparing FADE and the IPS method. Please refer to our response to Q1 above. # Q5: Reason of using DPD in training and PD in the test phase > (5) Why change the metric from DPD to PD in the experiment? Our ultimate goal is to optimize PD. However, PD is not differentiable due to the non-differentiability of the ranking/sorting operation of recommendation metrics. Thus, we propose to optimize DPD. # Q6: Further discussion of relevant works > (6) In lines 891-892, there is also some work to regard the DP of utilities metric as item fairness. It is better to further address the fairness concept. [8] aims to balance the trade-off between item-side exposure fairness and user-side utility individual fairness, which is different from our user-side group fairness (i.e., performance disparity). Specifically, while they ensure the exposure fairness, they want the resulting recommendation utility reduction to be equally allocated to every user. We appreciate your notification about this work, and we will add this discussion in the revised version. ## References [1] Ge et al. (2021, March). Towards long-term fairness in the recommendation. In Proceedings of the 14th ACM International Conference on web search and data mining (pp. 445-453). [2] Nil-Jana Akpinar et al. 2022. Long-term Dynamics of Fairness Intervention in Connection Recommender Systems. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society (AIES '22). Association for Computing Machinery, New York, NY, USA, 22–35. [3] Jiakai Tang et al. 2023. When Fairness meets Bias: a Debiased Framework for Fairness aware Top-N Recommendation. In Proceedings of the 17th ACM Conference on Recommender Systems (RecSys '23). Association for Computing Machinery, New York, NY, USA, 200–210. [4] Fioretto, Ferdinando, et al. "Differential Privacy and Fairness in Decisions and Learning Tasks: A Survey." 31st International Joint Conference on Artificial Intelligence, IJCAI 2022. [5] Dai, Enyan, and Suhang Wang. "Learning fair graph neural networks with limited and private sensitive attribute information." IEEE Transactions on Knowledge and Data Engineering (2022). [6] Ben-David et al. Analysis of representations for domain adaptation. NIPS, 2006. [7] Mansour et al. Domain adaptation: learning bounds and algorithms. arXiv:0902.3430, 2009. [8] Wu et al. (2021, July). Tfrom: A two-sided fairness-aware recommendation model for both customers and providers. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 1013-1022).