# General Response:
We thank all reviewers for their valuable feedback and insightful questions. We are particularly encouraged that they consider the proposed method novel and interesting for FL (SFXA, JNzR, iEGY), the empirical evaluation good and extensive (z1xW, J3o4), the performance achieving state-of-the-art (9hAu, NfEV) and our paper well-written (9hAu, z1xW, NfEV, J3o4).
We address individual questions of reviewers in separate responses. We have also uploaded a modified version of our paper based on reviewers' suggestions. The major changes are highlighted as blue in the paper. In particular, we added more explanations of the convex relaxation (9hAu, J3o4), description of computational cost in continuous action spaces (NfEV), and comparison to prior work (9hAu, J3o4) in Section 4. We added Appendix E which discusses the limitations and the social impacts of this work in detail (z1xW, J3o4).
We greatly appreciate all reviewers' time and effort for reviewing our paper. We hope that our paper updates and responses have addressed all reviewers' questions and concerns. Please let us know if there are further questions.
Paper ABCD Authors
# Reply to Reviewer SFXA (rating 6, confidence 3):
**General Response:** We thank the reviewer for appreciating the contributions and recommending acceptance. The detailed replies are as follows.
> Weakness: Although I do appreciate the solid amount of experiments done by the authors, it seems that only experiments on small-scale datasets (e.g. MNIST/CIFAR10) are done. The proposed method's ability to scale to larger datasets are yet to be shown. I encourage the authors to try some larger scale datasets such as TinyImageNet, etc (optional).
**Reply to Weakness:** With limited computation time and power, our experiments are limited to smaller data sets and comparable to recent innovations in the area. We do agree that for the larger impact of our research, churning on a larger dataset would be more representative. We leave it to the future scope of this work.
>**Question 1:** In Fig. 1(b), it seems that when $\gamma$ is in $[0,0.05]$, there is no trade-off between local accuracy and global accuracy. Can the authors please provide some intuitive explanation on this phenomenon?
**Reply to Question 1:** Observe that when $\gamma_i$ is very small, consensus is enforced exactly. This means that there is no ability for individual agents to exploit aspects of their data that is distinctly local. Experimentally, we see that when $\gamma_i$ is very small, their local models are almost identical. Their localized performance only increases after $\gamma_i$ is increased sufficiently, which is when data heterogeneity can be properly used.
>**Question 2:** In the dual update step of the algorithm, it seems that the update is done within only one single step. Therefore, I wonder would the results being very sensitive to the choice of stepsize $\alpha$. In addition, Figure 2(b) seems to corroborate with my concern as the update of $\lambda$ seems highly unstable.
**Reply to Question 1:** Thank you for this point. We note that in primal-dual algorithms, only the average of dual iterates is supposed to converge [A], which implies convergence in terms of the constraint violation, but not the last-iterate convergence of the dual variable (which shows unstable behavior in Figure 2(b)). We will plot the running average of $\lambda$ in the revised paper to support this.
[A] A. Nedić et a;', "Approximate primal solutions and rate analysis for dual subgradient methods." SIAM JOPT, 2009.
# Reviewer JNzR (rating 5, confidence 4)
We thank the reviewer for appreciating the contributions and recommending acceptance. The detailed replies are as follows.
> The most important weakness of the paper is the limited novelty of the framework in comparison to FedProx. The only difference between the targeted problems in FedProx and FedBC is the constraints in (7) substituting the penalty term in FedProx. In addition to the similar problem formulation, the optimization approach in FedBC makes it even more similar to FedProx. Algorithm 1 uses a primal-dual approach where the dual variables $\lambda_i$ are varying over a bounded range $[\lambda_{\min},\lambda_{\max}]$. Since $\lambda_{\min}-\lambda_{\max}$ appears in the convergence bound (parameter $\delta$ in Theorem 4.5) their difference needs to be bounded for proper convergence and we already know that for small $\lambda_{\min}-\lambda_{\max}$ , FedBC becomes nearly identical to FedProx. Therefore, I cannot say FedBC possesses sufficient novelty over FedProx.}
We would like to respectfully point out that FedProx and its variant based on Moreau envelopes [B] are instantiations of penalty method. As such, for the sake of discussion, in the convex case, they can only achieve constraint satisfaction when their penalty coefficient is sent to infinity -- see [C].
- Specifically, in FedProx, to achieve consensus or otherwise constraint satisfaction, this requires sending step-size to zero as the penalty coefficient is inverse to the step-size. By contrast, FedBC is able to achieve constraint satisfaction under constant step-size selection. This is a fundamentally distinct capability from either prior reference. Therefore, we respectfully point out that it is incorrect to say FedBC is a minor modification of FedProx or its Moreau envelop variant.
- Moreover, it is worth noticing that in FedProx, the model aggregation is either based on number of samples or random averaging, while in FedBC it's the dual variable which weighs the model updates which takes into account constraint violations and number of samples among other things.
[B] T Dinh et al', "Personalized federated learning with moreau envelopes." Advances in Neural Information Processing Systems, 2020.
[C] Wotao, Y. Optimization Algorithms for constrained optimization. Department of Mathematics, UCLA, 2015.
>**Weakness 2:** Assumption 4.3 seems too strong to me. The fact that $a,b$ can be chosen differently over $\Lambda$ requires a small diameter for $\Lambda$ and then FedBC would become equivalent to FedProx.
**Reply to Weakness 2:** This is a good point but we would like to remark that the Assumption 4.3 is artifact of the analysis and is not that restrictive in practice, which we observed in experiments. The hyperparameter selection procedure for FedProx and FedBC is provided in detail in Appendix E.2.
In FedBC, we obtain different $\lambda_i$'s for different clients despite they are all initialized from $0$. In contrast, only one $\lambda$ (which is called $\mu$ in FedProx paper) is used and prefixed for all clients a priori. As we obtain a distribution of $\lambda$'s' in the end for FedBC, it is challenging to make a direct comparison with FedProx in terms of their magnitudes. However, we do observe that FedBC consistently outperforms Fedprox given the advantage that clients can have different $\lambda_i$ experimentally. We also want to remark that the magnitude of $\lambda_i$ not only depends on its learning rate, but also on the tolerance (personalization) parameter $\gamma_i$ (unique to FedBC). By allowing each client to have a different $\lambda_i$, FedBC can control the difference between the local and global model, hence can acquire better personalization or global performance if either extreme exhibits practical advantages.
Also, we remark that to make FedBC equivalent to FedProx, we would need two changes, (1) remove lambda update and make it a constant, and (2) choose $\gamma_i$ to be $0$. Therefore, even if $\Lambda$ has a small diameter, due to the presence of non-zero $\gamma_i$, FedBC is different from FedProx.
# Reply to Reviewer 5TUP (rating 3, confidence 4):
>**Weakness 1:** The novelty is limited. Several works already investigated a similar framework by adding the penalty [6, 16, 25]. This paper extends the optimization problem to a constrained form and considers different constraint coefficients $\gamma_i$.
**Reply to Weakness 1:** We respectfully disagree with this comment. Our work, though looks similar to [6,16,25] but is different is various aspects. For instance, we are not only focussing on the performance of local model as in [6,25] or only on the global model as in [16]. We are interested in precisely characterizing the tradeoff via results in Theorem 4.5 and Corollary 4.6.
Apart from this, all the existing methods are mostly focusing on the penaly methods to solve the FL problem which is fundamentally different from the Lagrangian approach we are proposing. As such, for the sake of discussion, in the convex case, they can only achieve constraint satisfaction when their penalty coefficient is sent to infinity -- see [C]. Specifically, in FedProx [16], to achieve consensus or otherwise constraint satisfaction, this requires sending step-size to zero as the penalty coefficient is inverse to the step-size. By contrast, FedBC is able to achieve constraint satisfaction under constant step-size selection. This is a fundamentally distinct capability from either prior reference. Therefore, we respectfully point out that it is incorrect to say FedBC is a minor modification of FedProx [16] or its Moreau envelop variant [25].
[C] Wotao, Y. Optimization Algorithms for constrained optimization. Department of Mathematics, UCLA, 2015.
>**Weakness 2:** The theoretical results (Theorem 4.5) can not show the linear speedup w.r.t. computation rounds, which are obtained in Theorem 2(a) of [25] (see the third term). Also, the assumption of heterogeneity (Assumption 4.3) is also stronger than [25]. The results also rely on the bounded domain assumption, which means $||\nabla f(x)||$ is bounded by $L$ smoothness. And it's very strong for the non-convex problem.}
**Reply to Weakness 2:** In our analysis, we have provided the solution in terms of local accuracy parameter $\epsilon$, which is general than the setting considered in [25]. We note that in [25], the analysis and results are specific to a setting when stochastic gradient descent (step 8 of Algorithm 1 in [25]). On the other hand, in our work, our analysis and final results are independent of the algorithm used at the local client. For instance, if a local algorithm has sample complexity of $\mathcal{O}\left(\frac{1}{\epsilon^2}\right)$, then we can also replace $\epsilon$ by $\mathcal{O}\left(\frac{1}{\sqrt{R}}\right)$ in Theorem 4.5 and achieve sublinear speedup with respect to local computation rounds denoted by $R$.
We would also like to comment that the speedup is sublinear in [25] as well, not linear, whose definition would impose $\mathcal{O}(\rho^{R})$ for $0<\rho<1$.
Additionally, we agree with the fair point of bounded domain assumption, but we would like to emphasize that we provide guarantees not only for the global model but also for the local models, and that too at a fine grained level of constraint violations, which is to our knowledge is not present in related works on federated learning.
**Reply to Weakness 2:**
>**Question 1:** Can we achieve the main purpose of this paper by considering different $\mu_i$ in Eq. (5) of pFedMe [25]? What difference will it lead to the convergence results in [25]?
**Reply to Question 1:** This is not true. The approach in [25] is equivalent to penalty method (similar to FedProx) which is quite different than our primal-dual approach (as detailed in reply to Weakness 1). In addition, we also have another parameter $\gamma_i$ for each client $i$ which actually drives the tradeoff between local and global performance in FedBC.
>**Question 2:** This paper considers device sampling, but the additional error term doesn't appear in results in Theorem 4.5.
**Reply to Question 2:** We take care of the device sampling in the derivation of upper bound in Lemma B.2 in Appendix (Eq. (45)-(53)). An explicit dependence upon the number of device sampled in each iteration (denote by $S$) is clear in Eq. (90). We will make it explicit in the revised version of our paper.
# Reviewer iEGY (rating 5, confidence 3):}
We thank the reviewer for appreciating the contributions and recommending acceptance. The detailed replies are as follows.
>**Weakness 1:** Even though the theory makes the constraints on local model learnable, which achieves the goal of automation, it does not directly answer the more important question of what tradeoff between global objective and local objective is achieved.
**Reply to Weakness 1:** The tradeoff we achieve is the precise connection between the performance of global and local model with $\gamma_i$. We note that for a larger $\gamma_i's$ (which allows local model to be different from the global model), results in affecting the global performance (last term in Theorem 4.5).
While on the other hand, a larger $\gamma_i$, would result in reducing the last term on the right hand side of Corollary 4.6, and hence improves the local performance. We believe this establishment of calibration via $\gamma_i$ is unique true to our knowledge.
>**Weakness 2:** The major concern is on the experiments of personalization. The algorithm does not seem to help achieve better personalization. This echoes the first point: what tradeoff and balance are achieved through the automation of local constraints?
**Reply to Weakness 2:** This is an interesting point. We note that in Table 2(b), FedBC is able to achieve better personalized performance as compared to pFedMe (which also cares about both local and global models). FedBC is not able to beat Per-FedAvg because the goal in Per-FedAvg is just to improve local performance, and not care about the global performance (as evident in Table 2(a) from the global performance of Per-FedAvg). Therefore, FedBC is able to achieve more balanced performance as compared to algorithms (such as Per-FedAvg) which focus on local performance only.
Furthermore, we show that if we augment FedBC with MAML-based approaches, then Per-FedBC (Algorithm 3 in Appendix E.4) could beat all the algorithms in personalized performance as well.
>**Weakness 3:** While the theory writing is clear, the experiments are less organized. The additional discovery of the algorithm more equitably weighing the importance of local model does not explain the success of the algorithm on its own. The important discussion between the tolerance variables $\gamma$ and the dual variables $\lambda$ is deferred to the appendix.
**Reply to Weakness 3:** We apologize for this. We will take care of this in the revised final version of our paper and make sure that everything is clear. For instance, we will reorganize the results to be categorized by data domain, algorithm comparison, and the degree of data heterogeneity present in the data.
>**Question 1:** Could the authors comment on the tradeoff achieved by this algorithm? Does it allow better personalization, better learning of the global model, or a balance in-between? Why is this better than manually setting a tolerance variable, as in FedProx.}
**Reply to Question 1:** For the first part of this question, please refere to replies of Weakness 1 & 2 above. Regarding comparison to FedProx, we remark that FedProx is a penalty-based method, which is fundamentally different from the Lagrangian approach we are proposing. As such, for the sake of discussion, in the convex case, they can only achieve constraint satisfaction when their penalty coefficient is sent to infinity -- see [C]. Specifically, in FedProx, to achieve consensus or otherwise constraint satisfaction, this requires sending step-size to zero as the penalty coefficient is inverse to the step-size. By constrast, FedBC is able to achieve constraint satisfaction under constant step-size selection. This is a fundamentally distinct capability from either prior reference. Therefore, we respectfully point out that it is incorrect to say FedBC is a minor modification of FedProx.
[C] Wotao, Y. Optimization Algorithms for constrained optimization. Department of Mathematics, UCLA, 2015.
>**Question 2:** Could the authors comment (quantitatively if possible) on the change of the tolerance variables throughout training and how this impacts personalization?
**Reply to Question 2:** We discussed the procedure to update $\gamma$'s in detail in Appendix E.1 Evaluations. In general, we agree that it is difficult to know the optimal $\gamma_i$ a priori due to the heterogeneity present in the system. Hence, we initialize it to be $0$ for each client, and update it based on the derivative of local Lagrangian function with respect to it (cf. eq (8)). This leads to an online and gradient-based type of update. This choice of $\gamma_i$ update works really well in practice, as we show in the experiments. The personalization of each client’s model increases as the magnitude of $\gamma_i$ increases in the training (see Figure 1 (b)).
# Confidential Response to AE
We would like to thank the AE and reviewers for the time invested in reviewing our work but there are some issues which we would like to humbly report.
**Issues with the review by Reviewer UXJD**
We would like to respectfully point out that **there are no technical flaws** in the paper as mentioned by the reviewer. There were some typos and missing detailed derivation, which might have led to the comments by the reviewer. We have provided a detailed response to each of them and have carefully checked that there are no technical flaws in the proof.
**Issues with the review by Reviewer 5TUP**
The reviewer has taken issue with the similarity between this work and FedProx and pFedMe. However, we note that these methods are based on penalty method, and therefore cannot achieve null constraint violation with constant step-size. By contrast, our work develops a primal-dual methodology based on constrained violation, and hence achieves this capability. This is a fundamental point of contrast from these references.
Furthermore, it looks like reviewer 5TUP misunderstood the contribution of our work. For instance, the reviewer has commented that *"This paper proposes a new personalized FL framework by solving the Lagrangian relaxation of a constrained program."* This is not completely true. Our goal is not to just focus on personalized performance (which is clearly mentioned in the introduction and related work discussion on page 2 in our submission). Our goal is to talk about the calibration of tradeoff between global and local performance and achieve that by introducing a novel constraint in this work (cf. (7)). This is correctly acknowledged by all the other reviewers.
Also the weakness statement by the reviewer *"The theoretical results (Theorem 4.5) can not show the linear speedup w.r.t. computation rounds, which are obtained in Theorem 2(a) of [25]"* is not true. We note that the rates are sublinear not linear in [25].
Therefore, we kindly request you to de-emphasize the scores of these reviewers towards the final decision of our paper.
Yours Sincerely,
Authors