Rebut FedMM - HackMD

# Rebut FedMM ## General comment: We thank all the reviewers for their careful assessment of our paper, for the time spent, and constructive comments. Most reviewers found the paper sound (3x “good”, 1x “fair”), well presented (3x “good”, 1x “fair”) and the contribution of interest (3x “good”, 1x “fair”). We are happy that most reviewers gave a positive review and hope our answers and some complementary content that we provide will help the reviewers reinforce their support of the paper. The main contribution is to provide a general framework for Federated MM that encompasses several problems including EM, Dictionary learning. Reviewers underlined the relevance and importance of this setting (**92uF**: “using MM is elegant”, **vyew**: “new federated learning algorithms possess several benefits: combine stochastic surrogate MM with client sampling and quantization”, **zqiR**: “the abstract framework is quite general and covers several existing work in the literature as special cases”; **FJnh**: “provides a detailed theoretical analysis of MM in a federated setting, tackling a problem that has received relatively little attention from mainstream Federated Learning.”) Reviewers raised questions on some technical aspects of the paper, that we answer below. Overall, the main concerns were the (1) the scale and variety of experiments, (2) the importance of variance reduction (VR), (3) and the lack of empirical comparison to other algorithms. We have added novel experiments demonstrating the robustness of our approach in a case with 1000 workers, as well as a novel extension incorporating VR with a corresponding experiment, and several experiments exploring the impact of various hyper-parameters. Regarding the empirical comparison (both theoretical and empirical) to other methods, we have underline the many differences of context between our approach and previously published articles (typically, many methods [12,43,65] are designed for a deterministic setting, without an explicit map T, in the smooth regime, without compression or with full participation), that make comparison nearly impossible. Providing a thorough comparison of all methods would be very difficult and would be a major contribution on its own, esp. considering (1) the fact that the various codes are not unified (2) the difficulty of tuning hyper-parameters (3) that many questions would require subjective decisions: shall we add compression or partial device participation to methods designed without it? Shall we make our method deterministic or competitors stochastic? If a competing method designed in the deterministic case fails in the stochastic case, we may risk bringing false insights, etc. We have run experiments that show the superiority of our method in particular contexts, but fear the set of experiments is not complete enough to bring definitive answers, and believe a complete comparison is out of the scope of our paper. We believe our contribution, that (a) introduces a novel framework and algorithm, (b)sheds light on why the natural way to build an MM algorithm is to average on the surrogate space and on how some algorithms of the literature are related, (c ) provides a theoretical analysis and (d) highlights the open bottlenecks from a theoretical standpoint, already constitutes a significant work. We hope the reviewers consider that the insights brought and the exciting discussions it may result in are of sufficient interest for the community. ## Response to Review of Paper6628 by Reviewer 92uF Thank you for your comments and questions, and for underlining the relevance of the problem we address to the ML community and the clarity of the paper. We provide a point by point answer hereafter. - *On client drift*: The “client drift” problem occurs in the presence of heterogeneity, when several iterations are performed locally [27], [37]. However, a similar problem (degraded convergence of the algorithm) also occurs when compression is used or in the partial participation case, when only a fraction of clients participate at each iteration) [46, I]. Intuitively, the compression creates a noise that increases with the size of the individual contributions of the agents, that do not decrease to zero, and requires a modification of the algorithm to recover (almost) the same convergence as the homogeneous case. While we focus on the second case (compression or PP) and our algorithm is adapted to be provably robust to heterogeneity, we also introduce, in Alg. 5 Fed-Scaf-StoSur-MM, which performs several MM iterations and encompasses control variates to avoid client drift, in the same way as Scaffold [27] does. Providing an analysis of Algorithms 5 is an important open direction left out of the scope of the paper. Nevertheless, we believe that being robust to heterogeneity in the Compressed/Partial participation case is already a significant challenge. - *References on federated composite optimization*: Thanks for pointing those references. The related work section on Federated composite optimization will be completed accordingly. Note that these methods differ from our surrogate approach, and that the last one was released *after* the Neurips submission. - *Technicalities and differences to [12]*: Our approach is indeed built on [12], yet, there are several major differences. First and most obviously, [12] was only focused on the EM case, while one of our main points is to propose a unified framework to tackle all of Examples 2.1 to 2.3. From a methodological standpoint, the importance of performing aggregation on the surrogate space (and the distinction between [12] and [43]) were not present in [12] Regarding technicalities and the analysis: though the analysis are related (which applies to any stochastic approximation scheme that relies on similar Lyapunov-style arguments), there are some major distinctions: - In [12], the analysis was carried under stronger assumptions, esp. the problem was smooth, that is for a null function g, which is a major difference - The projection step $\Pi$, was not considered in [12]. Being able to manage the projection, when deriving the Lyapunov inequality (i.e. a contraction inequality which is the cornerstone for the proof) is very difficult, especially in the *non convex* optimization setting and for a method which is not necessarily the gradient algorithm. To our best knowledge, we are the first to propose a theoretical analysis in these settings. - The convergence rate obtained at the end was subsequently faster. We tried our best in Section 4 to underline the many challenges remaining in order to provide a full picture of Federated Surrogate algorithms, and our work already tackles several aspects w.r.t. [12]. - *On convergence guarantees*: We will change the wording to highlight the fact that this corresponds to a worst case guarantee on a particular metric. However, we stress that it is very classical in non convex optimization to control the norm of the gradient (increment of the quantity of interest), as it is often impossible (or NP hard) to find the minimizer of the loss (and thus to provide a guarantee on the excess loss) unless additional conditions such as PL functions are assumed. Similarly, controlling this quantity at a random stopping time is a classical technique in studying non-convex optimization (see e.g., [20], and the many follow up references). We are unsure what kind of comprehensive analysis you had in mind: the type of result we provide seems to be standard. We would be happy to discuss this further if you wish. - *On "online learning"*: thanks for raising this point. The “online” term has been used in the literature both in the adversarial context (Regret minimization) and the expectation minimization problem with an infinite stream of i.i.d. data (data coming "online"). See for example https://en.wikipedia.org/wiki/Online_machine_learning#Statistical_view_of_online_learning or [III] ("we focus on online algorithms that process samples sequentially as they become available.""). - *Discussion on theoretical results and constants*. Remarks 4.2 to 4.4 provide elements of the interpretation of the Theorem. We will expand that discussion to highlight the impact of the constant $p$, $\omega$, $\gamma_\max$, $\mathcal G_0$. First, observe that when control variates are used, hereterogeneity does not impact convergence. Moreover, the partial participation and the compression both affect the level of noise and the convergence (This is visible for example in Fig. 4, App. A5.). The constant $\omega_p$ aggregates the impact of the quantization and the PP: if p=1 we recover $\omega_p=\omega$. The maximal learning rate decreases proportionally to $1/\omega_c^{3/2}$, a dependency that is typical in non-convex optimization with control variates [III, 12]. Constants, $L, v_\max$ and $v_\min$ are very dependent on the problem and can be seen as conditioning. The constant $G_0$ corresponds to an “initial distance” on the control variates, and is 0 at the cost of a more expensive initialization, but can also be made arbitrarily small. - *On A6, A7*. Assumption A6 has been used in [12, 17]. Assumption A7 (ii) is also classical [e.g., A4 in 17], while A7(i) is more tailored to our analysis. We provide in App. C a detailed discussion on the assumption A7 on some examples to ensure its validity. We will clarify those points. Thank you again for your review. We hope our rebuttal answers your questions and addresses your concerns. We would be happy to provide more details or clarifications if any are needed, and hope our answers will reinforce your support of the paper. References [12,27,27,43,46] from the paper References [I - III] below. [I] Artemis: tight convergence guarantees for bidirectional compression in federated learning C Philippenko, A Dieuleveut [II] L Xiao, Dual Averaging Method for Regularized Stochastic Learning and Online Optimization. NeurIPS 2009 [III] S. Horvath, D. Kovalev, K. Mishchenko, S. Stich, and P. Richtarik. Stochastic distributed learning with gradient quantization and variance reduction. ## Response to Review of Paper6628 by Reviewer vyew Thank you for your comments and questions, and for underlining the relevance of the problem we address to the ML community. We provide a point by point answer hereafter. - *Discussion on convergence results*: This is a good point. Unfortunately, there are not many papers that tackle a similar framework, esp. non convex, non smooth, and stochastic generic MM algorithm with quantization and partial participation. For example, the closest references consider a simpler setting: [12] has g=0, [43] is also in the smooth case and considers inexact minimization (T map), [65] focuses on the deterministic and smooth case, etc. Overall, for the EM and Dictionary learning cases, we are not aware of convergence rates under comparable assumptions. We have tried our best to highlight the differences of context that make the results impossible to compare in Table 1 and Appendix A.4 (A.5 in the revised version). - *Compactness of B*: This assumption on the compactness of the set B is indeed slightly restrictive, though technical. Due to the projection step on B, we need B to be convex and closed. The boundedness assumption could be removed and replaced with the assumption that some quantities are bounded (see e.g. in Sec E.3 when $\bar C$ is defined). We will add a comment. Remark that this is different from the Scaffold case as our assumption is imposed on the surrogate space. Cases where assumption A7 is satisfied are given in Appendix C. - *Choice of $\gamma_\max$*: The maximal learning rate indeed has a complicated dependency, but the upper bound in the theorem is valid for any $\gamma$ smaller than this $\gamma_\max$. It is often the case that the maximal step size in optimization has a dependency on unknown constants (e.g. strong convexity and smoothness parameters). In practice, using a decreasing sequence of steps would ensure that we do not exceed this maximal step size asymptotically. Moreover, in Appendix C, we give an extended discussion on the validity of Assumption A7, including explicit formulas for $B(s)$ that could be used to estimate $v_\max$. - *The experiments only focus on showing that Fed-Quant-StoSur-MM can work. There is no comparison with other related FL methods.* This is a good remark, and we refer to the general discussion above. We believe that the context we tackle is somehow specific (quantization, partial participation, stochastic, non-smooth) and makes comparison very difficult without enforcing many modifications to competitors to fit in our framework. We fear that an incomplete comparison would risk carrying false or partial insights and thus choose to focus on demonstrating the quality of our method. - *Def of $J_\Phi$:* This is the Jacobian of the operator $\Phi$, defined line 64. - *Incorporation of VR:* You are right. Remark that our formulation of Algorithms allows for any oracle and thus encompasses variance reduced oracles (see remark 2.6). Yet, our analysis is indeed only performed under the assumption A4 (bounded variance), but could be extended to a time varying upper bound as in VR analysis. As the analysis is already very involved and we wanted to maintain an assumption as general as possible on the oracle (e.g. to encompass the streaming setting), we consider an analysis with VR as future work. The case where we have a finite sample on each client and the batch size is equal to the number results in $\sigma^2 =0$ and is included in our analysis. Moreover, note that the impact of VR in non-convex cases is unclear: models obtained with lower noise (VR, or deterministic GD) may underperform w.r.t. models obtained with more noise (bad/good local minima). This is the reason why we did not focus too much on this aspect. For completeness, we have added an experiment with V.R in Appendix A.4. of the paper, and encourage you to check it out (Algorithm design, convergence results in Fig 3). Thank you again for your review. We hope our rebuttal answers your questions and addresses your concerns. We would be happy to provide more details or clarifications if any are needed, and hope our answers will reinforce your support of the paper. ## Response to Review of Paper6628 by Reviewer zqiR Thank you for your constructive comments, and for underlining the generality of our approach. - *Regarding the experimental evaluation of our algorithms*: a complete comparison to competing algorithms, we refer to the general discussion: we believe that the context we tackle is somehow specific (quantization, partial participation, stochastic, non-smooth) and makes comparison very difficult without enforcing many modifications to competitors to fit in our framework. We fear that an incomplete comparison would risk carrying false or partial insights and thus choose to focus on demonstrating the quality of our method. However, we have added more experiments to show the robustness of our method to a large number of workers, the impact of the batch size, the partial participation level, the quantization level and variance reduction (those experiments are reported in App. A4-5 in the revised appendix) - *Incorporating V.R.*: We can also incorporate a VR scheme on the surrogate space. We have written down a concrete example for Example 2.1 + SVRG, we have added an experiment with V.R in Appendix A.4. of the paper, and encourage you to check it out (Algorithm design, convergence results in Fig 3). However, the impact of VR in non-convex cases is unclear: models obtained with lower noise (VR, or deterministic GD) may under-perform w.r.t. models obtained with more noise (bad/good local minima). This is why we focus our analysis to the general bounded variance assumption. Thanks for the other suggestions, we have already been corrected in the revised document. We appreciate that you liked the soundness and presentation of the paper. We hope the clarification on the VR and the new experiments (with VR, larger scale, impact of the batch size) address your concerns. We would be happy to provide more details if any are needed. ## Response to Review of Paper6628 by Reviewer FJnh Thank you for your feedback and positive comments, and for underlining the relevance of the problem. We provide some answers below. - *Experiments do not explore the theoretical results.* We have added experiments (with VR, larger scale, varying batch size, varying PP, varying quantization level). Those experiments are available in the revised version of the appendix (App A4-5, pages 21 and following), we hope you can check them out. Experiment with large number of workers (n=1000) shows that the method is still applicable in that case. Experiment with varying batch size exactly illustrates our theorem, showing that the limit value of $\mathbb{E}\left[\overline{\mathcal{E}}_{\tau}\right]$ decreases proportionally to the batch size and with the number of active workers, and quantization level (Figure 4, left, middle, right respectively). Those experiments will be extended to more cases in the final version if the paper is accepted. - *Insight as to what this analysis actually means in practice.* Our first goal is to highlight that the natural way to build an MM algorithm is to average on the surrogate space. Further, our analysis provides insights on how the compression, partial participation, batch size, etc. influence the convergence: e.g., we can observe that with control variates, from a theoretical standpoint, 1) heterogeneity does not impact convergence; 2) partial participation and compression result in an increase of the variance factor and reduce the maximal learning rate, that decreases proportionally to $1/\omega_p^{3/2}$ - a dependency that is typical in non-convex optimization with control variates [I, 12]. From an experimental standpoint, our experiments illustrate the impact of PP, the batch size, the control variates, and quantization level. Experiments Figure 1 shows the importance of the control variates, and that a lower participation results in a slower convergence initially but may ultimately converge to a better minimum (which is reasonable in a non-convex regime). - “Are there any examples where there is some interest in solving a problem that can be represented using MM in a federated setting “in the wild”?”: Examples 2.1 to 2.3 provide several cases in which our MM algorithm could be used in practice. Many federated applications in practice would require EM or Surrogate optimization: e.g., federated data imputation (when data is corrupted by missing entries) can be tackled with Federated EM is very helpful in the medical domain (different hospitals measuring different variables), Federated matrix factorization can be used in Federated recommendation systems in which data is kept on the devices for privacy reasons etc. - *Scale of the number of workers*: We have added experiments demonstrating the robustness of our method to a larger number of workers ($n=1000$), this experiment is reported in App. A4 in the revised document). - *the scale of the number of workers (4, 8, 20) seems particularly concerning: many clean [...] scale in my experience.* There is no expected bad dependency w.r.t. the number of workers, as long as the participation level remains constant, on the contrary, the variance level decreases proportionally to the number of active clients, as visible in Fig 4 (middle). One difficulty from the computational point of view is that the algorithm requires as many control variates as there are clients, but this is a common difficulty for all algorithms with control variates [27,46]. We hope our answers address your concerns. We would be happy to provide more details or clarifications if any are needed, and hope our answers will reinforce your support of the paper. Refs [12,27,46] from the paper, [I] below: [I] S. Horvath, D. Kovalev, K. Mishchenko, S. Stich, and P. Richtarik. Stochastic distributed learning with gradient quantization and variance reduction. # To preserve [17] G. Fort, P. Gach, and E. Moulines. Fast incremental expectation maximization for finite-sum optimization: nonasymptotic convergence. Statistics and Computing, 31(4), 2021 [20] S. Ghadimi and G. Lan. Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming. [27] S. P. Karimireddy, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and A. T. Suresh. SCAFFOLD: Stochastic Controlled Averaging for On-Device Federated Learning. [37] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang. On the Convergence of FedAvg on Non-IID Data [46] K. Mishchenko, E. Gorbunov, M. Takác, and P. Richtárik. Distributed Learning with Compressed Gradient Differences. [12] A. Dieuleveut, G. Fort, E. Moulines, and G. Robin. Federated-em with heterogeneity mitigation and variance reduction. [43] O. Marfoq, G. Neglia, A. Bellet, L. Kameni, and R. Vidal. Federated multi-task learning under a mixture of distributions. ------------------------------ # Second answer to referee 92uF > On Client drift & references on federated composite optimization Thank you for your suggestions. We will revise the paper accordingly. > I do not see much difficulty in the extension of [12] to the composite setting The difficulties come from **two** sources: (i) the composite optimization setting and (ii) the projection step $\Pi$. Let us explain in details. Denote by $(S_k, k\geq 0)$ the sequence of iterates produced by the algorithm. Both in [12] and in the current paper under review, the proof relies on a Lyapunov inequality whose obtention follows three steps: - find a Lyapunov function W and show that conditionally to the past of the algorithm, the expected value of $W(S_{k+1}) - W(S_k)$ is the sum of two terms $D$ and $R$. - show that the first term $D$ is negative; it is called the "drift" term - show that the second term $R$, which may be positive, is negligible (in some sense) w.r.t. the first term $D$. We agree that in both papers, the first step is almost similar. Nevertheless, the second and third step differ and especially the second step -- which is known to be the crucial step when proving a Lyapunov inequality. Let us comment the differences of this second step. - Let us start with the (toy) case of a Gradient algorithm when $W$ is smooth enough. The drift term is the scalar product $\left(\mathsf{Grad}W(S_{k}) \right)' \, \mathbb{E} \left[S_{k+1} - S_k \vert \mathrm{past}_k \right]$, where $\mathsf{Grad}W$ denotes the gradient of $W$. By definition of the update mechanism of the iterates, we have $\mathbb{E} \left[S_{k+1} - S_k \vert \mathrm{past}_k \right] = - \gamma_{k+1} \mathsf{Grad}W(S_k)$. We obtain that the drift term is $- \gamma_{k+1}\| \mathsf{Grad}W(S_k)\|^2$. - In [12], there is also a drift term equal to the scalar product $\left(\mathsf{Grad}W(S_{k}) \right)' \, \mathbb{E} \left[S_{k+1} - S_k \vert \mathrm{past}_k \right]$. Nevertheless, EM is not a gradient algorithm and we have (at least in simple cases when the stochastic approximation of the mean field is unbiased): $\mathbb{E} \left[S_{k+1} - S_k \vert \mathrm{past}_k \right] = - \gamma_{k+1} B(S_k)\mathsf{Grad}W(S_k)$, where $B(s)$ is a positive definite matrix. This implies that the drift term is equal to $- \gamma_{k+1} \| \mathsf{Grad}W(S_k) \|^2_{B(S_k)}$ where $\| \cdot \|_U$ is the norm induced by the positive definite matrix U. EM lies in the so called *variable metric* setting with a scale matrix $B(S_k)$ which depends on the iterate $S_k$. In [12], it is decided to manage this adapted variable metric by assuming that the spectrum of $B(s)$ is lower bounded uniformly in $s$. This yields a drift term upper bounded by $- \gamma_{k+1} v_{min} \| \mathsf{Grad}W(S_k)\|^2$. - In the paper under review: (i) we are in the composite optimization setting; (ii) the fluctuation of the Lyapunov function between two successive iterates does not rely on a Taylor expansion at order $2$ but it uses MM-1, MM-2, A6 and A7. A first novelty w.r.t. [12] is the inequality provided by **Proposition E.3 in the supplementary material**. The drift term is a scalar product which looks like in [12], **but** unfortunately, the conditional expectation of $S_{k+1}-S_k$ is not easily related to $-\gamma_{k+1} B(S_k) \, \mathsf{Grad} W(S_k)$. Such a fundamental difference is a consequence of (i) the non smooth function $g$; (ii) the fact that the expectation of a projection operator evaluated at a random variable Z **is not** the projection operator evaluated at the expectation of $Z$, and there are no general results between these two quantities. Consequently, we adopt a totally different approch in order to obtain the drift term, approach summarized in *the first 13 lines of the proof of Theorem E.11*: compared to [12] and to what is usually done in the literature, (i) the non smooth function $g$ contributes to the scalar product in such a way that the conditional expectation of this scalar product is **no more** the scalar product of the conditional expectations as in [12]; therefore, (ii) we have an original control of the scalar product (see line 1307 in the proof of Theorem E.11) and (iii) we have an original approach to exhibit a negative term $D$; this drift term is **no more inherited from the first order term when majorizing the smooth part** of the objective function, the scalar product is no more the drift term but is part of the remainder term $R$. As a consequence, the remainder terms $R$ are not the same in [12] and in the current paper. Their control differs, also because we have to prove that $R$ is negligible w.r.t. $D$ and these drift terms $D$ are not the same in the two papers. But, not surprisingly, the control of the term $R$ necessitates an upper bound on the quadratic error of the stochastic field $H_{k+1}$ for both papers (see Steps 4 to 6 in Section E.4). > On convergence guarantees, theoretical results, and constants It can be derived from Theorem 4.1 that the no. of iterations $k_{\max}$ required to obtain an $({\tt A} \sigma^2/n + \epsilon$)-stationary solution would be upper bounded by ${\tt B}/\epsilon$ for some constants ${\tt A, B}$. Note that there are two optimization steps run by the central server at each iteration (when computing ${\tt T}$ and for the projection $\Pi$) so that $k_{\max}$ is also proportional to the number of optimization steps in the algorithm. Let us discuss how ${\tt A}$ and ${\tt B}$ depend on the compression parameter $\omega$, the total number of agents $n$ and the partial participation ratio $p$. - First, consider the case when there is no compression ($\omega = 0$) and all workers are active ($p=1$). Then $\gamma$ is chosen independently on $n$ (see Theorem E.11 in the supplementary material). This yields ${\tt A} = O(1)$ and ${\tt B} = O(1)$. Now, consider that $\omega >0$ and fix $\gamma$ and $\alpha$ so that they scale as their optimal value: $$ \gamma= O\left( \sqrt{n} \frac{p}{1+\omega} \sqrt{\frac{p}{p\omega + (1+\omega)(1-p)}}\right), \qquad \frac{\gamma}{\alpha} = O\left( \sqrt{n}\sqrt{\frac{p}{p\omega + (1+\omega)(1-p)}}\right). $$ - The first comment is that the compression impacts the maximal value of the learning rate by a factor $\sqrt{n} (p/\omega)^{3/2}$ when $\omega/p \gg 1/p$. Similar constraints were observed in [Ref 1, Ref 2]. - Let us derive the scaling of ${\tt A}$ and ${\tt B}$. We have $$ {\tt A} = O\left(\omega + \frac{1+\omega}{p}(1-p) \right) \qquad {\tt B} =O\left( \frac{1}{\sqrt{n}} \frac{1+\omega}{p} \sqrt{\omega + \frac{1+\omega}{p}(1-p)} \right). $$ In the case of the full participation regime ($p=1$) and large compression effect $\omega \gg 1$, then ${\tt A} = O(\omega)$ and ${\tt B} = O(\omega^{3/2}/\sqrt{n})$. Otherwise, when $p<1$ and $\omega$ is large so that $\omega + (1+\omega)(1-p)/p \sim \omega (1-p)/p$, then $$ {\tt A}= O\left( \frac{\omega}{p}(1-p) \right), \qquad {\tt B} = O\left( \frac{\omega^{3/2}}{\sqrt{n}} \frac{\sqrt{1-p}}{p^{3/2}}\right). $$ This discussion shows that both the maximal learning rate and the total number of iterations $k_{\max}$ are affected by the compression in the regime $(\omega/p)^{3/2}/\sqrt{n} \gg 1$. This result is consistent with [12, section 2]. Of course, the cost of the compression on $k_{\max}$ has to be balanced by the reduction on the communication cost induced by the compression. The no. of active workers per iteration is $p n$ in expectation. If each worker uses the same number of samples at each iteration when computing an oracle (see line 5 of Algorithm 4), then this number is also proportional to the number of processed samples. We note that no. of samples needed scales as $n p k_{\max}$, i.e., $$ O\left(\frac{1}{\epsilon} \sqrt{n} \, \omega^{3/2} \sqrt{\frac{1-p}{p}} \right). $$ Dereasing $p$ increases the iteration complexity. We will provide further details in the future revision, such as the best strategy for the choice of $\alpha \in (0, p/(1+\omega))$. Ref 1: Horvath, S. and Kovalev, D. and Mishchenko, K. and Stich, S. and Richtarik, P. Stochastic distributed learning with gradient quantization and variance reduction. arXiv, 2019. Ref 2: Philippenko, C. and Dieuleveut, A. Preserved central model for faster bidirectional compression in distributed settings. NeurIPS, 2021.  --- ### Reviewer vyew > I appreciate the authors to provide detailed responses which mostly address my concerns. After reading the responses, I still feel positive on the results of the paper and decide to keep my score. Regarding the comparison with existing works, I agree there might not be other work that considers the combination of quantization + nonconvex + partial participation + composite objective in FL. How about relaxing the quantization, I believe there are other works that deal with composite nonconvex problems in FL with partial participation. Besides, I think having a discussion on when the proposed method is preferable to existing FL methods would strengthen the quality of the paper. Thank you. For FL algorithms with nonconvex + partial participation + composite objective, we note that there are recent works such as [Yuan et al., Federated Composite Optimization, ICML 2021]. However these prior works still take on a gradient-based approach (such as mirror descent) which is different from our surrogate-based method. Regarding convergence rates, there are two major differences: - first, FedDualAvg (alg. 3 in [Yuan et al., 2021]) is not robust to heterogeneity: to ensure convergence (Th 4.3) in the heterogeneous case requires a bounded heterogeneity and the rate is degraded by the heterogeneity factor $\zeta$. On the contrary, our approach performs well independently of the heterogeneity level, thanks to the use of control variates $V_{k,i}$. - second, convergence rates in the heterogeneous case are only provided for a quadratic function $F$. Furthermore, the case of FL for nonconvex + composite objective is only one of the applications for our framework [Example 2.1], and there are other applications as well, e.g., Examples 2.2 & 2.3. We will revise the paper to include discussions w.r.t. more existing FL methods. Many thanks again for your questions and for your support. We hope that those clarifications on the convergence rates will strengthen it, and that you may consider increasing your score if our answers clarified your doubts. We would be happy to maintain discussion in case you have any other question!

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.