ICLR Rebuttal - HackMD

# ICLR Rebuttal **Reviewer1**: **Summary Of The Paper**: This paper proposes a decentralized adaptive method for distributed deep learning, termed DAG-Adam. Convergence results are provided for smooth non-convex objectives under a bounded gradient assumption. Numerical experiments are conducted on Image Classification (CIFAR10, ImageNet-1k) and Language Modelling (fine-tuning pre-trained BERT models on SQuAD). **Main Review:** Strengths: I found this paper very interesting, and have been long awaiting progress on the topic of adaptive decentralized optimization for deep learning. Traditional gossip-based methods performing consensus on the parameters cannot be naively applied to adaptive gradient methods, such as Adam, because of the non-linearity in the gradient update (e.g., dividing by the square-root of the second moment buffer), which breaks traditional proofs and significantly degrades performance in practice compared to centralized learning (e.g., see [a]). This paper is very important and contributes to an emerging literature. In particular, to realize the benefits of decentralized (gossip-based) optimization for deep learning, one must develop strategies suited for adaptive gradient methods, especially since many popular deep learning tasks seem to only work with adaptive methods, such as language models and the new swath of image-based methods relying on Vision Transformer architectures. This paper also shows a very good familiarity with related work. Despite my current rating, I strongly support this paper, and thus will focus on the main issues I observed below so that the authors can address them, and hopefully find their paper accepted at the end of this process. Thanks for the valuable comments! Below we have addressed all questions from the reviewer. We will be glad to clarify any further comments. (We should clarify that we have updated the draft paper.) **Weaknesses:** * Proposition 1 is somewhat meaningless… what is relevant here is a lower bound on the squared norm of the sub-optimality of the iterate x_i^t as t goes to infinity. Similarly for Proposition 2. * Many thanks for bringing this issue to our attention. To address the reviewer’s concern, we have refined the upper bound in Proposition 1 and established a lower bound for $\sum_{i=1}^n \|x_i^{(t)} - x^\star\|^2$ when $t\to \infty$. For the reviewer’s convenience, we repeat the revised Proposition 1 as follows (the reviewer can also check the revised proposition in the paper): Proposition 1. Under Assumptions A.1, A.3 and A.4, if each $f_i(x)$ is further assumed to be $μ$-strongly-convex, and the full-batch gradient $\nabla f_i(x)$ is accessed per iteration, then DAdam (Al-gorithm 2) cannot converge to $x^\star$, i.e., the global solution to problem (1). In particular, the distance between DAdam limiting point and $x^\star$ can be characterized as follows: \begin{align} \lim_{t\to \infty} \sum_{i=1}^n \|x_i^{(t)} - x^\star\|^2 = O\Big( \frac{\gamma^2 G^2}{(1-\rho)^2\epsilon^2} + B^2\Big) \end{align} where $B := \|\frac{1}{n}\sum_{i=1}^n (H_i + \epsilon I)^{-1} \nabla f_i(x_i^\infty) - \frac{1}{n}\sum_{i=1}^n \nabla f_i(x_i^\infty)\| \neq 0$ is a constant bias term. Furthermore, the $B^2$ term in the above bound is necessary and cannot be removed by an improved analysis; there exists strongly convex and smooth functions $f_i(x)$ such that $\lim_{t\to \infty} \sum_{i=1}^n \|x_i^{(t)} - x^\star\|^2 \ge B^2$. With the newly-established lower bound in Proposition 1, it is observed that DAdam cannot converge to the optimal solution $x^\star$ even if a decaying learning rate $\gamma \to 0$ is utilized. We also revised Proposition 2 for QG-DAdam accordingly, please check the details of the revised Proposition 2 in the paper. * Can you clarify the following: Theorem 1 assumes that not just the gradient is bounded, but that the gradient with an additional consensus penalty is bounded? * Yes, we need the original gradient and augmented gradient to be bounded. * In Theorem 1, the language indicates that x⋆ is the optimal solution, but note that in the non-convex setting there does not necessarily exist a unique global optimizer. * Many thanks for pointing the typo out. We have changed $\mathcal{L}(\mathbf{x}^\star)$ to $\mathcal{L}^\star$, which is defined as $\mathcal{L}^\star = \textrm{argmin}_{x}\{\mathcal{L}(x)\}$. Please check Theorem 1 in the revised paper. * Would be interested in seeing the statistical significance of results (across a few seeds) in Table 1 on CIFAR10. * We have revised Table 1 to show the statistical significance across three different seeds. Please check Table 1 in the revised paper. [Yiming: Could you please revise Table 1 accordingly?] * Why does heterogeneity of data distributions have such a large impact on the final validation performance in Table 1? I would expect the heterogeneity to affect the convergence rate, but not the final validation performance. * Many thanks for raising this useful question. First, we agree that data heterogeneity will slow down the convergence rate of decentralized algorithms as expected by the reviewer. In the revised paper, we have added a new figure illustrating how different Adam-based algorithms converge with the CIFAR-10 training dataset, see Fig.~[XXX]. It is observed that both PAdam and DAG-Adam converge faster than the others due to their insensitivity to data heterogeneity; they can converge with much smaller training loss after the same amount of training epochs. This experiment corroborates the reviewer’s comments. Second, we would like to clarify that slow convergence rate does have a negative effect on the final validation (or test) performance. Note that all listed results in Table 1 are achieved with the same resource budget. In other words, they are obtained by utilizing the same hardwares, and updating models with the same amount of epochs (or iterations). [Hanbin: Could you please check whether we have detailed the experimental setting for Table 1?] If some algorithm converges slowly due to the influence of data heterogeneity, the model may have not been sufficiently trained after the predefined number of training epochs. As a result, the algorithm cannot converge to a good solution enabling small training loss, and will suffer from poor validation performance. Fig.~[XXX] in the revised paper illustrates how different algorithms perform in the validation dataset. It is observed that the smaller training loss the algorithm reaches (i..e, the faster the algorithm converges), the better the final validation performance it achieves.