# Response to Reviewer 1Mgi
We want to thank the reviewer for their valuable reviewing service and dedication of time to the community. Below, we address all questions and misunderstandings the reviewer has.
## Algorithm 1 Asynchronous Misunderstanding
>Algorithm 1 is not wait-free according to Line 11 and 12. As far as I can see, worker i has to wait for the models from all of its neighbors before averaging. If any update of its neighbor is delayed, which could be an unbounded delay, then worker i is blocked. Then in the future communication rounds, all neighbors of i are blocked as they have to wait for worker i for updates. I didn't find how the asynchrony is actually addressed in the algorithm.
**Response:**
SWIFT is a wait-free algorithm due to its non-blocking and asynchronous model communication. At the beginning of every local update, each client asynchronously broadcasts its local model to their neighbors (Line 7 of Algorithm 1). Each client stores a copy of their neighbors' model. Before performing model averaging, a client checks for any new model updates from its neighbors (Line 11). **If a client has not received a message from a neighbor, then their stored model for that neighbor is still its most up-to-date model**.
Thus, the model fetching and averaging process within Lines 11 and 12 **do *not* require** a client to wait for all of their neighboring client models. This ensures that there is no waiting for slower clients, and thus no parallelization delays. Specifically, the asynchronous averaging process is detailed mathematically in Equation (4) given as:
$$
X^{t+1} = X^t W^t_{i_t} - \gamma G(x^t_{i_t}, \xi^{t}_{i_t}).
$$
Within the equation above, the gradient matrix $G$ is zero everywhere except for the column corresponding to the gradient update of the active client $i_t$. The client-communication matrix $W_{i_t}$ is neither doubly-stochastic nor symmetric, allowing clients to average without having to wait for their neighbors. $W_{i_t}$ is defined in Equation (5) as:
$$
W^t_{i_t} := I_n + (w_{i_t}^t - e_{i_t}) e_{i_t}^\intercal, \;
w_{i_t}^t := [w_{1, i_t}^t, \ldots, w_{n, i_t}^t]^\intercal \in \mathbb{R}^n, \; \sum_{j=1}^n w_{j,i} = 1, \; w_{i,i} \geq 1/n \; \forall i.
$$
If the asynchronous averaging process defined above was synchronous, then $G$ would be a fully dense matrix containing the gradient updates to *all* clients at once.
Furthermore, the client-communication matrix $W_{i_t}$ would be symmetric and doubly-stochastic. This would require all model averaging to be done together in a synchronized fashion.
## Theoretical Benefit (Linear Speed-up)
>There is no theoretical benefit to use SWIFT. The convergence rate is which does not scale with the number of workers. It means SWIFT is not benefiting from the collaborative training and only have the same rate as training alone.
**Response:** We emphasize that SWIFT *does* obtain a linear speed-up with respect to the number of clients. We obtain this speed-up in the same way as AD-PSGD [R1]. The proof of this is summarized in Remark 4 of [R1], and we expand upon it further below:
An algorithm achieves an $\epsilon$-approximate solution if
$$
\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}||\nabla f (\bar{x}^t)||^2 \leq \epsilon.
$$
SWIFT's convergence rate in Theorem 1 (as well as AD-PSGD's) is consistent with SGD and is on the order of (dominated by) $1/\sqrt{T}$. Therefore, the total computational complexity to achieve an $\epsilon$-approximate solution is bounded by $\mathcal{O}(\frac{1}{\epsilon^2})$. As the number of clients does not affect the total complexity, a single client only shares a computational complexity of $\mathcal{O}(\frac{1}{n\epsilon^2})$. Therefore, ***linear speed-up with respect to the number of clients is achieved by SWIFT in terms of computational complexity***. We will highlight this in the final version of our manuscript.
### Rebuttal Summary
We again want to thank the reviewer for their time and effort in the reviewing process. We hope to have alleviated the misunderstanding surrounding the asynchronous nature of SWIFT, where clients store the most recent neighboring models and communicate in a non-blocking and wait-free manner. Finally, we wanted to clarify the linear speed-up which SWIFT (following along the lines of other Federated algorithms) obtains with respect to the number of clients. As mentioned in these other works, having more clients will provide more gradient updates and allow the total computational complexity to be reached faster.
We believe that we have addressed all of the reviewer's concerns, and would be happy to discuss any further questions that remain.
### Citations
[R1] Lian, Xiangru, et al. "Asynchronous decentralized parallel stochastic gradient descent." International Conference on Machine Learning. PMLR, 2018.
<!--
### Rebuttal Summary
[R1] Lian, Xiangru, et al. "Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent." Advances in Neural Information Processing Systems 30 (2017).
-->