Decentralized FL

# Decentralized FL ## Notes * There have been many recent works that propose bypassing a central orchestrator for training (PyGrid in our case). One of the techniques to achieve this is via gossip protocol where workers communicate among themselves to converge to a particular model. # Papers ## Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication * Consider decentralized stochastic optimization with the objective function being distributed over n machines that can only communicate to their neighbors on a fixed communication graph * No multiple local updates on each node, but perform shared/collaborative model updates * To overcome communication bottlenecks, nodes compress (e.g. quantize or sparsify) their model updates * cover both unbiased and biased compression operators * quality $\delta \le 1$ (1 => no compression) ### Contributions 1. CHOCO-SGD: a novel gossip-based stochastic gradient descent algorithm * Can reduce communication by at least two orders of magnitudes 2. CHOCO-GOSSIP: a novel gossip algorithm for the distributed average consensus problem * 1st linearly-converging gossip algorithm with communication compression * 1st gossip algorithm that supports arbitrary compressed messages for $\delta > 0$ and still exhibits linear convergence ### Introduction * **Decentralized Communication** * network topology as a graph where edges represent the communication links along which messages (e.g. model updates) can be exchanged. * Such topologies avoid these bottlenecks and thereby offer hugely improved potential in scalability * maximal degree of the network is often constant (e.g. ring or torus) or a slowly growing function in n (e.g. scale-free networks). * **Decentralied Optimization** (**or Gossip Algorithms*) * SGD is an inherently serial algorithm that does not take the distributed setting into account * Mini-batch SGD is the natural parallelization of SGD * Popular gossip algos: * Subgradient descent * ADMM: alternating direction method of multipliers * Dual Avg * Network topology only affects higher-order terms of the convergence rate of decentralized optimization algorithms on convex problems (see convergence rate formula) * CHOCO-SGD is as efficient in terms of iterations as centralized mini-batch SGD but avoids the communication bottleneck that centralized algorithms suffer from * CHOCO-SGD is free of i.i.d assumptions * **Communication Compression** * Propose the first method that supports arbitrary low accuracy and even biased compression operators * A compressed vector $Q(g)$ can be more efficiently represented, for instance by using limited bit representation (quantization) or enforcing sparsity (both referred as compression operators in this paper) * quantization operators are based on random dithering * Much sparser vectors can be obtained by random sparsification techniques * Techniques that do not directly quantize gradients, but instead maintain additional states are known to perform better in theory and practice * **Distributed Average Consensus** * Consists in finding the average vector $\vec{x}$ of n local vectors * Important sub-routine of many decentralized algorithms * For consensus algorithms with compressed communication it has been remarked that the standard gossip algorithm does not converge to the correct solution * sometimes do only converge to a neighborhood * The method presented here converges linearly, even for arbitrary compressed communication, without requiring adaptive accuracy. ### Average Consensus with Communication Compression * See Algorithm 1 * The proposed algorithm will later serve as a crucial primitive in the optimization algorithm for the general optimization problem * The convergence rate of scheme (Formula 3) crucially depends on the connectivity matrix $W \in R^{n×n}$ of the network defined as $(W )_{ij} = w{ij}$ also called the interaction or gossip matrix * CHOCO- GOSSIP that supports a much larger class of compression operators * Proposed scheme (see formula CHOCO-G) * $\hat{x}(t) ∈ R^d$ denote additional variables that are stored by all neighbors j of node i, ${i, j} \in E$, as well as on node i itself * This seems to require each machine to store deg(i) + 2 vectors * Could be re-written in a way that every node stores only three vectors: $x$, $\hat{x}$ and $\sum_{j:\{i,j\} \in E}{w_{ij}\hat{x}_j}$ * See Apendix E * Preserves the average of iterates * the noise introduced by the compression operator vanishes as t → ∞ ### Decentralized Stochastic Optimization * Leverage Algorithm 1 to build a Communication-Compressed Decentralized SGD -> Algorithm 2 consisting o 4 parts * Stochastic gradient step * Iterate update * Application of Compression Operator * CHOCO-G local communication * Its proven that this setting obtains a n×speed up compared to the serial implementation of SGD on only one worker. * CHOCO-SGD converges under arbitrary high compression * **It tries to carefully compensate quantization errors instead of ignoring them** * What happen in other approaches * the local variable on each worker is updated using only copies on the neighbors which do not carry information about the true uncompressed values * Errors made previously are lost and cannot be corrected in later iterations * In CHOCO-SGD the sharedcopy $\hat{x}_i$ is in general different from the private copy $x_i$, allowing to carry on the true values to the next iterations ### Experiments * Compression operators * $rand_k$ (sparsification), $top_k$ (sparsification) and $qsgd_s$ (quantization) compression operators * k -> 1% of all coordinates * $s \in \{2^4, 2^8\}$ * CHOCO-GOSSIP vs Baselines * (qsgd256) converges with the same rate as (E-G) that uses exact communications but requires much less data * (rand1%) sparsification as the most data-efficient method in the experiments * CHOCO-SGD vs SoA Decentralized Stochastic Optimization schemes * Impact on Topology * Ring topology as the more affected by increasing the No of workers * CHOCO-SGD consistently outperforms DCD-SGD in all settings * CHOCO-SGD reaches almost the same performance as exact communication, while significantly reducing communication cost. * performs almost as good as the exact Alg. 3, but uses 100× less communication with (rand1%) sparsification * (top1%) variant performs slightly better than (rand1%) * approximately 13× less communication for (qsgd16) quantization * Future work * Analysis of CHOCO-SGD on non-convex functions ### Implementation in PyTorch * https://github.com/epfml/ChocoSGD * Community impl * https://github.com/Adirlou/OptML_Project * https://github.com/JYWa/MATCHA