# Decentralized FL
## Notes
* There have been many recent works that propose bypassing a central orchestrator for training (PyGrid in our case). One of the techniques to achieve this is via gossip protocol where workers communicate among themselves to converge to a particular model.
# Papers
## Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication
* Consider decentralized stochastic optimization with the objective function being distributed over n machines that can only communicate to their neighbors on a fixed communication graph
* No multiple local updates on each node, but perform shared/collaborative model updates
* To overcome communication bottlenecks, nodes compress (e.g. quantize or sparsify) their model updates
* cover both unbiased and biased compression operators
* quality $\delta \le 1$ (1 => no compression)
### Contributions
1. CHOCO-SGD: a novel gossip-based stochastic gradient descent algorithm
* Can reduce communication by at least two orders of magnitudes
2. CHOCO-GOSSIP: a novel gossip algorithm for the distributed average consensus problem
* 1st linearly-converging gossip algorithm with communication compression
* 1st gossip algorithm that supports arbitrary compressed messages for $\delta > 0$ and still exhibits linear convergence
### Introduction
* **Decentralized Communication**
* network topology as a graph where edges represent the communication links along which messages (e.g. model updates) can be exchanged.
* Such topologies avoid these bottlenecks and thereby offer hugely improved potential in scalability
* maximal degree of the network is often constant (e.g. ring or torus) or a slowly growing function in n (e.g. scale-free networks).
* **Decentralied Optimization** (**or Gossip Algorithms*)
* SGD is an inherently serial algorithm that does not take the distributed setting into account
* Mini-batch SGD is the natural parallelization of SGD
* Popular gossip algos:
* Subgradient descent
* ADMM: alternating direction method of multipliers
* Dual Avg
* Network topology only affects higher-order terms of the convergence rate of decentralized optimization algorithms on convex problems (see convergence rate formula)
* CHOCO-SGD is as efficient in terms of iterations as centralized mini-batch SGD but avoids the communication bottleneck that centralized algorithms suffer from
* CHOCO-SGD is free of i.i.d assumptions
* **Communication Compression**
* Propose the first method that supports arbitrary low accuracy and even biased compression operators
* A compressed vector $Q(g)$ can be more efficiently represented, for instance by using limited bit representation (quantization) or enforcing sparsity (both referred as compression operators in this paper)
* quantization operators are based on random dithering
* Much sparser vectors can be obtained by random sparsification techniques
* Techniques that do not directly quantize gradients, but instead maintain additional states are known to perform better in theory and practice
* **Distributed Average Consensus**
* Consists in finding the average vector $\vec{x}$ of n local vectors
* Important sub-routine of many decentralized algorithms
* For consensus algorithms with compressed communication it has been remarked that the standard gossip algorithm does not converge to the correct solution
* sometimes do only converge to a neighborhood
* The method presented here converges linearly, even for arbitrary compressed communication, without requiring adaptive accuracy.
### Average Consensus with Communication Compression
* See Algorithm 1
* The proposed algorithm will later serve as a crucial primitive in the optimization algorithm for the general optimization problem
* The convergence rate of scheme (Formula 3) crucially depends on the connectivity matrix $W \in R^{n×n}$ of the network defined as $(W )_{ij} = w{ij}$ also called the interaction or gossip matrix
* CHOCO- GOSSIP that supports a much larger class of compression operators
* Proposed scheme (see formula CHOCO-G)
* $\hat{x}(t) ∈ R^d$ denote additional variables that are stored by all neighbors j of node i, ${i, j} \in E$, as well as on node i itself
* This seems to require each machine to store deg(i) + 2 vectors
* Could be re-written in a way that every node stores only three vectors: $x$, $\hat{x}$ and $\sum_{j:\{i,j\} \in E}{w_{ij}\hat{x}_j}$
* See Apendix E
* Preserves the average of iterates
* the noise introduced by the compression operator vanishes as t → ∞
### Decentralized Stochastic Optimization
* Leverage Algorithm 1 to build a Communication-Compressed Decentralized SGD -> Algorithm 2 consisting o 4 parts
* Stochastic gradient step
* Iterate update
* Application of Compression Operator
* CHOCO-G local communication
* Its proven that this setting obtains a n×speed up compared to the serial implementation of SGD on only one worker.
* CHOCO-SGD converges under arbitrary high compression
* **It tries to carefully compensate quantization errors instead of ignoring them**
* What happen in other approaches
* the local variable on each worker is updated using only copies on the neighbors which do not carry information about the true uncompressed values
* Errors made previously are lost and cannot be corrected in later iterations
* In CHOCO-SGD the sharedcopy $\hat{x}_i$ is in general different from the private copy $x_i$, allowing to carry on the true values to the next iterations
### Experiments
* Compression operators
* $rand_k$ (sparsification), $top_k$ (sparsification) and $qsgd_s$ (quantization) compression operators
* k -> 1% of all coordinates
* $s \in \{2^4, 2^8\}$
* CHOCO-GOSSIP vs Baselines
* (qsgd256) converges with the same rate as (E-G) that uses exact communications but requires much less data
* (rand1%) sparsification as the most data-efficient method in the experiments
* CHOCO-SGD vs SoA Decentralized Stochastic Optimization schemes
* Impact on Topology
* Ring topology as the more affected by increasing the No of workers
* CHOCO-SGD consistently outperforms DCD-SGD in all settings
* CHOCO-SGD reaches almost the same performance as exact communication, while significantly reducing communication cost.
* performs almost as good as the exact Alg. 3, but uses 100× less communication with (rand1%) sparsification
* (top1%) variant performs slightly better than (rand1%)
* approximately 13× less communication for (qsgd16) quantization
* Future work
* Analysis of CHOCO-SGD on non-convex functions
### Implementation in PyTorch
* https://github.com/epfml/ChocoSGD
* Community impl
* https://github.com/Adirlou/OptML_Project
* https://github.com/JYWa/MATCHA