[Federated Learning on Riemannian Manifolds](https://arxiv.org/pdf/2206.05668.pdf)

# [Federated Learning on Riemannian Manifolds](https://arxiv.org/pdf/2206.05668.pdf) ### Li and Ma, June 2022. - Algorithms for FL with nonconvex constraints. - Federated PCA and federated kPCA. - New algorithm: RFedSVRG. ## Introduction - Classical **Federated learning (FL)** aims at solving the following problem: $$ \min_{x\in \mathbb{R}^{d}} f(x) \coloneqq \dfrac{1}{n} \sum_{i=1}^{n} f_{i}(x), $$ where each loss function $f_{i}\colon \mathbb{R}^{d}\to \mathbb{R}$ is stored in a different **client/agent** that may have different physical locations and different hardware. A **central server** collects the information from the different agents and outputs a **consensus** that minimizes the sum of the loss functions $f_{i}(x)$ from all the clients. - **Aim of FL**: use the *computational resources* of different agents while *maintaining the data privacy* by not sharing data among all the local agents. - **FL problem over a Riemannian manifold**: $$ \min_{\color{red}{x\in \mathcal{M}}} f(x) \coloneqq \dfrac{1}{n} \sum_{i=1}^{n} f_{i}(x), $$ where $f_{i}\colon \mathcal{M}\to \mathbb{R}$. **Main difficulty**: aggregating points over a nonconvex set. - *Motivating application*: **federated kPCA problem**: $$ \min_{\color{red}{x\in \mathrm{St}(d,r)}} f(X) \coloneqq \dfrac{1}{n} \sum_{i=1}^{n} f_{i}(X), \quad \text{where} \ f_{i}(X) = - \dfrac{1}{2} \, \mathrm{tr}(X^{\top} A_{i} X), $$ where $\mathrm{St}(d,r) = \lbrace X \in \mathbb{R}^{d \times r} \vert X^{\top} X = I_{r} \rbrace$ is the **Stiefel manifold**, and $A_{i}$ is the **covariance matrix** of the data stored in the $i$th local agent. - When $r=1$, we get the **classical PCA**: $$ \min_{\| x \|_{2} = 1} f(x) \coloneqq \dfrac{1}{n} \sum_{i=1}^{n} f_{i}(x), \quad \text{where} \ f_{i}(x) = - \dfrac{1}{2}\,x^{\top} A_{i} x. $$ Observe that here the optimization variable is constrained to the **unit sphere**. - Problem: existing FL algorithms are not applicable to the last two problems, due to the difficulty in aggregating points on nonconvex sets. ## Contributions: - Riemannian federated SVRG algorithm (RFedSVRG), with convergence rate $\mathcal{O}(1/\epsilon^2)$ for obtaining an $\epsilon$-stationary point. - Consensus step on the tangent space to the manifold, instead of the widely used “Karcher mean” approach (the Riemannian center of mass). - Numerical results show that their RFedSVRG outperforms the Riemannian counterparts of two widely used FL algorithms: FedAvg and FedProx. ## Riemannian optimization - An important notion that is needed here is the **parallel transport**, which is used to define the Lipschitz condition for the Riemannian gradients and to prove convergence of the method. - Given a Riemannian manifold $(\mathcal{M},g)$ and two points $x,y\in\mathcal{M}$, the **parallel transport** $P_{x\to y}\colon T_{x}\mathcal{M} \to T_{y}\mathcal{M}$ is a **linear operator which keeps the inner product**: $\forall\xi, \zeta \in T_{x}\mathcal{M}$, we have $\langle P_{x\to y}\xi, P_{x\to y}\zeta \rangle_{y}=\langle \xi, \zeta \rangle_{x}$. ## RFedSVRG How to perform the **consensus step**? I.e., how to perform aggregation on the central server? There are two options: 1. **Riemannian center of mass** of the points (improperly called “Karcher mean”): $$ x_{t+1} \leftarrow \argmin_{x} \dfrac{1}{k}\sum_{i\in S_{t}} d^{2}(x,x^{(i)}). $$ Here, $S_{t} \subset [n]$ is a subset of indices with cardinality $k=\lvert S_{t}\rvert$, $x^{(i)}$ is the data from each local server, $d(\cdot,\cdot)$ is the Riemannian distance, $x_{t+1}$ is the next iterate point on the central server. 2. **Tangent space consensus step** (the one used in this paper): $$ x_{t+1} \leftarrow \mathrm{Exp}_{x_{t}}\left( \dfrac{1}{k} \sum_{i\in S_{t}} \mathrm{Exp}_{x_{t}}^{-1}(x^{(i)})\right), $$ where we “lift” each of the $x^{(i)}$ to the tangent space $T_{x_{t}}\mathcal{M}$ and then take their average on the tangent space. ### Algorithm 1: RFedSVRG $T$ is the number of rounds, $\tau_{i}$ in the inner loop denotes the number of local gradient steps. - **Local gradient step**: $$ x_{\ell + 1}^{(i)} \leftarrow \mathrm{Exp}_{x_{\ell}^{(i)} } \left[ -\eta^{(i)} \left(\mathrm{grad}f_{i} (x_{\ell}^{(i)}) - P_{x_{t}\to x_{\ell}^{(i)}}(\mathrm{grad}f_{i}(x_{t}) - \mathrm{grad} f(x_{t})) \right) \right] $$ The parallel transport is used to bring all the gradient vectors on the same tangent space, namely, the tangent space at $x_{\ell}^{(i)}$, in order to perform addition and subtraction. - Their algorithm uses the gradient information at the previous iterate $\mathrm{grad} f(x_{t})$, thus avoiding the *client-drift effect* and correctly converges to the global stationary points. ## Convergence Convergence of the RFedSVRG algorithm. They use standard assumptions for optimization on manifolds, like: 1. *Lipschitz smoothness*; 2. The manifold is *complete*, and there exists a compact set $\mathcal{D}\subset\mathcal{M}$ such that all the iterates generated by the RFedSVRG algorithm are contained in $\mathcal{D}$. The *sectional curvature* is *bounded*. Convergence rate results for $\tau_{i}=1$ (Theorem 7), $\tau_{i}>1$ (Theorem 8), and for a *geodesically convex objective function* (Theorem 9).