# [Federated Learning on Riemannian Manifolds](https://arxiv.org/pdf/2206.05668.pdf)
### Li and Ma, June 2022.
- Algorithms for FL with nonconvex constraints.
- Federated PCA and federated kPCA.
- New algorithm: RFedSVRG.
## Introduction
- Classical **Federated learning (FL)** aims at solving the following problem:
$$
\min_{x\in \mathbb{R}^{d}} f(x) \coloneqq \dfrac{1}{n} \sum_{i=1}^{n} f_{i}(x),
$$
where each loss function $f_{i}\colon \mathbb{R}^{d}\to \mathbb{R}$ is stored in a different **client/agent** that may have different physical locations and different hardware. A **central server** collects the information from the different agents and outputs a **consensus** that minimizes the sum of the loss functions $f_{i}(x)$ from all the clients.
- **Aim of FL**: use the *computational resources* of different agents while *maintaining the data privacy* by not sharing data among all the local agents.
- **FL problem over a Riemannian manifold**:
$$
\min_{\color{red}{x\in \mathcal{M}}} f(x) \coloneqq \dfrac{1}{n} \sum_{i=1}^{n} f_{i}(x),
$$
where $f_{i}\colon \mathcal{M}\to \mathbb{R}$. **Main difficulty**: aggregating points over a nonconvex set.
- *Motivating application*: **federated kPCA problem**:
$$
\min_{\color{red}{x\in \mathrm{St}(d,r)}} f(X) \coloneqq \dfrac{1}{n} \sum_{i=1}^{n} f_{i}(X), \quad \text{where} \ f_{i}(X) = - \dfrac{1}{2} \, \mathrm{tr}(X^{\top} A_{i} X),
$$
where $\mathrm{St}(d,r) = \lbrace X \in \mathbb{R}^{d \times r} \vert X^{\top} X = I_{r} \rbrace$ is the **Stiefel manifold**, and $A_{i}$ is the **covariance matrix** of the data stored in the $i$th local agent.
- When $r=1$, we get the **classical PCA**:
$$
\min_{\| x \|_{2} = 1} f(x) \coloneqq \dfrac{1}{n} \sum_{i=1}^{n} f_{i}(x), \quad \text{where} \ f_{i}(x) = - \dfrac{1}{2}\,x^{\top} A_{i} x.
$$
Observe that here the optimization variable is constrained to the **unit sphere**.
- Problem: existing FL algorithms are not applicable to the last two problems, due to the difficulty in aggregating points on nonconvex sets.
## Contributions:
- Riemannian federated SVRG algorithm (RFedSVRG), with convergence rate $\mathcal{O}(1/\epsilon^2)$ for obtaining an $\epsilon$-stationary point.
- Consensus step on the tangent space to the manifold, instead of the widely used “Karcher mean” approach (the Riemannian center of mass).
- Numerical results show that their RFedSVRG outperforms the Riemannian counterparts of two widely used FL algorithms: FedAvg and FedProx.
## Riemannian optimization
- An important notion that is needed here is the **parallel transport**, which is used to define the Lipschitz condition for the Riemannian gradients and to prove convergence of the method.
- Given a Riemannian manifold $(\mathcal{M},g)$ and two points $x,y\in\mathcal{M}$, the **parallel transport** $P_{x\to y}\colon T_{x}\mathcal{M} \to T_{y}\mathcal{M}$ is a **linear operator which keeps the inner product**: $\forall\xi, \zeta \in T_{x}\mathcal{M}$, we have $\langle P_{x\to y}\xi, P_{x\to y}\zeta \rangle_{y}=\langle \xi, \zeta \rangle_{x}$.
## RFedSVRG
How to perform the **consensus step**? I.e., how to perform aggregation on the central server? There are two options:
1. **Riemannian center of mass** of the points (improperly called “Karcher mean”):
$$
x_{t+1} \leftarrow \argmin_{x} \dfrac{1}{k}\sum_{i\in S_{t}} d^{2}(x,x^{(i)}).
$$
Here, $S_{t} \subset [n]$ is a subset of indices with cardinality $k=\lvert S_{t}\rvert$, $x^{(i)}$ is the data from each local server, $d(\cdot,\cdot)$ is the Riemannian distance, $x_{t+1}$ is the next iterate point on the central server.
2. **Tangent space consensus step** (the one used in this paper):
$$
x_{t+1} \leftarrow \mathrm{Exp}_{x_{t}}\left( \dfrac{1}{k} \sum_{i\in S_{t}} \mathrm{Exp}_{x_{t}}^{-1}(x^{(i)})\right),
$$
where we “lift” each of the $x^{(i)}$ to the tangent space $T_{x_{t}}\mathcal{M}$ and then take their average on the tangent space.
### Algorithm 1: RFedSVRG
$T$ is the number of rounds, $\tau_{i}$ in the inner loop denotes the number of local gradient steps.
- **Local gradient step**:
$$
x_{\ell + 1}^{(i)} \leftarrow \mathrm{Exp}_{x_{\ell}^{(i)} } \left[ -\eta^{(i)} \left(\mathrm{grad}f_{i} (x_{\ell}^{(i)}) - P_{x_{t}\to x_{\ell}^{(i)}}(\mathrm{grad}f_{i}(x_{t}) - \mathrm{grad} f(x_{t})) \right) \right]
$$
The parallel transport is used to bring all the gradient vectors on the same tangent space, namely, the tangent space at $x_{\ell}^{(i)}$, in order to perform addition and subtraction.
- Their algorithm uses the gradient information at the previous iterate $\mathrm{grad} f(x_{t})$, thus avoiding the *client-drift effect* and correctly converges to the global stationary points.
## Convergence
Convergence of the RFedSVRG algorithm.
They use standard assumptions for optimization on manifolds, like:
1. *Lipschitz smoothness*;
2. The manifold is *complete*, and there exists a compact set $\mathcal{D}\subset\mathcal{M}$ such that all the iterates generated by the RFedSVRG algorithm are contained in $\mathcal{D}$. The *sectional curvature* is *bounded*.
Convergence rate results for $\tau_{i}=1$ (Theorem 7), $\tau_{i}>1$ (Theorem 8), and for a *geodesically convex objective function* (Theorem 9).