# Multi-Agent Reinforcement Learning  (1/3): Settling the Variance of Multi-Agent Policy Gradient This blog is based on a NeurIPS 2021 paper *"Settling the Variance of Multi-Agent Policy Gradients"*, by Kuba et al., available at https://arxiv.org/pdf/2108.08612.pdf. **Reinforcement Learning (RL)** algorithms enable machines to learn to solve complicated tasks. The performance that they achieve gives hope for the automation of many manual chores that humans must do daily, thus increasing our efficiency and safety. However, the traditional single-agent RL has its limitation: it deals with problems with only one intelligent unit. In many real-world scenarios, multiple learners interact (these tasks are probably the most difficult, and desirable, to automate). Consider the top part of the illustration below: even if we let an autonomous vehicle with great RL-based software out to the city, it may fail to predict a human driver's behavior, which can end fatally. Furthermore, even if the car avoids crashes, and arrives at its destination safely and quickly (thanks to its great RL algorithm), one can ask how about the safety and timing of other cars? Indeed, an RL car doesn't care about the fate of others, and can possibly harm it in multiple ways. **Multi-Agent Reinforcement Learning (MARL)** is a new generation of RL, which aims at solving these problems. Its ultimate goal is to enable machines to learn in a way they stay considerate about one another, as well as about humans (see the bottom part of the figure below). ![](https://i.imgur.com/Kvyp0X3.jpg) A single-agent self-driving car may not predict unusual human driving behavior. ![](https://i.imgur.com/TzBS3Zb.jpg) Multi-agent self-driving cars (right) cooperate for everyone's efficiency. --- ## Multi-Agent Policy Gradient (MAPG) The interaction of a multi-agent system (MAS) with the environment is obviously different from the single-agent case. In MARL, at the time step $t$, every agent $i$ takes an action $a^i_t \in \mathcal{A}^i$ from its own action space (they can differ across agents) from its own policy $\pi^i_{\boldsymbol{\theta}}$. Together with other agents' actions, this gives a joint action $\boldsymbol{a}_t = (a^1_t, \dots, a^n_t) \in \boldsymbol{\mathcal{A}}$. The product of all agents' policies is the joint policy $\boldsymbol{\pi_\theta}(\cdot|s_t) = \prod_{i=1}^{n}\pi^i_{\boldsymbol{\theta}}(\cdot^i|s_t)$. In practical settings, the policies are parameterized by neural networks. In RL, we train them with gradient ascent methods derived from the Policy Gradient Theorem. In MARL, the theorem has its counterpart, Multi-Agent Policy Gradient (MAPG) Theorem $\quad \quad \quad \quad \nabla_{\theta^i}\mathcal{J}(\theta) = \mathbb{E}_{s_{0:\infty}\sim d^{0:\infty}_{\boldsymbol{\theta}}, \boldsymbol{a}_{0:\infty}\sim \boldsymbol{\pi_\theta}}\Big[ \sum\limits_{t=0}^{\infty}\gamma^t Q_{\boldsymbol{\theta}}(s_t, \boldsymbol{a}^{-i}_t, a^i_t)\nabla_{\theta^i}\log\pi^i_{\boldsymbol{\theta}}(a^i_t|s_t)\Big]$. Once we are equipped with state-action value critics, we can derive two MAPG estimators: one for centralized training (assuming a critic of all agents' joint action) $\quad \quad \quad \quad \quad \quad \quad \quad \quad g^i_{\text{C}} = \sum\limits_{t=0}^{\infty}\gamma^t \hat{Q}(s_t, \boldsymbol{a}^{-i}_t, a^i_t)\nabla_{\theta^i}\log\pi^i_{\boldsymbol{\theta}}(a^i_t|s_t)$ and a decentralized one (assuming a critic of an agent's local action only) $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad g^i_{\text{D}} = \sum\limits_{t=0}^{\infty}\gamma^t \hat{Q}^i(s_t, a^i_t)\nabla_{\theta^i}\log\pi^i_{\boldsymbol{\theta}}(a^i_t|s_t)$. The training framework which proved to deliver effective policies is the centralized one. That is because the critics giving feedback about joint action stimulate collaboration: while an action may not seem good for one agent, combined with specific actions of others, it becomes effective. Moreover, training such joint-action critics is easier. Imagine that you fit them by averaging the returns received after each state-action pair. The estimate for the state-(local)action pair will be different for any different behavior of any other agent. Thus, to achieve a reliable estimate, you must sample many state-(local)action pairs. This problem is gone in state-(joint)action values. ![](https://i.imgur.com/z18G2mP.jpg) The famous COMA with a joint critic plays StarCraft. --- ## The issues with MAPG Although the training with joint critic allows us to do MARL at all, it still comes with massive difficulties. The major one being the problem of credit assignment. The intuition behind it is that if all agents receive joint feedback, an individual agent does not know its contribution to the team's result. Consider a natural multi-agent scenario of a football/soccer game, where two (multi-agent) teams aim to score a goal. We've seen plenty of situations when a single player dribbled past his opponents and scored. In this case, all his teammates, including those who decided to just stand, are being rewarded. But is standing still optimal during a game? Probably not. ![](https://i.imgur.com/BDfHyJn.jpg) Lionel Messi dribbling past opponents by himself. The presence of the above problem may cause confusion: we follow the MAPG theorem and there is something wrong with that? Well, yes. We can fix it, but first, we have to understand the problem mathematically. Let's begin by taking a step back. The MAPG theorem tells us how to obtain an unbiased estimator of MAPG. But bias is not the only potential problem of an estimator; there is also variance. Intuitively, the large variance of our MAPG estimator comes from the joint action input in $\hat{Q}(s_t, \boldsymbol{a}^{-i}_t, a^i_t)$.  While the agent i is only interested in figuring out the utility of its own action, the estimate will be different whenever other agents alter their actions. This phenomenon induces extra variance. The questions that arise are: 1. How large is this variance 2. How can we modify MAPG estimation to reduce it? ## Settling the Variance To answer the first question we must first understand something which RL doesn't have to think about, which is how different agents contribute to the joint return. For this purpose, we introduce the multi-agent state-action value function, defined by $\quad \quad \quad \quad \quad \quad \quad Q^{i_{1:m}}_{\boldsymbol{\theta}}(s, \boldsymbol{a}^{i_{1:m}} ) = \mathbb{E}_{\boldsymbol{a}^{-i_{1:m}} \sim \boldsymbol{\pi_\theta}^{-i_{1:m}} }\big[Q_{\boldsymbol{\theta}}(s, \boldsymbol{a}^{i_{1:m}}, \boldsymbol{a}^{-i_{1:m}} )\big]$. This function says what is the expected return once a subset of agents has taken their actions. Crucially, on top of it, we define the multi-agent advantage function $\quad \quad \quad \quad \quad \quad A^{i_{1:m}}_{\boldsymbol{\theta}}(s, \boldsymbol{a}^{j_{1:k}}, \boldsymbol{a}^{i_{1:m}} ) = Q^{j_{1:k}, i_{1:m}}_{\boldsymbol{\theta}}(s, \boldsymbol{a}^{j_{1:k}}, \boldsymbol{a}^{i_{1:m}} ) - Q^{j_{1:k}}_{\boldsymbol{\theta}}(s, \boldsymbol{a}^{j_{1:k}} )$. The function allows a subset $i_1, \dots, i_m$ of agents to evaluate their actions in the scenario when $j_1, …, j_k$ have taken actions $a^{i_1}, …, a^{j_k}$. It turns out that this function satisfies the identity known as *multi-agent advantage decomposition lemma* $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad A_{\boldsymbol{\theta}}^{i_{1:m}}(s, \boldsymbol{a}^{i_{1:m}}) = \sum\limits_{j=1}^{m}A_{\boldsymbol{\theta}}^{i_j}(s, \boldsymbol{a}^{i_{1:j-1}}, a^{i_j})$ The lemma reveals that, with a reference to a state, the agents additively contribute to the return. Moreover, this additivity is somewhat special ---sequential; the multi-agent advantages that appear in the sum unfold the joint action agent-by-agent. Recall that, in Statistics, when such an unfolding of random variables happens, we can decompose their variance with the law of total variance. In maths, for random $X$ and $Y$, it means $\quad \quad \quad \quad \quad \quad \quad \quad \quad \mathbf{Var}[X] = \mathbf{Var}\big[ \mathbb{E}[X|Y] \big] + \mathbb{E}\big[ \mathbf{Var}[X|Y] \big]$. If we repeatedly substitute $X=a^{1:i}$ and $Y=a^{1:i-1}$, then the combination of multi-agent advantage decomposition lemma and the law of total variance lets us decompose the joint advantage variance as $\quad \quad \quad \quad \quad \mathbf{Var}_{\boldsymbol{a}\sim\boldsymbol{\pi_\theta}}[ A_{\boldsymbol{\theta}}] = \sum\limits_{i=1}^{n}\mathbb{E}_{\boldsymbol{a}^{1:i-1}\sim\boldsymbol{\pi_\theta}^{1:i-1}}\big[ \mathbf{Var}_{a^i\sim\pi^i_\theta}[A^i_{\boldsymbol{\theta}}(s, \boldsymbol{a}^{1:i-1}, a^i)]\big]$. Recall that the return contributes to the variance of the MAPG estimator through the joint action in the state-action value function. How it happens is a bit of mathematical magic, but this state-action value luckily simplifies to the advantage. This lets one leverage the above decomposition to prove that the agents contribute additively to the variance as follows. $\quad \quad \quad \quad \quad \mathbf{Var}_{s_{0:\infty}\sim d^{0:\infty}_{\boldsymbol{\theta}}, \boldsymbol{a}_{0:\infty}\sim\boldsymbol{\pi_\theta}}[ g^i_{\text{C}} ] - \mathbf{Var}_{s_{0:\infty}\sim d^{0:\infty}_{\boldsymbol{\theta}}, \boldsymbol{a}_{0:\infty}\sim\boldsymbol{\pi_\theta}}[ g^i_{\text{D}} ] = \mathcal{O}\Big(\sum\limits_{j\neq i}\epsilon_j^2\Big)$. In this theorem, the variance of the decentralized MAPG estimator is interpreted as the variance in a single-agent PG algorithm. It is so because the same estimator arises from letting the MAS try to learn independently. Notably, in the above theorem, the excess variance is of the order of the sum of squared local advantages of other agents. Hence, with the increasing number of agents, the variance grows, impeding training. On the other hand, the presence of local advantage in the bound hints that an effective variance-reduction technique should extract out their contributions to the joint return from the MAPG estimator. --- ## Optimal Baseline (OB) The variance-reducing technique known to RL practitioners is the baseline trick. An agent i can adapt it to MARL by subtracting a function independent of its action (but possibly depends on other agents actions) from the critic, which gives a new MAPG estimator $\quad \quad \quad \quad \quad \quad g^i_{\text{C}} = \sum\limits_{t=0}^{\infty}\gamma^t \big[\hat{Q}(s_t, \boldsymbol{a}^{-i}_t, a^i_t) - b(s_t, \boldsymbol{a}^{-i}_t)\big]\log\pi^i_{\boldsymbol{\theta}}(a^i_t|s_t)$ Such a modification does not influence bias (the expected value of the estimator still matches MAPG) because of the following identity $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \mathbb{E}_{a^i_t\sim\pi^i_{\boldsymbol{\theta}}}\big[ b(s_t, \boldsymbol{a}^{-i}_t)\nabla_{\theta^i}\log\pi^i_{\boldsymbol{\theta}}\big] = \boldsymbol{0}$. However, appropriately chosen, a baseline may reduce variance. Naturally, one can ask if there exists an optimal baseline, which achieves the minimal possible variance, and if so, what it is? To answer this question, we must first understand how the baseline influences the MAPG estimator's variance. We can do it at every step $t$ by, first, defining $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad g^i_{\text{C}, t} = \hat{Q}(s_t, \boldsymbol{a}^{-i}_t, a^i)\nabla_{\theta^i}\log\pi^i_{\boldsymbol{\theta}}(a^i_t|s_t)$. Again, by repeatedly applying the Law of Total Variance, we can decompose the variance as follows ![](https://i.imgur.com/cq4WuEq.png) Notably, as all MAPG estimators with a baseline give the same ecpectation, only the term of variance from agent $i$'s action depends on the baseline. Hence, we wish to minimize $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \mathbf{Var}_{a^i\sim\pi^i_{\boldsymbol{\theta}}}[g^i_{\text{C},t}]$. Now we can answer the question about the existance of an optimal baseline affirmatively, as the optimal baseline is given by $\quad \quad \quad \quad \quad \quad \quad b^{\text{optimal}}(s, a^{-i}) = \frac{\mathbb{E}_{a^i\sim\pi^i_{\boldsymbol{\theta}}}\big[ \hat{Q}(s, \boldsymbol{a}^{-i}, a^i)||\nabla_{\theta^i}\log\pi^i_{\boldsymbol{\theta}}(a^i|s)||^2\big]}{\mathbb{E}_{a^i\sim\pi^i_{\boldsymbol{\theta}}}\big[ ||\nabla_{\theta^i}\log\pi^i_{\boldsymbol{\theta}}(a^i|s)||^2\big]}$. Great! So now we just have to use this formula at every learning step and compute the optimal baseline, and the variance will be minimized 😁… But then, these expectations with gradients of neural nets don't have closed forms 🤔… So we would have to iterate over all available actions (discrete case), or even sample (continuous case), with doing backpropagation for every sample? This could incredibly slow down training at best, and in the continuous case, the estimation would actually induce much more variance due to the gradient's size.  ## Surrogate variance for neural network policy The major issue with the optimal baseline is the presence of the massive neural network gradient. We would, therefore, like to "remove" it somehow. Luckily, an NN-based policy can be decomposed as $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \pi^i_{\boldsymbol{\theta}}(a^i|s) = \pi^i(a^i|\psi^i_{\boldsymbol{\theta}}(s))$. where the $\psi^i(s)$ function the output of the network. This can be the layer of logits (from which the policy is computed with *softmax*) for discrete policies or the mean and variance of a Gaussian distribution for continuous policies. This enables us to decompose the variance of the MAPG estimator with the chain rule as follows $\mathbf{Var}_{a^i\sim\pi^i_{\boldsymbol{\theta}}}\big[ \nabla_{\theta^i}\log\pi^i_{\boldsymbol{\theta}}(a^i|s)\big( \hat{Q}(s, \boldsymbol{a}^{-i}, a^i) - b(s, \boldsymbol{a}^{-i})\big)\big] \\ = \mathbf{Var}_{a^i\sim\pi^i_{\boldsymbol{\theta}}}\big[ \nabla_{\theta^i}\psi^i_{\boldsymbol{\theta}}(s)\nabla_{\psi^i_{\boldsymbol{\theta}}(s)}\log\pi^i_{\boldsymbol{\theta}}(a^i|\psi^i_{\boldsymbol{\theta}}(s))\big( \hat{Q}(s, \boldsymbol{a}^{-i}, a^i) - b(s, \boldsymbol{a}^{-i})\big)\big]\\ =\nabla_{\theta^i}\psi^i_{\boldsymbol{\theta}}(s)\mathbf{Var}_{a^i\sim\pi^i_{\boldsymbol{\theta}}}\big[ \nabla_{\psi^i_{\boldsymbol{\theta}}(s)}\log\pi^i_{\boldsymbol{\theta}}(a^i|\psi^i_{\boldsymbol{\theta}}(s))\big( \hat{Q}(s, \boldsymbol{a}^{-i}, a^i) - b(s, \boldsymbol{a}^{-i})\big)\big]\nabla_{\theta^i}\psi^i_{\boldsymbol{\theta}}(s)^T$ Hence, the variance we wish to minimize is strongly related to the surrogate variance $\nabla_{\theta^i}\psi^i_{\boldsymbol{\theta}}(s)\mathbf{Var}_{a^i\sim\pi^i_{\boldsymbol{\theta}}}\big[ \nabla_{\psi^i_{\boldsymbol{\theta}}(s)}\log\pi^i_{\boldsymbol{\theta}}(a^i|\psi^i_{\boldsymbol{\theta}}(s))\big( \hat{Q}(s, \boldsymbol{a}^{-i}, a^i) - b(s, \boldsymbol{a}^{-i})\big)\big]\nabla_{\theta^i}\psi^i_{\boldsymbol{\theta}}(s)^T$ which does not depend on the parameters of our neural network, but only on its output. We can prove that the optimal baseline (OB) for this variance is $\quad \quad \quad \quad \quad \quad \quad b^*(s, a^{-i}) = \frac{\mathbb{E}_{a^i\sim\pi^i_{\boldsymbol{\theta}}}\big[ \hat{Q}(s, \boldsymbol{a}^{-i}, a^i)||\nabla_{\psi^i_{\boldsymbol{\theta}}(s)}\log\pi^i_{\boldsymbol{\theta}}(a^i|\psi^i_{\boldsymbol{\theta}}(s))||^2\big]}{\mathbb{E}_{a^i\sim\pi^i_{\boldsymbol{\theta}}}\big[ ||\nabla_{\psi^i_{\boldsymbol{\theta}}(s)}\log\pi^i_{\boldsymbol{\theta}}(a^i|\psi^i_{\boldsymbol{\theta}}(s))||^2 \big]}$ And this is a piece of good news! The gradients with respect to "phi" here usually have closed forms (like we know what is the gradient of Gaussian log-probability with respect to its mean). In the case of softmax policies, we can show that for a probability distribution  $\quad \quad \quad \quad \quad \quad \quad \quad \quad x^i_{\psi^i_{\boldsymbol{\theta}}}(a^i|s) = \frac{ \pi^i_{\boldsymbol{\theta}}(a^i|s)\big(1 + ||\pi^i_{\boldsymbol{\theta}}(s)||^2 - 2\pi^i_{\boldsymbol{\theta}}(a^i|s)\big) }{ 1- ||\pi^i_{\boldsymbol{\theta}}(s)||^2 }$ OB is just the expected value. So all we need to do is to compute the expected value of $\hat{Q}(s, \boldsymbol{a}^{-i}, a^i)$ with the probability given by $x^i_{\psi^i_{\boldsymbol{\theta}}}(a^i|s)$, and this is very easy! 🙌 Once we have done it, we use a new state-action value signal $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \hat{X}^i(s, \boldsymbol{a}^{-i}, a^i) \triangleq \hat{Q}(s, \boldsymbol{a}^{-i}, a^i) - b^*(s, \boldsymbol{a}^{-i})$. which we then use to construct a MAPG estimator with OB $\quad \quad \quad \quad \quad \quad \quad \quad \quad g^i_{\text{X}} = \sum\limits_{t=0}^{\infty}\gamma^t \hat{X}^{i}(s_t, \boldsymbol{a}^{-i}_t, a^i_t)\nabla_{\theta^i}\log\pi^i_{\boldsymbol{\theta}}(a^i_t|s_t)$ As the derivation of OB is a bit of math black magic, it may be hard to believe that it decreases the variance. But don't worry, we can verify it on a simple toy game. Namely, let's suppose that agent i wants to estimate its MAPG, while all other agents have already taken their actions. The agent can choose one of three actions, and it will do so according to its parameterised policy. The table illustrates what quantities the agent is given or can compute. ![](https://i.imgur.com/t8qbD1p.png) The agent can use one of three types of estimators: the vanilla MAPG estimator (using $\hat{Q}$), the COMA estimator (using $\hat{A}^i$), and the OB estimator (using $\hat{X}^i$). As it turns out, the variance in each of these cases is ![](https://i.imgur.com/l2RdqjA.png) which means that OB reduces the variance of MAPG by as much as 50%, and of COMA by over 30%. The latter may come as a surprise because COMA was derived to solve the credit assignment problem (i.e., reduce variance)… Well, OB is optimal after all 😁. Indeed, very easy to use in practice, OB can be applied to any stochastic policy gradient algorithm. An example of such an algorithm is recently popular MAPPO, which at every step optimizes the objective $\quad \quad \quad \sum\limits_{i=1}^{n}\mathbb{E}_{s\sim d_{\theta_{\text{old}}}, \boldsymbol{a}\sim\boldsymbol{\pi}_{\theta_{\text{old}}}}\Big[ \text{min}\Big(\frac{\pi_\theta (a^i|s) }{\pi_{\theta_{\text{old}}} (a^i|s)} \hat{A}(s, \boldsymbol{a}), \text{clip}\big( \frac{\pi_\theta (a^i|s) }{\pi_{\theta_{\text{old}}} (a^i|s)}, 1\pm \epsilon\big)\hat{A}(s, \boldsymbol{a})\Big)\Big]$. Let us remind that the advantage function is obtained from subtracting the state value function baseline from the joint Q-critic $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \hat{A}(s, \boldsymbol{a}) = \hat{Q}(s, \boldsymbol{a}) - \hat{V}(s)$. We can replace this baseline with the optimal OB, and obtain the following X value $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \hat{X}^i(s, \boldsymbol{a}) = \hat{Q}(s, \boldsymbol{a}) - b^*(s, \boldsymbol{a}^{-i})$ which then enables us to modify the MAPPO objective as *MAPPO with OB* $\quad \quad \quad \sum\limits_{i=1}^{n}\mathbb{E}_{s\sim d_{\theta_{\text{old}}}, \boldsymbol{a}\sim\boldsymbol{\pi}_{\theta_{\text{old}}}}\Big[ \text{min}\Big(\frac{\pi_\theta (a^i|s) }{\pi_{\theta_{\text{old}}} (a^i|s)} \hat{X}^i(s, \boldsymbol{a}), \text{clip}\big( \frac{\pi_\theta (a^i|s) }{\pi_{\theta_{\text{old}}} (a^i|s)}, 1\pm \epsilon\big)\hat{X}^i(s, \boldsymbol{a})\Big)\Big]$. OB was challenged against the challenging StarCraftII (discrete actions) and Multi-Agent MuJoCo (continuous actions) benchmarks. The empirical results confirm the theory. MAPPO with OB, using more accurate (thanks to less variance) MAPG estimators, learns faster and converges to better policies. ![](https://i.imgur.com/9rIkr76.png) These are just a few examples of when OB significantly improves the performance of a MAPG. In their paper, Kuba et al. actually found a case when OB is necessary to make MAPPO work! --- ## Implementation fo OB Now that you are excited about OB, and you quickly want to incorporate it into your code, we will help you implement it in PyTorch.  First, implement a function that normalizes a positive-valued vector, so that it provides a probability distribution, and name it normalize. Then, you can implement **OB for discrete actions** with a method: ![](https://i.imgur.com/ImKXfaI.png) and for continuous actions with Gaussian policies, you can estimate it: ![](https://i.imgur.com/XSo0TQt.png) Now, you can go and make your bots beat your friends in StarCraft, and tell us how it went! But maybe, if you want to make sure you use the finest possible algorithm, you can stay with us and go to the next article in the series😁. --- Thank you for reading this blog. I hope you found this useful and interesting. I also want to thank to my co-authors of the paper "Settling the Variance of Multi-Agent Policy Gradients": Muning Wen, Yaodong Yang, Linghui Meng, Shangding Gu, Haifeng Zhang, David Mguni, and Jun Wang. Without them, OB would never see the daylight!