HackMD
  • Beta
    Beta  Get a sneak peek of HackMD’s new design
    Turn on the feature preview and give us feedback.
    Go → Got it
      • Create new note
      • Create a note from template
    • Beta  Get a sneak peek of HackMD’s new design
      Beta  Get a sneak peek of HackMD’s new design
      Turn on the feature preview and give us feedback.
      Go → Got it
      • Sharing Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • More (Comment, Invitee)
      • Publishing
        Please check the box to agree to the Community Guidelines.
        Everyone on the web can find and read all notes of this public team.
        After the note is published, everyone on the web can find and read this note.
        See all published notes on profile page.
      • Commenting Enable
        Disabled Forbidden Owners Signed-in users Everyone
      • Permission
        • Forbidden
        • Owners
        • Signed-in users
        • Everyone
      • Invitee
      • No invitee
      • Options
      • Versions and GitHub Sync
      • Transfer ownership
      • Delete this note
      • Template
      • Save as template
      • Insert from template
      • Export
      • Dropbox
      • Google Drive Export to Google Drive
      • Gist
      • Import
      • Dropbox
      • Google Drive Import from Google Drive
      • Gist
      • Clipboard
      • Download
      • Markdown
      • HTML
      • Raw HTML
    Menu Sharing Create Help
    Create Create new note Create a note from template
    Menu
    Options
    Versions and GitHub Sync Transfer ownership Delete this note
    Export
    Dropbox Google Drive Export to Google Drive Gist
    Import
    Dropbox Google Drive Import from Google Drive Gist Clipboard
    Download
    Markdown HTML Raw HTML
    Back
    Sharing
    Sharing Link copied
    /edit
    View mode
    • Edit mode
    • View mode
    • Book mode
    • Slide mode
    Edit mode View mode Book mode Slide mode
    Note Permission
    Read
    Only me
    • Only me
    • Signed-in users
    • Everyone
    Only me Signed-in users Everyone
    Write
    Only me
    • Only me
    • Signed-in users
    • Everyone
    Only me Signed-in users Everyone
    More (Comment, Invitee)
    Publishing
    Please check the box to agree to the Community Guidelines.
    Everyone on the web can find and read all notes of this public team.
    After the note is published, everyone on the web can find and read this note.
    See all published notes on profile page.
    More (Comment, Invitee)
    Commenting Enable
    Disabled Forbidden Owners Signed-in users Everyone
    Permission
    Owners
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Invitee
    No invitee
       owned this note    owned this note      
    Published Linked with GitHub
    Like BookmarkBookmarked
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # Multi-Agent Reinforcement Learning  (1/3): Settling the Variance of Multi-Agent Policy Gradient This blog is based on a NeurIPS 2021 paper *"Settling the Variance of Multi-Agent Policy Gradients"*, by Kuba et al., available at https://arxiv.org/pdf/2108.08612.pdf. **Reinforcement Learning (RL)** algorithms enable machines to learn to solve complicated tasks. The performance that they achieve gives hope for the automation of many manual chores that humans must do daily, thus increasing our efficiency and safety. However, the traditional single-agent RL has its limitation: it deals with problems with only one intelligent unit. In many real-world scenarios, multiple learners interact (these tasks are probably the most difficult, and desirable, to automate). Consider the top part of the illustration below: even if we let an autonomous vehicle with great RL-based software out to the city, it may fail to predict a human driver's behavior, which can end fatally. Furthermore, even if the car avoids crashes, and arrives at its destination safely and quickly (thanks to its great RL algorithm), one can ask how about the safety and timing of other cars? Indeed, an RL car doesn't care about the fate of others, and can possibly harm it in multiple ways. **Multi-Agent Reinforcement Learning (MARL)** is a new generation of RL, which aims at solving these problems. Its ultimate goal is to enable machines to learn in a way they stay considerate about one another, as well as about humans (see the bottom part of the figure below). ![](https://i.imgur.com/Kvyp0X3.jpg) A single-agent self-driving car may not predict unusual human driving behavior. ![](https://i.imgur.com/TzBS3Zb.jpg) Multi-agent self-driving cars (right) cooperate for everyone's efficiency. --- ## Multi-Agent Policy Gradient (MAPG) The interaction of a multi-agent system (MAS) with the environment is obviously different from the single-agent case. In MARL, at the time step $t$, every agent $i$ takes an action $a^i_t \in \mathcal{A}^i$ from its own action space (they can differ across agents) from its own policy $\pi^i_{\boldsymbol{\theta}}$. Together with other agents' actions, this gives a joint action $\boldsymbol{a}_t = (a^1_t, \dots, a^n_t) \in \boldsymbol{\mathcal{A}}$. The product of all agents' policies is the joint policy $\boldsymbol{\pi_\theta}(\cdot|s_t) = \prod_{i=1}^{n}\pi^i_{\boldsymbol{\theta}}(\cdot^i|s_t)$. In practical settings, the policies are parameterized by neural networks. In RL, we train them with gradient ascent methods derived from the Policy Gradient Theorem. In MARL, the theorem has its counterpart, Multi-Agent Policy Gradient (MAPG) Theorem $\quad \quad \quad \quad \nabla_{\theta^i}\mathcal{J}(\theta) = \mathbb{E}_{s_{0:\infty}\sim d^{0:\infty}_{\boldsymbol{\theta}}, \boldsymbol{a}_{0:\infty}\sim \boldsymbol{\pi_\theta}}\Big[ \sum\limits_{t=0}^{\infty}\gamma^t Q_{\boldsymbol{\theta}}(s_t, \boldsymbol{a}^{-i}_t, a^i_t)\nabla_{\theta^i}\log\pi^i_{\boldsymbol{\theta}}(a^i_t|s_t)\Big]$. Once we are equipped with state-action value critics, we can derive two MAPG estimators: one for centralized training (assuming a critic of all agents' joint action) $\quad \quad \quad \quad \quad \quad \quad \quad \quad g^i_{\text{C}} = \sum\limits_{t=0}^{\infty}\gamma^t \hat{Q}(s_t, \boldsymbol{a}^{-i}_t, a^i_t)\nabla_{\theta^i}\log\pi^i_{\boldsymbol{\theta}}(a^i_t|s_t)$ and a decentralized one (assuming a critic of an agent's local action only) $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad g^i_{\text{D}} = \sum\limits_{t=0}^{\infty}\gamma^t \hat{Q}^i(s_t, a^i_t)\nabla_{\theta^i}\log\pi^i_{\boldsymbol{\theta}}(a^i_t|s_t)$. The training framework which proved to deliver effective policies is the centralized one. That is because the critics giving feedback about joint action stimulate collaboration: while an action may not seem good for one agent, combined with specific actions of others, it becomes effective. Moreover, training such joint-action critics is easier. Imagine that you fit them by averaging the returns received after each state-action pair. The estimate for the state-(local)action pair will be different for any different behavior of any other agent. Thus, to achieve a reliable estimate, you must sample many state-(local)action pairs. This problem is gone in state-(joint)action values. ![](https://i.imgur.com/z18G2mP.jpg) The famous COMA with a joint critic plays StarCraft. --- ## The issues with MAPG Although the training with joint critic allows us to do MARL at all, it still comes with massive difficulties. The major one being the problem of credit assignment. The intuition behind it is that if all agents receive joint feedback, an individual agent does not know its contribution to the team's result. Consider a natural multi-agent scenario of a football/soccer game, where two (multi-agent) teams aim to score a goal. We've seen plenty of situations when a single player dribbled past his opponents and scored. In this case, all his teammates, including those who decided to just stand, are being rewarded. But is standing still optimal during a game? Probably not. ![](https://i.imgur.com/BDfHyJn.jpg) Lionel Messi dribbling past opponents by himself. The presence of the above problem may cause confusion: we follow the MAPG theorem and there is something wrong with that? Well, yes. We can fix it, but first, we have to understand the problem mathematically. Let's begin by taking a step back. The MAPG theorem tells us how to obtain an unbiased estimator of MAPG. But bias is not the only potential problem of an estimator; there is also variance. Intuitively, the large variance of our MAPG estimator comes from the joint action input in $\hat{Q}(s_t, \boldsymbol{a}^{-i}_t, a^i_t)$.  While the agent i is only interested in figuring out the utility of its own action, the estimate will be different whenever other agents alter their actions. This phenomenon induces extra variance. The questions that arise are: 1. How large is this variance 2. How can we modify MAPG estimation to reduce it? ## Settling the Variance To answer the first question we must first understand something which RL doesn't have to think about, which is how different agents contribute to the joint return. For this purpose, we introduce the multi-agent state-action value function, defined by $\quad \quad \quad \quad \quad \quad \quad Q^{i_{1:m}}_{\boldsymbol{\theta}}(s, \boldsymbol{a}^{i_{1:m}} ) = \mathbb{E}_{\boldsymbol{a}^{-i_{1:m}} \sim \boldsymbol{\pi_\theta}^{-i_{1:m}} }\big[Q_{\boldsymbol{\theta}}(s, \boldsymbol{a}^{i_{1:m}}, \boldsymbol{a}^{-i_{1:m}} )\big]$. This function says what is the expected return once a subset of agents has taken their actions. Crucially, on top of it, we define the multi-agent advantage function $\quad \quad \quad \quad \quad \quad A^{i_{1:m}}_{\boldsymbol{\theta}}(s, \boldsymbol{a}^{j_{1:k}}, \boldsymbol{a}^{i_{1:m}} ) = Q^{j_{1:k}, i_{1:m}}_{\boldsymbol{\theta}}(s, \boldsymbol{a}^{j_{1:k}}, \boldsymbol{a}^{i_{1:m}} ) - Q^{j_{1:k}}_{\boldsymbol{\theta}}(s, \boldsymbol{a}^{j_{1:k}} )$. The function allows a subset $i_1, \dots, i_m$ of agents to evaluate their actions in the scenario when $j_1, …, j_k$ have taken actions $a^{i_1}, …, a^{j_k}$. It turns out that this function satisfies the identity known as *multi-agent advantage decomposition lemma* $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad A_{\boldsymbol{\theta}}^{i_{1:m}}(s, \boldsymbol{a}^{i_{1:m}}) = \sum\limits_{j=1}^{m}A_{\boldsymbol{\theta}}^{i_j}(s, \boldsymbol{a}^{i_{1:j-1}}, a^{i_j})$ The lemma reveals that, with a reference to a state, the agents additively contribute to the return. Moreover, this additivity is somewhat special ---sequential; the multi-agent advantages that appear in the sum unfold the joint action agent-by-agent. Recall that, in Statistics, when such an unfolding of random variables happens, we can decompose their variance with the law of total variance. In maths, for random $X$ and $Y$, it means $\quad \quad \quad \quad \quad \quad \quad \quad \quad \mathbf{Var}[X] = \mathbf{Var}\big[ \mathbb{E}[X|Y] \big] + \mathbb{E}\big[ \mathbf{Var}[X|Y] \big]$. If we repeatedly substitute $X=a^{1:i}$ and $Y=a^{1:i-1}$, then the combination of multi-agent advantage decomposition lemma and the law of total variance lets us decompose the joint advantage variance as $\quad \quad \quad \quad \quad \mathbf{Var}_{\boldsymbol{a}\sim\boldsymbol{\pi_\theta}}[ A_{\boldsymbol{\theta}}] = \sum\limits_{i=1}^{n}\mathbb{E}_{\boldsymbol{a}^{1:i-1}\sim\boldsymbol{\pi_\theta}^{1:i-1}}\big[ \mathbf{Var}_{a^i\sim\pi^i_\theta}[A^i_{\boldsymbol{\theta}}(s, \boldsymbol{a}^{1:i-1}, a^i)]\big]$. Recall that the return contributes to the variance of the MAPG estimator through the joint action in the state-action value function. How it happens is a bit of mathematical magic, but this state-action value luckily simplifies to the advantage. This lets one leverage the above decomposition to prove that the agents contribute additively to the variance as follows. $\quad \quad \quad \quad \quad \mathbf{Var}_{s_{0:\infty}\sim d^{0:\infty}_{\boldsymbol{\theta}}, \boldsymbol{a}_{0:\infty}\sim\boldsymbol{\pi_\theta}}[ g^i_{\text{C}} ] - \mathbf{Var}_{s_{0:\infty}\sim d^{0:\infty}_{\boldsymbol{\theta}}, \boldsymbol{a}_{0:\infty}\sim\boldsymbol{\pi_\theta}}[ g^i_{\text{D}} ] = \mathcal{O}\Big(\sum\limits_{j\neq i}\epsilon_j^2\Big)$. In this theorem, the variance of the decentralized MAPG estimator is interpreted as the variance in a single-agent PG algorithm. It is so because the same estimator arises from letting the MAS try to learn independently. Notably, in the above theorem, the excess variance is of the order of the sum of squared local advantages of other agents. Hence, with the increasing number of agents, the variance grows, impeding training. On the other hand, the presence of local advantage in the bound hints that an effective variance-reduction technique should extract out their contributions to the joint return from the MAPG estimator. --- ## Optimal Baseline (OB) The variance-reducing technique known to RL practitioners is the baseline trick. An agent i can adapt it to MARL by subtracting a function independent of its action (but possibly depends on other agents actions) from the critic, which gives a new MAPG estimator $\quad \quad \quad \quad \quad \quad g^i_{\text{C}} = \sum\limits_{t=0}^{\infty}\gamma^t \big[\hat{Q}(s_t, \boldsymbol{a}^{-i}_t, a^i_t) - b(s_t, \boldsymbol{a}^{-i}_t)\big]\log\pi^i_{\boldsymbol{\theta}}(a^i_t|s_t)$ Such a modification does not influence bias (the expected value of the estimator still matches MAPG) because of the following identity $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \mathbb{E}_{a^i_t\sim\pi^i_{\boldsymbol{\theta}}}\big[ b(s_t, \boldsymbol{a}^{-i}_t)\nabla_{\theta^i}\log\pi^i_{\boldsymbol{\theta}}\big] = \boldsymbol{0}$. However, appropriately chosen, a baseline may reduce variance. Naturally, one can ask if there exists an optimal baseline, which achieves the minimal possible variance, and if so, what it is? To answer this question, we must first understand how the baseline influences the MAPG estimator's variance. We can do it at every step $t$ by, first, defining $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad g^i_{\text{C}, t} = \hat{Q}(s_t, \boldsymbol{a}^{-i}_t, a^i)\nabla_{\theta^i}\log\pi^i_{\boldsymbol{\theta}}(a^i_t|s_t)$. Again, by repeatedly applying the Law of Total Variance, we can decompose the variance as follows ![](https://i.imgur.com/cq4WuEq.png) Notably, as all MAPG estimators with a baseline give the same ecpectation, only the term of variance from agent $i$'s action depends on the baseline. Hence, we wish to minimize $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \mathbf{Var}_{a^i\sim\pi^i_{\boldsymbol{\theta}}}[g^i_{\text{C},t}]$. Now we can answer the question about the existance of an optimal baseline affirmatively, as the optimal baseline is given by $\quad \quad \quad \quad \quad \quad \quad b^{\text{optimal}}(s, a^{-i}) = \frac{\mathbb{E}_{a^i\sim\pi^i_{\boldsymbol{\theta}}}\big[ \hat{Q}(s, \boldsymbol{a}^{-i}, a^i)||\nabla_{\theta^i}\log\pi^i_{\boldsymbol{\theta}}(a^i|s)||^2\big]}{\mathbb{E}_{a^i\sim\pi^i_{\boldsymbol{\theta}}}\big[ ||\nabla_{\theta^i}\log\pi^i_{\boldsymbol{\theta}}(a^i|s)||^2\big]}$. Great! So now we just have to use this formula at every learning step and compute the optimal baseline, and the variance will be minimized 😁… But then, these expectations with gradients of neural nets don't have closed forms 🤔… So we would have to iterate over all available actions (discrete case), or even sample (continuous case), with doing backpropagation for every sample? This could incredibly slow down training at best, and in the continuous case, the estimation would actually induce much more variance due to the gradient's size.  ## Surrogate variance for neural network policy The major issue with the optimal baseline is the presence of the massive neural network gradient. We would, therefore, like to "remove" it somehow. Luckily, an NN-based policy can be decomposed as $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \pi^i_{\boldsymbol{\theta}}(a^i|s) = \pi^i(a^i|\psi^i_{\boldsymbol{\theta}}(s))$. where the $\psi^i(s)$ function the output of the network. This can be the layer of logits (from which the policy is computed with *softmax*) for discrete policies or the mean and variance of a Gaussian distribution for continuous policies. This enables us to decompose the variance of the MAPG estimator with the chain rule as follows $\mathbf{Var}_{a^i\sim\pi^i_{\boldsymbol{\theta}}}\big[ \nabla_{\theta^i}\log\pi^i_{\boldsymbol{\theta}}(a^i|s)\big( \hat{Q}(s, \boldsymbol{a}^{-i}, a^i) - b(s, \boldsymbol{a}^{-i})\big)\big] \\ = \mathbf{Var}_{a^i\sim\pi^i_{\boldsymbol{\theta}}}\big[ \nabla_{\theta^i}\psi^i_{\boldsymbol{\theta}}(s)\nabla_{\psi^i_{\boldsymbol{\theta}}(s)}\log\pi^i_{\boldsymbol{\theta}}(a^i|\psi^i_{\boldsymbol{\theta}}(s))\big( \hat{Q}(s, \boldsymbol{a}^{-i}, a^i) - b(s, \boldsymbol{a}^{-i})\big)\big]\\ =\nabla_{\theta^i}\psi^i_{\boldsymbol{\theta}}(s)\mathbf{Var}_{a^i\sim\pi^i_{\boldsymbol{\theta}}}\big[ \nabla_{\psi^i_{\boldsymbol{\theta}}(s)}\log\pi^i_{\boldsymbol{\theta}}(a^i|\psi^i_{\boldsymbol{\theta}}(s))\big( \hat{Q}(s, \boldsymbol{a}^{-i}, a^i) - b(s, \boldsymbol{a}^{-i})\big)\big]\nabla_{\theta^i}\psi^i_{\boldsymbol{\theta}}(s)^T$ Hence, the variance we wish to minimize is strongly related to the surrogate variance $\nabla_{\theta^i}\psi^i_{\boldsymbol{\theta}}(s)\mathbf{Var}_{a^i\sim\pi^i_{\boldsymbol{\theta}}}\big[ \nabla_{\psi^i_{\boldsymbol{\theta}}(s)}\log\pi^i_{\boldsymbol{\theta}}(a^i|\psi^i_{\boldsymbol{\theta}}(s))\big( \hat{Q}(s, \boldsymbol{a}^{-i}, a^i) - b(s, \boldsymbol{a}^{-i})\big)\big]\nabla_{\theta^i}\psi^i_{\boldsymbol{\theta}}(s)^T$ which does not depend on the parameters of our neural network, but only on its output. We can prove that the optimal baseline (OB) for this variance is $\quad \quad \quad \quad \quad \quad \quad b^*(s, a^{-i}) = \frac{\mathbb{E}_{a^i\sim\pi^i_{\boldsymbol{\theta}}}\big[ \hat{Q}(s, \boldsymbol{a}^{-i}, a^i)||\nabla_{\psi^i_{\boldsymbol{\theta}}(s)}\log\pi^i_{\boldsymbol{\theta}}(a^i|\psi^i_{\boldsymbol{\theta}}(s))||^2\big]}{\mathbb{E}_{a^i\sim\pi^i_{\boldsymbol{\theta}}}\big[ ||\nabla_{\psi^i_{\boldsymbol{\theta}}(s)}\log\pi^i_{\boldsymbol{\theta}}(a^i|\psi^i_{\boldsymbol{\theta}}(s))||^2 \big]}$ And this is a piece of good news! The gradients with respect to "phi" here usually have closed forms (like we know what is the gradient of Gaussian log-probability with respect to its mean). In the case of softmax policies, we can show that for a probability distribution  $\quad \quad \quad \quad \quad \quad \quad \quad \quad x^i_{\psi^i_{\boldsymbol{\theta}}}(a^i|s) = \frac{ \pi^i_{\boldsymbol{\theta}}(a^i|s)\big(1 + ||\pi^i_{\boldsymbol{\theta}}(s)||^2 - 2\pi^i_{\boldsymbol{\theta}}(a^i|s)\big) }{ 1- ||\pi^i_{\boldsymbol{\theta}}(s)||^2 }$ OB is just the expected value. So all we need to do is to compute the expected value of $\hat{Q}(s, \boldsymbol{a}^{-i}, a^i)$ with the probability given by $x^i_{\psi^i_{\boldsymbol{\theta}}}(a^i|s)$, and this is very easy! 🙌 Once we have done it, we use a new state-action value signal $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \hat{X}^i(s, \boldsymbol{a}^{-i}, a^i) \triangleq \hat{Q}(s, \boldsymbol{a}^{-i}, a^i) - b^*(s, \boldsymbol{a}^{-i})$. which we then use to construct a MAPG estimator with OB $\quad \quad \quad \quad \quad \quad \quad \quad \quad g^i_{\text{X}} = \sum\limits_{t=0}^{\infty}\gamma^t \hat{X}^{i}(s_t, \boldsymbol{a}^{-i}_t, a^i_t)\nabla_{\theta^i}\log\pi^i_{\boldsymbol{\theta}}(a^i_t|s_t)$ As the derivation of OB is a bit of math black magic, it may be hard to believe that it decreases the variance. But don't worry, we can verify it on a simple toy game. Namely, let's suppose that agent i wants to estimate its MAPG, while all other agents have already taken their actions. The agent can choose one of three actions, and it will do so according to its parameterised policy. The table illustrates what quantities the agent is given or can compute. ![](https://i.imgur.com/t8qbD1p.png) The agent can use one of three types of estimators: the vanilla MAPG estimator (using $\hat{Q}$), the COMA estimator (using $\hat{A}^i$), and the OB estimator (using $\hat{X}^i$). As it turns out, the variance in each of these cases is ![](https://i.imgur.com/l2RdqjA.png) which means that OB reduces the variance of MAPG by as much as 50%, and of COMA by over 30%. The latter may come as a surprise because COMA was derived to solve the credit assignment problem (i.e., reduce variance)… Well, OB is optimal after all 😁. Indeed, very easy to use in practice, OB can be applied to any stochastic policy gradient algorithm. An example of such an algorithm is recently popular MAPPO, which at every step optimizes the objective $\quad \quad \quad \sum\limits_{i=1}^{n}\mathbb{E}_{s\sim d_{\theta_{\text{old}}}, \boldsymbol{a}\sim\boldsymbol{\pi}_{\theta_{\text{old}}}}\Big[ \text{min}\Big(\frac{\pi_\theta (a^i|s) }{\pi_{\theta_{\text{old}}} (a^i|s)} \hat{A}(s, \boldsymbol{a}), \text{clip}\big( \frac{\pi_\theta (a^i|s) }{\pi_{\theta_{\text{old}}} (a^i|s)}, 1\pm \epsilon\big)\hat{A}(s, \boldsymbol{a})\Big)\Big]$. Let us remind that the advantage function is obtained from subtracting the state value function baseline from the joint Q-critic $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \hat{A}(s, \boldsymbol{a}) = \hat{Q}(s, \boldsymbol{a}) - \hat{V}(s)$. We can replace this baseline with the optimal OB, and obtain the following X value $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \hat{X}^i(s, \boldsymbol{a}) = \hat{Q}(s, \boldsymbol{a}) - b^*(s, \boldsymbol{a}^{-i})$ which then enables us to modify the MAPPO objective as *MAPPO with OB* $\quad \quad \quad \sum\limits_{i=1}^{n}\mathbb{E}_{s\sim d_{\theta_{\text{old}}}, \boldsymbol{a}\sim\boldsymbol{\pi}_{\theta_{\text{old}}}}\Big[ \text{min}\Big(\frac{\pi_\theta (a^i|s) }{\pi_{\theta_{\text{old}}} (a^i|s)} \hat{X}^i(s, \boldsymbol{a}), \text{clip}\big( \frac{\pi_\theta (a^i|s) }{\pi_{\theta_{\text{old}}} (a^i|s)}, 1\pm \epsilon\big)\hat{X}^i(s, \boldsymbol{a})\Big)\Big]$. OB was challenged against the challenging StarCraftII (discrete actions) and Multi-Agent MuJoCo (continuous actions) benchmarks. The empirical results confirm the theory. MAPPO with OB, using more accurate (thanks to less variance) MAPG estimators, learns faster and converges to better policies. ![](https://i.imgur.com/9rIkr76.png) These are just a few examples of when OB significantly improves the performance of a MAPG. In their paper, Kuba et al. actually found a case when OB is necessary to make MAPPO work! --- ## Implementation fo OB Now that you are excited about OB, and you quickly want to incorporate it into your code, we will help you implement it in PyTorch.  First, implement a function that normalizes a positive-valued vector, so that it provides a probability distribution, and name it normalize. Then, you can implement **OB for discrete actions** with a method: ![](https://i.imgur.com/ImKXfaI.png) and for continuous actions with Gaussian policies, you can estimate it: ![](https://i.imgur.com/XSo0TQt.png) Now, you can go and make your bots beat your friends in StarCraft, and tell us how it went! But maybe, if you want to make sure you use the finest possible algorithm, you can stay with us and go to the next article in the series😁. --- Thank you for reading this blog. I hope you found this useful and interesting. I also want to thank to my co-authors of the paper "Settling the Variance of Multi-Agent Policy Gradients": Muning Wen, Yaodong Yang, Linghui Meng, Shangding Gu, Haifeng Zhang, David Mguni, and Jun Wang. Without them, OB would never see the daylight!

    Import from clipboard

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lost their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template is not available.


    Upgrade

    All
    • All
    • Team
    No template found.

    Create custom template


    Upgrade

    Delete template

    Do you really want to delete this template?

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Tutorials

    Book Mode Tutorial

    Slide Mode Tutorial

    YAML Metadata

    Contacts

    Facebook

    Twitter

    Feedback

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions

    Versions and GitHub Sync

    Sign in to link this note to GitHub Learn more
    This note is not linked with GitHub Learn more
     
    Add badge Pull Push GitHub Link Settings
    Upgrade now

    Version named by    

    More Less
    • Edit
    • Delete

    Note content is identical to the latest version.
    Compare with
      Choose a version
      No search result
      Version not found

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub

        Please sign in to GitHub and install the HackMD app on your GitHub repo. Learn more

         Sign in to GitHub

        HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Available push count

        Upgrade

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Upgrade

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully