# Multi-Agent Reinforcement Learning (2/3): Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning
This blog is based on the paper *"Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning"*, by Kuba et al., available at https://arxiv.org/pdf/2109.11251.pdf. It is the second blog in the series "Multi-Agent Reinforcement Learning". You can read the prequel of it at https://hackmd.io/rkNojzNzQzWXlU0HoaPOrg.
## The Limitations of Multi-Agent Policy Gradients
We are already familiar with **MARL**, in which independent agents want to optimize their parameterized policies $\pi^i$ to maximize
$\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \mathcal{J}(\boldsymbol{\pi}) = \mathbb{E}_{s\sim \rho^{0:\infty}_{\boldsymbol{\pi}}, \boldsymbol{a_{0:\infty}}\sim \boldsymbol{\pi} }\big[ \sum_{t=0}^{\infty}\gamma^t r(s_t, \boldsymbol{a}_t)\big]$.
Following traditional deep RL approach, we would always model every agent's policy with a neural network $\theta^i$ as $\pi^i_{\theta^i}$, and learn it with stochastic gradient ascent
$\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \theta^i \gets \theta^i + \alpha \nabla_{\theta^i}\mathcal{J}(\boldsymbol{\pi}_{\boldsymbol{\theta}})$.
Such a policy-gradient (PG) based approach is known to be problematic already in single-agent RL (and Machine Learning in general), because the step taken by the gradient update can be too large and harm performance. When it comes to mutli-agent PG (MAPG), this problem is even more severe because the update directions pointed by the gradients of the agents can be conflicting... What?! The gradient should always lead up the reward, right?! Not exactly. Bear in mind that in MARL, every agent follows its OWN gradient, which tells what the agent can do to improve the joint performance. Meanwhile, when all agents try to do the same thing, the joint update may be a disaster.
![](https://i.imgur.com/SNckxAw.jpg)
It's like if you were driving a car with your friend, and at some point the road would diverge, with a tree in the middle. Assuming you do nothing, your friend pulls the wheel right, to avoid a crash. However, being mindfull, you want to avoid the tree by turning left. You can't however, because the force applied by your friend is stopping you. The wheel remains still...
![](https://i.imgur.com/ZIXFRg8.jpg)
Therefore, in order to make decisions that are beneficial fo the whole team, the agents must always ***collaborate***. Unfortunately, name a method, be it MADDPG, IPPO, MAPPO, all of them make the agents to mind only themselves and follow their own gradients. Hence, we still have no clue how to assure the performance improvement in MARL... Until now 😁
## Multi-Agent Trust Region Learning
In single-agent RL, trust region learning enables stability of updates and policy improvement; at every iteration $k$, the new policy $\pi_{k+1}$ increases the return: $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \mathcal{J}(\pi_{k+1}) \geq \mathcal{J}(\pi_k)$.
Because of the reasons described above, simply apllying trust region learning to MARL fails: even if a trust-region update would guarantee improvement of one agent, all agents' updates can be damaging for the whole team. Today, however, you will see the new *multi-agent trust region learning*, which implements cooperation, and leads to the joint policy improvement 🎉. The key ingredients of it are the novel multi-agent functions, which describe the contribution of subsets of agents to the joint return.
### Multi-Agent Advantage Decomposition
First, the *multi-agent state-action value function* for an arbitrary ordered agent subset $i_{1:m} = \{i_1, \dots, i_m\}$ is defined as
$\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad Q_{\boldsymbol{\pi}}(s, \boldsymbol{a}^{i_{1:m}}) \triangleq \mathbb{E}_{\boldsymbol{a}^{-i_{1:m}}\sim\boldsymbol{\pi}^{-i_{1:m}}}\big[Q_{\boldsymbol{\pi}}(s, \boldsymbol{a}^{i_{1:m}}, \boldsymbol{a}^{-i_{1:m}} ) \big]$.
Simply speaking, this function says what is the average return if agents $i_{1:m}$ take a joint action $a^{i_{1:m}}$ at state $s$. On top of it we can define *the multi-agent advantage function*
$\quad \quad \quad \quad \quad \quad \quad \quad A_{\boldsymbol{\pi}}^{i_{1:m}}(s, \boldsymbol{a}^{j_{1:k}}, \boldsymbol{a}^{i_{1:m}}) = Q^{j_{1:k}, i_{1:m}}_{\boldsymbol{\pi}}(s, \boldsymbol{a}^{j_{1:k}}, \boldsymbol{a}^{i_{1:m}}) - Q^{j_{1:k}}_{\boldsymbol{\pi}}(s, \boldsymbol{a}^{j_{1:k}})$.
This function compares the quality of joint action $\boldsymbol{a}^{i_{1:m}}$ of agents $i_{1:m}$ against the average one for joint action $\boldsymbol{a}^{j_{1:k}}$ of agents $j_{1:k}$. Just think about how useful would it be if $i_{1:m}$ could know $\boldsymbol{a}^{j_{1:k}}$ and the multi-agent advantage. They could then "react" cleverly by choosing a joint action $\boldsymbol{a}^{i_{1:m}}$ with large multi-agent advantage... Actually, there is a lemma which describes the awesome consequence of such a scenario, known as the ***Multi-Agent Advantage Decomposition Lemma***: for any ordered subset $i_{1:m}$ of agents
$\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad A^{i_{1:m}}_{\boldsymbol{\pi}}(s, \boldsymbol{a}^{i_{1:m}}) = \sum\limits_{j=1}^{m}A^{i_j}_{\boldsymbol{\pi}}(s, \boldsymbol{a}^{i_{1:j-1}}, a^{i_j})$.
Isn't it amazing 😀?! If every agent $i_j$ knew what agents $i_{1:j-1}$ do, then it could react with $a^{i_j}_*$ to maximize its own multi-agent advantage (whose max is always positive).
![](https://i.imgur.com/Ue5arsO.png)
Then, by setting $m=n$, the lemma assures that the joint advantage will be positive! And if the number of agents is large, it should actually be "very" positive! Let's go and see how to contrive a learning algorithm with this idea.
### Monotonic Improvement
To learn joint policies which perform well throughout the whole game, we must look at the bigger picture than only one state and action---we consider their marginal distribution. Let's suppose that our agents follow a joint policy $\boldsymbol{\pi}=(\pi^1, \dots, \pi^n)$. They decide to learn according to some order $i_{1:n}$. Suppose that $i_1, \dots, i_{m-1}$ have already made their updates to new policies $(\bar{\pi}^{i_1}, \dots, \bar{\pi}^{i_{m-1}}) = \boldsymbol{\bar{\pi}}^{i_{1:m-1}}$. Then, for any candidate policy $\hat{\pi}^{i_m}$ we define the surrogate return
$\quad \quad \quad \quad \quad L_{\boldsymbol{\pi}}^{i_{1:m}}(\boldsymbol{\bar{\pi}}^{i_{1:m-1}}, \hat{\pi}^{i_m}) = \mathbb{E}_{s\sim\rho_{\boldsymbol{\pi}}, \boldsymbol{a}^{i_{1:m-1}}\sim \boldsymbol{\bar{\pi}}^{i_{1:m-1}}, a^{i_m}\sim \hat{\pi}^{i_m}}\big[ A_{\boldsymbol{\pi}}^{i_m}(s, \boldsymbol{a}^{i_{1:m-1}}, a^{i_m}) \big]$.
This definition is just a slight step forward from the definition of multi-agent advantage: here, the agent $i_m$ wouldn't react to others with a specific action, but rather with a specific policy $\hat{\pi}^{i_m}$. Fortunately, in this "bigger" setting, another decomposition lemma holds: define $C = \frac{4\gamma \max_{s, \boldsymbol{a}} |A_{\boldsymbol{\pi}}(s, \boldsymbol{a})|}{(1-\gamma)^2}$. Then
$\quad \quad \quad \quad \quad \quad \mathcal{J}(\boldsymbol{\bar{\pi}}) \geq \mathcal{J}(\boldsymbol{\pi}) + \sum\limits_{m=1}^{n}\big[ L_{\boldsymbol{\pi}}^{i_{1:m}}(\boldsymbol{\bar{\pi}}^{i_{1:m-1}}, \bar{\pi}^{i_m}) - CD_{\text{KL}}^{\text{max}}(\pi^{i_m}, \bar{\pi}^{i_m})\big]$.
This lemma provides a lower bound on the performance of the new joint policy of agents; a lower bound that is decemposed among agents. Such a decomposition allows the agents to, one by one, improve the guarantee on the performance of the next joint policy. It is just tailored for the *sequential update scheme*: the agents update their policies to solve
$\quad \quad \quad \quad \quad \quad \quad \bar{\pi}^{i_m} = \text{argmax}_{\hat{\pi}^{i_m}} \ L_{\boldsymbol{\pi}}^{i_{1:m}}(\boldsymbol{\bar{\pi}}^{i_{1:m-1}}, \hat{\pi}^{i_m}) - CD_{\text{KL}}^{\text{max}}(\pi^{i_m}, \hat{\pi}^{i_m})$.
As the maximization update is at least as good as no update (for which the above objective is zero), every agent guarantees that
$\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad L_{\boldsymbol{\pi}}^{i_{1:m}}(\boldsymbol{\bar{\pi}}^{i_{1:m-1}}, \bar{\pi}^{i_m}) - CD_{\text{KL}}^{\text{max}}(\pi^{i_m}, \bar{\pi}^{i_m}) \geq 0$.
Together with the above decomposition inequality, the agents following this protocol achieve the **monotonic improvement property**:
$\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \mathcal{J}(\boldsymbol{\bar{\pi}}) \geq \mathcal{J}(\boldsymbol{\pi})$.
Hurrah!!! 🧨🎆 We've done it! We figured out how to make the agents improve the joint return; all with a small update guarantee due to the $D_{\text{KL}}^{\text{max}}(\pi^{i_m}, \bar{\pi}^{i_m})$ penalty.
You may wonder *if any order of updates in the sequential update scheme guarantees monotonic improvement, which order should we use?* Indeed, it's a good point. Let's ask ourselves a question, *what the behavior of agents would be like at convergence to the optimal joint policy?*
Unfortunately, just like in the case of trust region learning in RL, we cannot implement this protocol in games with large state spaces. But don't worry!, we can easily approximate it and boost with neural networks, as we describe below.
## Deep Algorithms
Now you will learn about and get excited by the new state-of-the-art deep MARL algorithms: *Heterogeneous-Agent Trust Region Policy Optimization* (**HATRPO**) and *Heterogeneous-Agent Proximal Policy Optimization* (**HAPPO**).
### HATRPO
So what we'd like to do is to make every agent $i_m$ solve
$\quad \quad \quad \quad \quad \quad \quad \quad \bar{\pi}^{i_m} = \text{argmax}_{\hat{\pi}^{i_m}} \ L_{\boldsymbol{\pi}}^{i_{1:m}}(\boldsymbol{\bar{\pi}}^{i_{1:m-1}}, \hat{\pi}^{i_m}) - CD_{\text{KL}}^{\text{max}}(\pi^{i_m}, \hat{\pi}^{i_m})$
one after another. This is very similar to the objective of single-agent trust region learning: $L_{\pi}(\hat{\pi}) - CD_{\text{KL}}^{\text{max}}(\pi, \hat{\pi})$. This, in case of neural network policies, is solved approximately by constrained objective of TRPO
$\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \bar{\pi} = \text{argmax}_{\hat{\pi}} \ L_{\pi}(\hat{\pi}), \ \text{s.t.} \ \overline{D}_{\text{KL}}(\pi, \hat{\pi}) \leq \delta\\
\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad =\text{argmax}_{\hat{\pi}} \ \mathbb{E}_{s\sim\rho_{\pi}, a\sim\pi}\Big[ \frac{\hat{\pi}(a|s)}{\pi(a|s)} A_{\pi}(s, a)\Big], \ \text{s.t.} \ \overline{D}_{\text{KL}}(\pi, \hat{\pi}) \leq \delta,$
where $\delta$ is a small constraint parameter controlling the size of update. The key to making this objective solvable is that the distribution over which the expectation is taken entirely from the "old" policy $\pi$, allowing us to automatically differentiate it. Unfortunately, our approximate multi-agent trust-region objective (written out)
$\quad \quad \quad \quad \quad \mathbb{E}_{s\sim\rho_{\boldsymbol{\pi}}, \boldsymbol{a}^{i_{1:m-1}}\sim\boldsymbol{\bar{\pi}}^{i_{1:m-1}}, a^{i_m}\sim\hat{\pi}^{i_m}}\big[ A^{i_m}_{\boldsymbol{\pi}}(s, \boldsymbol{a}^{i_{1:m-1}}, a^{i_m}) \big], \ \text{s.t.} \ \overline{D}_{\text{KL}}(\pi^{i_m}, \hat{\pi}^{i_m}) \leq \delta$
not only involves the old and the candidate policies $\pi^{i_m}$ and $\hat{\pi}^{i_m}$, but also the joint just-updated policy of agents $i_{1:m-1}$, 😭 so hard!!!
Not really! We can boost importance sampling to make the esimtation of this objective feasible. Recall that the agents $i_{1:m-1}$ have already made their update. Then, $i_m$ is free to compute the ratio
$\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \frac{ \hat{\pi}^{i_m}(a^{i_m}|s)}{ \pi^{i_m}(a^{i_m}|s)} \cdot \frac{ \boldsymbol{\bar{\pi}}^{i_{1:m-1}}(\boldsymbol{a}^{i_{1:m-1}}|s) }{ \boldsymbol{\pi}^{i_{1:m-1}}(\boldsymbol{a}^{i_{1:m-1}}|s) }$.
So let's the agent compute it, and plug it in what it has, which is the data coming from the old joint policy $\boldsymbol{\pi}$. It turns out that the multi-agent trust-region objective can be equivalently written as
$\quad \quad \quad \quad \quad \mathbb{E}_{s\sim\rho_{\boldsymbol{\pi}}, \boldsymbol{a}\sim\boldsymbol{\pi} }\Big[ \frac{ \hat{\pi}^{i_m}(a^{i_m}|s)}{ \pi^{i_m}(a^{i_m}|s)} \cdot \frac{ \boldsymbol{\bar{\pi}}^{i_{1:m-1}}(\boldsymbol{a}^{i_{1:m-1}}|s) }{ \boldsymbol{\pi}^{i_{1:m-1}}(\boldsymbol{a}^{i_{1:m-1}}|s) } A_{\boldsymbol{\pi}}(s, \boldsymbol{a}) \Big], \ \text{s.t.} \ \overline{D}_{\text{KL}}(\pi^{i_m}, \hat{\pi}^{i_m}) \leq \delta.$
Yes, $A_{\boldsymbol{\pi}}(s, \boldsymbol{a})$ is the joint advantage function. The agents don't have to train special critics to compute the multi-agent advantage; all they have to do is to maintain a joint advantage estimator, like GAE. Furthermore, for simplicity, we can write
$\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad M_{\boldsymbol{\pi}}(s, \boldsymbol{a}) = \frac{ \boldsymbol{\bar{\pi}}^{i_{1:m-1}}(\boldsymbol{a}^{i_{1:m-1}}|s) }{ \boldsymbol{\pi}^{i_{1:m-1}}(\boldsymbol{a}^{i_{1:m-1}}|s) } A_{\boldsymbol{\pi}}(s, \boldsymbol{a})$,
which transforms our problem to the well-known TRPO objective! 🎆🎇🎈
$\quad \quad \quad \quad \quad \quad \quad \mathbb{E}_{s\sim\rho_{\boldsymbol{\pi}}, \boldsymbol{a}\sim\boldsymbol{\pi} }\Big[ \frac{ \hat{\pi}^{i_m}(a^{i_m}|s)}{ \pi^{i_m}(a^{i_m}|s)} M_{\boldsymbol{\pi}}(s, \boldsymbol{a}) \Big], \ \text{s.t.} \ \overline{D}_{\text{KL}}(\pi^{i_m}, \hat{\pi}^{i_m}) \leq \delta$.
Having turned the multi-agent trust-region problem to the above objective, the agent $i_m$ (with a neural network policy $\theta^{i_m}$) can maximize it by performing a step of TRPO
$\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \theta^{i_m}_{\text{new}} = \theta^{i_m}_{\text{old}} + \alpha^j \sqrt{ \frac{2\delta}{\boldsymbol{g}^{i_m}(\boldsymbol{H}^{i_m})^{-1} \boldsymbol{g}^{i_m}}} (\boldsymbol{H}^{i_m})^{-1} \boldsymbol{g}^{i_m}$.
Here, just like in TRPO, $\boldsymbol{g}^{i_m}$ is the gradient of ${i_m}$'s objective, and the matrix $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \boldsymbol{H}^{i_m} = \nabla^{2}_{\theta^{i_m}}\overline{D}_{\text{KL}}\big( \pi^{i_m}_{\theta^{i_m}_{\text{old}}}, \pi^{i_m}_{\theta^{i_m}}\big)$
is the Hessian of the average KL-divergence from the old policy. As you may know from TRPO, $\alpha^j$ is a step size at the power $j$. We choose $j$ to be the smallest such $j\in\mathbb{N}$ which makes the update improve the objective estimated from the data. In this way, we can enforce the update to be as good as we can get from available data 🤩.
### HAPPO
You may not like the second-order differentiation that HATRPO uses, ok. It is harder to code up, and also is more computationally expensive. Sometimes we want to quickly implement and execute an algorithm. Because of such concerns, we also developed an implementation of multi-agent trust-region learning through proximal policy optimization (PPO). As the constrained HATRPO objective has the same algebraic form as TRPO, it can be implemented with the *clip objective*.
$\quad \quad \quad \quad \mathbb{E}_{s\sim\rho_{\boldsymbol{\pi}}, \boldsymbol{a}\sim\boldsymbol{\pi} }\Big[ \text{min}\Big( \frac{ \hat{\pi}^{i_m}(a^{i_m}|s)}{ \pi^{i_m}(a^{i_m}|s)} M_{\boldsymbol{\pi}}(s, \boldsymbol{a}),
\text{clip}\big(\frac{ \hat{\pi}^{i_m}(a^{i_m}|s)}{ \pi^{i_m}(a^{i_m}|s)}, 1\pm \epsilon \big) M_{\boldsymbol{\pi}}(s, \boldsymbol{a}) \Big)\Big]$
The clip operator replaces the policy ratio with $1-\epsilon$ or $1+\epsilon$ depending on whether it exceeds the thresholds interval $1\pm\epsilon$ from below or above. If this is not the case, the ration remains unchanged. So for example, $\text{clip}(1.2, 1\pm 0.1) = 1.1$, and $\text{clip}(0.8, 1\pm 0.1) = 0.9$. This makes sure that large policy updates are discouraged. The clip objective is differentiable with respect to the policy parameters, so all we have to do is to initialize $\theta^{i_m} = \theta^{i_m}_{\text{old}}$, and a few times to this
$\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \boldsymbol{g}^{i_m}_{\text{HAPPO}} \gets \nabla_{\theta^{i_m}}L^{\text{HAPPO}}(\theta^{i_m})\\
\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \theta^{i_m} \gets \theta^{i_m} + \alpha \boldsymbol{g}^{i_m}_{\text{HAPPO}}.$
So quick and convenient! 😍
### Domination (verified empirically)
And how do these algorithms perform? Well, HAPPO itself outperforms SOTA methods like MAPPO, IPPO, and MADDPG. HATRPO, however, completely dominates all of them (including HAPPO) and establishes the new SOTA! 💥
Here you have some plots from Multi-Agent MuJoCo---the hardest MARL benchmark.
![](https://i.imgur.com/Uo32wJN.png)
So yeah, the key take-away of multi-agent trust-region learning is that the large number of agents does not have to imply conflicts in learning. Quite the opposite, a large team of learners willing to cooperate can get very far 🗻!
Are you wondering how can they do it safely? Read our next article!
Thanks for reading this article; myself (Kuba) I am really happy for your interest in MARL, and so are my co-authors: Ruiqing Chen, Muning Wen, Ying Wen, Fanglei Sun, Jun Wang, and Yaodong Yang.