# MAT Rebuttal ## Response to Reviewer YpeW > **Q1** Why have the authors not include a *related works* section and only reviewed two trust-region mehtods? *Related works* serves an important role of positioning the paper in the MARL literature. **A1** Section 1 was intended to provide the discussion about the related work and the motivation and blend them together, whereas Section 2 continued the literature further mathematically by covering the related work (such as the trust-region methods, e.g., MAPPO and HAPPO) that is closely related to the advantage decomposition theorem. That said, thanks for the reviewer’s comment, in the updated version, we have better positioned our work in the general area of MARL by: 1. adding the discussion about two more classic MARL methods, i.e. QMIX (value decomposition) and MADDPG (deterministic policy gradient), and a Transformer-based MARL method, i.e. UPDeT, in Section 1 and 2. 2. Additionally, running more baseline experiments for MADDPG and UPDeT, where currently completed results are shown in this anonymous [link](https://anonymous.4open.science/r/MAT_Rebuttal-0ABA/baseline.pdf) (QMIX’s performance is covered already in Tab.1). > **Q2** There seems to be a mismatch between the MAPPO performance from the original paper [3] and the one reported in this work. For example, on MMM2, MAPPO is known to achieve 90.6 win rate, while here authors report 81.8. Did the authors implement their own MAPPO version? If so, why? **A2** We used the original MAPPO implementation from the MAPPO codebase. Also, we kept its hyperparameters unchanged. What seems to be the reason of the mismatch is that we ran MAPPO with seeds generated by us at random (MAPPO paper has not listed the details of their seeds). Therefore, random seeds are different between our paper and MAPPO paper. The striking difference comes from the fact that MAPPO performed particularly poorly with one of the seeds, which is reflected by the higher variance in Tab.1 (10.1 versus 2.8). For better demonstration, we here provide the plot after removing the outlier under the following anonymised [link](https://anonymous.4open.science/r/MAT_Rebuttal-0ABA/outlier.pdf), where MAPPO achieve 90.2 win rate. Moreover, MAT achieves 93.8 win rate on MMM2, which is still better than the known win rate of MAPPO. > **Q3** Some parts of explanation of the few-shot experiments (Section 5.2) are confusing. For instance what exactly was done in the last meta-column (MAT-from scratch)? Also to help with the flow, I would suggest moving the paragraph starting at line 265 to the beginning of section 5.2. **A3** We apologize for the confusion. The difference between MAT and "MAT-from scratch" in section 5.2 is that the MAT model is pre-trained on different tasks while the "MAT-from scratch" model is randomly initialized. Therefore, the "MAT-from scratch" is a control group experiment to demonstrate the effectiveness of pre-training process w.r.t MAT. Also, with reviewer's suggestion in mind, we have moved the paragraph mentioned to section 5.2 as well as clarified the "MAT-from scratch" in our revision since it does make our paper easier to follow. >**Q4** This work seems relevant and should probably be addressed and possibly compared to UPDeT [1]. **A4** We agree that the Policy Decoupling in UPDeT is relevant to our work. UPDeT focuses on observations of each single agent, handling various observation sizes by decoupling the observations into a set of observation-entities according to the physical meanings, matching them with different action-groups, and modeling the relationship between the matched observation-entities with self-attention mechanism for better representation learning [1]. In contrast, MAT focuses more on observations and actions from different agents, modeling the interrelationship between agents with Transformer architectures, promoting the monotonic improvement of joint policies without the requirement for divisible observations. We have introduced this work in Section 2 of the revision and will surely compare with it in the final version (currently completed results are show in this anonymous [link](https://anonymous.4open.science/r/MAT_Rebuttal-0ABA/baseline.pdf)). > **Q5** What would happen to MAT's performance if we replaced the transformer by another language model (e.g., GRU, LSTM)? **A5** Theoretically, the multi-agent sequential decision paradigm could be implemented with different type of sequence models including RNNs like GRU and LSTM. However, compared with Transformer, other RNN-based implementations pass the information between sequential tokens with recurrence and will suffer from critical information loss that impedes building long-distance dependencies [2]. In contrast, the attention mechanisms allow the Transformer to model inter-token dependencies regardless of their distance in the input or output sequences. Technically, we conducted an ablation experiment that replaces the encoder and decoder in Figure 2 with GRU, whose results verified the advantage of Transformer architecture (see the anonymized [link](https://anonymous.4open.science/r/MAT_Rebuttal-0ABA/structure.pdf)). We found this question very insightful and we thank the reviewer for it. **Reference:** [1] Hu, Siyi, Fengda Zhu, Xiaojun Chang, and Xiaodan Liang. "Updet: Universal multi-agent reinforcement learning via policy decoupling with transformers." arXiv preprint arXiv:2101.08001 (2021) [2] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, (2017) [3] Yu, Chao, et al. "The surprising effectiveness of ppo in cooperative, multi-agent games." arXiv preprint arXiv:2103.01955 (2021). --- ## Reviewer MEiB > **Q1** The work lacks a *related works* section which makes it difficult to position the used techniques (advantage decomposition, transformers) in the MARL literature. The UPDeT algorithm, which also uses transformers, is not compared against. **A1** We appreciate pointing out this flaw. The Policy Decoupling in UPDeT is relevant to our work. UPDeT focuses on observations of each single agent, handling various observation sizes by decoupling the observations into a set of observation-entities according to the physical meanings, matching them with different action-groups, and modeling the relationship between the matched observation-entities with self-attention mechanism for better representation learning [1]. In contrast, MAT focuses more on observations and actions from different agents, modeling the interrelationship between agents with Transformer architectures, promoting the monotonic improvement of joint policies without the requirement for divisible observations. Besides, Section 1 was intended to provide the discussion about the related work and the motivation and blend them together, whereas Section 2 continued the literature further mathematically by covering the related work (such as the trust-region methods, e.g., MAPPO and HAPPO) that is closely related to the advantage decomposition theorem. That said, thanks for the reviewer’s comment, in the updated version, we have better positioned our work in the general area of MARL by: 1. adding the discussion about two more classic MARL methods, i.e. QMIX (value decomposition) and MADDPG (deterministic policy gradient), and a Transformer-based MARL method, i.e. UPDeT, in Section 1 and 2. 2. Additionally, running more baseline experiments for MADDPG and UPDeT, where currently completed results are shown in this anonymous [link](https://anonymous.4open.science/r/MAT_Rebuttal-0ABA/baseline.pdf) (QMIX’s performance is covered already in Tab.1). > **Q2** You mention that "the transformer architecture allows sequential policies to be trained in parallel" but it is not clear what this means. If you refer to parallelism in the agents' updates, then it comes as counterintuitive. After all, you need to run the decoder on $\hat{a}_t^{i_{1:m-1}}$ to generate the action distribution for agent $i_m$. **A2** We apologize for the confusion. What we want to say here is that "In order to guarantee the monotonic improvement of joint policies, HAPPO has to update each policy one-by-one during training, by leveraging previous update results of $\pi^{i_1},...,\pi^{i_{m-1}}$ to improve $\pi^{i_m}$, which becomes critical at large size of agents. By contrast, the attention mechanism of Transformer architecture (especially the decoder) allows for batching the ground truth actions $a_t^{i_0},...,a_t^{i_{n-1}}$ in the buffer to predict $a_t^{i_1},...,a_t^{i_{n}}$ and update policies simultaneously, which significantly improves the training speed and makes it feasible for large size of agents". We had added this clarification to the last paragraph of Section 3 in our revision. > **Q3** Ablation studies could be helpful to understand the effect of different components. For example, since agents are shuffled at every iteration, the position encoding might not be very relevant and can be discarded. Another example is to use the decoder where observations and actions for all agents $i_{1:m-1}$ are provided as input to the decoder to generate value and action distribution for agent $i_m$. **A3** We are very grateful for the insightful suggestion and we have conducted more ablation experiments w.r.t different encoding approaches and model architectures. Exactly as you mentioned, we found the position encoding in the vanilla Transformer [1] is not very relevant for MARL tasks in early experiments, and it has been already discarded in our implementation. Instead, we bind each observation with corresponding one-hot agent id, which also be applied to baseline methods for fairness. The ablation results under this anonymous [link](https://anonymous.4open.science/r/MAT_Rebuttal-0ABA/encoding.pdf) emphasized the importance of agent id encoding on heterogeneous settings. With the suggestion in mind, we also compared the implementation with different model architectures, i.e. encoder-decoder, encoder only, decoder only, and other sequence models (GRU). Corresponding ablation results under this anonymous [link](https://anonymous.4open.science/r/MAT_Rebuttal-0ABA/structure.pdf) confirmed the advantage of Transformer and the necessity of encoder-decoder architectures. **Reference:** [1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, (2017) --- ## Reviewer rnUZ > **Q1** The problem formulation is that of Markov games [1] rather than the stated Dec-POMDPs. The HAPPO paper uses the *Markov game* formulation as well, thus the deployed theoretical results are sound in this setting only. This should be corrected. **A1** The reviewr is right in pointing this out. We apologize for this error. We had changed *decentralized Partially Observable Markov Decision Processes (Dec-POMDPs)* to *Markov Games (MGs)* in the revision and keep the stated formulation. > **Q2** MAT achieves computational-efficiency advantage w.r.t. HAPPO. It would be nice to see the exact performance comparison w.r.t. wall clock time. **A2** Thanks so much for the suggestion, the wall clock time comparison between MAT, MAPPO and HAPPO w.r.t 10M environment steps is demonstrated under this anonymized [link](https://anonymous.4open.science/r/MAT_Rebuttal-0ABA/walltime.pdf). Also, we had added this comparison in our revised Appendix to better demonstrate the advantages. > **Q3** Figures 4 & 5, as well Tables 2 & 3 are illegible due to size. **A3** This is a valuable feedback. With the suggestion in mind, we will readjust the style of all figures and tables to make them more readable in the final version. **Reference:** [1] Littmann, "Markov Games as a Framework for Multi-Agent Reinforcement Learning", ICML 1994