Rebuttal of Unified Diversity

# Rebuttal of Unified Diversity ### We thank Reviewer qQqw for his/her 3-hour efforts in offering constructive comments that will surely help improve our paper. 1. > **Reviewer**: There are parts that lack more detail. For example, the setup of the AlphaStar game is not clear to me at all. Another example of this is the matrix for the non-transitive mixture game should be described a bit more in the main text, to at least get an intuition about it’s structure beyond having been “delicately designed” * **Response**: We apologize for the lack of clarity. Regarding the AlphaStar game, your understanding is correct. We do not train AlphaStar from scratch; instead, we test our algorithm on the **meta-game** induced by 888 policies (i.e. agents) generated during the training process of solving AlphaStar, which is provided by [3]. Regarding the non-transitive mixture game, the explicit construction of the “delicately designed” payoff $\mathbf{S}$ is given in Appendix D.1. The intuition behind the construction is to ensure that when the opponent takes a pure strategy best response, which corresponds to reaching the center of one of the gaussian humps, the best response against it will be choosing among the rest gaussian humps on other directions. As a result, this game involves both strong transitive and non-transitive structures. To achieve low exploitability, an effective population has to demonstrate diverse explorative trajectories that cover all directions (see Figure 2). 2. > **Reviewer**: I am not sure about the results on the non-transitive mixture game. Would the modes of the Gaussians not be where we would want the trajectories to end? If this is the case why are none of the algorithms (e.g. PSRO, PSRO-rN) reaching them? I am also surprised there is no cycling in the trajectories, given the cyclic nature of the game. * **Response**: The reviewer is correct in the sense that the players must learn to stay close to the Gaussian centroids whilst also exploring all seven Gaussians to avoid being exploited. The reason why they (i.e. PSRO, PSRO-rN) are not reaching them is that they adopt the **approximated** best response during each iteration via gradient descent. Therefore, it is reasonable that players approached (but not fully reached) the exact best response. The failure of PSRO on such a task is because it does not use the pipeline trick. It is also reported in Figure 3 in [5]. In terms of PSRO-rN, it came as no surprise that PSRO-rN would fail in such tasks, which is also studied theoretically in Proposition 3.1 in [4] and empirically in [6]. Regarding no cycling trajectories given the cyclic nature of the game, approaching a center of a Gaussian is actually caused by the transitive component of this game. The **cyclic** component is revealed by the fact that the policies try to explore different directions. 3. > **Reviewer**: The first sentence of the paper “zero-sum games involve non-transitivity” is not correct. * **Response**: We agree with the Reviewer that not all zero-sum games have non-transitive components. We would like to accept your suggestion and correct it as “many zero-sum games have a strong non-transitive **component**”. This is validated by [2], which proves that a game can be generally decomposed into the transitive component and the non-transitive component. We understand the non-transitive component can be zero at many times like what you have justified. 4. > **Reviewer**: Related research that might be of interest in this area is a paper on diversity of populations [1]. * **Response**: Thanks for pointing it out and we appreciate this paper. At a high level, we could expect the idea of [1] lies in the region of Response Diversity in our paper. It works towards using the interaction graphs as a more general objective to replace Nash or rectified Nash. We will include this reference in our manuscript. Finally, we appreciate your other careful and helpful suggestions (e.g., typos and writings) and will further revise our manuscript based on them. [1] Garnelo, Marta, et al. "Pick Your Battles: Interaction Graphs as Population-Level Objectives for Strategic Diversity." Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems. 2021. [2] Balduzzi, David, et al. "Re-evaluating evaluation." arXiv preprint arXiv:1806.02643 (2018). [3] Czarnecki, Wojciech Marian, et al. "Real world games look like spinning tops." arXiv preprint arXiv:2004.09468 (2020). [4] McAleer, S., Lanier, J., Fox, R., & Baldi, P. (2020). Pipeline psro: A scalable approach for finding approximate nash equilibria in large games. NeurIPS 2020 [5] Feng, Xidong, Oliver Slumbers, Yaodong Yang, Ziyu Wan, Bo Liu, Stephen McAleer, Ying Wen, and Jun Wang. "Discovering Multi-Agent Auto-Curricula in Two-Player Zero-Sum Games." arXiv preprint arXiv:2106.02745 (2021). [6] Perez-Nieves, N., Yang, Y., Slumbers, O., Mguni, D. H., Wen, Y., & Wang, J. (2021, July). Modelling Behavioural Diversity for Learning in Open-Ended Games. In International Conference on Machine Learning (pp. 8514-8524). PMLR. ### We thank the Reviewer Rpgw for the four hours of effort and the associated constructive comments that will surely help improve our paper. 1. > **Reviewer**: How does this actually differ from RED (or RND even) in practice? Furthermore, once the network is learnt on the dataset, does it remain fixed throughout the duration of the experiment, or does it get periodically re-learnt? * **Response**: Both our method and RED are inspired by the RND in terms of constructing the prediction error. However, RED is an imitation learning approach for single-agent RL, which means RED uses the prediction error as the only reward signal. In contrast, we are modelling diversity in the regime of population training in multi-agent RL (i.e., the PSRO process) where at each iteration we aim to discover a new diverse and effective agent that can improve the performance of the whole population. The network learnt on the dataset is **not** fixed through the experiment. Once a new policy is added, the aggregated Nash policy is changed. Therefore, it gets re-learnt at the beginning of each iteration of PSRO. 2. > **Reviewer**: Shouldn't $-i$ be assignable to a single opponent player? What value does it take in say a game with 3 or 4 players? * **Response**: As a general practice in game theory, $i$ indicates a single player and we utilize the special notion of $-i$ to encapsulate the remaining players. As a result, for a game with 3 players, $i$ means one specific player and $-i$ indicates the other two as a whole. 3. > **Reviewer**: Strictly speaking, I don't think the manuscript provides a good overview of the limitations of the proposed framework, nor of any particular weakness in the experimental setting used. * **Response**: We apologize for insufficient discussions on the limitations, and we include more discussions here. * Regarding the limitations from the perspective of the proposed algorithm, one limitation is that the diversity weights $\lambda_1$ and $\lambda_2$ in our paper are manually tuned. Our methods target the non-transitivity in zero-sum games. Therefore, the weights for a game should be related to how strong the non-transitive component is. In future work, we will work towards quantifying the non-transitivity component of a given game automatically and determining the diversity weights correspondingly. Similar ideas have been tested in single-agent RL cases on learning the discount rate [2]. In addition, our methods also inherit the limitations from the framework of PSRO. The advantage of the PSRO based methods also depends on the amount of non-transitivity of the environment [1]. To be specific, if there aren’t too many cyclic policies involved in the environment (such as Rock->Paper->Scissor), the newest model generated by the naive self-play training paradigm could probably be the strongest one (a Nash Policy), which implies that a naive self-play would suffice to solve the problem. The PSRO methods will then lose their advantage under such circumstances since they need extra computation resources to maintain meta-payoffs. * Regarding the limitations from the perspective of real-world application, the game dynamics could be complex and there could be a lot of randomnesses involved in the real-world games. Thus the approximated best response trained by reinforcement learning algorithms towards a certain policy could be inaccurate. In the Google Research Football experiment, we empirically save the checkpoint when the model win-rate is stable (i.e. the change of win-rate is less than 0.05 during two checks with the check frequency is 1000 model steps) or the training model steps reach an upper bound of 50000. These criteria could be potentially improved or further studied for all the PSRO based methods. 4. > **Reviewer**: I would personally enjoy a discussion at the very least on what other types of losses could be potentially steam from utilising the BD / RD decompositions. * **Response**: We offer here more details about what other types of losses can stem from BD and RD. * Regarding RD, we demonstrate that many current approaches can be unified as convex hull enlargement: * PSRO$_{rN}$ [3]: The rectified Nash modifies the original Nash objective, and it shows that the new objective will enlarge the convex hull more efficiently through the example of Rock-paper-scissors. Furthermore, it shows empirically that rectified Nash will lead to the largest area of the convex hull in the 2D embedding space. * DPP-PSRO [4]: DPP-PSRO proves that the new policy maximizing the diversity regularized best response will strictly enlarge the convex hull in Proposition 9, which in other words, can be formulated as a convex hull enlargement problem. To unify these, we proposed the direct objective to enlarge the convex hull and define **Response Diversity** as the contribution of a policy to the convex hull enlargement. * Regarding BD, we show the newly proposed concept of trajectory diversity [5] can be derived by our formulation of BD as occupancy measure discrepancy here. For a policy $\pi_i$, denote the trajectory distribution induced by $\pi_i$ as $q_{\pi_{i}}$ and the occupancy measure as $\rho_{\pi_{i}}$. Then trajectory diversity for $(\pi_1, \dots, \pi_n)$ is defined as the generalized JS divergence over the trajectory distributions: $$ Diversity(\pi_1, \dots, \pi_n)=\text{JSD}(q_{\pi_1}, \dots, q_{\pi_n}). $$ Considering that: $$ \text{JSD}(q_{\pi_1}, \dots, q_{\pi_n})=\frac{1}{n}\sum_{i=1}^{n}D_{KL}(q_{\pi_i}||q_{\hat{\pi}}), $$ where $q_{\hat{\pi}}=\frac{1}{n}\sum_{i=1}^{n}q_{\pi_i}$. Then following the Theorem 1 of [6], we get: $$ D_{KL}(q_{\pi_i}||q_{\hat{\pi}})\ge D_{KL}(\rho_{\pi_i}||\rho_{\hat{\pi}}). $$ Now we can conclude the following lower bound of trajectory diversity: $$ \text{JSD}(q_{\pi_1}, \dots, q_{\pi_n})\ge\text{JSD}(\rho_{\pi_1}, \dots, \rho_{\pi_n}). $$ Finally, we justify that the trajectory diversity is lower bounded by our occupancy measure level behavioral diversity. Therefore, maximizing trajectory diversity in [5] can be replaced by maximizing the lower bound BD. Finally, we appreciate your other careful and helpful suggestions (e.g., typos and writings) and will further revise our manuscript based on them. [1] Czarnecki, Wojciech Marian, et al. "Real world games look like spinning tops." arXiv preprint arXiv:2004.09468 (2020). [2] Xu, Zhongwen, Hado van Hasselt, and David Silver. "Meta-gradient reinforcement learning." arXiv preprint arXiv:1805.09801 (2018). [3] Balduzzi, David, Marta Garnelo, Yoram Bachrach, Wojciech Czarnecki, Julien Perolat, Max Jaderberg, and Thore Graepel. "Open-ended learning in symmetric zero-sum games." In International Conference on Machine Learning, pp. 434-443. PMLR, 2019. [4] Perez-Nieves, N., Yang, Y., Slumbers, O., Mguni, D. H., Wen, Y., & Wang, J. (2021, July). Modelling Behavioural Diversity for Learning in Open-Ended Games. In International Conference on Machine Learning (pp. 8514-8524). PMLR. [5] Lupu, A., Cui, B., Hu, H., & Foerster, J. (2021, July). Trajectory diversity for zero-shot coordination. In International Conference on Machine Learning (pp. 7204-7213). PMLR. ### We thank the Reviewer QZCS for the six hours of effort and the associated constructive comments that will surely help improve our paper. 1. > **Reviewer**: The paper fails to deliver on its core promise, to quote the abstract: "work towards offering a unified measure of diversity". * **Response**: Firstly, we admit that our methods do not completely unify BD and RD in one principled and fundamental objective, and there might be potential overclaiming by using "unify". Therefore, to correct this, we adopt your suggestion to modify the title as "Towards Unifying ..." to highlight our distributions in laying the ground work for the new objective to be discovered. Secondly, we provide some theoretical intuitions about the equivalence and difference between BD and RD. BD is defined as the occupancy measure discrepancy and RD is about diversity in the long-term expected returns. Therefore, we try to build relationships between the difference of two occupancy measures and the difference in the corresponding long-term expected return. Considering two policies $\pi_1$, $\pi_2$ and the associate occupancy measures $\rho_{\pi_1}$, $\rho_{\pi_2}$. We quantify the occupancy measure discrepancy using the integral probability metric (IPM) [4]: $$ d_{\mathcal{F}}(\rho_{\pi_1}, \rho_{\pi_2})=\sup_{f\in\mathcal{F}}|E_{(s,a)\sim \rho_{\pi_1}}[f(s, a)]-E_{(s,a)\sim \rho_{\pi_2}}[f(s, a)]| $$ If we regard $f(s,a)$ as a kind of reward function of the underlying MDP, [5] tells us that: $$ E_{(s,a)\sim \rho_\pi}[f(s, a)]=\sum_{s,a}\rho_\pi(s,a)f(s, a)=\eta_f(\pi), $$ where $\eta_f(\pi)$ is the **time-average** expected return of $\pi$ under the reward function $f$. Then we can conclude: $$ d_{\mathcal{F}}(\rho_{\pi_1}, \rho_{\pi_2})=\sup_{f\in\mathcal{F}}|\eta_f(\pi_1)-\eta_f(\pi_2)|, $$ where the left-hand side difference in occupancy measures is associated with the right-hand side difference in expected return. However, it is hard to reach the $\sup$ exactly. An alternate approximation is: $$ d_{\mathcal{F}}(\rho_{\pi_1}, \rho_{\pi_2})\approx\max_{f\in\{f_1, \cdots,f_n\}}|\eta_f(\pi_1)-\eta_f(\pi_2)|, $$ Therefore, if we have a diverse reward function set $\{f_1, \cdots,f_n\}$, the approximation can be very accurate. Note that this analysis is based on the single-agent setting. Returning to our multi-agent problem, a diverse reward function set can be achieved by a diverse opponent set since the marginal reward function for the player depends on the opponent policy as: $$ f_i(\mathbf{s}, a_i)=\sum_{a_{-i}}r_i(\mathbf{s}, a_{i}, a_{-i})\pi_{-i}(a_{-i}|\mathbf{s}), $$ where the LHS is the marginalized reward function of player $i$ with the fixed opponent $\pi_{-i}$. Based on this, we argue that during iterations of our methods, the population becomes more diverse and the gap between the difference in occupancy measures and the difference in the expected return gets closer. Therefore, the effects of BD and RD will become increasingly more equivalent during iterations. 2. > **Reviewer**: It seems that DPP-PSRO which uses Response Diversity, has a very similar profile to P-PSRO with Behavioral Diversity, and looks very different to P-PSRO with Response Diversity. Is DPP-PSRO classified correctly in Table 1? * **Response**: Yes, DPP-PSRO is classified correctly in Table 1. The core idea of DPP-PSRO is to construct the Determinantal Point Process (DPP) through the empirical payoff vectors, and their regularized objective encourages the new policy towards increasing the expected cardinality of DPP, which lies in the domain of Response Diversity. We hypothesize that the reason why DPP-PSRO looks different from P-PSRO with RD is that the DPP objective has much fewer exploratory effects than our RD objective since the DPP objective does not have a straight relationship with the empirical gamescape and has a large overlap with the ordinary best response objective in this game. Regarding P-PSRO with BD, the BD objective will simply push the new policy to be as far as possible from a specific fixed point (the Nash aggregated policy) in this relatively simple game. Thus, the BD objective built on the occupancy measure discrepancy is less informative in this relatively simple game since there is no complex interaction between the policy and the environment dynamics, which is in the definition of occupancy measure. Therefore, it does not bring up many exploratory effects under this setting either. That is why DPP-PSRO and P-PSRO with BD look similar, and both look like ordinary P-PSRO, considering the fact that we use the approximate best response during iterations via gradient descent, which also brings up a few exploratory behaviors for ordinary P-PSRO in Figure 2. 3. > **Reviewer**: In some cases the authors make things a little convoluted, and could explain the theorems in words, as they are often very straightforward observations (e.g. Theorem 1). * **Response**: We apologize for the potential lack of some straightforward and intuitive explanations for some statements. The intuition behind Theorem 1 is that since the game is one-step, the policy and the transition dynamics can be easily decoupled. Therefore, the divergence between occupancy measures can be simplified as the divergence between policies given that the transition dynamics are the same for two policies, We will try to add more intuitions for other propositions and further revise our manuscript. 4. > **Reviewer**: The population effectivity metric looks like a robustness objective, but [1] is not mentioned. * **Response**: We highly appreciate the missed reference you mentioned and will include it in our manuscript. We admit that the minimax objective is a common formulation of robustness, and we are actually inspired by this. We also understand there has been extensive literature revealing the relationship between diversity and robustness. For example, [6] shows the equivalence between solving the minimax and the diversity via regularization. Now we discuss the relationship between our metric with the specific objective in [1]. For the minimax objective in [1], the inner minimum is taken over different environment rewards, thus seeking the performance guarantee under the worst environment. For our objective, the minimum is taken over the opponent $\pi_{-i}$, thus seeking the performance guarantee under the strongest opponent, which generalizes with a different degree of freedom compared with [1]. 5. > **Reviewer**: Limitations of the method do not seem to be thoroughly discussed anywhere. The checklist says yes, but I couldn't find it. * **Response**: We apologize for insufficient discussions on the limitations, and we include more discussions here. * Regarding the limitations from the perspective of the proposed algorithm, One limitation is that the diversity weights $\lambda_1$ and $\lambda_2$ in our paper are manually tuned. Our methods target the non-transitivity in zero-sum games. Therefore, the weights for a game should be related to how strong the non-transitive component is. In future work, we will work towards quantifying the non-transitivity component of a given game automatically and determining the diversity weights correspondingly. Similar ideas have been tested in single-agent RL cases on learning the discount rate [3]. In addition, our methods also inherit the limitations from the framework of PSRO. The advantage of the PSRO based methods also depends on the amount of non-transitivity of the environment [1]. To be specific, if there aren’t too many cyclic policies involved in the environment (such as Rock->Paper->Scissor), the newest model generated by the naive self-play training paradigm could probably be the strongest one (a Nash Policy), which implies that a naive self-play would suffice to solve the problem. The PSRO methods will then lose their advantage under such circumstances since they need extra computation resources to maintain meta-payoffs. * Regarding the limitations from the perspective of real-world application, the game dynamics could be complex and there could be a lot of randomnesses involved in the real-world games. Thus, the approximated best response trained by reinforcement learning algorithms towards a certain policy could be inaccurate. In the Google Research Football experiment, we empirically save the checkpoint when the model win-rate is stable (i.e. the change of win-rate is less than 0.05 during two checks with the check frequency is 1000 model steps) or the training model steps reach an upper bound of 50000. These criteria could be potentially improved or further studied for all the PSRO based methods. We appreciate your other careful and helpful suggestions (e.g., typos and writings) and will further revise our manuscript based on them. [1] Zahavy et al. Discovering a set of policies for the worst case reward. ICLR 2021 [2] Czarnecki, Wojciech Marian, et al. "Real world games look like spinning tops." arXiv preprint arXiv:2004.09468 (2020). [3] Xu, Zhongwen, Hado van Hasselt, and David Silver. “Meta-gradient reinforcement learning.” arXiv preprint arXiv:1805.09801 (2018). [4] Alfred Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429–443, 1997. [5] Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems (pp. 1057-1063). [6] Xu, H., & Mannor, S. (2012). Robustness and generalization. Machine learning, 86(3), 391-423. ### We thank the Reviewer 1idw for the two hours of effort and the associated constructive comments that will surely help improve our paper. 1. > **Reviewer**: Many parts of the paper refer to unifying measures of diversity, or providing a unified diversity measure, but my understanding of the word "unifying" seems to be different from what's going on in this paper. * **Response**: Firstly, we admit that our methods do not completely unify BD and RD in one principled and fundamental objective, and there might be potential overclaiming by using "unify". Therefore, to correct this, we decide to modify the title as "Towards Unifying ..." to highlight our distributions in laying the ground work for the new objective to be discovered. Secondly, we provide some theoretical intuitions about the equivalence and difference between BD and RD. BD is defined as the occupancy measure discrepancy and RD is about diversity in the long-term expected returns. Therefore, we try to build relationships between the difference of two occupancy measures and the difference in the corresponding long-term expected return. Considering two policies $\pi_1$, $\pi_2$ and the associate occupancy measures $\rho_{\pi_1}$, $\rho_{\pi_2}$. We quantify the occupancy measure discrepancy using the integral probability metric (IPM) [4]: $$ d_{\mathcal{F}}(\rho_{\pi_1}, \rho_{\pi_2})=\sup_{f\in\mathcal{F}}|E_{(s,a)\sim \rho_{\pi_1}}[f(s, a)]-E_{(s,a)\sim \rho_{\pi_2}}[f(s, a)]| $$ If we regard $f(s,a)$ as a kind of reward function of the underlying MDP, [1] tells us that: $$ E_{(s,a)\sim \rho_\pi}[f(s, a)]=\sum_{s,a}\rho_\pi(s,a)f(s, a)=\eta_f(\pi), $$ where $\eta_f(\pi)$ is the **time-average** expected return of $\pi$ under the reward function $f$. Then we can conclude: $$ d_{\mathcal{F}}(\rho_{\pi_1}, \rho_{\pi_2})=\sup_{f\in\mathcal{F}}|\eta_f(\pi_1)-\eta_f(\pi_2)|, $$ where the left-hand side difference in occupancy measures is associated with the right-hand side difference in expected return. However, it is hard to reach the $\sup$ exactly. An alternate approximation is: $$ d_{\mathcal{F}}(\rho_{\pi_1}, \rho_{\pi_2})\approx\max_{f\in\{f_1, \cdots,f_n\}}|\eta_f(\pi_1)-\eta_f(\pi_2)|, $$ Therefore, if we have a diverse reward function set $\{f_1, \cdots,f_n\}$, the approximation can be very accurate. Note that this analysis is based on the single-agent setting. Returning to our multi-agent problem, a diverse reward function set can be achieved by a diverse opponent set since the marginal reward function for the player depends on the opponent policy as: $$ f_i(\mathbf{s}, a_i)=\sum_{a_{-i}}r_i(\mathbf{s}, a_{i}, a_{-i})\pi_{-i}(a_{-i}|\mathbf{s}), $$ where the LHS is the marginalized reward function of player $i$ for the fixed opponent $\pi_{-i}$. Based on this, we argue that during iterations of our methods, the population becomes more diverse and the gap between the difference in occupancy measures and the difference in the expected return gets closer. Therefore, the effects of BD and RD will become increasingly more equivalent during iterations. 2. > **Reviewer**: In tables of results, such as Table 2, when some results are presented in boldface this is often interpreted as those results being the best of their row with statistical significance. But in Table 2, it appears that all the best results of every row are presented in boldface, even if their confidence intervals (or are they standard errors?) are overlapping. * **Response**: We admit the performance of different algorithms may overlap in this game. We will accept your advice of highlighting all the prominent data. Finally, we appreciate your other careful and helpful suggestions (e.g., typos and writings) and will further revise our manuscript based on them. [1] Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems (pp. 1057-1063). ### Response to Reviewer qQqw Thanks for your great suggestions and we add clarifications of the missing points in the revision of our manuscript. ### Response to Reviewer Rpgw Thanks for your great suggestions and we add clarifications of the missing points in the revision of our manuscript. ### Response to Reviewer QZCS Now we totally acknowledge your points regarding the minimax objective are correct. The environment and the opponent actually can be equivalent under certain settings. Besides, we will try to include more clarifications in the revision.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.