Xiangyu Liu
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Rebuttal of Unified Diversity ### We thank Reviewer qQqw for his/her 3-hour efforts in offering constructive comments that will surely help improve our paper. 1. > **Reviewer**: There are parts that lack more detail. For example, the setup of the AlphaStar game is not clear to me at all. Another example of this is the matrix for the non-transitive mixture game should be described a bit more in the main text, to at least get an intuition about it’s structure beyond having been “delicately designed” * **Response**: We apologize for the lack of clarity. Regarding the AlphaStar game, your understanding is correct. We do not train AlphaStar from scratch; instead, we test our algorithm on the **meta-game** induced by 888 policies (i.e. agents) generated during the training process of solving AlphaStar, which is provided by [3]. Regarding the non-transitive mixture game, the explicit construction of the “delicately designed” payoff $\mathbf{S}$ is given in Appendix D.1. The intuition behind the construction is to ensure that when the opponent takes a pure strategy best response, which corresponds to reaching the center of one of the gaussian humps, the best response against it will be choosing among the rest gaussian humps on other directions. As a result, this game involves both strong transitive and non-transitive structures. To achieve low exploitability, an effective population has to demonstrate diverse explorative trajectories that cover all directions (see Figure 2). 2. > **Reviewer**: I am not sure about the results on the non-transitive mixture game. Would the modes of the Gaussians not be where we would want the trajectories to end? If this is the case why are none of the algorithms (e.g. PSRO, PSRO-rN) reaching them? I am also surprised there is no cycling in the trajectories, given the cyclic nature of the game. * **Response**: The reviewer is correct in the sense that the players must learn to stay close to the Gaussian centroids whilst also exploring all seven Gaussians to avoid being exploited. The reason why they (i.e. PSRO, PSRO-rN) are not reaching them is that they adopt the **approximated** best response during each iteration via gradient descent. Therefore, it is reasonable that players approached (but not fully reached) the exact best response. The failure of PSRO on such a task is because it does not use the pipeline trick. It is also reported in Figure 3 in [5]. In terms of PSRO-rN, it came as no surprise that PSRO-rN would fail in such tasks, which is also studied theoretically in Proposition 3.1 in [4] and empirically in [6]. Regarding no cycling trajectories given the cyclic nature of the game, approaching a center of a Gaussian is actually caused by the transitive component of this game. The **cyclic** component is revealed by the fact that the policies try to explore different directions. 3. > **Reviewer**: The first sentence of the paper “zero-sum games involve non-transitivity” is not correct. * **Response**: We agree with the Reviewer that not all zero-sum games have non-transitive components. We would like to accept your suggestion and correct it as “many zero-sum games have a strong non-transitive **component**”. This is validated by [2], which proves that a game can be generally decomposed into the transitive component and the non-transitive component. We understand the non-transitive component can be zero at many times like what you have justified. 4. > **Reviewer**: Related research that might be of interest in this area is a paper on diversity of populations [1]. * **Response**: Thanks for pointing it out and we appreciate this paper. At a high level, we could expect the idea of [1] lies in the region of Response Diversity in our paper. It works towards using the interaction graphs as a more general objective to replace Nash or rectified Nash. We will include this reference in our manuscript. Finally, we appreciate your other careful and helpful suggestions (e.g., typos and writings) and will further revise our manuscript based on them. [1] Garnelo, Marta, et al. "Pick Your Battles: Interaction Graphs as Population-Level Objectives for Strategic Diversity." Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems. 2021. [2] Balduzzi, David, et al. "Re-evaluating evaluation." arXiv preprint arXiv:1806.02643 (2018). [3] Czarnecki, Wojciech Marian, et al. "Real world games look like spinning tops." arXiv preprint arXiv:2004.09468 (2020). [4] McAleer, S., Lanier, J., Fox, R., & Baldi, P. (2020). Pipeline psro: A scalable approach for finding approximate nash equilibria in large games. NeurIPS 2020 [5] Feng, Xidong, Oliver Slumbers, Yaodong Yang, Ziyu Wan, Bo Liu, Stephen McAleer, Ying Wen, and Jun Wang. "Discovering Multi-Agent Auto-Curricula in Two-Player Zero-Sum Games." arXiv preprint arXiv:2106.02745 (2021). [6] Perez-Nieves, N., Yang, Y., Slumbers, O., Mguni, D. H., Wen, Y., & Wang, J. (2021, July). Modelling Behavioural Diversity for Learning in Open-Ended Games. In International Conference on Machine Learning (pp. 8514-8524). PMLR. ### We thank the Reviewer Rpgw for the four hours of effort and the associated constructive comments that will surely help improve our paper. 1. > **Reviewer**: How does this actually differ from RED (or RND even) in practice? Furthermore, once the network is learnt on the dataset, does it remain fixed throughout the duration of the experiment, or does it get periodically re-learnt? * **Response**: Both our method and RED are inspired by the RND in terms of constructing the prediction error. However, RED is an imitation learning approach for single-agent RL, which means RED uses the prediction error as the only reward signal. In contrast, we are modelling diversity in the regime of population training in multi-agent RL (i.e., the PSRO process) where at each iteration we aim to discover a new diverse and effective agent that can improve the performance of the whole population. The network learnt on the dataset is **not** fixed through the experiment. Once a new policy is added, the aggregated Nash policy is changed. Therefore, it gets re-learnt at the beginning of each iteration of PSRO. 2. > **Reviewer**: Shouldn't $-i$ be assignable to a single opponent player? What value does it take in say a game with 3 or 4 players? * **Response**: As a general practice in game theory, $i$ indicates a single player and we utilize the special notion of $-i$ to encapsulate the remaining players. As a result, for a game with 3 players, $i$ means one specific player and $-i$ indicates the other two as a whole. 3. > **Reviewer**: Strictly speaking, I don't think the manuscript provides a good overview of the limitations of the proposed framework, nor of any particular weakness in the experimental setting used. * **Response**: We apologize for insufficient discussions on the limitations, and we include more discussions here. * Regarding the limitations from the perspective of the proposed algorithm, one limitation is that the diversity weights $\lambda_1$ and $\lambda_2$ in our paper are manually tuned. Our methods target the non-transitivity in zero-sum games. Therefore, the weights for a game should be related to how strong the non-transitive component is. In future work, we will work towards quantifying the non-transitivity component of a given game automatically and determining the diversity weights correspondingly. Similar ideas have been tested in single-agent RL cases on learning the discount rate [2]. In addition, our methods also inherit the limitations from the framework of PSRO. The advantage of the PSRO based methods also depends on the amount of non-transitivity of the environment [1]. To be specific, if there aren’t too many cyclic policies involved in the environment (such as Rock->Paper->Scissor), the newest model generated by the naive self-play training paradigm could probably be the strongest one (a Nash Policy), which implies that a naive self-play would suffice to solve the problem. The PSRO methods will then lose their advantage under such circumstances since they need extra computation resources to maintain meta-payoffs. * Regarding the limitations from the perspective of real-world application, the game dynamics could be complex and there could be a lot of randomnesses involved in the real-world games. Thus the approximated best response trained by reinforcement learning algorithms towards a certain policy could be inaccurate. In the Google Research Football experiment, we empirically save the checkpoint when the model win-rate is stable (i.e. the change of win-rate is less than 0.05 during two checks with the check frequency is 1000 model steps) or the training model steps reach an upper bound of 50000. These criteria could be potentially improved or further studied for all the PSRO based methods. 4. > **Reviewer**: I would personally enjoy a discussion at the very least on what other types of losses could be potentially steam from utilising the BD / RD decompositions. * **Response**: We offer here more details about what other types of losses can stem from BD and RD. * Regarding RD, we demonstrate that many current approaches can be unified as convex hull enlargement: * PSRO$_{rN}$ [3]: The rectified Nash modifies the original Nash objective, and it shows that the new objective will enlarge the convex hull more efficiently through the example of Rock-paper-scissors. Furthermore, it shows empirically that rectified Nash will lead to the largest area of the convex hull in the 2D embedding space. * DPP-PSRO [4]: DPP-PSRO proves that the new policy maximizing the diversity regularized best response will strictly enlarge the convex hull in Proposition 9, which in other words, can be formulated as a convex hull enlargement problem. To unify these, we proposed the direct objective to enlarge the convex hull and define **Response Diversity** as the contribution of a policy to the convex hull enlargement. * Regarding BD, we show the newly proposed concept of trajectory diversity [5] can be derived by our formulation of BD as occupancy measure discrepancy here. For a policy $\pi_i$, denote the trajectory distribution induced by $\pi_i$ as $q_{\pi_{i}}$ and the occupancy measure as $\rho_{\pi_{i}}$. Then trajectory diversity for $(\pi_1, \dots, \pi_n)$ is defined as the generalized JS divergence over the trajectory distributions: $$ Diversity(\pi_1, \dots, \pi_n)=\text{JSD}(q_{\pi_1}, \dots, q_{\pi_n}). $$ Considering that: $$ \text{JSD}(q_{\pi_1}, \dots, q_{\pi_n})=\frac{1}{n}\sum_{i=1}^{n}D_{KL}(q_{\pi_i}||q_{\hat{\pi}}), $$ where $q_{\hat{\pi}}=\frac{1}{n}\sum_{i=1}^{n}q_{\pi_i}$. Then following the Theorem 1 of [6], we get: $$ D_{KL}(q_{\pi_i}||q_{\hat{\pi}})\ge D_{KL}(\rho_{\pi_i}||\rho_{\hat{\pi}}). $$ Now we can conclude the following lower bound of trajectory diversity: $$ \text{JSD}(q_{\pi_1}, \dots, q_{\pi_n})\ge\text{JSD}(\rho_{\pi_1}, \dots, \rho_{\pi_n}). $$ Finally, we justify that the trajectory diversity is lower bounded by our occupancy measure level behavioral diversity. Therefore, maximizing trajectory diversity in [5] can be replaced by maximizing the lower bound BD. Finally, we appreciate your other careful and helpful suggestions (e.g., typos and writings) and will further revise our manuscript based on them. [1] Czarnecki, Wojciech Marian, et al. "Real world games look like spinning tops." arXiv preprint arXiv:2004.09468 (2020). [2] Xu, Zhongwen, Hado van Hasselt, and David Silver. "Meta-gradient reinforcement learning." arXiv preprint arXiv:1805.09801 (2018). [3] Balduzzi, David, Marta Garnelo, Yoram Bachrach, Wojciech Czarnecki, Julien Perolat, Max Jaderberg, and Thore Graepel. "Open-ended learning in symmetric zero-sum games." In International Conference on Machine Learning, pp. 434-443. PMLR, 2019. [4] Perez-Nieves, N., Yang, Y., Slumbers, O., Mguni, D. H., Wen, Y., & Wang, J. (2021, July). Modelling Behavioural Diversity for Learning in Open-Ended Games. In International Conference on Machine Learning (pp. 8514-8524). PMLR. [5] Lupu, A., Cui, B., Hu, H., & Foerster, J. (2021, July). Trajectory diversity for zero-shot coordination. In International Conference on Machine Learning (pp. 7204-7213). PMLR. ### We thank the Reviewer QZCS for the six hours of effort and the associated constructive comments that will surely help improve our paper. 1. > **Reviewer**: The paper fails to deliver on its core promise, to quote the abstract: "work towards offering a unified measure of diversity". * **Response**: Firstly, we admit that our methods do not completely unify BD and RD in one principled and fundamental objective, and there might be potential overclaiming by using "unify". Therefore, to correct this, we adopt your suggestion to modify the title as "Towards Unifying ..." to highlight our distributions in laying the ground work for the new objective to be discovered. Secondly, we provide some theoretical intuitions about the equivalence and difference between BD and RD. BD is defined as the occupancy measure discrepancy and RD is about diversity in the long-term expected returns. Therefore, we try to build relationships between the difference of two occupancy measures and the difference in the corresponding long-term expected return. Considering two policies $\pi_1$, $\pi_2$ and the associate occupancy measures $\rho_{\pi_1}$, $\rho_{\pi_2}$. We quantify the occupancy measure discrepancy using the integral probability metric (IPM) [4]: $$ d_{\mathcal{F}}(\rho_{\pi_1}, \rho_{\pi_2})=\sup_{f\in\mathcal{F}}|E_{(s,a)\sim \rho_{\pi_1}}[f(s, a)]-E_{(s,a)\sim \rho_{\pi_2}}[f(s, a)]| $$ If we regard $f(s,a)$ as a kind of reward function of the underlying MDP, [5] tells us that: $$ E_{(s,a)\sim \rho_\pi}[f(s, a)]=\sum_{s,a}\rho_\pi(s,a)f(s, a)=\eta_f(\pi), $$ where $\eta_f(\pi)$ is the **time-average** expected return of $\pi$ under the reward function $f$. Then we can conclude: $$ d_{\mathcal{F}}(\rho_{\pi_1}, \rho_{\pi_2})=\sup_{f\in\mathcal{F}}|\eta_f(\pi_1)-\eta_f(\pi_2)|, $$ where the left-hand side difference in occupancy measures is associated with the right-hand side difference in expected return. However, it is hard to reach the $\sup$ exactly. An alternate approximation is: $$ d_{\mathcal{F}}(\rho_{\pi_1}, \rho_{\pi_2})\approx\max_{f\in\{f_1, \cdots,f_n\}}|\eta_f(\pi_1)-\eta_f(\pi_2)|, $$ Therefore, if we have a diverse reward function set $\{f_1, \cdots,f_n\}$, the approximation can be very accurate. Note that this analysis is based on the single-agent setting. Returning to our multi-agent problem, a diverse reward function set can be achieved by a diverse opponent set since the marginal reward function for the player depends on the opponent policy as: $$ f_i(\mathbf{s}, a_i)=\sum_{a_{-i}}r_i(\mathbf{s}, a_{i}, a_{-i})\pi_{-i}(a_{-i}|\mathbf{s}), $$ where the LHS is the marginalized reward function of player $i$ with the fixed opponent $\pi_{-i}$. Based on this, we argue that during iterations of our methods, the population becomes more diverse and the gap between the difference in occupancy measures and the difference in the expected return gets closer. Therefore, the effects of BD and RD will become increasingly more equivalent during iterations. 2. > **Reviewer**: It seems that DPP-PSRO which uses Response Diversity, has a very similar profile to P-PSRO with Behavioral Diversity, and looks very different to P-PSRO with Response Diversity. Is DPP-PSRO classified correctly in Table 1? * **Response**: Yes, DPP-PSRO is classified correctly in Table 1. The core idea of DPP-PSRO is to construct the Determinantal Point Process (DPP) through the empirical payoff vectors, and their regularized objective encourages the new policy towards increasing the expected cardinality of DPP, which lies in the domain of Response Diversity. We hypothesize that the reason why DPP-PSRO looks different from P-PSRO with RD is that the DPP objective has much fewer exploratory effects than our RD objective since the DPP objective does not have a straight relationship with the empirical gamescape and has a large overlap with the ordinary best response objective in this game. Regarding P-PSRO with BD, the BD objective will simply push the new policy to be as far as possible from a specific fixed point (the Nash aggregated policy) in this relatively simple game. Thus, the BD objective built on the occupancy measure discrepancy is less informative in this relatively simple game since there is no complex interaction between the policy and the environment dynamics, which is in the definition of occupancy measure. Therefore, it does not bring up many exploratory effects under this setting either. That is why DPP-PSRO and P-PSRO with BD look similar, and both look like ordinary P-PSRO, considering the fact that we use the approximate best response during iterations via gradient descent, which also brings up a few exploratory behaviors for ordinary P-PSRO in Figure 2. 3. > **Reviewer**: In some cases the authors make things a little convoluted, and could explain the theorems in words, as they are often very straightforward observations (e.g. Theorem 1). * **Response**: We apologize for the potential lack of some straightforward and intuitive explanations for some statements. The intuition behind Theorem 1 is that since the game is one-step, the policy and the transition dynamics can be easily decoupled. Therefore, the divergence between occupancy measures can be simplified as the divergence between policies given that the transition dynamics are the same for two policies, We will try to add more intuitions for other propositions and further revise our manuscript. 4. > **Reviewer**: The population effectivity metric looks like a robustness objective, but [1] is not mentioned. * **Response**: We highly appreciate the missed reference you mentioned and will include it in our manuscript. We admit that the minimax objective is a common formulation of robustness, and we are actually inspired by this. We also understand there has been extensive literature revealing the relationship between diversity and robustness. For example, [6] shows the equivalence between solving the minimax and the diversity via regularization. Now we discuss the relationship between our metric with the specific objective in [1]. For the minimax objective in [1], the inner minimum is taken over different environment rewards, thus seeking the performance guarantee under the worst environment. For our objective, the minimum is taken over the opponent $\pi_{-i}$, thus seeking the performance guarantee under the strongest opponent, which generalizes with a different degree of freedom compared with [1]. 5. > **Reviewer**: Limitations of the method do not seem to be thoroughly discussed anywhere. The checklist says yes, but I couldn't find it. * **Response**: We apologize for insufficient discussions on the limitations, and we include more discussions here. * Regarding the limitations from the perspective of the proposed algorithm, One limitation is that the diversity weights $\lambda_1$ and $\lambda_2$ in our paper are manually tuned. Our methods target the non-transitivity in zero-sum games. Therefore, the weights for a game should be related to how strong the non-transitive component is. In future work, we will work towards quantifying the non-transitivity component of a given game automatically and determining the diversity weights correspondingly. Similar ideas have been tested in single-agent RL cases on learning the discount rate [3]. In addition, our methods also inherit the limitations from the framework of PSRO. The advantage of the PSRO based methods also depends on the amount of non-transitivity of the environment [1]. To be specific, if there aren’t too many cyclic policies involved in the environment (such as Rock->Paper->Scissor), the newest model generated by the naive self-play training paradigm could probably be the strongest one (a Nash Policy), which implies that a naive self-play would suffice to solve the problem. The PSRO methods will then lose their advantage under such circumstances since they need extra computation resources to maintain meta-payoffs. * Regarding the limitations from the perspective of real-world application, the game dynamics could be complex and there could be a lot of randomnesses involved in the real-world games. Thus, the approximated best response trained by reinforcement learning algorithms towards a certain policy could be inaccurate. In the Google Research Football experiment, we empirically save the checkpoint when the model win-rate is stable (i.e. the change of win-rate is less than 0.05 during two checks with the check frequency is 1000 model steps) or the training model steps reach an upper bound of 50000. These criteria could be potentially improved or further studied for all the PSRO based methods. We appreciate your other careful and helpful suggestions (e.g., typos and writings) and will further revise our manuscript based on them. [1] Zahavy et al. Discovering a set of policies for the worst case reward. ICLR 2021 [2] Czarnecki, Wojciech Marian, et al. "Real world games look like spinning tops." arXiv preprint arXiv:2004.09468 (2020). [3] Xu, Zhongwen, Hado van Hasselt, and David Silver. “Meta-gradient reinforcement learning.” arXiv preprint arXiv:1805.09801 (2018). [4] Alfred Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429–443, 1997. [5] Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems (pp. 1057-1063). [6] Xu, H., & Mannor, S. (2012). Robustness and generalization. Machine learning, 86(3), 391-423. ### We thank the Reviewer 1idw for the two hours of effort and the associated constructive comments that will surely help improve our paper. 1. > **Reviewer**: Many parts of the paper refer to unifying measures of diversity, or providing a unified diversity measure, but my understanding of the word "unifying" seems to be different from what's going on in this paper. * **Response**: Firstly, we admit that our methods do not completely unify BD and RD in one principled and fundamental objective, and there might be potential overclaiming by using "unify". Therefore, to correct this, we decide to modify the title as "Towards Unifying ..." to highlight our distributions in laying the ground work for the new objective to be discovered. Secondly, we provide some theoretical intuitions about the equivalence and difference between BD and RD. BD is defined as the occupancy measure discrepancy and RD is about diversity in the long-term expected returns. Therefore, we try to build relationships between the difference of two occupancy measures and the difference in the corresponding long-term expected return. Considering two policies $\pi_1$, $\pi_2$ and the associate occupancy measures $\rho_{\pi_1}$, $\rho_{\pi_2}$. We quantify the occupancy measure discrepancy using the integral probability metric (IPM) [4]: $$ d_{\mathcal{F}}(\rho_{\pi_1}, \rho_{\pi_2})=\sup_{f\in\mathcal{F}}|E_{(s,a)\sim \rho_{\pi_1}}[f(s, a)]-E_{(s,a)\sim \rho_{\pi_2}}[f(s, a)]| $$ If we regard $f(s,a)$ as a kind of reward function of the underlying MDP, [1] tells us that: $$ E_{(s,a)\sim \rho_\pi}[f(s, a)]=\sum_{s,a}\rho_\pi(s,a)f(s, a)=\eta_f(\pi), $$ where $\eta_f(\pi)$ is the **time-average** expected return of $\pi$ under the reward function $f$. Then we can conclude: $$ d_{\mathcal{F}}(\rho_{\pi_1}, \rho_{\pi_2})=\sup_{f\in\mathcal{F}}|\eta_f(\pi_1)-\eta_f(\pi_2)|, $$ where the left-hand side difference in occupancy measures is associated with the right-hand side difference in expected return. However, it is hard to reach the $\sup$ exactly. An alternate approximation is: $$ d_{\mathcal{F}}(\rho_{\pi_1}, \rho_{\pi_2})\approx\max_{f\in\{f_1, \cdots,f_n\}}|\eta_f(\pi_1)-\eta_f(\pi_2)|, $$ Therefore, if we have a diverse reward function set $\{f_1, \cdots,f_n\}$, the approximation can be very accurate. Note that this analysis is based on the single-agent setting. Returning to our multi-agent problem, a diverse reward function set can be achieved by a diverse opponent set since the marginal reward function for the player depends on the opponent policy as: $$ f_i(\mathbf{s}, a_i)=\sum_{a_{-i}}r_i(\mathbf{s}, a_{i}, a_{-i})\pi_{-i}(a_{-i}|\mathbf{s}), $$ where the LHS is the marginalized reward function of player $i$ for the fixed opponent $\pi_{-i}$. Based on this, we argue that during iterations of our methods, the population becomes more diverse and the gap between the difference in occupancy measures and the difference in the expected return gets closer. Therefore, the effects of BD and RD will become increasingly more equivalent during iterations. 2. > **Reviewer**: In tables of results, such as Table 2, when some results are presented in boldface this is often interpreted as those results being the best of their row with statistical significance. But in Table 2, it appears that all the best results of every row are presented in boldface, even if their confidence intervals (or are they standard errors?) are overlapping. * **Response**: We admit the performance of different algorithms may overlap in this game. We will accept your advice of highlighting all the prominent data. Finally, we appreciate your other careful and helpful suggestions (e.g., typos and writings) and will further revise our manuscript based on them. [1] Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems (pp. 1057-1063). ### Response to Reviewer qQqw Thanks for your great suggestions and we add clarifications of the missing points in the revision of our manuscript. ### Response to Reviewer Rpgw Thanks for your great suggestions and we add clarifications of the missing points in the revision of our manuscript. ### Response to Reviewer QZCS Now we totally acknowledge your points regarding the minimax objective are correct. The environment and the opponent actually can be equivalent under certain settings. Besides, we will try to include more clarifications in the revision.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully