Rebutal for ICLR 8714 : Generative Adversarial Inverse Multiagent Learning

# Rebutal for ICLR 8714 : Generative Adversarial Inverse Multiagent Learning # Common Answer  **Summary of contributions**: We provide a computationally efficient (i.e. polynomial-time) min-max optimization characterization of a large class of inverse equilibrium problems in inverse game theory and inverse multiagent learning. Our formulation provides a simple solution to a large number of problems, one of which dates back to the 1940s (i.e., the revealed preference problem [1]). We also generalize the apprenticeship learning paradigm to multiagent settings, characterize the solutions to multiagent apprenticeship learning, introduce a novel method to compute such solutions, and demonstrate the effectiveness of our approach at predicting prices in Spanish electricity markets. **Common Reviewer Concern**: All reviewers expressed similar confusion regarding the whereabouts of the proofs of our theorems. We apologize for the confusion, which seems to have arisen from the fact that the detailed theorem statements in the appendix were renumbered, and as a result incorrectly referenced. Specifically, Theorems 3.2, 4.1, and 5.2 were restated in detail as Theorems 6.1, 6.2, and 6.3 respectively. We are uploading together with this rebuttal an updated appendix with corrected references and slightly expanded proofs (e.g., specifying omitted constants). We expect that this corrected appendix will address the reviewers' concerns. **References** [1] Paul A Samuelson. Consumption theory in terms of revealed preference. Economica, 15(60):243–253, 1948. ## Response to Reviewer WrBZ ### Weaknesses >**W1)** $\bullet$ It would be helpful to expand on the proofs of theorems 6.1, 6.2, and 6.3 in the supplementary material. I know that a reference has been provided, but a slight explanation of the cited result and how it relates to the theorem in question would be nice. **Response to W1)**: We refer you to our common answer.  -->  >**W2)** $\bullet$ Although a comparison of the method has been shown with the ARIMA model on the spanish electricity market data, it would be beneficial to have a comparison with prior methods in inverse multi-agent reinforcement learning. Especially in terms of efficiency since it's one of the main points of the paper. The abstract says that the method outperforms other widely-used methods (plural), and we only get to see it being compared with one other model which is specific to time-series data.  **Response to W2)**: To our knowledge, all existing methods of inverse multiagent reinforment learning apply only to finite state and action Markov games, while our electricity market model is a continuous state and action Markov game. As a result, comparing to these methods would require potentially non-trivial extensions. Instead, we chose to compare our method to a statistical method, namely that of ARIMA, only. We will adjust our language to correct the plural. >**W3)** $\bullet$ Some comparison/contextualization with prior work in multiagent inverse reinforcement learning would also be helpful.  **Response to W3)**: A more extensive related works section is included in the third section of the appendix. There, we summarize related work not only in inverse multiagent reinforcement learning, but also in microeconomics, econometrics, and algorithmic game theory. Thank you for pointing out that a reference to this more detailed related work section is missing. We will be sure to correct this oversight in the camera-ready version. ### Questions >**Q1) ** $\bullet$ What does the term $\psi(\pi,\rho;\theta)$ in the "Multiagent Apprenticeship Learning" section expand to? Cannot seem to find a definition anywhere.  **Response to Q1)**: $\psi$ is the cumulative regret and is defined as $\psi(\mathbf{\pi}, \mathbf{\rho} ; \mathbf{\theta}) \doteq \sum_{i \in[n]} u_i\left(\mathbf{\rho}_i, \mathbf{\pi}_{-i} ; \mathbf{\pi}\right)-u_i(\mathbf{x} ; \mathbf{\theta})$. (See page 3, at the end of the paragraph starting with "One-shot games".) Note that this definition extends immediately to Markov games since any Markov game can be seen as a one-shot game in which the space of actions is taken to be the space of policies. This point is explained further in the sentences preceeding Corollary 1. >**Q2) ** $\bullet$ Assuming that Algorithm 3 was used on the spanish electricity market data, how was the observation distribution specified? **Response to Q2)**: In the electricity market experiments, the Markov game consists of an electricity seller who sets prices in the day ahead market and spot market, and $n$ buyers who demand electricity. We use price for the day ahead market, prices for the spot market, and aggregate demand (i.e., the sum of the demand across all buyers) of electricity for every hour from 2015 to 2019 as our observation space. The observation distribution then consists of the history distribution associated with this Markov game, which is pushed forward through a function that outputs sampled price trajectories and the sum of each buyer's individual demand. ## Response to Reviewer Wrco ### Weaknesses >**W1)** The proofs are not completed, e.g., I cannot find the proofs for Theorem 4.1 and Theorem 5.2.  **Response to W1)**: We refer you to our common answer. >**W2)** The presentation can be further improved, e.g., more intuitions about the assumptions and theorems.  **Response to W2)**: We will try to add additional intuition to the main paper, beyond what already appears at the bottom of page 6 and top of page 7 (for example), respecting the space constraints. You can also already find additional discussion in the appendix (see page 16). ### Questions >**Q2)** Can you further polish the paper? Some typos: for example, in the fifth line of the abstract, should it be "to solve them"? **Response to Q2)**: We will do our best to thoroughly proofread the paper again to correct all grammatical, semantic, and syntactical errors in the camera-ready version. ## Response to Reviewer F6LX ### Weaknesses: >**W1)** As an easily rectified issue, Figure 1 could have been better represented by plotting residuals over time or, by subsampling the data, plotting mean residuals with error bars.  **Reponse to W1)**: This is a great suggestion. Thank you. We will indeed plot residuals over time and error bars over the five seeds for which we ran our experiments. >**W2)** As a minor complaint, I do not prefer the language of "generative-adversarial" (especially not in terms of a "discriminator"), even if this is the closest analogy familiar to machine learning practitioners: This is a standard min-max optimization problem that need not be wed to the ML setting. **Reponse to W2)**: Thank you for this feedback. The "generative" language is meant to allude to the generative model fitting problem within the larger technical development of our approach. That said, the discriminator language is not quite right. We will give further thought to this concern, perhaps by replacing "generative-adversarial" simply by "min-max".  ### Questions: >**Q1**: Remark 1 is indeed interesting, but it is not obvious. Did I miss an associated proof or example? **Response to Q1**: This remark follows from a folklore theorem, which states that set of solutions to convex-concave min-max optimization problems (i.e., min-max problems in which the objective is convex-concave, and the constraints are non-empty, compact, and convex) are convex. This fact can be seen as a corollary of the set of Nash equilibria being convex in zero-sum, potential, and monotone games. You can find a proof in Nau et al [1]; for completeness we also provide a proof here. Consider a convex-concave min-max optimization problem $\min_{\mathbf{x} \in \mathcal{X}} \max_{\mathbf{y} \in \mathcal{Y}} f(\mathbf{x},\mathbf{y})$ where $f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}$ is convex-concave and $\mathcal{X}, \mathcal{Y}$ are non-empty, compact, and convex sets. Let $V(\mathbf{x}) \doteq \max_{\mathbf{y} \in \mathcal{Y}} f(\mathbf{x},\mathbf{y})$. By Danskin's theorem [2], $V$ is convex since it is the maximum of a set of convex functions. Hence, by Theorem 2.6 of Rockafeller and Wets [3], the set of solutions $\arg \min_{\mathbf{x} \in \mathcal{X}} V(\mathbf{x}) = \arg\min_{\mathbf{x} \in \mathcal{X}} \max_{\mathbf{y} \in \mathcal{Y}} f(\mathbf{x},\mathbf{y})$ is convex.  >**Q2**: Why was the proof of Theorem 3.2 omitted? Was it rephrased to appear as Theorem 6.1 in the supplementary material? Establishing the convergence rates of various algorithms is not my expertise, but the results seem reasonable, especially given assumptions of convexity and Lipshitz smoothness. **Response to Q2**: We refer you to our common answer.  **References** [1] Nau, Robert, Sabrina Gomez Canovas, and Pierre Hansen. "On the geometry of Nash equilibria and correlated equilibria." International Journal of Game Theory 32 (2004): 443-453. [2] Danskin, John M. "The theory of max-min, with applications." SIAM Journal on Applied Mathematics 14.4 (1966): 641-664. [3] R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer Science and Business Media,2009 --------------------------------- ## Response to Reviewer zp62 ### Weaknesses > **W1)** The readability can be improved. I think Section 3 never mentions that it is for the one-shot game setting. There is no methodological contributions.  **Response to W1)**: In the second paragraph of Section 3, we define an inverse game as a tuple comprising a one-shot parametric game whose parameter values are missing, together with a Nash equilibrium action profile. > **W2)** All presented algorithms are simple extension of known algorithms. (On the other hand we should not invent/propose algorithms just for the sake of proposing them). **Response to W2)** As you mention, we have not developed any groundbreaking methodologies, but have simply introduced a novel mathematical characterization of a longstanding problem. Perhaps the strength of our paper lies in its simplicity. ### Questions > (Q1) Minor remark/question: what is the meaning of that weird symbol in Theorems 3.2, 4.1, 5.2? I assume it means of the same order as (but I don't think this is standard notation.) We believe you are referring to the notation $\eta_{\boldsymbol{y}}^{(t)} \asymp \varepsilon^4$. This notation is equivalent to big theta notation, i.e., $\eta_{\boldsymbol{y}}^{(t)} \in \Theta(\varepsilon^4)$. An example of the use of this notation can be found in [1]. Regardless, we will add a footnote to clarify. > (Q2) In Theorem 3.2, it is a little bit surprising that the optimal is obtained by averaging prior solutions instead of the last one. What is the intuition behind averaging (which includes initial solutions that can be of very low quality)? I assume the proof is correct.  **Response to Question**: Simple gradient descent ascent methods are not guaranteed to converge in last-iterates in min-max optimization problems, and can even diverge (see, for instance, Mertikopoulos et al. [2]). A common remedy is to average the iterates. As we mention in the sentence preceeding Theorem 3.2, we can instead obtain last-iterate convergence using extragradient descent ascent or optimistic gradient descent ascent. **References** [1] Daskalakis, Constantinos, Dylan J. Foster, and Noah Golowich. "Independent policy gradient methods for competitive reinforcement learning." Advances in neural information processing systems 33 (2020): 5527-5540. [2] Panayotis Mertikopoulos, Christos Papadimitriou, and Georgios Piliouras. Cycles in adversarial regularized learning. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2703–2717, 2018.