# Rebutal for ICLR 8714 : Generative Adversarial Inverse Multiagent Learning # Common Answer <!-- >From Alec: Reviewer ****, who was most negative regarding the submission, did not raise many technical points of concern. We would like to comment that this is potentially evidence that this individual may not have evaluated the manuscript thoroughly. Therefore, we respectfully request their review by down-weighted in the final adjudication. --> <span style="color:blue"> **Summary of contributions**: We provide a computationally efficient (i.e. polynomial-time) min-max optimization characterization of a large class of inverse equilibrium problems in inverse game theory and inverse multiagent learning. Our formulation provides a simple solution to a large number of problems, one of which dates back to the 1940s (i.e., the revealed preference problem [1]). We also generalize the apprenticeship learning paradigm to multiagent settings, characterize the solutions to multiagent apprenticeship learning, introduce a novel method to compute such solutions, and demonstrate the effectiveness of our approach at predicting prices in Spanish electricity markets.</span> <span style="color:blue"> **Common Reviewer Concern**: All reviewers expressed similar confusion regarding the whereabouts of the proofs of our theorems. We apologize for the confusion, which seems to have arisen from the fact that the detailed theorem statements in the appendix were renumbered, and as a result incorrectly referenced. Specifically, Theorems 3.2, 4.1, and 5.2 were restated in detail as Theorems 6.1, 6.2, and 6.3 respectively. We are uploading together with this rebuttal an updated appendix with corrected references and slightly expanded proofs (e.g., specifying omitted constants). We expect that this corrected appendix will address the reviewers' concerns.</span> **References** [1] Paul A Samuelson. Consumption theory in terms of revealed preference. Economica, 15(60):243–253, 1948. ## Response to Reviewer WrBZ ### Weaknesses >**W1)** $\bullet$ It would be helpful to expand on the proofs of theorems 6.1, 6.2, and 6.3 in the supplementary material. I know that a reference has been provided, but a slight explanation of the cited result and how it relates to the theorem in question would be nice. **Response to W1)**: <span style="color:blue"> We refer you to our common answer.</span> <!-- > Alec says: we should promise to add a proof sketch, or otherwise illuminate some technical points of differentiation from prior analyses that are similar? --> --> <!-- > Amy: **DELETE; redundant** > **Response to W1)**: <span style="color:blue"> We are uploading together with this rebuttal an updated appendix, and more detailed proofs of the results in our paper **NOT OKAY!!**. We have also addressed some theorem referencing issues, which we hope will address your concerns on clarity and correctness</span>. --> >**W2)** $\bullet$ Although a comparison of the method has been shown with the ARIMA model on the spanish electricity market data, it would be beneficial to have a comparison with prior methods in inverse multi-agent reinforcement learning. Especially in terms of efficiency since it's one of the main points of the paper. The abstract says that the method outperforms other widely-used methods (plural), and we only get to see it being compared with one other model which is specific to time-series data. <!-- > Alec says: this is a fair point. We should expand upon what would be reasonable sanity checks, classical IRL approaches, and how ours is distinct? --> **Response to W2)**: <span style="color:blue"> To our knowledge, all existing methods of inverse multiagent reinforment learning apply only to finite state and action Markov games, while our electricity market model is a continuous state and action Markov game. As a result, comparing to these methods would require potentially non-trivial extensions. Instead, we chose to compare our method to a statistical method, namely that of ARIMA, only. We will adjust our language to correct the plural.</span> >**W3)** $\bullet$ Some comparison/contextualization with prior work in multiagent inverse reinforcement learning would also be helpful. <!-- >Alec says: do we have an extended related works section in the appendix? If not, we need to promise to add this to the final upload and paste it into openreview --> **Response to W3)**: <span style="color:blue"> A more extensive related works section is included in the third section of the appendix. There, we summarize related work not only in inverse multiagent reinforcement learning, but also in microeconomics, econometrics, and algorithmic game theory. Thank you for pointing out that a reference to this more detailed related work section is missing. We will be sure to correct this oversight in the camera-ready version. </span> ### Questions >**Q1) ** $\bullet$ What does the term $\psi(\pi,\rho;\theta)$ in the "Multiagent Apprenticeship Learning" section expand to? Cannot seem to find a definition anywhere. <!-- > Alec says: I also couldn't find this...maybe we should add a table to the supplement where we can have a list of all terms and key quantities, just to disambiguate? --> **Response to Q1)**: <span style="color:blue">$\psi$ is the cumulative regret and is defined as $\psi(\mathbf{\pi}, \mathbf{\rho} ; \mathbf{\theta}) \doteq \sum_{i \in[n]} u_i\left(\mathbf{\rho}_i, \mathbf{\pi}_{-i} ; \mathbf{\pi}\right)-u_i(\mathbf{x} ; \mathbf{\theta})$. (See page 3, at the end of the paragraph starting with "One-shot games".) Note that this definition extends immediately to Markov games since any Markov game can be seen as a one-shot game in which the space of actions is taken to be the space of policies. This point is explained further in the sentences preceeding Corollary 1.</span> >**Q2) ** $\bullet$ Assuming that Algorithm 3 was used on the spanish electricity market data, how was the observation distribution specified? **Response to Q2)**: <span style="color:blue"> In the electricity market experiments, the Markov game consists of an electricity seller who sets prices in the day ahead market and spot market, and $n$ buyers who demand electricity. We use price for the day ahead market, prices for the spot market, and aggregate demand (i.e., the sum of the demand across all buyers) of electricity for every hour from 2015 to 2019 as our observation space. The observation distribution then consists of the history distribution associated with this Markov game, which is pushed forward through a function that outputs sampled price trajectories and the sum of each buyer's individual demand. </span> ## Response to Reviewer Wrco ### Weaknesses >**W1)** The proofs are not completed, e.g., I cannot find the proofs for Theorem 4.1 and Theorem 5.2. <!-- > Alec says: is this contained in the supplementary material, especially section 6.2? I think we can refer him to the appropriate subsection, but maybe some sections are missing? > Amy say: **DELETE; redundant** --> **Response to W1)**: <span style="color:blue"> We refer you to our common answer.</span> >**W2)** The presentation can be further improved, e.g., more intuitions about the assumptions and theorems. <!-- > Alec says: seems like an easy comment to address. Especially some of the discussion is already at the bottom of page 6, so we can expand this --> **Response to W2)**: <span style="color:blue"> We will try to add additional intuition to the main paper, beyond what already appears at the bottom of page 6 and top of page 7 (for example), respecting the space constraints. You can also already find additional discussion in the appendix (see page 16).</span> ### Questions >**Q2)** Can you further polish the paper? Some typos: for example, in the fifth line of the abstract, should it be "to solve them"? **Response to Q2)**: <span style="color:blue"> We will do our best to thoroughly proofread the paper again to correct all grammatical, semantic, and syntactical errors in the camera-ready version. </span> ## Response to Reviewer F6LX ### Weaknesses: >**W1)** As an easily rectified issue, Figure 1 could have been better represented by plotting residuals over time or, by subsampling the data, plotting mean residuals with error bars. <!-- ### Alec says: we promise to include error bars associated with 1 standard deviation over a number of trials? Is this possible? Wouldn't this require randomizing over the initial seed or otherwise adding random perturbations to the data? --> **Reponse to W1)**: <span style="color:blue"> This is a great suggestion. Thank you. We will indeed plot residuals over time and error bars over the five seeds for which we ran our experiments.</span> >**W2)** As a minor complaint, I do not prefer the language of "generative-adversarial" (especially not in terms of a "discriminator"), even if this is the closest analogy familiar to machine learning practitioners: This is a standard min-max optimization problem that need not be wed to the ML setting. **Reponse to W2)**: <span style="color:blue"> Thank you for this feedback. The "generative" language is meant to allude to the generative model fitting problem within the larger technical development of our approach. That said, the discriminator language is not quite right. We will give further thought to this concern, perhaps by replacing "generative-adversarial" simply by "min-max". </span> <!-- > [name=Denizalp Goktas] Not sure of the above, let me know what you think! > > [name=Alec Koppel] Seems unnecessary to promise to change it. We can just clarify the meaning of generative-adversarial as minimax, and provide some additional discussion of minimax problems arising in game theory, as well as underscore our motivation for the generative-adversarial language, as it provides enhanced interpretation of the generative model fitting problem within the larger technical development of our algorithm. --> ### Questions: >**Q1**: Remark 1 is indeed interesting, but it is not obvious. Did I miss an associated proof or example? **Response to Q1**: <span style="color:blue"> This remark follows from a folklore theorem, which states that set of solutions to convex-concave min-max optimization problems (i.e., min-max problems in which the objective is convex-concave, and the constraints are non-empty, compact, and convex) are convex. This fact can be seen as a corollary of the set of Nash equilibria being convex in zero-sum, potential, and monotone games. You can find a proof in Nau et al [1]; for completeness we also provide a proof here. </span> <span style="color:blue"> Consider a convex-concave min-max optimization problem $\min_{\mathbf{x} \in \mathcal{X}} \max_{\mathbf{y} \in \mathcal{Y}} f(\mathbf{x},\mathbf{y})$ where $f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}$ is convex-concave and $\mathcal{X}, \mathcal{Y}$ are non-empty, compact, and convex sets. Let $V(\mathbf{x}) \doteq \max_{\mathbf{y} \in \mathcal{Y}} f(\mathbf{x},\mathbf{y})$. By Danskin's theorem [2], $V$ is convex since it is the maximum of a set of convex functions. Hence, by Theorem 2.6 of Rockafeller and Wets [3], the set of solutions $\arg \min_{\mathbf{x} \in \mathcal{X}} V(\mathbf{x}) = \arg\min_{\mathbf{x} \in \mathcal{X}} \max_{\mathbf{y} \in \mathcal{Y}} f(\mathbf{x},\mathbf{y})$ is convex. </span> <!-- ###**Response to Q1**: Alec says: this is the second reviewer complaining about lack of proofs... I think the supplementary material may be missing some things by accident? --> >**Q2**: Why was the proof of Theorem 3.2 omitted? Was it rephrased to appear as Theorem 6.1 in the supplementary material? Establishing the convergence rates of various algorithms is not my expertise, but the results seem reasonable, especially given assumptions of convexity and Lipshitz smoothness. **Response to Q2**: <span style="color:blue"> We refer you to our common answer.</span> <!-- > > [name=Alec Koppel] include specific page numbers where the requested information can be found in the appendices. --> **References** [1] Nau, Robert, Sabrina Gomez Canovas, and Pierre Hansen. "On the geometry of Nash equilibria and correlated equilibria." International Journal of Game Theory 32 (2004): 443-453. [2] Danskin, John M. "The theory of max-min, with applications." SIAM Journal on Applied Mathematics 14.4 (1966): 641-664. [3] R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer Science and Business Media,2009 --------------------------------- ## Response to Reviewer zp62 ### Weaknesses > **W1)** The readability can be improved. I think Section 3 never mentions that it is for the one-shot game setting. There is no methodological contributions. <!-- > Alec says: promise we will clarify the one-shot setting early on the manuscript for the final upload --> **Response to W1)**: <span style="color:blue"> In the second paragraph of Section 3, we define an inverse game as a tuple comprising a one-shot parametric game whose parameter values are missing, together with a Nash equilibrium action profile. > **W2)** All presented algorithms are simple extension of known algorithms. (On the other hand we should not invent/propose algorithms just for the sake of proposing them). **Response to W2)** <span style="color:blue"> As you mention, we have not developed any groundbreaking methodologies, but have simply introduced a novel mathematical characterization of a longstanding problem. Perhaps the strength of our paper lies in its simplicity. </span> ### Questions > (Q1) Minor remark/question: what is the meaning of that weird symbol in Theorems 3.2, 4.1, 5.2? I assume it means of the same order as (but I don't think this is standard notation.) <span style="color:blue"> We believe you are referring to the notation $\eta_{\boldsymbol{y}}^{(t)} \asymp \varepsilon^4$. This notation is equivalent to big theta notation, i.e., $\eta_{\boldsymbol{y}}^{(t)} \in \Theta(\varepsilon^4)$. An example of the use of this notation can be found in [1]. Regardless, we will add a footnote to clarify. </span> > (Q2) In Theorem 3.2, it is a little bit surprising that the optimal is obtained by averaging prior solutions instead of the last one. What is the intuition behind averaging (which includes initial solutions that can be of very low quality)? I assume the proof is correct. <!-- > Alec says: averaging is a standard tool in the analysis of non-strongly convex optimization to obtain bounds on the sub-optimality of the objective function evaluation [cite Bertsekas nonlinear programming, e.g.] . One could get around the use of averaging if one instead rephrased performance gauranetees in terms of time-accumulation of sub-optimality, but this is not always a natural performance criteria, and in our setting, it is relatively less intuitive than the performance of the averaged iterates. --> **Response to Question**: <span style="color:blue"> Simple gradient descent ascent methods are not guaranteed to converge in last-iterates in min-max optimization problems, and can even diverge (see, for instance, Mertikopoulos et al. [2]). A common remedy is to average the iterates. As we mention in the sentence preceeding Theorem 3.2, we can instead obtain last-iterate convergence using extragradient descent ascent or optimistic gradient descent ascent. </span> **References** [1] Daskalakis, Constantinos, Dylan J. Foster, and Noah Golowich. "Independent policy gradient methods for competitive reinforcement learning." Advances in neural information processing systems 33 (2020): 5527-5540. [2] Panayotis Mertikopoulos, Christos Papadimitriou, and Georgios Piliouras. Cycles in adversarial regularized learning. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2703–2717, 2018.