We thank Reviewer 4Uw3 for the detailed comment and insightful feedback. We are encouraged the reviewer finds our paper "studies a very valid scenario" and "derived bounds are insightful". We address Reviewer 4Uw3's concerns and questions below:
### [1/3] Response to weakness
> Q1: the experiments on robustification of the victim policy are limited to the very small Kuhn Poker environment...
We respectfully point out that our robustification experiments extend **beyond** Kuhn Poker; we've also extensively tested on the Robosumo environment, a complex setting with over $100$ observation dimensions.
- Our algorithm's learning process is illustrated in Figure 4 (a), (b).
- Figure 4 \(c\) demonstrates the enhanced robustness of the robustified victim under attack compared to the baselines.
> Q2: the experiments are rather limited, common benchmark environments like the gym environment...
Our experiments do encompass complex benchmarks to scrutinize scalability:
- Gym environments like cartpole, pendulum, walker, halfcheetah, and hopper are single-agent, but our environment requires at least $2$ agents, making them unsuitable for testing our algorithms.
- The Robosumo environment is more complex than typical gym settings. For instance, while cartpole, pendulum, walker, halfcheetah, and hopper have observation dimensions of 3, 3, 17, 17, and 11 respectively, Robosumo's Ant, Bug, and Spider dimensions are 120, 164, and 208.
> Q3: the paper is not very well written...
We apologize for the oversight. We'll update the paper for clarity and address any questions regarding figures.
> Q4: section 3c is out of context and not really relevant to the paper
We'd like to clarify the significance of Section 3c:
- **We justify our attack budget**. Gleave et al focus on unconstrained attacks. Our approach regulates attacker performance (Section 3a), ensures stealth (Section 3b), and expands to single-agent adversarial RL (Section 3c). Its success in single-agent situations justifies exploring the multi-agent setting.
- **Our setup presents unique defense challenges**. While there are parallels to single-agent action adversarial RL, adapting its defenses to our multi-agent context isn't straightforward. Section 3c highlights these challenges, stressing the need for provable defenses.
### [2/3] Response to questions
> Q5: how do the proposed methods scale to mid-size environments such as cartpole or pendulum (or even hopper, walker and halfcheetah?)
As in response to Q1&Q2, the Robosumo environment is notably more intricate than typical gym environments. Importantly, our setup demands at least **$2$ agents**.
> Q6: why were the environments chosen as done for the paper?
Our paper introduces a robust theoretical framework aiming for scalability in intricate applications. Thus, our benchmark selection is tailored to both validate the theory and probe its scalability:
- Kuhn Poker validates our theory. With its unique capability of efficiently providing a best response oracle, it lets us track the victim's exploitability during training (Figures 2a and 2b), demonstrating the value of timescale separation.
- The Robosumo environment assesses scalability in complex scenarios. Given its more than $100$ input dimensions, we tested our robustified victim's performance under attacks, show our approach's scalable robustness.
> Q7: in section 7 figure 3: how is it possible that the same victim reward is achieved for all three attack budgets?...
- We don't claim that varying attack budgets will yield the same rewards or win rates. Appendix E4 shows a trade-off: reduced budgets boost stealth but might limit reward potential.
- For clarity, we compared the stealth of an unconstrained adversarial policy to a constrained one with equal win rates for a fair assessment. These policies might not reflect the algorithms' final state, as they could have different win rates due to budget variations. In essence, when an attacker aims for a specific win rate with a priority on stealthiness, our constrained attack offers superior stealthiness.
> Q8: can the authors give some intuition (some approximations) of what the bounds derived in Theorem 5.3...
- Our bound on the iteration complexity is $T = \operatorname{poly}\left(\frac{1}{\delta}, C_{\mathcal{G}}^{\epsilon_\pi},|\mathcal{S}|,|\mathcal{A}|,|\mathcal{B}|, \frac{1}{1-\gamma}\right)$. Hence, in environments with larger state ($|\mathcal{S}|$) and action spaces ($|\mathcal{A}|, |\mathcal{B}|$), longer effective horizons ($\frac{1}{1-\gamma}$), or inherent difficulty in exploration ($C_{\mathcal{G}}^{\epsilon_\pi}$), more iterations are needed. This aligns with intuitive expectations about algorithm scalability across different environments.
- Our algorithm's scalability is backed both by our theorem and empirical results in the high-dimensional Robosumo environment. While the theorem illustrates swift convergence to the optimal solution for reasonably sized problems, our experiments validate the algorithm's capacity to effectively scale in more complex, high-dimensional settings.
### [3/3] Response to limitations
> Q9: Limitations are not addressed
We have addressed the limitations in Sections 5 & 8. Specifically, we delve into issues like multiple independent attackers with self-interested objectives and the challenges around exact convergence with complex function approximations.
> Q10: Potential negative societal impacts of studying more powerful adversarial attacks are not addressed
In the revision, we'll delve deeper into potential societal implications. By highlighting potential attacks, we underscore the pressing need for safety in deep RL and spur the development of robust defenses.
---
We greatly appreciate Reviewer 4Uw3's valuable feedback and constructive suggestions. We are happy to answer any further questions.
Paper8031 Authors