## General response
We thank all reviewers for their time and valuable feedback, and appreciate that reviewers found our work novel and relevant.
While three scores of 5 *suggest* consistent reviews, we find that each reviewer had their own concerns.
Some of these concerns were simply misunderstandings.
Some concerns were due to issues with the quality of the writing which we have now improved.
We now address each of the remaining individual concerns and we hope that the reviewers will take another open-minded look at the paper.
### Clarifications
We wish to highlight three central clarifications:
(1) In contrast to other works, we study **undetectable** attacks on **sequences** observations.
(2) Not all presented methods are perfectly undetectable. For example, the W-illusory attacks pose a trade/off between undetectability, adversarial objectives and computational efficiency.
(3) Reality feedback alters the problem statement and is effective against all adversarial attacks.
### Additional experiments
As suggested by the reviewers, we are **currently running experiments in MuJoCo**, and will add **more results over the weekend**.
**Initial results for the Hopper and HalfCheetah** domains can retrieved fully anonymously here:
https://drive.google.com/file/d/1ffNtT3RtvEQRAX3D9A2mezdappCVq-Ku/view?usp=sharing
Note that we ran the SA-MDP baseline with a perturbation budget slightly smaller than that in the original paper (Zhang et al, 2021).
Furthermore, we will add results for W-illusory attacks over the weekend.
### Revised paper
We have updated our paper to address the concerns raised. We highlighted changes in red (old) and blue (new).
We are more than happy to answer any follow-up questions.
Thank you.
The authors.
### Response to Reviewer 1
> Summary Of The Paper:
> The paper studies adversarial attacks on the state observation (sensor inputs) channel of a RL agent. Different from previous work (in particular [Zhang et al., 2020]), the authors consider the stealthiness of the adversarial attacks. The contribution of the paper is that it defines the concept of detectability for adversarial attacks on state observation and proposes an algorithm that can compute and carry out such undetectable attacks.
>Strength And Weaknesses:
>The problem studied in this paper novel and meaningful. That is to consider the stealthiness of adversarial attacks and the trade-off between stealthiness and effectiveness of the attacks. Stealthiness and detection are the important twins in security problems, especially security problems for sequential decision-making systems where attacks are launched not just one time but sequentially. But the definition of indistinguishability and the detection mechanisms proposed throughout the paper are questionable. The assumptions made in Section 4 are contradicting with the RL algorithm that the author later choose as the focus of the study. The definition of statistical indistinguishability is a strong one. If statistical indistinguishability holds, no detection algorithms can detect such attacks. But the attacks that can avoid being detected by some detection mechanisms don't necessary need to be statistically indistinguishable. The use of statistical indistinguishability as a condition to craft adversarial attacks can lead to no attacking strategy satisfying the condition.
A1.1: While the reviewer’s statements are correct, we find that in many environments there are indeed perfectly indistinguishable attacks, which makes this a useful concept to introduce.
Furthermore, we present full statistical indistinguishability as the extreme limit of our investigation, not as the central result. For example, we study W-illusory attacks, which are not necessarily fully statistically indistinguishable. In other words, statistical indistinguishability is a sufficient but not always necessary condition for illusory attacks.
>Consider an MDP with continuous state space. Can the authors give an example of an MDP and an attacking strategy [...] such that $\mathbb{P}{\pi,\mathcal{E}} = \mathbb{P}{\pi,\mathcal{E}'},\forall (\mathcal{S}\times\mathcal{A})^T$? To make the example simple, we can assume T=2. Such examples can help the readers better understand how strong the definition is.
A1.2: Thank you for this remark. In section 5.3. (Perfect Illusory attacks), we present results for the Pendulum and CartPole environments, which both have continuous state spaces. We will also add a 2-step example to the appendix.
> I agree with the authors that the assumption of the victim knowing a world model mv is not unrealistic. "victims can learn accurate world models from unperturbed train-time samples". However, if the victim knows the world model, what is the point of interacting with the world to observe the state? The victim can leverage a more efficient algorithm (value iteration, policy iteration + function approximation if the state space is large) instead of the algorithms discussed in the paper.
A1.3: The concepts we use in our paper can be used with both model-based (i.e. planning methods) or model-free approaches (which we explore in our paper), this is a good point and something to explore in future work.
Just to be clear — having a world model which is used for planning does not get around the issue that the victim has to observe the current state (either as an input to the planning algorithm or to the amortised policy) and thus the observation can be attacked.
>The detection mechanism used in the experiment is included in the paragraph starting with "Using a dynamics model to detect adversarial attacks." Please highlight the detection mechanism since it is an important factor for the experiments. If I understand it correctly, the detection mechanism considers observations in a single step and an attack is detected if the predicted observation and the actual observation is larger than a threshold c.
>Does this detection technique come from a reference? I have three concerns regarding the mechanism. 1. It does not consider the whole trajectory history to detect 2. how do you measure distance for discrete state space (how to set threshold c)? 3. if c is small, even if there is not attack, the actual observation can be different from the predicted observation given the stochasticity of the model.
A1.4: We agree that the detection mechanism used in our work does not guarantee statistical indistinguishability. However, despite its simplicity, it is highly effective at detecting state-of-the-art adversarial attacks. We similarly agree that tuning c, especially in stochastic environments, is essential to the detection mechanism. We tuned c such that no unattacked trajectories would be classified as attacked (please see section 7.3 for more details). In deterministic environments, other notions of state similarity can be adopted, such as nearest neighbour approaches.
Furthermore, state-of-the-art methods to detect long-term statistical correlations are very sample-inefficient (Shi et al, 2020), requiring impracticably large numbers of test-time samples for statistically-significant detection. In contrast, our simple detector provides a trade-off between detection accuracy and sample efficiency.
> Other minor comments:
>I came across a paper that studies adversarial attacks on rewards with a definition of stealthiness in it (C1). As a reader, I am curious about the difference between undetectability in attacks in reward and attacks in state observations?
A1.5: Thank you for pointing us to this related work in cybersecurity, we have added it to our related work section. The fundamental difference is that C1 assumes direct adversarial manipulation of cost/reward signals. We in contrast assume that the agent is not supplied with a reward signal at test time, as is usually the case during deployment.
>In optimal control or MDP or POMDP, adversarial attacks have been investigated by many researchers (C2, C3). These attacks are sometimes called false data injection attacks or sensor attacks. The detector of attacks is usually based on statistical evidence instead of single observations.
A1.6: Thank you for pointing out these works, we added them to the related work. C2 and C3 both study hard-coded sensor attacks on linear control systems and the detection mechanisms proposed are only applicable to such systems. In contrast, we present a novel reinforcement learning framework for end-to-end learnt attacks on high-dimensional non-linear systems.
>Avoid using "may" in your statement. For example, in definition 4.1, a sampling policy that may conidtion on the whole history. The authors can say "a sampling policy where T can be infinity" or "a sampling policy that can condition on the whole action-observation history". In section 5 "may be unable to detect W-illusory attacks". Instead of using "may", scientific writing should specify under what conditions, humans are able to detect W-illusory attacks and under what conditions, human are not able to do so. Similar examples of using "may" can be found throughout the paper.
A1.7: We improved our writing accordingly.
## Response to Reviewer 2
>Summary Of The Paper:
>Summary: This paper emphasizes developing statistically undetectable attacks for autonomous decision-making agents. The authors develop a novel class of illusory attacks that are consistent with environment dynamics. Their results show that illusory attacks can easily fool humans, unlike the previous attacks in the literature. They compare illusory attacks with other attacks and show their performance under different defense techniques.
>Questions and comments:
>In Table 2, what is the perturbation budget used for the attacks?
A2.1: The budget used here is 0.2, which is comparable to the budget used in the SA-MDP paper ([Zhang et al., 2020]).
> Based on Section 3, I think in definition 4.1 and the rest of the text, it should be P_{\varepsilon, \pi} and not P_{\pi, \varepsilon} for consistency.
In Section 3, \gamma is not defined.
In definition 4.3, it says the highest expected return but shows minimization in equation (2). Is this correct? If yes, why? Also, why is "optimal illusory attack" defined? I do not see it being used in any part of the text.
I would suggest using \tau \sim P_{\varepsilon, \pi} instead of \tau \sim (\varepsilon, \pi}) in equation (3).
In equations (3, 7), should the summand not be till T-1 and not T?
In equation (7), it should be \lambda instead of \gamma, right?
On page 8, it should be Table 1 and not Table 5.3 for consistency.
A2.2: Thank you for these remarks, we made fixes and added additional clarifications in the updated version of the paper.
>I think it's a better idea to replace Table 1 with Table 7.5.1 in the main text since the absolute reward values are easier to compare.
A2.3: We discussed this, but find the current representation to be more interpretable. We are open to discuss this change further if it is seen as critical.
> Strength And Weaknesses:
> Strengths:
>Focuses on developing attacks that are consistent with the dynamics of the environment.
Develops a novel class of illusory attacks that are consistent with the dynamics. Introduces the concept of statistical indistinguishability for stochastic control processes.
Well-structured sections. The supplementary material contains all the relevant videos related to human study.
Weaknesses:
>I think that the human study setup is biased. The samples given in the supplementary show that it's easy to detect attacked vs. unattacked videos. All the unattacked videos labeled in the study balance the pendulum or cart pole perfectly, while all the attacked videos do not do that. Hence, it should be easy to detect attacked vs. unattacked videos for these simple environments.
>I think illusory vs. unattacked videos can also be figured out without much difficulty. For example, only in illusory attacks the cart pole moves very quickly. In the rest of the attacks/ unattacked videos, the cart poles do not move as fast as in illusory attack videos.
A2.4:Thank you for this remark. Please note however that human participants were specifically asked to discriminate between the unattacked and attacked *environments*, not between different policies. We chose this question as a single environment often permits diverse (near-optimal) policies, hence distinguishing between policies is not generally a good predictor for adversarial attacks. Our study is *fair* since we asked participants the same question for our method and baseline attacks, resulting in significant differences.
Further, note that none of the participants are experts in the field, and were left to their best judgement. A study with participants that are experts in RL constitutes an interesting direction for future work.
>Given the above weaknesses, I am not sure why the detection accuracies are low for the illusory attacks in Table 1.
A2.5: Please note that our detector does not test for statistical indistinguishability directly, as state-of-the-art methods to detect longer-term statistical correlations are very sample-inefficient [Shi et al, 2020], requiring impracticably large numbers of test-time samples for statistically-significant detection. In contrast to these works, our detector trades off between detection accuracy and sample efficiency. Please see section 7.3 for implementation details the detector which was used to generate results in Table 1.
>What is the naive detection algorithm that is used for Table 1? Also, W-illusory attacks are not "statistically indistinguishable", unlike perfect illusory attacks, right?
A2.6: This is correct. W-illusory attacks are a practical relaxation of illusory attacks. As stated in definition 4.5, “a W-illusory attack is an adversarial attack that is consistent with the victim’s model of the observation-transition probabilities mv”. […] “So importantly, W-illusory attacks can in general change the distribution of trajectories”.
>I am not convinced if good W-illusory attacks should always exist. Pendulum and cart pole are very simple environments. I would suggest adding experiments with more complex environments. Not sure about transferability to other tasks.
A2.7: Please note that definition 4.7 poses a practical implementation of W-illusory attacks. We are currently running experiments for W-illusory attacks on Mujoco, and will add results to our paper before the end of the weekend. Initial results can be retrieved fully anonymously here:
https://drive.google.com/file/d/1ffNtT3RtvEQRAX3D9A2mezdappCVq-Ku/view?usp=sharing
>Reality feedback is a practical and obvious way to defend against illusory attacks. Hence, the attack proposed in this paper does not appear to be strong.
A2.8: Given suitable reality feedback, *any* observation-space adversarial attack can be mitigated, including MNP attacks. However, we are adamant to stress that, in the case of perfect illusory attacks, reality feedback can be the only way the only possible defence. We demonstrate how reality feedback can we used to robustify test-time policies. We’d like to reiterate that reality feedback is not always feasible.
It’s a bit like saying that the answer to cybersecurity threats is to have perfectly secure channels.
## Response to Reviewer 3
>Summary Of The Paper:
>This paper studies test-time attacks on reinforcement learning agents. It focuses on attacks that are statistically undetectable, and proposes novel attack models that aim to preserve consistency of trajectories with the environment dynamics. The paper develops a new optimization framework for generating such attacks and experimentally validates the effectiveness of these attacks, called illusory attacks. The experiments test the efficacy of the proposed approach in terms of: i) detectability of adversarial attacks via statistical consistency checks, ii) detectability of adversarial attacks via visual inspection (i.e., human-subject studies), and iii) susceptibility of robustly trained RL agents to adversarial attacks. The experimental results indicate that the proposed attack approach yields lower detectability rates compared to prior works.
>Strength And Weaknesses:
>Strengths of the paper:
>To my knowledge, the attack models considered in this work has not been studied in the literature on test-time attacks against RL agents. The attack models are well motivated and complement those from prior work. Instead of focusing on LP norm-based attack models, the paper advocates models that are statistically undetectable.
The paper introduces a formal framework for studying these attack models, as well as an optimization problem for finding an optimal illusory attacks. The optimization problem aims to minimized the victim's return, while minimizing a distance function that measures the inconsistency between generated trajectories and the environment dynamics.
The paper also conducts a human-subject experiment to test the detectability of this attack model via visual inspection. This validation techniques appears to be novel when it comes to adversarial attacks on RL agents.
Weaknesses of the paper:
>The experiments are primarily based on two simple environments, Pendulum and CartPole. In contrast, prior work, e.g., Zhang et al. 2021, has studies more complex environments, such as MuJoCo. More experiments would be useful in order to understand the scalability of the approach.
A3.1: Following the reviewer's request, we present additional experiments on two MuJoCo environments (anonymous links to initial results: https://drive.google.com/file/d/1ffNtT3RtvEQRAX3D9A2mezdappCVq-Ku/view?usp=sharing).
Our experiments show that perfect illusory attacks exist for both of them, and we suggest that the associated SA-MDP attacks are easily detectable.
We will further complement results, also for W-illusory attacks, over the course of the coming weekend.
>Also, given that one the experiments involves human-subjects, IRB may be required; the paper doesn't seem to report if the study has an IRB approval.
A3.2: Thank you for pointing this out. We will attach the IRB approval to the camera-ready version.
>In general, it is not clear what are the computational properties of the proposed optimization framework. The I-MDP model does not scale well with the time horizon, and given that the experiments are only based on two simple environments, it is not clear how practical this approach is.
A3.3: We are not entirely sure we understand the reviewers question correctly, we assume that the reviewer refers to the scalability of the state-action history of the I-MDP model. We expect that solving I-MDPs scales computationally similarly to solving POMDPs. Note that recurrent policies have been successful at solving large POMDPs [1,2,3].
>Additionally, the attack optimization problem (7) seems to require the environment/world model. Some discussion on the practicality of the approach would be useful to have.
A3.4: We assume that the reviewer is referring to the practicality of the attacker having access to a world model. We here adopt the common assumption [Zhang et al., 2021] that the attacker has access to the environment (in order to implement the attack), which allows it to estimate a world model. In general, the accuracy of the attacker’s world model required for a successful illusory attack depends on the accuracy of the victim’s world model.
>Some parts of the formal framework are not entirely clear and may contain typos. Firstly, I don't fully understand why rewards are not included in trajectory \tau when one measure consistency with the true environment, nor why \tilde S does not include rewards. It's not immediately clear to me that the victim cannot detect this attack by inspecting received rewards. Some discussion on this would be useful.
A3.5: Thank you for this remark, we have updated the notation accordingly. We assume a test-time adversarial attack on the victim agent, hence the victim agent does not observe the reward signal at test-time (Zhang et al., 2021, Kumar et al., 2021). Imagine a robot trained in simulation, which would likewise not observe the reward during deployment in a real-world scenario.
>Secondly, definition 4.1 uses
E but does not specify it. Thirdly, Eq. (3), (5), (6) and (7) may not be precise. E.g., why do we minimize [...]
A3.6: Thank you for these remarks. We have addressed the issues in the updated version of the paper.
>Clarity, Quality, Novelty And Reproducibility:
>Please see my detailed comments above. Below I summarize some of the points related to quality, clarity and originality.
>Quality: I believe that the attack model studied in this work is interesting, but given that the evaluation is primarily based on experiments, more environments could be added to the experiments test-bed.
A3.7: Please see answer A3.1 regarding additional experiments.
>Regarding the reproducibility, the simulation-based results are well documented. On the other hand, it would be good to extend the description of the human-subject experiment, e.g., by adding the recruitment protocol, and indicate whether the study had received an IRB approval.
A3.8: Thank you for pointing this out. We will add the IRB approval and recruitment protocol to the camera-ready version.
>Clarity: The paper is overall clearly written. That said, some part of the paper are not entirely clear as I indicated above. If these are typos, unfortunately, there seem to be quite a few of them and they significantly impact the correctness of the results...
>Originality: The attack model studies in the paper seems quite novel. I also like the fact that the paper utilizes human-subject experiment to test the detectability of some of adversarial attacks, which is rather novel, or at least not that common in this line of work.
[1] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michael Mathieu, Andrew Dudzik, Juny- oung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
[2] Noam Brown and Tuomas Sandholm. Superhuman AI for multiplayer poker. Science, 365(6456): 885–890, 2019.
[3] Baker, Bowen, et al. "Video pretraining (vpt): Learning to act by watching unlabeled online videos." *arXiv preprint arXiv:2206.11795* (2022).
## Response to reviewer 4
>Summary Of The Paper:
This paper studies adversarial attacks on sequential decision-making policies, with a focus on statistically undetectable attacks. The authors assume exact knowledge of the world model, and introduce a novel class of adversarial attacks called illusory attacks, which are consistant with the world dynamics and thus more stealthy. This paper formulates the illusory attacks, and propose a feasible learning algorithm, W-illusory attacks, to generate illusory attacks. Experiments on simple control tasks show that the proposed attack and less detectable to humans and AI agents than state-of-the-art attacks.
>Strength And Weaknesses:
>Strengths:
>The idea of statistically consistant attacks is interesting, and to my knowledge, novel in the literature.
>The paper provides theoretical justifications that perfect illusory attacks exist for some but not all policy-environment pairs.
>Provided human study are useful for understanding the detectability of adversarial attacks.
>The proposed algorithm makes intuitive sense, and the empirical results on Cartpole and Pendulum do show the effectiveness of the algorithm.
Weaknesses:
>I do not agree with some claims made by the paper. In particular, most existing adversarial attacks are imperceptible to humans [1,2], which is an important motivation of adversarial attacks.
A4.1: Prior attacks are imperceptible to humans due to the budget constraint, e.g. by considering low-amplitude noise in an image. This previous work [1,2] focuses on observations at a single time step, such as a single image. In contrast, our work considers adversarial attacks on environment interactions that generate *sequences* of observations. This setting requires different considerations regarding perceptibility (by humans or learning systems) from single-image attacks. Specifically, we do not ask whether the attacked observation can be distinguished from the original one, but whether the distribution of trajectories matches between the attacked and original environment.
>Most literature of adversarial RL studies the adversarial perturbations on observation, which lies in a high-dimensional space (e.g. images [2,3,4]). These perturbations are usually undetectable by humans, unless special care is taken.
>The assumption of an exact world model is not very realistic, especially in the senarions where adversarial attacks are concerning. In simulators, we may have an exact world model. However, it is less possible and also less dangerous that a local simulator is attacked. Adversarial examples are more critical during the interaction with real-world environments where observation may be noisy and exposed to outside attackers. Under these real-world scenarios, the access to a world model is unrealistic. Even learning a good model can be challenging in these environments.
A4.2: Thank you for this remark. As we stated in section 4.2, it is not required that the victim has access to a perfect world model. For example, the victim may only have access to an abstract model of the environment dynamics that, by itself, would not be sufficient for planning, but may nevertheless allow the victim to perform effective checks on trajectory consistency. In general, the required accuracy of the attacker’s world model depends on the accuracy of the victim’s world model.
>The experiments are on simple environments. Can the authors also provide results on larger scale environments like Atari games, or at least MuJoCo environments?
A4.3: As suggested by the reviewers, we will add additional experiments on MuJoCo by the end of the weekend. Initial results can be retrieved fully anonymously here:
https://drive.google.com/file/d/1ffNtT3RtvEQRAX3D9A2mezdappCVq-Ku/view?usp=sharing
>Some related works are missing. For example, [4] proposes a stronger adversarial attack that SA-MDP [3]. Can the authors compare the proposed attack with [4]?
A4.4: Thank you for this remark, we added [4] to the related work. [4] investigates adversarial attacks on high-dimensional observation spaces, by improving the scalability of the method proposed in [3]. In contrast to our work, [3] and [4] do not consider statistical indistinguishability.