# ICML Rebuttal illusory attacks ## response 3 As pointed out by reviewer fAk8 themselves (first paragraph), we consider adversarial attacks on observations, whereas [5] considers attacks on actions. As pointed out by the reviewer fAk8 themselves (second paragraph), this difference, i.e. observations vs. actions, introduces partial observability (fAk8 quote "In your problem you obtain a partially observable system."). This distinction already makes our work fundamentally different from [5], as it results in fundamentally different technical challenges (fAk8 quote "It is known that it is highly difficult to perform sequential hypotheses tests on partially observable systems"). (Note that above, we point out three additional fundamental differences). We agree with reviewer fAk8's assessment that optimal change point detectors in partially observable MDPs are a highly relevant direction for future research. In fact, we state this in our "Conclusion and future work" section: "[...] To this end, more sample-efficient statistical detectors will need to be developed [...]". Such efforts could include scaling [1,2] to high-dimensional state spaces - we now explicitly point this out in our "Conclusion and future work" section. Note that we are clear that while our detector is highly effective in practice, we never claim nor suggest that our detector is optimal. Hence, our contributions do not rely on the detector's optimality. ## fAk8 Response We would like to point out that all of reviewer fAk8's additional concerns have, in fact, already been responded to. For completeness, we repeat: Point 1: As mentioned in Comment 1, [5] treats a fundamentally different problem setting than us. Hence, the first comment does not apply. Whether "the technique used in [5] [can] be extendable to non-stationary attack policies" is an interesting question, but orthogonal to our contributions. Point 2: As stated in our rebuttal above, [5] considers statistical distinguishability between two MDPs, whilst we consider statistical distinguishability between a MDP and a general stochastic control process (which need not be a MDP). Point 3: Again, [5] treats a fundamentally different problem setting than us (as explained in the rebuttal). Hence, this finding does not apply to our work. Lastly, we reiterate our rebuttal statement (see both responses to reviewers jwN6 and fAk8) in that our paper fundamentally differs from [5] in that 1. Different threat model: We consider the scenario where the adversary attacks the victim’s observation, while [5] considers the case where the adversary attacks the victim’s action. 2. More general class of attacks: We consider adversaries that condition on the full history, while [5] considers adversaries that condition on the current state only. Hence, in contrast to [5], we study statistical indistinguishability in more general settings (which can be non-Markovian). 3. Applicability to large continuous state spaces: We show that our methods scale to large state spaces, while [5] relies on methods applied to environments with very small state spaces. 4. Evaluation with human subjects: We additionally evaluate detectability of adversarial attacks through humans inspecting high-dimensional observation spaces. [5] Russo et al. 2022. Balancing detectability and performance of attacks on the control channel of Markov Decision Processes. ACC 2022. ### Response to in-thread comment We find that cross-posting comments to other reviewer threads makes the discussion very hard to follow. To simplify things, we respond to fAk8 in reviewer jwN6's thread. ### Message to AC We would like to flag to the AC three aspect of reviewer fAk8's response as not following the guidelines for reviewing discussions: 1. We find that reviewer fAk8's arguments do not have the rigour expected of a scientific review, with too many informal claims. This includes speculating over theoretical results that *may* be attainable and using phrases like "seems to be applicable" or "I suspect that policies are quite suboptimal", without supporting evidence. 2. We also observe that reviewer fAk8 did not engage with our rebuttal response, repeating arguments as if we had not already responded to them. 3. We find that fAk8 cross-posting comments to other reviewer threads unnecessarily makes the discussion very hard to follow. Overall, we suspect that reviewer fAk8's review may not be sufficiently impartial and some of the claims and demands should not be taken at face value. ## General Response Dear reviewers, we would like to thank you for your time and the valuable feedback on our paper. We appreciate that reviewers found our work novel and relevant. Below, we address each reviewer's questions individually. Unfortunately, ICML does not allow us to edit the manuscript at this time, which is why we tried to be as clear as possible about the changes made to the paper. We are more than happy to answer any additional questions. Best wishes, the authors --- --- --- ## TODO - ask reviewer 3 and 4 to increase their scores ## Review 1 **Summary:** This paper considers adversarial attacks against deep reinforcement learning agents which work by directly modifying the observations input to RL policies. The authors point out a flaw of previous work in this area: the observations which are produced by attacks may be very unrealistic and thus detectable by the victim, which could lead to mitigation of the attack's effects. They aim to remedy this by constraining attacks to be illusory, a set of properties they introduce which mean an attack is in some way hard to detect. The stronger form, a perfectly illusory attack, is statistically undetectable, while the weaker form, a E-illusory attack, can only be detected by looking for correlations between state transitions over multiple timesteps. The authors implement both types of attacks and show that they are much more difficult to detect, both for a learned detector and for humans. Meanwhile, they are still strong when compared to previously proposed attacks which do not try to evade detection. Additional experiments explore the effect of various defenses against illusory and previously proposed attacks. **Strengths And Weaknesses:** Overall, I think this is a strong paper for a number of reasons: - The idea of undetectable attacks in the MDP setting is really nice and, to my knowledge, novel. I appreciate the separation of perfectly illusory attacks from the illusory attacks; both definitions seem very natural and have interesting properties. - The experiments seem quite convincing and I think the addition of the human detection study is a nice touch. - The authors are thoughtful in their motivation and conclusion, discussing how the idea of illusory attacks should inform adversarial defenses. I think the weaknesses of the paper mainly concern the presentation. While it's mostly good, there are some things that could improve the clarity: - In general, I found Section 4, the main theoretical contributions, to be a bit hard to follow at times. It may help to have a running example or something that can help the reader understand all the definitions (of which there are many). In particular, it took me some time to understand the difference between perfectly illusory and Eillusory, and in particular the idea of looking for correlations across time to detect an attack. Some simple example could go a long way here. For instance, with CartPole, it's easy to understand that if perturbations applied to the pole are correlated over time, then (a) it's easier to detect that an attack is happening, but also (b) the attack is stronger since it will keep pushing the pole towards one side. On the other hand, if they're uncorrelated, it makes it hard to detect but also harder to generate a strong attack. - Another point of confusion was the disconnect between some of the theory in Section 4 and the experiments in Section 5. In Section 4, illusory attacks were defined in terms of distributions over next states and whether those distributions were similar under attack and not. However, in Section 5, you switch to looking at Lp distances. I guess this could be a proxy for some distance between distributions but it might help to make this connection clear. Another thing that confused me initially was the idea of an "illusory reward." It might be helpful when you introduce this to include a reference to equation (4) which I believe is where that idea comes from. - One possible issue with the results is that they could conflate two ideas you introduce: (i) the idea of making the attacked observations agree with the transition function, and (ii) penalizing this through the *reward* instead of through absolute bounds. I still believe that the lower detectability you find with the illusory attacks is due to (i), but it could be good to have some additional experiment that explores using reward penalties for the other attacks instead of or in addition to absolute bounds. Or, feel free to explain to me why that doesn't make sense! Typos/small notes: - Line 315 on the right: seems like the notation should say  instead of using a semicolon. ∥t^(st,at) - st+1∥≥c - Line 420 on the right: you refer to Section 6, but I think you mean Section 4.2 or something like that? **Questions:** I don't have any particular questions but I would like the authors to address the weaknesses I've mentioned above. **Limitations:** The authors partially discuss the societal impact of their work; however, it would be good to mention that, as their paper describe an improved method for attacking machine learning systems, it could be used nefariously. I do think that the benefits for defenders of considering illusory attacks, which are listed in the conclusion, outweigh this potential negative impact, but it would be good to state that explicitly. **Ethics Flag:** No **Soundness:** 4 excellent **Presentation:** 3 good **Contribution:** 4 excellent **Rating:** 7: Accept: Technically solid paper, with high impact on at least one sub-area, or moderate-to-high impact on more than one areas, with good-to-excellent evaluation, resources, reproducibility, and no unaddressed ethical considerations. **Confidence:** 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. **Code Of Conduct:** Yes --- ### Response to Reviewer 1 **Q1: Found Section 4, the main theoretical contributions, to be a bit hard to follow**: Thank you for this feedback. We have now added a more illustrative example, similar to the one discribed by the reviewer, to the example give in appendix A.2., and reference it in the main paper. **Q2: Disconnect between some of the theory in Section 4 and the experiments in Section 5**: Thank you for this remark, your assumption is correct. We have now clarified this in section 5.2 under E-illusory attacks. **Q3: Conflating two ideas you introduce**: From our understanding, the reviewer is wondering whether we could also apply the idea of a reward penalty to the bounded-budget implementations of baseline methods. In fact, our implemented method similarly uses a bounded budget. When removing the illusory reward, the implementation of our method is very similar to that of the SA-MDP baseline. We now made this more clear in the notation of Algorithm 1. **Error in line 315**: Correct. We have addressed this. **Error in line 420**: Correct, addressed, thank you. **Limitations**: We agree with the reviewer’s assessment and added the potential negative effects of our work to the conclusion. We point out that we expect the potential positive effects of allowing adversarial defense systems to harden against illusory attacks to outweigh the potential negative effects. --- --- --- ## Review 2 **Summary:** The paper proposes a detection method for adversarial attacks on reinforcement learning systems. Unlike prior works, authors consider a constraint on the detectability of the attack by formally defining 'illusory' attacks on the state observations of an agent. This general idea is that action-obervation sequences look statistically very different in non-adversarial vs. adversarial settings. Using this definition, the authors devise a pratical attack and evaluate said attack using several RL benchmark datasets. Through quantitative results as well as a human study, authors demonstrate that their attack is harder to detect than sota attacks. **Strengths And Weaknesses:** ****Strengths**** 1. The paper is well written and a does a good job at conveying the important concepts to a reader with limited expertise on the subject matter. 2. The proposed attack takes into consideration a very important aspect of an attack, i.e., detectability. 3. The core idea of using statistical measures instead of ℓp norms to measure detectability is not exactly novel (explored for image classifiers) however the way the authors define it for their use case is novel and intuitive. 4. Their approach to experimentation is adequately principled. **Weaknesses** 1. **Issue with choice of benchmark datasets:** My first concern is with regards to the choice of benchmark datasets used in the experiments section. The authors use 4 relatively simple RL enviroments whereas prior works also use more complicated enviroments (like Atari games). Becasue of this, it is unclear how effective the proposed attack will be on real-world RL systems. 2. **Issue with human study:** The authors report that they do the human study using 2 of the 4 environments (pendulum and cartpole). They do this to be more faithful with real-world conditions wherein humans perform sporadic inspection. This reasoning justifies using short epsiode lengths but doesn't justify using simple environments. Real world RL systems will likely be operating in much more complex environments than pendulum and carpole. Therefore, the results of the human study are not entirely convincing. 3. **Computational Overhead:** It is unclear how the proposed attack compares to prior attacks based on computational efficiency. If enforcing detectability introduces a significant computational overhead, the contribution gets diluted to a certain degree. **Miscellaneous comments** 1. 1st page 2nd last para "Previous frameworks in ...", which frameworks? citations are missing. 2. L#354, 2nd column, typo: "epsides" => "episodes" 3. Fix the bibliography to have a consistent format and reflect the correct publication venues. For example, "Explaining and harnessing adversarial examples." is an ICLR paper, not an arxiv paper. **Questions:** 1. What went behind deciding the 4 benchmark datasets for experimentation? 2. Can you better justify using only pendulum and cartpole for human study? Current reasoning is not convincing. **Limitations:** 1. The paper does not discuss limitations of proposed method. 2. The paper proposes an attack on RL systems that is hard to detect. Therefore, there are ethical concerns attached to this work. However authors do not explicitly discuss these concerns. **Ethics Flag:** No **Soundness:** 3 good **Presentation:** 3 good **Contribution:** 3 good **Rating:** 6: Weak Accept: Technically solid, moderate-to-high impact paper, with no major concerns with respect to evaluation, resources, reproducibility, ethical considerations. **Confidence:** 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. **Code Of Conduct:** Yes --- ### Response to Reviewer 2 **Q0: Computational overhead**: There is no computational overhead of our method at test-time. We found in our experiments that the computational overhead during training of the adversarial attack scaled with the quality of the learned attack. In general, we found that the training wall-clock time for the E-illusory attacks results presented in Table 1 was about twice that of the SA-MDP attack (note that MNP attacks and perfect illusory attacks do not require training). **Q1: What went behind deciding the 4 benchmark datasets for experimentation**: We chose the CartPole and Pendulum environments as these can be easily understood by humans unfamiliar with simulated benchmark environments. Hopper and HalfCheetah were chosen as they represent more complex control problems, having larger action and observation spaces, thereby allowing to show that our method scales to more complex domains. **Q2: Can you better justify using only pendulum and cartpole for human study**: We agree that the current reasoning in the paper is not entirely clear and have updated it. We found that human participants unfamiliar with simulated benchmark environments had a hard time understanding the transition dynamics of both the Hopper and HalfCheetah environments. In contrast, participants immediately understood Pendulum and CartPole. As we had to limit the time demand for participants, and as Hopper and Halfcheetah would have required additional time for more detailed explanations (on top of the extra time needed to perform the study for two more environments), we excluded those. Similarly, including Hopper and HalfCheetah yielded the risk of confounding study results due to the participants not adequately understanding the environment dynamics. We however plan to conduct a more extensive study with human subjects in future work. **The paper does not discuss limitations of proposed method**: Thank you for this remark. We have now added a limitations section to our conclusion, where we mention the limitations of the human study, as well as the limitation of our experiments to simulated benchmark environments. **The paper proposes an attack on RL systems that is hard to detect. Therefore, there are ethical concerns attached to this work. However authors do not explicitly discuss these concerns**: We have now added a statement to our conclusion. We state the potential negative effects of our work on stronger adversarial attacks, but point out that we expect the potential positive effects of allowing adversarial defence systems to harden against illusory attacks to outweigh the negative effects. **Miscellaneous comments**: Thank you for these remarks, we have addressed the issues and updated the bibliography. --- --- --- ## Review 3 **Summary:** In this work the authors study the problem of devising stealthy attacks on the observations of an RL agent. On the basis of an equivalence principle, they propose illusory attacks that constraint the sequence of observations to be as similar as possible to trajectories of the unpoisoned environment. They investigate their method on different environments, as well as on humans, and show the efficiency of their method using a simple detector that detects if a given transition is malicious or not. **Strengths And Weaknesses:** **Strengths** The topic is interesting, and the problem of attack detection in RL problems has not been fully investigated. The attack strategy seem interesting and easily implementable. The authors also propose a simple detection strategy. Finally, the authors show numerical results of their method to support their claims, together with experiments on humans. **Weaknesses** 1. The authors claim to introduce `perfect illusory attacks, a novel form of adversarial attack on sequential decision-makers that is both effective and provably statistically undetectable.` However, statistical detectability has been discussed, for example in [A], using methods from statistical change detection . Unfortunately the is lack of comparison w.r.t [A] (not mentioned in the related work). 2. Furthermore, the arguments presented in this draft do not seem to make use of statistical arguments. The authors use the argument from Shi et al. (2020) (i.e., testing the Markov property) to motivate stealthy attacks. However, there may be attacks where the attack policy may seem to induce a markov process. Furthermore, this is not really a detection argument (there is not discussion around detection rules), but more the test of a property. 3. The authors do not seem to address the problem of making the attack stealthy w.r.t the rewards. It would be simply to come up with a detector that can detect a change in the rewards sequence. 4. The authors test using a detector with no internal state, that considers only the current transition. They do not make use of a CUSUM-like algorithm to test the attacks, which is known to be optimal. 5. Some mathematical notation is a bit unclear. For example, the constraint in (3) it's still unclear what the authors mean (is it an expectation over (s,a)? what is the difference between the two ?). In algorithm 1 there is a typo in the adversarial reward. 6. Overall, little theoretical arguments are presented to support their argument. [A] Russo, Alessio, and Alexandre Proutiere. "Balancing detectability and performance of attacks on the control channel of Markov Decision Processes." 2022 American Control Conference (ACC). IEEE, 2022. **Questions:** See above **Limitations:** Authors do not seem to discuss limitations. **Ethics Flag:** No **Soundness:** 1 poor **Presentation:** 2 fair **Contribution:** 2 fair **Rating:** 4: Borderline reject: Technically solid paper where reasons to reject, e.g., limited evaluation, outweigh reasons to accept, e.g., good evaluation. Please use sparingly. **Confidence:** 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. **Code Of Conduct:** Yes --- ### Response to Reviewer 3 **Q1: Lack of comparison w.r.t. [A]**: We agree that [A] is relevant prior work that introduces attacks based on statistical indistinguishability between attacked and unattacked MDPs, albeit in a significantly different setting. To our understanding, we are still the first to consider statistically undistinguishable **observation-space** attacks, and evaluate their effectiveness in attacking human-AI systems. We were not aware of [A] and have now contrasted it in our related work section (see below). We now outline the main differences of our work. Our work significantly differs from [A] in the following ways: - Different threat model: We consider the scenario where the adversary attacks the *victim’s observation*, while [A] considers the case where the adversary attacks the *victim’s action*. - More general class of attacks: We consider adversaries that condition on the full history, while [A] considers adversaries that condition on the current state only. Hence, in contrast to [A], we study statistical indistinguishability in more general settings (which can be non-Markovian). - Applicability to large continuous state spaces: We show that our methods scale to large state spaces, while [A] relies on methods applied to environments with very small state spaces. - Evaluation with human subjects: We additionally evaluate detectability of adversarial attacks through humans inspecting high-dimensional observation spaces. We have updated our writing to reflect the named differences. Importantly, we also changed line 95 to “a novel form of *observation-space* adversarial attack on sequential decision-makers […].” **Q2.1: The authors use the argument from Shi et al. (2020) (i.e., testing the Markov property) to motivate stealthy attacks. However, there may be attacks where the attack policy may seem to induce a markov process.** We assume that there is a slight misunderstanding here. As we point out in Theorem 4.2, fulfilling the Markov property is a necessary but not sufficient condition for a stealthy attack. **Q2.2: Furthermore, this is not really a detection argument (there is not discussion around detection rules), but more the test of a property.** We agree that our writing was unclear here. We now state in section 4.1 that the criterion of statistical indistinguishability can be decomposed into two conditions (instead of “we propose to decompose the section process into two steps”). **Q3: Do not seem to address the problem of making the attack stealthy w.r.t the rewards**: We assume the standard case where the agent observes the reward during training, but not at test-time [1]. A scenario where the agent also observes the reward at test-time constitutes an interesting direction for future work. [1] Optimal Control of Markov Processes with Incomplete State Information, Åström, Karl Johan, 1965 [2] Planning and acting in partially observable stochastic domains, Kaelbling et al., 1998 **Q4: Do not make use of a CUSUM-like algorithm**: Thank you for this remark. We have investigated the CUSUM method and found that it is not directly applicable to our scenario, as in our setting both the test-time distribution is unknown and the state space is continuous. We agree that extending CUSUM-based approaches to high-dimensional domains constitutues an interesting direction for future work. **Q4.1: Constraint in (3) is unclear**: The difference is that one term originates from the attacked process, the other from the unattacked process. We have now made this more clear in the text. **Q4.2: Typo in algorithm 1**: Thank you, we have addressed this. **Q5: Little theoretical arguments**: We would like to politely ask for a clarification here. **Limitations**: Thank you for this remark. We have now added a statement to our conclusion. We state the potential negative effects of our work, but point out that we expect the potential positive effects of allowing adversarial defence systems to harden against illusory attacks to outweigh these. **Additional clarifications**: If the clarifications provided are helpful, we were wondering if you would be willing to raise your score? **Additional clarifications**: Hoping that the clarifications provided are insightful, we were wondering if you could consider updating your score. --- --- --- ## Review 4 **Summary:** This paper reveals that existing adversarial attacks in reinforcement learning can be detectable by learned detectors or human inspection. Based on this motivation, the authors propose a new type of illusory attacks that aim to render statistically indistinguishable state perturbations. Two versions of the attack are proposed: one is more theoretically optimal but may be unrealistic, while another is less optimal while more practical. Experiments on several control tasks show that the proposed method is less detectable than prior methods. **Strengths And Weaknesses:** ### **Strengths** 1. The paper studies a relatively novel problem, and reveals the detectability problem of existing attacks in RL. 2. The paper is in general well-written. 3. The proposed method is intuitive, and the empirical results are also interesting. 4. Human study is conducted to demonstrate the detectability. ### **Weaknesses** 1. The experiments are conducted on a limited number of environments, while more higher-dimensional environments like Atari games are not discussed, which are the main focus on a lot of recent literature [1-4]. It will be interesting to see whether the proposed attack can work in visual control environments, and whether existing attacks are also detectable in these domains. 2. The paper does not discuss another very relevant work [5] which also studies detectability of adversarial attacks in RL. 3. I am not sure whether some claims are necessarily correct. Please see my questions below for more details. I hope the authors can provide more explanations regarding them. My current opinion is leaning towards a borderline rating. [1] Huang et al. Adversarial Attacks on Neural Network Policies. ICLR Workshop 2017. [2] Zhang et al. Robust Deep Reinforcement Learning against Adversarial Perturbations on State Observations. NeurIPS 2020. [3] Oikarinen et al. Robust Deep Reinforcement Learning through Adversarial Loss. NeurIPS 2021. [4] Sun et al. Who Is the Strongest Enemy? Towards Optimal and Efficient Evasion Attacks in Deep RL. ICLR 2022. [5] Russo et al. 2022. Balancing detectability and performance of attacks on the control channel of Markov Decision Processes. ACC 2022. **Questions:** 1. I am not convinced by the claim that state-action consistency does not result in long-term consistency (Thm 4.2 and the remark under Def 4.5). More specifically, in the example in A.2., if the two 'D's on the right are identical in their state representations but not identical in their transition, then it is not an MDP. In other words, the right MDP indeed have 7 states, instead of 6, and is different from the MDP on the left. Of course, if the states are partically observable, then the claim makes more sense. But the paper does not make it very clear. 2. What is the detailed algorithm for the perfect illusory attack? I found the description in 5.2 and A.4 not concrete and sufficient enough to understand the implementation. A pseudocode will be helpful here. 3. It is interesting that in Table 1, defense methods can only slightly improve the robustness against baseline attacks, but can greatly improve the robustness against the perfect illusory attacks. Are there any insights behind this? The above questions mainly focus on the necessity and realisticity of the perfect illusory attacks. So I hope the authors could provide more details on why the perfect illusory attacks are important. **Limitations:** The paper does not explicitly discuss the limitation of the current work. **Ethics Flag:** No **Soundness:** 2 fair **Presentation:** 3 good **Contribution:** 2 fair **Rating:** 5: Borderline accept: Technically solid paper where reasons to accept outweigh reasons to reject, e.g., limited evaluation. Please use sparingly. **Confidence:** 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. **Code Of Conduct:** Yes --- ### Response to Review 4 **Q1: higher-dimensional environments like Atari games are not discussed, which are the main focus on a lot of recent literature [1-4].** We would like to point out that none of the works [1-4] evaluate **learned** adversarial perturbations on image-observation-space environments such as Atari. [1-4] all only evaluate white-box gradient-based attacks (such as FGSM or PGD attacks) on Atari. ([3] presents a hybrid solution where an RL agent determines the direction for the white-box gradient-based PGD attack). Please note that out of these papers, [2] is the only work that learns adversarial perturbations, and uses evaluation environments similar to ours. We however agree with the reviewer that learned image-observation-space adversarial attacks are an interesting direction for future research, more so the application of illusory attacks to image-observation-space environments such as Atari. **Q2:Paper does not discuss another very relevant work [5] which also studies detectability of adversarial attacks in RL**: We agree that [5] is relevant prior work that introduces attacks based on statistical indistinguishability between attacked and unattacked MDPs, albeit in a significantly different setting. To our understanding, we are still the first to consider statistically undistinguishable **observation-space** attacks, and evaluate their effectiveness in attacking human-AI systems. We were not aware of [5] and have now contrasted it in our related work section (see below). We now outline the main differences of our work. Our work significantly differs from [5] in the following ways: - Different threat model: We consider the scenario where the adversary attacks the *victim’s observation*, while [5] considers the case where the adversary attacks the *victim’s action*. - More general class of attacks: We consider adversaries that condition on the full history, while [5] considers adversaries that condition on the current state only. Hence, in contrast to [5], we study statistical indistinguishability in more general settings (which can be non-Markovian). - Applicability to large continuous state spaces: We show that our methods scale to large state spaces, while [5] relies on methods applied to environments with very small state spaces. - Evaluation with human subjects: We additionally evaluate detectability of adversarial attacks through humans inspecting high-dimensional observation spaces. We have updated our writing to reflect the named differences. Importantly, we also changed line 95 to “a novel form of *observation-space* adversarial attack on sequential decision-makers […].” **Q3: Not convinced by the claim that state-action consistency does not result in long-term consistency**: Thank you for this remark. As pointed out in the caption of Figure 7, the right process is a "decision process" with long-terms correlations, i.e. not an MDP. **Q4: What is the detailed algorithm for the perfect illusory attack?** Thank you for this suggestion. We have now added pseudo code for perfect illusory attacks to the paper in appendix A. This pseudo code is also given below. Note that if the initital state distribution $t(\cdot|\emptyset)$ is symmetric with respect to the origin, then the first attacked observation is given as $\tilde{s}_0 = -s_0$, i.e. given as the negative of the initial environment state. All subsequent attacked observations are then computed according to the environment transition function $t$. **Algorithm 2 - Perfect illusory attacks** **Input**: environment $env$, environment transition function $t$ whose initial state distribution $t(\cdot|\emptyset)$ is symmetric with respect to the point $p_{symmetry}$ in $\mathcal{S}$, victim policy $\pi_v$   $k = 0$   $s_0=$ $env$.reset()   $\tilde{s}_0 = -(s_0 - p_{symmetry})+p_{symmetry}$   $a_0=\pi_v$($\tilde{s}_0$)   _$,done=env.$step($a_0$)   **while not** $done$ **do**:          $k = k+1$          $\tilde{s}_k \sim t$($\tilde{s}_{k-1}$, $a_{k-1}$)          $a_k=\pi_v(\tilde{s}_k)$          _$,done=env.$step($a_k$)   **end while** **Q5: Insights behind Table 1**: Thank you for this remark. There was in fact a typo in Table 1, which we have now corrected. The bottom right score is 73 (which was indicated at 23). In fact, the effect pointed out by the reviewer does not exist. **Limitations**: Thank you for this remark. We have now added a statement to our conclusion. We state the potential negative effects of our work, but point out that we expect the potential positive effects of allowing adversarial defence systems to harden against illusory attacks to outweigh these. **Additional clarifications**: Hoping that the clarifications provided are insightful, we were wondering if you could consider updating your score.