owned this note
owned this note
Published
Linked with GitHub
# attack go nips rebuttal
## KataGo david
Thank you for your insightful comments. We agree that it will be even more interested if we can attack without leveraging
In this paper, we first give the definition of adversarial perturbations for Go. We restrict the adversary to perturb the state (game board) by adding one or two “meaningless” stones to unimportant locations of the board that do not affect the winrate and the best action of the game. We call the perturbed state is “semantically similar” to the original one. An adversarial example is defined as a perturbed state leading to an undoubtedly inferior action from the Go agent that is obvious even for amateur human players. Then, we proposed a novel method to systematically find adversarial states where the AZ agents perform much worse than usual. We carefully designed the constraints on perturbed states during the search so that they are semantically similar to the original states and are also easy enough for human players toverify the correct move. If a Go agent cannot find the correct move for this perturbed state, we say we found an adversarial example. Finally, we test AZ-based agents with thousands of these perturbed states, and we surprisingly find that agents do make trivial mistakes on these adversarial examples,even when MCTS is used.
## Message to AC
Dear Area Chair,
Since our work is the **first demonstrated attack on Go AI**, a setting where adversarial examples **were never defined nor demonstrated**, Reviewer nfAu had some confusions regarding definitions of adversarial examples. We design our adversarial examples to be semantically similar to the original state in a game (the added stones have no impact to game outcome), which is in the same nature of NLP adversarial examples (sentences that are semantically similar but wrongly classified). We hope the area chair can understand that many properties for classical adversarial examples are not directly applicable here (e.g., noises are imperceptible) and many classical thinkings of adversarial examples (e.g., performance vs robustness trade-off) are not always valid here.
Reviewer nfAu also asked about applying existing attacks to RL in our setting, however many existing attacks were done in continuous space like Mujoco and pixel space like Atari, which are quite different from Go. We carefully studied the related works the reviewer pointed out and none of them are actually applicable. Indeed we are the first attempt of attacking a Go agent, and we hope to serve as a baseline for future research.
In addition, the main purpose of our work is to **show the surprising errors a super-human Go AI can make, not to demonstrate a practically feasible attack** (e.g., blackbox attack) - this is the same case as when adversarial examples were initially discovered in computer vision. Our discovery can be used for finding the weakness of these agents and debugging agents to further improve their performance.
Reviewers argued that the adversarial examples we found are not "natural", however, we believe typical adversarial examples in computer vision are not natural as well (pixel-wise adversaries are never encountered in real data). In fact, **since Go has a finite number of states, all states are valid** and humans can start playing with any opening. It is surprising to see that **a super-human Go AI can make surprising mistakes in certain states**, which means it does not fully generalize. We believe this discovery is valuable.
We hope the AC can take these into consideration, especially since our work is the **first to demonstrate surprising errors in a superhuman AI**, which is quite innovative compared to most adversarial attacks in computer vision, where humans were served as the oracle.
Sincerely,
Authors
## Follow up of Reviewer W67q
Thank you for you suggestion. We agree that without the knowledge of Go, the adversarial example is way less impressive. We will add more background knowledge to the supplementary materials. Note that some information can be found in the Appendix E Territory. Thank you again for the help of improving our paper. We hope all your concerns have been addressed now and positively support our paper during the final discussion with AC.
## Thrid Follow up of Reviewer 9kDD
Again, thank you for your insightful question on the praticability of our setting. We give some examples where a Go AI is used with "unnatural" states, and in addition we argue that the main focus of our paper is to discover the weakness and debug the agent, rather than aiming for a practical attack. We discuss in detail below:
> "should we train AZ agents robust against this type of perturbations that cannot occur in practice?"
Besides "playing" starting from the empty board, AZ agents have other applications that will encounter all kinds of legal states. Therefore, making AZ agents more robust is also important.
- Large-scale MCTS search: When we execute MCTS with millions of simulations, the MCTS will start exploring states that will never be encountered by the agent during playing. If the PV-NN is not general enough to evaluate those states, it might be weaker than the MCTS with fewer simulations. Therefore, if we can improve the PV-NN's ability to generalization on states close to normal states, it can also improve the searching result of MCTS. Our work provides a baseline to evaluate such generalization.
- Proofing: Unlike "playing" which only needs to provide a non-perfect, best-effort attempt to achieve a game's goals, "proofing" needs to provide follow-up strategies for *all* the actions of the opponent, even those looks unreasonable. Proofing is an old but important topic in the Game community. Many algorithms have been proposed, such as alpha-beta search and proof number search. Currently, people are using AZ Agents as a heuristic to make the proofing faster[1]. Furthermore, it is common to introduce some "unnatural" states when proving. For example, to solve life-and-death problems (i.e., tsumego) with AI, [1] fill the rest of the board with nonregular stones to make the AI focus on the local battle ([Fig. 10](https://i.imgur.com/FlZ8NNa.png) of [1]).
In general, attack is not the only reason for generating adversarial examples. An important motivation for generating adversarial examples is for debugging --- finding “counter examples”, defined as the input example that the agent made a wrong prediction, for the agent in order to debug and improve. More specifically, we are trying to find states that the agent makes obvious mistakes, which show the agents are not “perfect” and will shed light on how to improve the current agent. For example, we have shown one example in our paper that the counter examples obtained by our attack reveal that the AZ agents are vulnerable to the last stone even given the same board, and thus we can try to fix those problems by permuting the sequence before passing into the networks as shown in our paper (Line 380). We believe our developed tool that reveals the “weakness” of AZ agents can be a useful tool for debugging the agents.
[1] Shih, Chung-Chin, et al. “A Novel Approach to Solving Goal-Achieving Problems for Board Games.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 36. No. 9. 2022.
<!-- > The authors mentioned that the setting of slightly perturbing system states is valid in the prior work on RL adversarial attacks. However, most of the prior work including [1] focuses on robotic tasks, where modeling uncertainty and variation is natural. For example, when a robot is deployed in the real world, the joint position of the robot can be changed by an extra force such as wind.
> I suggest that the authors propose a more practical scenario as adversarial literature in other domains.
- Think you for the suggestion. Like many preliminary studies of adversarial attacks in different fields, our work can be improved so that it can be used in a more practical scenario. However, we believe that the weakness of an ML will be revealed step by step, starting from knowing that bugs do exist. -->
## Second Follow up of Reviewer 9kDD
Thank you so much for willing to further discuss with us!
> Who is the adversary and who is the defender in this adversarial threat model?
- One way to view our setting is that we have three actors involved: the Go AI, the opponent player, and the attacking adversary. We allow the attacking adversary to slightly perturb one of the states in a given game in a restricted way such that only 1 to 2 stones are added, and these stones are meaningless for the state. Then the Go AI and its opponent player continue to play on the perturbed state starting with the Go AI. The goal of the attacking adversary is to find an example $s'$ so that the Go AI will play an extremely bad move on $s'$ and will lose the game against the opponent. Furthermore, the extremely bad move is so unreasonable such that even human players can easily tell it is incorrect. Hence, we can ignore the opponent and tell that the AI has lost.
- The defender is the Go AI agent. Although we are not focusing on defense, we show a strategy to improve the robustness of the Go AI agent based on the vulnerability identified by our attack (line 380).
> Your proposed adversarial threat model requires both players to make adversarial moves. Should both the players be considered adversaries?
- As we explained above, the adversary is the third person trying to perturb the state, while the Go AI and its opponent continue to play normally. The setting of slightly perturbing system state is valid and has been demonstrated in other RL adversarial attacks, such as [1, 2], where such perturbations are modeled as an adversary in simple Mujoco and Atari environments. The main contribution in our work is to demonstrate that such weakness also exists in the state-of-the-art and superhuman Go AIs in a 2-player game setting with discrete states, actions, and MCTS based policies. We will add this discussion to our paper.
[1] Pinto, Lerrel, et al. "Robust adversarial reinforcement learning." International Conference on Machine Learning. PMLR, 2017.
[2] Lin, Yen-Chen, et al. "Tactics of adversarial attack on deep reinforcement learning agents." arXiv preprint arXiv:1703.06748 (2017).
- Note that, for most of the AZ agents, we can only input a state to the program with a sequence of legal actions (like the trajectory shown in line 162). Hence, unlike [1, 2] can modify the state directly, the attack adversary can only use legal actions as a perturbation to create adversarial examples. More importantly, all the RL attacks work on continuous space while the state of Go is discrete, so a new attack is needed here.
Thank you again for the very insightful questions, which greatly help us to improve our paper. We hope all your concerns have been addressed now and please kindly let us know if you have any further questions.
<!-- - One way to view our setting is that we have two actors involved: a Go AI as the defender and a attacking adversary. The mission of the Go AI is to provide a good action for any given state no matter who the turn player of that state is. On the other hand, the attacking adversary is allow to select and slightly perturb one of the states in a given game under several restrictions (C1, C2). The attacking adversary will be consider success if the defender make a low level mistake (C3) on the state perturbed by the attacking adversary. -->
## Follow up of Reviewer nfAu
We thank you for reading our comments carefully and the constructive reply. However, we still feel there are certain misunderstandings we want to further clarify.
> This is because it is almost the only way to generate adversarial examples in the real-world setting, which makes the topic valuable as the hacker always has no access to the intermediate results or states of any deployed DNN model.
- We agree with the review that attacking a deployed DNN model without accessing the model (i.e., black-box attack) is an important topic. However, most adversarial attacks start with white-box attacks where the attacker can access the model [1, 2, 3] since it is important to understand the limitation of any model we are using. Our attack is the first attack on Go AI, so we follow the white-box setting.
[1] Madry, Aleksander, et al. "Towards deep learning models resistant to adversarial attacks." arXiv preprint arXiv:1706.06083 (2017).
[2] Carlini, Nicholas, and David Wagner. "Towards evaluating the robustness of neural networks." 2017 ieee symposium on security and privacy (sp). Ieee, 2017.
[3] Ebrahimi, Javid, et al. "Hotflip: White-box adversarial examples for text classification." arXiv preprint arXiv:1712.06751 (2017).
> Back to the scope of this paper, the authors focus on attacking the special intermediate states during playing Go, which is almost impossible for humans to fabricate. So I really do not think it is a practical or meaningful setting for the adversarial community.
- First, we argue that although the adversarial examples we found for Go are not "natural" states, when adversarial examples are initially discovered for DNNs, they are not natural, as well - the imperceptible adversarial noises almost never occur in natural data. So **the criticism actually applies to most papers studying adversarial examples**. Moreover, existing works on attacking RL also alter agent states [4,5] in a way that are impossible to happen in real scenarios.
[5] Huang S, Papernot N, Goodfellow I, Duan Y, Abbeel P (2017) Adversarial
attacks on neural network policies. arXiv preprint arXiv:1702.02284
[4] Lin, Yen-Chen, et al. "Tactics of adversarial attack on deep reinforcement learning agents." arXiv preprint arXiv:1703.06748 (2017).
- Additionally, we believe that proving how bad an agent can be even in an unnatural ("almost impossible for humans to fabricate") and adversarial setting is important, and this is the entire focus of the adversarial community. With our found adversarial examples, people can understand that currently, a strong Go AI sometimes still does not have the basic knowledge of Go that an amateur human player can understand, which is really surprising for human Go players (just like how adversarial examples were surprising in image classifiers).
> As the authors said, lots of players use AZ-like agents to enhance their skills, so they more care about performance of the agents under regular conditions.
- In reality, Go players probe the knowledge of AI to learn how AI solves a particular problem by creating fake states. For example, to solve life-and-death problems (i.e., [tsumego](https://senseis.xmp.net/?Tsumego)) with AI, one of the previous works [6] [fill the rest of the board with nonregular stones to make the AI focus on the local battle](https://i.imgur.com/LicjU9I.png). Our results show that these states may mislead the agent, which is not surprising now since we have shown in our paper that the agent is still vulnerable even with easier states (novice player often knows the right moves) and more regular conditions (only two steps away).
[6] Shih, Chung-Chin, et al. "A Novel Approach to Solving Goal-Achieving Problems for Board Games." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 36. No. 9. 2022.
> Sacrificing performance of clean accuracy for a situation that is unlikely to happen makes this issue even more meaningless from the perspective of practicality.
- First of all, the purpose of this work is to show the limitations of Go AI, and we do not claim to show any clean performance trade-off in this work, although we empirically found that a more robust agent may have better performance. For example, the DNNs trained by the stronger AZ algorithm are both better and more robust than those that do not.
> Besides, though the authors claim it is the first time to explore adversarial examples in AlphaZero-like Agents, I still think it is the responsibility of the authors to provide a strong and reliable baseline method from previous RL methods because this can more intuitively and accurately evaluate the effectiveness of the method in this paper.
- As we discussed in the previous comment, none of the baselines can be applied to our setting with discrete states, discrete actions with non-differentiable MCTS procedure in agent policy, and some papers also have quite different focuses (e.g., altering rewards or observations, which is not the same setting as ours). We can certainly list these baselines and report them as inapplicable if the reviewer thinks that is helpful. We feel our method is intuitive enough and can serve as a baseline for future attacks.
Since our work is the **first demonstrated attacks on Go AI, a setting where adversarial examples were never defined nor demonstrated, we hope the reviewer can understand that many properties for classical adversarial examples are not directly applicable here** (e.g., noises are imperceptible) and many classical thinkings of adversarial examples (e.g., performance vs robustness trade-off) are not always valid here. Our paper studies a novel problem setting proposes non-trivial algorithms and demonstrates surprising results, which we believe that does not belone to the rating of 3: `a paper with technical flaws, weak evaluation, inadequate reproducibility and incompletely addressed ethical considerations.` Hopefully, most of your concerns are addressed now and we hope you can kindly reevaluate our paper based on our discussions. Thank you.
## Follow up of Reviewer 9kDD
Thank you for reading our comments carefully. We still want to clearify about the realistic concern.
> However, I still have a concern that the adversarial scenario that the authors proposed is not realistic.
- In reality, Go players probe the knowledge of AI to learn how AI solves a particular problem by creating fake states. For example, to solve life-and-death problems (i.e., [tsumego](https://senseis.xmp.net/?Tsumego)) with AI, one of the previous works [1] [fill the rest of the board with nonregular stones to make the AI focus on the local battle](https://i.imgur.com/LicjU9I.png). Our results show that these states may mislead the agent, which is not surprising now since we have shown in our paper that the agent is still vulnerable even with easier states (novice player often knows the right moves) and more regular conditions (only two steps away).
[1] Shih, Chung-Chin, et al. "A Novel Approach to Solving Goal-Achieving Problems for Board Games." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 36. No. 9. 2022.
- In addition, we argue that although the adversarial examples we found for Go are not "natural" states, when adversarial examples are initially discovered for DNNs, they are not natural, as well - the imperceptible adversarial noises almost never occur in natural data. Moreover, most existing works [2,3] on attacking RL also alter agent states in a way that are impossible to happen in real scenarios. So this criticism applies to many works on adversarial examples.
[2] Huang S, Papernot N, Goodfellow I, Duan Y, Abbeel P (2017) Adversarial
attacks on neural network policies. arXiv preprint arXiv:1702.02284
[3] Lin, Yen-Chen, et al. "Tactics of adversarial attack on deep reinforcement learning agents." arXiv preprint arXiv:1703.06748 (2017).
- Finally, we believe that proving how bad an agent can be even in the unrealistic and adversarial setting is important, and this is the entire focus of the adversarial community. With our found adversarial examples, people can understand that currently, a strong Go AI sometimes still does not have the basic knowledge of Go that an amateur human player can understand, which is surprising (just like how adversarial examples were surprising in image classifiers).
## Reviewer W67q rate 5
> Clarity 1: "However, since this study would involve real humans to verify the results, it can be hard to scale this research definition of the state space of the game."
- Our method itself doesn't rely on human evaluations. We define our attack in a way that the obtained solution should automatically lead to a human verifiable semantically equivalent example, similar to adversarial attacks in computer vision and NLP. No human evaluation is used for constructing the attack example. The human evaluation is used to verify the quality of the adversarial examples obtained by our algorithm. This kind of human evaluation is also used in other fields, such as NLP attacks, where previous works usually use human evaluation to verify whether the adversarial text created by the attacks is natural and semantically-preserved [1, 2].
[1] Alzantot, Moustafa Farid et al. "Generating Natural Language Adversarial Examples." EMNLP (2018).
[2] Jin, Di et al. "Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment." AAAI (2020).
- Further, since the definition of semantically invariance perturbations mainly involves the value function and MCTS, our attack can be easily generalized to other games, such as NoGo.
> Clarity 2: the level of these humans in terms of their expertise level of the game Go
- The level of these humans amateur players are level 2k, 3d, and 5d. (Line 364).
> Question 1: what is exactly the state space? and how do you calculate the L0 distance between states?
- States of Go agents are normally presented by their trajectory (the actions played by both players starting from the empty board) (Line 162). Therefore, the L0 distance is calculated by comparing the edit distance of two trajectories, which will be the number of actions we added (Line 161).
> Question 2: intuitively, how to verify that two states s and s' are semantically equivalent? does verifying this require human expertise in the game Go?
- As mentioned in our response to Clarity 1, we will search for an adversarial example without human-in-the-loop. To be specific, we require two states $s$, $s'$ need to have similar win rate (see Eq (1)) and the additional moves (from $s$ to $s'$) are "meaningless". Human evaluation is only used to evaluate whether the adversarial examples obtained by our algorithm are really semantically equivalent to the original state (Line 290, 465).
> Question 3: condition C3 seems quite strong, requiring the verifier to be able to identify a wrong state. How hard is this for a human player to do?
- C3 is one of our goals, where we hope the adversarial examples found by our method can satisfy it. For value attack (Line 215), it is easy since humans can find that $v(s), v(s')$ is different; hence they can tell that at least one of them is wrong. For policy attack (Line 217), we designed several constraints to achieve C3, such as Eq 3, and best action $a^*_s$ should be the best action of $s'$ (Line 232). Such a constraint does not require human verifiers. We conduct human evaluations just to evaluate whether the final returned state s' of our algorithm are verifiable as we expected. The results (Line 370) show that the human verifiers can verify C3 on most examples pretty easily, like the examples shown in Figure 3. However, there are still examples that we found are hard for the human verifiers, like the examples shown in Figure 4, which require about half an hour of discussion to realize that it is an adversarial example. Usually, with a larger $\eta_{adv}$, the examples will be easier for humans to verify.
> In line 235, the authors mentioned "since it only takes one mistake for an agent to lose a game", why is this true?
- We define an action $a$ as a "mistake" of state $s$ if after playing action $a$, the winrate (according to examiner MCTS) will drop dramatically. When playing against an opponent with a similar level, making such mistakes will cost the examiner to lose the game.
> Is there a possibility to completely get rid of human participation in training the attacker?
- We do not need human participation in our attack. See our response to Clarity 1.
## Reviewer 6PMq rate 6
> Question 1: Why it's easier for a human verifier to verify the 1STEP attack than 2STEP attacks.
- In 1STEP attack, the human verifiers only need to check whether the action helps its player get any benefit (such as gaining 1 unit of territory). On the other hand, in 2STEP attack, both actions can help their players get benefits (say one action gains ten territories and another gains nine territories). Therefore, the human verifier needs to be able to tell that the benefits gained by both actions are equal, which is much more complicated (see discussions in Line 192). Hence, we force all the extra actions to be meaningless, so that human verifier only needs to check if both actions get zero territories without the need to compare their values, which is much easier.
> Question 2: line 197: The implication here seems to be that not both 1STEP and 2STEP moves are constrained to be meaningless.
- No, the additional actions of both 1STEP and 2STEP attacks are all meaningless (defined in Line 197). For 1STEP attack, the action will be meaningless once Eq.1 is satisfied. Therefore, we only need to ensure the actions of 2STEP are meaningless.
> Question 3: line 245: Why do the authors assume that there are N ≈ 150 (the average number of actions for a state) meaningless moves for each state? Is it not likely to be less than the average number of moves? At very least, perhaps N-1 is more justified?
- $N$ is the average number of actions for a state, not the meaningless moves for each state. In line 245, we use $\bar N$ to present the number of meaningless moves for each state, which is usually way smaller than 150.
> However, it is unclear how the author's method compares with prior approaches, in particular Gleave et al. (2019).
- Gleave et al. (2019) fooled the opponent by only controlling its agent to play meaningless action. Although conceptually similar, they consider a setting quite different from ours. First, they consider finding actions in a continuous space (e.g., MuJoCo environments), whereas in Go we have a discrete action space. Second, the actions in our setting are in a quite different nature. Compared to the environments Gleave et al. (2019), each step in an intense game such as Go has critical importance, and playing a meaningless action for one player will quickly lead to losing the game. In Gleave et al. (2019), each step is not that important since it might be effective for only 0.1 seconds. Even if a player plays a meaningless action to fool the opponent, it can often be undone in the next time step without leading to a failure of the agent. However, in Go, the player needs to get as much benefit as possible in every step, and it's almost impossible to recover from a meaningless action. Finally, state-of-the-art Go agents have an MCTS component, which is not considered in Gleave et al. (2019). Thus, we cannot directly apply the approach in Gleave et al. (2019) to our setting. We will add these discussions in the revision.
> Why the authors chose 'semantically invariant/meaningless' moves rather than adversarial moves in general
- Let $s$ denote the original state, our goal is to find adversarial moves to perturb it to $s'$ such that $s$ and $s'$ are semantically equivalent. However, even if $s$ and $s'$ are semantically equivalent to the examiner, it might not be obvious to humans. Therefore, we additionally force each action to be meaningless.
> Since the authors constrain their method to legal board states, they should relate their work to "Natural Adversarial Examples" (Hendrycks et al. 2021).
- We will cite this paper and discuss it in our revision. This paper discusses a set of natural (semantically preserved) perturbations of image data. In the first half of our paper, we design semantically preserved perturbations of Go states, which shares the same goal with the paper but requires a totally different design due to the discrete space and the game-playing environment.
> How easy is it to apply this method to games such as chess or shogi?
- Like NoGo (end of Section 4), our method can be similarly applied to games like chess and shogi since they also satisfy the assumptions we state in Appendix A. To the best of our knowledge, our method is the only one that can find adversarial examples of search-based agents on discrete turn-based games.
> What is the use of this method? Is it applicable for adversarial training to improve the robustness of Go-playing agents?
- As Go-playing agents are already much stronger than human players, our work is the first demonstrating that even those agents can make mistakes, and such mistakes can be easily verified by humans. In terms of applications, our attack algorithms can be used to find "bugs" in the search-based agents and shed light on debugging the agents. For instance, in lines 382--392, our attacker reveals the problem of existing agents that "Agents are sensitive to the ordering of actions in the trajectory", and we propose a simple approach to significantly improve the robustness of the KataGo agent.
> Other minor comments:
- Thank you for pointing out those issues. We have incorporated them in the revision.
## Reviewer 9kDD
> Some condition for defining the similarity between two states seems artificial like $max(V(s'), 1-V(T(s', a^*_s)))$
- The reviewer is correct that when the examiner makes a mistake on state s', the target agent will also be fooled, and this s' will be an adversarial example for both examiner and target agent. However, in this case, our algorithm will think that s, s' are not semantically equivalent and will miss this adversarial example. Therefore, we add the second term $1-V(T(s', a^*_s))$ to make the examiner stronger. Here $a^*_s$ is the examiner's (MCTS) best action of state s, and we use $a^*_s$ as a hint for the examiner. Intuitively, if $s$ and $s'$ are really semantically equivalent, $a^*_s$ will be a good action for s' too. Therefore, using $1-V(T(s', a^*_s))$ will allow MCTS to search from a better initialization and thus get a more accurate solution. Since this is equal to conducting a one-step minimax search, it is not artificial. It can be generalized to a larger minimax tree search when more information of $s$ is hinted.
- In practice, we found $max(V(s'), 1-V(T(s', a^*_s)))$ is more useful when the target agent is an MCTS agent since the target agent and the examiner are more likely to be fooled together. The attack success rate will drop more than 15% when the simulation count is 50 without using this additional term.
> The proposed adversarial scenario seems unrealistic.
- Like adversarial image attacks, we demonstrate that even AZ agents can be bad at states that are close to nature states. This is may be more surprising than fooling image classifiers since the AZ agents we used in this work already significantly surpass human ability, and it is important to know those agents also made simple mistakes that can be easily verified by humans.
- We frame this work as finding "bugs" of AZ agents instead of trying to develop a realistic attack to fool them. Finding bugs of machine learning models is important for improving existing models and is one of the main use cases for adversarial examples. For instance, in line 382--392, our attacker reveals the problem of existing agents that "Agents are sensitive to the ordering of actions in the trajectory", and we propose a simple approach to significantly improve the robustness of the KataGo agent.
- The main application of AZ-based Go agents is to teach humans the best move under different board states. Even professional players often query the states to the Go agents to learn the best move, in which case the queries may not be natural states. Our work highlights those very strong Go agents like AZ may not be trustworthy for these artificial queries, uncovering the potential weakness and limitations of AI.
> The paper does not provide any quantitative results showing that the proposed attack method degrades the performance of the target agent, though the paper says that the attacked agent will make mistakes way below their level (line 19).
- A quantitative measure for agent mistakes is defined by the level of the human verifier. When a mistake can be verified by a lower ranked player, the mistake is more serious and is a lower-level mistake. Note that, although a professional human player will make mistakes, such mistakes can not be verified by normal human players (level <= 4D). Therefore, it is surprising that AZ agents, which are stronger than professional human players, make these low-level mistakes.
> Grammatical errors:
- Thank you for pointing this out. We have fixed the errors in the revision.
## Reviewer nfAu
We thank the reviewer for raising the insightful questions that we did not emphasize in the paper. The main concern of the reviewer comes from a classical definition of adversarial example in computer vision. The contribution of our paper becomes immediately clear once this definition is generalized to discrete domains such as NLP and games, as we will discuss in detail below.
> These actions do not conform to the traditional sense of imperceptibility.
- We agree that "human imperceptibility" is one important aspect of adversarial examples, but this usually holds only in continuous domains (e.g., vision, speech), not discrete domains (e.g., NLP, board games). In general, given a natural example $s$, we can define an adversarial example $s'$ as another example that is semantically equivalent to $s$ but predicted differently by the model. In continuous domains, "semantically equivalent" can be defined by human imperceptibility. However, in discrete domains, usually, any change can be perceptible. For example, in NLP, given a natural sentence, "This movie had terrible acting," an adversarial example can be "This movie had awful acting," where we change only one word in the sentence to its synonym. The change is **not human imperceptible** (a word has been obviously changed), but humans can easily verify these two sentences should be semantically equivalent, so if the perturbation changes the model's prediction, we know there's an error in the model. There are tens of papers every year studying this kind of word substitution-based adversarial examples in NLP, see [1,2,3,4,5,6] and a benchmarking GitHub repository that collects existing text attacks https://github.com/QData/TextAttack. Our definition of adversarial example is almost identical to this definition in NLP, where we add only 1 or 2 stones and require the perturbed state to be semantically equivalent to the original state (i.e., the added stones do not give any advantage to any players).
[1] Alzantot, Moustafa Farid et al. "Generating Natural Language Adversarial Examples." EMNLP (2018).
[2] Jin, Di et al. "Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment." AAAI (2020).
[3] Ren, Shuhuai et al. "Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency." ACL (2019).
[4] Li, Linyang et al. "BERT-ATTACK: Adversarial Attack against BERT Using BERT." ArXiv abs/2004.09984 (2020): n. pag.
[5] Jia, Robin et al. "Certified Robustness to Adversarial Word Substitutions." EMNLP (2019).
[6] Li, Zongyi et al. "Searching for an Effective Defender: Benchmarking Defense against Adversarial Word Substitution." EMNLP (2021).
> Why do we want to study adversarial examples for AZ agents?
- The adversarial examples can be viewed as "bugs" of AZ agents. Finding bugs of machine learning models is important for improving existing models and is one of the main use cases for adversarial examples. For instance, in lines 382--392, our attacker reveals the problem of existing agents that "Agents are sensitive to the ordering of actions in the trajectory", and we propose a simple approach to significantly improve the robustness of the KataGo agent.
- Additionally, our work highlights that even very strong Go agents like AZ may not be trustworthy, uncovering the potential weakness and limitations of AI. With MCTS, AZ is both more "robust" and "accurate" than humans compared to most ML models in other fields. Many professional players use AZ agents every day to improve their skills, so it is important to fully understand the limitations of these agents.
> I tend to believe the lack of appropriate training strategy is the main reason for the misprediction given by the agent, rather than the so-called "adversarial example".
- Although we do agree a more "appropriate training strategy" may eliminate some adversarial examples (like adversarial training for classification models), the AZ algorithm we evaluated is currently the state-of-the-art training strategy, and we showcased surprising mistakes it can make and the weakness of the state-of-the-art strategy. We hope our discovery can inspire researchers to propose a more "appropriate training strategy", similar to adversarial training.
- We believe that the mistakes we found can be called "adversarial examples" since the agents are correct on the original state but extremely wrong on a perturbed state very similar to the original state; see also our answer to the first question on discussing the definition of "imperceptibility" in discrete space systems.
> The most surprising point of adversarial examples at present is that there seemingly exists an inevitable phenomenon between the robustness and accuracy of the ML model; that is, adversarial training will reduce the performance of any model in identifying clean samples.
- As mentioned above, our goal is to find "bugs" for AZ agents, which will shed light on how to improve existing methods. Currently, we do not know whether there's a trade-off between robustness and accuracy in the case of AZ, but we agree with the reviewer that fixing those errors may sacrifice clean performance. The focus of this work is to first discover the adversarial examples in Go agents. We leave the study of the trade-offs as a future work.
> For the evaluation experiments, I am a little confused why these works [1][2][3] cannot be good baselines for evaluating the performance of the proposed method to attack AZ agents.
- Thank you for pointing out these papers. Since our paper focuses on finding adversarial perturbation of the discrete turn-based game against AZ (MCTS search-based) agents, all of these methods, which are designed on continuous domains without searching, won't work. Additionally, any gradient-based techniques won't work since our target agent policy is search-based, and there is no easy definition of a gradient. Moreover, the purposes of the three papers are quite different from our method, as explained below:
- [1] Zhang, Xuezhou, et al. "Adaptive reward-poisoning attacks against reinforcement learning." ICML, 2020.
- This paper perturbs a small part of the *rewards* during training to make the agent unable to learn properly. Our work only tries to evaluate the robustness of existing state-of-the-art agents during testing, and we do not perturb the rewards during training.
- [2] Zhang, Huan, et al. "Robust deep reinforcement learning against adversarial perturbations on state observations." NeurIPS, 2020.
- This paper focuses on adding perturbation to the *observations* of the agent, not the actual states. We perturb the actual state instead of observation, and we focus on discrete-space agents with MCTS, which is not discussed in [2].
- [3] Gleave, Adam, et al. "Adversarial policies: Attacking deep reinforcement learning." ICLR, 2020.
- This paper tries to fool the opponent by leading the agent to a state far from its training distribution. In comparison, our adversarial is only two steps away from a normal state. We additionally point out three major differences between Gleave, Adam, et al. and our work in our response to [Reviewer 6PMq](https://openreview.net/forum?id=yZ_JlZaOCzv¬eId=7fGJTstolV4).