## Response to Reviewer 4oxd We thank Reviewer 4oxd for the valuable comments and suggestions. We would like to answer the questions and concerns one by one. For each question/concern, we first put the quote from the original review for reference, and then present our answer. ### Question 1 > Review quote: The VQA adversarial vulnerability was previously studied in a similar fashion > The task of VQA adversarial vulnerability was studied before. In [1] the focus was on studying the vulnerability to language variations and [2] studied the image vulnerability of both captioning and VQA models using an adaptation to C&W using the harder setup of targeted attacks. This paper mentions that attacking V+L transformers models is the main contribution, but this has limited novelty. Instead, it should also evaluate what aspects of attacks are amplified/reduced when using these recent models. For example, [2] mentioned one of the reasons that attacks fail is the language prior. Evaluation such aspects in recent transformers and focusing on the language-image interplay (e.g., using targeted attacks as well) can be interesting and provide new insights. > [1] Jia-Hong Huang et al. "A novel framework for robustness analysis of visual qa models." AAAI’19. [2] Xiaojun Xu et al. "Fooling vision and language models despite localization and attention mechanism." CVPR’18. Answer: We thank the reviewer for the related works and comments. Here are the differences with the two papers: [1] focuses on the language modality of VQA system only, while we focus on large-scale V+L models which are widely used currently and thus the robsutness analysis of such models is of great importance. In addition, we evaluate different ways to explore the robustness V+L models including causality, consistency regularization, and adversarial training, which we believe will provide insights for future research to further improve of the robustness of such models. [2] studies adversarial attacks on classical VQA models, while we focus on the defense approach exploration (e.g., causality, consistency regularization, and adversarial training) for V+L models, and we provide a benchmark dataset by attacking SOTA transformer-based models which will make relation evaluations more convenient. We will add these related discussion in the related work in our revision to make it more clear. ### Question 2 > Review quote: No comparison to previous causal VQA methods > Also, regarding novelty, [3] studied the task of causal VQA by constructing counterfactuals. This paper provides no discussion of how the used causal loss approach is more beneficial, especially since it focuses on the object detection component only, while [3] directly focused on the VQA task itself. > [3] Vedika Agarwal et al. "Towards causal vqa: Revealing and reducing spurious correlations by invariant and covariant semantic editing."CVPR’20. Answer: Thanks for providing additional references. Indeed [3] explores the causal VQA while it doesn't focus on the adversarial robustens of models. The “robustness” discussed in [3] is about the model performance on **normal** data samples with spurious correlation and biases, **not adversarial** examples that are designed to fool the model. Also, the VQA models studied in [3] are more classic ones with no overlapping with the SOTA models we study in our paper. It would be interesting future work to evaluate whether the method in [3] will also work as an adversarial defense with transformer-based models discussed in our paper. We will include the related work and discussions in our revision. ### Question 3 > Review quote: The attacks and most of the defenses do not consider the language modality and its interaction with the image > Overall, only in consistency loss does the system being a VQA is really getting explored. Else the attack is similar to attacking a vision transformer. Answer: Thanks for the suggestion. - For the attack: We only allow the image modality to be perturbed since we find that it is powerful enough to attack the V+L models which is an interesting observation. We agree that jointly attacking the image and language modalities is also an interesting potential future work. - For the defense: In our defense strategy exploration, we evaluate the image modality driven adversraial training as a baseline, causal analysis, and consistency across modalities. These strategies focus on single (e.g., image) modality and multi-modalities from different perspectives, so that we can cover different types of defense strtegies, and draw interesting conclusions such as multi-modality could further improve model robustnss by comparison. We will make our defense strategy selection and discussions more clear in our revision. ### Question 4 > Review quote: The paper mentions blackbox adversarial attacks and transferability, but all results were performed via whitebox > The paper claims to construct an adversarial dataset ,VAVQA (Visually Adversarial VQA), as a benchmark to evaluate the robustness of VQA models to blackbox adversarial attacks. However, all attacks performed in the paper are whitebox. Evaluating the transferability across models (e.g., with different architectures, such as the end-to-end models vs. object-detector-based models) should also be included in the paper to support this claim. Answer: We are sorry for the confusion. We have updated our description to make it more clear. In this paper, we aim to explore the most powerful attacks, i.e., whitebox attacks, and evalute different defense strategies against them to draw conclusions for the potential and pros/cons of different defense strategies. We evaluate attack transferability, aiming to evaluate the effectiveness and properties of the generated attacks in our benchmark datasets in order to provide more information for future evaluations. ### Question 5 > Review quote: Regarding transferability, a novel evaluation and attack setup would be to evaluate the robustness when the perturbed image is paired with other questions (different from the one used to construct the attack). Can we have a unified per-image attack that transfer to different semantically-different questions? Answer: We thank the reviewer for the interesting comment and indeed this is a really interesting setting. We add an experiment to test the transferability of the attack on each image-question pair to the same image with **other** questions, and the defense performance on this transfer attack. From the table, the attack is able to transfer to other non-attacked questions and degrade the scores by around 20 in H case. For ViLT, the proposed defense is able to boost the performance to a similar value with the benign score (66 vs 70). Consistent with the conclusions in the paper, the defense gets a smaller robustness improvement for UNITER because of the vulnerability of the OD module. We will add related discussions in our revision | | Benign | E Attack / Defense | M Attack / Defense | H Attack / Defense | | ------ | ------ | ------------------ | ------------------ | ------------------ | | ViLT | 70.84 | 61.79 / 67.52 | 57.23 / 66.97 | 51.69 / 66.38 | | UNITER | 69.73 | 57.71 / 58.42 | 52.98 / 54.41 | 50.11 / 51.61 | ### Question 6 > Review quote: I find the naming notion “E/M/H: Easy/Medium/Hard” to be misleading, since these setups use different budgets and it is obvious that increasing the perturbation would increases the attack success rate. Answer: We thank the reviewer for the valuable comment. Indeed, we aim to use different adversarial budgets to indicate the hardness level of attacks. On page 4 of the paper, we note that “The names indicate the difficulty levels of the potential defense, as it is more difficult to defend when the attack gets stronger with a larger perturbation magnitude.” We will add more illustration and give more intuitive names to the attacks such as "L/M/H budgets". We would like to take suggestions from the reviewer as well!