NeurIPS dataset paper **Reviewer BWo2** **Q1. Psychophysical validation of dataset on humans is done on only 2 observers, which is too small a sample size to conclude anything meaningful.** Ans: Thanks for the suggestion. We have performed the psychophysical test with seven human observers and the details are provided in Sec. 3.2. of the updated version of the paper. **Q2. The psychophysical validation strategy (comparison of brightness) is valid only cases where the illusory region is physically present. Hence, the authors employ it only for a subset of the types of brightness illusions and do not validate the remaining.** Ans: Please note that the psychophysical validation has been done by selecting random patches in the illusion. Therefore, the psychophysical validation is performed for all types of illusions present in the dataset. More details are available in Sec. 3.2 in the updated paper. **Q3. In my opinion, the model experiments do not serve the motivation of the paper. In Section 1, the authors motivate that developing a large dataset for illusions can aid in our understanding of biological perception. However, their statement is not well-evidenced by any of the experiments presented in the paper. All models are trained on the illusion dataset and show high accuracy and F1-scores on the test set. This does not tell us anything about how these models can now be compared to human perception. I would appreciate further clarification on this.** Ans: We would like to draw the attention of the Reviewer especially to Sec. 5.4, which demonstrates inversion in brightness perception of the same test patch in the Shifted White effect without any fundamental change in the background (Fig. 11 in the updated paper),excepting spatial frequency through aspect ratio manipulation. The nerwork has not been trained for any such stimulus, and yet is able to replicate the biological perception. **Q4. The paper is written well, with the exception of a few grammatical errors and typos. A round of proofreading should be enough to resolve them. Figures are very small, making it very hard to read axis labels and tick markers.** Ans: Thanks for the suggestion. We have updated the grammatical errors, typos, and the alignment of the figures. **Reviewer ctZU** **Q1. At a high level, illusory stimuli are used to understand the functioning of human visual system. I would expect to see the performance of the models trained on natural images on these stimuli and vice-versa.** Ans: Thanks for the suggestion. Since the tasks for training a model are quite different for natural images and illusion images, we design two experiments to compare the performance on natural vs illusion images. 1. Classification task: If natural images are considered as non-illusions, can a model that is trained to classify illusion vs non-illusion, succesfully identify natural images as non-illusion? We test illusion vs natural image (from the Caltech101 dataset [8]) classification and achieve a high test accuracy of 99.98% in 5 epochs. 2. Localization task: We formulate the localization task as segmenting the perceived darkness region in images. When we test the model on some natural images, it can locate darker patches in the images, as shown in Fig. 14 in the updated paper. **Q2. Generalization to novel types of illusions in the identification task is not addressed.** Ans: Thanks for the suggestion. Illusion identification is a classification problem, therefore classifying completely new type of illusion, e.g. geometric illusions like Muller-Lyer, may not be appropriate. Therefore, we test generalization with respect to illusion localization. Note, our trained segmentation model, where the task is to identify illusory patches, generalizes to novel types of illusions. Please refer to the results shown in Sec. 5.4 involving perception inversion in patches without any significant change in the background. **Q3. Why is the pixel level ground truth limited to regions perceived darker? Why not consider a dense relative brightness map?** Ans: Good suggestion! We have considered the darker patches as illusory segments in all our experiments. Dense relative brightness maps could be an alternative too. It is like choosing one convention among two options, like we do in geometric optics where positive direction is chosen in the ray diagram. **Q4. If the models trained in this task are showing strong differences in the features when compared to the models trained on natural images, generalization could become an issue.** Ans: A natural image, unlike these illusions, contain many other features like color, texture etc. Hence it may not be appropriate to compare directly these two situations. However, generalization is possible with respect to features like brightness, contrast etc. We have shown the response of the network trained on illusion images on natural images in Fig. 14 of the updated paper. The model can somehow identify the illsuory patches in a natural image. **Q5. Illusions are an emergent phenomenon. Our visual system is trained on natural images and illusion are result of the priors or inference strategy adapted by our visual system. Datasets that could allow us to train using natural images and test on the illusory stimuli could be valuable.** Ans: Thanks for the suggestion. We design two experiments to compare with natural vs illusion image. 1. Classification task: if natural images are considered as non-illusion, can a model that is trained to classify illusion vs non-illusion, succesfully identify natural images as non-illusion? We test illusion vs natural image (from the Caltech101 dataset [8]) classification and achieve a high test accuracy of 99.98 % in 5 epochs. 2. Localization task: We formulate the localization task as segmenting the perceived darkness region in images. When we test the model on some natural images, it can locate darker patches in the images, as shown in Fig. 14 in the updated paper. **Reviewer Mh3B** **Q1. Quality: Some of the evidence for the claims of generalizability to novel illusions seem to be anecdotal, which means examples could have easily been cherry-picked. Moreover, performance of previous baselines is inexplicably missing.** Ans: For generalizability, we test on different kind of novel illusions, which are not used for training. We are not aware of a data-driven approch that can be considered as a baseline. We mention the limitations of the filter-based approches in the prior work section. **Q2. Clarity: Some of the figures seem to have been made in a rush, as they are misaligned, default Python and Matlab settings not changed, etc. While this is not a major issue, redoing (some of) the figures would be greatly appreciate I believe by every reader.** Ans: Thanks for the suggestion. We have updated the figures in the updated draft. **Q3. Significance: Results seem to be too good given that simple baselines are only used. While that might the purpose of the dataset, aka to be easily learned by some model so we can study illusions, it may indicate biases in the data that might render them less useful.** Ans: Please note that the illusion classification is an easier task. However, illusion localization is more challenging and also crucial for understanding the formation of the illusions. **Q4. First and foremost, it seems that only illusory images are validated. How are non-illusory stimuli validated? How is the data augmentations carried out validated (e.g. mentioned in line 208)?** Ans: Please note that the non-illusory images are created based on some specific perturbations that are known to be effective in the related works [2, 3, 7, 10 ]. As per the reviewer's suggestion, we have performed psychophysical experiments with seven subjects on the illusory vs non-illusory images. We found that the subjects noticed reduced illusory effects in the non-illusory images. **Q5. On that note, I would like to mention that I am perplexed by the “weak” Hermann grids (e.g. Figure 2). I still see the illusion in Figure 2c for example, but my understanding is that the authors do not count them as illusions (line 150). More details are required to clear up the confusion that arises from the authors’ choice.** Ans: In the Fig. 2, the non-illusion variants are established in the prior work [7]. Therefore, following these works, we have considered those to be the non-illusion variants. However, we understand that in some cases, illuions can be subjective. **Q6. Moreover, when generalization is concerned to an illusion outside the five kinds of illusions in the dataset, e.g. Mach band, Dungeon, Howe, only a handful of anecdotes are presented.** Ans: Thanks for the suggestion. We will consider this and rewrite the sections accordingly. **Q6. In terms of the correctness of some minor claims in the paper, lines 30-31, the claim that deep learning is inspired by the VISUAL cortical model is unsubstantiated in the paper. In lines 36-37, you speak of a well-studied phenomenon but do not cite any studies.** Ans: Thanks for the suggestion. We would cite some papers as an evidence for our claim. **Q7. Lastly, me and two impartial, non-expert (but not independent as they my relatives) observers disagree on the claims regarding the illusion discussed in lines 308-317, specifically the claim in 310-311. I understand the subjective nature of illusions, so this may not be an issue per se, but if the authors’ data support the claims they make and they did not make a mistake in the phrasing, that could speak to the need for more than two individuals validating the dataset.** Ans: Thanks for the suggestion. As suggested we validate the dataset with more human obervers. Regarding the claim in line 310-311, the transition of brightness in Shifted White illusion with Aspect Ratio Variation, is also well-known and established [45]. We have validated that the train model is able to predict the transition in brightness succesfully, even it has never seen these form of illusions earlier. **Q8. My main issue was with the Figures. Specifically, figures 4, 5, 7, 8, 9, 10, 11 and 13 seem to be misaligned (either spatially or even the coordinates on the axes), using the default settings of the libraries used (althought that is not a problem, it is jarring).** Ans: Thanks for the suggestion. We have updated the figures and aligned them. We have also added some non-illusions as suggested. **Q9. There are some grammatical/syntactical errors in lines 111, 117 that were noticeable, plus a closing parenthesis is missing in line 231.** Ans: Thanks for pointing out. We have modified these in the updated version. **Q10. The authors provide somewhat detailed instructions on how to replicate the dataset, but do not, as of now, provide the dataset or discuss its future maintenance. Somewhat paradoxically, reviewers are provided a Python notebook that loads images of the dataset from a local path.** Ans: Thanks for the suggestion. We have provided a better interface for loading the dataset and running the experiments. **Q11. Couldn’t illusions be used for subliminal messaging and such? While simple grayscale brightness illusion might be far removed from these more complex “illusions” perhaps it is worth bringing up. Moreover, one should be careful when generalizing from a sample of 2 validators, as I mentioned above.** Ans: Thanks for the suggestion. We have performed experimental validation with seven human observers and the results are provided in Sec. 3.2 of the updated paper. **Q12. First, did the authors consider including both illusory and “non-illusory” parts within the same image (by non-illusory here I am referring to surfaces that could potentially be considered illusory)? Perhaps that could serve to increase the difficulty of the dataset.** Ans: Thanks for the suggestion. We have not considered both the illusory and non-illusory parts within the image. Currently, we are considering only the illusory parts in an image. **Q13. Second, the operationalization of illusion localization as detecting the part that appears darker than it is does not allow the model to even being to detect the hermann grid illusion when the background is dark, e.g. Figure 8. Are there any ideas on how to extend this operationalization? DId the authors try any other groundtruth for the models to learn?** Ans: Good suggestion! In case of darker background, which is complementary to our approach, the illusory patch would be brigher instead of dark illusory patch in our case. However, the overall effects and operationalization would be same in that case. **Q14. Third, the authors should probably provide the used splits to allow replication and comparison, and talk about how they tune their hyperparameters.** Ans: Thanks for the suggestion. We have provided the splits in the documnentation of the dataset. **Q15. The term “psychophysical” remains opaque the first few times it is mentioned, perhaps consider briefly clarifying the term. Another term that was confusing was “brightness induction” (line 41). Please clarify that term.** Ans: Thanks for the suggestion. We will clarify the terms. **Q16. I also found weird the way the authors are referring to the usage of existing models such as ResNet and UNet, e.g. lines 126-127: “We propose an approach ...”.** Ans: We would like to clarify that we are benchmarking the datasets using the existing models e.g., ResNet, UNet. However, the main contribution is in the problem formulation and creating the dataset. **Q17. Also, the authors refer to the final loss as a “weighted combination”, but they in effect use specifically convex combinations of them.** Ans: We will change "weighted combination" to convex combination in the final version. **Q18. One final thing would be that I do not see how experiment 6 (lines 322-334) adds anything really to the conversation, plus it also consists of an anecdote. Perhaps the authors can consider performing RSA on a large number of stimuli or some similar method a quantify the divergence in representations.** Ans: From the exp.6, we like to visualize if there is a difference between the layer activations among the network trained on natual images and the network trained on illusion images. Interestingly, we found that there is visible and notable difference in the lower layers between the networks trained on natural images vs network trained on illusion images. **Reviewer Fxaj** **Q1. The tasks presented are quite trivial.** Ans: We respectfully differ from the reviewer to this point. We agree that the illusion identification is an easier task, but illusion localization is not, as shown in the Table 4. **Q2. illusion identification is a really easy task for CNNs. Benchmarking this task is quite useless since any network (also very simple ones, with just a few conv layers) would reach above 99.9% accuracy. illusion localization is slightly more difficult, but a state-of-the-art semantic segmentation network should be capable of obtaining really high scores.** Ans: We agree that illusion identification is a easier problem compared to the illusion localization. However, the novelty lies in the problem formulation, dataset generation, psycophysical validation, and benchmarking. **Q4. I think what is missing is the link between the tasks and the main objective of the dataset: 'understand and interpret visual illusions'. Human vision is not trained on these illusory images, but rather on natural images. Conversely, a CNN that is trained on these images in a supervised way are going to learn specific basic filters (e.g., Sobel-like filters) or other kernels that are slightly more complex. Therefore, I think the paper lacks a proper connection that justifies the choice of such tasks and, hence, ground-truth annotations.** Ans: A Sobel filter, by detecting edges or outlnes in an image, helps in demarcating an object from its background as is also true for the illusory patches. If the CNNs, by learning to classify illusions from non-illuions do learn such a basic filter, then such learning may be effective for natural images too in tasks such as ascertaining the presence of a target object in the image. This may be explored as a future work. **Q5. the psychophysical experiment that validates the dataset, described in section 3.2, is performed on only 2 individuals. For a large-scale dataset with 20k+ images, maybe it would be better to increase the number of participants to average the subjective perception differences out.** Ans: Thanks for the suggestion. We have performed the psychophysical experiments with seven human observer and included the results in the updated version of the paper in Sec. 3.2. **Q6. there is no information about the human experimentation conducted (checklist answers are N/A).** Ans: We have provided more information of the human experimentation in the updated version in Sec. 3.2. **Q7. it is not clear why the 'standard' patch is a bar with 11 segments, and which one is the actual standard (lines 182-188).** Ans: The standard patch is a bar with 11 segment to quantize the grayscale range (0-255). We have followed the standard 2AFC experimental setup to validate the generated grayscale brightness illusions. **Q8. Usually, a detection mechanism based on neural networks is evaluated with a ROC analysis, or, in the case of unbalanced datasets (as in this case) with the area under the precision-recall curve.** Ans: For the segmentation task, we have used metrics F1-score and mIoU. **Q9. In section 5.1 there is no information about the threshold used (I imagine 0.5).** Ans: For segmentation, we consider a threshold of 0.25. **Q10. The dataset includes no documentation on the design and the generation processes. For instance, why are there 10x SBC induced grating illusions with respect to white illusions? There is no ablation on the actual number of illusory samples required for a CNN to learn to identify the illusions to justify this choice.** Ans: Thanks for the suggestion. We have provided the generation process in the documentation. **Q11. Since there is no summary of the dataset construction procedure, there is no way to tell wether the dataset is sound or not. The matlab generation code is included in the dataset link, but section 2 of the supplementary material should be enriched with additional details.** Ans: We have provided both the dataset and google colab datasheet to reproduce the results. This will confirm reproducibilty. Also, we are planning to make the dataset public through the public google drive link upon acceptance of the paper. **Q12. The paper is undestandable, overall. However, there are a few unclear sentences (e.g., lines 1-6, line 33, line 62, 70-73, 185-188, 221-223) and quite a few english mistakes (e.g., line 29, 45, 55, 70, 76, 92, 94, 110, 213, 226, 255, 337). Also, section 4 can directly be included in section 5 or vice-versa, since there are a few repetitions. Figure 4 could be improved for better understanding. The quality of figure 5 could be improved by proper export in matlab. Figure 6 looks like the illusion localization considers the detection output as input, while it processes the image directly.** Ans: Thanks for pointing out the typos. In the updated version, we have rectified those. **Q13. The dataset is made available through the apposite link, together with the matlab code for dataset generation and python code for the CNN models. From the answers in the checklist, it is clear the code will be made public in the future. Hosting, licensing and maintenance will be released with the dataset. However, in my opinion the submission lacks a summary of the process of dataset generation (including the choice of the number of images), and a discussion on the dataset folder organization.** Ans: Thanks for the question. We have provided a more comprehensive details for the dataset generation and reproducibility of the results in the updated version. **Q14. I think the major issue, as I pointed out above, is the choice of the tasks to benchmark. My opinion is that there is no discussion on why benchmarking these tasks improve the understanding of the brightness illusion phenomenon in humans perception. If the objective is to see what kind of filters a CNN learns to identify an illusion, maybe it could be worth it to release the dataset together with a set of kernel/activations visualization functions to delve into the low-level mechanisms of vision, without "caring" for performance on these tasks.** Ans: Thanks for the suggestion. We have already provided some of the layer/activation visualization from the dataset both in the main paper and in the supplementary material. Further analysis on the dataset could provide more insights on how much similar/dissimilar the illusory perception is compared to human perception. **Reviewer w4aU** **Q1. The 2 Alternate Forced Choice experiment used to valided the dataset implied only two human subjects, which is minimal for a dataset that is, in fine, meant to make machine able to behave like the general human visual system (not a single human visual system in particular).** Ans: Thanks for the suggestion. We have performed the 2AFC experiment with seven subjects to validate the dataset. The details are provided in Sec. 3.2 in the updated version of the paper. **Q2. I have some doubts about the illusion localization mask. For example in figure 1.a the illusion is marked on the right side but, from my point of view, the gray square on the left side could also be considered an illusion region. Similar issues are present for the images presented in figure 1.b, d and e.** Ans: Thanks for the comment. We would like to mention that in this paper, we are focusing on the illusory darkness, i.e., we are trying to segment out the darker illusory patch in the image. E.g., in Fig. 1.a, both the left and right patches are of exactly same intensity value, but the right sided patch appears to be darker. Therefore, we segment out the right patch as perceived darkness. Similarly, for other illusions also we segment out the darker patch. **Q3. The author claimed they validated their dataset via a 2 Alternate Forced Choice experiment, but the experiment used only 2 humans subjects. In the case of an illusion with two possible illusory region, two human subjects would select the same region 50% of the time and both would select a specific region 25% of the time. I would also assume that the subject responses would be highly correlated across multiple images.** Ans: We have performed the psycophysical validation with seven subjects to statistically validate the dataset. The results are provided in Sec. 3.2. of the updated version of the paper. **Q4. In the end, I am not convinced of the correctness of the ground truth provided for this dataset.** Ans: We have verified the ground truth of the dataset using psycophysical validation, hence it can considered as verified. **Q5. There are major issue with the dataset documentation and hosting plan at this point.** Ans: We have documented the dataset and planning to currently host in JHU university drive and release upon acceptance. **Q6. The dataset is provided, without license or documentation, on google drive. I think we can agree that cloud plateforms are not ideal for long term storage of datasets as they can change the conditions for file sharing without prior notice (this happened to me with some dropbox data I was using on a personal website), and they can also delete someone's account if they deem it is inactive.** Ans: Thanks for the comment. We have made the dataset public through the link: https://www.cis.jhu.edu/~aroy/Supplementary_BRI3L.zip. A comprehensive flow of our method has been provided in the google colab: https://colab.research.google.com/drive/1g4Ov5Cbx4nIzd-QxabmtuFC9A-rMdrO0#scrollTo=WUlmxxDmtHqd **Q7. I would expect a higher standard for hosting and licensing plan to match the expectations described in the call for paper of this track.** Ans: Thanks for the suggestion. We are planning to host the dataset with a license that will allow general users to use and modify the dataset by citing this paper.