# Rebuttal for (2023-11-10) [toc] ## To ACs (2023-12-04) Dear Area Chairs, We hope this message finds you well. We are writing to provide an update on the revisions made to our paper on the GNSVR model, in response to the valuable feedback received from the reviewers. We have made significant efforts to address the concerns raised and enhance the quality of our work. **1. Acknowledgment of Reviewer Contributions.** Firstly, we would like to express our deep gratitude to the reviewers for their insightful and constructive feedback. Their suggestions have been instrumental in refining our model and its presentation. **2. Enhancements in the GNSVR Model.** The core innovation of our GNSVR model, as rightly identified by the reviewers, lies in its novel and general framework for efficiently **generating** and **reusing** modules in visual reasoning tasks. We have made advancements in both module generation and reusage, ensuring our model's adaptability and intelligence in a range of tasks. **3. Revised Experiments.** To further substantiate our claims and demonstrate the effectiveness of GNSVR, we have conducted additional experiments, including: 1. Detailed ablation studies on various components. 2. Implementing and analyzing performance on the I-RAVEN dataset. 3. Comparative analysis with VisProg and ViperGPT, and a computational efficiency comparison with ViperGPT. **4. Paper Revision and Comparison with Related Work.** In addition to incorporating these experimental results, we have also thoroughly revised our manuscript (highlighted in blue) to include a discussion on related works, and refined technical descriptions. We believe these revisions provide a clearer understanding of our model's unique contributions and its distinction from existing works. **5. Closing Remarks and Request for Reevaluation.** We have managed to address most reviewer concerns and have turned a slightly negative review to a positive one. We have also put much effors to address reviewer **QVVg**’s concerns, including 1). conducting new detailed ablation studies; 2). implementing additional experiments with the I-RAVEN dataset; 3). discussing and comparing with related works. Despite our efforts to address these points promptly, we have not yet received feedback. We sincerely wish the ACs and reviewers to reevaluate our contribution based on the broader significance of our work. Thank you for your time and consideration. Warm regards, Authors --- This version maintains a professional and respectful tone, concisely highlights the key points of your revision, and respectfully requests a reevaluation of your work. **G1. Contribution Recognition.** We extend our sincere gratitude to the reviewers for their time and effort in reviewing our paper. We are pleased to note that the reviewers have generally acknowledged the GNSVR's following contributions: * **the idea of growing new modules is promising.** Using LLMs to augment these APIs in a task-specific manner is an interesting idea **(vHz5)**; the idea of continuously growing this library is innovative, and the method simple and elegant **(JpcA)**. * **the idea of reusing modules is appealing.** The prohibitive cost of using SOTA LLMs makes reuse of code appealing **(CDRY)**. The empirical performance of GNSVR for transfer tasks, as well as the examples of the modules generated is quite impressive **(QVVg)**; the claim is compelling that the modules that GNSVR finds from the GQA and RefCOCO tasks can "transfer" for related tasks of image editing and knowledge tagging **(vHZ5)**; I am appreciative of the idea of generating modular modules to be used in a library of skills for future tasks **(JpcA)**. * **the proposed GNSVR framework has diverse applications.** The fact that they can show it's use on several domains and types of tasks is also appealing **(CDRY)**. GNSVR utilize new modules successfully for a variety of vision language reasoning tasks **(QVVg)**. It has many potential avenues for future exploration **(vHZ5)**. **G2. Our core novelty.** As recognized in **G1** of the general response, GNSVR's core innovation lies in develop a **novel** and **general** framework to efficiently **generate** and **reuse** modules for visual reasoning tasks, even with limited training data. This capability is grounded in two key processes: 1. **Module Generation**: - ***Assessment of Reusability***: Initially, for a given visual reasoning task, GNSVR evaluates the applicability of existing modules to the task at hand. - ***Initiation and Testing of New Modules***: If existing modules are deemed inadequate, GNSVR initiates the creation of a new module with LLMs. This process involves transforming training instances into "test cases" to evaluate module performance. - ***LLM-Driven Code Generation***: Utilizing LLMs, GNSVR then **generates** code snippets for the new module. These snippets are specifically tailored to pass the defined "test cases", ensuring functionality and task alignment. 2. **Module Reusage**: - ***Modularized Program Transformation***: When faced with a new query of a visual reasoning task, GNSVR translates this query into a structured, step-by-step modularized program. - ***Execution with Established Modules***: The program is executed by **reusing** previously established modules, showcasing the model's ability to apply existing knowledge to new scenarios. - ***Flexibility and Adaptability***: The approach facilitates 1) the handling of different instances within the same visual reasoning task (VQA, referring expression comprehension), 2) the application to new reasoning tasks (image editing and knowledge tagging), and 3) rapid adaptation to entirely new tasks with minimal training examples (RAVEN and MEWL). GNSVR is a novel and general framework that not only shows efficiency and effectiveness in tackling visual reasoning tasks but also embodies a significant leap towards more **adaptable** and **intelligent** AI systems. **The model's proficiency in generating and reusing modules offers a robust framework for continuous learning and adaptation, mirroring human cognitive processes in problem-solving and knowledge application.** **G3. Experiments in the revision.** To address the reviewers’ questions and support our responses, we conduct the following experiments to support our claims and show the effectiveness of our GNSVR model. - Further ablation study on different components including (1). *good initialization*, (2). *input and output format*, (3). *prompt without existing modules*, (4). *prompt without creating new modules*, (5). *sampling strategy*, (6). *training number*, (7). *without module learning*, (8). *without debug mechanism* **(QVVg,vHZ5,JpcA)** - (9). Experiments on the new I-RAVEN dataset **(QVVg)** - (10). Variants of VisProg and ViperGPT on RAVEN and MEWL **(JpcA)** - (11). Computational efficiency comparison of our modularized GNSVR model and ViperGPT **(CDRY)**. - **G4. Paper Revision.** Besides the experimental results, we have also revised the paper correspondingly (highlighted in blue): - include experimental results into the revised version of the paper; - discuss the related work[A,B,C,D,E] and their difference with our GNSVR model; - revise text description for technique details. As a result, we have turned one slightly negative reviewer to the positive side. We have also put much effors to address reviewer **QVVg**’s concerns, including 1). conducting new detailed ablation studies; 2). implementing additional experiments with the I-RAVEN dataset; 3). discussing and comparing with related works. Despite our efforts to address these points promptly, we have not yet received feedback. We sincerely wish the AC and reviewers to reevaluate our contribution based on the broader significance of our work. [A]. Iterative Disambiguation: Towards LLM-Supported Programming and System Design". Pereira and Hartmann [B]. Self-planning Code Generation with Large Language Models. Jiang *et al*. [C]. Dynamic inference with neural interpreters. NeurIPS. 2021. Rahaman *et al*. [D]. Dataset interfaces: Diagnosing model failures using controllable counterfactual generation. 2023. Arxiv. *Vendrow et al*. [E]. Adaptive testing of computer vision models. 2023. ICCV. Gao *et al*. ## To ACs Dear ACs, I hope this message finds you in good health. I am writing with regard to the revised version of our paper, submitted in response to the insightful suggestions from reviewer **#QVVg**. Despite our efforts to address these points promptly, we have not yet received feedback. In our revision, submitted five days ago, we aimed to thoroughly address the reviewer's comments through: 1. Conducting new detailed ablation studies on a) prompt design, b) sampling strategy, c) the number of samples, d) the capabilities of the Large Language Model (LLM). 2. Implementing additional experiments with the I-RAVEN dataset, as specifically recommended. 3. Providing an in-depth comparison and discussion of our work in relation to other significant studies in our field. 4. Updating our manuscript to reflect these new experiments and discussions comprehensively. We understand that the review process is complex and can be time-consuming, and we appreciate the dedication and effort of all parties involved. We kindly request your assistance in reaching out to reviewer **#QVVg** for any additional comments or feedback they may have, during the **Reviewer/AC Discussion period**. We kindly request your guidance on how to proceed under these circumstances. Your prompt attention to this matter would be greatly appreciated, as it will help us navigate this time-sensitive phase of our research journey. Thank you for your continued support and understanding. Warm regards, Authors ## response QVVg Dear Reviewer #QVVg, We would like to thank you for your helpful feedback which has helped us improve the paper. We addressed reviewers' concerns in the author's responses, posted on the 17th of Nov 2023. We would be delighted if you could please take a look at our detailed responses so that we can address any remaining concerns before the end of the discussion phase. Sincerely, Authors of Submission 6843 ## Email to Program chairs, Subject: Assistance Requested for Prompt Reviewer Feedback on ICLR 2024 Submission ID 6843 Dear Program Chairs, I hope this message finds you in good health and spirits. We are the authors of the ICLR 2024 submission titled "Generative Neuro-Symbolic Visual Reasoning by Growing and Reusing Modules," with the submission ID 6843. We write to express our concern regarding the absence of feedback from reviewer #QVVg, following our response submitted over six days ago. We have tried contacting the ACs from the portal but still no feedbacks from both ACs and the reviewer. As the discussion deadline is less than a day away, we kindly request your assistance in contacting the assigned ACs to remind reviewer #QVVg of our pending query. We highly value the reviewer’s insights and are keen to incorporate their feedback into our work. We appreciate your support and understanding in this matter and look forward to your prompt assistance. Warm regards, Zhenfang Chen, Authors of ICLR submission (ID: 6843) [Your Name] [Your Affiliation] Dear Program chairs, I hope this message finds you well. We are the authors of the ICLR 2024 submission *"Generative Neuro-Symbolic Visual Reasoning by Growing and Reusing Modules"* (ID: **6843**). We are writing to follow up on our response to reviewer **#QVVg**, which was submitted over six days ago. As the discussion deadline is rapidly approaching in less than a day, we are becoming increasingly concerned about the lack of feedback from the reviewer. We have been trying to ask the ACs for assistance in reminding the reviewer for response but still no responses from both the ACs and reviewer #QVVg. Understanding the time-sensitive nature of this matter, we kindly request your assistance to remind ACs to remind reviewer **#QVVg** about our pending response. We believe their insights will be invaluable in advancing our work, and we are eager to engage in further discussion based on their feedback. We greatly appreciate your attention to this matter and your continued support throughout this process. Thank you very much for your efforts. Warm regards, Authors ## Response to #QVVg Request assistance in contacting the reviewer Dear ACs, I trust this message finds you in good health. I am writing to inquire about our recently submitted response to reviewer **#QVVg**, which was sent five days ago. In line with the reviewer's suggestions, we have diligently revised our paper, including: 1. Comprehensive ablation studies focusing on (1). prompt design, (2). sampling strategy, (3). the number of samples, and (4). the capabilities of the Large Language Model (LLM). 2. Implementation of new experiments using the I-RAVEN dataset, as recommended by the reviewer. 3. A thorough comparison and discussion of our work in relation to other relevant research. 4. Corresponding revisions to our paper to encompass these new experiments and discussions. As the discussion deadline is less than a day away, we are concerned about not having received feedback from reviewer **#QVVg**. We respectfully request your assistance in contacting the reviewer to expedite their response, given the time-sensitive nature of our research. Thank you for your continued support and attention to this matter. Warm regards, Authors --- This version maintains the essence of your original message while enhancing its formality and politeness. Dear AC, I hope this message finds you well. We are writing to follow up on our response to reviewer #QVVg, which was submitted over five days ago. According to his/ her comments, we have made the following revision. (1). More ablation studies on 1). prompt design, 2). sampling Strategy, 3). number of sampling examples, and 4). LLM's capability. (2). New experiments on the I-RAVEN dataset suggested by the reviewer. (3). Compare and discuss our work with related research. (4). revise the paper corresponding to include the above experiments and discussion. As the discussion deadline is rapidly approaching in less than one day, we are becoming increasingly concerned about the lack of feedback from the reviewer. Understanding the time-sensitive nature of this matter, we kindly request your assistance in reminding reviewer #QVVg about our pending response. We believe their insights will be invaluable in advancing our work, and we are eager to engage in further discussion based on their feedback. We greatly appreciate your attention to this matter and your continued support throughout this process. Thank you very much for your efforts. Warm regards, Authors ## Response to #QVVg Dear Reviewer QVVg, We sincerely thank you for your valuable feedback on our manuscript. Following your suggestions, we have enriched our paper with additional experiments and discussion with related work, which are now included in the revised manuscript. As the rebuttal period concludes, we hope our efforts align with your expectations. If you find our response satisfactory, we would be grateful if you could consider revising your score. Thank you once again for your insightful guidance. Warm regards, Authors ## Response to #CDRY, Dear Reviewer CDRY, Thanks again for you acknowledge the novelty of our GNSVR model to **grow** and **reuse** modules for LLM based programming for reasoning. We appreciate your recognition of the importance of modular programming in the era of LLMs. If we have convinced you on the claim that **"the paper is fundamentally using LLM’s in a novel way"**, could you kindly increase your rating score for the paper according to your original comment that *"If I was given evidence against either of the two above claims - that all of the work (particularly the "visual reasoning" work) is being done by existing tools, or that the paper is not fundamentally using LLM's in a novel way - I would be happy to increase my score."*? Regards, Authors ## response to AC (2023-11-21) Dear AC, I hope this message finds you well. We are writing to follow up on our response to reviewer **#QVVg**, which was submitted over four days ago. As the discussion deadline is rapidly approaching in less than two days, we are becoming increasingly concerned about the lack of feedback from the reviewer. Understanding the time-sensitive nature of this matter, we kindly request your assistance in reminding reviewer **#QVVg** about our pending response. We believe their insights will be invaluable in advancing our work, and we are eager to engage in further discussion based on their feedback. We greatly appreciate your attention to this matter and your continued support throughout this process. Thank you very much for your efforts. Warm regards, Authors ## Response to # QVVg **Looking forward to your post-rebuttal feedback** Thanks again for your constructive suggestions and comments. As the deadline for discussion is approaching, we are glad to provide any additional clarifications that you may need. In our previous response, we have carefully studied your comments and added a lot more experiments and analysis to complement your suggestions. We hope that the provided new experiments and additional explanations have convinced you of the merits of our work. Please do not hesitate to contact us if there are other clarifications or experiments we can offer. ## Response to #JpcA (2023-11-20) 1. Variance of GQA is smaller when sampling by types; > Thank you for the thoughtful response! Can you please elaborate on how the training examples are "randomly sampled by question categories"? I believe this relates to my concerns on overfitting, and how strategies are required (that may use additional information, like knowledge of question categories in a dataset) to ensure a broad coverage of modules are proposed. Thanks again for your response on the overfitting concern of the GNSVR model. We further explain your concerns as followed. In GQA, we have sampled training examples by question categories. This strategy does require the usage of question categories in the training dataset. In refCOCO, RAVEN and MEWL, we sample examples randomly since there is no sub category by types. To furtehr investigate our model's ability against overfitting, we have conducted new experiments on Table A below. We randomly sample different training samples from the whole training set 3 times with different random seeds, which we denote it **Random Sampling**. We also perform another sampling strategy that randomly sample training examples by types, which we call **Random Sampling by types**. Based on the results in table A, we can see that our model works in both **Random Sampling** and **Random Sampling by types**, achieving reasonable performance although we do observe that **Random Sampling** strategy have larger variances on GQA. We will add such analysis in the revised paper. Table A: Ablation on GQA for sampling strategies. | Method GNSVR | Accuracy | | -------- | -------- | | Random Sampling | 44.8 +- 0.41 | | Random Sampling by types | 45.9 +- 0.14 | Another way to show our GNSVR model's ability against **overfitting** is that as shown in table 2, Fig. 4 and Fig. 7, of the paper, **the learned modules learnt from GQA and refCOCO can be generalized to new tasks like image editting and knowledge tagging.** Note that language instructions and images from image editting and knowledge tagging are quite different from that of the GQA's abd RefCOCO's images. Table B: Ablation on RefCOCO. | Method GNSVR | Accuracy | | -------- | -------- | | Random Sampling | | Table C: Ablation on RAVEN. | # of samples | Center | L-R | U-D | | ------------ | ------ | ---- | ---- | | Random | | | | | 10 Random | | | | | 20 Random | | | | Table D: Ablation on MEWL. | # of samples | shape | color | material | | ------------ | -------- | -------- | --- | | 5 Random | | | | | 10 | | | | | 20 | | | | ## General response (update): thanks for all your comments and look forward to post-rebuttal feedbacks! Thanks again for all of your constructive suggestions, which have helped us improved the quality and clarity of the paper! Since it has been over three days, we have not heard any post-rebuttal response yet. Please don’t hesitate to let us know if there are any additional clarifications or experiments that we can offer, as we would love to convince you of the merits of the paper. We appreciate your suggestions. Thanks! ## General Response: contributions, novelty, new Experiments and paper revision. **G1. Contribution Recognition.** We extend our sincere gratitude to the reviewers for their time and effort in reviewing our paper. We are pleased to note that the reviewers have generally acknowledged the GNSVR's following contributions: * **the idea of growing new modules is promising.** Using LLMs to augment these APIs in a task-specific manner is an interesting idea **(vHz5)**; the idea of continuously growing this library is innovative, and the method simple and elegant **(JpcA)**. * **the idea of reusing modules is appealing.** The prohibitive cost of using SOTA LLMs makes reuse of code appealing **(CDRY)**. The empirical performance of GNSVR for transfer tasks, as well as the examples of the modules generated is quite impressive **(QVVg)**; the claim is compelling that the modules that GNSVR finds from the GQA and RefCOCO tasks can "transfer" for related tasks of image editing and knowledge tagging **(vHZ5)**; I am appreciative of the idea of generating modular modules to be used in a library of skills for future tasks **(JpcA)**. * **the proposed GNSVR framework has diverse applications.** The fact that they can show it's use on several domains and types of tasks is also appealing **(CDRY)**. GNSVR utilize new modules successfully for a variety of vision language reasoning tasks **(QVVg)**. It has many potential avenues for future exploration **(vHZ5)**. **G2. Our core novelty.** As recognized in **G1** of the general response, GNSVR's core innovation lies in develop a **novel** and **general** framework to efficiently **generate** and **reuse** modules for visual reasoning tasks, even with limited training data. This capability is grounded in two key processes: 1. **Module Generation**: - ***Assessment of Reusability***: Initially, for a given visual reasoning task, GNSVR evaluates the applicability of existing modules to the task at hand. - ***Initiation and Testing of New Modules***: If existing modules are deemed inadequate, GNSVR initiates the creation of a new module with LLMs. This process involves transforming training instances into "test cases" to evaluate module performance. - ***LLM-Driven Code Generation***: Utilizing LLMs, GNSVR then **generates** code snippets for the new module. These snippets are specifically tailored to pass the defined "test cases", ensuring functionality and task alignment. 2. **Module Reusage**: - ***Modularized Program Transformation***: When faced with a new query of a visual reasoning task, GNSVR translates this query into a structured, step-by-step modularized program. - ***Execution with Established Modules***: The program is executed by **reusing** previously established modules, showcasing the model's ability to apply existing knowledge to new scenarios. - ***Flexibility and Adaptability***: The approach facilitates 1) the handling of different instances within the same visual reasoning task (VQA, referring expression comprehension), 2) the application to new reasoning tasks (image editing and knowledge tagging), and 3) rapid adaptation to entirely new tasks with minimal training examples (RAVEN and MEWL). GNSVR is a novel and general framework that not only shows efficiency and effectiveness in tackling visual reasoning tasks but also embodies a significant leap towards more **adaptable** and **intelligent** AI systems. **The model's proficiency in generating and reusing modules offers a robust framework for continuous learning and adaptation, mirroring human cognitive processes in problem-solving and knowledge application.** **G3. Experiments in the revision.** To address the reviewers’ questions and support our responses, we conduct the following experiments to support our claims and show the effectiveness of our GNSVR model. - Further ablation study on different components including (1). *good initialization*, (2). *input and output format*, (3). *prompt without existing modules*, (4). *prompt without creating new modules*, (5). *sampling strategy*, (6). *training number*, (7). *without module learning*, (8). *without debug mechanism* **(QVVg,vHZ5,JpcA)** - (9). Experiments on the new I-RAVEN dataset **(QVVg)** - (10). Variants of VisProg and ViperGPT on RAVEN and MEWL **(JpcA)** - (11). Computational efficiency comparison of our modularized GNSVR model and ViperGPT **(CDRY)**. **G4. Paper Revision.** Besides the experimental results, we have also revised the paper correspondingly: - include experimental results into the revised version of the paper; - discuss the related work[A,B,C,D,E] and their difference with our GNSVR model; - revise text description for technique details. [A]. Iterative Disambiguation: Towards LLM-Supported Programming and System Design". Pereira and Hartmann [B]. Self-planning Code Generation with Large Language Models. Jiang *et al*. [C]. Dynamic inference with neural interpreters. NeurIPS. 2021. Rahaman *et al*. [D]. Dataset interfaces: Diagnosing model failures using controllable counterfactual generation. 2023. Arxiv. *Vendrow et al*. [E]. Adaptive testing of computer vision models. 2023. ICCV. Gao *et al*. ## Response to Reviewer #CDRY We appreciate the reviewer for the detailed comments and insightful suggestions. >Q1. The prohibitive cost of using SOTA large language models makes reuse of code appealing (although they explicitly don't use ChatGPT4 "due to the prohibitive cost", so maybe this is less of an argument than it would be otherwise). **Q1. Computational Efficiency.** Thanks for mentioning the compute efficiency of our modularized design and module reusage. We have calculated the average token number of our GNSVR model and the ViperGPT that has no module reusage mechanism when calling the LLMs. The averaged generated token number is shown in Table 1. It can be seen that our GNSVR's solutions are shorter and more efficient. This result is updated in the revised paper. This approach is particularly advantageous when calling expensive APIs from the OpenAI GPT family. Table 1: Average token number of generated solutions. | Methods | GQA | RefCOCO | | -------- | -------- | -------- | | ViperGPT-Instruct | 153.7 | 109.1 | | Ours-Instruct | 62.3 | 54.4 | >Q2. My main concern is that, based on the presentation, it seems that the authors took a lot of highly intricate API's for LLM's that large teams may have worked on and cobbled them together to solve a new task. I refer to this section: "The success of our GNSVR relies on a set of pre-defined modules and APIs as the starting point. We utilize handcrafted modules from VisProg (Gupta & Kembhavi, 2022) as our initial components. Additionally, we incorporate several new APIs from ViperGPT to enhance module creation. We also include some new APIs from ViperGPT (Sur´ıs et al., 2023) for making new modules." I appreciate their novelty in how they use these API's, but the ratio of insights of these authors vs of the authors of the API's appears insignificant. **Q2. About Reliance of Existing Modules.** We agree that our GNSVR relies on some pre-defined modules as the initial start point to learn to generate new modules. However, we want to highlight that the novelty of GNSVR is **not** *"taking a lot of highly intricate API’s for LLMs to work on and cobble them together to solve a new task"*. Instead, our novelty lies in **growing** and **reusing** the established modules that are learnt from the training set. The modules learnt by GNSVR can be applied to different domains, including 1). other instances of the visual reasoning task; 2). instances of a new reasoning task; and 3). adapting to new reasoning tasks by observing only a few training examples. Please refer to the **G2** of the **general responses** for further explanation. >Q3. I'm also not entirely convinced of the novelty of this paper. I refer to "Iterative Disambiguation: Towards LLM-Supported Programming and System Design" (Pereira and Hartmann) and "Self-planning Code Generation with Large Language Models" (Jiang et al). I don't think the fact that this is in the visual domain is enough to call it "novel", because there is virtually no engagement with visual modalities by the authors - as stated above, according to my understanding, they are using predefined modules which handle the interface between vision and language. **Q3. About the paper's novelty compared with existing works.** Thanks for reminding related works[1,2]. We have added such revision in the related work section of the revised paper. In [1], Pereira and Hartmann used LLMs to progressively enhance and specify system subcomponents, empowering users to develop versatile programs through a systematic iterative disambiguation method. In [2], Jiang *et al* learned- to generate code with LLMs, which involves a planning phase for outlining solution steps and an implementation phase for generating code. Besides the **dense engagement with the visual modalities** as input, our GNSVR is different from them in **modularization** of code snippets for better **module expansion** and **module reusage**. These unique differences make our GNSVR model to have new capabilities like **growing** new modules to handle other instances of the visual reasoning tasks VQA, image grounding, RAVEN and MEWL and **reusing** these new modules in new tasks like image editing and knowledge tagging. Such modularization also offers better computational efficiency (See **Q1**) and model transparency with high-level module abstraction (See the *Generated Program* in Fig. 3, 5 and 6 for an example). Please refer to the **G2** of the **general responses** for further explanation. [1]. Iterative Disambiguation: Towards LLM-Supported Programming and System Design (Pereira and Hartmann). [2]. Self-planning Code Generation with Large Language Models (Jiang et al). >Q4. If I was given evidence against either of the two above claims - that all of the work (particularly the "visual reasoning" work) is being done by existing tools, or that the paper is not fundamentally using LLM's in a novel way - I would be happy to increase my score. **Q4-1. Evidence for all the work is NOT being done by existing tools.** We argue that our GNSVR model does **not** simply adopt all the tools for a visual reasoning task. Instead, it learns to make new tools (**grow modules**) and reusing these new tools with existing tools to handle a new task (**reuse modules**). Please refer to **Q2** and **Q3** for a detailed explanation. **Q4-2. Evidence for the paper is fundamentally using LLM’s in a novel way.** Our way is fundamentally novel to use LLMs to **grow** and **reuse** new modules. While existing works use LLMs to **solve each task instance independently without reusage**, our novel modularization design offers several benefits, including 1). better performance by examining the new modules from the given training examples; 2). efficient module reusage and transfer on other tasks; 3). better computation efficiency. We also show how we introduce a general and novel framework to use LLMs in **G2** of the **General Response**. >Q5. Please clarify exactly where you created novel algorithms or ideas. If any of those ideas require more than iterative prompting, please state so explicitly. **Q5. Clarification of Novelty.** As detailed explained in **G2** of the **general response**, GNSVR's core innovation lies in develop a **novel** and **general** framework to efficiently **generate** and **reuse** modules for visual reasoning tasks, even with limited training data. As shown in **G1** of the **general response**, our novelty to **grow** and **reuse** new modules is also well recognized by other reviewers **(QVVg,vHZ5,JpcA)**. ## Response to Reviewer #QVVg Thank you for the constructive comments. >Q1. While I think the framework overall is useful, the components are not all equally important to solve the reasoning problem and hence it is important to understand for future research on modular reasoning to understand what works and what doesn't, and if so why not. How big of a role does "good initialization" of the neural module operators plays? How important is defining the correct input and output format for a new module? How important is the selection of the few shot samples to evaluate a new module? (the authors say "We extracted 300 examples from GQA, 100 from RefCOCO, 10 from Raven, and 10 from MEWL based on experimental experience." - is the experience here just cherry picking for results or something else?) How important is the LLM capability to learn new modules? What role does the prompt play for the LLM in evaluating existing modules and creating new ones? Without detailed analysis to support answers to all these questions, the paper is limited in terms of explaining the method beyond just presenting a new method and showing empirical results. **Q1. Ablation on each new module.** Thanks for raising these insightful questions, which help in understanding the importance of different components in the GNSVR. We randomly select 800 samples from GQA test-dev split to further investigate the effectiveness of different components of GNSVR. To better present all experimental results, all the mentioned ablation studies are organized into three sections below. **Q1-1: Ablation on Prompt Design.** Table 1: Ablation study of Prompt Design on GQA. | Method GNSVR | GQA | | -------- | -------- | | Baseline | 45.9 | | w/o input and output format | 43.2 | | w/o good initialization | 41.8 | | w/o existing modules in prompt for module making | 45.0 | | w/o creating new modules | 44.7 | We conducted a series of experiments to observe the impact of prompt design on the overall performance of GNSVR. Firstly, we removed the descriptions of input and output formats from the prompt. After removing these descriptions, the performance of GNSVR dropped by 2.7%. This is because, without clear guidance on input and output formats, the modules might output in the wrong format, leading to errors in subsequent parsing of the results. Furthermore, on top of removing the input and output format, we also removed some of the in-context examples and descriptions about module signatures from the prompt. The performance further declined. Since our method consists of three stages: module initialization, module generation, and module execution, where module initialization is the first step of our method. Without adequate module initialization as a foundation, the subsequent results are largely impacted. Therefore, we can see that without good initialization, our performance drops by 4.1%. Regarding the use of existing modules and creating new ones, from the table above, we can observe that not using the predefined modules from VisProg results in a 0.9% decrease in our performance. This demonstrates the robust module generation capability of GNSVR. Even without a series of predefined modules, our method can still build modules from scratch, solve problems, and the performance does not drop significantly. If we don't create new modules, then we are merely using the predefined modules. We can see that the result is 44.7%, which is 1.2% lower than our result of 45.9%. This performance gap highlights the effectiveness of the newly generated modules. By generating and using new modules, we can achieve better results. **Q1-2: Ablation on Sampling** In this section, we introduce our sampling strategy at first. Then, we conduct an experiment to showcase how the sampling methods will impact GNSVR performance. Subsequently, we investigate how the number of training samples affects our results of different tasks. **Sampling Strategy** Table 2: Ablation study of Sampling Strategy on GQA. | Method GNSVR | GQA | | -------- | -------- | | Baseline | 45.9 | | Random Sampling | 44.3 | Our sampling strategy: the GQA dataset contains five structural types: choose, logical, compare, verify, and query. These structural types inspired the idea of generating our new modules. Taking COMPARE_COLOR as an example, this newly generated module is generated to address questions related to color within the compare structural type. From the visualization of GQA, it is apparent that the query type can be addressed using existing VQA module from VisProg, and problems in the logical type can be decomposed into sub-problems of choose, compare, and verify types. Therefore, when selecting training samples, we randomly chose 100 samples each from the choose, compare, and verify types. Altogether, these three types comprise 300 samples, all sourced from the GQA train split. Hence, we are not cherry-picking our training samples; rather, we are selecting training samples based on the structural types of GQA. To explore the impact of sampling strategies on our experiment, we conducted an additional experiment with a random sampling of 300 samples, beyond our initial sampling strategy. In this setting, we randomly sampled 300 examples from the GQA train split. The performance was observed to be 44.3%, a decrease of 1.6% compared to 45.9%. This result suggests that a strategic sampling method can more effectively guide the LLM in generating more efficient modules for a given task. Relatively speaking, our method is robust in the face of choices in sampling strategies. **Number of Sampling Examples** Table 3: Number of Sampling Examples on GQA. | # of samples | GQA | | ------- | -------- | | 60 | 44.5 | | 120 | 45.3 | | 300 | 45.9 | Table 4: Number of Sampling Examples on RefCOCO. | # of samples | RefCOCO | | ------- | -------- | | 10 | 49.4 | | 50 | 67.0 | | 100 | 67.1 | Table 5: Number of Sampling Examples on RAVEN. | # of samples | Center | L-R | U-D | | ------------ | ------ | ---- | ---- | | 5 | 46.5 | 37.2 | 39.8 | | 10 | 80.1 | 67.6 | 69.1 | | 20 | 80.1 | 67.6 | 69.1 | Table 6: Number of Sampling Examples on MEWL. | # of samples | shape | color | material | | ------------ | -------- | -------- | --- | | 5 | 38.9 | 39.6 | 37.9 | | 10 | 43.7 | 45.3 | 41.0 | | 20 | 43.7 | 45.3 | 41.0 | We conduct a series of experiments to illustrate how the number of training samples influence the performance. In the GQA and RefCOCO datasets, if a small number of training samples are used, it's possible for the generated modules to overfit to certain samples, thereby reducing the generalization capability of the newly generated modules. Such overfitting in new modules can negatively impact the final results. Therefore, we can observe that when the number of samples is small, the performance of GNSVR is poorer. As the number of samples increases, the effectiveness of GNSVR improves. However, with a further increase in the number of samples, the performance gains of GNSVR tend to saturate. Regarding RAVEN and MEWL, since their patterns of change are limited, the number of few-shot samples selected is sufficient if it already covers all the variation patterns in RAVEN and MEWL. In other words, if the number of samples exceeds this threshold, there won't be any further improvement in the results; if it's below this threshold, the performance will decline. We selected 10 few-shot samples each in RAVEN and MEWL. As can be seen from the results in the table above, if the number of samples is equal to 5, there is a noticeable decrease in performance. This is because 5 few-shot samples are not enough to cover all the variation patterns of RAVEN or MEWL. If the number of samples is equal to 10 or 20, at this point, the few-shot samples are sufficient to encompass all possible variations. In this case, the same results are obtained. **Q1-3: Ablation on LLM's capability.** Table 7: Ablation on LLM's capability. | GPT | GQA | | -------- | -------- | | gpt-3.5-turbo-instruct | 45.9 | | gpt-3.5-turbo | 44.3 | By using a better LLM, our prompts can be better understood, and the LLM will generate higher quality modules. In this experiment, we compared the results of using gpt-3.5-turbo-instruct and gpt-3.5-turbo. Our experimental results show that better outcomes are achieved when using the more effective gpt-3.5-turbo-instruct. It is evident that the capabilities of the LLM influence the performance of GNSVR. As the abilities of LLMs continue to improve, so will the performance of GNSVR. Thanks to the flexibility of GNSVR, once a better LLM is available, we can easily switch to the latest LLM to achieve better results. >Q2 As shown independently in [1] and [2] the dataset contains flaws in choice design which enables models to learn shortcuts to solve the RPM reasoning task. I would recommend the authors use the i-RAVEN dataset introduced in [1] instead. **Q2. Experiments on the new RAVEN dataset.** Thanks for the reminder of the potential shortcut issue of the RAVEN dataset proposed by Zhang *et al*. We have conducted further experiments on I-RAVEN dataset proposed by [1]. Following our setting in RAVEN (Zhang *et al*), we use 10 random-selected samples for learning new modules to handle the task and test the model on the testing set. As shown in table , our GNSVR is still able to handle the abstract reasoning task with high accuracy and data efficiency. Table 8: Performance on the new I-RAVEN dataset. | Dataset | Center | L-R | U-D | | ------- | -------- | -------- | --- | |LEN | 56.4 | 44.2 | 44.2 | |CoPINet | 54.4 | 51.9 | 52.5 | |SRAN | 78.2 |70.1 | 70.3 | | I-RAVEN | 85.2 | 74.6 | 75.4 | >Q3. Could the authors comment on why they chose to go with a neuro-symbolic approach versus a purely neural approach for defining the neural module scripts? Is it only for controllability, and if so can they comment on how a neural script (e.g. an LLM that abstracts the problem of generating the script and executing it) would compare? There have been approaches towards learning neural scripts and executing them dynamically at inference time for reasoning tasks in prior work e.g. [1] **Q3. Neuro-Symbolic approach VS a Purely neural approach.** We appreciate the suggestion to discuss both neuro-symbolic and purely neural approaches for neural modules. We posit that both purely neural-based approaches, such as [1], and neuro-symbolic models like our GNSVR, represent valuable explorations in enabling AI systems to abstract and solve reasoning problems through script generation and execution. In contrast to neuro-symbolic methods, purely neural approaches like [1] are end-to-end learnable solely from data, without dependence on pre-defined models or APIs. Conversely, our neuro-symbolic method, GNSVR, offers enhanced model transparency through explicit, modularized Python code snippets; 2) It facilitates the use of pre-defined models (e.g., perception models for classification) through explicit function calls; and 3) It demonstrates superior data efficiency in adapting to new tasks, as evidenced by using only 10 Raven examples to learn modules for reasoning tasks. We have included such discussion in the related work of the revised paper. [1] Rahaman, N., Gondal, M.W., Joshi, S., Gehler, P., Bengio, Y., Locatello, F. and Schölkopf, B., 2021. Dynamic inference with neural interpreters. Advances in Neural Information Processing Systems, 34, pp.10985-10998. >Q4. I think the module initialization and evaluation during generation in GNSVR is closely related to the research on automatic group discovery and using it for model design paradigm introduced in computer vision previously e.g. [2, 3]. It would make the related works section more comprehensive to include discussion on how GNSVR relates to this growing subfield of automatic evaluation and model design. **Q4. More Discussion on Related Work** Thanks for mentioning the related works[2,3]. While [2] and [3] focusing on improving pure neural network based model's performance and a different research area that automatically discover groups and use them for model design, we share similar interest on using LLMs and few-shot examples to improve AI models' performance. We have added such discussion in the related work of the revised paper. ## Response to Reviewer #vHZ5 Thank you for the constructive comments and insightful suggestions. >Q1. From my perspective, the biggest current weakness of the paper is that from the experimental design its hard to parse out exactly how the discovered modules affect the system's performance. Ostensibly, this can be gleaned from comparisons between GNSVR and VisProg/ViperGPT, but there are more differences between these systems beyond merging in discovered modules. Specifically, GNSVR uses a "base API" that is a combination of VisProg and ViperGPT, so the "fair" comparison would be against an ablated version of GNSVR that removes steps 1 and 2, and just tries to solve test-problems with the original API functions. This condition is considered in the ablation experiment (GNSVR w/o ML), but only a subset of the RefCOCO test-set. To solidify the claim that the improvement GNSVR observes stems from its discovered modules, this base condition should be added to all of the experimental set-ups (tables 1-5), for example, from Table 2 its unclear how much of the delta improvement between VisProg and GNSVR can be attributed to improvements in the base API versus improvements to the API from the new modules. **Q1. More ablation study on Performance** Thanks for the insightful suggestions. We add more baseline experiments for comparison. For the GQA and RefCOCO baseline's comparison, please refer to **Q2** below. In Table 2 of the main paper, we showcase the capability of GNSVR's transfer learning. That is to say, the new modules learned from GQA and RefCOCO can be applied to the image editing and knowledge tagging task. Therefore, in the two tasks involved in Table 2, we did not learn any new modules but instead transferred and used new modules learned from other tasks. In Table 2, we can see that in all metrics, we surpassed VisProg. This performance improvement actually stems from the new modules generated from GQA and RefCOCO. It is these new modules that enabled functions that the inherent modules of VisProg couldn't achieve, thus leading to the improved performance of our system. As for RAVEN and MEWL, we have implemented the ViperGPT and VisProg baseline experiments in the following way. For VisProg, it requires a manual implementation of all modules by making use of the provided APIs. Thus, to enable VisProg handle Raven and MEWL tasks, we manually implement and debug new hand-crafted modules for VisProg to recognize and discover patterns to handle the task. We call this baseline **VisProg variant**. We also put the training examples in GNSVR' stage 1 into the prompt of **VisProg variant** for better performance. For ViperGPT, it has no manual modules and ask the LLMs to make use of the APIs to handle the instances. Thus, we manually write solutions for the training examples into the prompt of the ViperGPT to teach ViperGPT to handle the task. We call this approach **ViperGPT variant**. We have added such analysis into the revised paper. VisProg by itself needs a handcrafted solver module to find the target solution and it would be extremely difficult for ViperGPT to generate a solver from scratch. Thus, we add the solver module learnt from our GNSVR model to pre-defined API pool of VisProg and ViperGPT. As shown in table 1 and table 2, our GNSVR model achieves better performance than these two baselines, showing the great value of module learning for handling new tasks from only a few examples. Table 1: Compare our GNSVR model with baselines, VisProg and ViperGPT on RAVEN. | Methods | Center | L-R | U-D | | ------- | ------ | ---- | ---- | | VisProg variant | 36.8 | 26.1 | 27.8 | | ViperGPT variant | 40.6 | 30.7 | 32.4 | | Ours | 80.1 | 67.6 | 69.1 | Table 2: Compare our GNSVR model with baselines, VisProg and ViperGPT on MEWL. | Methods | shape | color | material | | ------- | -------- | -------- | --- | | VisProg variant | 35.2 | 35.9 | 34.9 | | ViperGPT variant | 37.8 | 38.2 | 36.7 | | Ours | 43.7 | 45.3 | 41.0 | >Q2. Beyond this, I'm also slightly concerned about the design of the GNSVR w/o ML baseline. At inference time, is this baseline allowed to invoke arbitrary python logic in the style of ViperGPT (e.g. standard control flow constructs) or is it restricted to only using API function calls in the style of VisProg. I would imagine that the first condition would be more fair to evaluate GNSVR. Solving some tasks might require simple logic that the LLM knows how to express in python, but might not be directly expressible with a series of API calls. In GNSVR, this logic is incorporated into modules, but in the baseline the LLM should still have the opportunity to invoke similar logic in its test-time solutions (otherwise its impossible to properly evaluate the usefulness of the discovered modules). Please clarify which of these modes the baseline is operating in. **Q2. the detailed implementation of (GNSVR w/o ML baseline)** Thanks for the concerns about the baseline's implementation. Our current implementation of the GNSVR w/o ML baseline is restricted to only using API function calls in the style of VisProg. To make it fair, we have wrapped the APIs in ViperGPT into **the same style** as VisProg and our GNSVR. We highly value the reviewer's opinion and develop a new baseline variant, **GNSVR w/o ML v2**, where the LLM is allowed to invoke arbitrary python logic. s shown in table 3, our GNSVR performs better than both two baselines across tasks, showing the effectiveness of the module learning. Table 3: Comparison of GNSVR and baselines on RefCOCO and GQA. | Model | GQA | RefCOCO| | -------- |-------- | -------- | | **GNSVR w/o ML** | 43.3 | 62.3 | | **GNSVR w/o ML v2** | 40.9 | 65.5 | | **GNSVR** | 45.9 | 67.1 | >Q3. Compared to VisProg and ViperGPT this system seems to require more training data, as the I/O pairs are not only used to populate in-context examples, but also impact (i) what module concepts are proposed and (ii) how the correctness of each module concept is evaluated. This point about reliance on training data is touched on in the ablation section, but it would be good to make this distinction explicit when comparing the pros/cons of the proposed system against past work. **Q3. Discussion on the pros/cons of the proposed system against past work** Thanks for the suggestion to make the reliance on training examples distinct and explicit. Compared with existing methods, our GNSVR framework can leveraging LLMs to create new neural models or general code snippets for specific functions. Moreover, these newly generated modules in GNSVR can cooperate and be reused for various tasks, enhancing overall performance and adaptability. However, compared with existing frameworks VisProg and ViperGPT, we also need a few training examples to serve as test cases to learn new modules. We have also added the pros/cons of the proposed system in the related work section highlighted in the blue color in the revised version. > Q4. (a) Can new modules be hierarchical (e.g. can one proposed module referenced a previously proposed module), or can they only call out to the original API functions? **Q4. About hierarchical module generation.** We do **not** explicitly restrict the LLM to only use the original API functions. However, as we examine the generated modules one by one. We found that all the modules use the original APIs functions. The reasons are that we have provided examples in the prompt that make use of original APIs for new modules. However, we do believe that it is an interesting direction to include hierarchical module generation in the future work. > Q5. (b) The error-correction and pass-rate logic for the module generation step are not clear. What is the "pass-rate" at which a function is accepted as a new module, from the listed examples it seems like a single input/output pair is used, so is the pass rate 1? **Q5. More Details about Pass Rate Definition.** Since we have been using multiple training examples for our module initialization. Thus, each initialized new module should be paired with multiple test cases (*i.e.* multiple instances of the visual reasoning tasks) and we only accept the modules that whose pass rate is larger than a threshold (0.5 in our implementation). We have added such explanation for pass rate in the revised paper. > Q6. (c) What exactly is the error-correcting mechanism when one of the authored functions produces an error -- is it sent back through the LLM with a modified prompt? How many times? What constitutes an error? This seems like a potentially important part of the contribution, so I would encourage the authors to even consider adding an ablation condition demonstrating this is helpful for the performance of the system. **Q6. Ablation on Debug mechanism.** Table 4: Ablation on Debug mechanism. | Method GNSVR | GQA | | -------- | -------- | | Baseline | 45.9 | | w/o debug | 44.9 | The error-correction prompt contains the error message from Python interpreter and wrong code snippet. We prompt the LLM to correct the wrong code based on the error message from Python interpreter. We heuristically set the maximal number of debug iterations as 5. If the wrong code can be corrected within 5 iterations, we will keep it. Otherwise, it will be abandoned. (Details can be found in Module Generation section of Fig. 2) The errors mainly stem from two sources: one is basic syntax errors in Python code, such as indentation and variable name errors. The other source is some fundamental logical errors, such as mistakes made when setting variable types, like treating a variable that should be of the bool type as the string type. By observing the table above, we can conclude that the debug process can assist GNSVR with generating more useful modules to elevate performance and prevent elementary programming mistakes. > Q7. It would be informative to provide additional details on the generated modules for each task: how many are created for each task? How often does each one get used in solving test-cases? Do they each capture distinct concepts, or are some modules duplicates? **Q7. how many modules are generated for each tasks** **GQA** (13 modules) VERIFY_COLOR, VERIFY_ACTION, VERIFY_ATTRIBUTE, VERIFY_MATERIAL, VERIFY_OBJECT, COMPARE_ATTRIBUTE, COMPARE_COLOR, COMPARE_MATERIAL, CHOOSE_ATTRIBUTE, CHOOSE_COLOR, CHOOSE_MATERIAL, CHOOSE_DEPTH, CHOOSE_ACTION. **RefCOCO** (5 modules) FILTER_COLOR, FILTER_SHAPE, VERIFY_RELATION, FILTER_RELATION, SORT_SPATIAL **RAVEN** (4 modules) DETECT_COLOR, DETECT_SIZE, DETECT_SHAPE, SOLVER. **MEWL** (4 modules) DETECT_COLOR, DETECT_MATERIAL, DETECT_SHAPE, SOLVER. We further take GQA as an example to exhibit the percentage of top-5 most-used new modules in GQA in table 5. Table 5: Percentage of top-5 most-used new modules in GQA. |Task|VERIFY_ATTRIBUTE |CHOOSE_ATTRIBUTE|VERIFY_COLOR|COMPARE_ATTRIBUTE|VERIFY_MATERIAL| | ------- | ------- | -------- | -------- | --- | --- | | GQA | 14.1 | 10.8 | 6.9 | 5.9 | 3.6 | The data in the above table shows the proportion of the five most common new modules appearing in the generated high-level programs. Overall, 38.7% of all generated high-level programs use the newly generated modules (this 38.7% calculation includes other less common modules and excludes duplicate samples, such as a single high-level program containing multiple new modules). From these results, it can be seen that these newly learned modules can be widely applied to GQA, thereby helping GNSVR achieve good results on GQA. > Q8. (a) From the example prompts, it looks like a single function is given as an in-context example for step 1 and step 2, are these always COMPARE_SIZE and LOC, as shown in figures 15 and 16? **Q8. Details on in-context examples.** Indeed, there are more examples provided in the prompt of step 1 and step 2. We just show the first and representative ones in the Figure 15 and 16. > Q9. (b) For the inference in-context examples are these chosen randomly from the training tasks that found a successful program? **Q9. inference context examples.** Yes. They are randomly chosen from training samples which generated successful programs. These examples utilize various newly generated modules. Each new module is assigned at least three successful training examples. >Q10. (a) For the Raven task, my understanding is that the input is just a series of images. If this understanding is correct, how do you turn these images into a question? I am also quite curious to see the internals of the SOLVER generated module, is this module shared between Raven and MEWL, or does it employ distinct logic? **Q10. Details on Raven and MEWL.** Note that a visual reasoning task does **not** necessarily to use language as input. All we need is to prompt the LLMs to generate modules that recognize the patterns and solve the problem. In RAVEN, by prompting LLM, we can obtain DETECT_COLOR, DETECT_SHAPE, and DETECT_SIZE. The image is fed into these modules and the output is the color, shape, and size of the image. In this way, the input image is converted into a (color, shape, size) triplet. We provide LLM with ten examples from RAVEN train split to demonstrate how to deduce the pattern of these triplets. By observing few-shot demonstrations, we let LLM generate the SOLVER() module, which detects the pattern of input triplets from Problem Matrix and choose the most appropriate answer from the Answer Set. Therefore, the internal of the SOLVER() module is primarily based on judgment, used to identify the patterns of the input triples in the Problem Matrix, thereby finding the answer in the Answer Set. The workflow of RAVEN is shown in Fig. 5. As for MEWL, we employ a similar approach to handle it. One example is provided Fig. 6. Since MEWL and RAVEN have different patterns, the SOLVER() module is not shared between RAVEN and MEWL. Thus, it utilizes distinct logics. ## Response to Reviewer #JpcA Thank you for the positive comments and insightful suggestions. > Q1. Is there anything to prevent overfitting to the small train set of a given task, and only creating specific modules tailored for that domain? I can imagine that these overfitted modules may not be very useful in new tasks. >Q2. Isn’t it possible that the method creates a broad function signature, but during stage 2 verification with the small set of train examples, it overfits to a bad implementation that only works for those examples, and therefore actually harms performance from there on out? I’m mainly concerned about the above two overfitting challenges. **Q1-Q2. About overfitting VS training number.** Thanks for concerns on the overfitting issue of train examples. Our GNSVR pipeline do have overfitting issue when the number of the training examples is too small (10), which could be reflected in the ablation study in table 5 of the revised paper. However, as the number of training examples goes up (50 and 100), the performance becomes stable, which is unlikely that the GNSVR framework creates a broad function signature overfitting the training examples. We believe that one reason for the strong few-shot performance of GNSVR is that all our pre-defined APIs are general APIs that work well for vision tasks in the wild. We also provide the performance VS test cases in the following tables for different tasks. As shown in the tables, as the training number is larger than a small number, our GNSVR framework becomes effective and relieve the overfitting issue. **Number of Sampling Examples** Table 1: Number of Sampling Examples on GQA. | # of samples | GQA | | ------- | -------- | | 60 | 44.5 | | 120 | 45.3 | | 300 | 45.9 | Table 2: Number of Sampling Examples on RefCOCO. | # of samples | RefCOCO | | ------- | -------- | | 10 | 49.4 | | 50 | 67.0 | | 100 | 67.1 | Table 3: Number of Sampling Examples on RAVEN. | # of samples | Center | L-R | U-D | | ------------ | ------ | ---- | ---- | | 5 | 46.5 | 37.2 | 39.8 | | 10 | 80.1 | 67.6 | 69.1 | | 20 | 80.1 | 67.6 | 69.1 | Table 4: Number of Sampling Examples on MEWL. | # of samples | shape | color | material | | ------------ | -------- | -------- | --- | | 5 | 38.9 | 39.6 | 37.9 | | 10 | 43.7 | 45.3 | 41.0 | | 20 | 43.7 | 45.3 | 41.0 | >Q3. There seems to be an assumption that the queries in the train set actually all require similar modules, for example, to verify the new modules, the selected set of test cases from the train set must actually use those modules. I’m not sure if this is a reasonable assumption, and also, how is this sampling of train examples (both in stage 1 and stage 2) done? The training samples in stage 1 and stage 2 are the same samples, randomly sampled by question categories. Specially, we first sample N examples to perform stage 1 to get the new module signatures and a set of corresponding test cases. The module signatures contain the standard input and output formats of the new modules. A test case is generated by parsing the corresponding sample query into a high-level program. We consider a new generated module passes a test case if the execution of the high-level program generates the correct answer. We also group the training examples according to the same new module signature, and use them to generate a set of test cases that would be used to test the correctness of the new modules. >Q4. W4. In the experiments, are VisProg/ViperGPT also given the same few-shot training set? For example, I believe you can use the same method of error correction and correct both VisProg/ViperGPT when it is incorrect. In this way, you can include VisProg/ViperGPT comparison in Tables 3 and 4. Would be great to better disentangle what drives this improvement of performance -- modularity in the proposed modules, or the training examples given in the loop, etc. **Q4. Variants of VisProg and ViperGPT.** Thanks for your suggestion on implementing variants of VisProg and ViperGPT for Raven and MEWL. Note that both VisProg and ViperGPT have no training phase that **generates** new modules for **reusage** in the test phase. Thus, there is no easy way for VisProg and ViperGPT to perform error correction with test cases. During the rebuttal, we have implemented the new variants in the following way to make use of training examples for VisProg and ViperGPT. For VisProg, it requires a manual implementation of all modules by making use of the provided APIs. Thus, to enable VisProg handle Raven and MEWL tasks, we manually implement and debug new hand-crafted modules for VisProg to recognize and discover patterns to handle the task. We call this baseline **VisProg variant**. We also put the training examples in GNSVR' stage 1 into the prompt of **VisProg variant** for better performance. For ViperGPT, it has no manual modules and ask the LLMs to make use of the APIs to handle the instances. Thus, we manually write solutions for the training examples into the prompt of the ViperGPT to teach ViperGPT to handle the task. We call this approach **ViperGPT variant**. We have added such analysis into the revised paper. VisProg by itself needs a handcrafted solver module to find the target solution and it would be extremely difficult for ViperGPT to generate a solver from scratch. Thus, we add the solver module learnt from our GNSVR model to pre-defined API pool of VisProg and ViperGPT. As shown in table 5 and table 6 below, our GNSVR model achieves better performance than these two baselines, showing the great value of module learning for handling new tasks from only a few examples. Table 5: Compare our GNSVR model with baselines, VisProg and ViperGPT on RAVEN. | Methods | Center | L-R | U-D | | ------- | ------ | ---- | ---- | | VisProg variant | 36.8 | 26.1 | 27.8 | | ViperGPT variant | 40.6 | 30.7 | 32.4 | | Ours | 80.1 | 67.6 | 69.1 | Table 6: Compare our GNSVR model with baselines, VisProg and ViperGPT on MEWL. | Methods | shape | color | material | | ------- | -------- | -------- | --- | | VisProg variant | 35.2 | 35.9 | 34.9 | | ViperGPT variant | 37.8 | 38.2 | 36.7 | | Ours | 43.7 | 45.3 | 41.0 | ## Experiments Plan 1. ablate different components of GNSVR (ongoing) We use an experimental setup similar to the ablation study section in the paper, as used for refcoco. Therefore, we randomly select 800 samples from the GQA test-dev split. <font color="#E24A0F"> * Ablation on table 1 on modules without learning new modules on GQA (same as table above, wenjun can do it) * Only using the transferred modules without learning new modules on table 2 (maybe wenjun cna do it since it is knowledge tagging and image editing) * Using existing modules for pattern recognition rather than learning it by examples on table 3 and table 4. (Rui: I think this is the same as ViperGPT and VisProg on RAVEN and MEWL ablation. If so, it has been done.) </font> <font color="#E24A0F">(1). Ablation on Good Initialization. (done) (2). Ablation on defining the correct input and output format for a new module. (done) (3). We extracted 300 examples from GQA, 100 from RefCOCO, 10 from Raven, and 10 from MEWL based on experimental experience. (Rui: Not cherry picking and I will explain it.) (ongoing) (4). Ablation on LLM's capability to learn new modules. (done) (5). Ablation for Prompts on using existing modules and creating new modules; (6). Alation on sampling strategy. (done) </font> GQA: (a) Ablate Prompt Design (done) | Method GNSVR | GQA | | -------- | -------- | | Baseline | 45.9 | | w/o Good Initialization | 41.8 | | w/o Input and Output Format | 43.2 | (b) Ablate Sampling Strategy (done) | Method GNSVR | GQA | | -------- | -------- | | Baseline | 45.9 | | Random Sampling | 44.3 | (c) Ablate LLM's capability to learn new modules (done) | GPT | GQA | | -------- | -------- | | gpt-3.5-turbo-instruct | 45.9 | | gpt-3.5-turbo | 44.3 | (Rui: 1. Refer to Q2, I think we already re-implemented the baseline in Table 1. Actually, I considered the fair comparison when I conducted experiments before. It looks like no additional experiments are needed. 2. I think Wenjun can do it. 3. This setting is quite similar to R4-Q4. These two can be done together.) 2. GNSVR on I-RAVEN (done) (dataset has been generated, center: 85.2%, left-right: 74.6%, up-down: 75.4%) | Dataset | Center | L-R | U-D | | ------- | -------- | -------- | --- | | RAVEN | 80.1 | 67.6 | 69.1 | | I-RAVEN | 85.2 | 74.6 | 75.4 | 3. ViperGPT and VisProg raven and mewl baseline (done) RAVEN(done) | Methods | Center | L-R | U-D | | ------- | ------ | ---- | ---- | | VisProg | 36.8 | 26.1 | 27.8 | | ViperGPT | 40.6 | 30.7 | 32.4 | | Ours | 80.1 | 67.6 | 69.1 | MEWL(done) | Methods | shape | color | material | | ------- | -------- | -------- | --- | | VisProg | 35.2 | 35.9 | 34.9 | | ViperGPT | 37.8 | 38.2 | 36.7 | | Ours | 43.7 | 45.3 | 41.0 | 4. showcase generated modules and calculate the used ratio. (done) * GQA: Generated Modules (13 modules): VERIFY_COLOR, VERIFY_ACTION, VERIFY_ATTRIBUTE, VERIFY_MATERIAL, VERIFY_OBJECT, COMPARE_ATTRIBUTE, COMPARE_COLOR, COMPARE_MATERIAL, CHOOSE_ATTRIBUTE, CHOOSE_COLOR, CHOOSE_MATERIAL, CHOOSE_DEPTH, CHOOSE_ACTION. Here is the table to exhibit the percentage of top-5 most used new modules in GQA. |Dataset|VERIFY_ATTRIBUTE |CHOOSE_ATTRIBUTE|VERIFY_COLOR|COMPARE_ATTRIBUTE|VERIFY_MATERIAL| | ------- | ------- | -------- | -------- | --- | --- | | GQA | 14.1 | 10.8 | 6.9 | 5.9 | 3.6 | The data in the above table shows the proportion of the five most common new modules appearing in the generated high-level programs. Overall, 38.7% of all generated high-level programs use the newly generated modules (this 38.7% calculation includes other less common modules and excludes duplicate samples, such as a single high-level program containing multiple new modules). From these results, it can be seen that these newly learned modules can be widely applied to GQA, thereby helping GNSVR achieve good results on GQA. * RAVEN: Generated Modules (4 modules): DETECT_COLOR, DETECT_SIZE, DETECT_SHAPE, SOLVER. * MEWL: Generated Modules (4 modules): DETECT_COLOR, DETECT_MATERIAL, DETECT_SHAPE, SOLVER. As for RAVEN and MEWL, all samples utilize newly generated modules to produce high-level programs. GNSVR solves these two tasks by executing high-level programs. 5. debug (done) | Method GNSVR | GQA | | -------- | -------- | | Baseline | 45.9 | | w/o debug | 44.9 | (b) The error-correction and pass-rate logic for the module generation step are not clear. What is the "pass-rate" at which a function is accepted as a new module, from the listed examples it seems like a single input/output pair is used, so is the pass rate 1? (c) What exactly is the error-correcting mechanism when one of the authored functions produces an error -- is it sent back through the LLM with a modified prompt? How many times? What constitutes an error? This seems like a potentially important part of the contribution, so I would encourage the authors to even consider adding an ablation condition demonstrating this is helpful for the performance of the system. 6. number of training samples GQA (done) | # of samples | GQA | | ------- | -------- | | 60 | 44.5 | | 120 | 45.3 | | 300 | 45.9 | RAVEN (done) | # of samples | Center | L-R | U-D | | ------------ | ------ | ---- | ---- | | 5 | 46.5 | 37.2 | 39.8 | | 10 | 80.1 | 67.6 | 69.1 | | 20 | 80.1 | 67.6 | 69.1 | MEWL (done) | # of samples | shape | color | material | | ------------ | -------- | -------- | --- | | 5 | 38.9 | 39.6 | 37.9 | | 10 | 43.7 | 45.3 | 41.0 | | 20 | 43.7 | 45.3 | 41.0 | 7. Ablation for Prompts on using existing modules and creating new modules (done) | Method GNSVR | GQA | | -------- | -------- | | Baseline | 45.9 | | w/o using existing modules | 45.0 | | w/o creating new modules | 44.7 | 8. About Reliance of Existing Modules (done) | Method GNSVR | GQA | | -------- | -------- | | Baseline | 45.9 | | w/o Reliance of Existing Modules | 45.0 |