# ContPhy: Continuum Physical Concept Learning and Reasoning from Videos [toc] [Backup Text 12-27 13:00 PM](https://hackmd.io/@ocPaB5y2SmudmPxvrwIQ1g/Bkj3JRb1C/edit) [back up V2, 12-28](https://hackmd.io/@ocPaB5y2SmudmPxvrwIQ1g/HJYFXnXyC) --- ## Camera Ready Revision TODO 1. [~~Baselines on NEWTON~~](https://hackmd.io/71N2pmQFR_O6OBN7oz5Shg#Q2-New-baseline-NEWTON-for-ContPhy) 2. [~~Baselines on PhyDNet & PIP~~](https://hackmd.io/71N2pmQFR_O6OBN7oz5Shg#Q3-More-Physical-Baselines) 3. [~~Add limitations~~](https://hackmd.io/71N2pmQFR_O6OBN7oz5Shg#Q4-Limitation-Section-of-the-paper) 4. [~~Add logical steps (in appendix)~~](https://hackmd.io/71N2pmQFR_O6OBN7oz5Shg#Q1-About-Logical-Steps-to-Infer-Answers) 5. [~~Performance and statistics of different prompts (in Appendix)~~](https://hackmd.io/71N2pmQFR_O6OBN7oz5Shg#Q2-About-Template-Question-Design) 6. [~~Add point-cloud performance (in Appendix)~~](https://hackmd.io/71N2pmQFR_O6OBN7oz5Shg#Q3-Baselines-with-Multi-Modalities) 7. [Statistics (same as 5?)](https://hackmd.io/71N2pmQFR_O6OBN7oz5Shg#Q1-Question-Statistics) 8. [~~Violet Baseline~~](https://hackmd.io/71N2pmQFR_O6OBN7oz5Shg#Q2-More-Baselines) 9. [~~Baselines on diffrent prompts (in Appendix)~~](https://hackmd.io/71N2pmQFR_O6OBN7oz5Shg#Q3-MLLMs-Prompting-Details) 10. [~~Add citation~~](https://hackmd.io/71N2pmQFR_O6OBN7oz5Shg#Q4-More-Related-Work) 11. [~~Limitations (same as 3 but more)~~](https://hackmd.io/71N2pmQFR_O6OBN7oz5Shg#Q5-Limitations-of-the-work) 12. [Add details about ContPRO (in Appendix)](https://hackmd.io/71N2pmQFR_O6OBN7oz5Shg#Q1-More-Details-on-Particle-Based-Dynamics-Learner) 13. [~~Update Fig-2~~](https://hackmd.io/71N2pmQFR_O6OBN7oz5Shg#Q2-Confusion-on-Fig2) 14. [~~Update link to sections~~](https://hackmd.io/71N2pmQFR_O6OBN7oz5Shg#Q3-Adding-links-to-the-supplementary-section) 15. [~~Baselines about few-shot~~](https://hackmd.io/71N2pmQFR_O6OBN7oz5Shg#Q4-Gathering-Explanations-from-humans) 16. [Add this Sim2Real Discussion??](https://hackmd.io/71N2pmQFR_O6OBN7oz5Shg#Q6-Performance-on-models-ability-to-infer-physical-properties) 17. Discussion about the new baseline table in Experiments --- 2024.4.2 23:00 ET >Reviewer aBrg: Thanks for the response. > >Some of my concerns are addressed. I will give the final rating considering the other reviewers' discussion. Dear Reviewer aBrg: Thank you very much for the time and effort you have dedicated to reviewing our work, as well as for your subsequent attention and feedback. Should you have any **unresolved concerns**, please feel free to raise them. Considering that the deadline is approaching, we are always ready to respond to any additional questions or opinions you might have regarding our work. We have also engaged in **discussions with other reviewers**, hoping to facilitate your evaluation and review process. Best regards, Authors 2024.4.2 12:07 ET Dear reviewer 7fPw, Thanks for your recognition on the comprehensive and extensive studies of our work. Based on your comment, we would like to further express **the significance of our work** below. **Existing MLLM Models' Overlook of Physics.** While current Multimodal Large Language Models (MLLMs) like Gemini and GPT-4V demonstrate impressive capabilities in general-purpose vision and language processing, they have a critical blindspot: physical reasoning as shown in our experiments. These models struggle with tasks that require understanding the physical properties of objects, how objects interact with their environment, and the underlying laws of physics that govern their behavior. **The proposed ContPhy provides a rigorous platform to evaluate and improve the physical reasoning abilities of MLLMs.** **Significance of Physical Reasoning Capabilities.** Physical reasoning is a cornerstone of Artificial General Intelligence (AGI)[**C**]. An AGI needs to not only understand the physical world through language and vision but also interact with it effectively and imagine objects' dynamics if objects' physical properties change. It also has strong applications in robotic manipulation[**A,B**], virtual reality[**D**], and animation development [**E**]. **Evaluation of World Simulation.** The recent state-of-the-art video generation model, Sora[**F**], shows strong capabilities of generating long high fidelity videos. It also provides a promising path towards building general purpose simulators of the physical world. However, it still has difficulties in understanding physical commonsense and generating physically plausible videos (*e.g.* the glass shattering in the ***discussion*** section of [**F**]). Our benchmark can estimate such world simulation models' physical reasoning capabilities in predicting physical dynamics according the language prompts (*e.g.* counterfactual questions). **Contribution of Our Work.** In this work, we present a novel and challenging physical reasoning benchmark, ContPhy, that pushes the boundaries of current AI models' understanding of the physical world. Specially, the benchmark encompasses diverse scenarios involving liquids, deformable materials, and complex mechanics. Unlike existing datasets, it focuses on inferring physical properties (mass, density, elasticity, etc.) and how they influence dynamics, particularly the interaction between rigid and soft bodies. We also introduce an oracle model (ContPRO) that marries the particle-based physical dynamic models with the recent large language models, which enjoy the advantages of both models, precise dynamic predictions, and interpretable reasoning. We hope that our work can spur progress in perception and reasoning within diverse physical settings, narrowing the divide between human and machine intelligence in understanding the physical world. **Policy on the Adding Anonymous Link.** Thanks for the kind reminder. We have checked the ICML 2024 Author Instructions[**G**] that states "... some additional details about ***rebuttals: ... Links are allowed, but the link must be anonymous to preserve double-blind review***...". We strictly comply the requirements and oversight of anonymity. [**A**]. Fabio, Ramos, et al. BayesSim: Adaptive Domain Randomization Via Probabilistic Inference for robotics simulators. RSS, 2019. [**B**]. Chebotar, Yevgen, et al. Closing the Sim-to-Real Loop: Adapting Simulation Randomization with Real World Experience. ICRA. 2019. [**C**]. Melnik, A., Schiewer, R., Lange, M., Muresanu, A. I., Garg, A., & Ritter, H. (2023). Benchmarks for Physical Reasoning AI. Transactions on Machine Learning Research. [**D**]. Giakoni-Ramírez, F., Godoy-Cumillaf, A., Espoz-Lazo, S., Duclos-Bastias, D., & Martín, P. D. V. (2023). Physical Activity in Immersive Virtual Reality: A Scoping Review. Healthcare (Basel, Switzerland). [**E**]. Funge J. Cognitive modeling for games and animation[J]. Communications of the ACM, 2000, 43(7): 40-48. [**F**]. https://openai.com/research/video-generation-models-as-world-simulators [**G**]. https://icml.cc/Conferences/2024/AuthorInstructions#:~:text=Authors%20can%20submit,track%20IP%20information). 2024.4.2 9:23 ET **Response to Reviewer 7fPw:** We would like to heartedly thank you for further clarifying your views on this work and notifying us to re-check rebuttal rules. 1. Your feedback on the limitations of our paper is well taken. We acknowledge that, despite our efforts to curate as diverse a dataset as possible to foster models capable of physical commonsense reasoning and dynamic inference, the field of physical reasoning in AI remains in its infancy and our work might only scratch the surface. Physical reasoning represents an emerging area of research aimed at enhancing the ability of AI agents to perceive and understand the physical world, a critical step towards achieving human-like intelligence in machines. This endeavor holds great promise for applications in interpretable robotic task learning and vision-language models, positioning our work as a foundational effort. Although the immediate potential of our research may not be fully apparent, we are confident in its significance as a stepping stone towards interpretably interactive machines that can navigate and understand the physical realm. 2. Second, thanks for kind reminder, we checked the rebuttal rules in the official rebuttal notification, where it says, "... some additional details about **rebuttals: ... Links are allowed, but the link must be anonymous to preserve double-blind review**..." We have double checked the linked webpage and the repository to avoid identifying information. We strictly comply the requirements and oversight of anonymity. Warm regards, Authors **Response to Reviewer UVKV:** We are deeply grateful for your positive feedback on our additional experiments and firmly believe that your insightful suggestions have significantly elevated the quality of our paper. Should you have any further questions or require additional clarification, please know we are fully prepared to respond. Best regards, Authors --- 2024.4.1 Response to 7fPw: Thank you for your positive and constructive feedback. We value your suggestions on enhancing dataset diversity, incorporating more baselines, and providing detailed explanations within our paper. We are grateful for your positive assessment and are dedicated to making any possible improvements. We wonder if there might be any additional point for refinement that we can do to potentially increase your evaluation on the paper. Warm regards, Authors Dear reviewer , Thank you again for your comments and suggestions on our paper. We hope that our responses and new results have addressed your questions and concerns. We still have a few days left in the discussion period. If you have any further questions, please don't hesitate to let us know and we'll be happy to address them. Thank you! Best, Authors Your guidance is greatly appreciated as we strive to align our work more closely with ICML's quality standards. Dear reviewer, We wanted to express our sincere appreciation for your thoughtful review of our paper. Your comments and suggestions were very valuable, and we hope that our revisions have addressed the questions and concerns you raised. We understand there are still a few days remaining in the discussion period. If you have any further questions or require additional clarification on the revisions we've made, please don't hesitate to reach out. We'd be happy to discuss them with you in more detail. Thank you again for your time and consideration. Sincerely, The Authors Dear Reviewer, Thank you again for your insightful review of our paper. Your feedback was highly valuable, and we hope that the revisions we've made address the points you raised. We noticed there are still a few days left in the discussion period. Should you have any further questions about the revisions, or if you require any additional clarification on how we addressed your comments, please don't hesitate to let us know. We'd be happy to discuss them with you in more detail. Thank you once more for your time and valuable contribution. Sincerely, The Authors ---- where $P$, $Q$ are two point sets. If there are materials $\{a, b, c\}$, Chamfer loss at time $t$ is defined as $\mathcal{L}(X_t, \hat X_t) = \sum\_{n \in \\{a,b,c\\}} \mathcal{L}\_{\text{Chamfer}}(X_{t, n}, \hat X_{t, n})$, where $X_t$ is the groundtruth point set $\\{m_i\\}\_t$ at time $t$ from ContPhy dataset, $\hat X_t$ is the simulation points by MPM at time $t$. For fluid density identification, we also add the average y-height distance to loss. At any timestep $t$, for any material point $m$, the gradient of loss $\mathcal{L}$ with respect to a physical parameter $\theta(m)$, $\nabla\_{\theta(m)} \mathcal{L}\_{t}$, can be obtained, which is leveraged to update $\hat\theta(m)$. Finally the pipeline outputs the estimated parameters for downstream symbolic processing. For example, if the module receives the request for fluids' density, then fluid a, b, c's predicted densities $\hat\rho_a, \hat\rho_b, \hat\rho_c$ are estimated, making an if-else judgment like $\hat\rho_a > \hat\rho_c$ and question-answering possible. Likewise, predictions for dynamic questions (predictive and counterfactual) are also accessible in this simulation pipeline. **DPI-Net.** is a graph neural network-based physical dynamic prediction model, which predicts objects' future locations based on particle-based history observation. Specifically, at sampled frame $t$, we represent the particle-based observation state as $S_t=\\{s_{t,n}\\}_{n=1}^N$, where $N$ is the total number of particles in the scene. For the state of the $n$-$th$ particle, we represent it with $s_{t,n}=\\{l_{t,n,x}, l_{t,n,y}, l_{t,n,z}, v_{t,n,x}, v_{t,n,y}, v_{t,n,z}, p_n \\}$, where $\\{l_{t,n,x}, l_{t,n,y}, l_{t,n,z},\\}$ and $\\{v_{t,n,x}, v_{t,n,y}, v_{t,n,z}\\}$ denote the 3D location and velocity of the $n$-$th$ particle, respectively and $p_n$ denotes the property value of the physical parameters, including mass. We use a multi-layer graph neural network [B, C] to pass messages among different particles and predict their locations $\\{l_{t+1,n,x}, l_{t+1,n,y}, l_{t+1,n,z},\\}$ at the $t+1$ frame. **A.** Yuanming Hu, Luke Anderson, Tzu-Mao Li, Qi Sun, Nathan Carr, Jonathan Ragan-Kelley, and Frédo Durand. DiffTaichi: Differentiable Programming for Physical Simulation. ICLR 2020. **B.** Yunzhu Li, Jiajun Wu, Jun-Yan Zhu, Joshua B. Tenenbaum, Antonio Torralba, and Russ Tedrake. Propagation Networks for Model-Based Control Under Partial Observation. ICRA 2019. **C.** Kipf T, Fetaya E, Wang K C, et al. Neural relational inference for interacting systems. ICML. 2018. ## General Response to All. **G1. Contribution Recognition.** We sincerely thank reviewers and ACs' time and efforts in reviewing the paper. We are glad that reviewers recognized the following contributions. * **Task**. The task is **novel**. *"The paper introduces a novel benchmark encompassing various scenarios using a 3D physics simulator and a well-structured question set to probe AI models to understand physical properties and dynamics."* (**UVKV**) *"The dataset based on Unity can be a good data source for many tasks, and so do the questions."* (**aBrg**) *"The proposed dataset, ContPhy, is quite interesting and dives reasonably deep into the realm of uncovering the fine-grained physical properties of video captured objects."* (**7fPw**) *"a novel benchmark to assessing machine physical commonsense by encompassing the inference of diverse physical properties."*(**xL5z**) * **Experiments**. **Comprehensive** experiments are conducted.*"The paper also performs a comprehensive set of experiments with traditional visual models, VLMs and also with humans"*. (**UVKV**) *"The experiments contain efforts of many methods and MLLMs."* (**aBrg**) * **Model**. The proposed ContPRO model is **effective**.*"It also show that ContPRO outperforms humans in some tasks and outperforms other approach in most tasks."* (**UVKV**) *"The design of the oracle model, ContPro, is comprehensive, and seems perform well."* (**7fPw**) **G2. Experiments During Rebuttal.** To address the reviewers’ questions and support our responses, we conduct the following experiments to support our claims and show ContPhy's value. For extensive experimental analysis, we choose Gemini-Pro-Vision rather than GPT-4V on experiments that require multimodal large language models, since Gemini provides free APIs for research. For better presentation, besides the pointwise response, we also summarized new experimental results in the following link: [https://physical-reasoning-project.github.io/rebuttal.html](https://physical-reasoning-project.github.io/rebuttal.html) 1. New blind-LLM-based baseline similar to NEWTON. (To **UVKV**) 2. Specialized models (PIP and PhyDNet) trained for physical reasoning. (To **UVKV**) 3. Comparision between template questions and LLM-paraphrased questions (To **aBrg** and **7fPw**); 4. Baselines with Multi-Modalities. (To **aBrg**) 5. Recent transformer-based video-QA model, Violet. (To **7fPw**) 6. Few-shot Prompting MLLM baseline. (To **7fPw**) 7. Chain-of-thought baseline explanations. (To **xL5z**) We hope our responses address all reviewers' concerns. We thank all reviewers' and AC's time and efforts again! ## Response to Reviewer **UVKV** We appreciate the reviewer for the detailed comments and insightful suggestions. ### Q1. Results about Baselines. <!-- **Q1. Results about Baselines.** --> > The QA dataset outputs two or three answers, but some results fall below the random baseline. What could explain this? **About Baseline Performance.** We thank the reviewer for the concern that some baseline models fall below the random baseline on some metrics. For example, in the fluid scenario, **C-LSTM** performs worse than the Blind random baseline (**RND**) on predictive question per option (**P-Opt.**) and goal-driven question per option (**G-Opt.**). We believe the reason is that models like **C-LSTM**, which are originally designed for static vision-language tasks have difficulties to understand the dynamics and physics common sense in the predictive or goal-driven scenarios of fluids. Thus, they only achieved comparable performance to the **RND** baseline. This shows our dataset' challenges for traditional vision-question answering models. **About Option Number of Each Question.** In Fig. (2), we only show examples with two or three answers for page limitation consideration. In fact, there are often more than 3 options for each multiple-choice question. The average number of options for each multiple-choice question is 3.4 in our dataset. ### Q2. New baseline NEWTON for ContPhy. <!-- **Q2. New baseline NEWTON for ContPhy.** --> > Has the author considered applying this to LLMs, akin to the approach in NEWTON's study on physical reasoning in language models? We thank the reviewer for suggesting the blind model evaluation on LLMs similar to the approach in NEWTON. For thorough evaluation, we have added a new baseline that first transforms the visual input into text description of the scene objects and their dynamics and then feed the description to the LLM. Specifically, we insert the following information into the prompt, 1) the 3D location coordinates, euler rotation angles, and local scales of each rigid object, and 2) the locations of each softbody or fluid's centroids and some(0~3) sampled particles in uniformly subsampled 6 sequential frames. Additionally, in the rope scenario, we also provide the link list of the loads or pulleys on each single rope. Following the text description of the scene, the questions are added below the given information. We feed the prompt into the **Gemini Pro Vision**. We choose Gemini rather than GPT-4V since Gemini provides free APIs for extensive experimental analysis. The results are shown in **table (R1-1)** below. In **table (R1-1)**, the text description of the **rope** scene can benefit the MLLM's understanding of the physical events happening in the video, while we do not observe significant increase of performance compared with the question-only blind model in **other three scenarios**. We guess the reason lies in that the rope scenario is quite different from the other three in ways of describing the physical dynamics. The rope is relatively easy to predict and reason once the model knows the connection lists of ropes and the motion patterns (rotation and motion directions). The causal relations acquired from the text are relatively obvious for pretrained language models. This reasoning chain does not involve too much physical dynamic reasoning and prediction, thus improving the performance of it on rope scenario. But as for the fluid, cloth, and ball scenarios, the question-solving process greatly involves and evaluates the language model's ability to do physical dynamic reasoning. Besides, due to the upper limit of tokens for Gemini API, we are only able to provide a small number of particles in sparsely sampled frames, which may also constrain LLMs' understanding on these temporal physical scenarios. This indicates that some of nowadays language models, such as Gemini-Pro, are not competent on continuum physical reasoning and dynamic inference, regardless of whether the input modality is images or text descriptions. We will add such analysis and discuss the related work NEWTON in the later version. **Table (R1-1).** Performance of Blind Gemini-based LLM model w/ or w/o NEWTON's approach. | **Model: Gemini-Pro-Vision** | **b) Question Only (Text Only)** | **a) Question Only (Visual Input)** | **i) NEWTON Approach (Text Only)** | |------------------------------|----------------------------------|----------------------|------------------------------------| | **Average Rope** | 28.6 | 31.5 | **32.3** | | **Average Fluid** | **31.0** | 25.2 | 23.9 | | **Average Cloth** | **53.4** | 45.0 | 48.9 | | **Average Ball** | **44.3** | 43.0 | 43.2 | ### Q3. More Physical Baselines. <!-- **Q3. More Physical Baselines.** --> > Has the author explored specialized models trained for physical reasoning, such as PIP (PIP: Physical Interaction Prediction via Mental Simulation with Span Selection), interpretable intuitive physics models, or PhyDNet? Thanks for suggesting new baselines for physical reasoning. During this rebuttal, we have implemented **PIP** and **PhyDNet** for the proposed ContPhy dataset. Based on the original implementation released by the authors, we first generated object masks based on each question and fed them into models, together with the video features. For the open-ended questions, we added a fully-connecter layer to predict the answer labels with a cross-entropy loss. Results are shown in **Table (R1-2)**. PIP and PhyDNet achieve competitive overall performance and excel in some settings (*e.g.* counterfactual questions in fluid setting) than previous best performance from vision models. However, these specialized models also have limitations in some scenarios like Rope and Cloth. We hypothesize the reasons are that these models are mainly designed for physical reasoning tasks with simple visual primitives, like sphere collision and movement. However, our dataset focuses on continuum objects in diverse environments and different question types, which makes it difficult for these models to grasp the physical rules behind the scenarios. We will add the new baselines into the later version of the paper. **Table (R1-2)**. New physical baselines for the ContPhy dataset. | | Previous Vision Best | PIP | PhyDNet | | :----------: | :------------------: | :------: | :------: | | **Rope Avg.**| **50.1** | 41.6 | 48.9 | | **Fluid Avg.**| **42.1** | 35.8 | 39.1 | | **Cloth Avg.**| **61.8** | 54.0 | 56.5 | | **Ball Avg.**| **53.4** | 41.2 | 46.7 | | **Rope P** | **60.7** | 31.5 | 59 | | **Rope CO** | 76.2 | 75.2 | **77.7** | | **Rope CQ** | **50.7** | 48.3 | 47.9 | | **Rope GO** | **56** | 50.6 | 54.4 | | **Rope GQ** | **6.7** | 2.2 | 5.6 | | **Fluid P** | **54** | 37 | 51.3 | | **Fluid CO** | 56.8 | 49.1 | **59.5** | | **Fluid CQ** | 8.6 | 6 | **10.3** | | **Fluid GO** | 67.7 | **67.7** | 55.9 | | **Fluid GQ** | **41.3** | **41.3** | 40 | | **Fluid PO** | **53.8** | 45.5 | 51.7 | | **Fluid PQ** | **12.7** | 3.8 | 4.8 | | **Cloth P** | **59.3** | 54 | 58.7 | | **Cloth PO** | **68.8** | 61.6 | 63.5 | | **Cloth PQ** | **57.3** | 46.3 | 47.3 | | **Ball P** | **54.7** | 54 | 52.7 | | **Ball CO** | 66.1 | 63.7 | **67.2** | | **Ball CQ** | 41.8 | 24.6 | **44.3** | | **Ball GO** | **58.1** | 54.1 | 57.4 | | **Ball GQ** | **38.9** | 22.2 | 21.1 | | **Ball PO** | **67.4** | 62.9 | **67.4** | | **Ball PQ** | **46.6** | 6.8 | 17 | ### Q4. Limitation Section of the paper. <!-- **Q4. Limitation Section of the paper.** --> We acknowledge the reviewer's suggestion to incorporate a limitations section. Here, we outline the identified limitations, which will be addressed in the future version of paper. Our proposed benchmark, ContPhy, aims to complement existing physical reasoning benchmarks by encompassing diverse physical property inference (e.g., mass, density) across various scenarios and predicting corresponding dynamics. However, ContPhy still has limitations. **Limitation 1: Language Diversity.** While the synthesized questions generated by the question engine can effectively test AI models' physical reasoning capabilities across diverse scenarios (future prediction, counterfactual prediction, goal-driven prediction) involving different objects (solids, soft objects, fluids), the language diversity remains limited. The current set of questions relies on a predefined vocabulary, resulting in a gap compared to natural language. **Limitation 2: Scenario Complexity.** We have carefully designed four distinct scenarios featuring various objects (solids, ropes, clothes, fluids). However, real-world physical interactions can be considerably more complex, involving additional objects and physical factors not currently included in the dataset (e.g., air, wind, fire). Based on these limitations, we propose the following future reserch direction for the ContPhy dataset. First, as suggested by the reviewer, we can utilize large language models (LLMs) or other NLP techniques to paraphrase questions and answer options, thereby increasing language variation. Moreover, we can design additional scenarios incorporating a wider range of objects and physical parameters to better reflect real-world complexity. ## Response to Reviewer **aBrg** Thank you for the constructive comments. ### Q1. About Logical Steps to Infer Answers. <!-- **Q1. About Logical Steps to Infer Answers.** --> > Logically, can we know the logical steps to infer the answers, in a common human way? For example, the mean of steps to infer the physical property questions in Fig 2? Thanks for the suggestions on logical steps to infer the answer. Similar to other synthesized questions in previous research[A, B], we can get the logical steps (reasoning operators) that lead to the answer. To provide more information about the benchmark, we show the examplar logical steps for examples of each question type and calculate their statistics in the following link: [https://physical-reasoning-project.github.io/rebuttal.html](https://physical-reasoning-project.github.io/rebuttal.html). From the table, we can see that most types of questions have two or three logical steps, which involves diverse capabilities for querying objects' visual attributes, physical properties and dynamics based on physical properties for solid objects, soft objects, and liquids. **A.** CLEVRER: CoLlision Events for Video REpresentation and Reasoning. Kexin Yi*, Chuang Gan*, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, Joshua B. Tenenbaum. ICLR, 2020. **B.** Patel, M., Gokhale, T., Baral, C., and Yang, Y. CRIPP- VQA: Counterfactual reasoning about implicit physical properties via video question answering. EMNLP, 2022. ### Q2. About Template Question Design. <!-- **Q2. About Template Question Design.** --> > How do the template question design? How many humans were involved? Do the templates affect the performance a lot, especially for the prompts for MLLM? Thanks for your concerns on templated questions! We design the question templates by brainstorming. About 10 people are involved in proposing, implementing, and modifying templates. As shown in Table 3-7 and Figure 5 of our paper, these questions are proposed to test AI models' capabilities in different dimensions, including static visual attribute recognition, physical property inference, dynamic prediction, and counterfactual imagination of solid objects, soft objects, and liquids. Here, we list more linguistic statistics of the QA dataset. Considering your concern about the effect of template-based questions, we also utilize LLMs to paraphrase the questions for better diversity. The rewording prompt we used could be checked at [the rebuttal webpage](https://physical-reasoning-project.github.io/rebuttal.html). Question statistics and MLLM's performances on template-based questions and LLM-paraphrased questions are compared in the **table (R2-1) and table \(R2-2)**. We also provide statistics of previous works, ComPhy and CLEVRER. Based on the statistics in **table (R2-1)**, we have the following observations. Paraphrasing can help increase the diversity of questions in standard criteria like Lexical Diversity and word distribution. Also, the diversity of our dataset is much better than ComPhy and CLEVRER. Based on **table \(R2-2)**, we can observe that while adding paraphrases for the template questions to the dataset increases the diversity of questions, the experimental results from baselines are similar. We believe the reason behind this is that the primary challenges of the proposed ContPhy benchmark come from the understanding of physical properties and the corresponding dynamics of different objects like solids, soft objects, and liquids. The questions in ContPhy mainly serve as flexible prompts to test AI models' physical reasoning capabilities in different dimensions. We will add such analysis and new results in their later version. **Table (R2-1).** Statistics about the templated questions and paraphrased questions. | **Question Generation Method** | **Template-Based** | **LLM-Paraphrased** | **Template-Based** | **LLM-Paraphrased** | **Template-Based** | **LLM-Paraphrased** | **Template-Based** | **LLM-Paraphrased** |ComPhy |CLEVRER| |:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:| | **Scenario** | Fluid | Fluid | Rope | Rope | Cloth | Cloth | Ball | Ball | - | - | | **Lexical Diversity: TTR**|0.0096 | **0.052** |0.0096 | **0.053** |0.0089 | **0.068** |0.0066 | **0.049** |0.0005 |0.00008| | **Lexical Diversity: Word Distribution** | Figures at [Webpage](https://physical-reasoning-project.github.io/rebuttal.html#:~:text=Word%20Distribution%20of%20Templated%20and%20Paraphrased%20Questions) | | | | | | | | | | | **QA Diversity: Question Types** | 7 Types (Figures at [Webpage](https://physical-reasoning-project.github.io/rebuttal.html#:~:text=5.3-,Question%20Type%20Distribution,-Question%20distribution%20statistics)) | 7 Types | 8 Types | 8 Types | 6 Types | 6 Types | 5 Types | 5 Types | 14 Types | 8 Types | | **Syntactic Diversity: Sentence Length Average / Variance** | 13.1/3.9| 13.6/10.7 | 13.0/6.9| 13.0/11.3 | 12.2/8.5| 11.7/11.4 | 15.2/10.2 | 15.6/19.2 | 12.0/8.7| 12.2/12.6 | | **Readability Scores: Flesch-Kincaid Grade Level** | 4.4 | 4.5 | 3.1 | 3.1 | 4.1 | 3.9 | 4.0 | 4.1 | 4.0 | 5.3 | **Table \(R2-2).** Average performance on the templated questions and paraphrased questions. | **Model: Gemini-Pro-Vision** | **a) Templated Question (Visual Input)** | **b) Templated Question (Text Only)** | **g) LLM-Paraphrased Question (Visual Input)** | **h) LLM-Paraphrased Question (Text Only)** | |------------------------------|------------------------------------------|---------------------------------------|------------------------------------------------|---------------------------------------------| | **Average Rope** | **31.5** | 28.6 | 27.9 | 26.1 | | **Average Fluid** | 25.2 | 31.0 | 23.4 | **32.9** | | **Average Cloth** | 45.0 | **53.4** | 43.5 | 51.6 | | **Average Ball** | 43.0 | 44.3 | 43.4 | **46.2** | ### Q3. Baselines with Multi-Modalities. <!-- **Q3. Baselines with Multi-Modalities.** --> > If inputting different modalities of the scene, e.g., multi-view, point clouds, mesh, etc, how well do the models perform? As suggested by the reviewer, we experiment with point cloud and add these features to CNN-LSTM and MAC baselines. We first utilize ULIP-2[A] pre-trained models with PointBert[B] backbones to extract features for all object point clouds in the scenarios. These features are concatenated together with the vision input, and are fed into vision baselines. Results are shown in **Table (R2-3)**. With the help of point clouds, vision models are exposed to large improvements in almost all settings. We articulate that point cloud features can improve vision model performance since it provides additional information like object locations and spatial relationship, which is important to predict objects' dynamics. We will add such analysis into the later version. **A**. Xue, Le, et al. “ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding.” arXiv(2023) **B**. Yu, Xumin, et al. “Point-BERT: Pre-Training 3D Point Cloud Transformers with Masked Point Modeling.” CVPR(2022) **Table (R2-3).** Performance of point cloud input based on MAC and CNN-LSTM. | | CNN-LSTM | CNN-LSTM+PC | MAC | MAC+PC | | ------------ | :------: | :---------: | :--: | :----: | | **Rope Avg.**| 45.9 | **48.0** | 44.9 | 47.3 | | **Fluid Avg.**| 37.3 | **38.5** | 32.6 | 37.8 | | **Cloth Avg.**| 57.2 | **59.1** | 56.0| 57.8. | | **Ball Avg.**| 49.7 | **55.9** | 43.6 | 52.1 | | **Rope P** | 52.7 | 55 | 53.3 | 57.7 | | **Rope CO** | 74 | 75.4 | 74.2 | 76 | | **Rope CQ** | 45 | 45.5 | 39.8 | 45.5 | | **Rope GO** | 51.2 | 53.8 | 50.3 | 51.7 | | **Rope GQ** | 6.7 | 10.1 | 6.7 | 5.6 | | **Fluid P** | 54 | 55.3 | 30 | 50.7 | | **Fluid CO** | 55 | 55.4 | 56.5 | 57.4 | | **Fluid CQ** | 8.6 | 9.5 | 6.9 | 7.8 | | **Fluid GO** | 57.3 | 58.1 | 51.2 | 58.5 | | **Fluid GQ** | 22.5 | 27.5 | 17.5 | 25 | | **Fluid PO** | 51.4 | 53.2 | 53.5 | 51.9 | | **Fluid PQ** | 12.5 | 10.6 | 12.5 | 13.5 | | **Cloth P** | 46.7 | 47.3 | 59.3 | 59.3 | | **Cloth PO** | 67.5 | 68.3 | 57.9 | 60.8 | | **Cloth PQ** | 57.3 | 61.7 | 50.7 | 53.3 | | **Ball P** | 54.7 | 55.3 | 48 | 52.7 | | **Ball CO** | 64.2 | 66.9 | 66.1 | 66.4 | | **Ball CQ** | 41.8 | 47.5 | 3.3 | 45.9 | | **Ball GO** | 54.1 | 60.4 | 58.1 | 52.6 | | **Ball GQ** | 20 | 36.7 | 18.9 | 21.1 | | **Ball PO** | 67.4 | 71.2 | 64.4 | 70.5 | | **Ball PQ** | 45.5 | 53.4 | 46.6 | 55.7 | **Q4. Paper Writing.** >**Q4-1.** L78, fig 2 is too far away from its 2st ref. >**Q4-2.** Lacking enough details of the model design of the ContPRO and its implementations. If possible, please add them in the suppl. Thanks for the advice on paper revision. We will make the following revisions in the later version, * 1). Move Fig. 2 to the page that is more close to its references; * 2). Besides the implementation details in Section **4** and Section **9.2**, we will include more details on design and implementation of ContPRO. We will also use release our source code for easy reproduction. We provide more model details below. **Model Design.** The proposed oracle model, ContPRO is a neuro-symbolic framework that uses LLMs to parse the query question into executable python programs. As shown in Listing 1-4 of the appendix, the parsed program will call APIs and default python logics to handle the task. Typical APIs include APIs from the perception model (e.g. line 4-5 of Listing 1 of the paper) and APIs from the phyiscal simulation model (e.g. line 8-9 of Listing 2 of the paper). The proposed oracle model, ContPRO is a neuro-symbolic framework that uses LLMs to parse the query question into executable python programs. The prompt is carefully designed to provide APIs that have been implemented and can be called in the Python code. All APIs and prompts are listed in **Listing 5** of our paper. To solve physical reasoning questions, we implement some visual perception modules and physical simulation modules, and their corresponding APIs. The perception modules (**Mask-RCNN**) take images as input and locate objects in the videos. The simulation modules (**MPM** and **DPI-Net**) take point clouds input to simulate point dynamics, ba counterfacts or prediction scenarios. **Mask-RCNN.** We use Mask R-CNN (He et al., 2017) to server as the video perception module to achieve dense object localization within each frame, along with the extraction of associated static attributes such as color and material. A ResNet-50 (He et al., 2016) serves as the network backbone, which is fine-tuned using data from the combined training set across all four scenarios until convergence is reached. **Material Point Method (MPM)** is a continuum mechanics algorithm designed for simulating complex physical interactions among multiple materials, such as fluids, soft bodies, and rigid structures. We define the state of the simulation at time $t$ as $S_t=\{s_{t,m}\}_{m=1}^M$, where $M$ is the total number of material points in the simulation. The state of the $m$-$th$ material point is given by $s_{t,m}=\{x_{t,m}, v_{t,m}, F_{t,m}, V_{t,m}, m_{t,m}\}$, where ${x_{t,m}}$ represents the position, ${v_{t,m}}$ the velocity, ${F_{t,m}}$ the deformation gradient, ${V_{t,m}}$ the volume, and ${m_{t,m}}$ the mass of the material point. The simulation advances by computing the forces acting on each material point based on its interactions with other points and external forces, updating its state accordingly. This involves calculating the stress tensor for each point, deriving from the material's physical properties such as density, Young's modulus, and yield stress, together noted as $\theta(m)$. The update equations are typically discretized using a background grid, to which material points transfer their mass and momentum, facilitating the computation of gradients and hence the forces. The differentiable MPM module in our work is powered by the implementation in **[A]**. We further enrich the framework with material point's property optimization ability, allowing the system to identify multiple material properties and disentangle complicated materials' interaction if given point trajectories. To be specific, by integrating the differentiable Taichi language, gradients of simulation variables with respect to physical parameters are computed at each step. This is crucial for **optimizing the estimated physical parameters** of the simulated materials, including fluid density and the plastoelastic properties of bodies. We use the loss of Chamfer distance between the groundtruth and predicted point trajectories, formulated as $\mathcal{L}_{\text{Chamfer}}(P, Q) = \frac{1}{N} \sum_{p_i \in P} \min_{q_j \in Q} \|p_i - q_j\|^2 + \frac{1}{M} \sum_{q_j \in Q} \min_{p_i \in P} \|q_j - p_i\|^2$ where $P$, $Q$ are two point sets. If there are materials $\{a, b, c\}$, Chamfer loss at time $t$ is defined as $\mathcal{L}(X_t, \hat X_t) = \sum_{n \in \{a,b,c\}} \mathcal{L}_{\text{Chamfer}}(X_{t, n}, \hat X_{t, n})$, where $X_t$ is the groundtruth point set $\{m_i\}_t$ at time $t$ from ContPhy dataset, $\hat X_t$ is the simulation points by MPM at time $t$. For fluid density identification, we also add the average y-height distance to loss. At any timestep $t$, for any material point $m$, the gradient of loss $\mathcal{L}$ with respect to a physical parameter $\theta(m)$, $\nabla_{\theta(m)} \mathcal{L}_{t}$, can be obtained, which is leveraged to update $\hat\theta(m)$. Finally the pipeline outputs the estimated parameters for downstream symbolic processing. For example, if the module receives request for fluids' density, then fluid a, b, c's predicted densities $\hat\rho_a, \hat\rho_b, \hat\rho_c$ are estimated, making if-else judgment like $\hat\rho_a > \hat\rho_c$ and question-answering possible. Likewise, predictions for dynamic questions (predictive and counterfactual) are also accessible in this simulation pipeline. **DPI-Net.** is a graph neural network-based physical dynamic prediction model, which predicts objects' future locations based on particle-based history observation. Specifically, at sampled frame $t$, we represent the paricle-based observation state as $S_t=\{s_{t,n}\}_{n=1}^N$, where $N$ is the total number of particles in the scene. For the state of the $n$-$th$ particle, we represent it with $s_{t,n}=\{l_{t,n,x}, l_{t,n,y}, l_{t,n,z}, v_{t,n,x}, v_{t,n,y}, v_{t,n,z}, p_n \}$, where $\{l_{t,n,x}, l_{t,n,y}, l_{t,n,z},\}$ and $\{v_{t,n,x}, v_{t,n,y}, v_{t,n,z}\}$ denote the 3D location and velocity of the $n$-$th$ particle, respectively and $p_n$ denotes the property value of the physical parameters, including mass. We use a multi-layer graph neural networks [B, C] to pass message among different particles and predict their locations $\{l_{t+1,n,x}, l_{t+1,n,y}, l_{t+1,n,z},\}$ at the $t+1$ frame. **A.** Yuanming Hu, Luke Anderson, Tzu-Mao Li, Qi Sun, Nathan Carr, Jonathan Ragan-Kelley, and Frédo Durand. DiffTaichi: Differentiable Programming for Physical Simulation. ICLR 2020. **B.** Yunzhu Li, Jiajun Wu, Jun-Yan Zhu, Joshua B. Tenenbaum, Antonio Torralba, and Russ Tedrake. Propagation Networks for Model-Based Control Under Partial Observation. ICRA 2019. **C.** Kipf T, Fetaya E, Wang K C, et al. Neural relational inference for interacting systems. ICML. 2018. ## Response to Reviewer **7fPw** Thank you for the constructive comments and insightful suggestions. ### Q1. Question Statistics. <!-- **Q1. Question Statistics.** --> >It is unclear to me whether the question is indeed diverse enough, in the sense that no explicit statistics such as type-token ratio, word distributions, and other relevant quantities were clearly reported. For the first point of weakness, I would actually suggest doing a LLM paraphrasing first and then see if that complicates the QA sets, with the statistics mentioned of course. Thanks for the advice on more statistics on the QA sets. Based on your advice, we have reported the statistics of the current question version and its paraphrased version. Besides recommended lexical diversity metrics such as TTR and word distribution, we also reported syntactic diversity (sentences length mean and variance), question type diversity (question type number), and readability scores (Flesch-Kincaid Grade Level) for reference. Statistics can be checked in **Table \(R3-1)**. Figures of word distribution and QA type distribution are uploaded on [the rebuttal webpage](https://physical-reasoning-project.github.io/rebuttal.html). The advice on paraphrasing questions is quite inspirational and could contribute to the diversity of QA dataset. We have utilized Gemini-Pro to reword the questions. The prompts we use can be checked on [our rebuttal webpage](https://physical-reasoning-project.github.io/rebuttal.html). We instruct the LLM to reword the given questions as diverse as possible and keep the original meaning strictly unchanged and the content readable for common people as well. We provide some generated examples here: | | **Templated** | **Paraphrased** | |---------------|-------------------------------------------------------------------------|-----------------------------------------------------------------| | **Example 1** | Is the density of light blue fluid equal to that of green fluid?" | Are the light blue liquid and green liquid just as heavy? | | **Example 2** | Which phrase below can best describe the final pose of the green plate? | What does the final position of the green plate best resemble? | | | A. Standing upright. | A. Standing straight up. | | | B. Leaning. | B. Tilted. | | | C. Lying horizontally. | C. Lying flat. | | **Example 3** | Will the orange ball finally drop into the left pit? | Is the orange ball expected to fall into the pit on the left? | The statistics for both paraphrased and template-based questions are listed in **Table \(R3-1)**. As shown in **Table (R3-1)**, paraphrases from LLMs can increase the qualities of question-answer pairs in terms of lexical diversity and word distribution while keeping the semantics of QA pairs unchanged. We also compare the diversity of our generated questions with previous works, ComPhy and CLEVRER. Based on the TTR metric, our dataset is much more diverse than previous works. We will add these new experiments and analysis to the later version. Besides the word statistics, we also test the MLLM's (Gemini) performance on template-based and LLM-paraphrased questions as well. Averaged results are listed in **Table (R3-2)** ([Full table at the rebuttal webpage](https://physical-reasoning-project.github.io/rebuttal.html)). The model has relatively close performance on templated and paraphrased datasets. This may indicate that whether using templates or LLM-generated questions does not affect models' performance on these questions. **Table (R3-1).** Statistics of templated questions and paraphrased questions. | **Question Generation Method** | **Template-Based** | **LLM-Paraphrased** | **Template-Based** | **LLM-Paraphrased** | **Template-Based** | **LLM-Paraphrased** | **Template-Based** | **LLM-Paraphrased** |ComPhy |CLEVRER| |:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:| | **Scenario** | Fluid | Fluid | Rope | Rope | Cloth | Cloth | Ball | Ball | - | - | | **Lexical Diversity: TTR**|0.0096 | **0.052** |0.0096 | **0.053** |0.0089 | **0.068** |0.0066 | **0.049** |0.0005 |0.00008| | **Lexical Diversity: Word Distribution** | Figures at [Webpage](https://physical-reasoning-project.github.io/rebuttal.html#:~:text=Word%20Distribution%20of%20Templated%20and%20Paraphrased%20Questions) | | | | | | | | | | | **QA Diversity: Question Types** | 7 Types (Figures at [Webpage](https://physical-reasoning-project.github.io/rebuttal.html#:~:text=5.3-,Question%20Type%20Distribution,-Question%20distribution%20statistics)) | 7 Types | 8 Types | 8 Types | 6 Types | 6 Types | 5 Types | 5 Types | 14 Types | 8 Types | | **Syntactic Diversity: Sentence Length Average / Variance** | 13.1/3.9| 13.6/10.7 | 13.0/6.9| 13.0/11.3 | 12.2/8.5| 11.7/11.4 | 15.2/10.2 | 15.6/19.2 | 12.0/8.7| 12.2/12.6 | | **Readability Scores: Flesch-Kincaid Grade Level** | 4.4 | 4.5 | 3.1 | 3.1 | 4.1 | 3.9 | 4.0 | 4.1 | 4.0 | 5.3 | **Table R3-2.** Average performance comparison of Gemini on templated questions and paraphrased questions. | **Model: Gemini-Pro-Vision** | **a) Templated Question (Visual Input)** | **b) Templated Question (Text Only)** | **g) LLM-Paraphrased Question (Visual Input)** | **h) LLM-Paraphrased Question (Text Only)** | |------------------------------|------------------------------------------|---------------------------------------|------------------------------------------------|---------------------------------------------| | **Average Rope** | 31.5 | 28.6 | 27.9 | 26.1 | | **Average Fluid** | 25.2 | 31.0 | 23.4 | 32.9 | | **Average Cloth** | 45.0 | 53.4 | 43.5 | 51.6 | | **Average Ball** | 43.0 | 44.3 | 43.4 | 46.2 | ### Q2. More Baselines. <!-- **Q2. More Baselines.** --> >For the baseline models, why not consider a few recent transformer-based video-QA models that can be finetuned on your dataset to complement the zero-shot large models such as GPT-4v, such as [1] and [2]? [1] Fu, Tsu-Jui, et al. "Violet: End-to-end video-language transformers with masked visual-token modeling." arXiv preprint 2021. [2] Sung, Yi-Lin, Jaemin Cho, and Mohit Bansal. "Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks." CVPR 2022. Thanks for the suggestion to add more finetuned baselines. We added a new transformer-based video-QA baseline, Violet, which takes videos as input, can be directly fine-tuned on our dataset. We showed the performance of Violet in **Table (R3-3)** and compared it with other fine-tuned vision baselines and zero-shot video baselines (MLLMs). The results of Violet are fine-tuned based on the pre-trained checkpoint on YT180, WebVid2.5M and CC3M. Based on the results shown in **Table (R3-3)**, we can find that the fine-tuned Violet exhibits excellent overall performance. Violet achieves competitive results with the best performance of all previous vision models in some settings (such as **Rope CO** and **Fluid GO**), and even surpasses the best in **Fluid CO** and **Ball CQ**. Compared with zero-shot video-language models, which are GPT-4V and Gemini, Violet excels in most of the settings, showing the advantages of fine-tuning large pre-trained models on ContPhy's videos. However, the performance of Violet is far away from human performance, showing the challenges of our ContPhy datasets. Due to the contraint time slots of the rebuttal period, we have not setup the evaluation pipeline for Vl-adapter, since the environment dependency of the [feature extraction stage](https://github.com/linjieli222/HERO_Video_Feature_Extractor?tab=readme-ov-file) of VL-Adapter is quite complex. We will add this baseline in the later version. **Table R3-3.** Performance comparison of Violet and other baselines. | | Previous Vision Best | Gemini | GPT-4v | Violet | | :----------: | :------------------: | :----: | :----: | :----: | | **Rope Avg.**| **50.1** | 31.5 | 34.1 | 45.4 | | **Fluid Avg.**| **42.1**| 25.2 | 29.7 | 39.8 | | **Cloth Avg.**| **61.8** | 45.0 | 49.8 | 59.6 | | **Ball Avg.**| **53.4** | 43.0 | 41.5 | 42.9 | | **Rope P** | 60.7 | 35.5 | 48.0 | 51.7 | | **Rope CO** | 76.2 | 48.2 | 42.0 | 76.0 | | **Rope CQ** | 50.7 | 12.0 | 11.3 | 43.1 | | **Rope GO** | 56.0 | 51.6 | 57.0 | 55.2 | | **Rope GQ** | 6.7 | 10.3 | 12.1 | 1.1 | | **Fluid P** | 54.0 | 10.0 | 25.0 | 50.9 | | **Fluid CO** | 56.8 | 47.3 | 53.3 | **60.4** | | **Fluid CQ** | 8.6 | 5.1 | 5.1 | 1.7 | | **Fluid GO** | 67.7 | 44.4 | 53.8 | 67.3 | | **Fluid GQ** | 41.3 | 11.3 | 7.5 | 41.2 | | **Fluid PO** | 53.8 | 52.4 | 50.0 | 53.2 | | **Fluid PQ** | 12.7 | 5.8 | 13.0 | 3.8 | | **Cloth P** | 59.3 | 42.0 | 49.0 | 55.0 | | **Cloth PO** | 68.8 | 50.1 | 53.0 | 68.2 | | **Cloth PQ** | 57.3 | 43.0 | 47.5 | 55.7 | | **Ball P** | 54.7 | 54.0 | 45.0 | 48.0 | | **Ball CO** | 66.1 | 60.9 | 66.7 | 65.6 | | **Ball CQ** | 41.8 | 29.6 | 46.9 | **41.8** | | **Ball GO** | 58.1 | 54.1 | 51.4 | 57.4 | | **Ball GQ** | 38.9 | 24.6 | 18.0 | 21.1 | | **Ball PO** | 67.4 | 51.7 | 45.4 | 64.4 | | **Ball PQ** | 46.6 | 25.9 | 17.2 | 2.3 | ### Q3. MLLMs Prompting Details <!-- **Q3. MLLMs Prompting Details** --> > It is unclear how GPT-4v and Gemini are prompted, i.e., did you use an in-context examples, what are the subsampling rates of the videos, and also what would be the instructions/guidelines to these large models? We thank the reviewer for adding more prompting details. For the original experiments presented in paper, we only combine general instruction, questions, and subsampled frames in a packed prompt. Besides, we uniformly subsampled 11 frames and resize images to 480x270 in initial experiments. The original general instruction and per-question instructions are listed in Table 8 of the paper. The experiment was in a zero-shot setting and we did not include any scenario-specific guidelines or any in-context QA examples in prompt. During rebuttal, we carefully considered your opinion and experimented with our self-designed scenario-specific guidelines, in-context examples, as well as elaborate human explanations for example questions. During rebuttal stage, We tested full-size(1920x1080) images and raised subsampling frame number to 16 frames, which is the acceptable upper limit of Gemini-Pro-Vision's visual input. The averaged results are listed below. Full results and prompt examples can be checked at [the rebuttal webpage](https://physical-reasoning-project.github.io/rebuttal.html). Packing the prompt with more scenario instructions, few-shot QA examples, and even detailed human explanations could benefit MLLM's understanding of the physical scene in various ways. But we do not discovered any benefit of upsampling the videos, probably because the MLLM (Gemini)'s visual understanding ability suffers from the upper limit of its visual API. **Table R3-4.** Average Performance comparision of Gemini-Pro-Vision with different prompting techniques. More details on [the rebuttal webpage](https://physical-reasoning-project.github.io/rebuttal.html). | **Model: Gemini-Pro-Vision** | **a) Question Only** | **c) Scenario-Specific Guideline** | **d) In-Context QA Examples** | **e) Human Explained Examples** | **f) Upsampled Video (11→16 Frames, Higher Resolution)** | |------------------------------|----------------------|------------------------------------|-------------------------------|---------------------------------|----------------------------------------------------------| | **Average Rope** | 31.5 | 34.1 | 33.5 | **36.2** | 29.9 | | **Average Fluid** | 25.2 | 25.1 | **26.4** | 23.1 | 23.6 | | **Average Cloth** | 45.0 | 46.4 | 45.6 | **48.8** | 44.4 | | **Average Ball** | 43.0 | **43.8** | 35.4 | 32.1 | 43.5 | ### Q4. More Related Work. <!-- **Q4. More Related Work.** --> >Another video based QA work that talks about counterfactual reasoning is [3]. While the work is not directly discussing physical properties at the granularity of this work and it serves as a more general event-rich video QA work, it is still quite relevant (its physical dimension) to the direction of this work. Consider citing and discussing it. [3] Wu, Te-Lin, et al. "ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos." EMNLP 2023. Thanks for the advice on the related work about counterfactual reasoning[3]. We will cite and discuss it in the later version. ### Q5. Limitations of the work. <!-- **Q5. Limitations of the work.** --> > There were no, at least to me not clear, limitations of this work addressed. The authors should address the limitations of this work in the following aspects: (potential) dataset curation artifacts, the diversity justification and the limitation of the nature of the proposed tasks, and finally failure modes (in more details) of the proposed oracle model. We thank the reviewer's suggestion for more detailed limitations of the paper. Here, we outline the identified limitations, which will be addressed in the future version of paper. Our proposed benchmark, ContPhy, aims to complement existing physical reasoning benchmarks by encompassing diverse physical property inference (e.g., mass, density) across various scenarios and predicting corresponding dynamics. However, ContPhy still has limitations. **Limitation 1: Language Diversity.** While the synthesized questions generated by the question engine can effectively test AI models' physical reasoning capabilities across diverse scenarios (future prediction, counterfactual prediction, goal-driven prediction) involving different objects (solids, soft objects, fluids), the language diversity remains limited. The current set of questions relies on a predefined vocabulary, resulting in a gap compared to natural language. **Limitation 2: Scenario Complexity.** We have carefully designed four distinct scenarios featuring various objects (solids, ropes, clothes, fluids). However, real-world physical interactions can be considerably more complex, involving additional objects and physical factors not currently included in the dataset (e.g., air, wind, fire). Based on these limitations, we propose the following future reserch direction for the ContPhy dataset. First, as suggested by the reviewer, we can utilize large language models (LLMs) or other NLP techniques to paraphrase questions and answer options, thereby increasing language variation. Moreover, we can design additional scenarios incorporating a wider range of objects and physical parameters to better reflect real-world complexity. **Failure Modes of the Oracle Model.** The proposed oracle model, ContPRO is a neuro-symbolic framework that uses LLMs to parse the query question into executable python programs. As shown in **Listing 1-4** of the appendix, the parsed program will call APIs and default python logics to handle the task. Typical APIs include APIs from the perception model (*e.g.* line 4-5 of Listing 1) and APIs from the phyiscal simulation model (**e.g.** line 8-9 of Listing 2). According to our observation, the typical errors mainly come from the dynamic predictions of the physical simluation models (MPM or DPI-Net). This can also be reflected from low per-question accuracies of different models on multiple-choice questions, which requires long-term object dynamic predictions. Note that long-term object physical dynamics has alway been a challenge, as found in our work and the previous research [A, B, C]. **A.** CLEVRER: CoLlision Events for Video REpresentation and Reasoning Kexin Yi*, Chuang Gan*, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, Joshua B. Tenenbaum. ICLR, 2020. **B.** Yunzhu Li, Jiajun Wu, Russ Tedrake, Joshua B. Tenenbaum, and Antonio Torralba. Learning Particle Dynamics for Manipulating Rigid Bodies, Deformable Objects, and Fluids. ICLR, 2019. **C.** Xie, Hanchen, et al. A critical view of vision-based long-term dynamics prediction under environment misalignment. ICML, 2023. ## Response to Reviewer **xL5z** We deeply appreciate the time and effort the reviewer has dedicated to evaluating our work and providing beneficial suggestions for improvement. ### Q1. More Details on Particle-Based Dynamics Learner <!-- **Q1. More Details on Particle-Based Dynamics Learner** --> > I did not quite understand how the Particle-Based Dynamics Learner works. How exactly are MPM and DPI applied? Please explain in more detail their working principles within the model. Thanks for the reviewer's suggestion to provide more details on the Particle-based Dynamics Learner. Besides the description in **Section 4 ContPRO** and **Section 9.2 Oracle Model ContPRO Details**, we provide more details below. Concerning the characteristics of the two system identification methods, we use (differentiable) MPM to infer fluid and ball scenarios, and DPI to infer cloth and rope scenarios. **Material Point Method (MPM)** is a continuum mechanics algorithm designed for simulating complex physical interactions among multiple materials, such as fluids, soft bodies, and rigid structures. We define the state of the simulation at time $t$ as $S_t=\{s_{t,m}\}_{m=1}^M$, where $M$ is the total number of material points in the simulation. The state of the $m$-$th$ material point is given by $s_{t,m}=\{x_{t,m}, v_{t,m}, F_{t,m}, V_{t,m}, m_{t,m}\}$, where ${x_{t,m}}$ represents the position, ${v_{t,m}}$ the velocity, ${F_{t,m}}$ the deformation gradient, ${V_{t,m}}$ the volume, and ${m_{t,m}}$ the mass of the material point. The simulation advances by computing the forces acting on each material point based on its interactions with other points and external forces, updating its state accordingly. This involves calculating the stress tensor for each point, deriving from the material's physical properties such as density, Young's modulus, and yield stress, together noted as $\theta(m)$. The update equations are typically discretized using a background grid, to which material points transfer their mass and momentum, facilitating the computation of gradients and hence the forces. The differentiable MPM module in our work is powered by the implementation in **[A]**. We further enrich the framework with material point's property optimization ability, allowing the system to identify multiple material properties and disentangle complicated materials' interaction if given point trajectories. To be specific, by integrating the differentiable Taichi language, gradients of simulation variables with respect to physical parameters are computed at each step. This is crucial for **optimizing the estimated physical parameters** of the simulated materials, including fluid density and the plastoelastic properties of bodies. We use the loss of Chamfer distance between the groundtruth and predicted point trajectories, formulated as $\mathcal{L}_{\text{Chamfer}}(P, Q) = \frac{1}{N} \sum_{p_i \in P} \min_{q_j \in Q} \|p_i - q_j\|^2 + \frac{1}{M} \sum_{q_j \in Q} \min_{p_i \in P} \|q_j - p_i\|^2$ where $P$, $Q$ are two point sets. If there are materials $\{a, b, c\}$, Chamfer loss at time $t$ is defined as $\mathcal{L}(X_t, \hat X_t) = \sum_{n \in \{a,b,c\}} \mathcal{L}_{\text{Chamfer}}(X_{t, n}, \hat X_{t, n})$, where $X_t$ is the groundtruth point set $\{m_i\}_t$ at time $t$ from ContPhy dataset, $\hat X_t$ is the simulation points by MPM at time $t$. For fluid density identification, we also add the average y-height distance to loss. At any timestep $t$, for any material point $m$, the gradient of loss $\mathcal{L}$ with respect to a physical parameter $\theta(m)$, $\nabla_{\theta(m)} \mathcal{L}_{t}$, can be obtained, which is leveraged to update $\hat\theta(m)$. Finally the pipeline outputs the estimated parameters for downstream symbolic processing. For example, if the module receives request for fluids' density, then fluid a, b, c's predicted densities $\hat\rho_a, \hat\rho_b, \hat\rho_c$ are estimated, making if-else judgment like $\hat\rho_a > \hat\rho_c$ and question-answering possible. Likewise, predictions for dynamic questions (predictive and counterfactual) are also accessible in this simulation pipeline. **DPI-Net** is a graph neural network-based physical dynamic prediction model, which predicts objects' future locations based on particle-based history observation. Specifically, at sampled frame $t$, we represent the paricle-based observation state as $S_t=\{s_{t,n}\}_{n=1}^N$, where $N$ is the total number of particles in the scene. For the state of the $n$-$th$ particle, we represent it with $s_{t,n}=\{l_{t,n,x}, l_{t,n,y}, l_{t,n,z}, v_{t,n,x}, v_{t,n,y}, v_{t,n,z}, p_n \}$, where $\{l_{t,n,x}, l_{t,n,y}, l_{t,n,z},\}$ and $\{v_{t,n,x}, v_{t,n,y}, v_{t,n,z}\}$ denote the 3D location and velocity of the $n$-$th$ particle, respectively and $p_n$ denotes the property value of the physical parameters, including mass. We use a multi-layer graph neural networks **[B, C]** to pass message among different particles and predict their locations $\{l_{t+1,n,x}, l_{t+1,n,y}, l_{t+1,n,z},\}$ at the $t+1$ frame. For predictive questions, we feed the state of the last frame of the video as the start point and then iteratively predict the future frames by feeding the prediction back to our model. For counterfactual questions, we modify the intial state $S_t=\{s_{t,n}\}_{n=1}^N$, predict the corresponding dynamics in counterfactual situations and check whether the events in each option happen or not. Similarily, for goal-driven questions, we change the condition to the setting in each option and check whether the event described in the question happen or not. We will include such details in the later version. **A.** Yuanming Hu, Luke Anderson, Tzu-Mao Li, Qi Sun, Nathan Carr, Jonathan Ragan-Kelley, and Frédo Durand. DiffTaichi: Differentiable Programming for Physical Simulation. ICLR 2020. **B.** Yunzhu Li, Jiajun Wu, Jun-Yan Zhu, Joshua B. Tenenbaum, Antonio Torralba, and Russ Tedrake. Propagation Networks for Model-Based Control Under Partial Observation. ICRA 2019. **C.** Kipf T, Fetaya E, Wang K C, et al. Neural relational inference for interacting systems. ICML. 2018. ### Q2. Confusion on Fig2. <!-- **Q2. Confusion on Fig2.** --> > Figure 2's Example A from (a) to (f) can easily cause confusion. While other examples describe the same object, Example A's first three images do not describe the same object as the last three images, which I think might confuse others. Thanks for the suggestion on Fig. 2 (a), we will update the figure and the caption to make it more clear and avoid confusion. ### Q3. Adding links to the supplementary section. <!-- **Q3. Adding links to the supplementary section.** --> > Please add related links in the supplementary material section. For instance, in section 4 on Program Execution, it is mentioned, "We provide more model details in the supplementary material. Please link specifically to section 9.2" We will revise the paper accordingly and add reference links to the sections in the supplementary for easy reading. ### Q4. Gathering Explanations from humans. <!-- **Q4. Gathering Explanations from humans.** --> > It is suggested that the authors consider gathering explanations from human subjects. This approach would enable them to collect data on the theories that humans naturally employ and assess whether providing suggestions, prompts, or guidance to utilize the appropriate theory enhances performance. Such an approach could significantly contribute to rationalizing expectations and informing the design of machine systems of this nature. Thanks for this insightful advice! We gathered some human explanations to some carefully selected QA examples (mainly designed ourselves) and test its effectiveness by adding it into the prompt of multimodal large language models. The detailed prompt examples and experiment results can be found at [the rebuttal webpage](https://physical-reasoning-project.github.io/rebuttal.html). As shown in the following table, we can observe the increase of average performance in most scenarios (rope, fluid, and cloth scenarios), which might imply that prompting with in-context examples can provide models with more information about the scene and also some evidence to support reasoning about the events. When we add detailed human explanations, the performance improves more for rope and cloth scenarios. **Table (R4-1).** Performance comparision of Prompting w/ or w/o In-Context Examples and Human Explanations. | **Model: Gemini-Pro-Vision** | **a) Question Only** | **d) In-Context QA Examples** | **e) Human Explained Examples** | |:----------------------------:|:--------------------:|:-----------------------------:|:-------------------------------:| | **Average Rope** | 31.5 | 33.5 | **36.2** | | **Average Fluid** | 25.2 | **26.4** | 23.1 | | **Average Cloth** | 45.0 | 45.6 | **48.8** | | **Average Ball** | **43.0** | 35.4 | 32.1 | ### Q5. Generalization to the Real World. <!-- **Q5. Generalization to the Real World.** --> > It is unclear if predictions from a 3d simulated model for this task will generalize to the real world. It depends on the quality of the renders and the physics simulation of the 3d engine. **Benefits of the Simulation Pipeline.** We thank the reviewer for questions about generalization to the real world. We choose simulation engine and graphic renders to build our dataset rather than the real world videos since the synethized nature provides several benefits. First, the synthesized nature makes it easy for us to get dense annotation (*e.g.* segmentation masks, point cloud and meshes) for each frame, which enables us to generate diverse questions to test phyiscal reaoning capabilities in different dimensions. Moreover, it can scale up the size of the dataset easily. **Sim2Real Research.** It has also been shown in previous study [A] and [B] that models learnt from sythesized and simulated data can generalize to real-world applications like robotic planning and manipulation. [A]. Veerapaneni, Rishi, et al. “Entity Abstraction in Visual Model-Based Reinforcement Learning.” CoRL (2019). [B]. Janner, Michael, et al. “Reasoning about physical interactions with object-oriented prediction and planning.” ICLR (2019). ### Q6. Performance on model's ability to infer physical properties. <!-- **Q6. Performance on model's ability to infer physical properties.** --> > The authors have raised doubts about the existing models' ability to infer physical properties on a continuum. However, there are no experiments to compare and demonstrate their actual performance. As shown in **Table 2** of the paper, we have shown existing visual models like MAC and HCRN and multimodal large language models like Gemini and GPT-4V have show inferior performance on ContPhy for infering objects' physical properties (**Prop**), predicting objects' dynamics (**P-Opt.**) and counterfactual imagnination (**C-Opt.** and **G-Opt.**), which we think it should have "demonstrated the actual performance" on existing models' abilities to infere physical properties. We will add such discussion in the experimental section of the later version.