# (Backup) ContPhy: Continuum Physical Concept Learning and Reasoning from Videos
[toc]
---
## General Response to All.
**G1. Contribution Recognition.**
We sincerely thank reviewers and ACs' time and efforts in reviewing the paper. We are glad that reviewers recognized the following contributions.
* **Task**. The task is **novel**. *"The paper introduces a novel benchmark encompassing various scenarios using a 3D physics simulator and a well-structured question set to probe AI models to understand physical properties and dynamics."* (**UVKV**) *"The dataset based on Unity can be a good data source for many tasks, and so do the questions."* (**aBrg**) *"The proposed dataset, ContPhy, is quite interesting and dives reasonably deep into the realm of uncovering the fine-grained physical properties of video captured objects."* (**7fPw**) *"a novel benchmark to assessing machine physical commonsense by encompassing the inference of diverse physical properties."*(**xL5z**)
* **Experiments**. **Comprehensive** experiments are conducted.*"The paper also performs a comprehensive set of experiments with traditional visual models, VLMs and also with humans"*. (**UVKV**) *"The experiments contain efforts of many methods and MLLMs."* (**aBrg**)
* **Model**. The proposed ContPRO model is **effective**.*"It also show that ContPRO outperforms humans in some tasks and outperforms other approach in most tasks."* (**UVKV**) *"The design of the oracle model, ContPro, is comprehensive, and seems perform well."* (**7fPw**)
**G2. Experiments During Rebuttal.**
To address the reviewers’ questions and support our responses, we conduct the following experiments to support our claims and show ContPhy's value.
## Response to Reviewer **UVKV**
**Q1. Results about Baselines.**
> The QA dataset outputs two or three answers, but some results fall below the random baseline. What could explain this?
**About Option Nubmer of Each Question.** In Fig. (2), we only show examples with two or three answers for page limitation consideration. In fact, the average number of options for each multiple-choice questions is <font color="#E24A0F">**</font> in our dataset.
**About Baseline Performance.** We thank the reviewer for the concern that some baseline models fall below the random baseline on some metrics. For example, **C-LSTM** performs worse than the Blind random baseline (**RND**) on predictive question per option (**P-Opt.**) and goal-driven question per option (**G-Opt.**). We believe the reason is that models like **C-LSTM**, which are originally designed for static vision-language tasks have difficulties to understand the dynamics and physics common sense in the predictive or goal-driven scenarios. Thus, they only achieved comparable performance to the **RND** baseline. This shows our dataset' challenges for traditional vision-question answering models.
<font color="#E24A0F">1. Do we have more than 3 options for each question type? 2. Get the statistics</font>
**Q2. New baseline NEWTON for ContPhy.**
> Has the author considered applying this to LLMs, akin to the approach in NEWTON's study on physical reasoning in language models?
We thank the reviewer for suggesting the blind model evaluation on LLMs similar to the approach in NEWTON. For thorough evaluation, we have added a new baseline that first transforms the visual input into text description of the scene objects and their dynamics and then feed the description to the LLM. Specifically, we insert the following information into the prompt, 1) the 3D location coordinates, euler rotation angles, and local scales of each rigid object, and 2) the locations of each softbody or fluid's centroids and some sampled particles in subsampled <font color="#E24A0F">add frame number</font> sequential frames. Additionally, in the rope scenario, we also provide the link list of the loads or pulleys on each single rope. Following the text description of the scene, the questions are added below the given information. We feed the prompt into the **Gemini Pro Vision**. We choose Gemini rather than GPT-4V since Gemini provides free APIs for extensive experimental analysis. The results are shown in **table (A)** below. In table (A), we do not observe significant increase of performance compared with the question-only blind model in Table 2 of the main paper. This indicates that there is no shortcuts in ContPro for LLMs to guess the correct answer based on pure textual information. We will add such analysis and discuss the related work NEWTON in the later version.
<font color="#E24A0F">Maybe add the experiments of Gemini with visual information for better analysis. Also, hard to compare the performance by scanning the table. Maybe use color or calculate the average performance. </font>
<!--
<font color="#0099FF">Thanks for the reviewer to suggest the blind model evaluation on LLMs as the approach in NEWTON. For thorough evaluation, we have added a new baseline that first transforms the visual input into text description of the scene objects and their dynamics and then feed the description to the LLM. Due to the input token limit of MLLM, it is a lossy transformation, but we tried our best to recover the visual events accurately in text. Specifically, we insert the following information into the prompt, 1) the 3D location coordinates, euler rotation angles, and local scales of each rigid object, and 2) the locations of each softbody or fluid's centroids and some sampled particles in subsampled sequential frames. The information is organized in dictionary format, retrieved by object name as the key. The detailed implementations for different scenarios are not totally the same. For example, in the rope scenario, we also provide the link list of the loads or pulleys on each single rope. Following the text description of the scene, the questions are added below the given information. We feed the prompt into the **Gemini Pro Vision** <font color="#E24A0F">zf: "need Fact check and analysis for the baseline"</font>. We choose Gemini rather than GPT-4V since Gemini provides free APIs for extensive experimental analysis. We will add such analysis and discuss the related work NEWTON in the later version. Here are the experiment results. After these attempts, we do not observe significant increase of scores, compared with purely blind experiments.
-->
**Table (A).** Performance of Blind Gemini-based LLM model on ContPro.
| **Model: Gemini-Pro-Vision** | **(N) Question Only (Text Only)** | **(O) Question + Subsampled Frames** | **(G) NEWTON Approach (Subsampled Frame Description; Text Only)** |
|:----------------------------:|:---------------------------------:|:------------------------------------:|:-----------------------------------------------------------------:|
| **Rope P** | 34.0 | 35.5 | 50.0 |
| **Rope CO** | 44.4 | 48.2 | 47.8 |
| **Rope CQ** | 5.6 | 12.0 | 2.1 |
| **Rope GO** | 48.9 | 51.6 | 54.7 |
| **Rope GQ** | 10.3 | 10.3 | 6.9 |
| **Fluid P** | 28.0 | 10.0 | 4.0 |
| **Fluid CO** | 48.0 | 47.3 | 56.0 |
| **Fluid CQ** | 6.4 | 5.1 | 2.6 |
| **Fluid GO** | 63.3 | 44.4 | 42.0 |
| **Fluid GQ** | 11.3 | 11.3 | 5.7 |
| **Fluid PO** | 51.2 | 52.4 | 57.1 |
| **Fluid PQ** | 8.7 | 5.8 | 0.0 |
| **Cloth P** | 54.0 | 42.0 | 57.0 |
| **Cloth PO** | 56.1 | 50.1 | 49.2 |
| **Cloth PQ** | 50.0 | 43.0 | 40.5 |
| **Ball P** | 54.0 | 54.0 | 47.0 |
| **Ball CO** | 60.1 | 60.9 | 58.4 |
| **Ball CQ** | 37.0 | 29.6 | 37.0 |
| **Ball GO** | 60.1 | 54.1 | 55.2 |
| **Ball GQ** | 34.4 | 24.6 | 26.2 |
| **Ball PO** | 47.1 | 51.7 | 52.9 |
| **Ball PQ** | 17.2 | 25.9 | 25.9 |
</font>
~~Thanks for the reviewer to suggest the blind model for LLMs. For thorough evaluation, we have added a new baseline that first transforms the visual input into text description () and then feed the description to the LLM. Specifically, we first put the 3D location of each object into the prompt and add the question below the object description. We feed the prompt into the **Gemini Vision Pro** <font color="#E24A0F">zf: "need Fact check and analysis for the baseline"</font>. We choose Gemini rather than GPT-4V since Gemini provides free APIs extensive experimental analysis. We will add such analysis and discuss the related work NEWTON in the later version.~~
**Q3. More Physical Baselines.**
> Has the author explored specialized models trained for physical reasoning, such as PIP (PIP: Physical Interaction Prediction via Mental Simulation with Span Selection), interpretable intuitive physics models, or PhyDNet?
Thanks for suggesting new baselines for physical reasoning. During this rebuttal, we have implemented all the baselines, including. <font color="#E24A0F">zf: "Need more experimental results."</font>.
## Response to Reviewer **aBrg**
**Q1. About Logical Steps to Infer Answers.**
> Logically, can we know the logical steps to infer the answers, in a common human way? For example, the mean of steps to infer the physical property questions in Fig 2?
<font color="#E24A0F">zf: "Do we still have how many operation steps for each question? Need some statistics"</font>.
**Q2. About Template Question Design.**
> How do the template question design? How many humans were involved? Do the templates affect the performance a lot, especially for the prompts for MLLM?
<font color="#0099FF">Thanks for your concerns on templated questions! We design the question templates by brain-storming. About 10 people are involved to propose, implement and modify templates. As shown in <font color="#0088FF">our paper Table 3-7 and Figure 5</font>, these templated questions are proposed to test AI models' capabilities in different dimensions, including static visual attribute recognition, physical property inference, dynamic prediction, and counterfactual imagination. Here we list some linguistic statistics of the QA dataset. Considering your concern on the effect of template-based questions, we also utilize LLMs to paraphrase the questions for better diversity. Question statistics and MLLM's performances on template-based questions and LLM-paraphrased questions are compared in the <font color="#E24A0F">following tables. (Some analysis here)... </font>
| Linguistic Statistics | Template-Based Fluid | Template-Based Rope | Template-Based Cloth | Template-Based Ball | LLM Paraphrased Fluid | LLM Paraphrased Rope | LLM Paraphrased Cloth | LLM Paraphrased Ball | Typical Values in Natural Language Corpus |
|-----------------------|----------------------|---------------------|----------------------|---------------------|-----------------------|----------------------|-----------------------|----------------------|-------------------------------------------|
| Lexical Diversity: TTR | 0.0096 | 0.0096 | 0.0089 | 0.0066 | 0.052 | 0.053 | 0.068 | 0.049 | 0.02 to 0.5 |
| Lexical Diversity: Word Distr | See Figure | See Figure | See Figure | See Figure | See Figure | See Figure | See Figure | See Figure | N/A |
| QA Diversity: Question Type Number | 7 Types, See Figure | 8 Types, See Figure | 6 Types, See Figure | 5 Types, See Figure | 7 Types, See Figure | 8 Types, See Figure | 6 Types, See Figure | 5 Types, See Figure | N/A |
| Syntactic Diversity: Sentence Length Average / Variance | 13.1/3.9 | 13.0/6.9 | 12.2/8.5 | 15.2/10.2 | 13.6/10.7 | 13.0/11.3 | 11.7/11.4 | 15.6/19.2 | 15 to 20 |
| Readability Scores: Flesch-Kincaid Grade Level | 4.4 | 3.1 | 4.1 | 4.0 | 4.5 | 3.1 | 3.9 | 4.1 | A score of 5.0 to 8.0 is considered easily understandable by the average 5th to 8th grader. |
| **Model: Gemini-Pro-Vision** | **(O) Template-Based Question** | **(N) Template-Based Question (Text Only)** | **(D) LLM-Paraphrased Questions** | **(F) LLM-Paraphrased Questions (Text Only)** |
|:----------------------------:|:-------------------------------:|:-------------------------------------------:|:---------------------------------:|:---------------------------------------------:|
| **Rope P** | 35.5 | 34.0 | 30.5 | 32.0 |
| **Rope CO** | 48.2 | 44.4 | 43.8 | 44.0 |
| **Rope CQ** | 12.0 | 5.6 | 9.2 | 5.6 |
| **Rope GO** | 51.6 | 48.9 | 49.3 | 47.1 |
| **Rope GQ** | 10.3 | 10.3 | 6.9 | 1.7 |
| **Fluid P** | 10.0 | 28.0 | 11.0 | 22.0 |
| **Fluid CO** | 47.3 | 48.0 | 45.0 | 55.3 |
| **Fluid CQ** | 5.1 | 6.4 | 2.6 | 15.4 |
| **Fluid GO** | 44.4 | 63.3 | 43.2 | 60.4 |
| **Fluid GQ** | 11.3 | 11.3 | 5.7 | 11.3 |
| **Fluid PO** | 52.4 | 51.2 | 52.0 | 54.3 |
| **Fluid PQ** | 5.8 | 8.7 | 4.3 | 11.6 |
| **Cloth P** | 42.0 | 54.0 | 39.0 | 48.0 |
| **Cloth PO** | 50.1 | 56.1 | 50.1 | 55.7 |
| **Cloth PQ** | 43.0 | 50.0 | 41.5 | 51.0 |
| **Ball P** | 54.0 | 54.0 | 58.0 | 61.0 |
| **Ball CO** | 60.9 | 60.1 | 57.2 | 57.6 |
| **Ball CQ** | 29.6 | 37.0 | 27.2 | 28.4 |
| **Ball GO** | 54.1 | 60.1 | 53.9 | 55.6 |
| **Ball GQ** | 24.6 | 34.4 | 27.9 | 32.8 |
| **Ball PO** | 51.7 | 47.1 | 52.3 | 56.9 |
| **Ball PQ** | 25.9 | 17.2 | 27.6 | 31.0 |
</font>
**Q3. Baselines with Multi-Modalities.**
> If inputting different modalities of the scene, e.g., multi-view, point clouds, mesh, etc, how well do the models perform?
Thanks for the suggestion to study the input of different modalities. As shown in <font color="#E24A0F">"Need experimental results."</font>
**Q4. Paper Writing.**
>**Q4-1.** L78, fig 2 is too far away from its 2st ref.
>**Q4-2.** Lacking enough details of the model design of the ContPRO and its implementations. If possible, please add them in the suppl.
Thanks for the advice on paper revision. We will make the following revisions in the later version,
* 1). Move Fig. 2 to the page that is more close to its references;
* 2). Besides the implementation details in Section **4** and Section **9.2**, we will include more details on design and implementation of ContPRO. We will use release our source code for easy reproduction.
## Response to Reviewer **7fPw**
**Q1. Question Statistics.**
>It is unclear to me whether the question is indeed diverse enough, in the sense that no explicit statistics such as type-token ratio, word distributions, and other relevant quantities were clearly reported. For the first point of weakness, I would actually suggest doing a LLM paraphrasing first and then see if that complicates the QA sets, with the statistics mentioned of course.
<font color="#0099FF">Thanks for the advice on more statistics on the synethsized questions. Based on your advice, we have reported the statistics of the current question version and its paraphrased version. Besides recommended lexical diversity metrics such as TTR and word distribution, we also reported syntactic diversity (sentences length mean and variance), question type diversity (question type number), and readability scores (Flesch-Kincaid Grade Level) for reference. Statistics can be checked here ___.
Thanks for the inspirational advice on paraphrasing questions. We have utilized Gemini-Pro to reword the questions. The prompt is as follows.
"I am looking for assistance in paraphrasing this question. "
"My primary goal is to ensure that the essence and meaning of the question, along with the content "
"of each option, remain unchanged. It is crucial that the sequence of the options is preserved so "
"that the correct answer corresponds directly with the original question. Below, I will provide the "
"question with its options. Please rephrase it as diversely as possible, maintaining strict adherence "
"to their original meaning. Make question readable and understandable for common people as well. "
"Please only return paraphrased question (with its paraphrased options if it has). Do not add any other text. "
"Please keep the color name and the object name unchanged. "
"Please do not change the word \"elastic\"/\"plastic\" or \"elasticity\"/\"plasticity\". "
"If the object name has \"the other\" description, let this description stay unchanged. "
"If you think the option is too hard to rephrase, you can keep it unchanged. "
"Also, keep the option format unchanged."
"For example, if I give you the following question:\n\n"
"If the gray stick were removed, which stick would orange fluid pass?\nA. Pink stick\nB. Brown stick\nC. Cyan stick\n\n"
"You may response:\n\n"
"If the gray stick were not there, which stick would orange liquid flow through?\nA. Pink stick\nB. Cyan stick\nC. Cyan stick\n\n"
"PLEASE STRICTLY FOLLOW above response format. Otherwise we could not use program to process your response. "
"OK, here is the original question you will paraphrase.\n\n")
f"{question list to paraphrase}\n\n"
"Thank you for your assistance!"
We instruct the LLM to reword the given questions as diverse as possible and keep the original meaning strictly unchanged and the content readable for common people as well. We provide some generated examples here: 1) "Is the density of light blue fluid equal to that of green fluid?" --> "Are the light blue liquid and green liquid just as heavy?" 2) "Which phrase below can best describe the final pose of the green plate?\nA. Standing upright.\nB. Leaning.\nC. Lying horizontally." --> "What does the final position of the green plate best resemble?\nA. Standing straight up.\nB. Tilted.\nC. Lying flat." 3) "Will the orange ball finally drop into the left pit?\nA. Yes\nB. No\nC. Can not answer" --> "Is the orange ball expected to fall into the pit on the left?\nA. Yes\nB. No\nC. Can not answer"
The statistics for both paraphrased and template-based questions are listed in the table.
| Linguistic Statistics | Template-Based Fluid | Template-Based Rope | Template-Based Cloth | Template-Based Ball | LLM Paraphrased Fluid | LLM Paraphrased Rope | LLM Paraphrased Cloth | LLM Paraphrased Ball | Typical Values in Natural Language Corpus |
|-----------------------|----------------------|---------------------|----------------------|---------------------|-----------------------|----------------------|-----------------------|----------------------|-------------------------------------------|
| Lexical Diversity: TTR | 0.0096 | 0.0096 | 0.0089 | 0.0066 | 0.052 | 0.053 | 0.068 | 0.049 | 0.02 to 0.5 |
| Lexical Diversity: Word Distr | See Figure | See Figure | See Figure | See Figure | See Figure | See Figure | See Figure | See Figure | N/A |
| QA Diversity: Question Type Number | 7 Types, See Figure | 8 Types, See Figure | 6 Types, See Figure | 5 Types, See Figure | 7 Types, See Figure | 8 Types, See Figure | 6 Types, See Figure | 5 Types, See Figure | N/A |
| Syntactic Diversity: Sentence Length Average / Variance | 13.1/3.9 | 13.0/6.9 | 12.2/8.5 | 15.2/10.2 | 13.6/10.7 | 13.0/11.3 | 11.7/11.4 | 15.6/19.2 | 15 to 20 |
| Readability Scores: Flesch-Kincaid Grade Level | 4.4 | 3.1 | 4.1 | 4.0 | 4.5 | 3.1 | 3.9 | 4.1 | A score of 5.0 to 8.0 is considered easily understandable by the average 5th to 8th grader. |
We tested the MLLM's performance on template-based and LLM-paraphrased questions as well. Results are listed here.
| **Model: Gemini-Pro-Vision** | **(O) Template-Based Question** | **(N) Template-Based Question (Text Only)** | **(D) LLM-Paraphrased Questions** | **(F) LLM-Paraphrased Questions (Text Only)** |
|:----------------------------:|:-------------------------------:|:-------------------------------------------:|:---------------------------------:|:---------------------------------------------:|
| **Rope P** | 35.5 | 34.0 | 30.5 | 32.0 |
| **Rope CO** | 48.2 | 44.4 | 43.8 | 44.0 |
| **Rope CQ** | 12.0 | 5.6 | 9.2 | 5.6 |
| **Rope GO** | 51.6 | 48.9 | 49.3 | 47.1 |
| **Rope GQ** | 10.3 | 10.3 | 6.9 | 1.7 |
| **Fluid P** | 10.0 | 28.0 | 11.0 | 22.0 |
| **Fluid CO** | 47.3 | 48.0 | 45.0 | 55.3 |
| **Fluid CQ** | 5.1 | 6.4 | 2.6 | 15.4 |
| **Fluid GO** | 44.4 | 63.3 | 43.2 | 60.4 |
| **Fluid GQ** | 11.3 | 11.3 | 5.7 | 11.3 |
| **Fluid PO** | 52.4 | 51.2 | 52.0 | 54.3 |
| **Fluid PQ** | 5.8 | 8.7 | 4.3 | 11.6 |
| **Cloth P** | 42.0 | 54.0 | 39.0 | 48.0 |
| **Cloth PO** | 50.1 | 56.1 | 50.1 | 55.7 |
| **Cloth PQ** | 43.0 | 50.0 | 41.5 | 51.0 |
| **Ball P** | 54.0 | 54.0 | 58.0 | 61.0 |
| **Ball CO** | 60.9 | 60.1 | 57.2 | 57.6 |
| **Ball CQ** | 29.6 | 37.0 | 27.2 | 28.4 |
| **Ball GO** | 54.1 | 60.1 | 53.9 | 55.6 |
| **Ball GQ** | 24.6 | 34.4 | 27.9 | 32.8 |
| **Ball PO** | 51.7 | 47.1 | 52.3 | 56.9 |
| **Ball PQ** | 25.9 | 17.2 | 27.6 | 31.0 |
</font>
~~For paraphration, <font color="#E24A0F">"More details"</font> <font color="#E24A0F">"More experimental analysis by Zhicheng."</font>~~
**Q2. More Implementation Details.**
>For the baseline models, why not consider a few recent transformer-based video-QA models that can be finetuned on your dataset to complement the zero-shot large models such as GPT-4v, such as [1] and [2]?
[1] Fu, Tsu-Jui, et al. "Violet: End-to-end video-language transformers with masked visual-token modeling." arXiv preprint 2021.
[2] Sung, Yi-Lin, Jaemin Cho, and Mohit Bansal. "Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks." CVPR 2022.
Thanks for the suggestion to add more finetuned baselines. <font color="#E24A0F">"More experimental analysis by Xin Yan. Maybe we can another baseline like BLIP2 or something easier."</font>
**Q3. MLLMs Prompting Details**
> It is unclear how GPT-4v and Gemini are prompted, i.e., did you use an in-context examples, what are the subsampling rates of the videos, and also what would be the instructions/guidelines to these large models?
<font color="#0099FF"> Grateful for pointing out some unspecified prompting details! For original in-paper experiments we only combine general instruction, questions, and subsampled frames in a packed prompt. The original general instruction and per-question instructions are listed in our paper Table 8. We did not include any scenario-specific guidelines or any in-context QA examples in prompt. During rebuttal, we carefully considered your opinion and experimented with our self-designed scenario-specific guidelines, in-context examples, as well as elaborate human explanations for example questions. Besides, we uniformly subsampled 11 frames and resize images from 1920x1080 to 480x270 in initial experiments. During rebuttal stage, We tested full-size(1920x1080) images and raised subsampling frame number to 16 frames, which is the acceptable upper limit of Gemini-Pro-Vision's visual input. The detailed results are listed below.
| **Model: Gemini-Pro-Vision** | **(O) Question Only** | **(A) Scenario-Specific Guideline** | **(B) In-Context Examples** | **\(C) In-Context Examples + Human Explanations** | **(E) Upsampled Video (11→16 Frames, Raised Resolution)** |
|:----------------------------:|:---------------------:|:-----------------------------------:|:---------------------------:|:------------------------------------------------:|:---------------------------------------------------------:|
| **Rope P** | 35.5 | 33.5 | 34.5 | 39.0 | 34.0 |
| **Rope CO** | 48.2 | 46.6 | 51.2 | 53.4 | 46.6 |
| **Rope CQ** | 12.0 | 14.8 | 13.4 | 12.0 | 11.3 |
| **Rope GO** | 51.6 | 54.7 | 56.1 | 57.4 | 48.9 |
| **Rope GQ** | 10.3 | 20.7 | 12.1 | 19.0 | 8.6 |
| **Fluid P** | 10.0 | 22.0 | 24.0 | 21.0 | 19.0 |
| **Fluid CO** | 47.3 | 45.7 | 48.3 | 46.0 | 46.3 |
| **Fluid CQ** | 5.1 | 2.6 | 5.1 | 2.6 | 0.0 |
| **Fluid GO** | 44.4 | 40.8 | 42.6 | 36.7 | 40.8 |
| **Fluid GQ** | 11.3 | 5.7 | 7.5 | 3.8 | 5.7 |
| **Fluid PO** | 52.4 | 53.1 | 51.6 | 48.8 | 49.2 |
| **Fluid PQ** | 5.8 | 5.8 | 5.8 | 2.9 | 4.3 |
| **Cloth P** | 42.0 | 46.0 | 54.0 | 54.0 | 47.0 |
| **Cloth PO** | 50.1 | 50.1 | 45.9 | 50.3 | 47.7 |
| **Cloth PQ** | 43.0 | 43.0 | 37.0 | 42.0 | 38.5 |
| **Ball P** | 54.0 | 52.0 | 53.0 | 46.0 | 56.0 |
| **Ball CO** | 60.9 | 56.4 | 47.3 | 43.2 | 60.5 |
| **Ball CQ** | 29.6 | 28.4 | 13.6 | 7.4 | 34.6 |
| **Ball GO** | 54.1 | 57.9 | 55.2 | 57.4 | 54.1 |
| **Ball GQ** | 24.6 | 31.1 | 6.6 | 11.5 | 23.0 |
| **Ball PO** | 51.7 | 52.9 | 51.1 | 43.7 | 50.6 |
| **Ball PQ** | 25.9 | 27.6 | 20.7 | 15.5 | 25.9 |
</font>
**More Related Work.**
>Another video based QA work that talks about counterfactual reasoning is [3]. While the work is not directly discussing physical properties at the granularity of this work and it serves as a more general event-rich video QA work, it is still quite relevant (its physical dimension) to the direction of this work. Consider citing and discussing it.
[3] Wu, Te-Lin, et al. "ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos." EMNLP 2023.
Thanks for the advice on the related work about counterfactual reasoning[3]. We will cite and discuss it in the later version.
## Response to Reviewer **xL5z**
**Q1. More Details on Particle-Based Dynamics Learner**
> I did not quite understand how the Particle-Based Dynamics Learner works. How exactly are MPM and DPI applied? Please explain in more detail their working principles within the model.
**Q2. Confusion on Fig2.**
> Figure 2's Example A from (a) to (f) can easily cause confusion. While other examples describe the same object, Example A's first three images do not describe the same object as the last three images, which I think might confuse others.
Thanks for the suggestion on Fig. 2 (a), we will update the figure and the caption to make it more clear and avoid confusion.
**Q3. Adding links to the supplementary section.**
> Please add related links in the supplementary material section. For instance, in section 4 on Program Execution, it is mentioned, "We provide more model details in the supplementary material. Please link specifically to section 9.2"
We will revise the paper accordingly and add reference links to the sections in the supplementary for easy reading.
**Q4. Gathering Explanations from humans.**
> It is suggested that the authors consider gathering explanations from human subjects. This approach would enable them to collect data on the theories that humans naturally employ and assess whether providing suggestions, prompts, or guidance to utilize the appropriate theory enhances performance. Such an approach could significantly contribute to rationalizing expectations and informing the design of machine systems of this nature.
<font color="#E24A0F">"More experiments need to be added."</font>
<font color="#0099FF">
| **Model: Gemini-Pro-Vision** | **(O) Question Only** | **(B) In-Context Examples** | **\(C) In-Context Examples + Human Explanations** |
|:----------------------------:|:---------------------:|:---------------------------:|:-------------------------------------------------:|
| **Rope P** | 35.5 | 34.5 | 39.0 |
| **Rope CO** | 48.2 | 51.2 | 53.4 |
| **Rope CQ** | 12.0 | 13.4 | 12.0 |
| **Rope GO** | 51.6 | 56.1 | 57.4 |
| **Rope GQ** | 10.3 | 12.1 | 19.0 |
| **Fluid P** | 10.0 | 24.0 | 21.0 |
| **Fluid CO** | 47.3 | 48.3 | 46.0 |
| **Fluid CQ** | 5.1 | 5.1 | 2.6 |
| **Fluid GO** | 44.4 | 42.6 | 36.7 |
| **Fluid GQ** | 11.3 | 7.5 | 3.8 |
| **Fluid PO** | 52.4 | 51.6 | 48.8 |
| **Fluid PQ** | 5.8 | 5.8 | 2.9 |
| **Cloth P** | 42.0 | 54.0 | 54.0 |
| **Cloth PO** | 50.1 | 45.9 | 50.3 |
| **Cloth PQ** | 43.0 | 37.0 | 42.0 |
| **Ball P** | 54.0 | 53.0 | 46.0 |
| **Ball CO** | 60.9 | 47.3 | 43.2 |
| **Ball CQ** | 29.6 | 13.6 | 7.4 |
| **Ball GO** | 54.1 | 55.2 | 57.4 |
| **Ball GQ** | 24.6 | 6.6 | 11.5 |
| **Ball PO** | 51.7 | 51.1 | 43.7 |
| **Ball PQ** | 25.9 | 20.7 | 15.5 |
</font>
**Q5. Sim2real Transfer**
> It is unclear if predictions from a 3d simulated model for this task will generalize to the real world. It depends on the quality of the renders and the physics simulation of the 3d engine.
**Q6. Performance on model's ability to infer physical properties.**
> The authors have raised doubts about the existing models' ability to infer physical properties on a continuum. However, there are no experiments to compare and demonstrate their actual performance.
<font color="#E24A0F">"We need some numbers for experiments."</font>
## Experimental Plan and Assignment.
**E1**. Model based on Newton's paper (https://arxiv.org/pdf/2310.07018.pdf)
**E2**. The focus on general-purpose visual and language representations in most visual models raises the question: has the author explored specialized models trained for physical reasoning, such as PIP (PIP: Physical Interaction Prediction via Mental Simulation with Span Selection), interpretable intuitive physics models, or PhyDNet? (Yanxin)
**E3**. How do the template question design? How many humans were involved? Do the templates affect the performance a lot, especially for the prompts for MLLM? (Zhicheng)[Done]
**E4**. If inputting different modalities of the scene, e.g., multi-view, point clouds, mesh, etc, how well do the models perform? (Yanxin) [Done 1]
**E5**. For the first point of weakness, I would actually suggest doing a LLM paraphrasing first and then see if that complicates the QA sets, with the statistics mentioned of course. (Zhicheng) [Done]
**E6**. For the baseline models, why not consider a few recent transformer-based video-QA models that can be finetuned on your dataset to complement the zero-shot large models such as GPT-4v, such as [1] and [2]? (Yanxin)
**E7**. It is unclear how GPT-4v and Gemini are prompted, i.e., did you use an in-context examples, what are the subsampling rates of the videos, and also what would be the instructions/guidelines to these large models? (Zhicheng) [Done]
**E8**. It is suggested that the authors consider gathering explanations from human subjects. This approach would enable them to collect data on the theories that humans naturally employ and assess whether providing suggestions, prompts, or guidance to utilize the appropriate theory enhances performance. Such an approach could significantly contribute to rationalizing expectations and informing the design of machine systems of this nature. (Zhicheng) [Done]