陈振方
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # (Backup) ContPhy: Continuum Physical Concept Learning and Reasoning from Videos [toc] --- ## General Response to All. **G1. Contribution Recognition.** We sincerely thank reviewers and ACs' time and efforts in reviewing the paper. We are glad that reviewers recognized the following contributions. * **Task**. The task is **novel**. *"The paper introduces a novel benchmark encompassing various scenarios using a 3D physics simulator and a well-structured question set to probe AI models to understand physical properties and dynamics."* (**UVKV**) *"The dataset based on Unity can be a good data source for many tasks, and so do the questions."* (**aBrg**) *"The proposed dataset, ContPhy, is quite interesting and dives reasonably deep into the realm of uncovering the fine-grained physical properties of video captured objects."* (**7fPw**) *"a novel benchmark to assessing machine physical commonsense by encompassing the inference of diverse physical properties."*(**xL5z**) * **Experiments**. **Comprehensive** experiments are conducted.*"The paper also performs a comprehensive set of experiments with traditional visual models, VLMs and also with humans"*. (**UVKV**) *"The experiments contain efforts of many methods and MLLMs."* (**aBrg**) * **Model**. The proposed ContPRO model is **effective**.*"It also show that ContPRO outperforms humans in some tasks and outperforms other approach in most tasks."* (**UVKV**) *"The design of the oracle model, ContPro, is comprehensive, and seems perform well."* (**7fPw**) **G2. Experiments During Rebuttal.** To address the reviewers’ questions and support our responses, we conduct the following experiments to support our claims and show ContPhy's value. ## Response to Reviewer **UVKV** **Q1. Results about Baselines.** > The QA dataset outputs two or three answers, but some results fall below the random baseline. What could explain this? **About Option Nubmer of Each Question.** In Fig. (2), we only show examples with two or three answers for page limitation consideration. In fact, the average number of options for each multiple-choice questions is <font color="#E24A0F">**</font> in our dataset. **About Baseline Performance.** We thank the reviewer for the concern that some baseline models fall below the random baseline on some metrics. For example, **C-LSTM** performs worse than the Blind random baseline (**RND**) on predictive question per option (**P-Opt.**) and goal-driven question per option (**G-Opt.**). We believe the reason is that models like **C-LSTM**, which are originally designed for static vision-language tasks have difficulties to understand the dynamics and physics common sense in the predictive or goal-driven scenarios. Thus, they only achieved comparable performance to the **RND** baseline. This shows our dataset' challenges for traditional vision-question answering models. <font color="#E24A0F">1. Do we have more than 3 options for each question type? 2. Get the statistics</font> **Q2. New baseline NEWTON for ContPhy.** > Has the author considered applying this to LLMs, akin to the approach in NEWTON's study on physical reasoning in language models? We thank the reviewer for suggesting the blind model evaluation on LLMs similar to the approach in NEWTON. For thorough evaluation, we have added a new baseline that first transforms the visual input into text description of the scene objects and their dynamics and then feed the description to the LLM. Specifically, we insert the following information into the prompt, 1) the 3D location coordinates, euler rotation angles, and local scales of each rigid object, and 2) the locations of each softbody or fluid's centroids and some sampled particles in subsampled <font color="#E24A0F">add frame number</font> sequential frames. Additionally, in the rope scenario, we also provide the link list of the loads or pulleys on each single rope. Following the text description of the scene, the questions are added below the given information. We feed the prompt into the **Gemini Pro Vision**. We choose Gemini rather than GPT-4V since Gemini provides free APIs for extensive experimental analysis. The results are shown in **table (A)** below. In table (A), we do not observe significant increase of performance compared with the question-only blind model in Table 2 of the main paper. This indicates that there is no shortcuts in ContPro for LLMs to guess the correct answer based on pure textual information. We will add such analysis and discuss the related work NEWTON in the later version. <font color="#E24A0F">Maybe add the experiments of Gemini with visual information for better analysis. Also, hard to compare the performance by scanning the table. Maybe use color or calculate the average performance. </font> <!-- <font color="#0099FF">Thanks for the reviewer to suggest the blind model evaluation on LLMs as the approach in NEWTON. For thorough evaluation, we have added a new baseline that first transforms the visual input into text description of the scene objects and their dynamics and then feed the description to the LLM. Due to the input token limit of MLLM, it is a lossy transformation, but we tried our best to recover the visual events accurately in text. Specifically, we insert the following information into the prompt, 1) the 3D location coordinates, euler rotation angles, and local scales of each rigid object, and 2) the locations of each softbody or fluid's centroids and some sampled particles in subsampled sequential frames. The information is organized in dictionary format, retrieved by object name as the key. The detailed implementations for different scenarios are not totally the same. For example, in the rope scenario, we also provide the link list of the loads or pulleys on each single rope. Following the text description of the scene, the questions are added below the given information. We feed the prompt into the **Gemini Pro Vision** <font color="#E24A0F">zf: "need Fact check and analysis for the baseline"</font>. We choose Gemini rather than GPT-4V since Gemini provides free APIs for extensive experimental analysis. We will add such analysis and discuss the related work NEWTON in the later version. Here are the experiment results. After these attempts, we do not observe significant increase of scores, compared with purely blind experiments. --> **Table (A).** Performance of Blind Gemini-based LLM model on ContPro. | **Model: Gemini-Pro-Vision** | **(N) Question Only (Text Only)** | **(O) Question + Subsampled Frames** | **(G) NEWTON Approach (Subsampled Frame Description; Text Only)** | |:----------------------------:|:---------------------------------:|:------------------------------------:|:-----------------------------------------------------------------:| | **Rope P** | 34.0 | 35.5 | 50.0 | | **Rope CO** | 44.4 | 48.2 | 47.8 | | **Rope CQ** | 5.6 | 12.0 | 2.1 | | **Rope GO** | 48.9 | 51.6 | 54.7 | | **Rope GQ** | 10.3 | 10.3 | 6.9 | | **Fluid P** | 28.0 | 10.0 | 4.0 | | **Fluid CO** | 48.0 | 47.3 | 56.0 | | **Fluid CQ** | 6.4 | 5.1 | 2.6 | | **Fluid GO** | 63.3 | 44.4 | 42.0 | | **Fluid GQ** | 11.3 | 11.3 | 5.7 | | **Fluid PO** | 51.2 | 52.4 | 57.1 | | **Fluid PQ** | 8.7 | 5.8 | 0.0 | | **Cloth P** | 54.0 | 42.0 | 57.0 | | **Cloth PO** | 56.1 | 50.1 | 49.2 | | **Cloth PQ** | 50.0 | 43.0 | 40.5 | | **Ball P** | 54.0 | 54.0 | 47.0 | | **Ball CO** | 60.1 | 60.9 | 58.4 | | **Ball CQ** | 37.0 | 29.6 | 37.0 | | **Ball GO** | 60.1 | 54.1 | 55.2 | | **Ball GQ** | 34.4 | 24.6 | 26.2 | | **Ball PO** | 47.1 | 51.7 | 52.9 | | **Ball PQ** | 17.2 | 25.9 | 25.9 | </font> ~~Thanks for the reviewer to suggest the blind model for LLMs. For thorough evaluation, we have added a new baseline that first transforms the visual input into text description () and then feed the description to the LLM. Specifically, we first put the 3D location of each object into the prompt and add the question below the object description. We feed the prompt into the **Gemini Vision Pro** <font color="#E24A0F">zf: "need Fact check and analysis for the baseline"</font>. We choose Gemini rather than GPT-4V since Gemini provides free APIs extensive experimental analysis. We will add such analysis and discuss the related work NEWTON in the later version.~~ **Q3. More Physical Baselines.** > Has the author explored specialized models trained for physical reasoning, such as PIP (PIP: Physical Interaction Prediction via Mental Simulation with Span Selection), interpretable intuitive physics models, or PhyDNet? Thanks for suggesting new baselines for physical reasoning. During this rebuttal, we have implemented all the baselines, including. <font color="#E24A0F">zf: "Need more experimental results."</font>. ## Response to Reviewer **aBrg** **Q1. About Logical Steps to Infer Answers.** > Logically, can we know the logical steps to infer the answers, in a common human way? For example, the mean of steps to infer the physical property questions in Fig 2? <font color="#E24A0F">zf: "Do we still have how many operation steps for each question? Need some statistics"</font>. **Q2. About Template Question Design.** > How do the template question design? How many humans were involved? Do the templates affect the performance a lot, especially for the prompts for MLLM? <font color="#0099FF">Thanks for your concerns on templated questions! We design the question templates by brain-storming. About 10 people are involved to propose, implement and modify templates. As shown in <font color="#0088FF">our paper Table 3-7 and Figure 5</font>, these templated questions are proposed to test AI models' capabilities in different dimensions, including static visual attribute recognition, physical property inference, dynamic prediction, and counterfactual imagination. Here we list some linguistic statistics of the QA dataset. Considering your concern on the effect of template-based questions, we also utilize LLMs to paraphrase the questions for better diversity. Question statistics and MLLM's performances on template-based questions and LLM-paraphrased questions are compared in the <font color="#E24A0F">following tables. (Some analysis here)... </font> | Linguistic Statistics | Template-Based Fluid | Template-Based Rope | Template-Based Cloth | Template-Based Ball | LLM Paraphrased Fluid | LLM Paraphrased Rope | LLM Paraphrased Cloth | LLM Paraphrased Ball | Typical Values in Natural Language Corpus | |-----------------------|----------------------|---------------------|----------------------|---------------------|-----------------------|----------------------|-----------------------|----------------------|-------------------------------------------| | Lexical Diversity: TTR | 0.0096 | 0.0096 | 0.0089 | 0.0066 | 0.052 | 0.053 | 0.068 | 0.049 | 0.02 to 0.5 | | Lexical Diversity: Word Distr | See Figure | See Figure | See Figure | See Figure | See Figure | See Figure | See Figure | See Figure | N/A | | QA Diversity: Question Type Number | 7 Types, See Figure | 8 Types, See Figure | 6 Types, See Figure | 5 Types, See Figure | 7 Types, See Figure | 8 Types, See Figure | 6 Types, See Figure | 5 Types, See Figure | N/A | | Syntactic Diversity: Sentence Length Average / Variance | 13.1/3.9 | 13.0/6.9 | 12.2/8.5 | 15.2/10.2 | 13.6/10.7 | 13.0/11.3 | 11.7/11.4 | 15.6/19.2 | 15 to 20 | | Readability Scores: Flesch-Kincaid Grade Level | 4.4 | 3.1 | 4.1 | 4.0 | 4.5 | 3.1 | 3.9 | 4.1 | A score of 5.0 to 8.0 is considered easily understandable by the average 5th to 8th grader. | | **Model: Gemini-Pro-Vision** | **(O) Template-Based Question** | **(N) Template-Based Question (Text Only)** | **(D) LLM-Paraphrased Questions** | **(F) LLM-Paraphrased Questions (Text Only)** | |:----------------------------:|:-------------------------------:|:-------------------------------------------:|:---------------------------------:|:---------------------------------------------:| | **Rope P** | 35.5 | 34.0 | 30.5 | 32.0 | | **Rope CO** | 48.2 | 44.4 | 43.8 | 44.0 | | **Rope CQ** | 12.0 | 5.6 | 9.2 | 5.6 | | **Rope GO** | 51.6 | 48.9 | 49.3 | 47.1 | | **Rope GQ** | 10.3 | 10.3 | 6.9 | 1.7 | | **Fluid P** | 10.0 | 28.0 | 11.0 | 22.0 | | **Fluid CO** | 47.3 | 48.0 | 45.0 | 55.3 | | **Fluid CQ** | 5.1 | 6.4 | 2.6 | 15.4 | | **Fluid GO** | 44.4 | 63.3 | 43.2 | 60.4 | | **Fluid GQ** | 11.3 | 11.3 | 5.7 | 11.3 | | **Fluid PO** | 52.4 | 51.2 | 52.0 | 54.3 | | **Fluid PQ** | 5.8 | 8.7 | 4.3 | 11.6 | | **Cloth P** | 42.0 | 54.0 | 39.0 | 48.0 | | **Cloth PO** | 50.1 | 56.1 | 50.1 | 55.7 | | **Cloth PQ** | 43.0 | 50.0 | 41.5 | 51.0 | | **Ball P** | 54.0 | 54.0 | 58.0 | 61.0 | | **Ball CO** | 60.9 | 60.1 | 57.2 | 57.6 | | **Ball CQ** | 29.6 | 37.0 | 27.2 | 28.4 | | **Ball GO** | 54.1 | 60.1 | 53.9 | 55.6 | | **Ball GQ** | 24.6 | 34.4 | 27.9 | 32.8 | | **Ball PO** | 51.7 | 47.1 | 52.3 | 56.9 | | **Ball PQ** | 25.9 | 17.2 | 27.6 | 31.0 | </font> **Q3. Baselines with Multi-Modalities.** > If inputting different modalities of the scene, e.g., multi-view, point clouds, mesh, etc, how well do the models perform? Thanks for the suggestion to study the input of different modalities. As shown in <font color="#E24A0F">"Need experimental results."</font> **Q4. Paper Writing.** >**Q4-1.** L78, fig 2 is too far away from its 2st ref. >**Q4-2.** Lacking enough details of the model design of the ContPRO and its implementations. If possible, please add them in the suppl. Thanks for the advice on paper revision. We will make the following revisions in the later version, * 1). Move Fig. 2 to the page that is more close to its references; * 2). Besides the implementation details in Section **4** and Section **9.2**, we will include more details on design and implementation of ContPRO. We will use release our source code for easy reproduction. ## Response to Reviewer **7fPw** **Q1. Question Statistics.** >It is unclear to me whether the question is indeed diverse enough, in the sense that no explicit statistics such as type-token ratio, word distributions, and other relevant quantities were clearly reported. For the first point of weakness, I would actually suggest doing a LLM paraphrasing first and then see if that complicates the QA sets, with the statistics mentioned of course. <font color="#0099FF">Thanks for the advice on more statistics on the synethsized questions. Based on your advice, we have reported the statistics of the current question version and its paraphrased version. Besides recommended lexical diversity metrics such as TTR and word distribution, we also reported syntactic diversity (sentences length mean and variance), question type diversity (question type number), and readability scores (Flesch-Kincaid Grade Level) for reference. Statistics can be checked here ___. Thanks for the inspirational advice on paraphrasing questions. We have utilized Gemini-Pro to reword the questions. The prompt is as follows. "I am looking for assistance in paraphrasing this question. " "My primary goal is to ensure that the essence and meaning of the question, along with the content " "of each option, remain unchanged. It is crucial that the sequence of the options is preserved so " "that the correct answer corresponds directly with the original question. Below, I will provide the " "question with its options. Please rephrase it as diversely as possible, maintaining strict adherence " "to their original meaning. Make question readable and understandable for common people as well. " "Please only return paraphrased question (with its paraphrased options if it has). Do not add any other text. " "Please keep the color name and the object name unchanged. " "Please do not change the word \"elastic\"/\"plastic\" or \"elasticity\"/\"plasticity\". " "If the object name has \"the other\" description, let this description stay unchanged. " "If you think the option is too hard to rephrase, you can keep it unchanged. " "Also, keep the option format unchanged." "For example, if I give you the following question:\n\n" "If the gray stick were removed, which stick would orange fluid pass?\nA. Pink stick\nB. Brown stick\nC. Cyan stick\n\n" "You may response:\n\n" "If the gray stick were not there, which stick would orange liquid flow through?\nA. Pink stick\nB. Cyan stick\nC. Cyan stick\n\n" "PLEASE STRICTLY FOLLOW above response format. Otherwise we could not use program to process your response. " "OK, here is the original question you will paraphrase.\n\n") f"{question list to paraphrase}\n\n" "Thank you for your assistance!" We instruct the LLM to reword the given questions as diverse as possible and keep the original meaning strictly unchanged and the content readable for common people as well. We provide some generated examples here: 1) "Is the density of light blue fluid equal to that of green fluid?" --> "Are the light blue liquid and green liquid just as heavy?" 2) "Which phrase below can best describe the final pose of the green plate?\nA. Standing upright.\nB. Leaning.\nC. Lying horizontally." --> "What does the final position of the green plate best resemble?\nA. Standing straight up.\nB. Tilted.\nC. Lying flat." 3) "Will the orange ball finally drop into the left pit?\nA. Yes\nB. No\nC. Can not answer" --> "Is the orange ball expected to fall into the pit on the left?\nA. Yes\nB. No\nC. Can not answer" The statistics for both paraphrased and template-based questions are listed in the table. | Linguistic Statistics | Template-Based Fluid | Template-Based Rope | Template-Based Cloth | Template-Based Ball | LLM Paraphrased Fluid | LLM Paraphrased Rope | LLM Paraphrased Cloth | LLM Paraphrased Ball | Typical Values in Natural Language Corpus | |-----------------------|----------------------|---------------------|----------------------|---------------------|-----------------------|----------------------|-----------------------|----------------------|-------------------------------------------| | Lexical Diversity: TTR | 0.0096 | 0.0096 | 0.0089 | 0.0066 | 0.052 | 0.053 | 0.068 | 0.049 | 0.02 to 0.5 | | Lexical Diversity: Word Distr | See Figure | See Figure | See Figure | See Figure | See Figure | See Figure | See Figure | See Figure | N/A | | QA Diversity: Question Type Number | 7 Types, See Figure | 8 Types, See Figure | 6 Types, See Figure | 5 Types, See Figure | 7 Types, See Figure | 8 Types, See Figure | 6 Types, See Figure | 5 Types, See Figure | N/A | | Syntactic Diversity: Sentence Length Average / Variance | 13.1/3.9 | 13.0/6.9 | 12.2/8.5 | 15.2/10.2 | 13.6/10.7 | 13.0/11.3 | 11.7/11.4 | 15.6/19.2 | 15 to 20 | | Readability Scores: Flesch-Kincaid Grade Level | 4.4 | 3.1 | 4.1 | 4.0 | 4.5 | 3.1 | 3.9 | 4.1 | A score of 5.0 to 8.0 is considered easily understandable by the average 5th to 8th grader. | We tested the MLLM's performance on template-based and LLM-paraphrased questions as well. Results are listed here. | **Model: Gemini-Pro-Vision** | **(O) Template-Based Question** | **(N) Template-Based Question (Text Only)** | **(D) LLM-Paraphrased Questions** | **(F) LLM-Paraphrased Questions (Text Only)** | |:----------------------------:|:-------------------------------:|:-------------------------------------------:|:---------------------------------:|:---------------------------------------------:| | **Rope P** | 35.5 | 34.0 | 30.5 | 32.0 | | **Rope CO** | 48.2 | 44.4 | 43.8 | 44.0 | | **Rope CQ** | 12.0 | 5.6 | 9.2 | 5.6 | | **Rope GO** | 51.6 | 48.9 | 49.3 | 47.1 | | **Rope GQ** | 10.3 | 10.3 | 6.9 | 1.7 | | **Fluid P** | 10.0 | 28.0 | 11.0 | 22.0 | | **Fluid CO** | 47.3 | 48.0 | 45.0 | 55.3 | | **Fluid CQ** | 5.1 | 6.4 | 2.6 | 15.4 | | **Fluid GO** | 44.4 | 63.3 | 43.2 | 60.4 | | **Fluid GQ** | 11.3 | 11.3 | 5.7 | 11.3 | | **Fluid PO** | 52.4 | 51.2 | 52.0 | 54.3 | | **Fluid PQ** | 5.8 | 8.7 | 4.3 | 11.6 | | **Cloth P** | 42.0 | 54.0 | 39.0 | 48.0 | | **Cloth PO** | 50.1 | 56.1 | 50.1 | 55.7 | | **Cloth PQ** | 43.0 | 50.0 | 41.5 | 51.0 | | **Ball P** | 54.0 | 54.0 | 58.0 | 61.0 | | **Ball CO** | 60.9 | 60.1 | 57.2 | 57.6 | | **Ball CQ** | 29.6 | 37.0 | 27.2 | 28.4 | | **Ball GO** | 54.1 | 60.1 | 53.9 | 55.6 | | **Ball GQ** | 24.6 | 34.4 | 27.9 | 32.8 | | **Ball PO** | 51.7 | 47.1 | 52.3 | 56.9 | | **Ball PQ** | 25.9 | 17.2 | 27.6 | 31.0 | </font> ~~For paraphration, <font color="#E24A0F">"More details"</font> <font color="#E24A0F">"More experimental analysis by Zhicheng."</font>~~ **Q2. More Implementation Details.** >For the baseline models, why not consider a few recent transformer-based video-QA models that can be finetuned on your dataset to complement the zero-shot large models such as GPT-4v, such as [1] and [2]? [1] Fu, Tsu-Jui, et al. "Violet: End-to-end video-language transformers with masked visual-token modeling." arXiv preprint 2021. [2] Sung, Yi-Lin, Jaemin Cho, and Mohit Bansal. "Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks." CVPR 2022. Thanks for the suggestion to add more finetuned baselines. <font color="#E24A0F">"More experimental analysis by Xin Yan. Maybe we can another baseline like BLIP2 or something easier."</font> **Q3. MLLMs Prompting Details** > It is unclear how GPT-4v and Gemini are prompted, i.e., did you use an in-context examples, what are the subsampling rates of the videos, and also what would be the instructions/guidelines to these large models? <font color="#0099FF"> Grateful for pointing out some unspecified prompting details! For original in-paper experiments we only combine general instruction, questions, and subsampled frames in a packed prompt. The original general instruction and per-question instructions are listed in our paper Table 8. We did not include any scenario-specific guidelines or any in-context QA examples in prompt. During rebuttal, we carefully considered your opinion and experimented with our self-designed scenario-specific guidelines, in-context examples, as well as elaborate human explanations for example questions. Besides, we uniformly subsampled 11 frames and resize images from 1920x1080 to 480x270 in initial experiments. During rebuttal stage, We tested full-size(1920x1080) images and raised subsampling frame number to 16 frames, which is the acceptable upper limit of Gemini-Pro-Vision's visual input. The detailed results are listed below. | **Model: Gemini-Pro-Vision** | **(O) Question Only** | **(A) Scenario-Specific Guideline** | **(B) In-Context Examples** | **\(C) In-Context Examples + Human Explanations** | **(E) Upsampled Video (11→16 Frames, Raised Resolution)** | |:----------------------------:|:---------------------:|:-----------------------------------:|:---------------------------:|:------------------------------------------------:|:---------------------------------------------------------:| | **Rope P** | 35.5 | 33.5 | 34.5 | 39.0 | 34.0 | | **Rope CO** | 48.2 | 46.6 | 51.2 | 53.4 | 46.6 | | **Rope CQ** | 12.0 | 14.8 | 13.4 | 12.0 | 11.3 | | **Rope GO** | 51.6 | 54.7 | 56.1 | 57.4 | 48.9 | | **Rope GQ** | 10.3 | 20.7 | 12.1 | 19.0 | 8.6 | | **Fluid P** | 10.0 | 22.0 | 24.0 | 21.0 | 19.0 | | **Fluid CO** | 47.3 | 45.7 | 48.3 | 46.0 | 46.3 | | **Fluid CQ** | 5.1 | 2.6 | 5.1 | 2.6 | 0.0 | | **Fluid GO** | 44.4 | 40.8 | 42.6 | 36.7 | 40.8 | | **Fluid GQ** | 11.3 | 5.7 | 7.5 | 3.8 | 5.7 | | **Fluid PO** | 52.4 | 53.1 | 51.6 | 48.8 | 49.2 | | **Fluid PQ** | 5.8 | 5.8 | 5.8 | 2.9 | 4.3 | | **Cloth P** | 42.0 | 46.0 | 54.0 | 54.0 | 47.0 | | **Cloth PO** | 50.1 | 50.1 | 45.9 | 50.3 | 47.7 | | **Cloth PQ** | 43.0 | 43.0 | 37.0 | 42.0 | 38.5 | | **Ball P** | 54.0 | 52.0 | 53.0 | 46.0 | 56.0 | | **Ball CO** | 60.9 | 56.4 | 47.3 | 43.2 | 60.5 | | **Ball CQ** | 29.6 | 28.4 | 13.6 | 7.4 | 34.6 | | **Ball GO** | 54.1 | 57.9 | 55.2 | 57.4 | 54.1 | | **Ball GQ** | 24.6 | 31.1 | 6.6 | 11.5 | 23.0 | | **Ball PO** | 51.7 | 52.9 | 51.1 | 43.7 | 50.6 | | **Ball PQ** | 25.9 | 27.6 | 20.7 | 15.5 | 25.9 | </font> **More Related Work.** >Another video based QA work that talks about counterfactual reasoning is [3]. While the work is not directly discussing physical properties at the granularity of this work and it serves as a more general event-rich video QA work, it is still quite relevant (its physical dimension) to the direction of this work. Consider citing and discussing it. [3] Wu, Te-Lin, et al. "ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos." EMNLP 2023. Thanks for the advice on the related work about counterfactual reasoning[3]. We will cite and discuss it in the later version. ## Response to Reviewer **xL5z** **Q1. More Details on Particle-Based Dynamics Learner** > I did not quite understand how the Particle-Based Dynamics Learner works. How exactly are MPM and DPI applied? Please explain in more detail their working principles within the model. **Q2. Confusion on Fig2.** > Figure 2's Example A from (a) to (f) can easily cause confusion. While other examples describe the same object, Example A's first three images do not describe the same object as the last three images, which I think might confuse others. Thanks for the suggestion on Fig. 2 (a), we will update the figure and the caption to make it more clear and avoid confusion. **Q3. Adding links to the supplementary section.** > Please add related links in the supplementary material section. For instance, in section 4 on Program Execution, it is mentioned, "We provide more model details in the supplementary material. Please link specifically to section 9.2" We will revise the paper accordingly and add reference links to the sections in the supplementary for easy reading. **Q4. Gathering Explanations from humans.** > It is suggested that the authors consider gathering explanations from human subjects. This approach would enable them to collect data on the theories that humans naturally employ and assess whether providing suggestions, prompts, or guidance to utilize the appropriate theory enhances performance. Such an approach could significantly contribute to rationalizing expectations and informing the design of machine systems of this nature. <font color="#E24A0F">"More experiments need to be added."</font> <font color="#0099FF"> | **Model: Gemini-Pro-Vision** | **(O) Question Only** | **(B) In-Context Examples** | **\(C) In-Context Examples + Human Explanations** | |:----------------------------:|:---------------------:|:---------------------------:|:-------------------------------------------------:| | **Rope P** | 35.5 | 34.5 | 39.0 | | **Rope CO** | 48.2 | 51.2 | 53.4 | | **Rope CQ** | 12.0 | 13.4 | 12.0 | | **Rope GO** | 51.6 | 56.1 | 57.4 | | **Rope GQ** | 10.3 | 12.1 | 19.0 | | **Fluid P** | 10.0 | 24.0 | 21.0 | | **Fluid CO** | 47.3 | 48.3 | 46.0 | | **Fluid CQ** | 5.1 | 5.1 | 2.6 | | **Fluid GO** | 44.4 | 42.6 | 36.7 | | **Fluid GQ** | 11.3 | 7.5 | 3.8 | | **Fluid PO** | 52.4 | 51.6 | 48.8 | | **Fluid PQ** | 5.8 | 5.8 | 2.9 | | **Cloth P** | 42.0 | 54.0 | 54.0 | | **Cloth PO** | 50.1 | 45.9 | 50.3 | | **Cloth PQ** | 43.0 | 37.0 | 42.0 | | **Ball P** | 54.0 | 53.0 | 46.0 | | **Ball CO** | 60.9 | 47.3 | 43.2 | | **Ball CQ** | 29.6 | 13.6 | 7.4 | | **Ball GO** | 54.1 | 55.2 | 57.4 | | **Ball GQ** | 24.6 | 6.6 | 11.5 | | **Ball PO** | 51.7 | 51.1 | 43.7 | | **Ball PQ** | 25.9 | 20.7 | 15.5 | </font> **Q5. Sim2real Transfer** > It is unclear if predictions from a 3d simulated model for this task will generalize to the real world. It depends on the quality of the renders and the physics simulation of the 3d engine. **Q6. Performance on model's ability to infer physical properties.** > The authors have raised doubts about the existing models' ability to infer physical properties on a continuum. However, there are no experiments to compare and demonstrate their actual performance. <font color="#E24A0F">"We need some numbers for experiments."</font> ## Experimental Plan and Assignment. **E1**. Model based on Newton's paper (https://arxiv.org/pdf/2310.07018.pdf) **E2**. The focus on general-purpose visual and language representations in most visual models raises the question: has the author explored specialized models trained for physical reasoning, such as PIP (PIP: Physical Interaction Prediction via Mental Simulation with Span Selection), interpretable intuitive physics models, or PhyDNet? (Yanxin) **E3**. How do the template question design? How many humans were involved? Do the templates affect the performance a lot, especially for the prompts for MLLM? (Zhicheng)[Done] **E4**. If inputting different modalities of the scene, e.g., multi-view, point clouds, mesh, etc, how well do the models perform? (Yanxin) [Done 1] **E5**. For the first point of weakness, I would actually suggest doing a LLM paraphrasing first and then see if that complicates the QA sets, with the statistics mentioned of course. (Zhicheng) [Done] **E6**. For the baseline models, why not consider a few recent transformer-based video-QA models that can be finetuned on your dataset to complement the zero-shot large models such as GPT-4v, such as [1] and [2]? (Yanxin) **E7**. It is unclear how GPT-4v and Gemini are prompted, i.e., did you use an in-context examples, what are the subsampling rates of the videos, and also what would be the instructions/guidelines to these large models? (Zhicheng) [Done] **E8**. It is suggested that the authors consider gathering explanations from human subjects. This approach would enable them to collect data on the theories that humans naturally employ and assess whether providing suggestions, prompts, or guidance to utilize the appropriate theory enhances performance. Such an approach could significantly contribute to rationalizing expectations and informing the design of machine systems of this nature. (Zhicheng) [Done]

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully