[Pre-revision] General Response to all reviewers
We thank all reviewers for the valuable comments and insightful feedback. We are glad to find the reviewers recognizing the following contributions:
* The idea of decorrelating physical properties and visual appearance is appealing and interesting [Reviewer EAkh, Reviewer YnEj, Reviewer 9tjv];
* It is novel and challenging to learn physical properties from the interaction and motion of a few video examples [Reviewer EAkh, Reviewer 9tjv];
* The paper is nicely written and easy to follow [Reviewer YnEj, Reviewer 9tjv].
Besides the specific response to each reviewer below, we summarize the planned changes in the revision.
* We are planning to add more baselines such as (MAC-REF and CNN-LSTM-REF) with reference videos in few-shot learning setting;
* We will add a new NS-DR variant baseline with extra ground-truth property information and adopting PropNet as the dynamic learner[A];
* We will revise the paper carefully, adding more details for question templates, user study, and analysis between physical properties and question-answering accuracies.
[A]. Li Y, Wu J, Zhu J Y, et al. Propagation networks for model-based control under partial observation. ICRA. 2019.
**To R1**
Thank you for your constructive comments.
**1. About baselines are too weak**
It may **not** be true that the baselines in Table 1 are too weak. We have tried the best existing models. Based on the best existing model on CLEVRER (ALOE[13]), we have implemented a baseline model, ALOE-REF. The variant, ALOE-REF, concatenates the visual features of the target video and reference videos as visual input and achieves only limited gains on ComPhy, as shown in Table 2. This is not surprising, since it remains unclear how to adapt these existing state-of-the-art video reasoning models to infer physics from few examples. We have also developed a new oracle neural-symbolic model (CPL) with much supervision. However, none of them works well enough on ComPhy.
ComPhy introduces a new test setting that requires models to learn physical properties from only a few video examples. Existing baselines like CNN-LSTM, HCRN, and MAC are based on massive training videos and question-answer pairs. Thus, they are difficult to adapt to the new scenario in ComPphy which learns the new compositional visible and hidden physical properties from only a few examples.
To provide a more thorough analysis, we will implement more variants of baseline models such as (MAC-REF and CNN-LSTM-REF) with reference videos in the revision. **We are also open to new and strong baselines if the reviewer could suggest any.**
**2. About important baseline models reported in CLEVER are missing**
We do **not** think IEP (2017), T-VQA+ (2018), and TbDNet (2018) are important baselines. Note that it has been proved in the CLEVRER[42] dataset that baselines like ALOE[13] (2021) achieve much better performance than IEP (2017), T-VQA+ (2018), and TbDNet (2018). As reported in ALOE[13] and CLEVRER[42], ALOE achieves an accuracy of 0.756 in counterfactual questions while the best of IEP, T-VQA+, and TbDNet achieves only an accuracy of 0.044.
We have implemented and analyzed the performance of modern state-of-the-art models in CLEVRER like ALOE/ ALOE-REF (2021), HCRN (2020), and MAC (2018) on ComPhy. Although the state-of-the-art model ALOE show excellent performance on CLEVRER, it achieves unsatisfactory performance on ComPhy.
The neural-symbolic model, NS-DR, by nature can not infer physical properties and thus can **not** answer physical-based questions in ComPhy. For example, given the factual question "Is the purple sphere heavier than the brown cube?", NS-DR has no sub-module to parse the mass value of the "brown cube" and, thus, fails to execute the program and answer the factual question. For better analysis, we are planning to implement a variant of NS-DR with extra ground-truth physical property information and PropNet[A] for dynamic predictions in the revision.
**3. About language in ComPhy**
Of course, if we merely focus on inferring physical properties from few video examples, we can simply ask whether the same physical property of a given object is the same as the one contained in the reference videos.
However, in ComPhy, we group segments into objects, infer their physical properties based on their motion, and use concepts and natural language to explain what has happened, infer what is about to happen, and imagine what would happen in counterfactual scenes. Adding language into ComPhy makes it possible for us to automatically learn concepts from natural language and ground them into the physical dynamic scenes, which is essential for human intelligence. We also believe that adding language into physical reasoning can attract broader research attention over the communities of both computer vision and natural language processing like the previous benchmark, CLEVRER[42].
**4. About missing details**
We will provide more details for question templates and human study in the revised version.
[A]. Li Y, Wu J, Zhu J Y, et al. Propagation networks for model-based control under partial observation. ICRA. 2019.
**To R2**
Thanks for your detailed comments.
**1. About the contribution of ComPhy and its difference with CLEVRER**
We **never** claim the question types are ComPhy's main contributions. As summarized in the end of Introduction section, our contributions are 1). a new physical reasoning benchmark ComPhy with physical properties (mass and charge), physical events (attraction, repulsion), and their composition with visual appearance and motions; 2). a few-shot reasoning setting, requiring models to infer hidden physical properties from only a few examples and make corresponding dynamic predictions to answer questions; 3). a new oracle neural-symbolic framework to handle the ComPhy tasks.
The main differences between ComPhy and CLEVRER include 1). ComPhy requires models to identify **intrinsic physical properties** of objects from **only a few video examples**; 2. ComPhy requires models to make **physical property-based dynamic predictions** for the target video like *"Which event would happen if the purple object were heavier?"*. Note that objects in CLEVRER are designed with **the same mass** and mainly focus on visible properties and dynamics **without physical property variance**. As shown in Figure 1 of the main paper, the variance in physical properties (mass and charge) plays an important role in objects' dynamic predictions.
Previous model ALOE achieved excellent accuracy (0.875 on predictive questions) on CLEVRER **without understanding physical properties**. However, ALOE-REF achieved much lower accuracy on ComPhy (0.371 on predictive questions). This shows the difference between ALOE and ComPhy and the importance of modeling physical properties on ComPhy.
**2. About only two physical properties are considered in ComPhy**
ComPhy supports reasoning over the compositional visible and hidden physical properties. Theoretically, we can add more physical properties like bounciness coefficients and friction into the benchmark and make the values of these physical properties continuous. However, such a design will make the dataset too complicated. We want to keep the dataset simple for people to infer the physical properties from few observations while still challenging for current AI models. As shown in Table 3, the oracle model CPL still achieves limited performance (56.4 on predictive questions and 29.1 on counterfactual questions) on ComPhy even though it uses much supervision information during training. We can add more other physical properties into ComPhy if the AI models can achieve satisfactory performance on the current benchmark.
**3. About the evaluation set up and identifying all objects' intrinsic properties**
The reviewer may have misunderstood our designed few-shot learning setting, which requires the models to infer objects' **physical properties** from **only a few videos** containing the objects moving and interacting under different initial conditions. From the example in Figure 2, if we **only look at the target video**, we can not compare the mass values between the *"the purple sphere"* and *"the brown cube"* because there is no interaction between them in the target video. However, if we look at the **reference video 2** of Figure 2, we can find after the collision between *"the purple sphere"* and *"the brown cube"*, both objects change into the opposite moving directions, indicating that neither of them has a larger mass value than the other object. Thus, we can not infer the mass relations between *"the purple sphere"* and *"the brown cube"* **from the only target video** is not a bug but a designed feature of ComPhy, which requires models to gather information from **both the target video and the reference videos** to infer objects' physical properties.
We make sure that each reference video at least contains an interaction (collision, charge, and mass) among objects to provide enough information for physical property inference. And each object should appear at least once in the reference videos. Moreover, when generating questions for comparing mass values and identifying charge relations between two objects, **we systematically control that the two objects should have at least one interaction (collision, abstraction, and repulsion) in one of the provided few video examples. We make sure that the few video examples are informative enough to answer questions based on the questions' programs and the video examples' property and event annotation.** As shown in Figure 2, we only ask models to compare the mass between *"the purple sphere"* and *"the brown cube"* when there is a collision between them. We will provide a more detailed analysis in the revised version to make it more clear.
**4. The necessity of the few-shot learning setting**
We emphasize that ComPhy is **not** simply a few-shot version of CLEVRER. We summarize the differences between ComPhy and CLEVRER in the previous paragraph "*1. About the contribution of ComPhy and its difference with CLEVRER*".
The goal of ComPhy aims to study the objects' hidden physical properties (mass and charge) and their composition with the visible properties. ComPhy introduces a few-shot reasoning setting for physical property learning. ComPhy provides only a few video examples for models to identify objects’ physical properties and then asks questions about the physical properties and dynamics. **Such a setting is natural since people also often infer objects' physical properties from only a few observations. Such a few-shot setting also decorrelates physical properties and visual appearance. It is effective to avoid shortcuts. Previous state-of-the-art models on CLEVRER like MAC and ALOE-REF have no modules designed for hidden physical properties and achieve limited performance on ComPhy.**
If we do not adopt such a few-shot learning setting, a straightforward way is to follow the setting in previous benchmarks like CLEVRER, requiring the model to watch a video and then answer questions about physical properties. However, physical properties are complicated and often can not be fully unraveled in only one video. Another solution is to correlate object appearance with physical properties like making all red spheres to be heavy and then ask questions about their dynamics. However, such a correlated setting may incur shortcuts for models by just memorizing the appearance prior rather than understanding coupled physical properties.
**5. About result analysis based on physical properties**
We will provide more analysis between physical properties and the question-answering performance in the revised version.
**6. About details of human study**
We will provide more details of the human study in the revised version.
**7. About the contributions of the proposed oracle model**
We propose an oracle model, compositional physical property learner (CPL) for reasoning on ComPhy. The proposed CPL is **not** *"just the inclusion of the Physical Property Learner"* of the NS-DR model in CLEVRER. We highlight the following differences between CPL and NS-DR.
**1). Goal**. CPL aims to learn physical properties from only a few given video examples while NS-DR by nature could not infer objects' physical properties from objects' dynamics in videos;
**2). Inputs**. CPL targets at ComPhy, whose input includes both target videos and reference videos while NS-DR only focuses on one given video and has no modules to gather information from all given target and reference videos;
**3). Physical Property Learning**. CPL contains a graph neural network-based physical property learner (PPL) to infer objects' physical properties from the objects' motion and interaction in the given target and reference videos while NS-DR has no such design and thus can not answer questions in ComPhy. PPL is effective in inferring physical properties from objects' motions and interactions, achieving an accuracy of 90.4% and 90.8% for mass and charge label prediction, respectively, as described in Line 36-Line317;
**4) Physical Property-based Dynamic Prediction**. The dynamic predictor in NS-DR has no mechanism to model the physical properties (mass and charge) of objects in ComPhy, which leads to inferior dynamic prediction performance. Instead, **CPL has specific modeling for mass and charges on nodes and edges of the graph neural network** and achieves better performance.
To make it easier to distinguish the CPL and NS-DR, we are planning to include a variant NS-DR baseline in the revision. To enable NS-DR to tackle the more challenging physics reasoning task on ComPhy, we provide NS-DR with extra ground-truth physical property information and a PropNet [A] for dynamic predictions.
[A]. Li Y, Wu J, Zhu J Y, et al. Propagation networks for model-based control under partial observation. ICRA. 2019.
**To R3**
Thanks for your positive comments.
**1. About the website presentation**
We will take additional dissemination actions (tutorials, slides, etc.) in the later version.