General Response: Revision Updated We would like to thank the reviewers for their thoughtful feedback and provide constructive and detailed comments for a better revision. We have revised our manuscript accordingly and highlight the changes in blue colors. The revision includes the following changes: * We have added more baselines,MAC-REF and CNN-LSTM-REF, with reference videos in few-shot learning setting in Line 252-259 and table 2 (Reviewer EAkh); * We have added a new NS-DR variant baseline with extra ground-truth property information from the objects in the target video and reference videos, which adopts PropNet[A] as the dynamic learner (Reviewer EAkh and Reviewer YnEj); * We have revised the paper carefully, adding more details for question templates (Line 34 to 41 and table 1 of the supplementary material), user study (Line 260 to 262 of the main paper), and analysis between physical properties and QA accuracies (Line 342 to 348 of the main paper) (Reviewer EAkh and Reviewer YnEj). **Design philosophy and perspective.** We also would like to summarize our design philosophy and perspective on physical reasoning. From our perspectives, the core intelligence of physical reasoning can be divided into several stages: (1). physic prediction without logic reasoning; (2). Physics reasoning with specific logic programs ; (3). Physical reasoning with templated language and program parser to learn physical concepts and dynamics; (4). Physical reasoning with templated language alone; (5). Physical reasoning with natural human language. Based on the goals of different stages, we have the following perspectives for building physical reasoning benchmarks. 1) We believe that a good scientific benchmark should serve for general and universe AI purposes and keep track of the progress of machine intelligence that the community makes. ComPhy dataset can be used to evaluate physical reasoning ability from level (1) to level (4). This is clearly better than building a narrow dataset just for level (1). 2) As in CLEVR, the merit of synthetic language is bias control and diagnostic ability. Human-written language often contains noise and question-conditioned bias. Thus, with human-written language, it’s really hard to diagnose which parts of a model go wrong, and models tend to utilize the bias of the dataset to achieve good performances. Synthetic data allows us to control the distribution of scenes, questions and answers. Moreover, we can diagnose models with ground-truth physical states and programs. 3) We agree that human language is challenging and is of great importance for model evaluation. However, as shown in table 2 and 3 of the main paper, current models still struggle with templated lanuage. Given current performance, it might be too early for the whole community to directly work on level (5). 4) More generally, our dataset represents a bet that synthetic and diagnostic dataset will enable progress in machine reasoning (e.g. CLEVR[21] and CLEVRER[41]). It is hard to know for sure how well the resulting systems will eventually transfer to the real-world, but it is a bet that many in the community see as worth making! [A]. Li Y, Wu J, Zhu J Y, et al. Propagation networks for model-based control under partial observation. ICRA. 2019. **To Reviewer 1** Dear reviewer EAkh, Thanks for your thoughtful response and we further address the concerns below. **[New baseline models using reference videos]** Thanks for your comments. Following your suggestions, we have made considerable efforts to implement more strong baselines with reference videos during rebuttal. Besides the **ALOE (Ref)** model, we have implemented other baselines using reference videos, **MAC (Ref)** and **CNN-LSTM (Ref)**. As shown in table 2 of the main paper, these models can only achieve comparable or even slightly worse performance comparing with their original models. This is not very surprising, since these models do not have explicit physic reasoning ability, which makes them extremely hard to infer the composition of physical properties and visible properties. We have also developed a variant of **NS-DR** that uses ground-truth physical property from reference videos for comparisons. We would like to emphasize that the original NS-DR can predict objects' dynamics based on objects' history trajectories, but has no capability of predicting physical property labels (e.g. mass, charge). Therefore, the original model can not answer physical property-related questions in a symbolic way. For example, to run the counterfactual operator "*Counterfactual_mass_light*" in question "If the sphere were lighter, which event would happen?", the symbolic reasoning module requires the mass value of the target "*sphere*" for question answering. But we do understand your concerns, we thus provide NS-DR with the ground-truth physical property labels of both the target video and reference videos and enable it to execute the symbolic programs from ground truth physic property labels and predicted motion trajectory. However, even directly using all the state information from reference videos, the NS-DR model is still worse than our model. The core reason is that the PropNet in NS-DR does not model mass charges on nodes and edges of the graph neural networks, leading to inferior dynamic prediction performance. Please refer to Section 4.2 and 5.2 for a more detailed analysis. **[About language in ComPhy]** From our perspectives, the core intelligence of physical reasoning can be divided into several stages: (1). physic prediction without logic reasoning; (2). Physics reasoning with specific logic programs ; (3). Physical reasoning with templated language and program parser to learn physical concepts and dynamics; (4). Physical reasoning with templated language alone; (5). Physical reasoning with natural human language. The reviewer suggests that we should either do 1) or 5). We respectfully push back this comment for the following reasons. 1) We believe that a good scientific benchmark should serve for general and universe AI purposes and keep track of the progress of machine intelligence that the community makes. ComPhy dataset can be used to evaluate physical reasoning ability from level (1) to level (4). This is clearly better than building a narrow dataset just for level (1). 2) As in CLEVR, the merit of synthetic language is bias control and diagnostic ability. Human-written language often contains noise and question-conditioned bias. Thus, with human-written language, it’s really hard to diagnose which parts of a model go wrong, and models tend to utilize the bias of the dataset to achieve good performances. Synthetic data allows us to control the distribution of scenes, questions, and answers. Moreover, we can diagnose models with ground-truth physical states and programs. 3) We do agree that human language is more challenging and is of great importance for model evaluation. However, as shown in table 2 and 3 of the main paper, current models still struggle with templated language. Given the current performance, it might be too early for the whole community to directly work on level (5). 4) More generally, our dataset represents a bet that synthetic and diagnostic dataset will enable progress in machine reasoning (e.g. CLEVR[21] and CLEVRER[41]). It is hard to know for sure how well the resulting systems will eventually transfer to the real world, but it is a bet that many in the community see as worth making! **[The bottleneck of physical reasoning]** This question is related to our dataset and model design principle. Since our dataset comes with full annotations of physical states and logic traces, we can easily diagnosis the bottleneck of existing AI models. We do show our efforts by developing a novel **CPL** model for this purpose. **CPL** is a neuro-symbolic framework that can perform a step-by-step evaluation of the whole physical reasoning processing (language parsing, physical property identification, and physical property-based dynamic prediction). We find that the language parser in **CPL** can successfully parse the language into executable programs with program labels during training and achieves nearly perfect program parsing accuracy. As analyzed in Section 5.2, the physical property learner in **CPL** achieves high accuracy for mass and charge label prediction. Thus, the bottleneck on ComPhy, for now, is how to improve physical property-based trajectory prediction. Future direction includes how to join parse templated language without program annotation [level 4], which is a foundation to physical reasoning from natural language [level 5]. **[Details of templated language]** We have shown the new language templates and examples of ComPhy in the revised paper (Line 34 to 41 and table 1 of the supplementary material). Samples of the questions and corresponding videos can be seen in both the supplementary material and the project page (https://comphyneurips.github.io). We wish that our response has addressed your concerns, and turns your assessment to the positive side. If you have any more questions, please feel free to let us know during the rebuttal window. Thank you very much! We appreciate your suggestions and comments! Thank you! [A]. Li Y, Wu J, Zhu J Y, et al. Propagation networks for model-based control under partial observation. ICRA. 2019. **To Reviewer 2** [Revision Updated] look forward to your feedback! Dear reviewer YnEj, Thanks again for your constructive comment. We have made the following changes in the revision according to your review. In particular, we have the following changes according to your comments. 1). We have added a variant of NS-DR to help distinguish the difference between the proposed CPL and NS-DR in Section 5.2 and analyze their performance; 2). We have also provided more analysis on physical property-based dynamics in Section 5.2; 3). We have proofread the paper again and fixed the typos. We have also summarized our perspectives on building physical reasoning benchmarks in **[General Response: Revision Updated]**. We hope they can help further clarify our design philosophy and address your concerns. We wish that our response has addressed your concerns, and turns your assessment to the positive side. We are glad to answer any more questions during the rebuttal window. We appreciate your suggestions and comments! Thank you! **To Reviewer 3** Thanks again for your positive and encouraging comment. We have summarized our perspectives on building physical reasoning benchmarks in **[General Response: Revision Updated]**. We hope that they can further clarify our design philosophy. If you have any more advice or questions, we are happy to listen and answer. We appreciate your comment and advice.