[ICLR 2024] uOCF rebuttal

## Comment to AC Dear Area Chair, This paper proposes the unsupervised discovery of Object-Centric neural Fields (uOCF), which learn to infer 3D object-centric scene representations from a single image. Our main technical contribution is the disentanglement of object intrinsic properties from their 3D locations and the subsequent application of object-centric sampling and prior learning. We collect three complex real-world datasets, and demonstrate our method allows scene manipulation and zero-shot generalization to unseen object types. During the discussion period, we addressed all reviewers' concerns and provided additional experiments and analysis. The following is a summary of our responses to the reviewers' comments, and an overview video is attached to the supplement for your convenience. 1. Despite acknowledging our method's potential, reviewer TQPk raised concerns about our experimental setup. Specifically, he argues that the identical and fixed number of objects in synthetic and real datasets undermines our experiments' validity to support our method's generalization ability. To address this concern, we have conducted additional experiments, where we randomize the number of objects in the scene and set the number of slots ($K$) above the maximum number of objects in the scene. The results (Appendix E.5) justify that our method neither requires two datasets to have the same number of objects nor $K$ to equal the number of objects in the scene, effectively addressing the reviewer's concern. 2. Reviewers are also interested in the necessity of pre-training on synthetic datasets. To address this, we have provided additional analysis and conducted ablation studies to show that omitting the pre-training stages leads to a significant performance drop. By utilizing alternative synthetic datasets (CLVER shapes and chairs), we have also demonstrated that the learned object priors are transferrable and agnostic to real-world datasets’ object categories. 3. Reviewers also raised concerns about over/under-segmentation issues. In fact, our method does incorporate specific designs to address these issues. Firstly, the slot attention mechanism enforces areas of similar features to bind to the same slot. Moreover, we add slot-specific positional encoding to the keys and values to let the slot latents emphasize local information, thus preventing parts of different instances from binding together. As for empirical results, Figures 12 and 23 show that only our model can overcome this drawback, whereas previous methods fail to segment all the chair details. In summary, we believe our method's novelty, extensive experiments, and real-world applications warrant its acceptance. We hope you will consider our rebuttal and the revised manuscript for publication. Sincerely, Authors --- Dear reviewer TQPk, Thanks for your valuable insights on the experimental design. We address your concerns below with additional experiments. Due to the limited time before the discussion period ends, the number of training epochs for the following experiments is fewer than standard. The additional discussions and results have been added to **Appendix E.5**. If accepted, we will refine our experiments section in the main text in the camera-ready version. --- Experiment configuration: We render a new synthetic dataset with each scene containing 2-4 chair instances. Real scenes still include exactly 4 objects. We test both $K=4$ and $K=5$. The results are shown in Tables 8, 9 (also attached below), and Figure 18. Please note that after using a training set consisting of a randomized number of objects, empty slots now appear (please see Figure 18), overcoming the over-segmentation problem. Qualitative Results on Room-Texture (15 epochs, 75000 iterations) | Method | ARI$\uparrow$ | FG-ARI$\uparrow$ | NV-ARI$\uparrow$ | PSNR$\uparrow$ | SSIM$\uparrow$ | LPIPS$\downarrow$ | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | | K=4, n_obj=4 | 0.819 | 0.643 | 0.743 | 28.68 | 0.803 | 0.139 | | K=4, n_obj$\in\{2,3,4\}$ | 0.828 | 0.743 | 0.756 | 30.11 | 0.831 | 0.112 | K=5, n_obj$\in\{2,3,4\}$ | 0.819 | 0.559 | 0.769 | 28.72 | 0.805 | 0.132 Qualitative Results on Kitchen-Hard (150 epochs, 48600 iterations) | Method | PSNR$\uparrow$ | SSIM$\uparrow$ | LPIPS$\downarrow$ | | :--- | :---: | :---: | :---: | | K=4, n_obj=4 | 27.36 | 0.820 | 0.140 | | K=4, n_obj$\in\{2,3,4\}$ | 27.10 | 0.825 | 0.131 | | K=5, n_obj$\in\{2,3,4\}$ | 28.26 | 0.837 | 0.120 | As shown in these tables, uOCF performs reasonably well when trained on scenes with $\leq K$ objects, justifying that the outstanding performance of uOCF in real-world scenes is not because of the identical number of objects between stages 2 and stage 3. In other words, applying the object prior learned from synthetic scenes to real-world scenes neither requires two datasets to have the same number of objects nor $K$ to equal the number of objects in the scene. We can still assume a shared maximum number of objects of the two datasets, similar to previous unsupervised object discovery literature. **Regarding under-segmentation.** In the additional experiments above we see that our method is able to overcome over-segmentation. Regarding under-segmentation, in Figure 12, we observe that only our model (both the previously trained model and the newly trained model) can overcome this drawback, whereas previous methods fail to segment all the chair details. **On the generality of the object prior.** We agree with the possible explanation on the generalization to overcome large domain gaps you mentioned, i.e., the learned object priors might be "compact shapes on a flat surface." We note that what defines an object is not entirely settled in cognitive science. In the core knowledge theory [Spelke 1990], fundamental principles of objects include "physical cohesion" (i.e., objects should move cohesively), which indicates that an object should be a cohesive compact entity, and "support" (i.e., objects do not float in mid-air without support). Therefore, we argue that "compact shapes on a flat surface" is a valid category-agnostic object prior. This argument justifies our stage 2 in applying the learned object prior: Since the learned object priors are general, it enables handling the domain gaps between synthetic and real data, regardless of the number of objects in these scenes. [1] Spelke. "Principles of object perception." Cognitive Science, 1990. --- Finally, we wish to reiterate that the main contribution of this work is to show the generalization power of object-centric modeling in unsupervised 3D object discovery, which allows real-world 3D scene editing from a single image, as well as zero-shot generalization to unseen real scenes. These have never been shown in previous object-centric learning methods. We appreciate your efforts and hope our experiments and analysis can address your concerns. Your suggestions indeed help us improve our work. We are more than happy to discuss if you have further concerns. Sincerely, Authors   # Summary of Paper Updates (New!) **Effect of Number of Slots ($K$) and the Number of Objects during Training.** We further consider two more challenging scenarios in Appenddix E.5: (1) the number of slots exceeds the number of objects in the scene, and (2) the number of objects differs across training scenes. As shown in Tables 8, 9, and Figure 18, Our method maintains its effectiveness in these scenarios. # Summary of Paper Updates We thank all reviewers for their insightful feedback and valuable suggestions. In response, we have comprehensively updated the paper and the appendix. Below is a summary of the changes: --- # New Experimental Results **[All reviewers] Ablation Studies on Encoder Design.** We have introduced thorough ablation studies in Appendix E.1, elaborated in Table 4 and Figure 12. These studies reveal that substituting a shallow U-Net with DINO is insufficient for separating foreground objects and disentangling foreground from background. Importantly, our design for maintaining a U-Net route alongside DINO features is crucial for enhanced performance and visual fidelity. **[All reviewers] Ablation Studies on Training Pipeline.** Additional ablation studies in Appendix E.1, detailed in Table 5 and Figure 13, emphasize the significance of the dual-stage training on synthetic datasets for robust generalization to complex real-world scenes. More discussions on the effect of object-prior learning and sampling techniques are also included. **[jMSr] Additional Zero-Shot Generalizability Analysis.** Our method's zero-shot generalizability is now tested on more challenging datasets, with findings detailed in Appendix E.2 and illustrated in Figure 15. Our approach can accurately identify and segment the chair instances in the scene and deliver plausible reconstruction results. **[TQPk] Extended Results on Room-Diverse Dataset.** We have included additional experiments on the Room-Diverse dataset in Appendix E.3, with comparative results presented in Table 6. Our method demonstrably surpasses all baseline methods in qualitative performance and visual quality. **[jMSr] Additional Evaluation Metrics.** Our scene segmentation evaluation has been expanded to include the Average Precision (AP) metric [1] in Appendix E.4, with results detailed in Table 7. **[TQPk,jMSr,ZWx8] Handling Scenes with Fewer Objects.** We clarify the dataset's composition that all scenes contain exactly $K=4$ objects. However, our method performs effectively in scenes with less than $4$ objects. Additional discussions and visualizations are provided in Appendix E.5 and Figure 17. **[TQPk,jMSr] Introduction of New Real-World Dataset.** The newly curated *Planters* dataset is introduced in Appendix E.6. Sample images and comprehensive evaluation results are available in Figure 19, Table 10, and Figure 20. **[ZWx8] Experiments with Transparent Objects.** New experimental results for scenes containing transparent objects are presented in Appendix E.7, with visual demonstrations in Figure 21. Our model ignores transparency due to its training dataset's absence of transparent objects. However, it still demonstrates reasonable object segmentation and reconstruction capabilities. --- # Paper Writing Updates **[jMSr] Enhanced Paper Contributions.** The contributions section has been restructured for clarity and brevity, emphasizing our novel three-stage training pipeline. See the revised Section 1 for details. **[jMSr] Expanded Relevant Paper Discussions.** Discussions about the ONeRF paper have been added to Section 2, enriching our literature review and contextual understanding. **[ZWx8] Reproducibility Statement.** A new section on reproducibility has been added. We pledge to release all codes, datasets, models, and detailed instructions upon acceptance of the paper, ensuring replicability and transparency. **[TQPk,ZWx8] Limitation Analysis.** We have added discussions on the limitations of our method in Appendix D, addressing the reviewers' concerns. --- ### References [1] Yang and Yang. "Promising or Elusive? Unsupervised Object Segmentation from Real-world Single Images." In NeurIPS, 2022. # Response to Reviewer TQPk Thank you for recognizing the empirical success of our work. We have addressed your main concerns as follows: --- **Q1. Utilization of DINO encoder** We note that simply substituting DINO for the shallow U-Net encoder in existing models does not inherently overcome their limitations. To see this, we have conducted comprehensive ablation studies detailed in Appendix E.1. Specifically, Table 4 (also attached below) illustrates how uORF-DINO and uORF-DR (uORF with our dual-route encoder) improves upon the standard uORF, yet still fails to correctly disentangle foreground and background elements. Note that uORF-DR binds all foreground objects to the background, leading to an ARI score of zero. Besides, Figure 12 provides visual comparisons to demonstrate these distinctions in performance. | Method | ARI$\uparrow$ | FG-ARI$\uparrow$ | NV-ARI$\uparrow$ | PSNR$\uparrow$ | SSIM$\uparrow$ | LPIPS$\downarrow$ | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | | uORF | 0.649 | 0.108 | 0.587 | 24.37 | 0.688 | 0.251 | | uORF-DINO | 0.688 | 0.677 | 0.656 | 25.33 | 0.739 | 0.240 | | uORF-DR | 0 | 0 | 0 | 25.38 | 0.698 | 0.322 | | uOCF | **0.802** | **0.785** | **0.747** | **28.96** | **0.803** | **0.121** | --- **Q2. Utilization of single-object datasets** We clarify that in the first stage of training, our model is trained on a single-object synthetic dataset (which can be easily synthesized) to learn general object priors **agnostic to** the real-world dataset's object categories. Therefore, it is **not** akin to the manual process. For instance, we trained our model on synthetic chairs and successfully applied the learned priors to real-world kitchen datasets. We further validate this generalizability using a very simple CLEVR dataset (colored primitives like cubes and spheres) instead of the chair dataset for object prior learning. We added the results in Table 5 and Figure 12 in Appendix E.1. We also attach the table below. | Method | PSNR$\uparrow$ | SSIM$\uparrow$ | LPIPS$\downarrow$ | | :--- | :---: | :---: | :---: | | uOCF (adapt from CLEVR) | 27.32 | 0.833 | 0.092 | | uOCF (adapt from chairs) | 28.29 | 0.842 | 0.069 | --- **Q3. The necessity of pre-training on synthetic datasets** The pre-training stages on synthetic datasets are critical to our method's success. This three-stage training pipeline begins with teaching the model the basics of object-centric NeRF, including essential aspects like physical coherence [1], which is crucial to unsupervised segmentation [2]. The subsequent stage then enhances the model's ability to predict object positions and segregate them into individual slots, thereby equipping it for handling complex real-world scenes. Our comparative analysis, detailed in Appendix E.1 and quantitatively supported by the results on the Kitchen-Hard dataset in Table 5 (also attached below), clearly shows the significant performance drop when omitting these initial synthetic dataset training stages. We show qualitative results in Figure 13 in the updated paper. | Method | PSNR$\uparrow$ | SSIM$\uparrow$ | LPIPS$\downarrow$ | | :--- | :---: | :---: | :---: | | uORF | 19.23 | 0.602 | 0.336 | | Omit stage 1&2 | 19.67 | 0.565 | 0.568 | | Omit stage 2 | 24.52 | 0.769 | 0.157 | | Omit object-centric sampling | 27.89 | **0.843** | 0.083 | | uOCF | **28.29** | 0.842 | **0.069** | --- **Q4. Limitation analysis** We have added discussions on our method's limitations in Appendix D. The current constraints include limited diversity in object appearance and background complexity in our datasets and challenges in reconstructing foreground objects with complex textures. These limitations, shared with other generalizable NeRF methods, are areas we aim to improve in future work. --- **Q5. Fixed number of objects in the scene** While the scenes in our datasets have $K=4$ objects, our method can also reconstruct and discover all objects to scenes with fewer objects. Further details and visual demonstrations are available in Appendix E.6 and Figure 19. --- **Q6. Why our background reconstruction quality is much higher than uORF?** A key strength of our work lies in its superior reconstruction quality. uORF's failure in differentiating between background and foreground elements, particularly in complex datasets like *Kitchen-Hard* and *Room-Texture*, prevents its background NeRF from focusing on background reconstruction. Our approach introduces a critical improvement: the disentanglement of object positions. This not only enhances systematic generalization but also facilitates object-centric prior learning and object-centric sampling. Specifically, object-centric prior learning allows better disentanglement of background, and object-centric sampling allows for $4\times$ samples with the same amount of computation. This results in a significant improvement in reconstruction quality for both background environment and foreground objects, setting our method apart from uORF's limitations. --- **Q7. Results on Room-Diverse dataset** We have included new experiments on the Room-Diverse dataset in Appendix E.3. The results in Table 6 (also attached below) and Figure 16 show our method's superior performance against all baseline comparisons. Additionally, we have explored another newly curated real-world *Planters* dataset in Appendix E.5, providing a comprehensive analysis with sample images, and results in Figures 17, 18, and Table 8. | Method | ARI$\uparrow$ | FG-ARI$\uparrow$ | NV-ARI$\uparrow$ | PSNR$\uparrow$ | SSIM$\uparrow$ | LPIPS$\downarrow$ | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | | uORF | 0.638 | 0.705 | 0.494 | 25.11 | 0.683 | 0.266 | | uORF-DINO | 0.692 | 0.555 | 0.633 | 25.50 | 0.698 | 0.239 | | uORF-DR | 0.742 | 0.653 | 0.680 | 26.00 | 0.707 | 0.209 | | QBO | 0.724 | 0.716 | 0.618 | 24.49 | 0.680 | 0.182 | | uOCF | **0.769** | **0.828** | **0.688** | **27.31** | **0.751** | **0.141** | --- **Q8. Novelty Concerns** We argue that our core contribution is proposing object-centric modeling, which addresses a fundamental bottleneck in generalizing 3D object discovery to more complex real scenes than synthetic scenes. In particular, existing 3D object discovery methods represent objects in the viewer's coordinate frame, entangling camera extrinsics with object intrinsics. This entanglement prevents them from effectively learning object priors and generalizing to complex scenes. Our object-centric modeling allows object-centric prior learning instantiated by our 3-stage training pipeline, which shows promising results in generalization, even in a zero-shot setting. Therefore, our innovation mainly lies in object-centric modeling which further enables prior learning and object-centric sampling. Regarding the integration of DINO, our new ablation studies (Appendix E.1, Table 4) show that a mere substitution of DINO in existing frameworks cannot overcome their limitations. The three-stage training pipeline, particularly the initial stages on synthetic datasets, is essential for our model’s generalization, as evidenced by the substantial performance drop when omitted (Appendix E.1, Table 5). Moreover, the object prior obtained on synthetic datasets are agnostic to the real-world dataset's object categories, and thus can even be learned from simple CLEVR shapes. These results underscore the effectiveness of the proposed object-centric modeling. --- ### References [1] Spelke. "Principles of object perception." Cognitive Science, 1990. [2] Chen et al. "Learning to Infer 3D Object Models From Images." In ECCV, 2022. # Response to Reviewer Qciw Thank you for recognizing the empirical success and wide applicability of our method. We have addressed your main concerns as follows: --- **Q1. Ablation studies on the training pipeline.** We appreciate your suggestion regarding the inclusion of additional ablation studies. In Appendix E.1, we have expanded our analysis to emphasize the critical role of our three-stage training pipeline in achieving robust generalization to complex real-world scenes. The ablation results in Table 5 (also attached below) and Figure 13, demonstrate how omitting the initial synthetic dataset training stages significantly compromises the model's capability to distinguish foreground from background elements. These results confirm the indispensability of each stage in our pipeline for achieving the model's full potential. | Method | PSNR$\uparrow$ | SSIM$\uparrow$ | LPIPS$\downarrow$ | | :--- | :---: | :---: | :---: | | uORF | 19.23 | 0.602 | 0.336 | | Omit stage 1&2 | 19.67 | 0.565 | 0.568 | | Omit stage 2 | 24.52 | 0.769 | 0.157 | | Omit object-centric sampling | 27.89 | **0.843** | 0.083 | | uOCF | **28.29** | 0.842 | **0.069** | --- **Q2. Ablation studies on encoder design.** Following your suggestions, we conducted ablation studies on the encoder design in Appendix E.1. The results in Table 4 and Figure 12 indicate that while integrating DINO features enhances performance, the parallel maintenance of a U-Net route is crucial for better performance. The comparative results on the Room-Texture dataset illustrate this point: | Method | ARI$\uparrow$ | FG-ARI$\uparrow$ | NV-ARI$\uparrow$ | PSNR$\uparrow$ | SSIM$\uparrow$ | LPIPS$\downarrow$ | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | | uORF | 0.649 | 0.108 | 0.587 | 24.37 | 0.688 | 0.251 | | uORF-DINO | 0.688 | 0.677 | 0.656 | 25.33 | 0.739 | 0.240 | | uOCF-DINO | 0.652 | 0.346 | 0.621 | 26.06 | 0.723 | 0.251 | | uOCF-IM | **0.806** | 0.749 | **0.752** | 27.77 | 0.753 | 0.182 | | uOCF | 0.802 | **0.785** | 0.747 | **28.96** | **0.803** | **0.121** | Specifically, while uORF-DINO improves over standard uORF by replacing the shallow U-Net encoder with DINO, it still fails to correctly disentangle foreground and background elements. Besides, uOCF-DINO, which drops the U-Net route and utilizes the DINO encoder only, and uOCF-IM, which substitutes DINO's intermediate layer features for the shallow encoder in uOCF, both achieve inferior performance than uOCF. This reaffirms the superiority of our encoder design. --- **Q3. Handling the pose ambiguity.** We appreciate your focus on the challenge of pose ambiguity, especially regarding symmetric objects. As detailed in Section 3, our method primarily targets the inference of object-centric NeRFs and object positions in 3D scenes from single images, excluding the disentanglement of object orientation. This choice is based on (1) the absence of a universal canonical orientation for diverse object categories, particularly the symmetric objects you have mentioned; (2) The technical challenges in accurate orientation estimation from varied real-world images; (3) Empirical evidence (as detailed in Appendix B) suggesting our model's proficient learning of object scale and orientation after position disentanglement. This approach enables our method to effectively handle real-world scene complexities without delving into the currently challenging problem of pose estimation. # Response to Reviewer jMSr Thank you for recognizing the empirical success and wide applicability of our method. We have addressed your main concerns as follows: --- **Q1. Concerns on motivation** A key limitation in current object representation methods is their viewer's frame-dependent nature, where even minor positional shifts or camera movement can significantly alter an object's latent representation. This entanglement of intrinsic attributes with extrinsic properties limits generalization in complex real-world scenes (Figure 1). Furthermore, as highlighted in recent 2D object-centric learning literature [1,2] and the success of convolutional networks, position-invariance is essential for systematic generalization. This point is also emphasized in recent 3D supervised learning literature [3]. Therefore, we propose to disentangle object position and introduce the concept of position-invariant object-centric NeRFs. This disentanglement not only allows learning generalizable object priors from category-agnostic synthetic data (object-centric prior learning) but also improves sample efficiency (object-centric sampling). Extensive experiments in Section 4 and Appendix E validate the superiority of our method on our newly curated datasets. --- **Q2. How to integrate local and global features?** In Section 3.4, we detail our approach to integrating local and global features. We utilize $f_g$ as the "$\mathrm{queries}$" for the latent extraction module and subsequently concatenate its outputs with an attention-weighted mean of $f_l$. This results in the final slot latent $\{\mathbf{z}_i\}_{i=0}^K$, where attention weights are also derived from the latent extraction module. This design efficiently balances global and local information, maximizing the utility of pre-trained models. --- **Q3. Inferring the object's position in the world coordinates** Instead of learning the object position in the world coordinates $p_i^{wd}$, our approach first estimates an object's position on the image plane $p_i^{img}$ by the attention-weighted mean over a spatial grid, which is then converted to the world coordinates using the intersection of the ray with the ground plane. A bias term is added to $p_i^{img}$ to adjust for discrepancies in this conversion. Since the position estimation happens in image space instead of 3D space, our approach does not need to learn separate coordinate systems for each image. --- **Q4. How to tackle under/over-segmentation? How to handle objects visible in multi-view images but invisible in the input image, and vice versa?** Under/over-segmentation is a common challenge in existing approaches, often resulting in binding multiple objects into a single foreground slot (missing objects) or all objects into the background slot (blurry object reconstruction), as visualized in Figures 5 and 7. The technical contribution of our work is exactly to help address this challenge. In particular, we introduce the generalizable object-centric prior learning pipeline. This pipeline begins with learning object priors from a synthetic dataset. The learned object priors might include generalizable visual clues such as physical coherence [4] which is critical to unsupervised segmentation [5]. The subsequent stage refines the model's ability to predict object positions and segregate them into individual slots. As shown in Figure 7, our method can effectively separate and reconstruct individual objects in real scenes, even when the object is largely occluded. However, even though our method can tackle object occlusion, it cannot handle objects that are invisible in the input image, as the model cannot discover them. Reversely, when the object is visible in the input image but invisible in the multi-view images, our model can still discover and reconstruct it. This property is naturally inherited from the volumetric rendering technique utilized in NeRF. --- **Q5. Effect of object-centric prior learning and sampling** We have discussed the effect of object-centric prior learning in previous responses. For qualitative results, ablation studies (Appendix E.1) show that omitting synthetic dataset training stages drastically reduces performance. Meanwhile, our object-centric technique enables $4\times$ samples within the same computational budget, significantly enhancing background reconstruction quality. Key results are summarized below, with full details in Tables 4, 5, and Figures 12, 13. | Method | PSNR$\uparrow$ | SSIM$\uparrow$ | LPIPS$\downarrow$ | | :--- | :---: | :---: | :---: | | uORF | 19.23 | 0.602 | 0.336 | | Omit stage 1&2 | 19.67 | 0.565 | 0.568 | | Omit stage 2 | 24.52 | 0.769 | 0.157 | | Omit object-centric sampling | 27.89 | **0.843** | 0.083 | | uOCF | **28.29** | 0.842 | **0.069** | --- **Q6. Ablation studies on other model components** Additional ablation studies (Appendix E.1) reveal that while substituting DINO for a shallow U-Net encoder improves performance, it does not inherently overcome its limitations. Specifically, Table 4 (also attached below) illustrates how uORF-DINO and uORF-DR (uORF with our dual-route encoder) improve upon the standard uORF, yet still fail to correctly disentangle foreground and background elements. Note that uORF-DR binds all foreground objects to the background, leading to an ARI score of zero. Besides, uOCF-DINO, which drops the U-Net route and utilizes the DINO encoder only, and uOCF-IM, which substitutes DINO's intermediate layer features for the shallow encoder in uOCF, both achieve inferior performance than uOCF. This reaffirms the superiority of our encoder design. | Method | ARI$\uparrow$ | FG-ARI$\uparrow$ | NV-ARI$\uparrow$ | PSNR$\uparrow$ | SSIM$\uparrow$ | LPIPS$\downarrow$ | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | | uORF | 0.649 | 0.108 | 0.587 | 24.37 | 0.688 | 0.251 | | uORF-DINO | 0.688 | 0.677 | 0.656 | 25.33 | 0.739 | 0.240 | | uORF-DR | 0 | 0 | 0 | 25.38 | 0.698 | 0.322 | | uOCF-DINO | 0.652 | 0.346 | 0.621 | 26.06 | 0.723 | 0.251 | | uOCF-IM | **0.806** | 0.749 | **0.752** | 27.77 | 0.753 | 0.182 | | uOCF | 0.802 | **0.785** | 0.747 | **28.96** | **0.803** | **0.121** | --- **Q7. Number of objects in the scene.** While the scenes in our datasets have $K=4$ objects, our method can also reconstruct and discover all objects to scenes with fewer objects. Further details and visual demonstrations are available in Appendix E.6 and Figure 19. --- **Q8. Evaluation on more challenging datasets** Given that ScanNet does not offer textured meshes or calibrated images, we use the HM3D dataset [6], a dataset focused on building-scale indoor 3D reconstruction, for evaluating more complex real-world scenes. We conducted our evaluation using zero-shot inference, with detailed results presented in Appendix E.2 and illustrated in Figure 15. Our model, pre-trained on the Room-Texture dataset, demonstrates robustness in discovering and segmenting chair instances even in unseen complex scenes. --- **Q9. Additional evaluation metrics** We have adopted the official AP computation code from [7] and included the AP evaluation metric in Appendix E.4, with detailed results in Table 7 (also attached below). Specifically, we consider two kinds of APs: Input view-AP for the input view, and Novel view-AP for novel views. Our method's high AP scores demonstrate its superior scene segmentation capabilities. | Metric | uORF | QBO | COLF | uOCF | | :--- | :---: | :---: | :---: | :---: | | Input view-AP | 0.005 | 0.359 | 0.315 | **0.782** | | Novel view-AP | 0.001 | 0.195 | 0.015 | **0.770** | --- **Q10. Performance analysis with comparison methods** A key strength of our work lies in its superior reconstruction quality. In contrast, uORF struggles to distinguish between background and foreground elements, particularly in complex datasets such as *Kitchen-Hard* and *Room-Texture*, resulting in low segmentation scores. This difficulty extends to uORF's background NeRF, leading to blurry reconstructions due to its inability to effectively focus on background elements. Our approach builds upon uORF but introduces a critical enhancement: the disentanglement of object positions. This advancement not only enhances systematic generalization but also facilitates the application of techniques like object-centric prior learning and sampling. Specifically, object-centric prior learning allows better disentanglement of background, and object-centric sampling allows for $4\times$ samples with the same amount of computation. This results in a significant improvement in reconstruction quality for both background environment and foreground objects, setting our method apart from uORF's limitations. --- **Q11. Minor issues** Following your advice, we have revised the contributions for clarity and conciseness, updated the mathematical notations for positional encodings, and added discussions about the ONeRF paper. --- ### References [1] Biza et al. "Invariant Slot Attention: Object Discovery with Slot-Centric Reference Frames." In ICML, 2023. [2] Traub et al. "Learning What and Where: Disentangling Location and Identity Tracking Without Supervision." In ICLR, 2023. [3] Fuchs et al. "SE(3)-Transformers: 3D Roto-Translation Equivariant Attention Networks," In NeurIPS, 2020 [4] Spelke. "Principles of object perception." Cognitive Science, 1990. [5] Chen et al. "Learning to Infer 3D Object Models From Images." In ECCV, 2022. [6] Ramakrishnan et al. "Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI". arXiv, 2021. [7] Yang and Yang, "Promising or elusive? unsupervised object segmentation from real-world single images." In NeurIPS, 2022 # Response to Reviewer ZWx8 Thank you for recognizing the empirical success, wide generalizability, and broad applicability of our method. We have addressed your main concerns as follows: --- **Q1. Ablation studies on encoder design, training pipeline, and object-centric sampling.** **A1.** We provide additional ablation studies in Appendix E.1 following your suggestions. These studies justify the necessity of our three-stage training pipeline, showing that omitting synthetic dataset training stages significantly impairs the model's performance. Additionally, our findings illustrate that the integration of DINO features, while beneficial, requires the concurrent use of a U-Net route for better performance. Ablation studies on object-centric sampling also confirm its role in enhancing background reconstruction. We provide a snapshot of the results below, and the full details are available in Tables 4, and 5 and Figures 12, and 13. **Ablation Studies on Encoder Design** | Method | ARI$\uparrow$ | FG-ARI$\uparrow$ | NV-ARI$\uparrow$ | PSNR$\uparrow$ | SSIM$\uparrow$ | LPIPS$\downarrow$ | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | | uORF | 0.649 | 0.108 | 0.587 | 24.37 | 0.688 | 0.251 | | uORF-DINO | 0.688 | 0.677 | 0.656 | 25.33 | 0.739 | 0.240 | | uOCF-DINO | 0.652 | 0.346 | 0.621 | 26.06 | 0.723 | 0.251 | | uOCF | **0.802** | **0.785** | **0.747** | **28.96** | **0.803** | **0.121** | **Ablation Studies on Training Pipeline** | Method | PSNR$\uparrow$ | SSIM$\uparrow$ | LPIPS$\downarrow$ | | :--- | :---: | :---: | :---: | | uORF | 19.23 | 0.602 | 0.336 | | Omit stage 1&2 | 19.67 | 0.565 | 0.568 | | Omit stage 2 | 24.52 | 0.769 | 0.157 | | Omit object-centric sampling | 27.89 | **0.843** | 0.083 | | uOCF | **28.29** | 0.842 | **0.069** | As for the most crucial component, we emphasize the integration of position disentanglement and a three-stage training pipeline. The former enables the model to learn object-centric NeRFs and object positions in 3D scenes from single images, while the latter enables generalization to complex real-world scenes. These two components are complementary and indispensable for our method's success. --- **Q2. Limitation analysis** We have added discussions on our method's limitations in Appendix D. The current constraints include limited diversity in object appearance and background complexity in our datasets and challenges in reconstructing foreground objects with complex textures. These limitations, shared with other generalizable NeRF methods, are areas we aim to improve in future work. --- **Q3. Explanation of training configurations** Details on datasets and training configurations are provided in Appendix C. Additionally, to facilitate reproducibility, we have dedicated a section in our paper and committed to releasing all codes, datasets, models, and detailed instructions upon paper acceptance. --- **Q4. Hyperparameter $K$ (number of objects)** While the scenes in our datasets have $K=4$ objects, our method can also reconstruct and discover all objects to scenes with fewer objects. Further details and visual demonstrations are available in Appendix E.6 and Figure 19. --- **Q5. Results on transparent objects** As for transparent objects, we have included new experiments in Appendix E.7, with visual results in Figure 20. Our model ignores transparency due to the absence of transparent objects in its training dataset. However, it still demonstrates reasonable object segmentation and reconstruction capabilities. --- **Q6. Geometric quality of learned object representations** Recovering object geometric shapes from NeRFs is indeed a challenging problem. NeRF primarily represents objects using density values for rendering, which doesn't explicitly encode object geometry. This can lead to inaccuracies, especially in complex geometries or scenarios with limited view coverage. However, significant advancements have been made in this area. Recent research efforts have focused on enhancing NeRF's capability to reconstruct object geometry more accurately. These include improved sampling strategies [1], integration of geometric constraints [2], and hybrid approaches that blend traditional 3D representations with NeRF [3]. Integrating these advancements into our method is a promising direction for future work. --- ### References [1] Li et al. "NerfAcc: Efficient Sampling Accelerates NeRFs". In ICCV, 2023. [2] Rematas et al. "ShaRF: Shape-conditioned Radiance Fields from a Single View". In ICML, 2021. [3] Yariv et al, "Volume Rendering of Neural Implicit Surfaces". In NeurIPS, 2021.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.