# ICLR22 rebuttal
### Summary of Revisions
We thank the reviewers again for their helpful suggestions and constructive reviews. We are eager to incorporate their constructive feedback as revisions to our paper (highlighted in blue in the updated submission). We now summarize the revisions in below:
### Experiments
- [BYk3,tkSe] In Appendix D, we demonstrate the inference on a single real photo by our model. The results suggest that our model can reasonably generalize to real images.
- [BYk3] In Appendix D, we analyze the sensitivity to slot initializations. We observe that our model can obtain good results across different slot initializations.
- [tkSe] In Appendix D, we discuss testing results on unseen shapes, indicating that our model generalizes well to unseen object shapes.
- [pN74] In Appendix D, we perform an ablation study on foreground locality box. A visual comparison demonstrates that the foreground box prevents foreground object slots from reconstructing background segments.
### Writing
- [pN74] In the experiment section, we include the discussion that our better segmentation results are due to disjoint background/foreground modeling.
- [pN74] In the method section, we introduce why and how we set the foreground locality box.
In individual responses, we also answered questions from each reviewer and made additional clarifications. We hope that our responses and revisions are helpful. We sincerely appreciate your time and efforts in helping us to improve our paper. Please don't hesitate to let us know if you have any additional questions!
<!-- ### Potential
- [BYk3] Relation to GIRAFFE.
- [tkSe] Clarification on the foreground slot number.
- [pN74] Details such as how many samples per ray.
-->
### Reviewer BYk3
Thank you for your constructive review and helpful suggestions!
**Q1**: Relation to GIRAFFE.
**A1**: We would like to clarify that although our work is related to GIRAFFE, it addresses a **fundamentally different** problem than GIRAFFE, as detailed in Appendix E:
- GIRAFFE targets at unconditional generation; it is not designed for, and therefore **cannot** tackle the problem of inferring object representations from (or conditioned on) either a single image or multiple. In contrast. we focus on such inference from a single image of a multi-object scene.
- Technically, our model preserves the multi-view consistency of 3D object representations. Our progressive training mechanism further helps to produce output at a higher resolution. In contrast, GIRAFFE's feature field technique trades multi-view consistency for fine details; their generated objects may look differently across different views, and this is why GIRAFFE cannot be used for novel view synthesis (see Figure 25).
In Appendix E, we have also shown comparisons with GIRAFFE to demonstrate these fundamental differences.
- In Appendix E.1, we show that the single-image inference problem is highly non-trivial and GIRAFFE cannot do it. The observation is that GIRAFFE cannot even do input-view reconstruction, let alone novel view synthesis or segmentation in 3D.
- In Appendix E.2, we show that GIRAFFE cannot do wide-baseline novel view synthesis even using author-provided pretrained models. This is because their neural renderer sacrifices 3D consistency for higher-resolution images.
- In Appendix E.3, we analyze why GIRAFFE cannot do inference and novel view synthesis.
If possible, we would sincerely appreciate your time in reviewing Appendix E of our submission. We hope that it will resolve your concern. We also understand that such clarification might be critical, and are happy to move the discussion to the main paper if needed.
**Q2**: Scene complexity and the quality of segmentation and re-arrangement results.
**A2**: Regarding applicability to real scenes, we show a demonstration in Figure 10 in our updated submission. As we can see, our model can obtain reasonable inference results even on real images, yielding reasonable segmentations. Note that GIRAFFE is not able to perform such inference, even on synthetic images (see our answer above and Appendix E). Other previous methods cannot infer unsupervised factorized 3D scene representations from a single synthetic image, either. Thus, we wish to clarify that our proposed method is indeed making progress on a challenging task. We surely agree that the segmentation and re-arrangement quality can always be further improved and we look forward to keeping pushing this frontier.
<!-- We clarify that unsupervised inference of 3D factorized scene representations is a challenging problem and previous methods cannot do it. Even existing 2D inference methods (such as slot attention [Locatello, NeurIPS'20]) cannot tackle scenes with various backgrounds and various object shapes. Thus, we have already made a step towards more complex scenes compared to existing unsupervised inference methods (even compared to 2D methods).
Regarding the segmentation results, previous methods cannot do unsupervised segmentation in 3D while we are the first to enable it, and our results are clearly better than our ablated baselines. Regarding re-arrangement results, we are also the first to enable this task, and ours is much better than all the baselines. -->
**Q3**: Is the method robust to different initial centers?
**A3**: Yes! Following your suggestion, we have tested the robustness of our model on the Room-Chair dataset. For each test scene, we now use 5 different random seeds for sampling initial centers. We compute the mean $\mu$ and std $\sigma$ of ARI (segmentation metric) over the 5 seeds. We average them over 500 test scenes. The averaged mean $\bar\mu$ of ARI is $78.2\%$ and $\bar\sigma$ is $1.7\%$. The mean ARI suggests good segmentation results (similar to $78.8\%$ as reported in Table 2 in our main paper), and $\bar\sigma=1.7\%$ indicates that our method consistently achieves the results w.r.t. different initial centers. We have added this analysis to a section in Appendix D in our updated submission (highlighted in blue).
Thanks again for your constructive suggestions. We hope that our responses are helpful in addressing your concerns. Please don't hesitate to let us know if there are any additional questions.
### Reviewer tkSe
Thank you for your constructive review and helpful suggestions!
**Q1**: Testing on unseen shapes would be informative.
**A1**: Thank you for this suggestion! We followed your suggestion and constructed another testset for Room-Diverse. All test objects are drawn from a pool of 500 ShapeNet chairs that are completely disjoint from the training objects. All other settings are the same as the original testset. The results are
-|ARI$\uparrow$ (segmentation metric)|LPIPS$\downarrow$ (novel view synthesis metric)
:-|:-:|:-:
Ours, tested on seen shape testset| 65.61 | **0.1729**
Ours, tested on unseen shape testset| **66.10**| 0.1771
As we can see, results on seeen and unseen shapes are very close, suggesting that our model generalizes well to unseen object shapes. We have added these results to our appendix D and highlighted the new contents in blue.
**Q2**: Demonstration on real world scenes would be strong results.
**A2**: Thank you for this suggestion! We take photos by a cellphone and test our model (trained on Room-Diverse dataset) on real photos. We show results in Appendix D, Figure 10, in our updated submission (New contents are highlighted in blue). As we can see, our model can obtain reasonable results even on real images.
**Q3**: How is the number of foreground objects decided?
**A3**: Based on Slot Attention [Locatello, NeurIPS'20], we actually do not need to decide the number of foreground objects. The model essentially learns a prior distribution; it then samples $K$ objects from this learned distribution as the initialization for the grouping process to allocate them to individual objects. Empty slots are allowed; thus, $K$ can be seen as the maximum number of objects in the scene. Further, this process makes it possible to set a different $K$ during training and testing.
We have demonstrated such generalization through experiments. On CLEVR-567 (each scene has 5--7 objects), we set the number of slots $K$ to 7 throughout training. Then, we tested our model (trained on CLEVR-567 with $K=7$ slots) on scenes with 11 packed objects by setting the number $K$ to 11 during inference. Our model performs reasonably well on such generalization, as shown in Table 5 in main paper and Figure 19 in the updated Appendix (was Figure 17 in the original Appendix).
Thanks again for your constructive suggestions. We hope that our responses are helpful in answering your questions. Please don't hesitate to let us know if there are any additional questions. We also understand that some of the results currently in Appendix may be critical, and are happy to move them to the main paper if needed.
### Reviewer pN74
Thank you for your constructive review and helpful suggestions!
**Q1**: Discussion of Stelzner et al. (2021)
**A1**: Thank you for bringing this up. We totally agree with you on the importance of discussing the concurrent work from Stelzner et al. (2021), and this is why we tried to clarify the relevance and difference in the related work section, despite that it has not been published. We also wish to include a comparsion, but couldn't do so for the exact reason that you mentioned: the code/data for their approach is not publicly available, making comparisons extremely difficult. We appreciate your understanding. Shall the code/data of Stelzner et al. (2021) is made available in the future, we will be more than happy to include such a comparison.
**Q2**: How many samples per ray are used?
**A2**: We used 64 samples per ray. We left this detail in Appendix B.4, and are happy to move it to the main paper.
**Q3**: Do you still use two nets for ray evaluations as NeRF?
**A3**: No, we only use a single network for better efficiency. Our model only requires one forward pass with a single network to render a pixel.
**Q4**: Would GRAF's sampling method work for your method?
**A4**: Thanks for the suggestion! We actually had thought about and tried this idea. It turns out that GRAF's sampling is suboptimal in our setup. Specifically, if we use their sampling with a stride of $1$, then it is effectively the same as our patch-based fine-training; with a stride $>1$, we have observed that their sampling method (as shown in Fig. 3 of the GRAF paper) introduces aliasing in the rendering outputs. This was not an issue in GRAF, because their GAN-style training setup involves training a discriminator together with the renderer. Instead, we are using a perceptual loss that is computed via a network pretrained on ImageNet. Because this network has not seen such aliasing, the preceptual loss becomes less effective and leads to suboptimal reconstructions.
**Q5**: Is better segmentation results due to two disjoint latent spaces for foreground and background?
**A5**: Yes, this aligns with our ablation studies. From the qualitative comparison in Figure 4, we see that Slot Attention and "ours (w/o background modeling)" manifest similar artifacts. Furthermore, from Table 2, we can see that "ours (w/o background modeling)" performs at a similar level ($11.7\%$ ARI on CLEVR-567) as slot-attention ($3.5\%$ ARI), and "ours" ($83.7\%$) is much better than both. This observation is consistent across all datasets. We have added this analysis to the experiment section in our updated submission (highlighted in blue).
**Q6**: How do you modify foreground object pose/appearance in scene editing?
**A6**: We have shown two types of modification in the submission: changing object position (translation) and adding/removing objects. The change of object position, as shown in Figure 6, is done by imposing a displacement field to the object to be moved. Specifically, if we want to move an object by displacement $[dx,dy,dz]=[1,2,3]$, we inversely displace all query point coordinates to the object radiance field by $[-1,-2,-3]$. Adding/removing objects is done by adding/removing a latent embedding of a foreground object, as shown in Figure 1. It is also possible to rotate an object radiance field. Directly rotating the field will make the object rotate around the scene center in a world coordinate frame. To make the object rotate in the object coordinate frame, we may first compute the baricenter of the density function learned by the radiance field to identify the center of the object, and then apply rotation to be centered around the object itself.
**Q7**: How does foreground box affect final results? Should explain it in main text.
**A7**: Thanks for your suggestion! We set the foreground box in early training to prevent foreground slots from representing the background. Without the foreground box, the foreground NeRF can attach a piece of background segment to an object slot. We added a visual comparison to demonstrate this effect in Figure 8 in our updated submission. We have also included the explanation in the main text (highlighted in blue).
Thanks again for your constructive comments. We hope that our responses are helpful in answering your questions. Please don't hesitate to let us know if there are any additional questions.