## Global response (common questions) to all reviewers
We would like to thank our reviewers, which put considerable time and thoughts for helping improve our paper.
We are pleased that the reviewers find our paper "very well-written and easy to follow"(R-sM4b, R-mHBD, R-DWmW); our method being described as "novel" (R-DWmW), "fully automated" (R-B4nT), "impressive, well-thought-of and designed" (R-sM4b) and "systematic and comprehensive"(R-mHBD), with a "solid implementation"(R-7hAU); our experiments were commended as "extensive" (R-DWmW), "solid" (R-B4nT), backed by "supportive ablations" (R-7hAU, R-mHBD), demonstrating "effectiveness" (R-sM4b), achieving "state-of-the-art performance" (R-sM4b, R-mHBD), and notably improving "monocular 3D detection by a large margin" (R-DWmW).
Please find below our summary of major changes and response to some common questions. We will incorporate these major changes to our revised paper.
**Summary of major changes**:
1. [7hAU, DWmW, mHBD] We add new experiments on the ScanNet[1] dataset to show that 3D Copy-Paste also improves monocular 3D object detection performance on other dataset.
2. [7hAU, B4nT] We evaluate our monocular 3D object detection performance on SUN RGBD with mAP_0.15 and show consistent improvements.
3. [sM4b] We add detailed results of each individual object category on the SUN RGB-D dataset.
**Experiments on ScanNet dataset**:
To show the generalization to other datasets, we apply our 3D Copy-Paste to ScanNet[1] and conduct monocular 3D object detection with ImVoxelNet. ScanNet is a large-scale R-GBD video dataset that isn't specifically tailored for monocular 3D object detection. Here are the detailed experiments and results (we use ScanNet v2):
[*Adapt to monocular setting*] ScanNet contains 1,201 videos (scenes) in the training set and 312 videos (scenes) in the validation set. Adapting it for monocular 3D object detection, we utilized one RGB-D image per video, amounting to 1,201 RGB-D images for training and 312 for validation. We calculate the ground truth 3D bounding box label for each of our used views from their provided video (scene) level label (because some objects in the scene may not be visible in our monocular view).
[*Training and test*] For the baseline, we train an ImVoxelNet monocular 3D object detection model on the training set and test on the validation set. For our method, there are 8 overlap categories (sofa, bookshelf, chair, table, bed, desk, toilet, bathtub) in the 18 classes of ScanNet with our collected Objaverse data (main paper Table1). We use our 3D copy-past to augment the training set and train an ImVoxelNet. All the training parameters are the same as the training on SUN RGB-D dataset. We will release all the code.
We show the results on the average accuracy of the 8 overlap classes (AP_0.25) in the Table below. Our 3D Copy-Paste improves ImVoxelNet by 2.8% mAP.
|ScanNet AP_0.25 | Average (mAP)|bed |chair|sofa|table|bookshelf|desk|bathtub|toilet|
|-----------------|------|------|----|----|----|----|----|----|----|
| ImVoxelNet |14.1 |25.7|7.9 |**13.2**|7.8 |4.2 |20.5|22.1 |**11.5** |
| ImVoxelNet+3D Copy-Paste|**16.9** |**27.7**|**12.7** |10.0|**10.8** |**9.2** |**26.2**|**29.2** |9.0 |
**Experiments on other 3D detection method**:
To show the generalization of our method to other downstream monocular 3D object detection methods, in our supplementary material Table S1, we conducted additional experiments with another monocular 3D object detection model: Implicit3DUnderstanding (Im3D [2]). Using our method (Im3D + 3D Copy-Paste) improve the mAP_0.25 from 42.13 (Im3D) to 43.34.
**Performance details of each category in SUN RGB-D dataset**:
In the table below, we show the detailed SUN RGB-D monocular 3D object detection results with ImVoxelNet of the main paper Table 2 on each individual object category:
|SUN RGB-D AP_0.25 | Average (mAP) |bed |chair|sofa|table|book shelf|desk|bathtub|toilet|dresser|night stand|
|-----------------|------|------|----|----|----|----|----|----|----|----|----|
| ImVoxelNet |40.96 |72.0|55.6 |53.0|41.1 |**7.6** |21.5|29.6 |76.7 |19.0|33.4|
| ImVoxelNet+3D Copy-Paste|**43.79** |**72.6**|**57.1** |**55.1**|**41.8** |7.1 |**24.1**|**40.2** |**80.7** |**22.3**|**36.9**|
**Reference**
[1] Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T. and Nießner, M., 2017. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5828-5839).
[2] Zhang, C., Cui, Z., Zhang, Y., Zeng, B., Pollefeys, M. and Liu, S., 2021. Holistic 3d scene understanding from a single image with implicit representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8833-8842).
## R1 [7hAU] (reject)
Thank you for your time and comments! Please see our response below.
**Experiments on ScanNet**:
We use SUN RGB-D dataset as this dataset offers 10,000+ monocular RGB-D images (scenes), and is densely annotated with 146,617 2D polygons and 58,657 3D bounding boxes. Many 3D object detection papers [1,2] use SUN RGB-D performance as the main result. ScanNet is a large-scale R-GBD video dataset, while it isn't specifically tailored for monocular 3D object detection. We appreciate the reviewer’s suggestion on conducting new experiments on ScanNet.
Experimental settings are described in detail in the Global Response to all reviewers. We summarize it here:
[*Adapt to monocular setting*] We utilized one RGB-D image per video: 1,201 for training and 312 for validation. We calculate the ground truth 3D bounding box label for each of our used views from their provided video(scene) level label.
[*Training and test*] There are 8 overlap categories in ScanNet with our collected Objaverse data (main paper Table1). We use our 3D Copy-Paste to augment the training set and train an ImVoxelNet.
We show the results on the average accuracy of the 8 overlap classes (AP_0.25) in the Table below. Our 3D Copy-Paste improves ImVoxelNet by 2.8% mAP.
|ScanNet AP_0.25 | Average (mAP)|bed |chair|sofa|table|bookshelf|desk|bathtub|toilet|
|-----------------|------|------|----|----|----|----|----|----|----|
| ImVoxelNet |14.1 |25.7|7.9 |**13.2**|7.8 |4.2 |20.5|22.1 |**11.5** |
| ImVoxelNet+3D Copy-Paste|**16.9** |**27.7**|**12.7** |10.0|**10.8** |**9.2** |**26.2**|**29.2** |9.0 |
**Improvement significance and implementation easiness**:
Monocular 3D object detection is a challenging task that requires inferring the 3D information given only a single RGB image, and improving from 40.96 to 43.79 can be considered significant as Reviewer DWmW also pointed out. If checking the performance leaderboard, it has been hard to improve mAP further beyond 40 (e.g., ImVoxelNet remains the current SOTA on Papers-With-Code on SUN RGB-D even though it is now 2 years old). Given our 2.83% improvement over that, our method is the new state-of-the-art performance on indoor scene monocular 3D object detection.
Beyond just a numerical improvement in mAP, our work provides a number of conceptual advances in the field, which we believe are important as well. Firstly, the task of monocular 3D object detection is notably data-intensive, and labeling 3D labels can be both time-consuming and costly. Our approach addresses this challenge through data augmentation, introducing an automatic pipeline that remains model-agnostic. Since data augmentation is a one-off effort, it can potentially enhance various models. To allow easy usage, we will release our code, model, and generated data.
Secondly, through comprehensive experiments, we demonstrate the efficacy of our method across diverse models, such as ImVoxelNet and Implicit3D (detailed in the supplementary), and on different datasets, including SUN RGB-D and ScanNet. Importantly, our findings highlight the potential of 3D data augmentation in improving the performance of 3D perception tasks, opening up new avenues for research and practical applications.
**mAP 0.25 evaluation**:
This has to do with the fact that official results are only available for a few combinations of dataset/object classes/IOU threshold: The official ImVoxelNet GitHub code, when using SUN RBG-D on “10 classes from VoteNet” setting (same as ours), also uses 0.25 IOU threshold with performance 40.7 (even though they use 0.15 with other datasets/classes).
The authors provided an implementation in MMdetection3D Github [3], which also uses mAP 0.25 for the 10 classes from VoteNet setting (performance 40.96). We used the official code of MMdetection3D, so we used the 0.25 IOU threshold. Per your suggestion, we also show our results with mAP 0.15 on SUN RGB-D dataset below. We will incorporate this in our revised paper.
|SUN RGB-D |mAP_0.15|
|-|-|
|ImVoxelNet |48.45|
|ImVoxelNet + 3D Copy-Paste|**51.16**|
**Illumination estimation and reproducibility**:
Illumination estimation is an important and challenging task, our main paper Sec. 2.3 (L108~L119) listed some representative works in this domain. We will release all our code and dataset upon acceptance for reproducibility.
**Semantic plausibility influence**:
For this experiment, to make the insertion more globally semantically plausible (e.g., avoid inserting toilet into other rooms than bathroom), we only insert the object categories that already exist in the current room. For instance, if the room contains a table, chair, and sofa, we only consider inserting new objects that belong to these 3 categories.
The results (Table 4) show that considering the global semantic meaning (43.75) is on par with the random category selecting setting (43.79). One potential reason is that the downstream detection (CNN-based models) may rely more on local information to conduct detection, so they are not sensitive to the global semantics. Different from point cloud-based 3D detection where context information is important as RGB information is often discarded, in monocular 3D object detection where the input is an RGB image, appearance per se may be the most important clue.
**Requirement of depth, scale, and gravity**:
The metrically scaled object, depth, and gravity direction can be either dataset provided or estimated by off-the-shelf methods, e.g., metric depth estimation from ZoeDepth [4], and gravity direction estimation from ground segmentation with plane fitting. We will add this discussion in the limitations section (currently, the limitations are in the supplementary).
**Reference**
[1] Huang et al. "Perspectivenet: 3d object detection from a single rgb image via perspective points." NeurIPS 2019.
[2] Zhang et al. "Holistic 3d scene understanding from a single image with implicit representation." CVPR 2021.
[3] Contributors, M. (2020). MMDetection3D: OpenMMLab next-generation platform for general 3D object detection.
[4] Bhat et al. "Zoedepth: Zero-shot transfer by combining relative and metric depth." arXiv 2023.
## R2 [sM4b] (Weak accept)
Thank you for your time and comments! Please see our response below.
**Comparison to Common 3D Corruption**:
3D Common corruptions (3DCC) use 3D information to generate real-world corruptions, which can evaluate the model robustness and be used as a data augmentation for model training. Our method's contribution may be orthogonal to 3DCC. 3DCC conducted scene-level global augmentation and did not introduce new object content. Combining our method with 3DCC may achieve better performance. We will cite this paper and add comparisons in related works.
**Evaluation on other tasks**:
That is a very good point, we focus on monocular 3D object detection because it is a challenging, fundamentally representative, and data-intensive task, which involves both 3D geometry and semantic understanding, and our method could help. We also conducted experiments to show that the inserted position, size, pose, and lighting do influence the downstream model performance. We extend our method in other 3D detection datasets (ScanNet [1] below) and other models (Implicit3DUnderstanding [2] in supplementary), and we will treat extending to other tasks as future work.
While we agree with the reviewer that extending to other downstream tasks is desirable, this will take more time than available during the rebuttal period; in the meantime, we note that our title and overall positioning of the paper are not overselling our results, i.e., it is clearly stated that our method is useful for monocular 3D object detection. Likewise, we will clearly note in the manuscript that this is important but considered for future work beyond this paper.
Here are the detailed experiments and results on ScanNet:
[*Adapt to monocular setting*] ScanNet contains 1,201 videos(scene) in the training set and 312 videos(scene) in the validation set. Adapting it for monocular 3D object detection, we utilized one RGB-D image per video, amounting to 1,201 RGB-D images for training and 312 for validation. We calculate the ground truth 3D bounding box label for each of our used views from their provided video(scene) level label (because some objects in the scene may not be visible in our monocular view).
[*Training and test*] For the baseline, we train an ImVoxelNet monocular 3D object detection model on the training set and test on the validation set. For our method, there are 8 overlap categories (sofa, bookshelf, chair, table, bed, desk, toilet, bathtub) in the 18 classes of ScanNet with our collected Objaverse data (main paper Table1). We use our 3D copy-past to augment the training set and train an ImVoxelNet. All the training parameters are the same as the training on SUN RGB-D dataset. We will release all the code.
We show the results on the average accuracy of the 8 overlap classes (AP_0.25) in the Table below. Our 3D Copy-Paste improves ImVoxelNet by 2.8% mAP.
|ScanNet AP_0.25 | Average (mAP)|bed |chair|sofa|table|bookshelf|desk|bathtub|toilet|
|-----------------|------|------|----|----|----|----|----|----|----|
| ImVoxelNet |14.1 |25.7|7.9 |**13.2**|7.8 |4.2 |20.5|22.1 |**11.5** |
| ImVoxelNet+3D Copy-Paste|**16.9** |**27.7**|**12.7** |10.0|**10.8** |**9.2** |**26.2**|**29.2** |9.0 |
For the 2D recognition task, some related works [3,4] showed that simple 2D copy-paste is already good enough to help improve performance. While it is hard to conduct copy-paste in 3D, that is one of the motivations of our work. We believe our work should also help on 2D task and will treat it as future work.
**2D copy-paste baseline and individual object performance**:
For 2D insertion, it is hard to obtain the 3D bounding box, which is required for downstream monocular 3D object detection.
The random insertion in the main paper comparison is 3D insertion, where the insertion position, size, pose, and illumination are not physically plausible, which causes a significant performance drop.
In the table below, we show the detailed SUN RGB-D monocular 3D object detection results with ImVoxelNet of the main paper Table 2 on each individual object category:
|SUN RGB-D AP_0.25 | Average (mAP) |bed |chair|sofa|table|book shelf|desk|bathtub|toilet|dresser|night stand|
|-----------------|------|------|----|----|----|----|----|----|----|----|----|
| ImVoxelNet |40.96 |72.0|55.6 |53.0|41.1 |**7.6** |21.5|29.6 |76.7 |19.0|33.4|
| ImVoxelNet+3D Copy-Paste|**43.79** |**72.6**|**57.1** |**55.1**|**41.8** |7.1 |**24.1**|**40.2** |**80.7** |**22.3**|**36.9**|
**Include more datasets**:
Yes, we posit that a richer collection of objects to insert would be beneficial. However, we need full 3D models that we can insert in any pose (thus, the NYU dataset may not work as it does not provide full 3D object models). But other 3D object datasets (e.g., OmniObject3D) could be included in future work.
**Time cost to render one object**:
Overall it will take around 5~10 seconds. Specifically, searching the insertion position, pose, and size takes less than 0.5 seconds with iteration 1000. Plane detection, lighting estimation, and rendering take most of the time.
**Insert shiny or specular objects**:
Changing the reflection property only influences the final rendering process; our physically plausible position, pose, size, and illumination are agnostic to the object's surface property. Specular objects will show reflections of other objects or lights during the rendering process.
**Reference**
[1] Dai, Angela, et al. "Scannet: Richly-annotated 3d reconstructions of indoor scenes." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
[2] Zhang, Cheng, et al. "Holistic 3d scene understanding from a single image with implicit representation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
[3] Dwibedi, Debidatta, Ishan Misra, and Martial Hebert. "Cut, paste and learn: Surprisingly easy synthesis for instance detection." Proceedings of the IEEE international conference on computer vision. 2017.
[4] Ghiasi, Golnaz, et al. "Simple copy-paste is a strong data augmentation method for instance segmentation." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.
## R3 [DWmW] (Borderline accept)
Thank you for your time and comments! Please see our response below.
**Methodology contribution**:
To conduct 3D object insertion for data augmentation, traditional methods[1] require manually locating support planes, designing poses, estimating lighting, etc, which are hard to scale up. Out method is the first automated 3D insertion pipeline on complex indoor scenes, enabling large-scale 3D insertion for data augmentation.
To allow a fully automated pipeline, we make the following technical contributions: (1) To avoid collision, after plane detection, we propose a constrained insertion parameter search method (algorithm 1) to guarantee a physically-plausible inserted position, pose, and size. We also include an efficient collision check method to solve the time-consuming challenge. (2) We conduct environment map transformation and refinement to add more accurate illumination on inserted objects.
To our best knowledge, our work is the first to showcase that physically plausible 3D object insertion can serve as an effective generative data augmentation technique for indoor scenes, leading to state-of-the-art performance in discriminative downstream tasks such as monocular 3D object detection, opening up new avenues for research and practical applications.
**Experiments on other datasets**:
Thank you for your suggestion. We conduct new experiments on the ScanNet [2] dataset. Here are the detailed setting and results:
[*Adapt to monocular setting*] ScanNet contains 1,201 videos(scene) in the training set and 312 videos(scene) in the validation set. For monocular 3D object detection, we use one RGB-D image per video, so 1,201 RGB-D images for training and 312 for validation (test). We calculate the ground truth 3D bounding box label for each of our used views from their provided video(scene) level label (because some objects in the scene may not be visible in our monocular view).
[*Training and test*] For the baseline, we train an ImVoxelNet monocular 3D object detection model on the training set and test on the validation set. For our method, there are 8 overlap categories (sofa, bookshelf, chair, table, bed, desk, toilet, bathtub) in the 18 classes of ScanNet with our collected Objaverse data (main paper Table1). We use our 3D copy-past to augment the training set and train an ImVoxelNet. All the training parameters are the same as the training on SUN RGB-D dataset. We will release all the code.
We show the results on the average accuracy of the 8 overlap classes (AP_0.25) in the Table below. Our 3D Copy-Paste improves ImVoxelNet by 2.8% mAP.
|ScanNet AP_0.25 | Average (mAP)|bed |chair|sofa|table|bookshelf|desk|bathtub|toilet|
|-----------------|------|------|----|----|----|----|----|----|----|
| ImVoxelNet |14.1 |25.7|7.9 |**13.2**|7.8 |4.2 |20.5|22.1 |**11.5** |
| ImVoxelNet+3D Copy-Paste|**16.9** |**27.7**|**12.7** |10.0|**10.8** |**9.2** |**26.2**|**29.2** |9.0 |
**Inserted object class selection**:
Good point! The main results in Tables 2 and 3 use a random sample from the object class set, taken uniformly and without consideration of context. However, we also explore the influence of global context on detection performance in the main paper, Table 4. For this experiment, we only insert the object categories already existing in the current room to make the insertion more globally semantically plausible (e.g., avoid inserting toilets into other rooms except for the bathroom). For instance, if the room contains a table, chair, and sofa, we only consider inserting new objects that belong to these 3 categories.
The results (Table 4) show that considering the global semantic meaning (43.75) is on par with the random category selecting setting (43.79). One potential reason is that the downstream detection (CNN-based models) may rely more on local information to conduct detection, so they are not sensitive to the global semantics. Unlike point cloud-based detection, appearance may be the most important in monocular object detection. Here we do not observe significant improvement after considering the global semantic meaning, and it may also depend on different datasets.
**References**
[1] Debevec, P., 2008. Rendering synthetic objects into real scenes: Bridging traditional and image-based graphics with global illumination and high dynamic range photography. In Acm siggraph 2008 classes (pp. 1-10).
[2] Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T. and Nießner, M., 2017. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5828-5839).
## R4 [B4nT] (Weak Accept)
Thank you for your time and comments! Please see our response below.
**Indoor/outdoor difference and novelty**:
Different from outdoor scenes, the challenges in indoor scenes include (1) complex spatial layouts (in particular, cluttered background and limited object-placeable space) that necessitate a carefully-designed method to allow automated object placement (physically-plausible position, size, pose), and (2) complex lighting effects such as soft shadow, inter-reflection and long-range light source dependency that demand dealing with lighting for harmonious object insertion.
To deal with the above challenges, we make the following key technical contributions: (1) To avoid collision, after plane detection, we propose constrained insertion parameter search (algorithm 1) to guarantee a physically-plausible inserted position, pose and size. We also include an efficient collision check method to solve the time-consuming challenge. (2) We conduct environment map transformation and refinement to add more accurate illumination on inserted objects.
To our best knowledge, our work is the first automated 3D insertion pipeline on complex indoor scenes, enabling large-scale 3D insertion for data augmentation. It also showcases that physically plausible 3D object insertion can serve as an effective generative data augmentation technique for indoor scenes, leading to state-of-the-art performance in discriminative downstream tasks such as indoor monocular 3D object detection, opening up new avenues for research and practical applications.
Thank you for your suggestion, we will revise our manuscript as follows:
- we already cite your ref [3] Neural Light Field Estimation but will add the other outdoor citations.
- we will tone down the claim that indoor scene is more challenging since it may indeed be hard to defend.
- we will extend our discussion of how our work compares to previous art with outdoor scenes.
**AP_0.25 and AP_0.15**:
This has to do with the fact that official results are only available for a few combinations of dataset setting, object classes, and IOU threshold: The official ImVoxelNet GitHub code, when using SUN RBG-D on “10 classes from VoteNet” setting (same as ours), also uses 0.25 IOU threshold with performance 40.7 (even though they use 0.15 with other datasets/classes).
The authors provided an implementation in MMdetection3D [1] Github, which also uses mAP 0.25 for the 10 classes from VoteNet setting (performance 40.96). We used the official code of MMdetection3D, so we used the 0.25 IOU threshold. Per your suggestion, we also show our results with mAP 0.15 on SUN RGB-D dataset (10 classes from VoteNet) here:
|SUN RGB-D | mAP_0.15|
|-----------------|------|
| ImVoxelNet |48.45 |
| ImVoxelNet + 3D Copy-Paste|**51.16** |
**More experiments on outdoor dataset**:
Thanks! In this paper, we focus on indoor scene insertion, we will treat extension to the outdoor scene as a future work to explore. In the meantime, we note that our title and overall positioning of the paper make it very clear that it is currently focused on indoor scenes (i.e., we are not overselling the work).
However, we added experiments on the new dataset ScanNet[2] and new models (Implicit3DUnderstanding[3] in supplementary) as a way to show the generalization of our method. Here are the detailed setting and results on ScanNet:
[*Adapt to monocular setting*] ScanNet contains 1,201 videos(scene) in the training set and 312 videos(scene) in the validation set. For monocular 3D object detection, we use one RGB-D image per video, so 1,201 RGB-D images for training and 312 for validation (test). We calculate the ground truth 3D bounding box label for each of our used views from their provided video(scene) level label (because some objects in the scene may not be visible in our monocular view).
[*Training and test*] For the baseline, we train an ImVoxelNet monocular 3D object detection model on the training set and test on the validation set. For our method, there are 8 overlap categories (sofa, bookshelf, chair, table, bed, desk, toilet, bathtub) in the 18 classes of ScanNet with our collected Objaverse data (main paper Table1). We use our 3D copy-past to augment the training set and train an ImVoxelNet. All the training parameters are the same as the training on SUN RGB-D dataset. We will release all the code and training data.
We show the results on the average accuracy of the 8 overlap classes (AP_0.25) in the Table below. Our 3D Copy-Paste improves ImVoxelNet by 2.8% mAP.
|ScanNet AP_0.25 | Average (mAP)|bed |chair|sofa|table|bookshelf|desk|bathtub|toilet|
|-----------------|------|------|----|----|----|----|----|----|----|
| ImVoxelNet |14.1 |25.7|7.9 |**13.2**|7.8 |4.2 |20.5|22.1 |**11.5** |
| ImVoxelNet+3D Copy-Paste|**16.9** |**27.7**|**12.7** |10.0|**10.8** |**9.2** |**26.2**|**29.2** |9.0 |
**Paper checklist**:
Thanks for your reminder! The limitations and broader impact discussion were in the supplementary, we also added more experiment details in the supplementary. But we will double-check the PaperChecklist and make sure everything is there.
**References**
[1] Contributors, M. (2020). MMDetection3D: OpenMMLab next-generation platform for general 3D object detection.
[2] Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T. and Nießner, M., 2017. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5828-5839).
[3] Zhang, C., Cui, Z., Zhang, Y., Zeng, B., Pollefeys, M. and Liu, S., 2021. Holistic 3d scene understanding from a single image with implicit representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8833-8842).
## R5 [mHBD] (Borderline accept)
Thank you for your time and comments! Please see our response below.
**Related work**:
Thanks for your suggestion and comments! We will modify the statement and add all the provided related papers.
**Methodology**:
To conduct 3D object insertion for data augmentation, traditional methods [1] require manually locating support planes, designing poses, estimating lighting, etc, which are hard to scale up. Out method is the first automated 3D insertion pipeline on complex indoor scenes, enabling large-scale 3D insertion for data augmentation.
To allow a fully automated pipeline, we make the following technical contributions: (1) To avoid collision, after plane detection, we propose a constrained insertion parameter search method (algorithm 1) to guarantee a physically-plausible inserted position, pose, and size. We also include an efficient collision check method to solve the time-consuming challenge. (2) We conduct environment map transformation and refinement to add more accurate illumination on inserted objects.
To our best knowledge, our work is the first to showcase that physically plausible 3D object insertion can serve as an effective generative data augmentation technique for indoor scenes, leading to state-of-the-art performance in discriminative downstream tasks such as monocular 3D object detection, opening up new avenues for research and practical applications.
**More experiments on other models and datasets**:
- **For more monocular 3D object detection methods**, we conducted experiments on Implicit3DUnderstanding (Im3D[2]) in supplementary materials, and the results are in supplementary Table S1.
- **For more datasets**, we extend our method to ScanNet [3] dataset. Here are the detailed setting and results on ScanNet:
[*Adapt to monocular setting*] ScanNet contains 1,201 videos(scene) in the training set and 312 videos(scene) in the validation set. For monocular 3D object detection, we use one RGB-D image per video, so 1,201 RGB-D images for training and 312 for validation (test). We calculate the ground truth 3D bounding box label for each of our used views from their provided video(scene) level label (because some objects in the scene may not be visible in our monocular view).
[*Training and test*] For the baseline, we train an ImVoxelNet monocular 3D object detection model on the training set and test on the validation set. For our method, there are 8 overlap categories (sofa, bookshelf, chair, table, bed, desk, toilet, bathtub) in the 18 classes of ScanNet with our collected Objaverse data (main paper Table1). We use our 3D copy-past to augment the training set and train an ImVoxelNet. All the training parameters are the same as the training on SUN RGB-D dataset. We will release all the code.
We show the results on the average accuracy of the 8 overlap classes (AP_0.25) in the Table below. Our 3D Copy-Paste improves ImVoxelNet by 2.8% mAP.
|ScanNet AP_0.25 | Average (mAP)|bed |chair|sofa|table|bookshelf|desk|bathtub|toilet|
|-----------------|------|------|----|----|----|----|----|----|----|
| ImVoxelNet |14.1 |25.7|7.9 |**13.2**|7.8 |4.2 |20.5|22.1 |**11.5** |
| ImVoxelNet+3D Copy-Paste|**16.9** |**27.7**|**12.7** |10.0|**10.8** |**9.2** |**26.2**|**29.2** |9.0 |
**References**
[1] Debevec, P., 2008. Rendering synthetic objects into real scenes: Bridging traditional and image-based graphics with global illumination and high dynamic range photography. In Acm siggraph 2008 classes (pp. 1-10).
[2] Zhang, C., Cui, Z., Zhang, Y., Zeng, B., Pollefeys, M. and Liu, S., 2021. Holistic 3d scene understanding from a single image with implicit representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8833-8842).
[3] Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T. and Nießner, M., 2017. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5828-5839).