**General Response**:
We thank the reviewers for their thoughtful feedback. It is encouraging that the reviewers found learning representations for scene images to be important and relevant to the community ((R-moVN),(R-ANCp), (R-Rh3t)) and our idea of using new image sampling procedure & localization loss to be novel ((R-moVN), (R-Rh3t)). We also thank the reviewers ( (R-Rh3t) & (R-ANCp)) for appreciating the OHMS dataset as a good testbed for understanding the performance of SSL algorithms in multi-object scenario. We are very glad that all reviewers found our writing to be clear and easy to follow (R-moVN),(R-ANCp), (R-Rh3t).
**Reviewer moVN**:
**report the results of MoCoL with the additional 2 loss terms without OAC.**
We show ablation studies where we change only the object crop, without the localization & rotation losses. We add another ablation study by removing the OAC loss and using only rotation and localization losses.
We can see that by only adding the object crop the performance of the model improves by a fair margin ( +1.5 mAP). The rotation and localization loss further improve the performance.
| COCO detection | $\text{AP}$ | $\text{AP}_{50}$ | $\text{AP}_{75}$ | COCO segmentation | $\text{AP}$ | $\text{AP}_{50}$ | $\text{AP}_{75}$ |
|-----------------|------|------|------|-----------------|------|------|------|
| MoCo-v2| 38.2 | 58.9| 41.6 | MoCo-v2 | 34.8 | 55.3 | 37.8 |
| MoCo-v2 + OAC | 39.7 | 60.1 | 43.4 | MoCo-v2 + OAC | 36.0 | 57.3 | 38.8 |
| MoCo-v2 + OAC (Using all losses) | 40.7 | 60.9 | 43.9 | MoCo-v2 + OAC (Using all losses) | 36.9 | 58.3 | 39.6 |
| Dense-CL | 39.6 | 59.3| 43.3 | Dense-CL | 35.7 | 56.5 | 38.4 |
| Dense-CL + OAC | 40.4 | 44.0 | 60.4 | Dense-CL + OAC | 36.6 | 57.9 | 39.5 |
| Dense-CL + OAC (Using all losses) | 41.4 | 61.5 | 44.7 | Dense-CL + OAC (Using all losses) | 37.5 | 59.5 | 40.4 |
Below in the 1st row we show results by only using rotation and localization losses. We can see that using rotation and localization losses alone doesn’t work as well as when using all three losses.
| Model | mAP |
|-----------------|------|
| Using only rotation and localization loss| 44.3 |
| MoCo-v2| 48.7 |
| MoCo-v2 + OAC| 58.6 |
| MoCo-v2 + OAC + Rotation + Localization| 59.7 |
**Reviewer ANCp**
**The problem seems to be explored by several previous works that are not discussed here** : Even though both [1] and [2] even though they deal with issues in using random crop, they don’t report results by pre-training on multi-object datasets like COCO and OpenImages, which is our main focus in this paper. Secondly, their improvement on detection & segmentation is only +0.3mAP while our improvement is +1.5 mAP by just changing the cropping strategy. Also, it’s not very clear how to extend [1] & [2], since they rely on using SSL model pre-trained on ImageNet to generate better positives. Hence these pre-trained models which work well on ImageNet may not extend well to OpenImages.
**It is a bit unclear why we need two projection heads here. This makes it hard to tell how the proposed cropping method improves over the baselines. For instance, one can also augment MoCo-v2 with the rotation task and object localization loss to (potentially) improve the performance. Ideally, there should be more results that exclude all the additional losses and modifications, but simply changes vanilla cropping method to LOD.**
Here is the ablation study when changing only the object crop and without the localization & rotation loss. We can see that by only changing the object crop the performance of the model improves by a fair margin ( +1.5map) and rotation and localization loss further improve the performance.
| COCO detection | $\text{AP}$ | $\text{AP}_{50}$ | $\text{AP}_{75}$ | COCO segmentation | $\text{AP}$ | $\text{AP}_{50}$ | $\text{AP}_{75}$ |
|-----------------|------|------|------|-----------------|------|------|------|
| MoCo-v2| 38.2 | 58.9| 41.6 | MoCo-v2 | 34.8 | 55.3 | 37.8 |
| MoCo-v2 + OAC | 39.7 | 60.1 | 43.4 | MoCo-v2 + OAC | 36.0 | 57.3 | 38.8 |
| MoCo-v2 + OAC (Using all losses) | 40.7 | 60.9 | 43.9 | MoCo-v2 + OAC (Using all losses) | 36.9 | 58.3 | 39.6 |
| Dense-CL | 39.6 | 59.3| 43.3 | Dense-CL | 35.7 | 56.5 | 38.4 |
| Dense-CL + OAC | 40.4 | 44.0 | 60.4 | Dense-CL + OAC | 36.6 | 57.9 | 39.5 |
| Dense-CL + OAC (Using all losses) | 41.4 | 61.5 | 44.7 | Dense-CL + OAC (Using all losses) | 37.5 | 59.5 | 40.4 |
Below in the 1st row we show results by only using rotation and localization losses. We can see that using rotation and localization losses alone doesn’t work as well as when using all three losses.
| Model | mAP |
|-----------------|------|
| Using only rotation and localization loss| 44.3 |
| MoCo-v2| 48.7 |
| MoCo-v2 + OAC| 58.6 |
| MoCo-v2 - OAC + Rotation + Localization| 59.7 |
**Pre-training on ImageNet**: Table 8 in supplementary shows results on ImageNet by using OAC. We can see that by just changing the crops to OAC, our performance improves by +0.8mAP on VOC. Due to resource constraints, we have not added results by using Localization and rotation loss, which we will do in the subsequent versions.
**It is a bit odd to see the related works at the end of the paper. Similar to the second point, there are several related works that need to be discussed and compared. It makes sense to present an overview before the method section to give readers a glimpse about how similar problems are addressed.**
Thanks for the suggestion, we will move related work after introduction in the updated version.
**In the appendix, what is the role of new p’(x) in (5), as p’ is not showed in the equation. If one use p’ to replace a part of p in (5), it is no longer a variational lower bound of the MI between X and C. Does this replacement create a new random variable? The theoretical analysis does not make the motivation clear.** P’(x) results from the new sampling scheme defined by our proposed cropping methods. So each of our cropping scheme can be viewed as enforcing a different probability distribution over x in (1). This is our interpretation, not an explicit formulation that derives the method. We will update the theoretical analysis in the updated version.
**Empirical experiments as stated in weakness sections. For instance, (1) Mocov2 + rotation + object localization (without LOD) (2) Empirical comparison with baselines such as [1] or [2] (3) Results on standard benchmarks such as ImageNet. By consolidating the experiments, I believe the contribution of the paper would be more convincing.**
Comparison with [1],[2]: Both [1] and [2] even though they deal with issues in using random crop, they don’t report results by pre-training on multi-object datasets like COCO and OpenImages, which is our main focus in this paper.. Secondly, their improvement on Detection & Segmentation are only +0.3map while our improvement is +1.5 map by just changing the cropping strategy. Also, it’s not very clear how to extend [1] & [2], since they rely upon bootstrapped models to generate better positives. And as we saw on OpenImages, bootstrapped models which use random cropping don’t really perform well.
**Reviewer Rh3t**
**Results are limited to COCO and VOC. ADE20K would be a nice third dataset to test.** Here are the results on ADE20k. We follow the same protocol as [3].
| Model | mIoU |
|-----------------|------|
| MoCo-v2| 37.5 |
| MoCo-v2 + OAC| 39.2 |
**It would be really interesting to also compare the performance of the proposed approach and other SSL methods when trained on a random 200k-sized subset from all of Openimages in Table 2.** Here is the comparison, when we used 212k random sized subset from OpenImages.
| Model | mAP |
|-----------------|------|
| MoCo-v2 (Baseline)| 52.3 |
| MoCo-v2 + OAC | 58.1 |
| Supervised| 60.1 |
**computational cost of running LOD**
For OHMS, it took around 4.5 hrs to generate the proposals and for COCO it took us ~2 hours to generate the proposals.
**At least one row for MoCo-v2 without OAC trained on the same OHMS dataset is needed as a fair baseline.** Thanks for the suggestion, we get 54.7 mAP by training scene-scene crop on OHMS dataset, we will add this result in the updated version.
**Result on Full OpenImages**: As suggested we train MoCo-v2 on a full dataset i.e 1.7 million images for 100 epochs. For the testing set, we use the full test set of the OpenImages dataset. Here are the results after pre-training on full dataset.
| Model | mAP |
|-----------------|------|
| MoCo-v2 (Baseline)| 50.5 |
| MoCo-v2 + OAC| 62.1 |
| Supervised| 74.0 |
**It would be nice to report more results for the " Obj-Obj+Dilate crop"**
| COCO detection | $\text{AP}$ | $\text{AP}_{50}$ | $\text{AP}_{75}$ | COCO segmentation | $\text{AP}$ | $\text{AP}_{50}$ | $\text{AP}_{75}$ |
|-----------------|------|------|------|-----------------|------|------|------|
| MoCo-v2| 38.2 | 58.9| 41.6 | MoCo-v2 | 34.8 | 55.3 | 37.8 |
| MoCo-v2 + OAC | 39.7 | 60.1 | 43.4 | MoCo-v2 + OAC | 36.0 | 57.3 | 38.8 |
| MoCo-v2 + OAC (Using all losses) | 40.7 | 60.9 | 43.9 | MoCo-v2 + OAC (Using all losses) | 36.9 | 58.3 | 39.6 |
| Dense-CL | 39.6 | 59.3| 43.3 | Dense-CL | 35.7 | 56.5 | 38.4 |
| Dense-CL + OAC | 40.4 | 44.0 | 60.4 | Dense-CL + OAC | 36.6 | 57.9 | 39.5 |
| Dense-CL + OAC (Using all losses) | 41.4 | 61.5 | 44.7 | Dense-CL + OAC (Using all losses) | 37.5 | 59.5 | 40.4 |
Below in the 1st row we show results by only using rotation and localization losses. We can see that using rotation and localization losses alone doesn’t work as well as when using all three losses.
| Model | mAP |
|-----------------|------|
| Using only rotation and localization loss| 44.3 |
| MoCo-v2| 48.7 |
| MoCo-v2 + OAC| 58.6 |
| MoCo-v2 - OAC + Rotation + Localization| 59.7 |
**Q1) What is the overlap between the classes of VOC and COCO and the 208 classes of the OHMS dataset? How about Imagenet and COCO/VOC?**: There is a 85% overlap between OpenImages and VOC, and between COCO and VOC the overlap is 95%. Between ImageNet and COCO, the overlap is 58%. Between ImageNet and VOC the overlap is 100%.
**Q2) Sorry if I missed it but is there a train/val split for OHMS or are you performing evaluations and object detection on exactly the training images?**: To enable fair comparison with SOTA methods[1,2], the object detection was always done on the COCO dataset. For classification results on OHMS, we create a test set where we only sample images, if they have one of 208 classes which were selected while training.
**Q3) is the last row of Table 2 using the added losses (rotation and localization tasks)? If yes, what are the equivalent results without these added losses?”**: No it was without adding additional losses. By adding additional losses the map improves to 59.7.
**Q4) Can you clarify what you mean by "different projection heads for scene and objects" (end of Sec 3.1) for MoCo-v2 which is generally the base SSL algorithm for the proposed approach? isnt the scene MLP (which i assume is the teacher) side also an EMA of the student side?**
Thanks for this question. A few clarifications : 1) As mentioned in Section 3, we randomly choose one of scene or object to pass through the query & key encoder. Thus, either the “scene” or “object” crop could be the “teacher”. 2) Different from the standard contrastive learning setup, we train two (query) MLPs , one each for object and scene crop and maintain EMA versions of the two on the student (key) side. We will make this clear in the final draft.
[1] Dense-CL https://arxiv.org/abs/2011.09157
[2] MoCo-v2 https://arxiv.org/abs/2003.04297
[3] Rethinking Atrous Convolution for Semantic Image Segmentation https://arxiv.org/pdf/2203.11709