# ICASSP Rebuttal Dear Area Chairs and Reviewers, We are thankful for the valuable comments from the reviewers, and appreciate the recognition to our work. Overall, the reviewers deem our work "of sufficient interest" (2381, 4280, 4F36, 0E18), "moderately original" (2381, 4280, 4F36, 0E18), technically correct" (4280, 4F36, 0E18), “convincing validation"(2381, 4280, 4F36, 0E18), and “references adequate"(2381, 4280, 4F36). According to the concerns raised by the reviewers, we make the following point-to-point responses to address all the reviewers’ concerns in a proper way: 1. "I do not see any theoretical analysis." [Justification of Technical Correctness Score 2381] 
Response: Thanks for your question which we believe could leave for future work. In fact, we have not enough space to conduct further theoretical analysis in a 4-page limit; Second, our design is based on our qualitative analysis, which is also often the case for many existing works on self-supervised vision representation learning. We hope our strong experimental results could shed lights on applying the extra instance-level optimization target in SSL framework as done in this paper. 2. "How do global-level and scene-level differences in this paper differ?" [Justification of Clarity of Presentation Score 2381] 
Response: global-level methods add objective on each sample. Scene-level methods focus more on object-level contrasting. 3. "Some of the descriptions seem unclear and a bit hard to follow for me. (see 10. for more details)”[Justification of Clarity of Presentation Score 4280]
Response: Thanks for your suggestions, we will make it clear in our revised version 4. [Additional comments to author, 4280]
Response: 5. [Additional comments to author, 4F36]
Response: 6. "The reference works are mostly SSL methods based on contrastive learning. Other SSL methods such as MAE or those based on pretext tasks are missing." [Justification of Reference to Prior Work Score 0E18] 
Response: We do not compare our method with MAE, since MIM-based methods require ViT-based backbone (CNN can not perform masked-image-modeling).