NeurIPS Follow-up Feedback

# NeurIPS Follow-up Feedback ## Reviewer zcQX ***Q3: Why and how to choose the weak and strong data augmentations?*** **A3**: For weak augmentation, we use random-resize-crop, horizontal flip and normalization, and for strong augmentation we additionally use additionally use the color jitter, Gaussian blur, and random gray scale transformations (line 183-186 in the main paper). To answer why we chose weak and strong augmentation, we provided ablations in the main paper (line 289-297). We infer that *weakly-augmented images provide better pseudo-labels* from the teacher network such that the student network can be optimized to make the outputs from strongly-augmented images to be consistent with the outputs from the weakly-augmented images. ***Feedback to A3***: I still have concerns here. Maybe I asked the question in a wrong way. How did the authors build up the sets of weak/strong augmentation? Have you ever tried different components for the two sets of data augmentation? ***Final Response***: We thank the reviewer for the feedback and apologize for giving our followup response late. Below we provide more details about the augmentations and will add a discussion on this in our final version. For weak augmentation, we adopt the standard augmentation used in ResNet training [2], i.e., random-resize-crop, and horizontal flip. On the other hand, for strong augmentation, we follow the augmentations used in [1, 5], i.e., additionally use the color jitter, Gaussian blur, and random gray scale transformations on top of random-resize-crop, and horizontal flip.  We have also tried other augmentations, but did not find them perform well. Specifically, we tried 'multi-crop' [3] and 'rand-aug' [4], both of which can be considered stronger than the original one used as part of the *strong* augmentation. We report the results below which show that too strong augmentation does not help much for cross-domain transfer. It also depends on the target domain. For instance, 'multi-crop' works slightly better on CropDisease dataset, but performs poorly on ISIC. The default augmentations that we used in the main paper generally perform well across domains. | Model | EuroSAT | CropDisease | ISIC | ChestX | | :-------------- | :------ | :---------- | :---- | :----- | | Ours (multi-crop) | 88.25 | **95.98** | 41.83 | 25.42 | | Ours (rand-aug) | 84.58 | 92.91 | 47.41 | 25.03 | | Ours | **89.07** | 95.54 | **49.36** | **28.31** | [1] Ting Chen et al., "Big Self-Supervised Models are Strong Semi-Supervised Learners". [2] Kaiming He et al., "Deep Residual Learning for Image Recognition". [3] Caron, Mathilde, et al. "Unsupervised learning of visual features by contrasting cluster assignments." [4] Cubuk, Ekin D., et al. "Randaugment: Practical automated data augmentation with a reduced search space." [5] Chen, Xinlei, et al. "Improved baselines with momentum contrastive learning. > [name=Rameswar Panda] If this response is already finished, we should post it in open review asap. Making more delay might give a negative impression to the AC that the paper is not ready and need more work. ## Reviewer pL1o --- ***Q1: My main concern to this paper is the motivation. Using student-teacher knowledge distillation is not a new idea in training deep neural networks. It remains unclear to me why such method is strong on cross-domain few-shot learning without seeing the experimental results. I expect the authors to provide more discussion on the advantages of the proposed methods, e.g., what specific natures of cross-domain few-shot learning is exploited? The current discussion shows only 'another way of using unlabeled data' in cross domain few-shot learning, yet this is something can be exploited in many tasks like standard few-shot learning, domain generalizations, etc..*** **A1**: We thank the reviewer for the detailed comments. We provide several points below: 1. Although student-teacher knowledge distillation has been exploited in several computer vision problems, to the best of our knowledge, we are the first to apply this in the cross-domain few-shot learning problem. 2. We hypothesize that using both labeled base data and unlabeled target data during training provides a common embedding for both base and target domain. Then the natural question could be - why not use the unlabeled target data only, it might provide more target-specific representation. One issue with this approach is that self-supervised learning generally requires a large amount of unlabeled data to work. Secondly, it has been shown that combining supervised and unsupervised learning during training provides more transferable representation (*Islam et al., A Broad Study on the Transferability of Visual Representations with Contrastive Learning*). We argue that similar conclusion holds for cross-domain few-shot learning, i.e., combining supervised and unsupervised loss provides better representation for the downstream task. 3. An important aspect of our method is sharing the same head for supervised loss and distillation loss. Our distillation loss is similar to non-contrastive self-supervised loss like BYOL or DINO. However, it requires much more data and several training tricks to make the non-contrastive method like DINO to work. We argue that using the same distillation head for supervised loss resolves these issues. Please refer to line 262-283 in the main paper. 5. As for why our proposed method works on the cross-domain setting, we also refer to the "Effect of dynamic distillation" section of the main paper (line 242-261), where we show that our method creates better grouping on the embeddings of the target datasets even though we do not use any labels from the target dataset during pretraining. 6. We agree with the reviewer that this approach can be exploited in standard few-shot setting too, which we verified in "Few-shot performance on similar domain" section (line 228-240). **Feedback to A1**: In my opinion, being the first to apply a technique to a new task cannot be considered as motivation or novelty, especially when the technique is a relatively established one and has been shown to be a general way of improving NN training. So what I expect to see at this point is: - Specific reasons that motivates you to apply this technique to this problem. - Novel improvements you proposed that are tailored for this task based on your understandings and assumptions. Here are my comments to each point you provide: 1. To me, applying knowledge distillation to a new task can hardly be considered as a contribution, as knowledge distillation has been proved as a strong method on improving representation learning. E.g., [1] has discussed the application of knowledge distillation + self-supervised learning on both standard and few-shot training. 2. What you provided can be part of the reason support the idea. However, I have following minor concerns: - 'self-supervised learning generally requires a large amount of unlabeled data to work'. This argument needs further support. Please see [1,2,3] for a non-exhaustive list of applying self-supervised learning at few-shot scenarios. - After reading the paper again, I have minor concern on the setting itself. As one of the main reasons we are studying cross-domain few-shot learning is that in practice, we have no idea where the model will be applied and how significant the domain discrepancy is. Assuming the availability of unlabeled target domain data can potentially restrict the applications to the known target domain only. In this case, a valid setting might be: training with domain A (labeled) and domain B (unlabled) and test on domain C (unavailable at training), which will have a more significant practical value. However, since this is an issue also shared by other work studying the same setting, failing to address this point will not effect the final score I recommend. 3. In paper, your discussion on this point is mainly based the results you got, but I'm asking motivation. If this improvement is really important, why it is not fully discussed in Section 1 and 3? And I'm having hard time understanding what the 'trivial solution' you were referring to in this discussion. 4. Again, what you provided is based on the results. And considering the general advantages of knowledge distillation on representation learning, I am not surprised by this result. 5. Please see my further comments on A4 below. [1] Xu, Guodong, et al. "Knowledge distillation meets self-supervision." ECCV, 2020. [2] Chen, Da, et al. "Self-supervised learning for few-shot image classification." ICASSP, 2021. [3] Gidaris, Spyros, et al. "Boosting few-shot visual learning with self-supervision." ICCV, 2019.  Thanks for the feedback! We apologize for the delay. Please find below are our responses and let us know if you have any further questions/concerns. **A1 Motivation** **[Knowledge Distillation]** We agree with the reviewer that knowledge distillation has been proved to be a strong method for improving representation learning. However, most of the findings are applicable only for *in-domain data*. For example, [1] shows that KD enhances in-domain features for both linear transfer and few-shot learning, [4] uses KD for semi-supervised learning (where both labeled and unlabeled samples are from the same domain). Thus, it is still not clear whether the same conclusion holds for *cross-domain few-shot learning*. This mainly motivated us to exploit student-teacher knowledge distillation to tackle the problem of cross-domain few-shot learning where there is a large shift between the base and target domain. Moreover, our method is also different than [4] that uses separate layers for classifier head and self-supervised head, where we use the same classifier head as distillation head. Without performing actual experiments and analysis, one could argue that reprojecting the features of target domain into a vastly different source domain (using the same head) might actually hurt the performance on the target domain. We believe our approach and analysis in the paper can be an important contribution to the existing literature and expand the applicability of knowledge distillation. [1] Xu, Guodong, et al. "Knowledge distillation meets self-supervision." ECCV, 2020. [4] Sohn, Kihyuk, et al. "Fixmatch: Simplifying semi-supervised learning with consistency and confidence."  **[Self-Supervised Learning with Less Data]** We will make this point clear in the final version. Specifically, what we refer here is that self-supervised learning on the *unlabeled target data only* without using any source data is not optimal as self-supervised learning often requires a large amount of data to learn better features. We agree that self-supervised learning on limited data could help, but more data could help in learning more transferable representations.    **[Problem Setup]** Thanks for the comment. We believe that the setup in our paper is practical. Imagine that a company collects a certain amount of unlabeled data and then can only afford to label few of them. In this case, the accessibility of target unlabeled data is available in the training time, and the few-labeled samples are used in the few-shot learning. More specifically, in our setup, we actually know where the model will be applied, but do not have many labeled samples from the target domain to train the network with supervised setting. The general paradigm in this setup is to train the network with self-supervised learning using the unlabeled target samples from a pretrained model. We propose a more effective method in this paper by using both the unlabeled target samples and labeled data from a different domain. The setup reviewer suggested (i.e., training with domain A (labeled) and domain B (unlabled) and test on domain C (unavailable at training)) is also interesting and we actually show few results on that in Table 6. Specifically, we used domains A and B to pretrain and then test on domain C. Nonetheless, the results suggest that directly using domain A and C leads the best results if the final evaluation is based on the domain C. In other words, the best accuracy is achieved when the unlabeled data and target data are from the same domain.       **[Single Projection Head and Trivial Solution]** Sharing the same head for supervised and distillation loss leads to better performance because both source data and target data are projected to the same embedding space (single projection header), and the source data are actually used to regularize the embedding of the target data. Therefore, in order to discriminate among the source and target data in the same space including discrimination of each data point at the same time, the learned features must be more discrimiative. Then, it results in a much better feature representation. This is our main motivation why we would only use single header and it shows better results in comparison with separate projection headers. We will add this in the section 1 of the final version (see the revision plan described below). By trivial solution, we refer to the collapsed solution, i.e., outputting the same vector for all images, in non-contrastive self-supervised learning. Please check section 5.3 in [4], section 1 and 3 in [5], where the authors discuss why collapsed solution is a huge issue in non-contrastive self-supervised learning and how to avoid them with different tricks. We hypothesize that using the same projection head for both supervised and self-supervised loss imposes a regularization effect which prevents collapse, without relying on the explicit (often complex) tricks mentioned in [4] and [5]. [4] Caron, Mathilde, et al. "Emerging properties in self-supervised vision transformers." [5] Grill, Jean-Bastien, et al. "Bootstrap your own latent: A new approach to self-supervised learning." --- ***Q2: The paper is somehow poorly organized. It is surprising to see that the main discussion (methodology) is not even 1.5 page long with a Figure in it. I think this is also part of the reason that the motivation of this paper is not well presented.*** **A2**: Thanks for the suggestion. We will provide more details about the motivation of our proposed method in the final version. **Feedback to A2**: I'll reconsider the score I recommend if the authors can provide the plan on improving the presentation in more detail. Thanks. **Final Response**: **A2 Presentation** **[Presentation]** Below we provide a detailed plan on how we will improve the presentation to make our motivation and setup clear in the final version. - In the second paragraph of 'Introduction', we will add an additional discussion (about 2-3 lines) on why our setup is practical and what's the difference between usual cross-domain few-shot and our setup (see our response above on Problem Setup). - We will discuss the motivation of our approach in the third paragraph of section 1 (before illustrating our approach). For motivation, we will first mention why using both labeled source data and unlabeled target data during pre-training can be helpful (see our initial response A1-2). Then we will mention how using the same projection head could lead to better performance (see our updated response above on Single Projection Head and Trivial Solution). We anticipate this should be around 5-6 lines. - We will move some parts of the 4th paragraph in the original paper (similarity and difference with recent methods) to section 2. - In section 4.3, we will add additional experiments for miniImage->CUB as a separate subsection to prove the effectiveness of our approach to fine-grained domain where self-supervised methods underperform (see response below). We hope the above plan for reorganization of the paper will clarify our motivation earlier in the Introduction section. Please let us know if you have any further suggestion on this. We would be more than happy to incorporate them in the final version.  --- ***Q3. Some widely used cross-domain few-shot learning settings were set in [1], and are not included in the paper.*** **A3**: For cross-domain few-shot evaluation, [1] uses only mini-ImageNet->CUB, i.e., training on mini-ImageNet dataset and evaluation on CUB dataset. We argue that this setting is rather limited as CUB contains only natural images like ImageNet. We adopt BSCD-FSL benchmark [6] which has a better distribution of downstream datasets from natural to medical images. [1] Chen, Wei-Yu, et al. "A closer look at few-shot classification." ICLR 2019. [6] Guo et al. A broader study of cross-domain few-shot learning. ECCV 2020. **Feedback to A3**: I respectfully disagree. The setting BSCD-FSL focusing on is more about sample-wise domain shift. While as ImageNet is a general classification dataset, and CUB is a fine-grain dataset, the source and target domains have dramatically different levels of inter-class discrepancy. miniImageNet -> CUB provides another perspective of cross-domain robustness measurement, therefore serves as a convincing benchmark and is not necessarily easier. Considering my late further comments, I will reconsider my score if the authors can provide additional results, but failing to provide them will not affect the score. **Final Response**:  **A3 Other cross-domain few-shot setting** **[Experiment on miniImageNet -> CUB]** Thanks for suggesting this experiment. We agree with the reviewer that mini-ImageNet->CUB is an interesting experiment to show the transferability of different models to a fine-grained dataset. We perform additional experiments for mini-ImageNet->CUB, and report the results below: | Model | CUB | | -------- | -------- | | MatchingNet | 58.23 | | ProtoNet | 63.19 | | Transfer | 68.72 | | SimCLR | 62.84 | | Transfer+SimCLR | 67.82 | | STARTUP | 66.10 | | Ours | **69.50** | Note that we did not directly cite the number in [1] as the test setup is different, all the results are reproduced by us. For CUB, we found that vanilla `Transfer` performs surprisingly well (also reported by [1]), and adding SimCLR with Transfer (`Transfer+SimCLR`) actually decreases the accuracy. Wallace et al. also experimented with different self-supervised methods on smaller domain and found that all of them underperform for fine-grained task [2]. However, *our method still performs the best, demonstrating the effectiveness of our approach in a fine-grained downstream task*. We will add this finding as a separate subsection in Section 4.3 in the main paper. [1] Chen, Wei-Yu, et al. "A closer look at few-shot classification." ICLR 2019. [2] Wallace, Bram, and Bharath Hariharan. "Extending and analyzing self-supervised learning across domains." --- ***Q4: The performance on standard few-shot classification datasets are actually not comparable to SOTA. E.g., according to the [leaderboard](https://few-shot.yyliu.net/miniimagenet.html), with the standard inductive setting, many methods can achieve over 54% with simple Conv-4 architecture on miniImageNet 5way 1shot. While in-domain few-shot classification is obviously less challenging, it is confusing to me why the proposed method performs poorly.*** **A4**: Thanks for the comment. In Table 3, we show the in-domain performance comparison with similar training and test set and similar evaluation protocol for the methods considered for cross-domain few-shot learning. First, we want to clarify that our goal is not meta-learning for in-domain few-shot evaluation. Our approach is about having a stronger pretraining if some unlabeled target-related data are available, which is not the evaluation protocol of the leaderboard. Moreover, as stated in lines 233-234, our method needs unlabeled data from novel classes, which results in the different test set for the evaluation than the leaderboard uses. Thus the results are not comparable. **Feedback to A4**: I agree the settings are different. But since unlabeled data is introduced in training, will it be easier? To me, this setting is already closed to the 'transductive' setting recently attracts wide attention and achieves much higher performance comparing standard 'inductive' setting. Although not 100% the same since the unlabeled data is introduced in the testing stage in 'transductive' setting, but introducing more data in training stage should be able to ease the testing and make the numbers comparable, am I right? **Final Response**:  **A4 Standard setting** **[Inductive vs Transductive Setup]** As rightly stated by the reviewer, our setting is not entirely same to the *transductive setting* as unlabeled data is not being used in testing stage of our framework. On the other hand, it is also different from *inductive setting* as our method needs unlabeled data from novel classes, which results in a different test set for evaluation compared to the inductive methods in the leaderboard. Despite different settings, our results are very much comparable to the previous findings on inductive methods. For example with the same ResNet-10 backbone, the `Baseline` method in [1] achieves an accuracy of 74.69 and 52.37, while our`Transfer` baseline obtains 74.26 and 53.40 for 5-shot and 1-shot evauation on miniImageNet respectively. We compare the results of other methods with the `Transfer` baseline in Table 3 of the main paper, where we achieve the best performance for in-domain evaluation (`Ours`: 76.02 for 5-shot and 53.71 for 1-shot on miniImageNet). We also note that the improvement of in-domain evaluation for tieredImageNet dataset is pretty significant over the `Transfer` baseline, and even comparable to the [leaderboard (tieredImageNet)](https://few-shot.yyliu.net/tieredimagenet.html) methods, implying that our approach can achieve better performance for in-domain few-shot evaluation with more data. Overall, we infer that our approach provides a good backbone/feature-extractor even for in-domain few-shot evaluation. Moreover, most methods in the [leaderboard (miniImageNet)](https://few-shot.yyliu.net/miniimagenet.html) apply different post-processing networks/tricks with the backbone. For example, LFT [3] integrates additional transformation layers into the encoder backbone, and FEAT [2] uses additional layers and tricks including set-to-set function and Bidirectional LSTM/Transformer layer to achieve good results. We argue that our method is not directly comparable with these approaches, as we are simply using a pretrained backbone as a feature extractor for the few-shot evaluation without any post-processing layers or tricks. However, one can still use our trained backbone as initialization for other methods. To illustrate this, we perform an experiment where we use ResNet-10 backbone trained with `Transfer` baseline and ResNet-10 backbone trained with our method for two different initialization of FEAT. We train with the default hyperparameters in FEAT for 200 epochs and evaluate on a subset of the mini-ImageNet test images that were not used in the pretraining of our approach. We show the 5-way 1-shot results below along with the original accuracy of the models used for initialization: | Method | Accuracy (1-shot 5-way) | | -------- | -------- | | Transfer | 53.40 | | Ours | 53.71 | | FEAT (initialization with Transfer) | 54.66 | | FEAT (initialization with Ours) | 56.70 | Indeed, pretrained backbone from our method can be used as a better initializer for FEAT than that of transfer baseline. [1] Chen, Wei-Yu, et al. "A closer look at few-shot classification." ICLR 2019. [2] Ye, Han-Jia, et al. "Few-shot learning via embedding adaptation with set-to-set functions." [3] Tseng, Hung-Yu, et al. "Cross-domain few-shot classification via learned feature-wise transformation."    ---