# Rebuttal ICML v2
## Reviewer iHsY [Borderline reject]
> My major concern is the discussion of the degradation brought by Batch Normalization, which might be the most important conclusion of this paper (as Line 97 claimed). Some recent works have pointed out this phenomenon and suggest using Layer Normalization, e.g., [1] Towards Stable Test-time Adaptation in Dynamic Wild World, ICLR2023, Top 5%
> Could you please provide more discussions for Batch Normalization and Layer Normalization, regarding the recent work [2]?
> [2] Towards Stable Test-time Adaptation in Dynamic Wild World, ICLR2023, Top 5%
We confirm the brittle performance of BN in SF-UDA. In contrast to Niu et al. (2023) analyzing the degradation of BN layers in specifically-designed *synthetic* settings (small batches, mixture of shifts, label-imbalance, distribution shift), our work complements theirs by **providing evidence on *real data* and validating them on multiple architectures**. We found the reference relevant, and we will include it in our related work section.
> My second concern is that the authors only use three source-free UDA methods to investigate the task, which might be not enough for pure empirical evaluation work.
We believe that SHOT and SCA are very **representative methods** of the current SF-UDA literature and they are **complementary**: SHOT adapts the feature extractor keeping the classifier fixed, while SCA aligns the classifier leveraging a frozen feature extractor. These methods have proved **foundational in SF-UDA research**, as evidenced by their use in recent publications (Taufique et al., 2021; Yang et al., 2021a,b; Qiu et al., 2021).
We also included the results of the NRC method, but we strongly agree that validating with even more methods would further support our claims. Therefore, note that in the revised version of the paper **we will add the following new experimental results with the recent AAD method** (*Attracting and Dispersing: A Simple Approach for Source-free Domain Adaptation. NeurIPS 2022 Spotlight*) to Tab. 5, further corroborating our findings:
| SWIN L DomainNet | SWIN L OfficeHome | ConvNext XL DomainNet | ConvNext XL OfficeHome |
| -------- | -------- | -------- |-------- |
| 47.2 | 87.8 | 46.3 |88.0 |
We appreciate the reviewer's suggestion and plan to include even more methods in future work.
> I notice that the appendix provides self-supervised learning experiments for UDA. Could you please give more discussions with [1]? [1] have done similar work for UDA. Moreover, I'm also curious about the usage of MOCO V2 in Line 1484, "As it can be noticed, after performing a fine-tuning on the source domain (and eventually applying SCA) MOCO v2 performs very similarly (on average) ...". Could you please provide more explanation for the operation, especially for the words in the parentheses?
> [1] Connect, Not Collapse: Explaining Contrastive Learning for Unsupervised Domain Adaptation, ICML22
We agree that Shen et al. (2022) has done interesting work on UDA using contrastive learning strategies. However, in our study in App J, we are not using a contrastive approach on unlabelled source and target data to disentagle domain and classification features. Instead, we are evaluating the effectiveness of self-supervised backbone pre-training for SF-UDA. Specifically, we aim to **compare standard supervised pre-training against models pre-trained on ImageNet without labels**. In both cases, the pre-trained weights are then used as initialization for the subsequent operations.
At line 1484 we refer to the competitive performance of both FT and FT+SCA with their corresponding supervised baselines, averaging over the considered datasets.
Thank you for bringing to our attention that our point was not sufficiently clear. We will provide additional clarifications on this matter in the final version of our paper.
> A suggestion that is not necessary. Could you provide more SF-UDA implementation in your experiments? I find only three SF-UDA implementations are used in this submission. I realized the importance of large-scale evaluations towards datasets, architecture, fine-tuning strategy, and normalization strategy. I also realized that a unified SF-UDA benchmark that includes implementations as many as possible is important to boost this field. Furthermore, I also encourage more evaluations for Semi-supervised situations [3,4] or other tasks like segmentations. Finally, I suggest the authors refer to the survey repo "https://github.com/YuejiangLIU/awesome-source-free-test-time-adaptation" if they want to enhance their evaluations further.
> [3] Context-guided entropy minimization for semi-supervised domain adaptation. NN2022
> [4] Learning Invariant Representation with Consistency and Diversity for Semi-supervised Source Hypothesis Transfer, Arxiv2021
We are glad to hear that the reviewer shares our view on the importance of large-scale evaluations for SF-UDA, and we agree with the suggestion to include more SF-UDA implementations in our experiments. We also thank the reviewer for pointing out that repository, which we found useful and very well organized. We are currently working to extend this work with more methods, tasks (like segmentation) and settings. We are not dealing with the Semi-supervised setting right now, but it is extremily common in practical scenarios and it is a very good suggestion, so we will include it in the future.
**In the initial release of the code we will also add our optimized implementation of Yang et al. 2022c and of (at least) other two methods.**
Moreover, we plan to keep the code repository up-to-date, continuously adding new methods. One aim of this project is to provide efficient and scalable implementations (multi-node, multi-gpu, automatic mixed precision, etc.), supporting any architecture of the `torchvision` and `timm` libraries and the tests on all datasets that we used in our paper.
- Niu, Shuaicheng, et al. "Towards stable test-time adaptation in dynamic wild world." ICLR (2023).
- Qiu, Zhen, et al. "Source-free domain adaptation via avatar prototype generation and adaptation." IJCAI (2021).
- Shen, Kendrick, et al. "Connect, not collapse: Explaining contrastive learning for unsupervised domain adaptation." ICML (2022).
- Taufique, Abu Md Niamul, Chowdhury Sadman Jahan, and Andreas Savakis. "Conda: Continual unsupervised domain adaptation." arXiv preprint arXiv:2103.11056 (2021).
- Yang, Shiqi, et al. "Exploiting the intrinsic neighborhood structure for source-free domain adaptation." NeurIPS (2021).
- Yang, Shiqi, et al. "Generalized source-free domain adaptation." ICCV (2021b).
- Yang, Shiqi, et al. "Attracting and dispersing: A simple approach for source-free domain adaptation." NeurIPS (2022).
## Reviewer K1Z8 [Borderline reject]
> The paper lacks novel intuitive or theoretical insights. The four summarized findings are quite trivial. The authors' key finding that pre-training large models with large-scale datasets can improve domain adaptation performance has been supported in previous works (Kim et al. 2022; Zhang and Davison, 2020).
> It is also unsurprising that SF-UDA methods are competitive with state-of-the-art UDA methods, as demonstrated by previous works such as NRC and SHOT.
Domain Adaptation methods are **typically tested on a limited set of benchmark datasets and by considering only a small number of backbone architectures**. However, when these methods are used in practice on different data and with new architectures the **performance might change dramatically as pointed out in Kim et al. (2022)** for the UDA problem. Contrary to the UDA setting analyzed in Kim et al. (2022) and Zhang and Davison (2020), SF-UDA is greatly understudied and most of the **results obtained for other settings shall not be considered true for SF-UDA without a proper experimental verification**. Our extended experimental analysis aims at shedding light on these aspects. We study how different design choices impact on the final SF-UDA result, validating our findings on a wide variety of datasets and backbone architectures, and identify guidelines to design effective SF-UDA approaches.
> The paper overlooks other important choices in SF-UDA, such as whether to inverse source data or features, how to choose a simple and effective target learning strategy, and whether explicit domain alignment is necessary. These questions are fundamental to SF-UDA and should be addressed.
We appreciate the Reviewer's insightful feedback on our work and acknowledge that these aspects in SF-UDA warrant further exploration. We fully intend to address these in our future research. In this work, we focus on several key choices that we believe are particularly compelling. These include fine-tuning, adaptation of the feature extractor versus the classifier, pre-training (including self-supervised strategies), normalization choice, and the robustness and fair evaluation of SF-UDA methods.
We believe that our contributions, along with our code and implementations, will significantly benefit the research community and advance the study of the issues raised by the Reviewer which will be of great interest for our future work.
> The experimental setting is unclear. SHOT's popular pipeline fine-tunes the pre-trained model in the source domain and then adapts it to the target domain. However, the paper mainly focuses on weak baselines like linear probing and SCA, which are confusing and unconvincing. Tables 3 to 5 show that SHOT consistently outperforms these baselines, which challenges why need to study linear probing and SCA for SF-UDA.
Thank you for raising this concern. We included SCA as baseline in our study because it is **complementary to SHOT**. While SHOT adapts the feature extractor keeping the classifier fixed, SCA does the opposite. It is true that SHOT generally outperforms SCA, but we found that SCA is not a weak baseline. In fact, as shown in Table 5, **SCA performs on par with some state-of-the-art UDA methods** while being simpler and efficient (as described in App B). Furthermore, **SCA is significantly faster** than SHOT (as shown in App H), and **the performance gap between SHOT and SCA is small when using state-of-the-art architectures** with Layer Normalization (as shown in App I). We appreciate your feedback and will better emphasize these points in the final version of our paper.
> Some key claims require further explanation and clarification, such as why LN is better than BN in cases of full fine-tuning. The authors provide no reasoning for this observation.
One possible reason is that architectures with LN, which are well-optimized on large datasets, may present a more robust inductive bias that leads to better performance on downstream tasks like domain adaptation: **BN is sensitive to distribution shift, especially in the low-data regime of the SF-UDA setting** where statistics are poorly estimated. We observed that replacing BN with LN in some architectures made it difficult or impossible to train them on ImageNet (we will report our experimental results on ResNet50 in App K of the final version of the paper). Additionally, LN has been shown to work well in unsupervised and generative models, which aim to learn representations capturing high-level structure and statistical regularities in data. This property of LN could be beneficial for domain adaptation, where the goal is to adapt to new domains with different statistical properties.
We will include a discussion of these points in the revised version of the paper to better justify our observations.
> In addition to the raised issues, the paper should clarify whether SCA requires the retention of both the source model and source prototypes and, if so, how to prevent source privacy leakage through feature inversion methods. Additionally, there appear to be misuses of rows and columns in Section 4.3, which necessitates correction and clarification.
We agree with the reviewer that the privacy leakage through inversion methods is a general problem that requires in-depth study, especially since privacy is one of the key motivations for SF-UDA methods. However, to the best of our knowledge, methods in the SF literature typically retain both the source model and prototypes. For instance, in SHOT, the hypothesis on the source domain (i.e., the classifier, which is equivalent to retaining one prototype per class) is transferred to the target by adapting the feature extractor. In SCA, instead, the prototypes can be represented by the centroid of the source domain or by the weights of the last linear layer. In App B, we provided a detailed discussion of the prototypes of SCA, along with analysis and results. We will emphasize these aspects in the final version of the paper.
Finally, we apologize for the misuse of rows and columns in Section 4.3 and will make corrections in the final version of the paper.
## Reviewer rLLa [Reject]
> Presentation and writing: many places are described too vaguely, making it difficult to read. For example: What does each dot in Figure 2 and 3 mean?
**We will be happy to improve the current manuscript and clarify *specified* doubts.** In Figures 2 and 3, each dot represents the LP accuracy of a considered architecture, in the caption we refer to them as "markers". We apologize for any confusion and we will improve the caption for further clarification.
> This study only focuses on the training of classifier. However, some existing UDA methods train not only the classifier, but also the feature backbone, when adapting the source domain model. For example, the SHOT model uses the soft labels to train the feature extraction network and classifier at the same time. As the source domain data is labeled, one can perform supervised training. However, when the authors investigate LP and CP, only classifier training is considered. To be more comprehensive, it is required that the entire model (including both feature backbone and the classifier) is tuned with whatever label information available (as a strong case study at least).
While we did some experiments just training the classifier, please note that **in most of the experiments we actually evaluated what the Reviewer is asking**. Our work specifically focuses on:
1) The problems of **fine-tuning vs. no fine-tuning of the feature extractor** in the first transfer;
2) The **adaptation of the classifier (SCA) vs. the adaptation of the feature extractor (SHOT) in the second transfer**. For instance see section 4.3 and tables 3 and 4.
>Contributions: it seems no new findings from this paper for the readers. For instance, that big data brings good performance is nothing new in deep learning in general. Also, those about the shortcomings of batch normalisation (BN) and the advantages of layer normalisation (LN) are not novel findings (Refer to [A1] and [A2] for similar conclusions)
The general trend of larger data helping performance is **not guaranteed to hold in the context of SF-UDA.** Apparently intuitive claims should be rigorously supported by empirical evidence: the presented results experimentally support the *hypothesis*.
Regarding Layer Norm, contrary to Chang et al. (2019) that proposes a new UDA approach building on domain-specific BN parameters, and Vaswani et al. (2017) that limits to the use of the LN layer in their architecture, **we provide extensive and consistent empirical evidence of BN layer failures on real data and multiple architectures** offering a fair comparison between LN and BN in SF-UDA.
> These following pieces are not clear and confusing: By my best guess, Equation (7) means that the final accuracy of a model is equal to a linear transformation of the top1 of its backbone. In the classification task, top1 refers to the correct rate itself.
> Equation (8) is hard to understand in the current form. My guess is that the authors want to use and to describe the accuracy improvement of using ImageNet21k backbones. If yes, this is a model parameter, the question is how to predict?
> From my read, Figure 1 is an illustration of Equation (7) and Equation (8). If this is the case, my current take is that the authors are doubting the research meaning of SF-UDA. Specifically, the model performance is mostly determined by the feature backbone, whilst all the other design choices just give the contributions at the level as the noises. From Figure 1, however, this effect is not big though. The authors need to explain more for sufficient clarity.
**Equations (7) and (8)** present two statistical models about our experimental observations and **show how relevant the choice of the backbone and pre-training data are for predicting the final SF-UDA performance** (Accuracy (B)). While (7) linearly predicts the final accuracy based on the top-1 accuracy of the backbone on Imagenet 1K, **(8) leverages further information on pre-training data** in a multi-linear model with interactions. We will clarify 4.2 in the final version of the paper to improve readability.
Figure 1 provides a visualization of the relationship between top-1 accuracy of the backbone and the generalization performance on different tasks (*top*: LP Gen and LP DGen, *bottom*: SF-UDA ). Lines represent the fitted bilinear model.
We stress that we are not doubting the research meaning of SF-UDA but **we aim at rigorously investigate and empirically quantify the extent to which the backbone and pre-training data are critical in SF-UDA.**
- Chang et al. "Domain-specific batch normalization for unsupervised domain adaptation." CVPR 2019
- Vaswani et al. "Attention is all you need." NeurIPS 2017
## Reviewer brfk [Borderline accept]
We express our gratitude to the reviewer for their exceptional attention to detail.
> While the study is very thorough in terms of datasets considered, neural network architectures, and involved a very large number of experiments overall, one aspect where it is lacking is in terms of the SFDA methods used. It would be great to also consider an entropy minimization (without pseudolabeling) representative, like TENT (Wang et al) as well more recent/ SOTA methods like NRC, which the authors do mention but only appears in one table (that has a different evaluation protocol) as a reference point, instead of participating in the analysis performed in this study. Further, AdaBN is a great reference point too but unfortunately shows up very selectively (only in Figure 3 in the main paper). Finally, LAME (Boudiaf et al) is a method that doesn’t update the feature extractor during unsupervised transfer (like SCA) but is more sophisticated than SCA and thus would perhaps be a better reference point.
We totally agree with this concern of the reviewer. **We decided to limit the number of evaluated methods to enable a more in-depth analysis with many datasets and many architectures**, since we have observed that, usually, the success (and performance) of SF-UDA (and UDA) methods strongly depends on the dataset and it is very easy to make a method work on a limited number of datasets, while it is very hard to make it work consistently on many datasets.
Furthermore, **in the final version of the paper we will add the results of the state-of-the-art recent method AAD** (*Attracting and Dispersing: A Simple Approach for Source-free Domain Adaptation, NeurIPS 2022 Spotlight*) and in the release of our code we will include our preliminary experiments and the implementation of AdaBN, LAME and possibly other methods that we are currently investigating (and designing) in order to boost the future research on the topic. We also agree that LAME can be a very interesting baseline, but we decided to use SCA precisely for its simplicity and for the competitive results that can be obtained with respect to other state-of-the-art methods, see table 5.
> The study also assumes the use of pretrained models throughout (and consider double transfer: first to the source domain, and then to the target domain) but it would be really useful to have, as a reference point, results that train from scratch on the source domain as a starting place (rather than transferring the pretrained model). This is because not all real-world problems are such that there exists a related-enough large-scale dataset to pretrain on.
We ran preliminary experiments assessing training from scratch and observed a large margin drop with respect to ImageNet pretraining. The results are consistent for all considered datasets (even for Domainnet that provides a large number of images). Furthermore, we excluded those results from the paper, but since they may be interesting (thank you for the suggestion) we will add them in the Appendix of the final version.
> Some aspects of the methodology are unclear. Perhaps most importantly, it is unclear how model selection is done throughout the paper. This is an open problem in areas like domain generalization/adaptation (due to unavailability of validation set from target; or at least this is the strictest possible assumption which would hold in real-world problems) and thus it is very important to declare the design choices made. For instance, for the finding that finetuning the pretrained model on the source dataset degrades performance on the target, could it be that this is a symptom of the hyperparameters used during this finetuning phase? (see Questions section too)
> More generally, how was model selection done for each model / stage / experiment?
We reported all the details of the implementations and the model/hyperparameter selection strategy in App D. For the sake of results reliability, we adopted and rigidly followed the following training-evaluation protocol. During the first transfer the source dataset is split into training and validation and the latter is used to select the number of training epochs, while other hyperparameters are fixed. For the second transfer the hyperparameters are fixed to constant values, since the unavailability of target labels makes the validation impossible. We noticed that most of the works in literature uses different second-transfer hyperparameters to validate the algorithm on different datasets. **We dissociate from this practice** for the aforementioned motivations. In the presented work, hyperparameters are fixed for all datasets and architectures.
We thank the reviewer for highlighting these points, we will better clarify them in the manuscript.
Additionally, we observed in past experiments that a variation in the hyperparameters during the first transfer does not significantly affect the downstream performance and, specifically, does not impact on the performance degradation on the target caused by the fine-tuning. We will add these results in the Appendix in the final version of the paper.
> Terminology: ‘As opposed to fine-tuning, in which the pre-training and downstream tasks can be significantly different’. To me, fine-tuning is an approach, not a problem setting. Perhaps it would be clearer to say ‘As opposed to transfer learning in general, in which…’. That is, I think it’s clearer to consider that transfer learning is a general problem of adapting from source to target, DA and DG are special cases where the class labels match between source and target, and so on, as defined in the paper for the remaining distinctions (e.g. between DA and DG, and between DA and UDA, SFDA).
>In Equations 1 and 3, should it be argmin rather than min? I.e. showing that we want to find the arguments for theta and phi that would minimize that expression. This reads like a minimization problem that defines some training objective that isn’t differentiable (as it’s phrased w.r.t the accuracy on the target domain)
Thank you for pointing out these mistakes, we agree with the reviewer and we will correct the final version of the paper.
> Section 3.1, in Remark paragraph, clarify that the training set for the LP or CP is always coming from the source domain (maybe obvious to those working in the area, but useful for completeness and clarity)
Thanks for the suggestion. We will implement the clarification in the updated version of the manuscript.
> For models trained on IN21K, why is the design choice made of the triple transfer: IN21K -> IN -> source -> target, instead of the double transfer IN21K -> source -> target?
The motivations for this choice are twofold:
* It is common practice in the literature to fine-tune on IN models previously trained on IN21K. Thus, we believe it is of interest for the community to evaluate those models.
* Preliminary experiments showed that fine-tuning on IN a model previously trained on IN21K has no effects on the downstream SF-UDA accuracy.
We will add these observations and results in the final version of the paper.
>‘In SF-UDA, as it is common in the literature […], we consider the transductive setting’. My understanding was that SF-UDA was the non-transductive counterpart of test-time adaptation. Is this incorrect? Could you provide some reference from the SFDA literature that describes this transductive evaluation setting? It would also be great to have a discussion of the differences between these related problems in the paper (as methods developed for one can usually readily be applied to the other).
To the best of our knowledge, UDA and SF-UDA methods in the litereture are evaluated in the transductive setting. We refer the Reviewer to some recent methods: SHOT (Liang et al.), AAD (Yang et al.), Confidence Score DA (Lee et al.).
> Figure 1: which dataset(s) were used? Would be great to say in the caption.
>
We will add the information in the caption.
>Is there some hypothesis for why models with BN layers don’t benefit from the adaptation (finetuning) on the source?
>When saying that ADABN adapts the BN layers, clarify whether this refers to the parameters or statistics of the BN layers (or both).
>More generally, it is very important to say how BN is handled in each situation for each method in each of the 2 stages of the double transfer (i.e. train mode or eval mode?)
>For SCA, it would be nice to also show a variant that does adapt the BN statistics to the target domain (like AdaBN) even though no parameter updates to the feature extractor are made. This is potentially a more interesting reference point.
We hypothesize that BN layers, while aiding the optimization process, make the model sensitive to input distribution shifts. Fine-tuning to a particular dataset with a limited number of images exacerbates the problem.
We followed existing work on SF-UDA involving batch normalization layers. We use BN layers in train mode in every phase that updates the model weights with gradients and in eval model in all phases where just the inference is needed, i.e., to evaluate the pseudo-labels and at test time. AdaBN adapts just the statistics of BN layers.
We will make it clear in the final version of the manuscript, thank you for pointing this out.
We agree with the Reviewer that more experiments of AdaBN + SCA would be interesting. We are also currenlty evaluating the use of PCA before SCA and the use of Kernel K-Means instead of the classical spherical K-Means of SCA. They are very promising directions since they seem to outperform SCA in many benchmarks, but we leave these as extensions of the current work. In the first public release of code this methodologies and our preliminary results will be included.
>In Table 5, clarify exactly what are the differences in the evaluation protocol in this table compared to the rest of the paper. Also, to check: all of these methods use exactly the same pretrained model weights too (not just the same architecture), correct? Why is FT+NRC trained only on a subset of DomainNet?
Table 5 follows the same exact evaluation protocol of the rest of the paper. The only difference is that for this table we used the not-optimized official code for the methods that work on one single gpu (we introduced only mixed precision). We will correct and clarify this point, thank you for pointing this out.
We trained NRC with a subset of DomainNet because the algorithm itself it is not designed to work with quite large datasets. Even if the modifications to extend it (and optimize it) to large datasets are minimal and we could simply implement them to improve the original algorithm, we thought that it was out of scope of our work. For this reason we reported the results using a subset of the dataset for training: NRC was still able to reach impressive results (despite the unfavorable setup), thus supporting our claims.
> Figure 3: could the degraded performance observed when finetuning be a consequence of the hyperparameter being chosen in a way that is agnostic to the target domain? How was model selection done?
[ALREADY ANSWERED BEFORE]
**Minor comments**
>Section 4.3 refers to a top-left part of Figure 2 which doesn’t exist (Figure 2 doesn’t have a left and right part), and to a ‘second column’ of Figure 2 (which, again, has only 1 column)
>In the paragraph above Remark 1: ‘Nevertheless, in the second column…’ - should be ‘second row’. And later, ‘Finally, in the second row’ - should be the ‘second column’.
>In Equation 6, S_{tgt} isn’t defined (unless I’m missing it?)
We thank the reviewer for spotting these inattentions in the text. We will fix them in the manuscript.
**Limitations**
>As mentioned above, the study is limited for the most part to only 2 SFDA approaches, which are not representative of all categories of methods used in this area and aren’t very recent methods, and far from SOTA these days, and makes the assumption throughout that pre-trained models are utilized. Further, I wonder what is the limitation of considering computer vision distribution shifts only, though I recognize that this study is already very thorough and it is reasonable to consider that to be beyond the scope of this paper.
[TODO]
- Liang et al. "Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation." ICML 2020.
- Yang et al. "Attracting and dispersing: A simple approach for source-free domain adaptation." NeurIPS 2022.
- Lee et al. "Confidence score for source-free unsupervised domain adaptation." ICML 2022.
## Area Chair Private Message [Optional]
Dear Area Chair,
We are writing to express some concerns we have about the feedback provided by Reviewer rLLa for our submission. While we appreciate the time and effort put into the review, we have reasons to believe that the reviewer may not have read our paper carefully, which has led to an inaccurate assessment of our work.
In particular, we noticed that the reviewer mentioned the absence of certain experiments and considerations in our paper, which were actually presented in the manuscript. We believe this may have contributed to the reviewer's negative assessment of our paper, which unfortunately may impact the final decision for our submission.
Given the above, we kindly request that you consider our concerns when making the final decision for our submission. We would be happy to provide any additional information or clarification that may be helpful in this regard.
Thank you for your attention to this matter.
Sincerely,
The authors
## Answer to official comment of Reviewer brfk
QUESTION:
Thank you authors for the thorough responses, additional experimental results and observations! I wanted to ask for some more clarifications based on these:
RE: model selection: the authors state that they use 'constant' hyperparameters for the second phase of transfer, which are fixed across datasets if I understood correctly? How were these constant values obtained? Do they differ from one method to another or is e.g. the learning rate chosen to be exactly the same across methods? I very much appreciate that the authors avoided explicitly tuning hyperparameters on the test set of the target domain (given the assumed unavailability of labeled validation data from the target), which I suspect is often what other works do, without explicitly declaring it. However, it is still very important to set the hyperparameters to 'reasonable' values, as otherwise we might be drastically underestimating the performance of one method relative to another. How was this addressed?
The authors might be interested in the following papers for doing model selection 'properly' in unsupervised domain adaptation:
Musgrave et al, Three New Validators and a Large-Scale Benchmark Ranking for Unsupervised Domain Adaptation
You et al, Towards Accurate Model Selection in Deep Unsupervised Domain Adaptation
Saito et al, Tune it the Right Way: Unsupervised Validation of Domain Adaptation via Soft Neighborhood Density
RE: "We use BN layers in train mode in every phase that updates the model weights with gradients and in eval model in all phases where just the inference is needed, i.e., to evaluate the pseudo-labels and at test time." -- I wonder if the issues/brittleness associated with BN that the authors identified is (at least partly) due to the choice of using eval model at test time for BN. Using training mode even at test time might be a more appropriate choice, as statistics from the (target domain!) batch would be used which are likely more appropriate than using source domain statistics, especially in the event of a large distribution shift between source and target. (Of course, using train mode at test time makes the setup a bit transductive, as statistics are jointly used across examples that belong to the same batch, but this is probably okay as the evaluation is transductive anyway). What are the authors' thoughts about this? Were experiments considered along these lines?
ANSWER:
We would like to express our gratitude, again, to Reviewer brfk for taking the time to ask additional insightful questions, and for providing us with useful references.
Hyperparameter selection for the second transfer is an interesting topic, and we agree that some existing methods can be helpful. However, in our experience, in many practical scenarios the results are not consistent and convincing. While we have observed performance gains using some methods for hyperparameter selection on certain datasets, the same methods applied to different architectures and/or datasets led to performance drops.
As stated in Appendix D, we used the "official" hyperparameters reported in most experiments on the methods we studied. SHOT, in particular, has been found to be highly robust to hyperparameters, while NRC and AAD are more sensitive to them. Even so, we were able to achieve excellent results across different datasets by using these hyperparameters without changing them.
We also included SCA in our discussion, as it is hyperparameter-free and very effective.
Concerning BN, we found that using BN in training mode at test time helps alleviate the "hard performance drops" caused by BN, as correctly pointed out by the Reviewer. However, we also found that in other scenarios this approach decreased performance significantly. We were not able to propose a general heuristic to determine when to use training or evaluation mode, other than in synthetic to real scenarios, where using training mode is usually recommended.
In the manuscript, we presented the results of AdaBN (see Fig. 3), which we believe address the Reviewer's concern. After fine-tuning, we adapted the BN statistics to the target domain, similar to using BN in training mode on the target. The results were consistent with our previous findings, indicating that AdaBN can help alleviate performance drops in some cases, but can also cause degradation in others.
----------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------
We did not consider some methods that were found to be very sensitive to hyperparameters and seed for the RNG, or whose implementation is particularly brittle with respect to the convolution algorithm and the CUDA version used. We excluded these methods as they "overfit the benchmarks" and were not stable with respect to hyperparameter selection and the other constraints we identified.
We stress that this is a general problem of Unsupervised Domain Adaptation and not only of the the Source-Free setting.
Some examples of works that, despite the effort, were not possible to fully reproduce (or that did not perform as well as expected) with some minor modifications of the implementation (as explained above) are [1], [2] for SF-UDA and [3], [4] for UDA.
* [1] Dong et al. "Confident anchor-induced multi-source free domain adaptation." Advances in Neural Information Processing Systems 34 (2021): 2848-2860.
* [2] Yang et al. "Generalized source-free domain adaptation." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.
* [3] Na et al. "Fixbi: Bridging domain spaces for unsupervised domain adaptation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
* [4] Zhang et al. "Bridging theory and algorithm for domain adaptation." In International conference on machine learning, pp. 7404-7413. PMLR, 2019.
-------------------------------------------------
--------------------------------------------
Thank you for your feedback and suggestions. We agree that having a unified benchmark for Source Free Unsupervised Domain Adaptation would be valuable and essential for the community.
In regards to the experiments in this paper, we appreciate your interest in seeing more results, which confirms the usefulness and importance of the topic. We are currently working on extensions and performing additional experiments in many directions, which we believe will further contribute to the field.
While we acknowledge that there is always room for additional experiments and benchmarks, we are confident that the extensive set of experiments we conducted for this study, which required more than 360 GPU days (excluding the last set of experiments that we conducted, as requested, and also excluding any extension that we are currently working on), provides a strong foundation for our empirical findings.
We hope that our code and extensions will be a valuable resource for the community to investigate Source Free Unsupervised Domain Adaptation further and to build a unified benchmark.
Regarding recent Batch Normalization works for domain adaptation, we will certainly consider this in our future work.
Once again, we appreciate your valuable feedback and suggestions. Please let us know if you have any further comments or questions.
--------------------- OLD ---------------------
While we acknowledge that there is always room for additional experiments and benchmarks, we stress the extension of the set of experiments for this study, which required more than 360 GPU days (excluding the last set of experiments that we conducted, as requested, and also excluding any extension that we are currently working on). We believe that the results we have presented in the paper are already extensive and demonstrate empirically our findings. We hope that our code and extensions will be a valuable resource for the community to investigate Source Free Unsupervised Domain Adaptation.