Dear AC,
As it is almost the end of the discussion period, we would like to thank you for all the effort you made. This year's submission is overall a great experience. We really enjoyed the conversation with the reviewers, as they provided insightful feedback to help us improve our paper. With that being said, we still have two major concerns remaining and would like to ask for your advice.
1. We are glad that reviewer Zi5U responded to our rebuttal. His original recommendation was weak reject because he was mainly worried about whether the additional AE module and the corresponding loss function are meaningful. Since it was a last-minute response, we couldn't have conducted any further experiments. Luckily, this was carefully analyzed in the main paper, and results from Table 4 clearly show that the auto-encoder and loss function both have a significant contribution to the overall performance. After presenting the numbers to him, he agreed on upgrading his recommendation score to 6 and leans towards acceptance. Unfortunately, it seems that he forgot to modify his score on the official review page. Even though we tried to reach out to him, we didn't receive any further updates, and we are worried whether this will affect the final decision of our paper. Since the discussion period is almost finished, we would like to know if there is anything else we can do.
2. We still didn't receive any feedback from reviewe iJo5. In fact, we were really hoping to know whether our additional experiments have addressed his concerns, since one of the experiments was brought up by him to help justify where the performance gain of our method comes from.
Your help is greatly appreciated.
Dear reviewer Zi5U,
Glad to know that some of the concerns have been addressed! We really appreciate that you decided to raise the score. As a kind reminder, we noticed that the scores haven't been updated yet. Would you please modify it accordingly?
Thanks again for your thoughtful reviews and helpful suggestions to help improve our paper!
Dear reviewer Zi5U,
Thank you for your reply! We respectfully disagree with the following comment that you made:
>*I still believe the overall novelty is limited since the performance gain from the additional AE module and the corresponding loss function are marginal*
Table 4 from the paper clearly shows that stage two of MPA (where we incorporated the AE module and the loss function) increases stage one by an average of 2.8\%. We strongly belive that this gain is in fact significant. We have copied the table below for reference.
||A→C|A→P|A→R|C→A|C→P|C→R|P→A|P→C|P→R|R→A|R→C|R→P|Avg|
|:-|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|Stage One| 52.2|83.5 |82.1 |72.8 |83.6 |82.8 |73.3 |52.7 |82.4 |72.0 |51.6 |82.1 |72.6 |
|Stage Two| 54.1|85.9 |85.2 |74.3 |86.0 |85.3 |74.6 |54.1 |85.3 |74.3 |54.2 |86.0 |***75.4*** |
Dear reviewers, we sincerely thank all of you again for your valuable feedback. As the end of the discussion period is approaching, we would like to know if there are any concerns remaining. Please do not hesitate to let us know.
| | Training Time (hours) | Test Time Per Image (seconds) | mAP |
|:------------|:-----:|:---:|:---:|
|RCNN |84 |47 |66.0 |
|Fast RCNN |***9.5***|***0.32*** |66.9 |
| | Test Time Per Image (seconds) |
|:------------|:-----:|
|RCNN |47 |66.0 |
|Fast RCNN | ***0.32*** |66.9 |
| | Test Time Per Image with Proposals (seconds) | mAP |
|:------------|:-----:|:---:|
|RCNN |50 |66.0 |
|Fast RCNN |2 |***66.9*** |
|Faster RCNN |***0.2*** |***66.9*** |
# ICLR2023-2nd Round Rebuttal
## Reviewer xddt
Dear reviewer xddt,
Thank you so much for your feedback on our rebuttal and we are delighted to discuss these remaining concerns.
---
>*For the “source combine” + “simple prompt learning” baseline: what are the implementation details, e.g., the design of prompts and how is the model trained?*
Sorry for not providing these details! We adopted the method introduced in [1] for the prompt design. Compared with MPA, the difference is that the prompt now only contains class specific tokens. In other words, here $M_1$ = 32, $M_2$ = 0 while in MPA $M_1$ = 16, $M_2$ = 16. For each target domain, we first initialize one such prompt randomly, and train it on the source domains using only contrastive loss in a supervised manner. The prompt is then believed to have learned representations for the class labels, and finally it is tested on the target domain using contrastive pairing.
---
>*For "MFSAN+CLIP", how are the results on Office-Home, especially on Clipart as the CLIP baseline is much worse than MFSAN (as shown in Table 1)?*
As pointed out in the paper, we did our best to re-implement state-of-the-art methods using CLIP as the backbone network, yet most of the results are unsatisfactory. Unfortunately, MFSAN + CLIP seems only to be working on the ImageCLEF dataset. Here, we report the result on the Office-Home dataset in the table below:
| | Art | Clipart | Product | Real World | Avg |
|:------------|:-----:|:---:|:---:|:-----:|--- |
|CLIP |71.5 |50.5 |81.3 |82.4 | 71.4 |
|MFSAN |72.1 |62.0 |80.3 |81.8 | 74.1 |
|MFSAN+CLIP|58.9 |41.2 |72.0 |71.0 |75.4|
Clearly, MFSAN + CLIP fails on Office-Home even though we have tried all methods or tricks we can think of to make the performance more promising. We will also make the training code and logs publicly available.
---
>*For LST, how does auto-encoders affect results, as the authors mention that the purpose for auto-encoder is to help LST?*
Good point! To answer this question, additional experiments have been conducted to test how auto-encoders affect LST. Specifically, if no auto-encoders are used, we can only directly initialize a prompt from the full 512 dimension space instead of initializing a vector in the found embedding subspace. The remaining procedure is then the same as LST with auto-encoder, where we train the prompt with pseudo-labels provided by CLIP using contrastive loss. The results are shown in the following table.
| | Art | Clipart | Product | Real World | Avg |
|:------------|:-----:|:---:|:---:|:-----:|--- |
|LST without AE |73.5 |51.9 |84.1 |84.2 | 73.4 |
|LST with AE |72.9 |52.2 |84.9 |85.0 | 73.8 |
In addition to performance gains as suggested by the experiments, LST tunes much fewer number of parameters when auto-encoders are used, which we believe is a strong merit. In this specific case, since $d_I=150$, the number of parameters have actually been decreased by a factor of $\frac{512}{150}$.
[1] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. In IJCV, 2022
# ICLR2023-Rebuttal
## Reviewer iJht
We thank reviewer iJht for the positive comments and for providing detailed and thoughtful feedback on our work. We address all of reviewer's concerns with additional details on the comments below:
> *What is the difference between stage one and DAPL? Stage one (learning individual prompts) seems to be similar to DAPL.*
Good question! The main difference is that stage one of MPA uses ***static*** pseudo-labels generated by zero-shot CLIP while DAPL uses dynamically changing pseudo-labels. Consequently, stage one of MPA converges much faster than DAPL.
> *The effectiveness of L_CLS loss in stage 2. Table.5 (a) shows the effectiveness of objective function Equation 9. However, the results of L_CLS loss are missed.*
Thank you for pointing this out. We have conducted the experiment and reported the results in the table below:
| | Art | Clipart | Product | Real World | Avg |
|:------------|:-----:|:---:|:---:|:-----:|--- |
|Without $\mathcal{L}_{CLS}$ |72.2 |48.4 |84.9 |83.8 | 72.3 |
|With $\mathcal{L}_{CLS}$|74.8 |54.9 |86.2 |85.7 |75.4|
The reason why we didn't include them in Table 5(a) is that our method is dependent on the generated pseudo-labels, which will be meaningless without $\mathcal{L}_{CLS}$. The above experimental results show that indeed the performance drops by a large margin when $\mathcal{L}_{CLS}$ is not included in the objective function (especially on the Clipart domain).
> *What does “Zero” mean in Table.5 (b)? “Zero auto-encoders means we completely discarded the auto-encoder structure” is not clear to me. If the autoencoder structure is discarded, what structure is adopted?*
Zero means that we directly align the learned prompts without using the auto-encoder. This experiment is for testing whether the auto-encoder is beneficial for the alignment process, and results show that there are some gains. In addition to performance gains, the major purpose of the auto-encoder structure is for deriving the shared embedding space for generalizing to unseen domains.
> *The similarity between the reconstructed prompts. In Table.4, the reconstructed prompts of different domains achieve almost the same results on the target domain. Does this mean that these reconstructed prompts are the same?*
Indeed, the reconstructed prompts achieve almost the same results on the target domain and we believe that this is due to the success of our alignment strategy. For example, with the $\mathcal{L}_1$ loss, all prompts are constrained to produce a similar logit for the same input image of the target domain. To justify this, we conducted an additional experiment where the $\mathcal{L}_1$ loss is not included in the objective function and tested the reconstructed prompts. The results are reported in the table below:
||A→C|A→P|A→R|C→A|C→P|C→R|P→A|P→C|P→R|R→A|R→C|R→P|Avg|
|:-|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|Without $\mathcal{L}_1$ loss| 53.4|85.4 |84.2 |73.5 |82.6 |84.6 |74.1 |52.6 |83.5 |74.3 |52.9 |83.9 |73.8 |
|With $\mathcal{L}_1$ loss| 54.1|85.9 |85.2 |74.3 |86.0 |85.3 |74.6 |54.1 |85.3 |74.3 |54.2 |86.0 |74.9 |
The results clearly show that without the $\mathcal{L}_1$ loss, the reconstructed prompts achieve different results on the target domain.
## Reviewer Zi5U
We thank reviewer Zi5U for the positive comments and for providing detailed and thoughtful feedback on our work. We address all of reviewer's concerns with additional details on the comments below:
> *The idea of prompt learning for domain adaptation is not new, and the prompt design in this paper is highly based on the method proposed by Ge et al., 2022. So the main contribution is to extend the prompt learning for domain adaptation from the single source domain to multiple source domains. The novelty is limited.*
We agree that the prompt design is based on the method of Ge et al. Nevertheless, we respectfully disagree that the novelty is limited. We believe that novelty isn't only about inventing new techniques, but also about pushing simple and existent ideas to new problems and make them work. The focus of Ge et al.'s work is on single source domain adaptation. While it sounds straightforward to directly extend it to the multi-source setting, simply doing so produces limited results. As is shown in Table 1 and 2, applying their method in the source combined scenario produces performances of 86.0%, 72.8% and 52.0% on the ImageCLEF, Office-Home and DomainNet dataset, which is on average **3.4%** lower than ours. This suggests that the use of an auto-encoder structure and the design of our loss function is important, and we are the first for this attempt, which shows the significance and novelty of our work. We believe our findings are insightful for future research.
---
> *The learning objective is changing in the two stages. Is it possible to combine the two stages into a single one?*
Great point! Combining the two stages is actually one of our previous attempts in extending single source prompt learning methods to the multi-source setting. However, experiments show that directly combining them would make the training process unstable and therefore we decided to split the process into two stages. The goal now is to learn individual prompts in the first stage and align them in the second. In fact, we believe this is another strong evidence to support that it is difficult to directly apply prompt learning methods to the multi-source scenario. We will be working towards this direction in our future work.
---
> *Why is the auto-encoder necessary?*
In addition to offering performance gains, the shared embedding space for generalizing to unseen domains cannot be found without the auto-encoder structure. To further clarify this, the embedding learned by the auto-encoder is expected to encode domain-invariant knowledge, and by traversing it, we can adapt to domains that weren’t involved during the training stage (this is essentially LST in the paper). The process can be seen as a test-time adaptation strategy [1]. Additionally, we find that the auto-encoder structure is also helpful in stabilizing the training process.
---
References:
[1] Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. In ICLR, 2021.
## Reviewer xddt
We thank reviewer xddt for the positive comments and for providing detailed and thoughtful feedback on our work. We address all of reviewer’s concerns with additional details on the comments below:
> *Many results in Table 1 and 2 (Zero-Shot) are already better than other existing DA methods and it is not clear whether the performance gain is from CLIP.*
While results in Table 2 indeed show that zero-shot CLIP is better than other existing DA methods, many results in Table 1 show the opposite. CLIP assessed on the Office-Home dataset performs worse than all but one method, and the discrepancy is further magnified on the ImageCLEF dataset. However, regardless of CLIP’s performance, our proposed MPA consistently outperforms other methods, especially on the ImageCLEF dataset where MPA has a **6.2%** performance gain w.r.t CLIP. Whether the performance gain is from CLIP is discussed in the following question. Furthermore, we would like to point out that one of the major emphasis of our work is to make the training process easier and more efficient (MPA only needs about 2% total trainable parameters compared with SOTA methods on Office-Home), rather than simply gaining better performances.
---
> *The proposed training scheme cannot validate MPA’s DA ability but only show the transfer learning ability.*
Thank you for your question! As a matter of fact, reviewer Ijo5 raised a similar concern and suggested a simple baseline of “source combine” + “simple prompt learning”. We followed the reviewer's idea and the results below show that while the proposed baseline is 1% on average better than zero shot CLIP, MPA still outperforms it with a significant margin and we believe that the gain can only be from MPA’s DA ability.
| | Art | Clipart | Product | Real World | Avg |
|:------------|:-----:|:---:|:---:|:-----:|--- |
|CLIP |71.5 |50.2 |81.3 |82.4 |71.4|
|Simple Prompt|70.7 |52.9 |82.9 |83.9 |72.4|
|MPA |74.8 |54.9 |86.2 |85.7 |75.4|
Furthermore, we tested MFSAN, the second best method in Table 1, with CLIP’s pretrained backbone on the ImageCLEF dataset and the results are reported in the table below. With the initializations from CLIP, the performance actually dropped by a small margin, demonstrating that a better backbone does not necessarily lead to a better performance.
| | C | I | P | Avg |
|:----------|:---:|:---:|:---:|:-----:|
|CLIP |95.1 |87.3 |74.0 |85.5 |
|MFSAN |95.4 |93.6 |79.1 |89.4 |
|MFSAN+CLIP |96.7 |93.0 |77.7 |89.1 |
|MPA |98.6 |96.2 |80.4 |91.7 |
Hopefully, these added experiments can justify that MPA’s performance gain is not from CLIP and validate our method's DA ability. We will include them in the revised version of our paper.
---
>*Extending the prompt design to multi-source is straightforward and the loss functions are also not new.*
While extending the ***prompt design*** to the multi-source scenario is easy, we would like to emphasize that extending the ***method*** is not. As is shown in Table 1 and 2, directly applying DAPL in the source combined scenario produces limited results.
As for the loss functions, we believe that each of them is crucial to the success of our algorithm. The contrastive loss and the ae loss are what enable the learning of soft prompts and the auto-encoder, and the $\mathcal{L}_1$ loss can make a better alignment as discussed in the ablation study (Table 5(a)). While these aren’t new, we believe that it is also important to push simple and existent ideas to new problems and make them work rather than trying to invent new techniques. For example, before the advent of MAE [1], the idea of masked self-supervised training was already prevailing in NLP. Nevertheless, by successfully applying it on image recognition tasks, MAE is undeniably an influential work in the computer vision community.
---
>*For multi-prompt alignment, it's not clear how the autoencoder is needed to achieve alignment between prompts. In experimental results (Table 5 (a)), performance gain is also marginal.*
Indeed, the gain is marginal. However, the major purpose of using an autoencoder is not for performance gain, but for deriving the embedding space for generalizing to unseen domains. To further clarify this, the embedding learned by the auto-encoder is expected to encode domain-invariant knowledge, and by traversing it, we can adapt to domains that weren’t involved during the training stage (this is essentially LST in the paper). The process can be seen as a test-time adaptation strategy [2]. Additionally, we find that the autoencoder structure is also helpful in stabilizing the training process, which further implies that extending existent single source method to the multi source scenario is not easy.
---
>*Performance of using prompt learning would highly depend on the CLIP model (Qdr results in Table 2).*
We agree that the performance of MPA is somewhat related to the CLIP model. However, MPA clearly outperforms CLIP because it is able to learn additional domain-invariant knowledge. On domain I of ImageCLEF, domain Product of Office-Home and domain Painting of DomainNet for example, MPA exceeds CLIP by a large margin of 8.9%, 4.9% and 5.9% respectively. As a matter of fact, CLIP's overall performance on the ImageCLEF and Office-Home dataset is limited as suggested by Table 1. Nevertheless, MPA still has promising results regardless of how CLIP behaves.
As for the domain Quickdraw, we have discussed in the paper that MPA’s mediocre performance is most likely due to the large domain gap between quickdraw and other domains. Such a trend is also found in [3].
---
>*For eq(8), it's not clear which prompts are aligned, e.g., between P_i and P_j*
The absolute values of the difference between ***all*** pairs of prompts’ probabilistic outputs of target domain data are aligned since for the same target domain image, they are all expected to classify it into the same category.
---
>*d_I appears many times in the paper but the definitions are not clear, e.g., eq(6), dimensions in v_tune and d_tune.*
Sorry for the confusion. The $\mathbf{d}_I$ in eq(6) is in bold format and represents a randomly initialized vector in the embedding space while others are scalars that represent the dimension. This will be clarified in the revised version.
---
>*Prompts are designed with class-specific and domain-specific ones. However, there are experiments to validate whether it is necessary.*
Thank you for pointing this out! We have conducted the experiment for validating the prompt design and the results are presented in the table below:
| | Art | Clipart | Product | Real World | Avg |
|:------------|:-----:|:---:|:---:|:-----:|--- |
|Only class-specific tokens |74.2 |53.0 |85.2 |85.3 |74.4|
|Class-specific with domain-specific tokens (reported)|74.8 |54.9 |86.2 |85.7 |75.4|
The results show that the incorporation of domain-specific tokens is in fact crucial to the performance of MPA. Here, for a fair comparison, the number of class-specific tokens has been raised to 32 (MPA uses 16 class-specific and 16 domain-specifc tokens). We will be adding these to the supplemental material in the revised version.
---
>*Lots of sensitivity experiments are not provided, e.g., length of prompts, thresholds for pseudo-labels, /alpha in eq(9)*
Sorry for not providing experiments on hyperparameter selection! They have been conducted and the results are reported below. We will also be adding these to the supplemental material in the revised version.
For prompt length, we examined three different choices of $M_1$ and $M_2$ (for simplicity we are setting them to be equal):
- $M_1$ = $M_2$ = 8;
- $M_1$ = $M_2$ = 12;
- $M_1$ = $M_2$ = 20;
| | Art | Clipart | Product | Real World | Avg |
|:------------|:-----:|:---:|:---:|:-----:|--- |
|$M_1$ = $M_2$ = 8 |74.3 |54.9 |85.8 |85.3 |75.1|
|$M_1$ = $M_2$ = 12|74.4 |54.8 |86.1 |85.5 |75.2|
|$M_1$ = $M_2$ = 20|74.8 |55.2 |86.3 |86.0 |75.6|
|$M_1$ = $M_2$ = 16 (reported)|74.8 |54.9 |86.2 |85.7 |75.4|
The general trend here is that the longer the prompt, the better the performance, which is not that surprising. We chose 16 mainly to balance the trade-off between performance and efficiency.
For pseudo-label threshold, we also examined three different choices:
- $\tau$ = 0.3;
- $\tau$ = 0.6;
- $\tau$ = 0.8;
| | Art | Clipart | Product | Real World | Avg |
|:------------|:-----:|:---:|:---:|:-----:|--- |
|$\tau$ = 0.3 |74.5 |55.0 |86.1 |85.9 |75.4|
|$\tau$ = 0.6|74.6 |54.9 |85.9 |85.3 |75.2|
|$\tau$ = 0.8|74.0 |54.2 |85.2 |85.5 |74.7|
|$\tau$ = 0.4 (reported)|74.8 |54.9 |86.2 |85.7 |75.4|
As $\tau$ increases, while the quality of the pseudo labels gets higher, fewer images will be fed into the model, thus hurting the overall performance. Consequently, 0.4 is a balancing choice as the results suggest.
For $\alpha$ in eq(9), we examined four different choices:
- $\alpha$ = 1;
- $\alpha$ = 10;
- $\alpha$ = 100;
- $\alpha$ = 1000;
| | Art | Clipart | Product | Real World | Avg |
|:------------|:-----:|:---:|:---:|:-----:|--- |
|$\alpha$ = 1 |74.4 |53.7 |84.9 |85.6 |74.7|
|$\alpha$ = 10|74.5 |54.1 |85.7 |86.0 |75.1|
|$\alpha$ = 100|74.7 |54.5 |85.5 |85.6 |75.1|
|$\alpha$ = 1000|74.4 |55.0 |86.3 |86.0 |75.4|
|$\alpha$ = 500 (reported)|74.8 |54.9 |86.2 |85.7 |75.4|
The main reason why we chose 500 for $\alpha$ is to balance all losses (in this case $\mathcal{L}_1$) to be of the same order of magnitude. Our experimental results also support such motivation.
---
>*The explanation of using LST for experiments is not clear (paragraph above Table 3)*
Sorry for the confusion. LST can be seen as a strategy for test time adaptation by traversing the learned shared embedding space. The experiments conducted is a simulation of the test time adaptation scenario, where for example on the Office-Home dataset, the embedding space can be derived through Art, Clipart and Product domain and LST conducted on the Real World domain.
___
References:
[1] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll ́ar, and Ross Girshick. Masked au-toencoders are scalable vision learners. In CVPR, 2022
[2] Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. In ICLR, 2021.
[3] Yunsheng Li, Lu Yuan, Yinpeng Chen, Pei Wang, and Nuno Vasconcelos. Dynamic transfer for multi-source domain adaptation. In CVPR 2021.
## Reviewer rNJs
We thank reviewer rNJs for the positive comments and for providing detailed and thoughtful feedback on our work. We address all of reviewer’s concerns on the comments below:
> *While the paper includes ablation studies on other methods with image encoder, it is not clear (unfair comparison) if this enough to benchmark the proposed methods against the previous method.*
Great question! As a matter of fact, reviewer Ijo5 raised a similar concern and suggested a simple baseline of “source combine” + “simple prompt learning”. We followed the reviewer's idea and the results below show that while the baseline is 1% on average better than zero shot CLIP, MPA is still outperforming it with a significant margin.
| | Art | Clipart | Product | Real World | Avg |
|:------------|:-----:|:---:|:---:|:-----:|--- |
|CLIP |71.5 |50.2 |81.3 |82.4 |71.4|
|Simple Prompt|70.7 |52.9 |82.9 |83.9 |72.4|
|MPA |74.8 |54.9 |86.2 |85.7 |75.4|
Furthermore, for a better comparison, we tested MFSAN, the second best method in Table 1, with CLIP’s pretrained backbone on the ImageCLEF dataset and the results are reported in the table below. With the initializations from CLIP, the performance actually dropped by a small margin, demonstrating that a better backbone does not necessarily lead to a better performance.
| | C | I | P | Avg |
|:----------|:---:|:---:|:---:|:-----:|
|CLIP |95.1 |87.3 |74.0 |85.5 |
|MFSAN |95.4 |93.6 |79.1 |89.4 |
|MFSAN+CLIP |96.7 |93.0 |77.7 |89.1 |
|MPA |98.6 |96.2 |80.4 |91.7 |
Hopefully the above added experiments can provide a better understanding of our method.
> *It is not clear how to go about deciding the hyperparams $M_1$ and $M_2$. 16 seems a little too big for the proposed method.*
To answer your concern, we examined three different choices of $M_1$ and $M_2$ (for simplicity we are setting them to be equal):
- $M_1$ = $M_2$ = 8;
- $M_1$ = $M_2$ = 12;
- $M_1$ = $M_2$ = 20;
| | Art | Clipart | Product | Real World | Avg |
|:------------|:-----:|:---:|:---:|:-----:|--- |
|$M_1$ = $M_2$ = 8 |74.3 |54.9 |85.8 |85.3 |75.1|
|$M_1$ = $M_2$ = 12|74.4 |54.8 |86.1 |85.5 |75.2|
|$M_1$ = $M_2$ = 20|74.8 |55.2 |86.3 |86.0 |75.6|
|$M_1$ = $M_2$ = 16 (reported)|74.8 |54.9 |86.2 |85.7 |75.4|
The general trend here is that the longer the prompt, the better the performance, which is not that surprising. We chose 16 mainly to balance the trade-off between performance and efficiency.
## Reviewer iJo5
We thank reviewer iJo5 for the positive comments and for providing detailed and thoughtful feedback on our work. We address all of reviewer’s concerns on the comments below:
> *It seems the threshold for pseudo-label selection is a very important hyper-parameter; the author should discuss how the value affects the performance.*
Indeed, the threshold is an important hyper-parameter and we apologize for not providing enough relevant analysis. Additional experiments were conducted to address this issue. Here we present results for three different choices of $\tau$:
- $\tau$ = 0.3;
- $\tau$ = 0.6;
- $\tau$ = 0.8;
| | Art | Clipart | Product | Real World | Avg |
|:------------|:-----:|:---:|:---:|:-----:|--- |
|$\tau$ = 0.3 |74.5 |55.0 |86.1 |85.9 |75.4|
|$\tau$ = 0.6|74.6 |54.9 |85.9 |85.3 |75.2|
|$\tau$ = 0.8|74.0 |54.2 |85.2 |85.5 |74.7|
|$\tau$ = 0.4 (reported)|74.8 |54.9 |86.2 |85.7 |75.4|
As the results suggest, as $\tau$ increases, while the quality of the pseudo labels gets higher, fewer images will be fed into the model, thus hurting the overall performance. Consequently, 0.4 is a balancing choice.
We will be adding these to the supplemental material of the revised version.
---
> *How to justify the performance gain is not from a more powerful pre-trained model when compared with other multi-source UDA approaches? A simple baseline can be "source combine" + "simple prompt learning".*
Thank you for your question and suggestion! We conducted experiments on the suggested “source combine” + “simple prompt learning” baseline and adopted the prompting method from [1]. The results in the table below show that while the baseline is 1% on average better than zero shot CLIP, MPA is still outperforming it with a significant margin. Thus, we believe that this can justify that the performance gain is not from the powerful pre-trained CLIP but from the domain adaptation ability of our method.
| | Art | Clipart | Product | Real World | Avg |
|:------------|:-----:|:---:|:---:|:-----:|--- |
|CLIP |71.5 |50.2 |81.3 |82.4 |71.4|
|Source Combined + Simple Prompt Learning|70.7 |52.9 |82.9 |83.9 |72.4|
|MPA |74.8 |54.9 |86.2 |85.7 |75.4|
To further answer the question, we also tested MFSAN, the second best method in Table 1, with CLIP’s pretrained backbone on the ImageCLEF dataset and the results are reported in the table below. With CLIP’s backbone, the performance actually dropped by a small margin, demonstrating that a better backbone does not necessarily lead to a better performance. We hope that this can further justify that MPA’s performance gain is not from CLIP. These experiments will be added in the revised version of our paper.
| | C | I | P | Avg |
|:----------|:---:|:---:|:---:|:-----:|
|CLIP |95.1 |87.3 |74.0 |85.5 |
|MFSAN |95.4 |93.6 |79.1 |89.4 |
|MFSAN+CLIP |96.7 |93.0 |77.7 |89.1 |
|MPA |98.6 |96.2 |80.4 |91.7 |
Finally, we would like to point out that one of the major emphasis of our work is to make the training process easier and more efficient (MPA only needs about 2% total trainable parameters compared with SOTA methods on Office-Home), rather than simply gaining better performances.
---
> *Can MPA solve domain generalization problems? Pseudo labels are required in Eq.(10) which means the "unseen" domain is actually seen during training, please clarify this.*
Great question! We actually tried to find a universal prompt in our derived subspace so that MPA could be extended to solve domain generalization problems. Unfortunately we haven't come up with a good solution yet and this will be our future research direction. However, with the proposed LST, MPA can actually solve the test time adaptation problem [2]. This is done by traversing the learned shared embeddings, which is expected to be domain-invariant. Throughout this stage, only data from the domain of interest is needed (no need of source domain data) and we do require pseudo labels during this phase [3]. However, this domain of interest is not involved in the training of the latent subspace, therefore we name it an “unseen domain”. We will clarify this in the revised version.
---
> *How it performs when compared with SOTA single source UDA approaches?*
Stage one of MPA can be considered as a single source UDA approach. We have reported its performance in Table 4 and below for reference.
||A→C|A→P|A→R|C→A|C→P|C→R|P→A|P→C|P→R|R→A|R→C|R→P|Avg|
|:-|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|Stage One| 52.2|83.5 |82.1 |72.8 |83.6 |82.8 |73.3 |52.7 |82.4 |72.0 |51.6 |82.1 |72.6 |
---
References:
[1]Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. In IJCV, 2022
[2] Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. In ICLR, 2021.
[3] Sachin Goyal, Mingjie Sun, Aditi Raghunathan, and Zico Kolter. Test-time adaptation via conjugate pseudo-labels. arXiv preprint arXiv:2207.09640, 202