# Rebuttal to Common Response from Reviewer G8hr
Thanks for your feedback. We sincerely appreciate your thorough review and the time you took to consider our responses.
1. On the reliance on class prior issue: As mentioned in the response to Reviewer nvqz, the label distribution estimation method is an orthogonal component of our method; There exist many label distribution esitmation algorithms (e.g. [1,2,3,4,5]). Any availble method can be plugged into OTTER. For example, we used BBSE [1] as a label distribution estimation method in linear probing setting.
2. On inference time-cost issue: OT was applied to the whole inference set by accumulating batches from zero-shot model outputs. As can be seen in the time consumption table, additional time consumption by OT is marginal, compared to the embedding time, yielding a throughput similar to zero-shot models.
Additionally, we clarifiy that Figure 3 does **not** show "a larger number of samples per class significantly enhances performance". Figure 3 shows BBSE's label distribution estimation sensitivity to the linear probing labeled data size, and consequently the OTTER's performance that depends on BBSE's label distribution performance. As shown in [1], BBSE yields better label distribution esitmation with a larger number of samples, and Figure 3 shows the accuracy of OTTER improves when we have better label distribution.
3. On contributions: We appreciate the suggested references and plan to add detailed discussion to the updated manuscript. We would like to highlight the differences from the papers. OTTER is suggested as the inference time enhancement method for zero-shot models; While we used optimal transport as well, we studied it in terms of an inference-time adaptation and investigated what factors affect its success and failure with theoretical insights. Inference-time adaptation is clearly different setup from semi-supervised learning [G8hr R1, R3], partial learning [G8hr R4], imbalance clustering [G8hr R6]. While [nvqz R1] tries to improve zero-shot models in the similar setup, OTTER offers a lighter way to improve zero-shot classification with theoretical guarantess.
[1] Saerens, Marco, Patrice Latinne, and Christine Decaestecker. "Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure." Neural computation 14.1 (2002): 21-41.
[2] Lipton, Zachary, Yu-Xiang Wang, and Alexander Smola. "Detecting and correcting for label shift with black box predictors." International conference on machine learning. PMLR, 2018.
[3] Azizzadenesheli, Kamyar, et al. "Regularized Learning for Domain Adaptation under Label Shifts." International Conference on Learning Representations. 2018.
[4] Alexandari, Amr, Anshul Kundaje, and Avanti Shrikumar. "Maximum likelihood with bias-corrected calibration is hard-to-beat at label shift adaptation." International Conference on Machine Learning. PMLR, 2020.
[5] Garg, Saurabh, et al. "A unified view of label shift estimation." Advances in Neural Information Processing Systems 33 (2020): 3290-3300.[G8hr R1] Tai K S, Bailis P D, Valiant G. Sinkhorn label allocation: Semi-supervised classification via annealed self-training[C]//International conference on machine learning. PMLR, 2021: 10065-10075.
[G8hr R3] Chang W, Shi Y, Tuan H, et al. Unified optimal transport framework for universal domain adaptation[J]. Advances in Neural Information Processing Systems, 2022, 35: 29512-29524.
[G8hr R4] Wang H, Xia M, Li Y, et al. Solar: Sinkhorn label refinery for imbalanced partial-label learning[J]. Advances in neural information processing systems, 2022, 35: 8104-8117.
[G8hr R6] Zhang C, Ren H, He X. P OT: Progressive Partial Optimal Transport for Deep Imbalanced Clustering[J]. arXiv preprint arXiv:2401.09266, 2024.
[nvqz R1] Kahana, Jonathan, Niv Cohen, and Yedid Hoshen. "Improving zero-shot models with label distribution priors." arXiv preprint arXiv:2212.00784 (2022).
# Response to Reviewer G8hr
Thank you for the detailed feedback. We understand that there could be disagreement over the interpretation of prior works. However, we already clarified our contributions and none of your disagreement over interpretations is relevant to our contributions.
# Response to Reviewer nvqz
Thank you for the response.
* On unaddressed previous research: We agree with the relevancy of the previous works and we provided detailed discussion and explained distinct contributions from the prior works. We plan to add the detailed discussion to the updated manuscript. As you mentioned, OTTER offers an efficient way to adapt to new label distribution with theoretical guarantees.
* Few-shot experiments: We include the experiment result table below. As we mentioned, BBSE has worse label estimation than CoOp's predictions, yield worse accuracy of OTTER. However, when the true label distribution is given, OTTER could yield additional improvement over CoOp. As we mentioned, as long as the label distribution estimation is better than the label distribution of the predictions. Clearly, better few-shot learning method may have the better label distribution, still OTTER can provide additional improvement if the label distribution better than naive predictions.
| Dataset | CoOp | CoOp + OTTER (BBSE) | CoOp + OTTER (True) | TV($\nu^{CoOp}, \nu^{true}$) | TV($\nu^{BBSE}, \nu^{true}$) |
| --------------- | ----- | ------------------- | ------------------- | ------- | ------- |
| Caltech101 | 95.62 | 91.64 | 96.84 | 0.02 | 0.07 |
| DTD | 70.63 | 64.78 | 70.69 | 0.06 | 0.17 |
| EuroSAT | 85.42 | 77.93 | 88.26 | 0.06 | 0.15 |
| Food101 | 84.41 | 80.29 | 84.90 | 0.03 | 0.10 |
| Flowers102 | 96.79 | 93.02 | 97.52 | 0.02 | 0.05 |
| Oxford-IIIT-Pet | 92.29 | 86.29 | 93.10 | 0.04 | 0.09 |
---------
# Common Response
We thank all of the reviewers for their kind comments and feedback. Reviewers recognized the strengths of our paper, which we briefly reiterate before we dive into in-depth responses.
* The proposed method is **simple but provides significant enhancement**. (Reviewers aD18, G8hr, nvqz)
* The paper offers *theoretical results* showing that (1) under mild conditions, the proposed method recovers the Bayes optimal classifier for the inference dataset (2) the final error can understood through a decomposition into label estimation error and calibration error. (Reviewers G8hr, nvqz)
We address three common questions before proceeding to individual responses.
* **On needing the ground truth label distribution** (Reviewers aD18, G8hr): While the true label distribution enables the maximum improvement when using the proposed method, **it is not necessary**. Indeed, our algorithm can improve zero-shot classification with just slightly better label distribution estimation than the one implicitly used in zero-shot prediction. To illustrate this claim, we interpolate the label distribution of zero-shot prediction $\nu^{zs}$ and the true label distribution $\nu^{true}$ such that $$\hat{\nu}_\alpha=(1-\alpha)\nu^{zs} + \alpha\nu^{true},$$ where $0 \leq \alpha \leq 1$. We use $\hat{\nu}_\alpha$ as the label distribution specification for OTTER and provide a graph (https://anonymous.4open.science/r/icml-otter-rebuttal-FD28/class_balance_interpolation.pdf) that illustrates the resulting accuracy changes. As expected, **as long as the specification of the label distribution is closer to the true distribution, our technique shows performance improvement in all cases**.
* **On distinctions from previous works** (Reviewers G8hr, nvqz): **The key difference** from previous works using optimal transport to match label distributions is our method's primary focus on **inference-time adaptation, without any fine-tuning**. Prior works mainly focus on fine-tuning of zero-shot models to modify the class prior. **Our goal is to avoid fine-tuning, as it reduces the attractiveness of zero-shot models and requires additional hyperparameter tuning**. There are *additional differences*: to understand the potential and limitations of inference-time adaptation with OT, we obtained conditions under which our technique is successful (Section 4). As far as we are aware, this analysis is unique.
* **On online prediction**: While there is a requirement for batched unlabeled datasets, this is not overly restrictive. Any streaming predictions can be transformed into batched predictions by accumulating examples, and the computational cost for optimal transport remains minimal, given the availability of powerful solvers for linear programming . The table below presents the time consumption (in seconds) for the inference step in Table 1 from the paper, with the pre-computed embeddings. Time reduction column represents the time reduction rate of OTTER compared to PM. Measurements were taken using a machine equipped with an Intel® Core™ i7-11700K @ 3.60GHz processor, 64GB RAM, and NVIDIA GPU RTX-4090. For most cases (n<30000), our method takes less than 1 second, while the prior matching baseline takes more than 3 seconds. It's worth noting that the time consumption for computing embeddings is more substantial; even 10 seconds is negligible compared to the embedding time consumption (ranging from 5 to 30 minutes for each evaluation set), which is common for all inference conditions.
| Dataset | n | ZS | PM | OTTER | Time reduction (%) |
| --------------- | ----- | ------ | ------- | ------- | ------------------ |
| CIFAR10 | 10000 | 0.0381 | 3.7127 | 0.0733 | 98.03 |
| CIFAR100 | 10000 | 0.0462 | 3.6296 | 0.1947 | 94.64 |
| Caltech101 | 7162 | 0.0298 | 3.6445 | 0.1188 | 96.74 |
| Caltech256 | 22897 | 0.2111 | 3.9597 | 0.8568 | 78.36 |
| Food101 | 25250 | 0.1186 | 3.6968 | 0.4969 | 86.56 |
| STL10 | 8000 | 0.0304 | 3.4877 | 0.0546 | 98.43 |
| SUN397 | 87004 | 1.1233 | 33.0386 | 10.5316 | 68.12 |
| Flowers102 | 6149 | 0.0280 | 3.7216 | 0.0959 | 97.42 |
| EUROSAT | 22000 | 0.0826 | 3.6655 | 0.3491 | 90.48 |
| OXFORDIIITPET | 3669 | 0.0137 | 3.3901 | 0.0233 | 99.31 |
| STANFORDCARS | 8041 | 0.0413 | 3.4910 | 0.1964 | 94.37 |
| Country211 | 21100 | 0.1285 | 3.7665 | 1.0537 | 72.02 |
| DTD | 1880 | 0.0070 | 3.4603 | 0.0156 | 99.55 |
| cub | 5794 | 0.0306 | 3.5583 | 0.1410 | 96.04 |
| imagenet | 40000 | 0.9954 | 37.6932 | 8.1003 | 78.51 |
| imagenet-r | 26000 | 0.1921 | 3.8331 | 0.9834 | 74.35 |
| imagenet-sketch | 40889 | 1.0189 | 38.4853 | 9.0579 | 76.46 |
# Reviewer 1 (Reviewer aD18)
We are grateful to the reviewer for noting the strengths of our work and providing useful questions and comments.
* **On the label distribution**: As mentioned in the common response, **access to the ground truth label distribution is not necessary**. Our algorithm has the capability to improve zero-shot classification in any setting where we can estimate the label distribution better compared to doing so directly within zero-shot prediction. Linear probing experiments (Section 5.2.2) show that our method can provide **additional accuracy improvements**, even with an imprecise label distribution specification due to the limited number of samples. Note that the label distribution in linear probing datasets is typically uniform, and so it might be different compared to the label distribution of the evaluation set. The proposed combination offers additional adaptation for the unbalanced evaluation set.
* **On complete results for Table 3**: Indeed, Appendix D.4. Table 12 includes the full results of Table 3. Surprisingly, adapting zero-shot models with our methods **yields far better accuracy** than linear probing in language zero-shot classification experiments.
* **On the streaming prediction**: Indeed, the proposed method requires a prediction set, but it is not a prohibitively expensive requirement, as we described in the common response.
# Reviewer 2 (Reviewer G8h4)
We appreciate the reviewer for the thoughtful comments and references. We will add the suggested papers in our updated draft.
* **On our contribution**: As outlined in the common response, prior works that use label distribution information with optimal transport primarily **involve fine-tuning the model to adapt to new label distributions**. In contrast, our method offers inference-time adaptation *without the need for fine-tuning*. Our method presents a practical solution for addressing continuously evolving label distributions while keeping model weights unchanged. Moreover, to gain deper insights into the potential and limitations of this approach, we provided a novel theoretical analysis. It characterizes the gains our technique produces as a function of label distribution specification and calibration---two key factors. Our synthetic experiments validate our theoretical results. Our real-world experiments demonstrate that OTTER significantly enhances zero-shot classification accuracy without requiring any fine-tuning procedure or hyperparameter tuning. **Below, we provide detailed comparisons with each suggested reference, explicitly noting differences:**
- [1] proposed a self-training algorithm (Sinkhorn label allocation) which requires a partial-labeled dataset. In comparison, OTTER does not need instance-level information such as features but just group-level information for the label distribution. Moreover, SLA involves training in the iterative labeling algorithm while OTTER is an inference time adaptation method that *needs no additional training or fine-tuning*.
- [2] introduced Relative Entropic Optimal Transport (RE-OT) to bridge classification problems and regularized optimal transport problems. However it is a training method while OTTER is a training-free adaptation method.
- [3] and [4] both solved universal domain adaptation problems with optimal transport tools. They require source data and training while OTTER is both source and training-free.
- [5] used OT to reweight training samples in imbalanced datasets by matching imbalanced training set distribution with a balanced distribution over samples. OT in this work is applied to learn sample weights instead of class allocation as in OTTER.
- [6] used OT to learn representations from deep imbalanced data. OT is used to map the imbalanced prediction to a balanced distribution.
* **On the computational cost at inference time**: While it is true that our inference-time adaptation requires additional computation, the computational overhead is not heavy. The linear programming version of the optimal transport algorithm can run in $\tilde{O}(nk\sqrt{n+k})$ time via minimum cost flow [7], where $n$ is the number of data points and $k$ is the number of classes. In practice, we observed our method gives modified predictions within 0.05 ms per sample.
* **On baselines**: Prior matching is the most suitable baseline for comparison as it has identical requirements: specification of the label distribution and the inference dataset. Indeed, one of the main advantages of our method is its compatibility with other inference-time enhancement techniques. For instance, enhancing CLIP prompts with language models (e.g.[8]) has emerged as one of the state-of-the-art methods in zero-shot classification. By combining OTTER with their method (Classification by Description, CD), we achieve further improvement over their approach. (Note that we only used datasets for which their prompts are publicly available.)
| Dataset | ZS | PM | OT | ZS + CD | PM + CD | OT + CD |
| -------- | ------ | ------ | ------ | -------- | -------- | -------- |
| EUROSAT | 32.90% | 11.36% | 42.03% | 53.62% | 11.37% | 57.15% |
| PET | 83.84% | 23.11% | 88.83% | 87.95% | 16.33% | 91.01% |
| DTD | 39.04% | 8.83% | 44.41% | 42.87% | 14.73% | 43.24% |
| Cub | 45.98% | 10.34% | 50.40% | 55.51% | 11.49% | 58.47% |
| ImageNet | 60.18% | 12.42% | 62.86% | 66.46% | 14.08% | 68.05% |
References:
[1] Tai K S, Bailis P D, Valiant G. "Sinkhorn label allocation: Semi-supervised classification via annealed self-training." International conference on machine learning. PMLR, 2021.
[2] Shi L, Zhen H, Zhang G, et al. "Relative Entropic Optimal Transport: a (Prior-aware) Matching Perspective to (Unbalanced) Classification." Advances in Neural Information Processing Systems, 2024.
[3] Chang W, Shi Y, Tuan H, et al. "Unified optimal transport framework for universal domain adaptation." Advances in Neural Information Processing Systems, 2022.
[4] Wang H, Xia M, Li Y, et al. "Solar: Sinkhorn label refinery for imbalanced partial-label learning." Advances in neural information processing systems, 2022.
[5] Guo D, Li Z, Zhao H, et al. "Learning to re-weight examples with optimal transport for imbalanced classification." Advances in Neural Information Processing Systems, 2022.
[6] Zhang C, Ren H, He X. "P$^2$OT: Progressive Partial Optimal Transport for Deep Imbalanced Clustering." arXiv:2401.09266, 2024.
[7] Lee, Yin Tat, and Aaron Sidford. "Path finding methods for linear programming: Solving linear programs in $\tilde{O}(\sqrt{rank})$ iterations and faster algorithms for maximum flow." 2014 IEEE 55th Annual Symposium on Foundations of Computer Science. IEEE, 2014.
[8] Menon, Sachit, and Carl Vondrick. "Visual Classification via Description from Large Language Models." The Eleventh International Conference on Learning Representations. 2022.
# Reviewer 3 (Reviewer nvqz)
We are grateful to the reviewer for noting the strengths of our work and providing useful comments.
* **On the comparison with [1]**: Thank you for suggesting this reference. [1] has in common the idea that optimal transport is used for the adaptation to new label distributions. However, **their method requires both fine-tuning and hyperparameter tuning**---which we seek to avoid in the context of zero-shot classification. An additional difference is that we provide a novel theoretical analysis to characterize the potential and limitations of our technique.
* **On the combination with other few-shot learning methods ([2], [3])**: We reproduced CoOp [3] with the 16-shot learning setting, estimated the label distribution with BBSE, and consequently applied OTTER with the estimated label distribution. We found that the label estimation via BBSE is more imprecise compared to the label distribution within CoOp predictions. As a result, unsuprisingly, OTTER cannot improves CoOp predictions. However, when given a better estimate of the label distribution, OTTER is still able to improve accuracy. We note that the label distribution estimation method is an orthogonal component of our method; any available method can be plugged into OTTER.
* **On the prompts used for each dataset**: We used “a photo of a [CLASS]" for all datasets. This reduces the influence of prompt tuning as a factor in performance. It may have yielded worse performance than the CLIP original paper, since they used prompts designed for each dataset.
[1] Kahana, Jonathan, Niv Cohen, and Yedid Hoshen. "Improving zero-shot models with label distribution priors." arXiv preprint arXiv:2212.00784 (2022).
[2] Zhang, Renrui, et al. "Tip-adapter: Training-free clip-adapter for better vision-language modeling." arXiv preprint arXiv:2111.03930 (2021).
[3] Zhou, Kaiyang, et al. "Learning to prompt for vision-language models." International Journal of Computer Vision 130.9 (2022): 2337-2348.