General Response

# General Response We are very grateful to all the reviewers who took the time to read our manuscript and provided quality feedback. ### G1) Additional experiments for this rebuttal * Other Vision-Language Models (VLM) backbones * goal: Extended scope of validation for applicability of proposed method. * setup: We validate two more VLM backbones (CLIP-ResNet50 and CLIP-ViT/L-14) in image classification under natural distribution shifts, (ID) IN-1K (OOD) IN-V2/R/A/Sketch, ObjectNet * Other calibration methods (training-time method and post-hoc method) * goal: Demonstration on the effectiveness of proposed calibrated robust fine-tuning (CaRot) method with our methods designed for improving calibration on ID and OOD. * setup: We implement three calibration methods, Mixup (Zhang et al. 2017), Mixup Inference in Training (MIT, Wang et al. 2023), and Temperature Scaling for OOD (TS-P, Tomani et al. 2021) and evaluate them over models trained on IN-1K image classification under natural distribution shifts. * Advanced calibration metrics beyond expected calibration error (ECE) * goal: Rigorous measurement of calibration error (beyond the standard metric, ECE) * setup: We implement two more metrics of calibration (Adaptive ECE and Class-wise ECE; Mukhoti et al. 2020) and evaluate them over models trained on IN-1K image classification under natural distribution shifts. * Further investigation on temperature scaling (TS) * goal: Justification of the use of CaRot by conducting experiments under a challenging environment. * setup: Class-imbalanced IN-1K image classification (long-tailed recognition) under natural distribution shifts. By following Liu et al. 2019, we prune samples so that the number of samples per class is from about 1280 to 5. The generated IN-1K has a long-tail distribution and totally has 116K samples for training. The validation set is also modified to the same class-wise sampling ratio. Yet the ID test and OOD test samples are intact (balanced). * Comparison to previous work * goal: Explaining the technical difference between previous work and our proposal. * setup: Compare our method to Cheng et al. 2021 on image classification under natural distribution shifts, (ID) IN-1K (OOD) IN-V2/R/A/Sketch, ObjectNet. * `(updated on 4/1, 3am CT)` **Parameter-efficient fine-tuning (PEFT)** * goal: To demonstrate the validity of CaRot under PEFT setup, which considers parameter efficiency as well as classification accuracy. * setup: We adopt one of the state-of-the-art PEFT approaches, Multi-modal Prompt Learning (MaPLe, Khattak et al. 2023), as our baseline, and compare the MaPLe with cross-entropy loss (default), MaPLe with contrastive loss, and MaPLe with CaRot loss on ImageNet-1K natural distribution shifts. By following the standard experimental setup of PEFT (Zhou et al. 2022, Khattak et al. 2023), we compare the baseline and our proposal under a few-shot evaluation protocol (16-shot per class). ### G2) Novelty & Contribution * In this paper, we reveals miscalibration issues of fine-tuned VLMs (under distribution shift) and provide a simple robust fine-tuning method that improve OOD calibration and accuracy simultaneously for the first time. * We provide a first theoretical analysis that considering both OOD calibration and classification error in a single framework. * The proposed method has its unique design motivation from our theoretical analysis. * Extensive experiments on three types of challenging distribution shifts demonstrate the effectiveness of proposed method by showing the improvement on OOD calibration and OOD classification accuracy while also achieving good ID calibration and ID classification accuracy. ### G3) Global reference for this rebuttal 1. mixup: Beyond Empirical Risk Minimization, Zhang et al., 2017 2. On the Pitfall of Mixup for Uncertainty Calibration, Wang et al., 2023 3. Post-hoc Uncertainty Calibration for Domain Drift Scenarios, Tomani et al., 2021 4. Calibrating Deep Neural Networks using Focal Loss, Mukhoti et al., 2020 5. Large-Scale Long-Tailed Recognition in an Open World, Liu et al., 2019 6. Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation, Cheng et al., 2021 7. Robust fine-tuning of zero-shot models, Wortsman et al., 2022 8. Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution, Kumar et al., 2022 9. Surgical Fine-Tuning Improves Adaptation to Distribution Shifts, Lee et al., 2023 10. Finetune like you pretrain: Improved finetuning of zero-shot vision models, Goyal et al., 2023 11. First Steps Toward Understanding the Extrapolation of Nonlinear Models to Unseen Domains, Dong and Ma, 2022 12. Self-supervised Learning is More Robust to Dataset Imbalance, Liu et al., 2021 13. Revisiting the Calibration of Modern Neural Networks, Minderer et al., 2021 14. Learning to Prompt for Vision-Language Models, Zhou et al. 2022 15. AutoFT: Learning an Objective for Robust Fine-Tuning, Choi et al., 2024 16. MaPLe: Multi-modal Prompt Learning, Khattak et al. 2023 # R1-1. Response to reviewer nfuk (1/2) We sincerely appreciate your constructive in-depth feedback, and thanks for your awareness on strengths of our work! Below, we put the item-by-item responses to your concerns and questions. > W1) Confining the scope of validation to image classification with CLIP ViTB/16 only seems a bit too limiting, ..., I’d recommend adding **two more VLMs** to the evaluation. To expand the scope of application, we additionally validate CaRot on CLIP RN50 and CLIP ViT-L/14 on ImageNet natural distribution shifts setup. | Backbone | Method | ID Acc | ID ECE | OOD Acc | OOD ECE | |----------|--------|--------|--------|---------|---------| | RN50 | ZS | 0.5983 | 0.0624 | 0.4252 | 0.0955 | | RN50 | FT | 0.7621 | 0.0983 | 0.4197 | 0.2804 | | RN50 | LP-FT | **0.7625** | 0.1042 | 0.4162 | 0.3274 | | RN50 | FLYP | 0.7616 | 0.0516 | 0.4270 | 0.2127 | | RN50 | CaRot | 0.7613 | **0.0491** | **0.4276** | **0.1792** | | ViT-L/14 | ZS | 0.7555 | 0.0590 | 0.7093 | 0.0711 | | ViT-L/14 | FT | 0.8526 | 0.0993 | 0.6598 | 0.2036 | | ViT-L/14 | LP-FT | 0.8474 | 0.1056 | 0.6411 | 0.2521 | | ViT-L/14 | FLYP | 0.8619 | 0.0729 | 0.7144 | 0.1470 | | ViT-L/14 | CaRot | **0.8692** | **0.0357** | **0.7416** | **0.0686** | Above, We see that CaRot still shows its effectiveness in terms of OOD Accuracy, OOD ECE, and ID ECE. > W2) After temperature scaling, the improvements are very minor and in some cases other fine-tuning techniques seem to work better. Temperature scaling (TS) divides the logit vector (the vector before applying the softmax function) by a scalar value ($\tau$). It addresses the over- or under-confident predictions universally by smoothing ($\tau>1$) or sharpening ($\tau<1$) the predictive probability distributions, respectively. Models fine-tuned with baseline methods including standard fine-tuning (FT), LP-FT, and FLYP are generally over-confident across all the data points, and applying TS with temperature>1 calibrates them well. On the other hand, CaRot is already well-calibrated during training by adopting self-distillation as a input-dependent regularization. Therefore, the improvement achieved by applying TS is less effective for CaRot than other fine-tuning baselines. Besides, TS becomes less effective to stably achive good calibration when the distribution of validation set for searching optimal temperature parameter is severely different from test-time data distribution. To show this, we design a long-tail classification problem by modifying the number of samples per class of train and validation set while holding the distribution of OOD test set intact. By following ImageNet-LT paper (Liu et al. 2019), we set the maximum and minimum occurence of classes to about 1280 and 5, respectively. Results are below. | | | | w/o TS | w/o TS | w/ TS | w/ TS | |--------|--------|---------|--------|---------|--------|---------| | Method | ID Acc | OOD Acc | ID ECE | OOD ECE | ID ECE | OOD ECE | | ZS | 0.6833 | 0.5839 | 0.0570 | 0.0736 | 0.0561 | 0.0752 | | FT | 0.6856 | 0.5232 | 0.1369 | 0.2266 | 0.0652 | 0.1214 | | LP-FT | 0.6992 | 0.4951 | 0.0985 | 0.1787 | 0.0602 | 0.1131 | | FLYP | 0.8026 | 0.6047 | **0.0444** | 0.0929 | 0.0419 | 0.1044 | | CaRot | **0.8038** | **0.6102** | 0.0454 | **0.0877** | **0.0416** | **0.0982** | Although TS improves calibration on ID and OOD if the classifier is overconfident in general, TS fails to further boost the calibration of models that already achieve good ID calibration without TS (such as ZS, FLYP, and CaRot) as the distribution of test samples become more discrepant from train and validation samples. # R1-2. Response to reviewer nfuk (2/2) > W3, Q2) Given the similarity of the method to Cheng et al., the novelty of the method may be more minor, but it is certainly a new application of it and that is also valuable. Would it be possible to provide more details on differences wrt the Cheng et al. (2021) method? Thanks for understanding our contribution in terms of new application of contrastive learning and EMA self-distillation. We would like to emphasize that we derive the CaRot method motivated by our theoretical analysis on OOD calibration error and classification error (Theorem 3.2.). To mitigate the OOD calibration and classification error simultaneously during fine-tuning, we adopt contrastive loss and self-distillation technique to address the H-divergence and ID calibration error, respectively. This design philosophy and our problem setup distincts us from the previous work which focused on data efficiency during pre-train phase of VLM. Furthermore, there is a technical difference with regard to the direction of KL Divergence (KLD). That is, while Cheng et al. adopt forward KLD which puts the target distribution Q on behind of prediction distribution P (i.e., $\int_{x} P(x) \log {P(x) \over Q(x)}dx$), we use reverse KLD $\int_{x} Q(x) \log {Q(x) \over P(x)}dx$. This makes Cheng et al. perform entropy maximization while CaRot does not. | | | | w/o TS | w/o TS | w/ TS | w/ TS | |--------------|--------|---------|--------|---------|--------|---------| | Method | ID Acc | OOD Acc | ID ECE | OOD ECE | ID ECE | OOD ECE | | FLYP | 0.8258 | 0.5946 | 0.0643 | 0.1831 | **0.0392** | 0.1217 | | Cheng et al. | 0.8312 | 0.6099 | 0.0532 | 0.1558 | 0.0523 | 0.1531 | | CaRot | **0.8337** | **0.6222** | **0.0408** | **0.0916** | 0.0397 | **0.0941** | The results on ImageNet natural distribution shifts setting shows the superiority of our learning objective compared to Cheng et al. > W4) Focusing on an image classification task that is well studied in terms of calibration may make the contribution appear less impactful. There are many calibration methods developed for such a setup, even if they do not use VLMs. In this paper, we investigated the calibration error of robust fine-tuning methods, which has yet to be overlooked in previous work (Wortsman et al. 2022, Kumar et al. 2022, Lee et al. 2023, Goyal et al. 2023). In this context, we followed the experiment setup of the existing literature by focusing on evaluating the CLIP ViT model on an image classification task. We leave further investigation on other vision-language tasks, such as retrieval and visual question answering, as future work. > Q1) How exactly is the VLM model used here? (in terms of text and image inputs) During fine-tuning, we pass a batch of images and corresponding texts (classname wrapped in prompts such as "a photo of a {classname}") to image and text encoders to extract embeddings from two input modalities, and calculate the pair-wise similarity of these embeddings to compute our proposed loss function. At the inference time, we construct a classification head with text encoder by computing embeddings of text prompts for all classnames and get prediction logit vector by matrix multiplication of an input image embedding and the classification head. > W5, W6, Q3) Various **metrics** exist that improve over ECE. It could be good to also consider those during evaluation. For OOD settings, it could be valuable to consider comparison with **TS that is designed specifically for OOD**. Could any **methods from 2.2. Confidence calibration** be adapted for the studied setup? (`Calibration metrics`) As the reviewer pointed out, despite ECE being the most widely used measure, relying solely on ECE cannot guarantee the most accurate representation of calibration error. To address these concerns, we added two advanced calibration error metrics: 1) Adaptive ECE (ACE) and 2) Class-wise ECE (CCE). (`Calibration methods`) We validate two more confidence calibration methods Mixup and Mixup Inference in Training (MIT, Wang et al. 2023) those are train-time calibration reguralization methods. Because these methods are devised for cross-entropy-based learning problem, we adopt these methods for FT and LP-FT. (`TS for OOD`) Standard TS tunes temperature value using ID validation set. To apply TS under domain shift, Tomani et al. (2021) suggested constructing a perturbed validation set and tuning the temperature value of TS using the perturbed validation set. We refer to this method as TS-P and added comparisons with TS-P to the Table above. | Method | ID Acc | ID ECE | ID ACE | ID CCE | OOD Acc | OOD ECE | OOD ACE | OOD CCE | |----------------|--------|--------|--------|--------|---------|---------|---------|---------| | ZS | 0.6833 | 0.0570 | 0.0561 | 0.0020 | 0.5839 | 0.0736 | 0.0729 | 0.0115 | | FT | 0.8153 | 0.0463 | 0.043 | 0.0012 | 0.5750 | 0.1259 | 0.1232 | 0.0121 | | FT w/ Mixup | 0.8252 | 0.0608 | 0.0586 | 0.0011 | 0.5400 | 0.1806 | 0.1788 | 0.0130 | | FT w/ MIT | 0.8164 | 0.0849 | 0.0808 | 0.0011 | 0.5241 | 0.2130 | 0.2118 | 0.0131 | | LP-FT | 0.8217 | 0.0695 | 0.0663 | 0.0011 | 0.5822 | 0.1935 | 0.1921 | 0.0114 | | LP-FT w/ Mixup | 0.8269 | 0.0434 | 0.0414 | 0.0012 | 0.5566 | 0.1426 | 0.1409 | 0.0127 | | LP-FT w/ MIT | 0.8209 | 0.0735 | 0.0697 | 0.0011 | 0.5317 | 0.2176 | 0.2170 | 0.0129 | | FLYP | 0.8269 | 0.0635 | 0.0606 | 0.0011 | 0.5940 | 0.1836 | 0.1825 | 0.0110 | | CaRot | **0.8337** | **0.0408** | **0.0388** | 0.0012 | **0.6223** | **0.0917** | **0.0904** | **0.0108** | |----------------|--------|--------|--------|--------|---------|---------|---------|---------| | ZS | - | 0.0561 | 0.0560 | 0.0003 | - | 0.0752 | 0.0744 | 0.0017 | | FT | - | 0.0463 | 0.0430 | 0.0012 | - | 0.1259 | 0.1232 | 0.0121 | | FT w/ Mixup | - | 0.0435 | 0.0442 | **0.0002** | - | 0.1452 | 0.1431 | 0.0021 | | FT w/ MIT | - | 0.0439 | 0.0397 | **0.0002** | - | 0.1084 | 0.1063 | **0.0020** | | LP-FT | - | **0.0382** | **0.0366** | 0.0012 | - | 0.0979 | 0.0962 | 0.0121 | | LP-FT w/ Mixup | - | 0.0405 | 0.0390 | 0.0012 | - | 0.1221 | 0.1205 | 0.0126 | | LP-FT w/ MIT | - | 0.0400 | 0.0384 | 0.0002 | - | 0.1073 | 0.1052 | 0.0022 | | FLYP | - | 0.0392 | 0.0381 | 0.0012 | - | 0.1217 | 0.1204 | 0.0114 | | CaRot | - | 0.0397 | 0.0382 | 0.0012 | - | **0.0941** | **0.0926** | 0.0108 | |----------------|--------|--------|--------|--------|---------|---------|---------|---------| | ZS | - | 0.1873 | 0.1865 | 0.0026 | - | 0.1444 | 0.1436 | 0.0733 | | FT | - | 0.0965 | 0.0955 | 0.0015 | - | 0.0822 | 0.0825 | 0.0426 | | FT w/ Mixup | - | 0.0642 | 0.0634 | **0.0002** | - | 0.1085 | 0.1083 | **0.0020** | | FT w/ MIT | - | 0.1251 | 0.1254 | 0.0003 | - | 0.1043 | 0.1054 | **0.0020** | | LP-FT | - | 0.1434 | 0.1430 | 0.0018 | - | 0.1060 | 0.1058 | 0.0570 | | LP-FT w/ Mixup | - | 0.0861 | 0.0849 | 0.0003 | - | 0.0906 | 0.0905 | 0.0021 | | LP-FT w/ MIT | - | 0.1419 | 0.1420 | 0.0004 | - | 0.0979 | 0.0990 | 0.0023 | | FLYP | - | 0.0539 | **0.0500** | 0.0013 | - | 0.0866 | 0.0861 | 0.0337 | | CaRot | - | **0.0521** | 0.0514 | 0.0013 | - | **0.0763** | **0.0758** | 0.0313 | Here, the top, middle, and botton split of the table denote, performance without TS, with TS, and with TS-P. Although baseline approach such as LP-FT and FT w/ MiXUP and MIT achieve good results in ID CCE and OOD CCE sometimes, CaRot consistently shows the best results in terms of OOD ECE, OOD ACE, and ID/OOD Accuracy. # R2. Response to reviewer Md1U Thanks for your valueable comments. We are very glad that you found something interesting in our research! > W2) More comparative analysis is needed. Besides fully fine-tuning CLIP, several parameter-efficient fine-tuning methods have emerged recently. It's important to investigate their performance on OOD. Fundermental goal of this study is investigating the calibration error of robust fine-tuning methods, which has yet to be overlooked in previous work (Wortsman et al. 2022, Kumar et al. 2022, Lee et al. 2023, Goyal et al. 2023). In this context, we followed the experiment setup of the existing literature by focusing on evaluation across full fine-tuning methods. Investigation on the calibration of PEFT approaches will be good future work direction, but beyond the scope of current study. [04.01, update --] Fundermental goal of this study is investigating the calibration error of robust fine-tuning methods, which has yet to be overlooked in previous works. In this context, we followed the experiment setup of the existing literature (Wortsman et al. 2022, Kumar et al. 2022, Lee et al. 2023, Goyal et al. 2023, Choi et al. 2024) by focusing on evaluation across full fine-tuning methods without consideration on parameter-efficiency. Investigation on the calibration of PEFT approaches will be good future work direction, but beyond the scope of current study. (!update!) We validate CaRot on one of the state-of-the-art parameter efficient fine-tuning tuning (PEFT) method, MaPLe (Khattak et al. 2023), that adapts continous prompt tokens being fed into CLIP's visual and text encoder. By following the standard setup of PEFT, we compare baseline method and our proposal in the few-shot evaluation protocol (16-shot per class) and we used default hyperparameter (number of learnable tokens and depth, learning rate, and so on) proposed by the authors. Here, ID denotes ImageNet-1K, OOD denotes the average performance on ImageNet-V2, ImageNet-A, ImageNet-R, ImageNet-Sketch. | Method | ID Acc | ID ECE | OOD Acc | OOD ECE | |----------------|--------|--------|---------|---------| | Zeroshot CLIP | 66.720 | 0.0599 | 57.200 | 0.0712 | | MaPLe (Khattak et al. 2023) | 72.242 | 0.0513 | 57.455 | 0.0972 | | MaPLe w/ Contrastive Loss | 72.390 | 0.0489 | 57.707 | 0.0951 | | MaPLe w/ **CaRot** | **72.916** | **0.0481** | **58.726** | **0.0870** | While the amount of learnable parameters is significantly lower than full fine-tuning setup (2.85% of entire), MaPLe still largely increases OOD ECE, and CaRot not only effectively mitigates this OOD miscalibration issues but also improve OOD generalization. This result further justifies the use of CaRot under PEFT regime as well as full fine-tuning setup. > W3) The novelty of the approach is questionable. Theorem 3.2 appears commonplace in domain adaptation literature. It appears to be a mere substitution of source and target domains with in-distribution (id) and OOD domains. Regarding the method, as the authors themselves acknowledge, a similar approach has been introduced, with the distinction being its application in the fine-tuning phase. While there is a previous work utilizing the similar learning objective, we would like to emphasize that the problem setup and motivation is completely different. We believe that the novelty of research is not limited to just technical design, but should also be considered in terms of the problem to be solved, the perspective from which it is viewed, and the theoretical motivation that led to the proposed method. In this paper, we investigated the confidence calibration of fine-tuned vision-language model under distribution shifts for the first time, and analyzed the OOD calibration and classification errors, theoretically. We showed that OOD calibration and OOD classification error are both bounded from above with ID calibration error and H-divergence between ID and OOD domains, and this is the first paper which analyzes OOD calibration error and OOD classification error in a unified theoretical framework. We devise our CaRot objective based on our novel theoretical analysis aiming at new robust fine-tuning method that considering calibration, too. Specifically, to mitigate the OOD calibration and classification error simultaneously during fine-tuning, we adopt contrastive loss and self-distillation technique to address the H-divergence and ID calibration error, respectively. This design philosophy and our problem setup distincts us from the previous work (Cheng et al. 2021) which focused on data efficiency during pre-train phase of VLM. Furthermore, there is a technical difference with regard to the direction of KL Divergence (KLD). That is, while Cheng et al. adopt forward KLD which puts the target distribution Q on behind of prediction distribution P (i.e., $\int_{x} P(x) \log {P(x) \over Q(x)}dx$), we use reverse KLD $\int_{x} Q(x) \log {Q(x) \over P(x)}dx$. This makes Cheng et al. perform entropy maximization while CaRot does not. | | | | w/o TS | w/o TS | w/ TS | w/ TS | |--------------|--------|---------|--------|---------|--------|---------| | Method | ID Acc | OOD Acc | ID ECE | OOD ECE | ID ECE | OOD ECE | | FLYP | 0.8258 | 0.5946 | 0.0643 | 0.1831 | **0.0392** | 0.1217 | | Cheng et al. | 0.8312 | 0.6099 | 0.0532 | 0.1558 | 0.0523 | 0.1531 | | CaRot | **0.8337** | **0.6222** | **0.0408** | **0.0916** | 0.0397 | **0.0941** | The results on ImageNet natural distribution shifts setting shows the superiority of our learning objective compared to Cheng et al. > W1) The relationship between the theoretical analysis and the method implementation needs clarification. While the derived bound for OOD calibration error and OOD classification error is not only for the CaRot but also holds for all other learning classifiers, we would like to emphasize that we reflect each component of that bound to build calibrated robust fine-tuning (CaRot) method, and this design motivation is supported by empirical evidence (Figure 3 of our manuscript). # R3. Response to reviewer JEGK Thank you for taking your precious time to review our paper and give productive feedback! > W1) The main components in CaRot are the contrastive learning and self-distillation. However, these two are well-studied and used in previous works, which is not novel at all. Based on that, the novelty of the work is too limited. While there is a previous work utilizing the similar learning objective, we would like to emphasize that the problem setup and motivation is completely different. We believe that the novelty of research is not limited to just technical design, but should also be considered in terms of the problem to be solved, the perspective from which it is viewed, and the theoretical motivation that led to the proposed method. In this paper, we investigated the confidence calibration of fine-tuned vision-language model under distribution shifts for the first time, and analyzed the OOD calibration and classification errors, theoretically. We showed that OOD calibration and OOD classification error are both bounded from above with ID calibration error and H-divergence between ID and OOD domains, and this is the first paper which analyzes OOD calibration error and OOD classification error in a unified theoretical framework. We devise our CaRot objective based on our novel theoretical analysis aiming at new robust fine-tuning method that considering calibration, too. Specifically, to mitigate the OOD calibration and classification error simultaneously during fine-tuning, we adopt contrastive loss and self-distillation technique to address the H-divergence and ID calibration error, respectively. This design philosophy and our problem setup distincts us from the previous work (Cheng et al. 2021) which focused on data efficiency during pre-train phase of VLM. Furthermore, there is a technical difference with regard to the direction of KL Divergence (KLD). That is, while Cheng et al. adopt forward KLD which puts the target distribution Q on behind of prediction distribution P (i.e., $\int_{x} P(x) \log {P(x) \over Q(x)}dx$), we use reverse KLD $\int_{x} Q(x) \log {Q(x) \over P(x)}dx$. This makes Cheng et al. perform entropy maximization while CaRot does not. | | | | w/o TS | w/o TS | w/ TS | w/ TS | |--------------|--------|---------|--------|---------|--------|---------| | Method | ID Acc | OOD Acc | ID ECE | OOD ECE | ID ECE | OOD ECE | | FLYP | 0.8258 | 0.5946 | 0.0643 | 0.1831 | **0.0392** | 0.1217 | | Cheng et al. | 0.8312 | 0.6099 | 0.0532 | 0.1558 | 0.0523 | 0.1531 | | CaRot | **0.8337** | **0.6222** | **0.0408** | **0.0916** | 0.0397 | **0.0941** | The results on ImageNet natural distribution shifts setting shows the superiority of our learning objective compared to Cheng et al. > W2) Some missed experiments are listed. (i) In experiments, some efficient tuning based methods should be compared, e.g., CoOp and CoCoOp. (ii) As mentioned in Limitations, it would be nice to include more VLMs, e.g., CLIP series or other VLMs. (i) Fundermental goal of this study is investigating the calibration error of robust fine-tuning methods, which has yet to be overlooked in previous work (Wortsman et al. 2022, Kumar et al. 2022, Lee et al. 2023, Goyal et al. 2023). In this context, we followed the experiment setup of the existing literature by focusing on evaluation across full fine-tuning methods. Investigation on the calibration of PEFT approaches will be good future work direction. [04.01, update ---] (i) Fundermental goal of this study is investigating the calibration error of robust fine-tuning methods, which has yet to be overlooked in previous works. In this context, we followed the experiment setup of the existing literature (Wortsman et al. 2022, Kumar et al. 2022, Lee et al. 2023, Goyal et al. 2023, Choi et al. 2024) by focusing on evaluation across full fine-tuning methods without consideration on parameter-efficiency. Investigation on the calibration of PEFT approaches will be good future work direction, but beyond the scope of current study. (!update!) We validate CaRot on one of the state-of-the-art parameter efficient fine-tuning tuning (PEFT) method, MaPLe (Khattak et al. 2023), that adapts continous prompt tokens being fed into CLIP's visual and text encoder. By following the standard setup of PEFT, we compare baseline method and our proposal in the few-shot evaluation protocol (16-shot per class) and we used default hyperparameter (number of learnable tokens and depth, learning rate, and so on) proposed by the authors. Here, ID denotes ImageNet-1K, OOD denotes the average performance on ImageNet-V2, ImageNet-A, ImageNet-R, ImageNet-Sketch. | Method | ID Acc | ID ECE | OOD Acc | OOD ECE | |----------------|--------|--------|---------|---------| | Zeroshot CLIP | 66.720 | 0.0599 | 57.200 | 0.0712 | | MaPLe (Khattak et al. 2023) | 72.242 | 0.0513 | 57.455 | 0.0972 | | MaPLe w/ Contrastive Loss | 72.390 | 0.0489 | 57.707 | 0.0951 | | MaPLe w/ **CaRot** | **72.916** | **0.0481** | **58.726** | **0.0870** | While the amount of learnable parameters is significantly lower than full fine-tuning setup (2.85% of entire), MaPLe still largely increases OOD ECE, and CaRot not only effectively mitigates this OOD miscalibration issues but also improve OOD generalization. This result further justifies the use of CaRot under PEFT regime as well as full fine-tuning setup. (ii) We include evaluations with two more VLMs (different CLIP visual encoder backbones), and the results are below. | Backbone | Method | ID Acc | ID ECE | OOD Acc | OOD ECE | |----------|--------|--------|--------|---------|---------| | RN50 | ZS | 0.5983 | 0.0624 | 0.4252 | 0.0955 | | RN50 | FT | 0.7621 | 0.0983 | 0.4197 | 0.2804 | | RN50 | LP-FT | **0.7625** | 0.1042 | 0.4162 | 0.3274 | | RN50 | FLYP | 0.7616 | 0.0516 | 0.4270 | 0.2127 | | RN50 | CaRot | 0.7613 | **0.0491** | **0.4276** | **0.1792** | | ViT-L/14 | ZS | 0.7555 | 0.0590 | 0.7093 | 0.0711 | | ViT-L/14 | FT | 0.8526 | 0.0993 | 0.6598 | 0.2036 | | ViT-L/14 | LP-FT | 0.8474 | 0.1056 | 0.6411 | 0.2521 | | ViT-L/14 | FLYP | 0.8619 | 0.0729 | 0.7144 | 0.1470 | | ViT-L/14 | CaRot | **0.8692** | **0.0357** | **0.7416** | **0.0686** | We see that the consistent effectiveness of CaRot compared to baseline methods in terms of ID ECE, OOD Acc, and OOD ECE. # R4. Response to reviewer PF86 > W1) Relatively trivial improvement in ID accuracy and ID calibration. The main goal of our study is addressing miscalibration issues of fine-tuned VLM under distribution shifts. To that end, we design our method by focusing on the OOD calibration and OOD accuracy from our theoretical analysis. Our method shows remarkably good performance in terms of OOD Accuracy and OOD ECE, but still achieve good ID accuracy and ID ECE, too. Moreover, about ID calibration, CaRot achieve significantly better performance in some challenging environments such as encountering strong adversarial attack (right side of Table 3 in manuscript). > W2, Q1) No explanation about why the H-divergence between two domains is reduced. It is straightforward to understand why self-distillation~(EMA teacher) works, because zero-shot CLIP generally has low calibration error. I can understand that the EMA teacher can reduce the ID calibration error to reduce the OOD calibration error. Which part of the proposed method, CaRot, reduces the divergence between two domains? CaRot is equipped with contrastive learning and EMA self-distillation. We adopt the contrastive learning to improve OOD robustness and EMA self-distillation for ID calibration, respectively. There is a theoretical connection between H-divergence-related metric and minimum singular value of data representation (Dong and Ma, 2022), and there is also an evidence that contrastive learning induces diverse features (even they are not related to downstream class label) rather than standard cross-entropy-based learning (Liu et al. 2021). Because the diversity of learned features in representation space is related with the number of non-diminishing singular values of that representations, we speculate that contrastive learning reduces the H-divergence by holding relatively large singular values in representation covariance matrix in general. > Q2) Considering the additional cost of the proposed method, I think temperature scaling still works as a more practical method. In Table 8, we can see that TS works well with vanilla fine-tuning and other methods to reduce the OOD calibration error. Previous work like Minderer et al. also gets the same conclusion. The OOD calibration error with TS is already very low in most cases and I am not sure whether people will opt for CaRot for further improvement. The improvement sometimes seems trivial. First of all, we would like to mention that TS is an accuracy-preserving post-hoc method while we propose a robust fine-tuning (training-based) method CaRot. That is, our work is orthogonal to TS, and can be complementary. However, as the reviewer mentioned, we should justify the use of CaRot compared to vanilla fine-tuning method with TS. We conductd an experiment for this below. TS may become less effective to stably achive good calibration when the distribution of validation set for searching optimal temperature parameter is severely different from test-time data distribution. To show this, we design a long-tail classification problem by modifying the number of samples per class of train and validation set while holding the distribution of OOD test set intact. By following ImageNet-LT paper (Liu et al. 2019), we set the maximum and minimum occurence of classes to about 1280 and 5, respectively. Results are below. | | | | w/o TS | w/o TS | w/ TS | w/ TS | |--------|--------|---------|--------|---------|--------|---------| | Method | ID Acc | OOD Acc | ID ECE | OOD ECE | ID ECE | OOD ECE | | ZS | 0.6833 | 0.5839 | 0.0570 | 0.0736 | 0.0561 | 0.0752 | | FT | 0.6856 | 0.5232 | 0.1369 | 0.2266 | 0.0652 | 0.1214 | | LP-FT | 0.6992 | 0.4951 | 0.0985 | 0.1787 | 0.0602 | 0.1131 | | FLYP | 0.8026 | 0.6047 | **0.0444** | 0.0929 | 0.0419 | 0.1044 | | CaRot | **0.8038** | **0.6102** | 0.0454 | **0.0877** | **0.0416** | **0.0982** | Although TS improves calibration on ID and OOD if the classifier is overconfident in general, TS fails to further boost the calibration of models that already achieve good ID calibration without TS (such as ZS, FLYP, and CaRot) as the distribution of test samples become more discrepant from train and validation samples. > Q3) There are some differences between Table 7 and Figure 6 in Minderer et al. about the OOD ECE on ImageNet-V2/R/A. The ECE does not coincide. Could the authors provide some explanations? For the zero-shot CLIP evaluation, while Minderer et al. adopt ViT-B/32 and RN50 and ECE with bin size 100 in Figure 6 of their paper, we adopt ViT-B/16 and bin size 10 in our Table 7. Furthermore, it is unclear whether they adopt prompt ensemble strategy or not from their paper, while we adopt it to evaluate zero-shot CLIP classification.