$\newcommand{\ip}[2]{\left\langle#1, #2\right\rangle}$
### Common Response
We thank the reviewers for their kind comments and input. Before proceeding with in-depth responses, we highlight strengths of our work noted by reviewers.
* Our setting, **robustifying zero-shot models without extra labels and fine-tuning is novel and challenging** (DxCE, Bg99),
* We offer an **interesting alternative** as fine-tuning LLMs becomes increasingly costly. (E4A7),
* Our method is **simple** (DxCE) and **unique**. (Kdvh, E4A7, Bg99).
Given the novel nature of our work, the reviewers had several questions and suggestions they wished to see addressed before recommending acceptance. We appreciate these—and respond to all of them, often _producing further improved performance compared to the submitted version_**. We are confident that our work offers an exciting problem that invites further study—and a very strong baseline approach that has substantial practical value**.
We respond to three common questions.
* **On directly prompting LLMs for text classification** (Rev. DxCE and E4A7). As suggested, we conduct experiments on text datasets and compare them against the direct prompting of LLMs. In the following, we use ChatGPT and BART-MNLI. **RoboShot produces benefits here as well**, as large as a 25.7% improvement in CivilComments and 9% in HateXplain. Even in the Gender Bias experiments, RoboShot lifts weaker/older model performance to a level comparable to modern LLMs.
|Dataset+model|AVG|WG|
|-|:-:|:-:|
|CivilComments+ChatGPT|85.6|19.2|
|CivilComments+BART-MNLI|32.5|15.7|
|CivilComments+**Ours**(Ada)|56.6|**44.9**|
|CivilComments+**Ours**(BERT)|49.7|42.3|
||
|HateXplain+ChatGPT|55.4|12.2|
|HateXplain+BART-MNLI|61.2|5.3|
|HateXplain+**Ours**(Ada)|63.6|**21.2**|
|HateXplain+**Ours**(BERT)|57.3|14.0|
||
|Gender Bias+ChatGPT|90.1|**86.6**|
|Gender Bias+BART-MNLI|86.1|78.4|
|Gender Bias+**Ours**(Ada)|78.0|60.1|
|Gender Bias+**Ours**(BERT)|85.1|84.9|
* **On understanding improvements and limitations** (Rev. DxCE and Bg99). We agree! In fact, _capturing the conditions when improvements are or are not possible is the motivation for our theoretical analysis_. At high level, to be effective, we need that
* Harmful insights should be aligned well with harmful concepts, i.e. should have large coefficients in our model in Sec. 3.1,
* Helpful insights should be aligned well with helpful concepts and helpful concepts in class embeddings should have large coefficients,
* Insight noise should be sufficiently small.
In practice, these quantities are related to the quality of embeddings and the number and diversity of the language-model derived insights. To study the sensitivity to those factors, we varied embedding models and LLMs in our experiments. Our overall takeaway is that off-the-shelf modern pretrained models usually have enough of these properties to successfully robustify zero-shot prediction.
* **On average accuracy** (Rev. DxCE and Bg99). Improving worst-group accuracy leading to decreased average accuracy is a standard phenomenon, even when fine-tuning, e.g., see [1]. In that work, 2 out of 5 models have average accuracy drop when the worst group accuracy has been significantly improved.
The underlying reason follows. Some inputs use non-causal (harmful) concepts as features, so that when this is blocked, average accuracy may decrease. Concretely, consider the distribution of the margin $M:\mathcal{X}\rightarrow \mathbb{R}$ given by $M(x) := \ip{c^{+}}{x} - \ip{c^{-}}{x}$ where $c^+, c^-$ are the correct/incorrect class embeddings. Accuracy can be expressed as $\mathbb{E}[1[M(x)\geq 0]$. We observe this margin distribution in Figure A.1. (a) and Figure A.2. (a) in original image embeddings. Typically, inputs with spurious features ('waterbird'-'land') tend to be closer to the decision boundary ($M=0$). We expect that harmful insight removal procedure _increases_ the margin of $\mathcal{D}_{sp}$, but _decreases_ the margin of inputs with non-spurious features $\mathcal{D}_{nsp}$ (e.g. 'waterbird'-'water', which is shown in Figure A.1. (b) and Figure A.2. (b)). Here the potential **tradeoff between the accuracy of $\mathcal{D}_{sp}$ and $\mathcal{D}_{nsp}$ appears**. If the gain in $\mathcal{D}_{sp}$ outweighs the loss in $\mathcal{D}_{nsp}$, the average accuracy increases, as in Figure A.1.(b). If the gain in $\mathcal{D}_{sp}$ is less than the loss in $\mathcal{D}_{nsp}$, the average accuracy decreases. In either case, the model performance in $D_{sp}$ is improved by this procedure.
[1] Zhang, Michael, and Christopher Ré. "Contrastive adapters for foundation model group robustness.", NeurIPS 2022.
* **On designing LLM prompts to get textual insights** (Reviewers DxCE and E4A7). We **do not use any extensive prompt engineering or prompt tuning methods** in our paper, to stay true to our zero-shot setting. Concretely, for ChatGPT, we directly ask it to list the biased/spurious features and the true visual features for the task---regardless of the task. We also add instructions for the answer format, so we can automatically parse them. We use the same prompt used for LLaMA, without answer format instructions. We then parse the answer by slicing the string with indexes after the original prompt (so only take the answers to be extracted as insights). We use non-finetuned versions of GPT2 and Flan-T5, so we adapted the prompts for models trained on the next-word prediction task. We use several prompts that are paraphrases of each other, in order to get a list of insights from these models. The complete prompt list is in Tables 6 and 7.
## Response to Reviewer Kdvh
Thank you for noting the **novelty of our method**! We appreciate the valuable feedback.
* **On other approaches, e.g., linking CLIP's encoder with LLMs with adapters**. This is a great idea! The challenge in doing so is that using adapters typically requires training on a dataset, while our setting **seeks to improve robustness without any further training, tuning, or extra labels**. Connecting a generative model to the vision encoder would need extra training and data (as in the suggested Flamingo and LLaVA combinations).
We note, however, that despite the fact that our setting is much more difficult, we can sometimes reach the same performance gains as scenarios where additional training data is possible. For example, the reviewer's suggestion of using adapters to tackle spurious features was pursued in [1]. While we face a more challenging setting, we _can achieve similar or superior performance gains_ in several cases.
* **On novelty**. Indeed, as the reviewer notes, our novelty is not in the use of basic linear algebraic operations (the procedure that removes harmful subspaces that correspond to spurious correlations and strengthens the effect of helpful subspaces). Instead, the novelty of this work lies in:
* Our overall setting,
* Where and how we obtain the signal that permits these operations (querying language models),
* Our novel model that divides embeddings into harmful, helpful, and benign components, thus enabling us to seek to remove some and boost other components without any extra information or training.
* **On using LLMs to propose insight descriptions**. We note that **we do not use any ground truth insights**. Using ground truth insights has been pursued in other work, e.g., [2]. Note that **our performance is superior to [2] on the common model used in both works (ViT-L/14), despite not accessing such ground truth insights!** Instead, we use insights obtained from LLMs to get textual descriptions of the spurious features for the task. We believe that this is exactly what is being suggested. At high level, our approach is to query LLMs via prompting, e.g., “what are spurious/useful features to distinguish [class X] and [class Y]?”. We then use embeddings of the answers to robustify base model embeddings.
[1] Zhang, Michael, and Christopher Ré. "Contrastive adapters for foundation model group robustness.", NeurIPS 2022.
[2] Chuang et al, "Debiasing Vision-Language Models via Biased Prompts", 2023.
## Response to Reviewer DxCE
Thank you for the noting that our setting (model robustification without additional labels, training, and specification) is **challenging and interesting**! We also appreciate the helpful feedback.
* **On the use of manually designed helpful/harmful queries to LLM**. Great question! The queries used to get the insights are solely based on the task description and **contain no manually tuned task-specific information**. We use prompts such as “list spurious/useful features to distinguish [classes]”. This is standard procedure to obtain answers from language models and akin to the practice of adding preceding text like “a photo of [label]” when prompting visual language models for zero-shot predictions.
To stay true to our zero-shot robustification settings, we **do not use any prompt tuning/extensive prompting engineering methods to obtain better performance** (doing so would likely boost the performance of our method even further). As seen in Tables 6 and 7, for stronger LLMs (ChatGPT, LLaMA) we simply ask them to *list* the spurious/useful features given a task. For GPT2 and FlanT-5 (no instruction tuning), we use several prompts (paraphrases of each other, mainly consisting of the class label string) to obtain a list of insights.
* **On the impact of using different prompts to query textual insights**. We re-iterate our earlier point that we do not use any extensive prompt tuning, as the goal of our work is to be truly zero-shot. This means we cannot tune prompts to obtain better performance. If we have access to a labeled dataset to be used for tuning, we could indeed obtain even better performance. However, the key insight in our work is that **substantial robustness improvements are possible without any** labeled data or additional tuning.
* **On directly prompting LLM for classification**. This is also a great question! We address it in the common response. Briefly, RoboShot produces improvements in this comparison as well, by 25.7% in CivilComments and 9% in HateXplain. We have highlighted this result in the updated draft.
* **On calibration**. We agree with the suggestion! Indeed, **RoboShot further benefits from the calibration methods** pointed out by the reviewer. In fact, this further highlights the versatility of Roboshot---we can combine it with such methods with no additional work. To showcase this, we show additional results below from (1) applying the calibration method alone, (2) our method, (3) the combination.
These new results show that **the best performing method across the board is either ours or the combination**. The underlying reason for this is that as the two methods are orthogonal, adding calibration can further improve the results.
#### Model: BERT ####
|Dataset|AVG|WG|
|-|:-:|:-:|
|CivilComments + Calibration|51.0|37.3|
|CivilComments + **Ours**|49.7|**42.3**|
|CivilComments + Combination|53.4|36.9|
||
|HateXplain + Calibration|60.9|15.8|
|HateXplain + **Ours**|57.3|14.0|
|HateXplain + Combination|56.7|**22.8**|
||
|Gender Bias + Calibration|85.4|83.2|
|Gender Bias + **Ours**|85.1|**84.9**|
|Gender Bias + Combination|85.7|82.5|
||
|Amazon + Calibration|78.0|57.7|
|Amazon + **Ours**|82.9|**63.8**|
|Amazon + Combination|79.0|59.2|
#### Model: Ada ####
|Dataset|AVG|WG|
|-|:-:|:-:|
|CivilComments + Calibration|73.3|31.2|
|CivilComments + **Ours**|56.6|**44.9**|
|CivilComments + Combination|68.3|35.0|
||
|HateXplain + Calibration|61.9|31.6|
|HateXplain + **Ours**|63.6|21.2|
|HateXplain + Combination|59.6|**33.3**|
||
|Gender Bias + Calibration|84.2|77.8|
|Gender Bias + **Ours**|78.0|60.1|
|Gender Bias + Combination|84.2|**77.9**|
||
|Amazon + Calibration|71.2|50.5|
|Amazon + **Ours**|78.0|60.1|
|Amazon + Combination|83.2|**63.9**|
* **When is Roboshot applicable to different combinations of models and datasets.**
This is an important question, and we have updated our draft with the answer. Briefly, there are two approaches to determining where RoboShot is suitable.
* _Intrinsic evaluation_: Our paper contains a theoretical framework characterizing our approach. We restate the results here: RoboShot scales down harmful coefficients in the sample embedding at an inversely proportional rate to the harmful coefficients in the insight embedding. This bound provides a key insight: that RoboShot works better when there is less noise in the insight embeddings. These quantities can then be directly measured on, e.g., synthetic data.
* _Extrinsic evaluation_: Existing pipelines built using zero-shot models are evaluated by backtesting or A/B testing. Since **our approach is a plug-in replacement for embeddings**, we can directly apply it to extrinsic evaluation techniques and compare.
* **On Figure 4**. We use the CelebA dataset (sample 100 test points for each class), and use the real LLM insights. Some examples of the LLM (ChatGPT) outputs follow. Examples of spurious insights: [“deep complexion”, “ fair complexion”]. Examples of true insights: [“coarse hair texture”, “smooth hair texture”].
* **On extracting concise textual insights from LLM output**. We fully automate the process of translating LLM outputs to concise textual insights. For ChatGPT, we instruct it to output in the format we can parse. For instance: “List the biased/spurious differences between waterbirds and landbirds. Give short keyword for each answer. Answer in the following format: <Difference>: <waterbird characteristic> ; <landbird characteristic>”. We can then directly parse ChatGPT’s output. Other LLMs we use in the experiments are trained for the next word prediction task. So we take the outputs and slice the string by taking all characters after the prompt.
## Response to Reviewer E4A7 ##
The reviewer notes that **our finetuning-free robustification method is innovative**. Thank you! We also appreciate the helpful input.
* **On applicability to current research trends**. The **advantage of our approach is that it applies to and is capable of improving on prompt-based methods**! Indeed, just like prompting, it is fully zero-shot and does not require additional information. We need only access embeddings to turn a prompt-based method into one where our technique applies.
We validated this idea by following the reviewer's suggestion and conducting experiments by direct prompting to ChatGPT and BART-MNLI. The results of these experiments are in the common response. They show that **RoboShot improves direct prompting methods** on ChatGPT and BART, especially on the toxic classification tasks (25.7% in CivilComments and 9% in HateXplain).
* **On average performance**. Thank you for the comment! Indeed, the fact that average performance is decreased when improving worst-group accuracy is a known phenomenon, which we discussed in the common response. We answer in more detail here. When removing harmful insights, RoboShot tries to remove spurious features which can be predictive for some groups (in-distribution data, e.g. water in waterbird images), but not across all groups (out-of-distribution data, e.g. water in landbird images). Thus, *accuracy in groups where spurious features were informative may drop slightly, while accuracy in groups where spurious features have adverse effects typically increases*. However, the tradeoff's appearance depends on the task, model, and embedding quality; **in many cases, average accuracy can substantially increase**. We note, as well, that this occurs only due to removal. When boosting helpful insights, our approach is beneficial to accuracy across all groups.
Thus, the case where average accuracy does not improve happens when 1) the removed harmful insights drop a group's accuracy in a way that outweighs the gains in other groups and 2) increasing helpful insights does not improve overall accuracy enough due to the embedding quality or weak helpful insights. Note that this tradeoff appears for the same reason even in fine-tuning based approaches, for example [1].
* **On prompting LLMs to get textual insights**. The prompts are designed **solely based on the task descriptions** for each dataset. We described prompting approaches in the common response and refer to Tables 6 and 7 for the list of all prompts we use.
[1] Zhang, Michael, and Christopher Ré. "Contrastive adapters for foundation model group robustness.", NeurIPS 2022.
## Response to Reviewer Bg99
Thank you for noting the advantages of our paper: **our novel setting and extensive experiments and analysis**! We also appreciate the helpful feedback.
* **On gender bias experiments**. Thank you for pointing this out! We have added the details of genders considered in the dataset, as well as the remaining unresolved biases, to the updated Appendix of our updated draft. We have also added similar details for toxicity classification datasets.
We also briefly describe these additions: The gender bias dataset labels is not an exhaustive list of all genders. Only two genders are included in the dataset: male and female. Biases that might impact the gender demographics not included in the datasets labels remain unresolved.
* **On limitations**. As mentioned in the common response, our theoretical framework provides the conditions to be effective. The bottom line---and key limitation---for our method to work is that the **insight noise should be sufficiently small compared to harmful/helpful coefficients**. To illustrate this point, we conducted a synthetic experiment varying the noise level in insight vectors. The following table shows the results. We see that up to 10~20% of noise level to signal (harmful, helpful coefficients = 1), our algorithm works well, recovering worst group accuracy and improving average group accuracy.
| | AVG | WG |
| -------------------------------- | -------------- | -------------- |
| Zero-shot | 74.90 +- 0.32 | 49.38 +- 0.83 |
| RoboShot ($\sigma_{insight}=0.1$) | 96.80 +- 1.35 | 94.46 +- 0.29 |
| RoboShot ($\sigma_{insight}=0.2$) | 96.38 +- 1.52 | 93.41 +- 3.36 |
| RoboShot ($\sigma_{insight}=0.5$) | 89.5 +- 14.72 | 80.83 +- 21.45 |
| RoboShot ($\sigma_{insight}=1$) | 74.34 +- 24.51 | 55.83 +- 33.60 |
| RoboShot ($\sigma_{insight}=5$) | 49.56 +- 14.69 | 8.65 +- 12.61 |
For completeness, we include the full set of experimental details below, and have added these to our draft.
1. Basis: $z_{core}=(1, 0, 0), z_{spurious}=(0,1,0), z_{benign}=(0, 0, 1)$
2. Class embeddings:
* $c_{1}=z_{core}+z_{spurious}+ z_{benign}$
* $c_{0}=-z_{core}-z_{spurious}+ z_{benign}$
3. Input distribution (here $s$ denotes spurious feature group):
* $x|y=1, s=0 \sim \mathcal{N}([w_{core}, w_{spurious}, w_{benign}], \sigma_{input}I), n=2500$
* $x|y=1, s=1 \sim \mathcal{N}([w_{core}, -w_{spurious}, w_{benign}], \sigma_{input}I), n=2500$
* $x|y=0, s=0 \sim \mathcal{N}([-w_{core}, -w_{spurious}, w_{benign}], \sigma_{input}I), n=2500$
* $x|y=0, s=1 \sim \mathcal{N}([-w_{core}, w_{spurious}, w_{benign}], \sigma_{input}I), n=2500$
4. Insight vectors:
* $v_{helpful} = \gamma_{helpful}z_{core} + \gamma_{s}z_{spurious} + \gamma_{b}z_{benign}$, where $\gamma_{s} \sim \mathcal{N}(0, \sigma_{inisght})$, $\gamma_{b} \sim \mathcal{N}(0, \sigma_{benign})$
* $v_{harmful} = \gamma_{c}z_{core} + \gamma_{harmful}z_{spurious} + \gamma_{b}z_{benign}$, where $\gamma_{c} \sim \mathcal{N}(0, \sigma_{inisght})$, $\gamma_{b} \sim \mathcal{N}(0, \sigma_{benign})$
5. For the experiment reported in Table, we used $w_{core}=1, w_{spurious}=1, w_{benign}=0.5, \gamma_{helpful}=1, \gamma_{harmful}=1$, $\sigma_{input}=0.5, \sigma_{benign}=0.01$, and repetition=100