## Reviewer 1 - tMeB
- **Confidence:** 5
- **Soundness:** 2.5
- **Overall Assessment:** 2.5
### Strengths
- **S1:** The analysis of the phenomenon of seen term bias is interesting and important.
- **S2:** The improvement of term recall and task performance is significant.
- **S3:** The proposed approach is simple, straightforward and effective.
### Weaknesses
- **W1:** Some part of the writing is confusing. For example, (1) the sentence at line 48/49 contradicts the argument (should it be in document instead of "out-of-document"?); (2) lines 307-309 are confusing (is test data used for training?); (3) what are "answer query terms" at line 446? (4) why low unseen term ratio leads to high improvement, claimed at line 517?
- **W2:** The baseline of the out-of-the-box models without any finetuning on the generated queries should be included in all the tables.
- **W3:** InPars/GPL as the model names is confusing, while Contriever and DRAGON are both a method name and a model name.
- **W4:** Table 3 can be merged into Table 1. More results on more datasets should be included in Table 4 for supporting the claim.
- **W5:** Examples of the refined queries should be included.
- **W6:** The paper title should be adjusted: it is not zero-shot retrieval, as ICL demonstration is used. Better to frame it as domain adaptation.
- **C1:** The title word "Term-level" should be capitalized into "Term-Level"
- **C2:** The heading "Seen Term Bias in PQG-Based Approaches" can be changed into something like "Task-dedicated PQG"
### Our Comment
Thank you for your detailed review and insightful feedback on our paper. We appreciate your comments, particularly your recognition of the importance of seen term bias and significant improvement presented by our method. We will incorporate the details you suggested into the camera-ready version to enhance the quality of the paper. The specifics are as follows:
- **W1:** (1) You are correct. Lines 48-49 will be revised to "in document." (2) Lines 307-309 describe the few-shot setting, matching the baseline InPars. We also have shown in Table 3 that tRAG gains even without this few-shot examples. (3) "Answer query term" refers to terms appearing in the actual test query, measured to show that tRAG generates document-relevant terms, rather than any out-of-document terms. (4) Line 517 explains that FiQA is an exceptional case. Despite having a low unseen term ratio due to its very short length, FiQA shows a significant performance gain.
- **W2:** We compared finetuning on generated query-based methods because our main goal is to address the seen term bias in query generation. However, we will add out-of-the-box models (w/o finetuning) to the table as references requested:
| Method | NFCorpus | SciFact | SciDocs | FiQA | Average |
| --- | --- | --- | --- | --- | --- |
| w/o finetuning | | | | | |
| BM25 | 32.5 | 66.5 | 15.8 | 23.6 | 34.6 |
| ColBERT | 30.5 | 67.1 | 14.5 | 31.7 | 36.0 |
| ColBERTv2 | 33.8 | 69.3 | 15.4 | 35.6 | 38.5 |
| w/ finetuning | | | | | |
| GPL | 34.2 | 66.4 | 16.1 | 32.8 | 37.4 |
| GPL + RAG | 34.5 | 66.9 | 16.3 | 34.0 | 37.9 |
| GPL + tRAG | 34.9 | 67.3 | 16.8 | 37.6 | 39.2 |
- **W3:** We will use term "backbone method" instead of "model name" to resolve the confusion.
- **W4:** We will merge Table 3 into Table 1 to better support our claim.
- **W5:** We will add the examples for more detail. The below table shows some of such examples. Out-of-document terms are highlighted in bold.
| Original query | Refined query |
| -------- | -------- |
| Breast cancer | How breast cancer cells feed on **cholesterol**? |
| Key factors of cancer | What role do **invadopodia** play in cancer? |
| RNA-binding during stress | What happens to RNA-binding **proteins** during stress? |
- **W6, C1 and C2:** Thank you for your detailed feedback on the terminology. We will use your suggestions to improve the clarity of our paper.
### Our Comment 2
Thanks for the reply to our response. We recognize that your remaining concerns also arose from our use of terms, and we will clarify them in the camera-ready version. The details are as follows:
> W1 (2): why did not you use train/dev queries? Using test queries for ICL is not a sound setting.
The "test query" mentioned here refers to the test queries from MS-MARCO dataset, which serve as the dev set you mentioned. Therefore, we are not training on queries used for evaluation. To avoid such concerns, we will revise the term use.
> W1 (3): it seems that it is not related to the word "answer". So why is the word included in the term?
The word "answer" was included in response to recent reviews to explain that the recall we measured refers to actual relevant terms. We can express this more directly by using "relevant term" instead.
## Reviewer 2 - gT7f
- **Confidence:** 3
- **Soundness:** 4
- **Overall Assessment:** 3.5
### Strengths
- **S1:** Describes a helpful method for adapting neural retrievers to obscure domains that are very different in term distribution compared to both typical corpora and available datasets for training neural retrievers
- **S2:** The approach is explained in detail, and the experiments are well-structured so that the reader can see the benefits of using the proposed approach compared to GPL or CSQE.
- **S3:** Authors included source code in this revision
### Weaknesses
- **W1:** It seems to be a good extension of existing work in this sub-field. However, the overall impact on the community interested in the best possible retriever models might be relatively minor.
- **W2:** The authors included the code; however, it seems that running it wouldn't be that straightforward as there is no readme, and the code seems to expect some dependencies (mainly datasets) to be in the correct locations.
- **C1:** For the camera-ready version, please include instructions on how to run your code in the README.
### Our Comment
Thank you for your time in providing a thoughtful review of our paper. We sincerely appreciate your comments, particularly your recognition of our clear explanation of the method, which focuses on domain adaptation using term-RAG.
- **W1:** We want to emphasize that our improvement in nDCG@10 on the BEIR benchmark is notable, especially in zero-shot settings where such enhancements are challenging. For instance, CPT [1] reports a 0.1 improvement over the previous best method on the BEIR benchmark.
- **W2 and C1:** We will add README to the camera-ready version to clarify how to run our code.
[1] Neelakantan, Arvind, et al. "Text and code embeddings by contrastive pre-training." arXiv preprint arXiv:2201.10005 (2022).
## Reviewer 3 - DDPF
- **Confidence:** 4
- **Soundness:** 3.5
- **Overall Assessment:** 3
### Strengths
- **S1:** Similar to my previous reviews, the paper is well-organised and well-motivated, making it easy to follow. Meanwhile, the idea of generating novel unseen terms is interesting.
- **S2:** I appreciate the efforts the author made for addressing my concerns. The detailed discussion on how to measure the quantity of generated unseen terms and additional results on ablating few-shot examples are convincing, prompting me to increase the soundness score from 3 to 3.5.
### Weaknesses
- **W1:** The concern is still similar as I stated in the last review, the minor improvements make the proposed method less valuable given extra introduced complexity.
- **W2:** The correlation between unseen terms ratio and the improvements are still not clear, i.e., the largest improvement observed in FiQA with lowest unseen term ratio.
### Our Comment
Thank you for your time in providing a thoughtful review of our paper. We sincerely appreciate your comments, particularly your recognition of our effort to provide detailed discussion on few-shot examples.
- **W1:** We want to emphasize that our improvement in nDCG@10 on the BEIR benchmark is notable, especially in zero-shot settings where such enhancements are challenging. For instance, CPT [1] reports a 0.1 improvement over the previous best method on the BEIR benchmark.
- **W2:** Lines 516-527 provide an analysis of the largest improvement observed in FiQA. The shortest document length in FiQA makes it easier to generate relevant keywords, which is aligned with recent studies [2, 3]. We will clarify it further in the cam-ready.
[1] Neelakantan, Arvind, et al. "Text and code embeddings by contrastive pre-training." arXiv preprint arXiv:2201.10005 (2022).
[2] Jorge AV Tohalino, et al. "Using citation networks to evaluate the impact of text length on keyword extraction." Plosone, 18(11):e0294500 (2023).
[3] Feng Liu et al. "Performance evaluation of keyword extraction methods and visualization for student online comments." Symmetry, 12(11):1923 (2020).