1.30 내가 리뷰하는 Supportive reranking 에 대한 피드백
---
W2: Limitations in the Supportive Re-ranking label strategy
I have checked again on section 4.2 on the effects of feedback providers, but there seems to be only ablation studies about the parameter size of Llama-2, and no comparison between models of different architectures, such as what was used in other experiments (Qwen-7B, Mistral-7B ...) While I believe this is a limitation to the work, I agree that this is not significant.
W3: Missing crucial reranking baselines
I have understood that one difference with SRR and relevance-based re-ranking models is that SRR can be applied to any QA format data without relevancy annotation, and this is the reason why I have previously commented as the following:
> A good example of a fair comparision would be : comparing SRR with a relevance-based reranker using a similar model structure (DeBERTaV3-large) and training dataset (using the same training dataset is best, but in case it doesn't have any training dataset, datasets in a similar domain (using Wikipedia as supportive rerankers, QA queries,..) with relevance labels can be used).
Thank you for providing additional results on the CoT-MAE, but I am still not convinced about the choice of the baseline reranking model (why not compare with DeBERTa-v3-large based rerankers? why not compare with MonoT5?) Yes, SRR and relevance-based re-ranking methods can be complementary, but at the end, the final objective of rerankers are used to better rank passages.
The authors have commented as the following:
> One should choose the more appropriate re-ranking method for the specific use case (or even use them in combination), rather than relying solely on performance in certain downstream tasks.
If one are to make this argument, I think there should be a comparison about relevance-based v.s. SRR rerankers and analysis about what specific use cases there are that we should use SRR over relevance-based rerankers.
Thank you for the comments on W4, about open-sourcing the code, about model size, and clarification about the token probability.
The major concern I had about this work is the lack of comparison on relevance-based rerankers, that are trained on similar datasets with the proposed reranker OR based on DeBERTa-v3-large backbone models. It's not that the SRR model has to excel at every dataset, or on every cases. I'm not requiring the method needs comparison to state-of-the-art reranking models.
What the paper needs is (1) the *comparison* with relevance-based rerankers on a similar setup (Models built on [DeBERTa-v3-large](https://huggingface.co/cross-encoder/nli-deberta-v3-large), or the use of standard rerankers such as [MonoT5-base](https://huggingface.co/castorini/monot5-base-msmarco-10k), or [RankGPT](https://github.com/sunnweiwei/RankGPT?tab=readme-ov-file) (RankGPT3.5 is fine enough. There is open-source code with [quickstart code](https://huggingface.co/cross-encoder/nli-deberta-v3-large) that you can try right away) and (2) *Analysis*, such as analyzing when SRR is effective / when relevance-based rerankers is effective / whether using the combination of two increases effectiveness or not .. and so on. Without proper comparison results and analysis on standard rerankers, the comments I have made in the previous review remains unchanged.*
2024.1.30 (R2 코멘트)
---
While I appreciate the author's effort, I am affirmative that the reviews of the original draft is fair.
I'll just try to reply to the points that the authors believe are misunderstanding but I don't believe so.
I am not reviewing other papers so I don't think I should consider if other papers are cherrypicking numbers.
I understand some other methods depend on initial ordering. My comment was "None of the other methods depend on initial ordering under such complexity analysis". Each iteration in the proposed algorithm considers ALL documents, so it is pretty much equivalent to one iteration of existing listwise methods - existing listwise methods depend on initial order due to context length limitations, which is in the outer loop. Thus the authors are missing the points.
The length of the replies themselves grant a major round of revision, which the reviewer thought is a major motivation of the ARR system. I don't fully trust experiments ran in few days and the numbers need closer examination.
I persnally believe fairness is the most important factor for scientific research and I am not comfortable with the current submission.
(R2 코멘트 리스폰스)
---
**1. Algorithmic complexity and initial ordering**
> None of the other methods depend on initial ordering under such complexity analysis. Each iteration in the proposed algorithm considers ALL documents, so it is pretty much equivalent to one iteration of existing listwise methods - existing listwise methods depend on initial order due to context length limitations, which is in the outer loop. Thus the authors are missing the points.
We respectfully disagree with the reviewer's interpretation of our algorithm's dependency on initial ordering. Previous listwise reranking models using the sliding window approach depends on the initial ordering in the *inner loop*, meaning it is sensitive in ordering at any time of computation. This leads to the positional bias problem [1, 2], which is already discussed extensively on our main paper. We have emphasized throughly that our method effectively mitigates this problem, being more robust to positional bias and even being able to be partially parallizable for computation on the same level (please see recent comments for reviewer Lzgx).
[1] Liu et al., Lost in the Middle: How Language Models Use Long Contexts, 2023
[2] Tang et al., Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models, 2023
**2. Regarding the ARR discussion**
> The length of the replies themselves grant a major round of revision.
We would like to clarify that the comprehensive nature of our responses is a testament to our commitment to addressing all concerns raised. Most of our responses were mainly about clarifying the misunderstandings the reviewer made (e.g., explaining that we rightfully did a fair comparison, clarify our motivations ... ). Minor details are added for answering reviewer's concerns were about comparing to more extensive baselines such as GPT-4, pairwise methods on positional bias, and comparison of GTR-based ListT5 models to RankT5. We also stress that the main results are already presented in the paper, and the conclusions and the statements we made remains the same after comparison to additional baselines.
**3. Fairness in Scientific Research**
> I am not reviewing other papers, so I don't need to consider the way of choosing parameters. I don’t fully trust experiments ran in few days and the numbers need closer examination.
We believe we did our best to perform a fair evaluation with respect to the baselines by (1) choosing the baselines' best pretrained checkpoints on zero-shot performance, (2) performing a proper hyperparameter search similar to the baselines, (3) conducting extensive evaluation by the full 18 BEIR benchmarks, TREC-DL19, DL20, and MSMARCO without any left-out datasets, and (4) aligning with the same baseline setting with RankT5 (GTR).
> I persnally believe fairness is the most important factor for scientific research and I am not comfortable with the current submission.
In this aspect, we fail to understand why our submission is making the reviewer uncomfortable. We share the same belief on the fairness in scientific research and feel unfairly accused.
AC에게 리스폰스
---
Dear Area Chairs,
We would like to express our gratitude to the reviewers and area chairs for your valuable effort. We are reaching out to kindly request ACs to bring attention to our concerns regarding the review/discussion provided by reviewer **8G2W**. While we appreciate the time and effort invested by the reviewer, I believe the critique is biased and unfairly critical.
Our paper received 2 positive reviews and one negative review.
Positive reviewers agreed that our work contributes to applying a novel FiD architecture with tournament sort for computationally efficient, positionally robust, yet effective listwise reranker for zero-shot reranking.
However, reviewer **8G2W** addresses **fairness in scientific research** about our submission, in which we and other reviewers disagree with:
> reviewer 8G2W: I personally believe fairness is the most important factor for scientific research and I am not comfortable with the current submission.
We share the same belief in the fairness of scientific research, and we are dedicated to ensuring our manuscript undergoes a fair and unbiased evaluation. Along with the comments from reviews made by other reviewers and the extensive efforts to responses to rightfully justify on this issue, we feel unfairly accused of this claim.
> reviewer Lzgx: *... Experiments are extensive, including in-domain evaluation on MS MARCO passages, out-of-domain evaluation on BEIR full datasets, as well as ablation study to help readers better understand the design choices in ListT5.*
> reviewer 54fT: *...The authors have thoroughly explained the computational complexity and efficiency in FLOPs ...*
Also, we disagree with the comments regarding the **algorithmic complexity and initial ordering**:
> reviewer 8G2W: ... the proposed method is nowhere near being practical. None of the other methods depend on initial ordering under such complexity analysis ...
We feel that the reviewer is disregarding the **practicality and efficiency of the full line of works that depends on listwise reranking**, even after our efforts to clarify the misunderstandings. Previous listwise reranking models *does* depend on the initial ordering and the computation is sequentially done. We have explained about this on the main paper and responses and reviewer Lzgx also agreed that our method can be partially parallizable for computation on the same level, improving efficiency compared to previous sliding-window based listwise rerankers. We disagree with the claim that our method is *"nowhere near being practical"*, and also feel unfairly accused.
Additionally, the reviewer suggests the length of our replies necessitates a major round of revision. The comprehensive nature of our responses is a testament to our commitment to addressing all concerns reviewer 8G2W raised. All the major claims remain unchanged and only minor edits can reflect what was discussed to improve the quality of our submission, which we believe is the motivation of the discussion phase. However, the outcome of all the efforts we have tried to clarify the misunderstandings has been belittled to "*I don’t fully trust experiments ran in few days*".
Thank you for your time and effort in organizing the discussion period, and we hope this letter raises attention to the ACs about the responses of reviewer **8G2W**.
Sincerely,
Authors of paper 645
2024.1.27 (R3 코멘트)
---
Thank you for the response. I would like to know more details about how the "rough comparison" is done. Like the hyperparameter
for ListT5, RankVicuna also uses the number of passes for the accuracy-efficiency tradeoff. Therefore, it is necessary to specify the setting for each method compared. Moreover, I still think it is necessary to conduct a comparison with the sliding window approach in detail since the tournament sort is the major component of the paper and the sliding window is used in multiple works [1, 2, 3].
Overall, I think I'm convinced about the choice of retriever for negative examples and the novelty of ListT5 in the other comments, but for the efficiency pitch, I still think it is necessary to compare with the sliding window approach. I will be willing to change my scores if this baseline is provided.
[1] Sun, Weiwei, et al. "Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent." arXiv preprint arXiv:2304.09542 (2023).
[2] Ma, Xueguang, et al. "Zero-Shot Listwise Document Reranking with a Large Language Model." arXiv e-prints (2023): arXiv-2305.
[3] Pradeep, Ronak, Sahel Sharifymoghaddam, and Jimmy Lin. "RankVicuna: Zero-shot listwise document reranking with open-source large language models." arXiv preprint arXiv:2309.15088 (2023).
Thank you for the prompt reply! We have named as "rough comparison", since the choice of architecture (Llama-7b v.s. T5-base) is very different and thus the comparison between the sorting method would not be very precise.
---
(R3 코멘트 리스폰스)
---
On the reviewers' request, we have analyzed the efficiency of tournament sort v.s. sliding window approach in a more detailed setup. For a reliable and realistic comparison, we use the recently released [official repository for RankVicuna](https://github.com/castorini/rank_llm) for evaluating the FLOPs and latency of listwise + sliding window approaches.
**[Implementation Details]**
- We rerank top-10 passages from BM25 top100 candidate passages, for 43 queries of TREC-DL19.
- The latency and FLOPs for tournament sort variants are computed using our ListT5 evaluation code, and the efficiency for the sliding window approach is computed using the official repository for RankVicuna.
- FiD variants of the sliding window approach (idx 5,7 on Table 3) are computed using our ListT5 evaluation code.
- From the official code for the sliding window approach by RankVicuna, we append the FlopsProfiler by deepspeed to measure FLOPs using the same library. For a fair comparison and run in the same setup with ListT5-3b models, we set the baseline model from `castorini/rank_vicuna_7b_v1` to `lmsys/fastchat-t5-3b-v1.0` and provide an option of window size $w$=5 and strid of $s$=2 / 3. We measure the FLOPs & latency.
- Commands: `CUDA_VISIBLE_DEVICES=0 python ./src/rank_llm/scripts/run_rank_llm.py --model_path=castorini/lmsys/fastchat-t5-3b-v1.0 --top_k_candidates=100 --dataset=dl19 --retrieval_method=bm25 --prompt_mode=rank_GPT --context_size=4096 --variable_passages --window_size 5 --step_size [2/3]`
**[Results]**
> Table 1. Latency Comparison between sliding window v.s. tournament sort (ListT5).
| Method | Model | Latency | Latency (w/ parallel.) | NDCG@10 |
| ---------- | --------------------- | ------- | -------------------- | ------- |
| tournament | ListT5-base (r=1) | 430.7 | 287.71 | 71.23 |
| tournament | ListT5-base (r=2) | 765.7 | 627.34 | 71.76 |
| tournament | ListT5-3b (r=1) | 1306.0 | 1024.16 | 71.75 |
| tournament | ListT5-3b (r=2) | 2320.8 | 2042.47 | 71.79 |
| sliding w. | T5-3b (no FiD) (stride 2) | 4085.0 | not parallelizable | - |
| sliding w. | T5-3b (no FiD) (stride 3) | 3136.0 | not parallelizable | - |
The Latency is reported in seconds. ListT5, for both (r=1) and (r=2), are generally faster than the sliding window variants.
**Latency v.s. Latency (w/ parallel.)** As we have also addressed for the reviewer 8G2W on w5, we can parallize computation from the same level (e.g., leaf node) within query for the tournament sort, while the sliding window approach should be computed sequentially (cannot be parallized). Latency (w/ parallel.) are measured with parallizing the computation of leaf node to 20 threads, which lowers the overall time cost.
> Table 2. \# of forward passes to rerank top-k candidates per one query, on w=5
| Sorting Method | # required forward passes to rerank top 1 | # required forward passes to rerank top 10 |
| ---------------------- | ---------------------------------- | -------------------------------- |
| sliding window, stride=1 | [(100 - 5)/1] = 95 | at least 95 * [10/(5-1)] = 285 |
| sliding window, stride=2 | [(100 - 5)/2] = 48 | at least 48 * [10/(5-2)] = 192 |
| sliding window, stride=3 | [(100 - 5)/3] = 32 | at least 32 * [10/(5-3)] = 160 |
| sliding window, stride=4 | [(100 - 5)/4] = **24** | at least 24 * [10/(5-4)] = 240 |
| tournament sort, r=1 | (100/5) + (20/5) + 1 = **25** | 25 + 9 * (1 + 1 + 1) = **52** |
| tournament sort, r=2 | (100/5) + (40/5) + 2 + 1 = **31** | 31 + 9 * (1 + 1 + 1 + 1) = **67** |
We can see that given a window size of 5, tournament sort is much more efficient to rerank top-10 passages, requiring fewer number of forward passes.
This is because, unlike tournament sort with output caching, sliding window approach requires re-evaluation over the entire input sequence, depending on the window size, stride, and the number of top-k passages to rerank. Detailed explanation is in below:
- After one pass of a sliding window of $w$=5 and stride of 4, we discard 4 passages and only carry (5-4)=1 previous passage to the next step as we move the window. Therefore, we would only be able to correctly order top-1 passages, since the top-2 passage cannot be moved.
- Therefore, to rank top 10 candidates, we have to iterate through at least [10/(5-4)] = 10 times, which would result in a total of 240 forward passes.
- The same applies to methods of stride=2 to 4.
- In contrast, tournament sort uses output caching, and after the initial computation (25 for r=1 and 31 for r=2), we only need to compute *one* path from leaf to root, which only costs only one additional forwards for each level of the tournament tree, which is 3 for (r=1) and 4 for (r=2).
- Therefore, by using tournament sort, we can efficiently reduce the number of forward passes needed to rank top 10 candidates.
- Please refer to Fig.8 and Fig.4 on our main paper for detailed illustrations.
> Table 3. FLOPs (in a multiple of FLOPs of MonoT5-base) on the choice of architecture and method, on TREC-DL19.
| idx | Base model | Method | Model | FLOPs (top1) | FLOPs (top10) |
| --- | ---------- | --------------------- | ---------------- | ------------ | ------------- |
| 0 | T5-base | pointwise | MonoT5-base | 1x | 1x |
| 1 | T5-base | tournament | ListT5-base(r=1) | 1.3x | 2.6x |
| 2 | T5-base | tournament | ListT5-base(r=2) | 1.8x | 4.7x |
| 3 | T5-3b | tournament | ListT5-3b (r=1) | 17.6x | 36.3x |
| 4 | T5-3b | tournament | ListT5-3b (r=2) | 24.6x | 66.0x |
| 5 | T5-3b | sliding w. (stride 2) | T5-3b (FiD) | 38.5x | 154x |
| 6 | T5-3b | sliding w. (stride 2) | T5-3b (no FiD) | 53.8x | 215.1x |
| 7 | T5-3b | sliding w. (stride 3) | T5-3b (FiD) | 25.6x | 128x |
| 8 | T5-3b | sliding w. (stride 3) | T5-3b (no FiD) | 35.1x | 175.6x |
For the computation of FLOPs (top10), we multiplied by 5 for (stride=2) and 4 for (stride=3) variants of the sliding window. Please refer to the previous table (Table 2) about the computation process.
**T5 (FiD) v.s. T5 (no FiD)** ListT5 uses the FiD architecture to effectively mitigate the positional bias problem and handle long inputs efficiently.
- (L.362) *...By using the FiD architecture, encoder self-attention is efficiently computed only within each pair of query and passage. ...*
Comparing with idx 5 v.s 6 (38.5 vs 53.8) and 7 vs 8 (25.6 vs 35.1), we can see that using FiD results in lower number of total FLOPs.
**ListT5-tournament sort v.s. T5-3b (FiD)-sliding window** We can clearly see that the FLOPs to rerank top 10 candidates are much lower for both (r=1) and (r=2) variants of tournament sort (36.3x and 66.0x), compared with the FiD variant of sliding window, for both (stride=2) and (stride=3) (154x and 128x).
Thank you for requesting the comparison between the sliding window approach and tournament sort, and we will add the results at our next revision. If you consider those experiments to be insufficient, please specify any other experiments of your intention. Thank you!
---
## 1. Clarification of our efficiency contribution
First of all, we would like to clarify one point about our contribution of efficiency of ListT5:
> **ListT5 (FiD + tournament sort)** is much more efficient than corresponding **pairwise** methods (DuoT5), previous **LLM-based listwise** rerankers, and comparable with **pointwise** methods (MonoT5).
- We had no intention of direct comparison between tournament sort v.s. sliding window (and arguing its effectivness);
- Rather, we have focused on the inefficiency coming from the large parametric size (see the introduction) of previous listwise ranking models, and claim the efficiency of ListT5 from a **combination of (1) model parameter size and (2) sorting method** perspective.
- In other words, we have consistently emphasized the comparison between **ListT5+tournament sort v.s. LLM(RankVicuna, GPT-3.5, GPT-4,..) + sliding window.**
Below are the recap of the efficiency claims stated at our paper:
- *L.12- (abstract): ... has an efficiency comparable to pointwise ranking models and surpasses the efficiency of previous listwise ranking models, ...*
- *L.52- (introduction): ... On the other hand, Large Language Models (LLMs) that use listwise reranking (Pradeep et al.,054 2023; Sun et al., 2023b,a; Ma et al., 2023) sacrifice efficiency ***in another aspect** due to large parametric model size, ...*
- *L.366- (5.1 performance and efficiency) By using the tournament tree structure with output caching, ListT5 is much more efficient over previous listwise ranking models and comparable with pointwise ranking models. ...*
From the main paper, it seems like the paragraph from L.366 had made confusion to the reviewer, that we are directly arguing the effectiveness of tournament sort over the sliding window approach. The paragraph should be rewritten as the following: *By using FiD with tournament sort and output caching, we were able to sucessfully apply listwise reranking to even small-sized models such as T5-base, which is much more efficient over previous LLM-based listwise ranking models and comparable with pointwise ranking models ...*
- Thank you for pointing out and we will add this clarification at our next revision.
## 2. Why not ListT5 + sliding window?
As the reviewer mentioned, the sliding window approach is commonly used in previous listwise ranking LLMs. However, we have chosen to apply the (hierarchical) tournament sort approach over the sliding window approach for ListT5 due to (1) its adaptability with window size of 5, and (2) being able to be partially parallizabile, explained in detail below:
1. Direct application of the sliding window approach with $w$=5 was not well-defined.
- Through ablation experiments, we have found out (Appendix Table 6) that seeing 5 passages at once is better than seeing 10+ passages at once for ListT5.
- However, effectively applying the sliding window approach for listwise ranking models with a window size $w$ = 5 smaller than $k$ = 10 is not well-defined.
- Previous listwise ranking methods typically adopt a single setup: w=20 with stride=10. It was not well-defined what is the best way to apply the sliding window approach for ranking k=10 with w=5 (ListT5).
- To apply, we would need multiple iteration (passes) to re-rank top-10 passages with the window size smaller than 10.
- We didn't want to iterate multiple passes with the sliding window approach to re-rank 10 passages with ListT5 that sees 5 passages at once, since it would sacrifice efficiency.
- From the following experiment, we show that the **efficiency of tournament sort** is **comparable or better** than the efficiency of applying the **sliding window in multiple passes**.
- We didn't want the selection of window size and stride to be dependent on the number of top-$k$ passages we want to re-rank.
- Instead, we apply tournament sort, a sorting algorithm which can hierarchically rank any top-$k$ passages with any $w$.
2. Tournament sort can be partially parallizable.
- The sliding window approach is sequentially defined (the next input depends on the last prediction output). Therefore, for each query, reranking **cannot be computed in parallel**.
- In contrast, tournament sort can compute inferences for the same level *at once*. For example, we can parallize the computation at the leaf node, which accounts for $O(n)$ complexity out of $O(n + klogn)$. We believe being able to compute in batchwise manner is important, and application of this parallization can reduce the overall latency.
- From the following experiment, we show that the latency of ListT5 can be further optimized, when we compute the leaf node in batch (size=20).
## Efficiency results on TREC-DL19
On the reviewer's request, we provide the efficiency comparison between the sliding window approach and tournament sort.
We investigate the following methods:
- pointwise (MonoT5) / pairwise (DuoT5) / listwise + tournament sort (ListT5) / listwise + sliding window (T5-3b, RankVicuna)
For each method, we measure the following:
- performance / FLOPs / latency without parallization / latency with parallization (batch size of 20).
**Implementation Details**
For a fair and realistic comparison, we use the recently released [official repository for RankVicuna](https://github.com/castorini/rank_llm) for evaluation of listwise + sliding window approaches (idx 7,8). From the official code, we append the FlopsProfiler by deepspeed to measure FLOPs using the same library. We use `castorini/rank_vicuna_7b_v1` for RankVicuna, and for a fair comparison with ListT5-3b, we replace the model with `lmsys/fastchat-t5-3b-v1.0`, run with the same setup ($w$=20, max_length 4098, stride=10..) as RankVicuna, and measure only the FLOPs & latency (idx 7).
Commands: CUDA_VISIBLE_DEVICES=0 python ./src/rank_llm/scripts/run_rank_llm.py --model_path=castorini/rank_vicuna_7b_v1 --top_k_candidates=100 --dataset=dl19 --retrieval_method=bm25 --prompt_mode=rank_GPT --context_size=4096 --variable_passages
The results are as follows:
> Efficiency results on reranking top-10 passages from bm25 top100 for 43 queries of TREC-DL19.
| idx | Base model | Method | Model | FLOPs | Latency | Latency (w/bsize=20) | NDCG@10 |
| --- | ---------- | ----------- | ------------------ | ----- | ------- | -------------------- | ------- |
| 1 | T5-base | pointwise | MonoT5-base | 1x | 206.5 | 32.5 | 71.48 |
| 2 | T5-base | pairwise | DuoT5-base(top50) | 24.5x | 2285.1 | 702.82 | 71.37 |
| 3 | T5-base | tournament | ListT5-base(r=1) | 2.6x | 430.7 | 287.71 | 71.23 |
| 4 | T5-base | tournament | ListT5-base(r=2) | 4.7x | 765.7 | 627.34 | 71.76 |
| 5 | T5-3b | tournament. | ListT5-3b (r=1) | 36.3x | 1306.0 | 1024.16 | 71.75 |
| 6 | T5-3b | tournament | ListT5-3b (r=2) | 66.0x | 2320.8 | 2042.47 | 71.79 |
| 7 | T5-3b | sliding w. | T5-3b + slid. win. | 23.6x | 3544.0 | not applicable | - |
| 8 | Llama-7b | sliding w. | RankVicuna-7b | 55.5x | 2451 | not applicable | 71.40 |
1. (Main claim in our paper) ListT5-base models (idx 3,4) are far more efficient than pairwise methods (idx 2) and previous LLM + listwise rerankers. (idx 8).
2. Given the same model structure of T5-3b, efficiency of tournament sort (idx 5,6) is comparable or better than the efficiency of applying the sliding window (idx 7).
- Compared with the sliding window approach of window size = 5 and step size = 2 (7), ListT5 (r=1) is more efficient than the sliding window approach done in the same window size.
- FLOPs of ListT5-3b (r=1) and (r=2) is 36.3x and 66.0x.
- FLOPs of T5-3b + sliding window approach is 23.6x
3. Generally, ListT5 + tournament sort has lower latency, and this can be further optimized by batchwise computation.
- Asymptotic complexity: According to [1] Qin et al., Listwise reranking using the sliding window is reported to have a complexity of $O(n)$, while ListT5 has a time complexity of $O(n + klogn)$.
Additionally, we think that the efficiency of ListT5 + tournament sort is not yet explored heavily and can be greatly optimized (L.516, limitations section), by efficient batching (currently batchwise computation is only implemented at the leaf level - extension of parallel computation at every level) and implementing early stopping for (r=2). We will include the following results in the next revision.
If you consider those experiments to be insufficient, please specifiy any other experiments of your intention. Thank you!
[1] [Qin et al., Large language models are effective text rankers with pairwise ranking prompting, 2023.](https://arxiv.org/pdf/2306.17563.pdf)
---
Left out: latency comparison with apis..
| | MonoT5 | DuoT5 (top 50) | ListT5 (r=1) | RankGPT(gpt3.5) | RankGPT(gpt4) | RankVicuna |
| -------------- | ------ | -------------- | ------------ | --------------- | ------------- | --- |
| Latency (sec.) | 769 | 21,344 | 2,673 | 7,128 (API) | 20,736 (API) | 29,607 |
| Multiple of: | 1x | 27.8x | 3.5x | - | | |
page.19, Appendix Section H
*... Since the experiments with ChatGPT and GPT-4 are conducted using the OpenAI API, the running time is contingent on the OpenAI service, e.g., API latency. Besides, the running time can also vary across different API versions and network environments. In our testing conditions, the average latency for API calls for gpt-3.5-turbo and gpt-4 was around 1.1 seconds and 3.2 seconds, respectively. Our proposed sliding window-based permutation generation approach requires 10 API calls per query to re-rank 100 passages. Consequently, the average running time per query is 11 seconds for gpt-3.5-turbo and 32 seconds for gpt-4.*
2024.1.26
---
---
R1
---
Thank you for your valuable comments to improve the paper. We have additionally compared with DuoT5 and conduct consistency experiments for LLM on the posititional bias experiments. We hope our response offers sufficient evidence to your questions, and that you will consider increasing the overall assessment for our paper!
### w1. Necessary for comparison with DuoT5 for positional bias
**We provide positional bias experiments on DuoT5 (without swapping orders) below and show that compared to listwise models, pairwise models exhibit even higher positional bias than ListT5.**
- For general performance comparison with DuoT5 with ListT5, please refer to Table 3.
- Pairwise models also have positional bias problems (shown below by experiment).
- Thank you for your suggestion. We have measured the positional bias of DuoT5 (pairwise models) on FIQA (the ones in the paper) and TREC-COVID using the same data as the original experiment.
- We split the original group of [1 positive, 4 negatives] to 4 groups of [1 positive, 1 negative] and measure the accuracy and agreement ratio before and after swapping the order.
- We found that pairwise models also exhibit positional bias, having a tendency to label the passage that comes at the front as positive.
- ListT5 exhibits less positional sensitivity than DuoT5. We will include this result in the next revision.
- Theoretically, we can remove the positional sensitivity by evaluating all possible permutations of input passages ([Wiki](https://en.wikipedia.org/wiki/U-statistic), [Ryan et al.](https://proceedings.mlr.press/v97/murphy19a.html), [Yarotsky](https://arxiv.org/abs/1804.10306)). DuoT5 and other pairwise ranking methods [1] already include swapping orders and aggregate scores from both orderings. To measure the positional bias of pairwise models, in this experiment, we removed the averaging from swapping orders in DuoT5.
- However, the number of forward passes needed to mitigate positional bias problems in this way grows in a factorial scale, making it impossible and impractical to be applied to listwise methods. (We need 5! = 120 number of forward passes in order to remove the positional bias of 5 instances)
- In contrast, we effectively mitigate the positional bias problem, without any additional forward passes, with the Fusion-in-Decoder architecture.
> FiQA
>
| | GPT-3.5-turbo-0301 | DuoT5-base(pairwise) | ListT5-base |
| ---------------------- | ------------------ | -------------------- | ----------- |
| (1) Accuracy: | | | |
| 1 | 90.9 | 89.9 | 85.3 |
| 2 | 80.4 | 76.9 | 85.6 |
| 3 | 78.5 | | 82.2 |
| 4 | 63.8 | | 83.3 |
| 5 | 64.3 | | 82.6 |
| Std. | 10.3 | 6.5 | **1.4** |
| (2) Agreement ratio: | | | |
| Points to same passage | 60.5 | 78.1 | **90.4** |
| other | 39.5 | 21.9 | 9.5 |
> TREC-COVID
| | GPT-3.5-turbo-1106 | DuoT5-base(pairwise) | ListT5-base |
| ---------------------- | ------------------ | -------------------- | ----------- |
| (1) Accuracy: | | | |
| 1 | 81.6 | 91.3 | 93.9 |
| 2 | 63.3 | 76.0 | 87.8 |
| 3 | 75.5 | | 83.7 |
| 4 | 67.3 | | 85.7 |
| 5 | 61.2 | | 81.6 |
| Std. | 7.7 | 7.7 | **4.2** |
| (2) Agreement ratio: | | | |
| Points to same passage | 55.1 | 79.6 | **83.7** |
| other | 44.9 | 20.4 | 16.3 |
* GPT-.3.5-turbo-0301 is no longer accessible, so we experimented with GPT-3.5-turbo-1106 for TREC-COVID.
[1] Qin et al., LARGE LANGUAGE MODELS ARE EFFECTIVE TEXT RANKERS WITH PAIRWISE RANKING PROMPTING, 2023.
---
### c1. Regarding position bias - have you encounter inconsistent results from API calls even when input prompts are same?
**Iterative experiments for RankGPT 3.5 show that the numbers were not exactly the same as you expected, but the difference is negligible and doesn’t change the overall trend.**
- Thank you for the suggestion. We ran the positional bias experiment (on GPT3.5) on TREC-COVID for 3 times to investigate whether given the same input, does LLMs generate the same output multiple times. We will include the result with codes for replication on our next revision.
- The results from api calls were not always exactly the same, but the difference was negligible, not changing the main claims.
- For the case of ListT5, our method is deterministic (and thus fully replicable). On running the same experiments on FiQA twice, we validated that the output files are exactly the same.
> TREC-COVID
| | <GPT- | 3.5- | turbo- | 1106> | ListT5-base |
| ---------------------- | ------- | ------- | ------- | ------- | ----------- |
| | Trial 1 | Trial 2 | Trial 3 | Average | |
| (1) Accuracy: | | | | | |
| 1 | 81.6 | 79.6 | 81.6 | 81.0 | 93.9 |
| 2 | 63.3 | 63.3 | 61.2 | 62.6 | 87.8 |
| 3 | 75.5 | 75.5 | 75.5 | 75.5 | 83.7 |
| 4 | 67.3 | 63.3 | 66.0 | 66.0 | 85.7 |
| 5 | 61.2 | 63.3 | 63.3 | 63.3 | 81.6 |
| Std. | 7.7 | 7.1 | 7.4 | 4.2 | **4.2** |
| (2) Agreement ratio: | | | | | |
| Points to same passage | 55.1 | 55.1 | 55.1 | 55.1 | **83.7** |
| other | 44.9 | 44.9 | 44.9 | 44.9 | 16.3 |
---
R2
---
We thank the reviewer for your request to improve the fairness of evaluation. We have additionally conducted various experiments to support our claims and added clarifications for any misunderstandings the review have. Especially, we have provided with **the results of ListT5-GTR** and prove that ListT5-GTR still outperforms previous baselines, which was one of the major concerns of the comments. We hope our response offers sufficient evidence to your questions, and that you will consider increasing the overall assessment for our paper!
### w1. Evaluation of MonoT5 and RankT5 with ListT5 is not fair, since Models are directly downloaded online
**It is conventional to evaluate models (especially those that don't have replicable training code) with pre-released checkpoints.**
- We have done our best effort to construct a fair setup with baseline models.
- Replication of the baseline models was not possible due to following reasons:
- MonoT5 has replicability issues [mentioned by the MonoT5 author](https://github.com/castorini/pygaggle/pull/308#issuecomment-1333502554), getting lower scores on BEIR with pytorch.
- RankT5 has no open-sourced training code available from the authors, and important hyperparameter details are missing for replication. Please see the comment for next paragraph.
### w2. Use of COCO-DR for negative data sampling, while RankT5 uses GTR.
**Unfortunately, Exact application of the RankT5 setup to ListT5 was not possible at the main paper due to the following reasons.**
- No information about the parameter size of the GTR model
- Unclear whether the model they use is GTR or GTR-FT, and if it is GTR-FT, no open-sourced checkpoints are available
- Unclear whether they used only ‘train’ or both ‘train’ and ‘dev’ subset of MS MARCO for training RankT5.
- See the mention from the RankT5 paper:
- *…5.1 Data sets / MS MARCO … We use a dual-encoder retriever (Ni et al., 2022) fine-tuned on MS MARCO to retrieve the top-1000 passages for each query in both the “train” and “dev” partitions as the candidate documents. …*
- Different number of negatives - RankT5 samples 35 negative passages per one query from the top 1000 retrieval results, while ListT5 only sees 4 negatives at once.
- We would like to clarify the point that we have used COCO-DR only for the negative sample selection from the MS MARCO training set (not used nor trained anywhere with BEIR), and regarding the dev set performance of MS MARCO, GTR-XL achieves an NDCG@10 score of 43.9 where COCO-DR achieves the NDCG@10 score of 42.4. COCO-DR actually scores lower than GTR-XL for MS MARCO.
**Even though the exact replication of GTR was not possible, we conduct approximate but fair experiments to train ListT5 on negatives sampled from `sentence-transformers/gtr-t5-xl`, and find that *ListT5 trained with GTR still outperforms RankT5* (didn't influence the final results).**
- All other configurations stay the same except we change the COCO-DR-large model to GTR-XL.
- (L.267) We train T5-base model with learning rate of 1e-04, linear learning rate scheduler with 10% warmup, effective batch size 256, maximum input length 230, maximum output length 7.
- Although the original ListT5-base model was trained for 20k steps, we were only able to train the model untill 10k steps due to the limited time of the discussion period. (This is the only hyperparameter that is changed)
- From the experiments, we can see that currently, GTR does not provide a good ordering of negatives as COCO-DR. However, we would like the reviewers to recognize that this experiment was done in one-pass given a short range of time and we were unable to do any hyperparameter tuning. Due to limited time, we were also only able to train ListT5-GTR for 10k steps, which is half of what ListT5-COCO-DR was trained. We think that the results may go up if done proper hyperparameter tuning .
> Ablation experiments - NDCG@10 on full BEIR subset except CQADupStack - we could not evaluate CQADupStack on time. ListT5 models are (r=2) variants.
| Dataset | BM25 (init) | MonoT5 | RankT5 | ListT5 - COCO-DR | ListT5 - GTR |
|----------------|-------------|--------|--------|------------------|--------------|
| TREC-COVID | 59.5 | 78.3 | 77.7 | 78.3 | 78.6 |
| NFCorpus | 32.2 | 35.7 | 35.1 | 35.6 | 36.2 |
| BioASQ | 52.2 | 55.3 | 58.2 | 56.4 | 55.1 |
| NQ | 30.5 | 52.1 | 53.2 | 53.1 | 52.8 |
| HotpotQA | 63.3 | 71.2 | 72.8 | 72.6 | 71.9 |
| FiQA-2018 | 23.6 | 39.2 | 39.2 | 39.6 | 39.9 |
| Signal-1M (RT) | 33.0 | 32.0 | 30.8 | 33.5 | 32.9 |
| TREC-NEWS | 39.5 | 48.0 | 45.4 | 48.5 | 48.0 |
| Robust04 | 40.7 | 53.4 | 54.3 | 52.1 | 52.7 |
| Arguana | 40.8 | 34.4 | 35.5 | 48.9 | 43.6 |
| Touche-2020 | 44.2 | 29.6 | 37.1 | 33.4 | 32.7 |
| CQADupStack | 30.0 | 38.6 | 37.0 | 38.8 | - |
| Quora | 78.9 | 84.6 | 83.3 | 86.4 | 83.9 |
| DBPedia | 31.8 | 42.8 | 43.7 | 43.7 | 43.6 |
| SCIDOCS | 14.9 | 16.7 | 16.8 | 17.6 | 17.0 |
| FEVER | 65.2 | 78.4 | 77.6 | 79.8 | 79.4 |
| Climate-FEVER | 16.5 | 23.1 | 21.2 | 24.0 | 24.8 |
| SciFact | 67.9 | 73.1 | 73.5 | 74.1 | 73.1 |
| avg (w/o CQADupStack) | 43.2 | 49.9 | **50.3** | **51.6** | **51.0** |
Thank you for raising an important question and we will include this ablation experiment in the next revision.
---
### w3. Hyperparameter tuning is cherry picked on evaluation metrics
**We believe our investigated learning rate and number of steps are conventional and within an acceptable range for empirical ML research papers, and competitors (MonoT5 and RankT5) have also properly tuned their hyperparameters similar to ours.**
- MonoT5-10k is trained on the “seemingly arbitrary” first 640k lines from triples.train.small.tsv. ([reference](https://github.com/castorini/pygaggle/issues/222#issuecomment-926790242))
- According to the official guidelines, `casterioni/monot5-base-msmarco-10k` is known to perform better on zero-shot than the `casterioni/monot5-base-msmarco` model, so from possible variants, we compared our model with the 10k version, to make a fair comparison as possible.
- According to the [checkpoint metadata](https://github.com/google-research/google-research/tree/master/rankt5), the step size of the released RankT5-base model is 1049900, large model 1050700, and 3B model 1050000.
- There are no mentions about how they picked the numbers on the paper. We wonder if the reviewer also considers the hyperparameter setups of MonoT5 and RankT5 as cherry-picked magic numbers.
- Specifically, we experimented by grid search with a few points, investigating a learning rate of 1e-05 and 1e-04 and step size of 20000 and 25000 for T5-base and 3000 and 8000 for T5-3b. For T5-3b, we needed to reduce the input length due to resource limitations. (Limitation section: L.526)
- Lastly, we want to emphasize that **ListT5 was extensively evaluated on with diverse datasets** and different first-stage retrieval methods (BM25/COCO-DR), including the **full** BEIR subset (18 datasets), MS MARCO-dev and TREC-DL19/DL20 **to explicitly avoid being seen as cherry-picking** and ensure the reliability of the overall effectiveness of ListT5 on zero-shot scenarios.
---
### w4. Motivation is unclear about its effectiveness in zero-shot scenarios. Why compare ListT5 with MonoT5 and RankT5?
**The motivation of our work is clearly stated at the Introduction part of our main paper L.37 - L.46 (see below) and supported by Ma et al., 2023 [1].**
- *... Recently, listwise reranking, which evaluates multiple passages together, is gaining attention for its effectiveness in zero-shot retrieval. Listwise rerankers can condition on and compare multiple passages to calibrate the relevance scores better, and can reduce the inaccuracy of predictions arising from domain shift, as theoretically supported by Xian et al. (2023), and empirically supported by a line of works such as Ma et al. (2023); Sun et al. (2023a). …*
**MonoT5, DuoT5, RankT5, and RankGPT are chosen as our baseline models since they are known to perform competitively very well on BEIR, providing a fair comparison with ListT5 among the possible models we can replicate.**
- To ensure fair comparison, since MonoT5, DuoT5, and RankT5 are built on top of the same pretrained model architecture (T5).
- To focus on proving the effectiveness of ListT5 (listwise reranking using FiD) compared to pointwise, pairwise, and listwise (RankGPT) ranking baselines.
- MonoT5 and RankT5 shows state-of-the-art performances with respect to the BEIR benchmark, ranking #1 at the [official BEIR leaderboard](https://eval.ai/web/challenges/challenge-page/1897/leaderboard).
- They are also fully replicable with released checkpoints.
- The effectiveness of RankT5 to BEIR is clearly mentioned at the abstract of the RankT5 paper:
- *… the ranking model appears to have better zero-shot ranking performance on out-of-domain data sets compared to the model fine-tuned with classification losses. …*
- To the best of our knowledge, we are the first to successfully apply listwise reranking using FiD to small-sized models such as T5. We believe that showing performance gains with respect to RankT5 is sufficient enough to show the effectiveness of our method to zero-shot retrieval.
- We have also compared ListT5 with listwise rerankers (RankGPT, RankVicuna for FLOPs) and shown that RankVicuna cost 107 times more FLOPs than ListT5-base (L.359), with lower performance on BEIR (RankGPT, see Table 3).
- Unfortunately, replicating and comparing with the Xian et al., (2023) paper [2] was not possible since it didn’t include any code nor released checkpoints that we can use.
[1] Ma et al., Zero-Shot Listwise Document Reranking with a Large Language Model
[2] Xian et al., Learning List-Level Domain-Invariant Representations for Ranking
---
### w5. Claims of efficiency is not convincing
**We notice reviewer’s misunderstandings (M1 to M4) on the practicality and efficiency of the full line of works that depends on listwise reranking [1, 3, 4, 5, 6, 7, 8, 9]. In response, we aim to clarify and rectify each misconception.**
**M1: None of the other methods depend on initial ordering**
- Full line of works that are built on listwise reranking (sliding window approach [1, 4, 5, 6, 7, 8, 9], pairwise ranking prompting [3]) ***depends on*** initial ordering.
**M2: All other methods can be parallelized while the proposed method is iterative**
- All listwise ranking methods that use the sliding window approach [1, 3, 4, 5, 6, 7, 8] ***cannot*** be parallelized because the input to the sliding window at a given time depends on the results obtained from all previous timestamps.
- Our approach is actually more parallelizable than the sliding window approach. This is because we ***can*** parallelize the computation at the leaf level on tournament sort, and critically, this accounts for the majority of computation ($O(n)$ complexity out of $O(n + klogn)$), which is already implemented in our code.
**M3: The proposed method is way less practical than other methods on latency**
- We provide the asymptotic complexity and FLOPs of ListT5, and prove that ListT5 are ***much more practical*** than DuoT5 and are similar with MonoT5.
- We additionally report the latency of evaluating FIQA-2018, which aligns with the real-time FLOPs comparison, at Fig.5.
> Latency comparison on bm25 top100 FiQA, fully batched on 1x A6000 48G.
| | MonoT5 | DuoT5 (top 50) | ListT5 (r=1) |
| -------------- | ------ | ------ | ------------ |
| Latency (sec.) | 769 | 21,344 | 2,673 |
| Multiple of: | 1x | 27.8x | 3.5x |
**M4: If initial ordering is given and an iterative method is allowed, there are way more efficient methods for, e.g., pairwise approaches.**
- We have already emphasized throughout our paper and by comments(Table 1, Figure 5), that ***ListT5 is 10 times (25.5 / 2.57) more efficient*** than pairwise approaches (From numbers at Fig.5).
[3] Qin et al., LARGE LANGUAGE MODELS ARE EFFECTIVE TEXT RANKERS WITH PAIRWISE RANKING PROMPTING, 2023
[4] Pradeep et al., RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!, 2023
[5] Sun et al., Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents, 2023
[6] Pradeep et al., RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models, 2023
[7] Adeyemi et al., Zero-Shot Cross-Lingual Reranking with Large Language Models for Low-Resource Languages, 2023
[8] Tang et al., Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models, 2023
[9] Sun et al., Instruction Distillation Makes Large Language Models Efficient Zero-shot Rankers, 2023
---
### w6. GPT-4 based method outperforms chatgpt based method by a large margin, where this submission only compares to chatgpt which is cherry-picking baselines.
**We think comparison with GPT-4 is not mandatory due to its high api cost and the parameter size is unknown, but we have included the results due to the reviewer’s request. The results show that ListT5 is more competitive than GPT4 for both positional bias and overall accuracy.**
- For comparison with GPT-4, we think using a stronger first-stage reranker with ListT5-3b is accepted. We have additionally evaluated reranking with ListT5-3b model with using SPLADE++ED (`naver/splade-cocondenser-ensembleddistill`) top100 as the first-stage retriever, which is also conventionally used in the RankZephyr paper [3]. We found that our method performs **competitively (better) than both RankZephyr and GPT4** on the reported BEIR benchmark.
> NDGC@10 comparison with RankGPT(GPT4), RankZephyr, and ListT5. Initial retrieval by SPLADE++ED top100 for RankZephyr and ListT5. We provide and compare with all reported baseline numbers from the RankGPT [4] and RankZephyr [3] paper.
| | RankGPT(GPT4) | ListT5-3b (r=2) | RankZephyr-7b (best reported) |
| ----------- | ------------- | --------------- | ----------------------------- |
| TREC-COVID | 85.51 | 85.76 | 85.66 |
| NFCorpus | 38.47 | 38.38 | |
| Signal-1M | 34.4 | 29.88 | |
| TREC-NEWS | 52.89 | 54.57 | 51.07 |
| Robust04 | 57.55 | 62.47 | |
| Touche-2020 | 38.57 | 31.17 | |
| DBPedia | 47.12 | 50.81 | |
| SciFact | 74.95 | 76.67 | |
| Average) | 53.68 | **53.71** | |
---
### w7. Lack of baselines for the positional bias experiment.
**7-1. GPT4?
We also report the positional bias of GPT-4 and DuoT5 and show that ListT5 is more robust to positional bias than GPT4.**
- We have measured the positional bias of DuoT5 (pairwise models) on FIQA (the ones in the paper) and TREC-COVID using the same data as the original experiment.
- We split the original group of [1 positive, 4 negatives] to 4 groups of [1 positive, 1 negative] and measure the accuracy and agreement ratio before and after swapping the order.
- Also, note that the point of this experiment is to compare the robustness to input ordering change (to measure standard deviation and agreement ratio), and the individual accuracy (where GPT4 is better than ListT5-base on FiQA, and worse than TREC-COVID) is not directly related with the downstream reranking accuracy.
- It additionally cost 195.8 $ (total of 6523910 input / 1298 output tokens) to run the GPT4 experiment on FiQA. We will include the following results at the next revision.
The results show that although GPT4 is more robust to the position bias problem than GPT3.5, ListT5 is more robust to positional bias than GPT4.
**7-2. Only FiQA ?**
**Additional experiments on TREC-COVID shows similar results with respect to the FiQA results.**
- FiQA was selected because it was one of the datasets with sufficient amount of queries (648), small amount of positive passage per query, and have a passage length of 512, where 5 of them (512*5=2600) fits into the context window size of GPT-3.5-turbo. (Details at Appendix E)
- We believe that investigating all positive queries from the full FiQA dataset, consisting of 818 * 5 = 4090 samples with different orderings was sufficient to show the positional bias between conventional listwise models and ListT5.
- However, on the reviewer’s request, we also provide additional experiments on TREC-COVID. TREC-COVID has an average of 493 positive passages for one query, so we sample one positive passage for each query, and conduct the positional bias experiments (other configurations are the same with the FiQA experiment setup)
> FiQA positional bias experiment
| | GPT-3.5-turbo-0301 | GPT-4-0613 | DuoT5-base | ListT5-base |
| ---------------------- | ------------------ | ---------- | ---------- | ----------- |
| (1) Accuracy: | | | | |
| 1 | 90.9 | 94.6 | 89.9 | 85.3 |
| 2 | 80.4 | 90.5 | 76.9 | 85.6 |
| 3 | 78.5 | 84.4 | | 82.2 |
| 4 | 63.8 | 86.8 | | 83.3 |
| 5 | 64.3 | 84.8 | | 82.6 |
| Std. | 10.3 | 3.9 | 6.5 | **1.4** |
| (2) Agreement ratio: | | | | |
| Points to same passage | 60.5 | 82.8 | 78.1 | **90.4** |
| other | 39.5 | 17.2 | 21.9 | 9.5 |
> TREC-COVID positional bias experiment. Std was calculated using numpy.std.
| | GPT-3.5-turbo-1106 | GPT-4-0613 | DuoT5-base | ListT5-base |
| ---------------------- | ------------------ | ---------- | ---------- | ----------- |
| (1) Accuracy: | | | | |
| 1 | 81.6 | 95.9 | 91.3 | 93.9 |
| 2 | 63.3 | 83.7 | 76.0 | 87.8 |
| 3 | 75.5 | 73.5 | | 83.7 |
| 4 | 67.3 | 77.6 | | 85.7 |
| 5 | 61.2 | 71.4 | | 81.6 |
| Std. | 7.7 | 8.8 | 7.7 | **4.2** |
| (2) Agreement ratio: | | | | |
| Points to same passage | 55.1 | 69.4 | 79.6 | **83.7** |
| other | 44.9 | 30.6 | 20.4 | 16.3 |
* GPT-.3.5-turbo-0301 is no longer accessible, so we experimented with GPT-3.5-turbo-1106 for TREC-COVID.
**7-3. Only ListT5 (r=1)?**
**Only (r=1) can be defined on this positional bias experiment.**
- Section.4 - Experiment and Data Setup has the full explanations of the experimental scenario for the positional bias experiment.
- We defined the positional bias inspired from Liu et al., [9], where they measure the answer accuracy with respect to the index change of the relevant passage. Here, we only consider the top 1 prediction of each listwise reranking model, so the results are the same for (r=1) and (r=2).
- One clarification: the measurement of positional bias is applied for this basic operating unit, so the choice of (r=1) and (r=2) does not affect the metric (Fig.3)
- Since this can cause confusion, we will revise the explanations to be more clear in the next revision.
[9] Liu et al., Lost in the Middle: How Language Models Use Long Contexts
R3
---
We appreciate the reviewer for taking the valuable time and effort to review our paper. We have provided additional explanations and experiments to further validate and clarify our arguments in the response, especially about **unifying the negative samples by GTR** for fair comparison. We would be greatful if the reviewer takes time to view our responses, especially since the reviewer addressed the difference of training dataset as the major concern.
We hope our response offers sufficient evidence to your questions, and that you will consider increasing the overall assessment for our paper!
### w1. RankT5 negatives are sampled from GTR, while ours are sampled from COCO-DR.
**Unfortunately, Exact application of the RankT5 setup to ListT5 was not possible at the main paper due to the following reasons.**
- No information about the parameter size of the GTR model
- Unclear whether the model they use is GTR or GTR-FT, and if it is GTR-FT, no open-sourced checkpoints are available
- Unclear whether they used only ‘train’ or both ‘train’ and ‘dev’ subset of MS MARCO for training RankT5.
- See the mention from the RankT5 paper:
- *…5.1 Data sets / MS MARCO … We use a dual-encoder retriever (Ni et al., 2022) fine-tuned on MS MARCO to retrieve the top-1000 passages for each query in both the “train” and “dev” partitions as the candidate documents. …*
- Different number of negatives - RankT5 samples 35 negative passages per one query from the top 1000 retrieval results, while ListT5 only sees 4 negatives at once.
- We would like to clarify the point that we have used COCO-DR only for the negative sample selection from the MS MARCO training set (not used nor trained anywhere with BEIR), and regarding the dev set performance of MS MARCO, GTR-XL achieves an NDCG@10 score of 43.9 where COCO-DR achieves the NDCG@10 score of 42.4. COCO-DR actually scores lower than GTR-XL for MS MARCO.
**Even though the exact replication of GTR was not possible, we conduct approximate but fair experiments to train ListT5 on negatives sampled from `sentence-transformers/gtr-t5-xl`, and find that *ListT5 trained with GTR still outperforms RankT5* (didn't influence the final results).**
- All other configurations stay the same except we change the COCO-DR-large model to GTR-XL.
- (L.267) We train T5-base model with learning rate of 1e-04, linear learning rate scheduler with 10% warmup, effective batch size 256, maximum input length 230, maximum output length 7.
- Although the original ListT5-base model was trained for 20k steps, we were only able to train the model untill 10k steps due to the limited time of the discussion period. (This is the only hyperparameter that is changed)
- From the experiments, we can see that currently, GTR does not provide a good ordering of negatives as COCO-DR. However, we would like the reviewers to recognize that this experiment was done in one-pass given a short range of time and we were unable to do any hyperparameter tuning. Due to limited time, we were also only able to train ListT5-GTR for 10k steps, which is half of what ListT5-COCO-DR was trained. We think that the results may go up if done proper hyperparameter tuning .
> Ablation experiments - NDCG@10 on full BEIR subset. ListT5 models are (r=2) variants.
| Dataset | BM25 (init) | MonoT5 | RankT5 | ListT5 - COCO-DR | ListT5 - GTR |
| -------------- | ----------- | ------ | -------- | ---------------- | ------------ |
| TREC-COVID | 59.5 | 78.3 | 77.7 | 78.3 | 78.6 |
| NFCorpus | 32.2 | 35.7 | 35.1 | 35.6 | 36.2 |
| BioASQ | 52.2 | 55.3 | 58.2 | 56.4 | 55.1 |
| NQ | 30.5 | 52.1 | 53.2 | 53.1 | 52.8 |
| HotpotQA | 63.3 | 71.2 | 72.8 | 72.6 | 71.9 |
| FiQA-2018 | 23.6 | 39.2 | 39.2 | 39.6 | 39.9 |
| Signal-1M (RT) | 33.0 | 32.0 | 30.8 | 33.5 | 32.9 |
| TREC-NEWS | 39.5 | 48.0 | 45.4 | 48.5 | 48.0 |
| Robust04 | 40.7 | 53.4 | 54.3 | 52.1 | 52.7 |
| Arguana | 40.8 | 34.4 | 35.5 | 48.9 | 43.6 |
| Touche-2020 | 44.2 | 29.6 | 37.1 | 33.4 | 32.7 |
| CQADupStack | 30.0 | 38.6 | 37.0 | 38.8 | 38.8 |
| Quora | 78.9 | 84.6 | 83.3 | 86.4 | 83.9 |
| DBPedia | 31.8 | 42.8 | 43.7 | 43.7 | 43.6 |
| SCIDOCS | 14.9 | 16.7 | 16.8 | 17.6 | 17.0 |
| FEVER | 65.2 | 78.4 | 77.6 | 79.8 | 79.4 |
| Climate-FEVER | 16.5 | 23.1 | 21.2 | 24.0 | 24.8 |
| SciFact | 67.9 | 73.1 | 73.5 | 74.1 | 73.1 |
| avg | 42.5 | 49.3 | **49.6** | **50.9** | **50.3** |
Thank you for raising an important question and we will include this ablation experiment in the next revision.
---
### w2. Need to compare FLOPs of sliding window with tournament sort
**We would like to clarify that the results at L.356 *are* conducted using the sliding window approach.**
- The information is also written at Line.357-360: We used Vicuna-7b with a context size of 4096 and a sliding window of 20. RankVicuna resulted in FLOPs that is 42 times more of ListT5-base (r=1) to rank top-10 passages from 100 candidate passages.
- Listwise reranking with a sliding window (of 20) approach necessitates using models that are trained with large context sizes, which is only applicable to billion-sized models (L.133).
- Thus, it is natural that we cannot compare these approaches just like what we did with T5-based models - MonoT5(pointwise), RankT5, DuoT5 (pairwise), and ListT5.
- Though not a fair comparison, theoretically, if we reduce the RankVicuna-7b model to 220M (size of T5-base, which is 31 times smaller) and leave everything as it is, the FLOPs of the model with the sliding window approach will still be inefficient than ListT5 (r=1), which is (42/31) = 1.35 times slower. However, please note that this comparison is roughly done.
---
### w3. Novelty is limited, hierarchical approach is already investigated
To our understanding, the contribution of PARADE and related works is primarily about applying hierarchical scoring to pointwise reranking of documents, but the main focus and contribution of our work is applying listwise reranking of passages to T5-based models with the Fusion-in-Decoder approach, which seems to be parallel with our focus.
- Regarding the novelty concerns, we would like to restate our 3 major contributions (L.68-89):
- In this work, we introduce ListT5, a FiD-inspired listwise reranker that is:
- Computationally effective than previous listwise LLMs or pairwise methods, while competitive with pointwise models.
- Robust to positional bias than previous listwise LLMs, such as GPT-3.5 or even GPT4.
- Shows better zero-shot performance than previously well-established models.
Application of tournament sort is utilized to achieve the computational effectiveness of ListT5 for (1), and this approach is especially designed in par with the nature of ListT5, such as designing the option of (r=1) and (r=2). We believe that our 3 major contributions of our paper, along with the novelty of applying FiD for reranking and the design of tournament sort for ListT5 can be seen as a sound contribution.
Also, we agree that FiD for reranking better distinguishes our work, and will deemphasize current pareto pitch as suggested. Thank you for your valuable comments to improve the paper.
---