# Letter to AC
Dear Area Chairs,
We would like to express our gratitude to the reviewers and area chairs for their valuable effort. We are reaching out to kindly request ACs to bring attention regarding the review/discussion with reviewer x5n7.
In the response, the reviewer addressed 2 clarifications (Q1 & Q5) and one possible minor limitation (Q2 - on the efficiency of tournament sort for m > k), along with 2 more results that needs to be discussed (Q3 & Q4). Through the author response period, we believe we have **resolved the reviewer's concerns**. Along with the responses to clarifications (Q1& Q5), we have shown that the tournament sort can still be effective on the case of m > k by Q2, and have provided additional results for Q3 and Q4, showing the effectiveness of ListT5 to hyperparameter change (Q3) and the applicability of tournament sort to existing listwise reranking models (Q4).
While we wanted the author response session to be interactive, unfortunately, the reviewer failed to respond to our comments in the response time period. Therefore, we hope this letter brings attention to the ACs regarding the discussions with reviewer x5n7.
Again, we appreciate your time and effort in organizing the discussion period.
Sincerely, Authors of paper 4227
# Responses to R1 (x5n7)
Thank you for providing valuable comments to improve our paper! We hope our comments are sufficient enough to answer the questions raised and that you consider raising the overall assessment of our paper! We would greatly appreciate it if you could review our comments and inform us if any further discussions or clarification is required:)
### 1. Limiting the number of passages to rank ($m$) to 5 seems somewhat restrictive.
- We would like to point out that we have experiments on increasing m to 10 (Table 6 in Appendix B.1; also see Line 172).
- For the T5-base model, the $m$ = 5 model was selected as our main model simply because they were empirically more effective than $m$ = 10. Still, please note that our $m$ = 10 model with $r$ = 4 (average NDCG@10 of 50.0 on BEIR), outperforms RankT5 (average NDCG@10 of 49.6 on BEIR).
- *Why does small m (5) work better than larger m (10) in our models, while original FiD uses large m (e.g., 100)?* Compared to the original FiD for QA where the use case of inputs are to find relevant information, ListT5 needs to *order* between different inputs, a task that needs to analyze the relative relevancy between all given inputs. We conjecture that this *sorting* task has more complexity than the task that use the input as knowledge source, leading to increased effectiveness when $m$ is smaller.
- We believe $m$ could be increased when applied to larger models, while still being competitive.(L.674-678)
### 2. Tournament sort is efficient than sliding window when m is less than k, but what if the model use m=20 to get top-10 ranking? Is tournament sort still faster?
- Even for the case of m=20, the number of forward passes needed to get top-10 rankings for tournament sort is less than or equal to the sliding window variants.
> Number of forward passes to rerank top-10, when m = 20, n = 100 (written in the format of: num_passes for each iteration x number of iterations that assures global ranking)
| | s = r = 10 | s = r = 5 | s = r =1 |
| ---------- | ---------- | --------- | ------------- |
| Sliding | 9 x 1 iter. = 9 | 17 x 2 iter.= 34 | 81 x 10 iter. = 810 |
| Tournament | 9 x 1 iter. = 9 | 7 x 2 iter. = 14 | 6 x 10 iter. = 60 |
- For example, the number of iterations to rank top-10 with m=20 requires 9 forward passes, which is the same amount with the sliding window approach with same hyperparameters (stride = 10, window size = 20). Tournament sort with r=10 can correctly rank global top-10 candidates in one iteration, as illustrated below.
```
[9]
| \
| \
|10 \ 10
[6] [8]
| \ |10 \
| \ [7] \
|10 \10 |10 \10 \10
[1] [2] [3] [4] [5]
0-20 20-40 40-60 60-80 80-100
```
- Exploring over exhaustive parameters and models could not be done within the short rebuttal window, but we will make sure to include detailed analysis on the efficiency and performance comparison of the sliding window approach with tournament sort for the case of m > k.
### 3. Lack of discussion of how r=3,4 performs. Worth showing.
- We have observed saturation of performance for larger r in early experiments (L.168-170). On the reviewer's request, we provide full evaluation results of BEIR on the variants of the performance of ListT5 on r=3 and r=4.
- We will include the results in the next revision.
> NDCG@10 on BEIR, for ListT5-base with different r.
| Dataset | r=1 | r=2 | r=3 | r=4 |
| -------------- | ---- | -------- | ---- | ---- |
| TREC-COVID | 76.7 | 78.3 | 78.1 | 78.3 |
| NFCorpus | 35.5 | 35.6 | 35.3 | 35.6 |
| BioASQ | 57.2 | 56.4 | 55.9 | 55.8 |
| NQ | 52.0 | 53.1 | 52.2 | 52.2 |
| HotpotQA | 72.1 | 72.6 | 71.6 | 71.6 |
| FiQA-2018 | 39.5 | 39.6 | 39.6 | 39.5 |
| Signal-1M (RT) | 33.3 | 33.5 | 33.3 | 33.2 |
| TREC-NEWS | 47.9 | 48.5 | 49.2 | 48.6 |
| Robust04 | 52.0 | 52.1 | 51.6 | 51.7 |
| Arguana | 49.7 | 48.9 | 49.4 | 49.9 |
| Touche-2020 | 34.2 | 33.4 | 33.8 | 34.2 |
| CQADupStack | 38.4 | 38.8 | 38.5 | 38.6 |
| Quora | 86.1 | 86.4 | 86.0 | 86.1 |
| DBPedia | 43.9 | 43.7 | 43.3 | 43.1 |
| SCIDOCS | 17.2 | 17.6 | 17.4 | 17.5 |
| FEVER | 77.8 | 79.8 | 78.6 | 78.4 |
| Climate-FEVER | 22.8 | 24.0 | 23.4 | 23.3 |
| SciFact | 74.1 | 74.1 | 73.8 | 74.3 |
| Average | 50.6 | **50.9** | 50.6 | 50.7 |
### 4. Verify the effectiveness of tournament sort using GPT3.5/4 to conduct ranking, in addition to T5.
- We have analyzed the performence of [RankGPT](https://github.com/sunnweiwei/RankGPT?tab=readme-ov-file) with tournament sort. The performance difference of tournament sort with respect to the sliding window approach for w=20, s=10 were not significant, while the number of required forward passes to rank top-10 passages is the same (9) for both variants.
> NDCG@10 on the selected subset of BEIR, on RankGPT-3.5 with different sorting methods. To compensate for the instability of APIs, All results are run for 3 times. Except for trec-covid and touche, differences are statistically non-significant (p > 0.1).
| | dl19 | dl20 | trec-covid | news | touche |
| ----------------------- | ---------- | ---------- | -------------- | ---------- | -------------- |
| sliding (w=20, s=10) | 68.4 ± 0.4 | 64.9 ± 1.1 | 72.6 ± 1.4 | 46.5 ± 1.0 | **38.2 ± 0.5** |
| tournament (m=20, r=10) | 67.4 ± 0.9 | 65.8 ± 0.6 | **76.4 ± 0.4** | 45.5 ± 1.0 | 33.1 ± 1.7 |
- Due to the short window of the rebuttal period and the cost of API calls, we were unable to experiment with different hyperparameters (e.g., for different m, r). We will provide more detailed analysis on the performance and efficiency of RankGPT-3.5&4 on our next revision of our paper.
### 5. Others
Table 11: 25 x [10/1] = 250, I think this is not exactly correct. For example, to get the 6th passage, the sliding window only needs to run 24 times. The first 5 are already ordered.
- The reviewer is correct in noting that, for ranking top-6 with a sliding window of size 5 and stride 4, one forward pass can be saved in the 6th slide because top-5 have been already ordered in the first 5 slides. For top-10, while our calculation was 25 x [10/1] = 250, a corrected precise calculation gives 248.
- This correction marginally reduces the numbers in Table 11, while the trends are the same. Thank you for pointing out, we will correct the numbers.
> Fixed numbers for table 11.
| sorting method | hyperparameter | \# forward pass to rerank top-10 |
| --------------- | -------------- | -------------------------------- |
| sliding window | s=1 | 288 -> 280 |
| | s=2 | 196 -> 191 |
| | s=3 | 165 -> 162 |
| | s=4 | 250 -> 248 |
| tournament sort | r=1 | 52 |
| | r=2 | 67 |
<comment>
- We believe the numbers in Table 11 are, while maybe counter-intuitive, mostly correct (up to a small error that we will describe and fix). For larger strides, the number of passages a sliding window can "cache" or "carry over" gets more restricted, requiring a larger number of iterations overall.
- We can construct a worst-case example ("adversary") to demonstrate this. Suppose that, in the initial ordering of 100 passages, the most irrelevant passage comes at the first, and the actual top-10 comes at the last. For a sliding window of size 5 and stride 4, we are only able to carry at most one passage from previous iteration; meaning **each slide is only able to "rescue" only one passage**. Concerning the numbers asked by the reviewer under this situation, to get the 6-th passage, we would need to run 6 "rescues", which amounts to 6 slides over the 100 passages and therefore 25 * 6 times of forward passes. In general, for window size $w$ and stride $s$, each slide can rescue $w - s$ passages. This gives us Table 11.
- Therefore, we believe it takes 25 * 6 number of forward passes to rerank the global top-6. Consider the following example, where the numbers indicate the actual rank of passages:
</comment>
<comment>
- Initial ordering [100, 99, 98, .. 6, 5, 4, 3, 2, 1]
- [100, 99, 98, .. 6, [5, 4, 3, 2, 1]] start sliding window, iter=1
- [100, 99, 98, .. 6, [1, 2, 3, 4 ,5]] ordered 1 times
- [100, 99, 98, .. , [9, 8, 7, 6, 1], 2, 3, 4 ,5] move (stride=4)
- [100, 99, 98, .. , [1, 6, 7, 8, 9], 2, 3, 4 ,5] ordered 2 times
- ...
- [[1, 94, 98, 99, 100], 95, 96, ... 9, 2, 3, 4, 5] ordered 25 times
- [1, 94, 98, 99, 100, 95, ... [9, 2, 3, 4, 5]] start sliding window, iter=2
- [1, 94, 98, 99, 100, 95, ... [2, 3, 4, 5, 9]] ordered 26 times
- [1, 94, 98, ..., [13, 6, 7, 8, 2], 3, 4, 5, 9] move (stride=4)
- ...
- [1, 2, 3, [4, 5, 6, 94, 95], 90, 91, 92, ... 17, 21, 25] ordered 149 times
- ...
</comment>
Some references better to include: Large Search Model: Redefining Search Stack in the Era of LLMs (regarding the creation of listwise relevance targets.)
- Thank you for the valuable sugggestion. As the reviewer pointed out, ListT5 (listwise reranking that outputs permutations of 5) can be seen as an application of defining reranking as text generation task, which is in line with the reference paper. We will include this related work along with follow-up papers in the next revision of our paper.
# Responses to R2 (qN1M)
We appreciate the reviewer's time and effort to review our paper. We hope that the below comments have offered sufficient evidence to your questions, and that you consider increasing the overall assessment of our paper!
### 1. The partition and sort in tournament sorting are based on the initial order returned by the retriever. The analysis in Section 5.2 shows the method could still be affected by passage positions. Have the authors tried to shuffle the initial retrieval results from DE/BM25 and test if the top-k reranked passages remain the same, a property that holds for pointwise methods?
- While our approach obtains a good amount of ordering robustness using FiD, as the reviewer pointed out, the tournament sort itself depends on the initial ordering returned by first-stage retriever (say bm25). Similar to sliding windows, this may leave a room for ordering sensitivity.
- To investigate this, we additionally did an experiment on several datasets controlling the use of FiD and differing the sorting method (tournament vs sliding).
- The results show that (1) ordering sensitivity is indeed prevalent in previous listwise reranking models (M3, M4, No FiD + sliding), (2) using FiD (M1, M2) greatly reduces the sensitivity in general (in line with our findings in the main text), and (3) (M1 vs M2) when using FiD, tournament sort is in general robust to ordering change (-0.4 drop in average performance), especially when compared to sliding windows (-1.2 drop).
- We appreciate the comment and will add this result in our next revision.
> NDCG@10 before/after randomly shuffling the initial top 100 ordering of BM25.
| Model | Sorting method | initial ordering | dl19 | dl20 | trec-covid | news | touche | average |
| ------------------ | -------------------- | ---------------- | ---- | ---- | ---------- | ---- | ------ | -------- |
| | | | | | | | | |
| ListT5-base (M1) | tournament(r=2) | no shuffle | 71.8 | 68.1 | 78.3 | 48.5 | 33.4 | 60.0 |
| | | shuffle,avg | 71.2 | 68.1 | 77.2 | 48.9 | 32.8 | 59.6 |
| | | drop in perf. | | | | | | **-0.4** |
| ListT5-base (M2) | sliding(s=3, iter=4) | no shuffle | 71.8 | 67.7 | 77.5 | 50.0 | 33.1 | 60.0 |
| | | shuffle, avg | 69.5 | 65.5 | 77.7 | 49.2 | 32.1 | 58.8 |
| | | drop in perf. | | | | | | **-1.2** |
| RankVicuna-7b (M3) | sliding | no shuffle | 68.9 | 66.1 | 80.5 | 46.9 | 33.0 | 59.1 |
| | | shuffle, avg | 67.1 | 64.6 | 79.2 | 45.3 | 30.8 | 57.4 |
| | | drop in perf. | | | | | | **-1.7** |
| RankGPT-3.5 (M4) | sliding | no shuffle | 68.4 | 64.9 | 72.6 | 46.5 | 38.2 | 58.1 |
| | | shuffle, avg | 62.5 | 57.0 | 66.1 | 38.3 | 22.8 | 49.3 |
| | | drop in perf. | | | | | | **-8.8** |
- Due to the limited time window of the rebuttal, we evaluate on 5 different selected subset of datasets, that has relatively small data size (around 50)
- Evaluation results for RankGPT-3.5 are run 3 times and averaged to compensate for the unstability of the API output.
- All other results on shuffling the initial ordering are shuffled in 3 different seeds (seed 0,1,2) and we report the average score (avg).
### 2. The time complexity is claimed to be reduced from to O($n^2$) to O($n$ + $k$ log $n$). However, the proposed method can only reorder the top-k with the rest unordered.
- (1) We can consider re-ordering full ($n$) passages, by simply setting k=n, which gives us O($n$ + $n$ log $n$) = O($n$ log $n$). This is still better than O($n^2$) of pairwise models, and in fact is the best possible (worst-case asymptotic) complexity for sorting algorithms based on comparisons [[1]](https://en.wikipedia.org/wiki/Comparison_sort).
- We also stress that (2) In many problems, we are often asked to rerank only top-k. For example, NDCG@10, the official metric of BEIR, only asks for top-10 passages. Also, (3) the sliding window approach shares the same issue. (L.187-194)
- We appreciate the comments and will add above discussion to the revised draft.
[1] [https://en.wikipedia.org/wiki/Comparison_sort](https://en.wikipedia.org/wiki/Comparison_sort)
### 3. How do you ensure the outputs are valid index? Do you use any constrained decoding or each digital number is represented by a special token?
- We do not use constrained decoding or validation module to ensure the generation of valid permutations of 5. Each digital number is represented as simple integes (The decoded output looks something like "1 2 5 4 3" or "2 3 1 5 4"), with an exception catching module such that if the parsing fail, we go back to the original ordering. However, since ListT5 is fine-tuned to output valid indexes for 20000 steps, if the input follows the correct format, the model almost always outputs valid index. Nevertheless, we think that adding a constrained decoding module would be beneficial. Thank you for your suggestion and we will add the module in our open-source code.
### 4. It seems that the random replacement will affect the final results. What if we choose a different passage for replacement? Why do you always use the passage with index+21 for replacement?
- The choice of index 21 was arbitrary; we fixed to a single index to reduce randomness of inference as much as possible for replicability.
- Note that this replacement is never involved in training; it only affects the inference stage.
- We have analyzed the sensitivity of the choice of random index, and show that there are no significant differences by changing the random index.
> NDCG@10 on differing the index of random assignment.
| Dataset | Random | replace | index | number: | | | | |
| ------------- | ---------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- |
| | 21 (orig.) | 20 | 19 | 18 | 17 | 16 | 15 | 6 |
| signal | 33.5 | 33.3 | 33.2 | 33.1 | 33.3 | 33.3 | 33.3 | 33.5 |
| trec-covid | 78.3 | 78.2 | 78.2 | 78.4 | 78.1 | 78.1 | 78.3 | 78.3 |
| news | 48.5 | 48.4 | 48.4 | 48.4 | 48.3 | 48.4 | 48.4 | 48.4 |
| scifact | 74.1 | 74.0 | 74.0 | 74.0 | 74.0 | 74.1 | 74.2 | 74.0 |
| nfcorpus | 35.6 | 35.7 | 35.7 | 35.7 | 35.7 | 35.7 | 35.7 | 35.8 |
| fiqa | 39.6 | 39.6 | 39.7 | 39.7 | 39.6 | 39.6 | 39.6 | 39.7 |
| bioasq | 56.4 | 56.6 | 56.5 | 56.5 | 56.4 | 56.5 | 56.5 | 56.5 |
| touche | 33.4 | 33.7 | 33.4 | 33.5 | 33.6 | 33.8 | 33.8 | 33.6 |
| dbpedia | 43.7 | 43.6 | 43.8 | 43.7 | 43.8 | 43.7 | 43.7 | 43.8 |
| robust04 | 52.1 | 52.1 | 52.0 | 52.1 | 52.0 | 52.1 | 52.1 | 51.9 |
| scidocs | 17.6 | 17.6 | 17.6 | 17.5 | 17.6 | 17.5 | 17.6 | 17.5 |
| arguana | 48.9 | 49.0 | 48.8 | 48.8 | 48.9 | 49.0 | 48.9 | 48.9 |
| climate-fever | 24.0 | 24.0 | 24.0 | 24.0 | 24.0 | 24.0 | 24.0 | 24.0 |
| dl19 | 71.8 | 71.9 | 71.8 | 71.7 | 71.7 | 71.8 | 71.7 | 71.8 |
| dl20 | 68.1 | 68.2 | 68.1 | 68.1 | 68.1 | 68.1 | 68.3 | 68.5 |
| **Average** | **48.4** | **48.4** | **48.3** | **48.3** | **48.3** | **48.4** | **48.4** | **48.4** |
- We have evaluated with 13 out of 18 datasets of BEIR, along with DL19 and DL20, on this experiment. Due to the limited time window, 5 datasets (nq, hotpotqa, cqadupstack, quora, fever) that have more than 3,000 queries were excluded, but we believe this result is statistically significant enough to show the robustness to random assignment.
[comment]: <> ([1] Shen et al., Ranking and Reranking with Perceptron. https://link.springer.com/content/pdf/10.1007/s10994-005-0918-9.pdf)
# Responses to R3 (gPo5)
Thank you for taking your time and effort to review our paper. Overall, we will add missing references and revise missing details and unclear notations on our next revision.
### 1. It seems necessary to include https://arxiv.org/pdf/1811.04415.pdf as it also does m-ary sort. There are also some additional papers referencing it that might be worthwhile to include.
- Thank you for the suggestion. It indeed looks like it's related in that both considers inter-document relationships to jointly score relevances. We will include this paper along with the follow-up papers in the next revision of our paper.
### 2. I suppose that the MSMARCO results are MRR@10? NDCG@10?
- As stated at Section 4.5 (L.329-335), we mainly report NDCG@10 scores, since it is the official metric for BEIR. We have also reported the MRR@10 scores at the appendix (Appendix Table 8).
### 3. Document retrieval task? Passage retrieval task?
- We reported the scores for the dev subset of MS MARCO passage ranking task. Thank you for pointing out and we will include the clarification in our next revision.
### 4. I find the paper to find relatively hard to follow compare to other ones, which is a pity considering the strong results of the paper. Unclear if e_i (Eq 1) is a single embedding or the full sequence token embedding. I assume it is the former, but "classical" FiD is the latter.
- By e_i, we are referring to the full sequence token embedding, identically as in classical FiD (Our implementation of encoder embeddings and the FiD architecture are built from the [original FiD code](https://github.com/facebookresearch/FiD/blob/main/src/model.py#L146).). We will revise the notation of e_i (e.g., to h_i and with explanations) to better represent that it is full sequence token embedding. Thank you for pointing out!
- About the writing - we are planning to do the following to improve the overall writing quality:
- Revise Appendix Sec. E (Details on the evaluation dataset) to include exact pointers to evaluation datasets and configurations we used.
- Revise the method section, especially on the notation, for better readability.
- Organize the related work section to better align with the introduction part.
### 5. Thank you!
- Thank you for your comments, and please, feel free to ask additional questions if there are any unclear parts regarding the dataset and evaluation!