ARR Dec 2023-longcontext-review&rebuttal

## **General Response** We sincerely thank all reviewers for their valuable comments and constructive suggestions! The main contributions of our work and the important questions are highlighted and answered as follows: - **Contributory work in the important area.** LooGLE sheds light on the future development of enhanced models toward “true long-context understanding”. We examine whether LLMs that allow extended context text input understand the long contexts" (Reviewer **fCa5**), which is usefully needed and contributes to the community. It shows current LLMs targeting long-context understanding all fail to solve the true long-dependency problems, and merely increasing the context window size might not help. - **Novel dataset for long context.** We introduce "a novel long-context understanding benchmark for LLM evaluation" (Reviewer **QkhE**). The dataset has the remarkable advantages of "longer text lengths, more recent data, both short-dependency and long-dependency tasks and multiple genres with QA pairs cross-validated between the questioner and answerer after independent annotations." (Reviewer **fCa5**). - **Curated dataset thoughtfully designed of high quality** LooGLE includes "the construction details including tasks definitions, data collection, and annotation are clearly illustrated and the dataset collection process also avoids data leakage" (Reviewer **QkhE**). Besides, "the task design is persuasive and robust". We "spent many efforts to make sure the benchmark got diverse types of tasks" (Reviewer **iRYd**). - **Comprehensive Experiments.** Our work "evaluates various angles of long-context understanding covering 7 sub-tasks and offers an overview of LLMs performance evaluations. " (Reviewer **QkhE, fCa5**). Experiments are well conducted and the experimental results are generally sound. (Reviewer **iRYd**). We've revised our manuscript per the reviewers' suggestion (highlighted in red in the uploaded revision pdf). We have carefully refined the figures of the paper and solved the typo issues mentioned by the reviewers. Detailed responses to each reviewer's concerns are carefully addressed point-by-point. ## **To Reviewer fCa5** > Q1: state early in the introduction the categories of tasks and subtasks involved: a) Long dependency: Summarization & QA (4 sub-tasks); b) short dependency: cloze and QA. Concrete task definitions and examples only occur in figures and appendices but are not introduced in main texts **A:** Thanks for pointing out your concern. We follow the task definitions of short QA, summarization, and cloze as current works since they are well-acknowledged and commonly used NLP tasks. We introduce them in fewer words but with more examples due to the limited pages in the main texts. In this paper, we stress more attention on long dependency QA, which **we have already discussed the implementation of the dataset in detail, from data collection in Section 3.1 and task definition to generation, especially the manual compilation in Section 3.2 which spans three pages.** > Q2: ArXiv summarization as a long dependency task is debatable: the authors show that summarization results are better than ShortQA and Cloze in Fig 3 and Table 5 that longer contexts hurt summarization. Summarization can focus on introduction and conclusion paragraphs (Line 470) and do not require information from thousands of tokens; **A:** Thanks for pointing out your concern. Arxiv paper summarization is well acknowledged and commonly used as one of the challenging task for long context. Here are related works for your reference. [1] An Empirical Survey on Long Document Summarization: Datasets, Models and Metrics [2] Efficient Attentions for Long Document Summarization [3] A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents The good performance of ArXiv summarization among different tasks may be attributed to the inherent nature of the arXiv paper with both the introduction and conclusion sections located at the beginning and in the end respectively, which might sometimes contain the major sketch of the paper. However, there are much papers who has a lengthy abstract/conclusion, already exceeding the context window of current long-context LLM (like GPT4-32k) which remains challenging. Besides, we will involve more general and challenging summarization tasks in abundant fields for comprehensive evaluation in the future. > Q3: It would be helpful to mention licenses of sources in Sec 3.1 **A:** Thank you for your kind reminder. We use MIT License for our dataset and have specified it in Section 3. > Q4: In the exemplified QA tasks, pieces of information can be extracted from local sentences but aggregated across thousands of tokens; What differentiates the harder Timeline and Calculation from Multiple retrieval and Comprehension could be the deeper arithmetic chain of thoughts (CoT) involved **A:** We refer **'long dependency'** in our paper to describe questions where various pieces of relevant information are scattered across multiple locations in a lengthy text, sometimes spanning across the whole document. Instead of directly locating the evidence through keyword matching or semantic similarity, **it relies heavily on the profound comprehension of the question and its implied correlation with this evidence for extensive multi-source retrieval** from the texts. Based on that, for Timeline reorder, Calculation as well as Comprehension and reasoning, **a few capabilities are essential including mathematical computation, temporal awareness of the central storyline, intricate comprehensive understanding and multi-hop reasoning, etc**. As depicted in Fig. 5, chain of thoughts (CoT) offered only marginal improvements in the long context comprehension of our tasks. > Q5: annotation time: admittedly, it takes hours to read documents with 10K+ tokens. However, LongQA task (e.g., timeline reorder) annotations can be implemented by keyword searches, and relevant information is local (maximally sentences away). Requiring "a week" for ~10 questions on a document sounds exaggerated. In other words, these tasks do not necessarily require "memorization or comprehensive understanding of the central storyline of the document" (line 257) -- anyone who has taken a reading comprehension test has experience quickly extracting targeted information from thousands of words **A:** **It is a well-established fact that the annotation process for an article takes approximately one week to complete.** There are several factors contributing to this duration. Firstly, our annotation process strictly follows a **three-step approach: Question & Answer, Answer & Check, and Cross-validation & Revise.** Each step is carried out sequentially. Before proceeding to the next step, a simple verification is conducted by recruiters. Secondly, our **annotation process adheres to rigorous standards**, encompassing long dependencies, diverse problem types, formulating clear and precise questions, and providing deterministic and objective answers. Participants are strictly prohibited from using large language models and tools like ChatGPT for tasks such as article reading, data generation, and annotation. It takes nearly 3 to 5 hours on average to read one long document but needs even more effort to question/answer. It is worth mentioning that **answering long dependency questions imposes higher demands for comprehensive abilities**. After a thorough reading of the entire article, Instead of directly locating the evidence through keyword matching or semantic similarity, it relies heavily on the profound comprehension of the question and its implied correlation with the information for extensive multi-source retrieval from the texts. Since relevant information may be dispersed throughout the text, It necessitates the annotators to engage intensively in repeated readings to ensure accurate and comprehensive answers. Additionally, apart from formulating qa pairs, annotators are also expected to provide precise sources of the answers within the original text as evidence. In particular, **it poses more challenges for scripts in terms of readability**. Scripts are typically written in dialogue format, which can make them comparatively less readable. Extracting the events from the dialogue and subsequently formulating questions and finding answers based on those events is a complex endeavor. The information relevant to the answers is often subtly embedded within the dialogue and not explicitly presented. > Q6: Inter-annotator agreement of 81.88% (line 339): how was the inter-annotator agreement (IAA) measured? At which stage (before or after revision)? On how many QA pairs? And on which measure(s), e.g., completely correct string match? **A:** Thanks for pointing out your confusion. We conduct a comparison of the annotation results between the Question & Answer step and Answer & Check step to determine the level of agreement. This comparative analysis is carried out manually by human annotators to evaluate both the dependency of the question and its answerability. To quantify the inter-annotator agreement, we employ Percentage Agreement as a measurement metric. The Percentage Agreement is calculated using the formula: Percentage Agreement = (Number of agreements / Total number of QA pairs) * 100. In this study, we have calculated the Percentage Agreement for all the QA pairs. The Number of agreements obtained is 931, while the Total number of QA pairs is 1137. > Q7: On which tasks and how many did the paper inspect "the accuracy of the generated answers being consistent with the ground truth for some models/tasks"? How was the manual evaluation implemented, and why was "accuracy" the metric? How was the "human evaluation" involved in CoT? What exactly does the human evaluate? Why is the metric accuracy? Wouldn't P/R/F1 be better if compared with GPT4_score? **A:** We have included **human evaluations in 4 parts of the experiments**. Details of the tasks and the number of QAs inspected are as follows: 1. We manually evaluated the accuracy for 4 long-dependency tasks under 3 settings (without CoT, few-shot CoT, zero-shot CoT) in Figure 5. (Total 343 * 3 = 1029 QAs) 2. We randomly selected over 400 questions from each task in long dependency QA and evaluated the accuracy from both GPT4’s and the human perspective to see their agreement (%) in Table 7. (Total 425 QAs randomly sampled) 3. We provide probable explanations for long QA bad cases to provide insights and directions for model promotion in Table 10 in Appendix E. (Same QAs as experiment 1 above) 4. We captured discrepancies in the generated outputs of different models to tackle inherent preferences encountered in the long context in Section 4.2.3. (Total 300+ QAs randomly sampled) By implementing the manual evaluation, given the question, we manually read, checked, and compared the predicted outputs from different models with the ground truth in semantics. Predictions that are semantically the same as the answer to the question are deemed correct (could be similar statements in different expressions sometimes). We compute the proportion of correct answers as the reported "accuracy". As for “human evaluation” involved in CoT, we manually evaluate the accuracy of predictions directly in experiments 1 & 2 mentioned above. For experiment 3, we inspect the rationale behind CoT of each QA pair for a better understanding of how models decompose and tackle challenges associated with extended dependency-based QA. We use accuracy instead of P/R/F1 firstly because human can hardly give a continuous probability that the answer is correct, but only hard yes/no. Also, the answers don't involve true negatives or false positives (there is no groundtruth correct/incorrect for each answer; the task is still generation instead of classification). The same reason applies to GPT4, where the reported GPT4_score is also the proportion that GPT4 judges the generated answer matches the given one.  > Q8: GPT4_score strongly favors GPT4-32k model in long dependency QA (Table 4, 54.09 >> 42.12/45.04) Since there is not enough evidence in the paper to suggest GPT4_score as a better metric (though the paper seems to assume so implicitly), it would be much more convincing to include a traditional metric for detailed analyses (e.g., Fig 2, Fig 3) rather than GPT4_Score alone **A:** Thanks for your insightful advice. We have involved both traditional metrics and GPT4_score in the great majority of the evaluations in Section 4.2 and Appendix E (especially covering all the main results of model comparisons on LooGLE). As for GPT4 as an evaluator in our paper, we have indeed considered this and spared no efforts to make the evaluation as fair as possible in the following steps: 1) There have been **many recent research studies that have shown that the GPT4 evaluator exhibits high consistency with human evaluation and can serve as a reliable annotator** to some extent. Here are some related works for your reference. [1] Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [2] Do large language models show decision heuristics similar to humans? a case study using gpt-3.5 [3] Calibrating llm-based evaluator 2) We randomly selected over 400 questions from each task in long dependency QA and evaluated the accuracy from both GPT4's and the human perspective. The results are in Table 7. **It can be seen that GPT4 can function well to make human-like judgments in our dataset**. In order to provide a more comprehensive assessment, we utilize GPT4 as an LLM evaluator to obtain reference results. 3) Besides, **we make our implementation reproducible and GPT4's judgment deterministic** by setting its temperature to 0, top_p to 1, and prompting GPT4 to output True/False/exact-score only, instead of descriptive results. From our observations in the experiment results, we found that the GPT4 evaluator has no bias in itself when scoring. 4) **How we use GPT4-eval is also delicately designed to ensure fairness**. By giving the question(QA task only), ground truth, and predicted outputs, we ask GPT4 to compare and score considering the semantic matching, information completeness, consistency, fluency, and grammar. In this way, GPT4 can focus on the comparisons without bias and tendency for better evaluation. The detailed prompt can be seen in Appendix F. The automatic metrics usually measure the generated texts from different aspects and are easily influenced by the output lengths, occurrence of keywords, and format of expressions with bias. Therefore, we select GPT4_score for brief visualization of results in Tables 3 and 4 and more details can be referred to the original tables. > Q9: Table 7 in the Appendix could be included in the main text **A:** Thanks for the comments. We have included Table 7 in the main text on Page 6. > Q10: Why presenting different questions & predictions for different LLMs in Appendix H? Wouldn't predictions on the same questions intrigue further comparisons? **A:** Thanks for your comments. The overall performance among various models has been thoroughly compared and shown in Section 4 and Appendix E. We further present typical examples of diverse model predictions in Appendix H to show the huge generation discrepancy in content, length, and quality mentioned in the main text. The disparity can be attributed to the training strategies and techniques to extend longer context for these models. It may lead to some degree of performance degradation and can shed light on enhancing the "true long-context understanding" ability of LLMs instead of merely enlarging their context window. Based on your advice, it's also feasible to further add more examples to present comparisons using the same questions for different models and we will work on it in the revision paper. > Q11: Fig. 4 should explicitly state this is the result of the short-dependency cloze task **A:** Thanks for your advice and we have revised the title of Figure 4 to "Cloze performance with varying segments". > Q12: Reorder Fig 3 so the short QA and cloze tasks are closer together and longQA and summarization together **A:** Thanks for the comments. We have revised Figure 3 to put the same category of task closer together. > Q13: Inappropriate citations for IAA (line 339) and metrics such as BLEU, ROUGE, etc. (lines 374-376) **A:** Thanks for your kind reminder. We have revised and added appropriate citations for IAA and the metrics you mentioned in the revision paper. > Q14: typos Line 51: extra space in the author name "s Koˇ ciský" (check bibtex) Line 116: redundant indentation before "sports" Line 214, 343: extra space before footnote "2" Line 369: use \citet{} for "Liu et al., 2023a" Line 386: use \citet{} for "Suri et al., 2023; Liu et al., 2023b; Zheng et al., 2023" Ensure consistent task naming: "calculation" vs. "computation" in Fig 1 Lines 2104-2115: formatting issue Are questions in Appendix H.7 (page 27) intended? Is empty output intended in line 2241? **A:** Thank you for kindly pointing out the typos. Format errors in lines 2104-2115, the questions in H.7 and the empty output in line 2241 are all intentional. They represent the original output generated by LLM. Other typo issues have been solved and highlighted in the revision paper. ## **To Reviewer QkhE** We sincerely thank the reviewer for your acknowledgment of our contribution to proposing the novel long-context benchmark including **clearly illustrated data definitions, collection, and careful annotation as well as various angles of tasks**. We would like to involve further discussions in the following sections. > Q1: In Section 4.2.3, the paper only explains why a longer input window does not help summarization, but lacks explanation for long dependency QA results that longer context input decreases model performance. **A:** Thanks the reviewer for your comments. For **results evaluated by GPT4**, it can be noticed that the longer context indeed improves performance. For **results evaluated by automatic metrics**, the model performance decreases in some settings possibly because models with a larger context window tend to generate longer outputs, facing penalties from these automatic metrics. > Q2： Metric descriptions are confusing. Table 6 presents results for each task and Table 4 shows the aggregated results for long dependency QA tasks. I do not know which metric is used for Table 6 and how individual task scores are aggregated. Using GPT-4-32k as an example, individual task scores are (33, 26, 22, 44), while the aggregated score in Table 4 equals to the average of the four tasks. **A:** Thank you for pointing out your confusion. We have already introduced the evaluator and evaluation metrics for Table 6 in Section 4.3.2, see "we employed GPT4 as the evaluator, and the accuracy results are available in Table 6" in line 482, 483. Further, the result by GPT4_score in Table 4 is not directly aggregated by individual task scores in Table 6. 1) GPT4_score in Table 4 is calculated as： (Number of correct QAs / Total number of long dependency QAs) * 100. 3) GPT4_score for each task in Table 6 is calculated as： (Number of correct QAs for one specific task/ Total number of QAs in the same category of task) * 100. The former score in Table 4 is not the sum of the latter scores in the 4 tasks. Please feel free to further clarify your confusion if needed. > Q3：How are the few-shot examples selected when evaluating with CoT in Section 4.2.3? **A:** We have a few considerations when selecting the few-shot examples for CoT: 1) **Diverse problem types**: Examples should cover all the 4 categories of long dependency QAs introduced in LooGLE for generalization. 2) **Long dependency**: Examples whose evidence has a wider span across the document are preferred. Besides, there is a varying number of evidence for different questions to keep variety. 3) **Clear question with precise answer**: Representative examples with clearer questions and answers could help LLM better learn from a limited number of shots for better performance. Since examples for CoT In terms of the limited context window of LLM, examples following the above-mentioned standards with fewer words are preferred. > Q4：What's the language of the curated dataset? Please specify the data language in the paper. **A:** Thank you for pointing this out. The current version of our dataset is English-only. We have also specified the data language in the revision paper on Page 3 Section 3. ## **To Reviewer iRYd** We sincerely thank the reviewer for your acknowledgment of our contribution to proposing the novel long-context benchmark with **persuasive and robust designed tasks, a huge effort to manually create long dependency QAs with comprehensive categories. The experiments are well conducted and the experimental results are generally sound**. We would like to involve further discussions in the following sections. > Q1: A little overclaiming. The author mentioned that previous datasets suffer from "outdated documents" and highlighted the "relatively new" documents as a main contribution. However, their documents are actually: arxiv after Jan 22, wiki March 22, and TV scripts after 22, which doesn't seems very fresh. It's ok for LLMs tested in the paper, but more recent LLMs such as GPT-4-turbo and Llama-2 got an updated knowledge cutoff of July and April 23. So, I perfer the author not to use this as a selling point. **A:** We sincerely thank the reviewer for pointing it out and it’s a good suggestion. 1) We have revised the expressions in the introduction and reorganized the relevant content as part of the benchmark introduction in Section 3. 2) We truly spent much effort from the beginning of multi-source data collection, robust task design, and manual generation. The annotation underwent a meticulous three-step process to ensure a rigorous cross-validation process. It took nearly three months for continuous annotation and careful refinement to achieve the final high quality of the questions, answers, and supporting evidence as well as a high degree of accuracy, precision, and relevance to the document’s content. The whole generation of LooGLE underwent a tough and sustainable process. 3) After continuously testing on the latest long-context LLMs like LLaMA2-7B-32K, it's lucky to find that we come to the same findings as the paper's current conclusion. > Q2: The author seems to have little knowledge of existing long context benchmarks. The author only discusses two similar benchmarks that focus on long context, but there are actually plenty more. Like the one I know released by lmsys in July that solely focus on information retrieval (where this draft did a better job in including more types of tasks). I think the paper will benefit from checking more recent references. **A:** Thank you for your advice. We have indeed conducted thorough survey and solid comparative works on relevant long-context dataset and models. As you mentioned, within the LongEval testsuite [1], LMSys has put forward two synthetic tasks that specifically focus on long context processing: Coarse-grained Topic Retrieval and Fine-grained Line Retrieval. The primary objective of these tasks is to assess the effectiveness of models in accurately extracting information from synthetic data when dealing with extended contextual information. Furthermore, in another research paper called Giraffe [2], two additional datasets were introduced. One of these datasets, known as LongChat-Lines, similar to the task proposed by LMSys. It serves as a synthetic fine-grained key-value retrieval task. The second dataset is the Wiki QA dataset, which encompasses two distinct tasks. The first task, Free Form QA (FFQA), involves identifying single-word or short-sentence answers that precisely found in the input. The second task, Altered Numeric QA (AltQA), involves transforming the answer to a numeric representation to mitigate the potential risk of data leakage during the pretraining phase. In accordance with the definitions outlined in our own research paper, the tasks within these datasets primarily revolve around short dependency scenarios. However, our dataset stands out by encompassing not only short dependency tasks but also long dependency tasks. This key distinction sets our dataset apart from others in the field. [1] [How Long Can Open-Source LLMs Truly Promise on Context Length? | LMSYS Org](https://lmsys.org/blog/2023-06-29-longchat/) [2] Giraffe: Adventures in Expanding Context Lengths in LLMs We have added the above mentioned works into reference and discussion. Besides, we will also add more related works and ensure discussing all new long context benchmarks released recently. > Q3: The author mentioned data leakage issues for previous datasets. Will LooGLE affect by this problem? The paper will benefit from adding references about data leakage. **A:** Thank you for your advice and we have added references about data leakage in the revision paper (see Page 1). We will continuously and periodically update the dataset with newer documents, diverse and challenging tasks, as well as assessment of the latest long context LLMs possibly to alleviate the data leakage issue and to come up with the ever-changing knowledge.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.