Summarization is (Almost) Dead

# [Summarization is (Almost) Dead](https://arxiv.org/abs/2309.09558) ## 1. Introduction * LLM summaries are significantly preferred by the human evaluators, which also demonstrate higher factuality. * After sampling and examining 100 summarization-related papers published in ACL, EMNLP, NAACL, and COLING in the past 3 years, we find that the main contribution of about 70% papers was to propose a summarization approach and validate its effectiveness on standard datasets. :::success we acknowledge existing challenges in the field(Text summarization) such as: 1. The need for high-quality reference datasets 2. Application-oriented approaches 3. Improved evaluation methods ::: ## 2. Experimental Setting 在這個段落，我們要探討人類評估摘要所使用的資料集和模型，以及實驗的過程和細節。 ### 2.1 Datasets :::info To ensure that the large language model has not "seen" the data during training(Zero-shot的概念), we use the latest data to build the datasets specifically for human evaluation in each summarization task. Each dataset consists of 50 samples. ::: 1. Single-news, multi-news, and dialogue summarization tasks - Dataset construction approach: * CNN/DailyMail * Multi-News * Mediasum 2. Cross-lingual summarization task - Dataset construction approach: * Entails translating reference summaries in our single-news dataset from English to Chinese using Google Translate, followed by a post-editing process. 3. Code summarization task - Dataset construction approach: * [Methodology](https://arxiv.org/pdf/2110.01710.pdf) ### 2.2 Models 1. Large Language Model(LLM) * GPT-3 * GPT-3.5 * GPT-4 2. Fine-tuned Model * Single-news task: BART, T5 * Multi-news task: Pegasus, BART * Dialogue task: T5, BART * Cross-lingual task: MT5, MBART * Code task: Codet5 ### 2.3 Experimental process and details * Human Evaluation Experiment Approach: 1. We hire two annotators. 2. Each annotator is assigned to complete all 50 questions of a single task. 3. For each question, they are presented with a source article and summaries from all summarization systems selected in this task. 4. Then, they are asked to compare the summaries pairwise. ## 3. Experiment Results ### Experiment 1: Comparing the overall quality of summaries ![image](https://hackmd.io/_uploads/ByDzPemHp.png) :::info “WinRate” 指的是系統M相較於系統N被人工評估者優先選擇的比例。它用來反映兩個系統之間的整體質量優劣。具體來說，WinRate 通過下面的公式定義： ![image](https://hackmd.io/_uploads/SJho3-XSp.png) = 系統M被選中的次數 / (系統M被選中的次數 + 系統N被選中的次數) 這裡的“被選中”是指人工評估者在比較系統M和系統N生成的兩個摘要時，優先選擇的那個系統。舉例來說，如果在50次比較中，系統A被選擇30次，系統B被選擇20次，則系統A相對於系統B的WinRate為： WinRateB A = 30 / (30 + 20) = 60% 表示系統A整體質量優於系統B。通過計算和比較不同系統之間的WinRate，可以直觀地反映出人工評估者對系統質量的偏好。文章中利用該指標比較了大語言模型、人工參考摘要和微調模型之間的優劣。總結來說，WinRate反映了一個系統相對於另一個系統被人工評估者選擇的比例，從而衡量兩個系統的相對質量。 ::: * Summaries generated by the LLMs consistently outperform both human and summaries generated by fine-tuned models across all tasks. ### Experiment 2: Comparing the factual consistency of summaries * Annotators to identify sentence-level hallucinations in the human- and LLM-generated summaries, allowing us to compare their levels of factual consistency. ![螢幕擷取畫面 2023-11-28 133303](https://hackmd.io/_uploads/HyMshlmHT.png =70%x) :::success 1. The number of hallucinations (sentence-level) found in GPT-4 and human-written summaries. We highlight the figures which is significantly large. 2. Human-written reference summaries exhibit either an equal or higher number of hallunications compared to GPT-4 summaries. ::: :::info #### Type of Hallucination #### 1. Intrinsic Hallucination - Def: Inconsistencies between the factual information in the summary and the source text. - Ex：原文：小明跑步跑了第一名生成摘要：小明跑步跑了最後一名 #### 2. Extrinsic Hallucination - Def: summary includes certain factual information that is not present in the source text. - Ex：原文：小明跑步跑的很快生成摘要：小明跑步跑了第一名 ::: ![螢幕擷取畫面 2023-11-28 133311](https://hackmd.io/_uploads/rJA00e7rp.png =70%x) :::success 1. The proportion of extrinsic hallucinations in GPT-4 and human-written summaries. 2. A notably higher occurrence of extrinsic hallucinations in tasks where human-written summaries demonstrate poor factual consistency ::: ### Comparative Analysis 1. Reference summaries vs. LLM summaries * human-written reference summaries, their lack of fluency. * human-written reference summaries are sometimes flawed with incomplete information.(這邊我覺得指的是human-written reference summaries 會忽略掉一些重點) 2. Summaries generated by fine-tuned models vs. LLM summaries * Fine-tuned models tend to have a fixed and rigid length. * When the input contains multiple topics, the summaries generated by fine-tuned models demonstrate lower coverage of these topics. ## 4. The Changing Landscape of Summarization: Seeking New Horizons 1. Quality of summaries generated by LLMs surpasses that of the reference summaries in many datasets. 2. Previous summarization methods were often tailored to specific categories, domains, or languages, resulting in limited generality, and their significance is gradually diminishing. 3. As mentioned in the introduction, nearly 70% of the research is no longer meaningful. However, we believe that the following directions are worth exploring: ### 4.1 Summarization Datasets * The role of the dataset shifts from model training to testing, necessitating higher-quality reference summaries. * Previously generated datasets will gradually be phased out, and future reference summaries will require human expert annotations. :::success In order to thoroughly assess the summarization capabilities of LLMs, it becomes imperative to incorporate other diverse genres of data, as well as other languages, especially those that are low-resource in nature. 這裡指的是說，過去自動生成摘要的dataset在品質上無法滿足需求，未來需要人類專家參與標注來生成高質量摘要的dataset，這樣才能更好的去評估模型生成摘要的能力。 ::: ### 4.2 Summarization Approaches * Customized Summarization * Real-time Summarization * Interactive Summarization ### 4.3 Summarization Evaluation * Future [automated evaluation techniques for summarization](https://arxiv.org/pdf/2302.14520.pdf) hold promise in their reliance on LLMs as demonstrated by recent studies * Extrinsic Evaluation ## 5. Conclusion * LLM summaries exhibit superior fluency, factuality, and flexibility, especially in specialized and uncommon summarization scenarios. * We also offer an outlook on the tasks worth exploring in the field of text summarization in the future, focusing on three aspects: datasets, methods, and evaluation. ### 5.1 Limitations 1. We do not include other popular LLMs like LLaMA and Vacuna :::warning because these newer models do not disclose the cutoff date of their training data. This lack of information makes it challenging for us to create a novel dataset specifically tailored for evaluating the zero-shot generation of summaries by LLMs. ::: 2. Due to the high cost, we only conduct human experiments on five common text summarization tasks.(其實摘要還可以用在很多的領域，像 Slide Summarization) ### 5.2 Ethics Statement We obtain text from publicly available websites, and it is possible that some of the texts may contain biased, offensive or violent elements.