<style> .red { color: red; } </style> # [Element-aware Summarization with Large Language Model: Expert-aligned Evaluation and Chain-of-Thought Method](https://arxiv.org/abs/2305.13412) :::danger **Comments:** Accepted at ACL 2023 **Github:** https://github.com/Alsace08/SumCoT ::: ## 1. Introduction - Existing studies commonly train or fine-tune language models on large-scale corpus. - However, some standard datasets have shown to be noise-enriched, mainly in terms of information redundancy and factual hallucination. - Sufficient experiments have shown that reference summaries in these standard datasets perform poorly on human assessment dimensions, especially coherence, consistency, and relevance. :::info To fill this gap, this work releases expert-writing <span class="red">Element-aware summary test sets</span>. In professional news writing, core elements such as character, time, place, event, etc., are indispensable. ::: - Utilizing the new test sets, we are surprised to find that the <span class="red">zero-shot performance of large language models (LLMs) is highly competitive with some strong fine-tuned pre-trained models (PLMs)</span>, and the performance of PLMs declines compared to standard test sets. - Inspired by the competitive zero-shot performance of LLMs and chain-of-thought technique, we create <span class="red">Summary Chain-of-Thought (SumCoT)</span> to elicit LLMs to generate summaries step by step. ## 2. Element-aware Summary Test Set ### 2.1 Data Construction 1. We select two standard news summary datasets (test sets) as document sources: - **CNN/DailyMail**, provides a large-scale multi-domain news collection, which is representative of single-document datasets. - **BBC XSum**, provides a highly abstracted news collection. 2. We ask three news experts to independently write professional summaries for 200 randomly sampled source documents according to a complete <span class="red">writing protocol</span>. 3. We require one of the experts to lead the writing, and the other two to judge the completed summary in four dimensions from the protocol. 4. For **CNN/DailyMail**, a summary is written in **25-30 minutes** on average, and for **BBC XSum**, in **15-20 minutes** on average. ### 2.2 Writing Protocal We divide the protocol into **micro** demands and **macro** demands. - **Micro** Demands: - Emphasizes our targets, namely element awareness. - All news summaries should have four essential core elements — Entity, Date, Event, and Result. ![image](https://hackmd.io/_uploads/B18b_43Ha.png =70%x) :::info News elements have been highlighted with different color shadows. It is clear that our element-aware summary covers more comprehensive elements, and the logical connection between the elements is smoother. ::: - **Macro** Demands: - Guarantees the professionalism and objectivity of the overall writing quality. - All news summaries should focus on the four dimensions: 1. **Fluency**: No spelling, grammatical, or syntactic errors within sentences. 2. **Coherence**: The summary should not be a heap of events, and linguistic transition must be smooth and logically correct. 3. **Consistency**: No hallucinated facts—neither facts that do not appear in or are contrary to the source document are allowed. 4. **Relevance**: Adequately weigh the importance of multiple facts, and find the core concern of the text. ### 2.3 Overall Quality #### 2.3.1 ![image](https://hackmd.io/_uploads/B16Vt4nrT.png =60%x) - The **average length of element-aware summaries** largely matches the distribution of that of dataset-specific summaries. - In terms of abstraction, we report the **percentage of novel n-grams** that are included in the summary but not in the source document. #### 2.3.2 ![image](https://hackmd.io/_uploads/HJCRjN2B6.png =60%x) - We further hold a vote on two highly subjective dimensions — **logical coherence** and **factual importance**, they reflect the **professionalism** and the **information comprehensiveness** of writing. - It is clear that element-aware summaries are significantly more popularly accepted in both subjective dimensions by the public. ### 2.4 Element-aware Characteristic ![image](https://hackmd.io/_uploads/B16PlrnSa.png =55%x) ::: info ![image](https://hackmd.io/_uploads/HyCIp42Sp.png =50%x) - **Precision**: Accuracy of the core elements embedded in the summary. - **Recall**: Hit rate of the core elements in the source document. 1. For i-th annotator (i = 1, 2, 3) and j-th element in the writing protocol (j = 1, 2, 3, 4). 2. We ask this annotator to release **two sets** that separately contain all j-th elements in the source document they consider important and all j-th elements appearing in the summary. 3. The annotator-released sets for the source document and summary are denoted as Aj i and A′j i , respectively. - **F1 score**: The harmonic mean of Precision and Recall, to measure the overall level. ::: :::success 1. Our test sets have a significant advantage in the element-aware characteristic. 2. The dataset-specific test sets perform poorly particularly in the Recall score, meaning that they have ignored many fine-grained details. ::: ## 3. Preliminary Comparison: Zero-shot LLMs Versus Fine-tuned PLMs We preliminarily compare existing strong LLMs and PLMs upon our elementaware test sets, designed to analyze the general summary capabilities of zero-shot LLMs and finetuned PLMs from a more fine-grained perspective. ### 3.1 Experimental Setup #### Dataset For each source document on both datasets, we compare summaries generated by models with dataset-specific (original) and element-aware (ours) reference summaries. #### Models - PLMs: **BART**, **T5**(two strong generation-oriented PLMs), **PEGASUS**(a summarization-customized PLM fine-tuned on two datasets separately as the strong baselines) - LLMs: **175B-parameter GPT-3** #### Implementation - PLMs: Follow the official finetuned models released on the Huggingface - LLMs: - CNN/DailyMail: "Summarize the above article: " as the standard prompt - BBC XSum: "Summarize the above article in one sentence: " :::info All the source documents are **truncated to 1024 tokens when using PLMs** and **2048 tokens when using LLMs**. ::: #### Evaluation - ROUGE-1/2/L - BERTSCORE ### 3.2 Main Result ![image](https://hackmd.io/_uploads/rydrHr3Sp.png) :::info - Longitudinal Comparison (Language Models): **compare the performance of different models on the same test set**. - Horizontal Comparison (Test Sets): **compare the performances of the same model on different test sets**. ::: ### 3.3 Human Study :::info 1. Human studies are conducted as **an overall quality assessment of human preferences**. 2. We use a 7-point Likert scale to ask annotators to evaluate four dimensions: Fluency, Coherence, Consistency, and Relevance. 3. We <span class="red">set the element-aware summaries as the baseline (score 0)</span> and <span class="red">set the scoring range to -3~3</span>. ::: ![image](https://hackmd.io/_uploads/S1n6FS2ra.png =60%x) :::success All of these results can fully demonstrate that LLMs have great potential for summarization, and a higher-quality dataset is key for evaluation. ::: ## 4. Towards Element-oriented Summary: Chain-of-Thought Method - We see that GPT-3 performs surprisingly well on our element-aware test sets. - GPT-3 has great potential for fine-grained zero-shot summary writing. - We further enhance the summarization ability of LLMs by leveraging a CoT-based method (SumCoT). :::success SumCoT **elicits LLMs to focus on news core elements**, thereby generating element-aware summaries step by step. ::: ### 4.1 Two-stage Pipeline ![image](https://hackmd.io/_uploads/H1Bs2S2B6.png) :::info - Stage 1: **Core element extraction** 1. we create guiding-question prompts to elicit the LLMs to extract four core elements: Entity, Date, Event, Result. 2. Then concatenate these questions into Q = [q1, q2, q3, q4]. 3. Let the source document be S, then the LLMs input in this stage is formulated as [S;Q]. - Stage 2: **Multiple information integration and summarization** 1. Next, we integrate the extracted elements and more detailed information from the source document. 2. We concatenate the source document, questions, answer, and a simple prompt. 3. "Let’s integrate the above information and summarize the article:" to prompt the LLMs for summary generation. 4. The input in this stage is formulated as [S;Q;A; [p′]], and the output is the final summary. ::: ### 4.2 Comprehensive Evaluation #### 4.2.1 ![螢幕擷取畫面 2023-12-05 151713](https://hackmd.io/_uploads/HJE5RrnS6.png) :::info Case comparisons between GPT-3 zero-shot summaries before and after usingSumCoT. Spans of Entity, Date, Event and Result are separately highlighted in red, yellow, blue and green. ::: #### 4.2.2 ![image](https://hackmd.io/_uploads/H1YYJU2S6.png) :::success Demonstrating that GPT-3 successfully focuses on more core elements through SumCoT and further fits the element-aware writing pattern. ::: #### 4.2.3 ![image](https://hackmd.io/_uploads/rJzHxUhSa.png =70%x) :::success Indicate that the Sum-CoT technique further improves the performance of the standard zero-shot paradigm in all dimensions, particularly coherence and relevance. ::: ### 4.3 Better Understanding SumCoT #### 4.3.1 How does SumCoT affect summary writing? 1. **Final summaries are extremely faithful to the extracted elements**, particularly on CNN/DailyMail. 2. The coverages of each element are relatively lower due to the one-sentence style of BBC XSum. 3. **The coverage of Date is significantly low, probably due to the errors of extraction** #### 4.3.2 Is the element extraction accurate and comprehensive? ![image](https://hackmd.io/_uploads/ByEsMU2ST.png =75%x) :::info Demonstrates a strong correlation between **element extraction** and **summary generation**. ::: ![image](https://hackmd.io/_uploads/Byt_Q83Ba.png =60%x) :::success 1. Results show that extraction achieves an outperforming result **except for Date**, because <span class="red">Date hallucination is particularly evident for extracting non-existent dates</span>. 2. **Precision are usually lower than Recall**, because <span class="red">element redundancy often occurs</span>. ::: #### 4.3.3 Does the model size limit SumCoT? ![image](https://hackmd.io/_uploads/BJmwVInHa.png) :::success 1. When the model size is small, element extraction is almost invalid. 2. As the model size increases, GPT-3 can extract one by one for all types of elements, but the extraction itself has many errors or redundancies. 3. **Only when the model size is the largest, the element extraction is human-approved**. ::: ## 5. Conclusion Overall, our main contributions are three-fold: 1. We construct expert-writing element-aware summary test sets to evaluate general summarization systems more objectively (§2). 2. We explore the zero-shot summarization ability of LLMs on the new test sets and demonstrate that their writing ability cannot be fully reflected by standard test sets (§3). 3. We propose a new CoT-based summarization technique, which allows the LLMs to generate more fine-grained summaries step by step (§4).