<style>
.red {
color: red;
}
</style>
# [Element-aware Summarization with Large Language Model: Expert-aligned Evaluation and Chain-of-Thought Method](https://arxiv.org/abs/2305.13412)
:::danger
**Comments:** Accepted at ACL 2023
**Github:** https://github.com/Alsace08/SumCoT
:::
## 1. Introduction
- Existing studies commonly train or fine-tune language models on large-scale corpus.
- However, some standard datasets have shown to be noise-enriched, mainly in terms of information redundancy and factual hallucination.
- Sufficient experiments have shown that reference summaries in these standard datasets perform poorly on human assessment dimensions, especially coherence, consistency, and relevance.
:::info
To fill this gap, this work releases expert-writing <span class="red">Element-aware summary test sets</span>. In professional news writing, core elements such as character, time, place, event, etc., are indispensable.
:::
- Utilizing the new test sets, we are surprised to find that the <span class="red">zero-shot performance of large language models (LLMs) is highly competitive with some strong fine-tuned pre-trained models (PLMs)</span>, and the performance of PLMs declines compared to standard test sets.
- Inspired by the competitive zero-shot performance of LLMs and chain-of-thought technique, we create <span class="red">Summary Chain-of-Thought (SumCoT)</span> to elicit LLMs to generate summaries step by step.
## 2. Element-aware Summary Test Set
### 2.1 Data Construction
1. We select two standard news summary datasets (test sets) as document sources:
- **CNN/DailyMail**, provides a large-scale multi-domain news collection, which is representative of single-document datasets.
- **BBC XSum**, provides a highly abstracted news collection.
2. We ask three news experts to independently write professional summaries for 200 randomly sampled source documents according to a complete <span class="red">writing protocol</span>.
3. We require one of the experts to lead the writing, and the other two to judge the completed summary in four dimensions from the protocol.
4. For **CNN/DailyMail**, a summary is written in **25-30 minutes** on average, and for **BBC XSum**, in **15-20 minutes** on average.
### 2.2 Writing Protocal
We divide the protocol into **micro** demands and **macro** demands.
- **Micro** Demands:
- Emphasizes our targets, namely element awareness.
- All news summaries should have four essential core elements — Entity, Date, Event, and Result.

:::info
News elements have been highlighted with different color shadows. It is clear that our element-aware summary covers more comprehensive elements, and the logical connection between the elements is smoother.
:::
- **Macro** Demands:
- Guarantees the professionalism and objectivity of the overall writing quality.
- All news summaries should focus on the four dimensions:
1. **Fluency**: No spelling, grammatical, or syntactic errors within sentences.
2. **Coherence**: The summary should not be a heap of events, and linguistic transition must be smooth and logically correct.
3. **Consistency**: No hallucinated facts—neither facts that do not appear in or are contrary to the source document are allowed.
4. **Relevance**: Adequately weigh the importance of multiple facts, and find the core concern of the text.
### 2.3 Overall Quality
#### 2.3.1

- The **average length of element-aware summaries** largely matches the distribution of that of dataset-specific summaries.
- In terms of abstraction, we report the **percentage of novel n-grams** that are included in the summary but not in the source document.
#### 2.3.2

- We further hold a vote on two highly subjective dimensions — **logical coherence** and **factual importance**, they reflect the **professionalism** and the **information comprehensiveness** of writing.
- It is clear that element-aware summaries are significantly more popularly accepted in both subjective dimensions by the public.
### 2.4 Element-aware Characteristic

::: info

- **Precision**: Accuracy of the core elements embedded in the summary.
- **Recall**: Hit rate of the core elements in the source document.
1. For i-th annotator (i = 1, 2, 3) and j-th element in the writing protocol (j = 1, 2, 3, 4).
2. We ask this annotator to release **two sets** that separately contain all j-th elements in the source document they consider important and all j-th elements appearing in the summary.
3. The annotator-released sets for the source document and summary are denoted as Aj i and A′j i , respectively.
- **F1 score**: The harmonic mean of Precision and Recall, to measure the overall level.
:::
:::success
1. Our test sets have a significant advantage in the element-aware characteristic.
2. The dataset-specific test sets perform poorly particularly in the Recall score, meaning that they have ignored many fine-grained details.
:::
## 3. Preliminary Comparison: Zero-shot LLMs Versus Fine-tuned PLMs
We preliminarily compare existing strong LLMs and PLMs upon our elementaware test sets, designed to analyze the general summary capabilities of zero-shot LLMs and finetuned PLMs from a more fine-grained perspective.
### 3.1 Experimental Setup
#### Dataset
For each source document on both datasets, we compare summaries generated by models with dataset-specific (original) and element-aware (ours) reference summaries.
#### Models
- PLMs: **BART**, **T5**(two strong generation-oriented PLMs), **PEGASUS**(a summarization-customized PLM fine-tuned on two datasets separately as the strong baselines)
- LLMs: **175B-parameter GPT-3**
#### Implementation
- PLMs: Follow the official finetuned models released on the Huggingface
- LLMs:
- CNN/DailyMail: "Summarize the above article: " as the standard prompt
- BBC XSum: "Summarize the above article in one sentence: "
:::info
All the source documents are **truncated to 1024 tokens when using PLMs** and **2048 tokens when using LLMs**.
:::
#### Evaluation
- ROUGE-1/2/L
- BERTSCORE
### 3.2 Main Result

:::info
- Longitudinal Comparison (Language Models): **compare the performance of different models on the same test set**.
- Horizontal Comparison (Test Sets): **compare the performances of the same model on different test sets**.
:::
### 3.3 Human Study
:::info
1. Human studies are conducted as **an overall quality assessment of human preferences**.
2. We use a 7-point Likert scale to ask annotators to evaluate four dimensions: Fluency, Coherence, Consistency, and Relevance.
3. We <span class="red">set the element-aware summaries as the baseline (score 0)</span> and <span class="red">set the scoring range to -3~3</span>.
:::

:::success
All of these results can fully demonstrate that LLMs have great potential for summarization, and a higher-quality dataset is key for evaluation.
:::
## 4. Towards Element-oriented Summary: Chain-of-Thought Method
- We see that GPT-3 performs surprisingly well on our element-aware test sets.
- GPT-3 has great potential for fine-grained zero-shot summary writing.
- We further enhance the summarization ability of LLMs by leveraging a CoT-based method (SumCoT).
:::success
SumCoT **elicits LLMs to focus on news core elements**, thereby generating element-aware summaries step by step.
:::
### 4.1 Two-stage Pipeline

:::info
- Stage 1: **Core element extraction**
1. we create guiding-question prompts to elicit the LLMs to extract four core elements: Entity, Date, Event, Result.
2. Then concatenate these questions into Q = [q1, q2, q3, q4].
3. Let the source document be S, then the LLMs input in this stage is formulated as [S;Q].
- Stage 2: **Multiple information integration and summarization**
1. Next, we integrate the extracted elements and more detailed information from the source document.
2. We concatenate the source document, questions, answer, and a simple prompt.
3. "Let’s integrate the above information and summarize the article:" to prompt the LLMs for summary generation.
4. The input in this stage is formulated as [S;Q;A; [p′]], and the output is the final summary.
:::
### 4.2 Comprehensive Evaluation
#### 4.2.1

:::info
Case comparisons between GPT-3 zero-shot summaries before and after usingSumCoT. Spans of Entity, Date, Event and Result are separately highlighted in red, yellow, blue and green.
:::
#### 4.2.2

:::success
Demonstrating that GPT-3 successfully focuses on more core elements through SumCoT and further fits the element-aware writing pattern.
:::
#### 4.2.3

:::success
Indicate that the Sum-CoT technique further improves the performance of the standard zero-shot paradigm in all dimensions, particularly coherence and relevance.
:::
### 4.3 Better Understanding SumCoT
#### 4.3.1 How does SumCoT affect summary writing?
1. **Final summaries are extremely faithful to the extracted elements**, particularly on CNN/DailyMail.
2. The coverages of each element are relatively lower due to the one-sentence style of BBC XSum.
3. **The coverage of Date is significantly low, probably due to the errors of extraction**
#### 4.3.2 Is the element extraction accurate and comprehensive?

:::info
Demonstrates a strong correlation between **element extraction** and **summary generation**.
:::

:::success
1. Results show that extraction achieves an outperforming result **except for Date**, because <span class="red">Date hallucination is particularly evident for extracting non-existent dates</span>.
2. **Precision are usually lower than Recall**, because <span class="red">element redundancy often occurs</span>.
:::
#### 4.3.3 Does the model size limit SumCoT?

:::success
1. When the model size is small, element extraction is almost invalid.
2. As the model size increases, GPT-3 can extract one by one for all types of elements, but the extraction itself has many errors or redundancies.
3. **Only when the model size is the largest, the element extraction is human-approved**.
:::
## 5. Conclusion
Overall, our main contributions are three-fold:
1. We construct expert-writing element-aware summary test sets to evaluate general summarization systems more objectively (§2).
2. We explore the zero-shot summarization ability of LLMs on the new test sets and demonstrate that their writing ability cannot be fully reflected by standard test sets (§3).
3. We propose a new CoT-based summarization technique, which allows the LLMs to generate more fine-grained summaries step by step (§4).