<style>
.red {
color: red;
}
.blue{
color: blue;
}
.green{
color: green;
}
</style>
# [Toward Unifying Text Segmentation and Long Document Summarization](https://arxiv.org/abs/2210.16422)
:::danger
**Comments:** Accepted at EMNLP 2022 (Long Paper)
**Github:** https://github.com/tencent-ailab/Lodoss
:::
## 1. Introduction
- One of the most effective ways to summarize a long document is to extract salient sentences.
- While **abstractive strategies(生成式摘要)** produce more condensed summaries, they suffer from hallucinations and factual errors, which pose a more difficult generation challenge.
- **Extractive summaries(提取式摘要)** have the potential to be highlighted on their source materials to facilitate viewing, e.g., Google’s browser allows text extracts to be highlighted on the webpage via a shareable link.
- <span class='red'>As a document grows in length, it becomes crucial to bring structure to it.</span> Examples include chapters, sections, paragraphs, headings and bulleted lists(條列式標題).
- <span class='red'>All of them allow readers to find salient content buried within the document.</span>
- Particularly, having *sections* is a differentiating factor between a long(research article) and a mid-length(news article) document.
- Writing a long document thus requires the author to meticulously organize the content into sections.
- They exploit document structure by <span class='red'> hierarchically building representations from words to sentences, then to larger sections and documents.</span>
## 2. Our Approach
- The task of long document summarization is significantly more challenging than other summarization tasks. It has a high compression rate, e.g., >85%, excluding most sentences from the summary and suggesting an extractive summarizer must be able to accurately identify summary-worthy sentences.

- <span class='red'>It learns robust sentence representations by performing both tasks simultaneously.</span>
- Further, it introduces <span class='red'>a new regularizer drawing on determinantal point processes</span> to measure the quality of all summary sentences collectively, <span class='red'>ensuring they are informative and have minimum redundancy.</span>
- We employ the **Longformer**
:::info
1. equipped with dilated window attention to produce contextualized token embeddings for an input document.
2. **Windowed attention** allows each token to attend only to its local window to reduce computation and memory usage.
3. It has the added benefit of easing section segmentation.
:::
:::info
這句話指出使用Longformer模型設計有一個額外的好處,就是有助於降低文本分段(section segmentation)的難度。
具體來說,Longformer採用了一種稱為"dilated sliding window attention"的技術。這種注意力機制允許每個詞只關注其鄰近的局部窗口,而不需要關注整個長篇文本,藉此降低計算量和記憶體使用。
這種設計帶來了一個好處,就是能夠更好地捕捉文本中的段落邊界信息。由於每個單詞只需專注於它們附近的文本內容,當文本中出現段落分界的時候,模型能夠更容易察覺這一點,因為在當前窗口內會看到一些新的詞彙,以及與上一個段落的部份重複詞彙。
因此,Longformer的這種設計不僅能處理長文本,還能自然地捕捉段落邊界信息,從而對文本分段任務有所裨益。這給模型帶來了更好的分段能力,也為其他任務如摘要提取等帶來了幫助。
:::
- The left and right context of a section break can be captured by the local window, which reveals any words they have in common and new words seen in this context.
:::info
這句話的意思是:
通過local window,可以捕捉到文本段落分界處上下文的資訊。這個local window可以揭示兩側上下文中共用的詞彙,以及新出現在這個上下文中的詞彙。
更詳細地解釋:
長文檔中常會有多個段落或章節的劃分,每個分界點的上下文會有一些重疊的詞彙,因為它們描述的仍是同一個主題。但也會出現一些新的詞彙,表示話題的轉移。
通過Transformer模型中的dilated sliding window attention機制,模型在每一層都只關注local window內的文字,而不是整個文檔。這樣的設計可以減少計算量,同時也讓模型捕捉到位於段落分界處左右的局部上下文。透過分析這個local window的詞彙,模型可以發現原本共用的詞彙,以及新出現的詞彙,從而識別出潛在的段落切分位置。
因此,利用local window來觀察上下文中的詞彙變化,對於判斷文本的邏輯結構和章節劃分很有幫助。
:::
- Our Longformer model utilizes a large position embeddings matrix, allowing it to process long documents up to 16K tokens.
- We use <span class='red'>dilation, changing window size across layers from 32 (bottom) to 512 (top) </span>to increase its receptive field.

- Our summarizer is built on top of Longformer by **stacking two layers of inter-sentence Transformers to it**.
- We append **[CLS]** to the beginning of each sentence and **[SEP]** to the end. <span class='red'>This modified sequence of tokens is sent to Longformer for token encoding.</span>
- We obtain the vector of the i-th [CLS] token as the representation for sentence si with rich contextual information. These vectors are added to sinusoidal position embeddings, then passed to two layers of inter-sentence Transformers to capture document-level context.
:::info
這段話的意思是:
我們將第i個[CLS]符號的向量視為第i個句子si的表徵,其中包含了豐富的上下文訊息。這些向量與sinusoidal position embeddings相加後,再傳遞到two layers of inter-sentence Transformers中,用來捕捉文件級別的上下文資訊。
詳細解釋如下:
1. 將每個句子前面加上[CLS]標記,後面加上[SEP]標記,用以區分不同的句子。
2. 將這個帶有標記的序列餵入Longformer模型進行編碼,得到每個token的向量表徵。
3. 取第i個[CLS]符號對應的向量,將其視為第i個句子si的向量表徵。這個向量已經包含了句子的上下文信息。
4. 將所有句子的向量表徵與sinusoidal position embeddings相加,sinusoidal position embeddings用於表示句子在文檔中的相對位置。
5. 將相加後的向量輸入到two layers of inter-sentence Transformers中。這些Transformer層能捕捉更高層次的文檔上下文關係。
6. 經過這兩層Transformer的處理後,生成了最終句子向量表徵{h1, h2, ..., hN},包含了句子本身以及文檔層面的上下文信息。
這樣做的目的是從句子層面上升到文檔層面,充分利用句子與句子之間、以及整個文檔的上下文訊息,以有效地表示每個句子的重要程度。
:::
- Such global context is especially important for identifying salient sentences, whereas sinusoidal position embeddings indicate the relative position of sentences.
### Summarization and Section Segmentation

:::success
公式(1)和(2)定義了文章的摘要總結和段落分割兩個任務的計算方式:
公式(1):
ŷ_sum,i = σ(w_sum^T h_i + b_sum)
這是用來計算第i個句子被選為摘要句的機率分數。
- ŷ_sum,i 代表預測第i個句子是否為摘要句,1表示是,0表示否。
- σ是sigmoid函數,將線性輸入映射到(0,1)區間,產生一個機率值。
- w_sum是一個權重向量,h_i是第i個句子的隱藏表徵向量,兩者的矩陣乘積加上偏置項b_sum構成了sigmoid的輸入。
公式(2):
ŷ_seg,i = σ(w_seg^T h_i + b_seg)
這用來計算第i個句子是否為段落開頭或結尾的機率分數。
- ŷ_seg,i 代表預測第i個句子是否為段落分界點,1表示是,0表示否。
- 計算方式與公式(1)類似,只是使用了不同的權重向量w_seg和偏置項b_seg。
這兩個公式利用相同的句子隱藏表徵向量h_i,但分別學習不同的權重參數,從而同時完成摘要和分段的預測任務。這種多任務學習策略有助於模型學習到更健壯的句子表徵,提高對長文件的理解和摘要生成質量。
:::
- Particularly, ysum,i = 1 indicates the i-th sentence is to be included in the summary.
- yseg,i = 1 suggests the sentence starts (or ends) a section.
- <span class='red'>A section usually starts or concludes with summary-worthy sentences.</span>
- <span class='red'>predicting section boundaries helps us effectively locate those sentences.</span>
- the **discourse cues** for identifying major section boundaries, e.g., “so next we need to...,” are portable across domains and genres.

:::success
公式(3)和(4)定義了摘要總結和段落分割任務的損失函數:
公式(3):
L_sum = -1/N * 求和(y_sum,i * log(ŷ_sum,i) + (1 - y_sum,i) * log(1 - ŷ_sum,i))
這是二元交叉熵損失函數,用於訓練摘要總結任務。
- N是所有句子的總數
- y_sum,i 是第i個句子的真實標籤,0或1
- ŷ_sum,i 是公式(1)計算的預測機率
- 對於每個句子,損失是將真實標籤和預測機率代入負對數損失公式計算得到的
- 對所有句子的負對數損失求平均,得到最終的L_sum
公式(4):
L_seg = -1/N * 求和(y_seg,i * log(ŷ_seg,i) + (1 - y_seg,i) * log(1 - ŷ_seg,i))
這也是二元交叉熵損失函數,不過是用於訓練段落分割任務。
- 符號意義與公式(3)類似
- y_seg,i和ŷ_seg,i分別代表第i個句子段落分界點的真實標籤和預測機率
該損失函數強制模型的預測結果符合真實數據中的段落分割標記。
總的來說,公式(3)和(4)分別定義了摘要和分割任務的損失,訓練過程中通過聯合優化這兩個損失,使得模型同時學習到摘要摘取和段落分割的能力。
:::
- Our **base model, “Lodoss-base,”** minimizes the per-sentence empirical cross-entropy of the model w.r.t. gold-standard summary labels (Eq. (3)).
:::info
這句話的意思是:
我們的基礎模型"Lodoss-base"透過最小化每個句子的empirical cross-entropy損失, 來優化模型參數, 使模型的輸出越接近金標準摘要標籤。
更詳細地解釋:
1. "Lodoss-base"是論文中提出的基礎摘要模型。
2. 最小化"每個句子的empirical cross-entropy損失"是指模型要最小化所有訓練句子的平均cross-entropy損失。
3. 對每個句子,預測結果 ŷ_sum,i 代表模型認為該句子是否應該包含在摘要中(1代表包含, 0代表不包含)。
4. y_sum,i 是gold-standard,由人工製作的理想摘要決定。
5. cross-entropy損失函數會比較模型的預測 ŷ_sum,i 與人工標籤 y_sum,i 的差距。
6. 透過最小化所有訓練句子的平均cross-entropy損失(公式(3)),可以逐步調整模型參數,使預測結果越來越接近人工標籤。
7. 這種基於句子級別的訓練方式,能幫助模型學習辨別哪些句子重要,應被納入摘要。
因此,Lodoss-base模型通過最小化與標準答案的差距,逐步訓練可靠的句子表徵,以便從長文檔中準確地提取重要句子。
:::
- It learns to identify salient sentences despite that content salience may vary across datasets.
- Our **joint model, “Lodoss-joint,”** optimizes both tasks through multi-task learning: *L*(Θ) = *L*sum + *L*seg.
- It adds to the robustness of derived sentence representations, because the acquired knowledge for section segmentation is more transferable across domains.
:::info
這句話的意義是:
透過同時進行摘要提取和文本段落分割這兩個任務的訓練,能夠增強模型學習到的句子表徵的穩健性。原因在於段落分割任務所獲得的知識在不同領域之間更具可轉移性。
更詳細地解釋:
1. 模型同時訓練摘要提取和文本分割任務,稱為"multi-task learning"。
2. 除了標記重要句子用於摘要外,模型還要預測每個句子是否屬於新的段落的開頭。
3. 分割段落時,模型需要學習識別話題轉換、句子之間的連貫性、關鍵詞的變化等線索。
4. 這些識別段落結構的能力所得到的知識,比僅依賴於學習哪些句子重要更具普遍性。
5. 因為不同領域和文本類型的作者在組織段落、表達邏輯時,通常會運用類似的寫作技巧和rethorical cues。
6. 因此,模型在一個領域內學到的段落分割知識和技能,更容易應用和轉移到其他領域中。
7. 這種更具transferability的知識,使得模型最終學習到的句子表徵比單一訓練摘要提取任務時更加穩健,在不同領域都能表現良好。
總之,段落分割任務幫助模型獲得更具通用性的知識,從而提升了學到句子表徵的泛化能力和適應性。
:::
- Here, ŷ_sum,i and ŷ_seg,i and are predicted scores for summarization and segmentation; ysum,i and yseg,i are ground-truth sentence labels.
### A DPP Regularizer
- It is especially important to **collectively** measure the quality of a set of extracted sentences, instead of handling sentences individually.
- We introduce a new regularizer leveraging the determinantal point processes to encourage a set of salient and diverse sentences to be selected for the summary.
- With the DPP regularizer, a ground-truth summary Y ′ is expected to achieve the highest probability score compared to alternatives.
- It provides a summary-level training objective that complements the learning signals of our Lodoss-joint summarizer.
:::info
這段話的意思是:
相比於其他可能的摘要選擇,我們希望人工標註的參考摘要(ground-truth summary) Y'能獲得最高的概率分數。DPP正則項提供了一個總體的訓練目標,來補充Lodoss-joint摘要模型從句子級別獲得的學習訊號。
更詳細地解釋:
1. DPP(確定點過程)是一種用於挑選一組具有代表性和多樣性的元素的機率模型。
2. 作者將DPP應用於摘要句子的選擇上。一個理想的摘要Y'應該是由一組重要且不重複的句子組成。
3. 作者希望優化DPP的分數函式,使得標準參考摘要Y'獲得比其他候選摘要組合更高的機率分數。
4. 這個機率分數測量了一組摘要句子集合的整體品質,考慮了句子重要性和多樣性。
5. 將這個DPP正則項加入訓練目標後,不僅有來自句子級別的學習訊號(cross-entropy損失),還多了一個來自總體摘要級別的學習目標。
6. 句子級別的學習有助模型辨別單個句子的重要程度,而DPP正則項則提供了一個補充的全局優化目標,確保選出的摘要句子整體上都重要且不重複。
7. 通過共同優化這兩個目標,模型能學習產生高品質的摘要,句子重要且互補。
所以DPP正則項提供了一個額外的總體訓練目標,有助模型從全局上確保選出最理想的摘要句子集合。
:::

- <span class='red'>DPP defines a probabilistic measure for scoring a subset of sentences.</span>
- Let *Y* = {1, 2, ...,N} be the ground set containing N sentences. The probability of a subset Y ⊆ *Y*, corresponding to an extractive summary.
:::info
該句指出了以下概念:
Let Y = {1, 2, ..., N} 表示了一個基礎集合(ground set)Y,包含了從1到N共N個句子(sentences)。
P(*Y*)代表了從這N個句子中選出一個子集*Y*的機率。這個子集Y對應了一個摘要(extractive summary),即從全文中摘取出一些句子來構成摘要。
因此,這句話定義了一個基礎集合Y,包含了文件中所有N個句子,然後計算從中挑選出一個子集*Y*的機率P(*Y*),來作為從全文中摘取一些句子以形成摘要的機率模型。摘取的句子子集*Y*就對應了最終產生的摘要內容。
:::
- where det(·) is the determinant of a matrix; L ∈ RN×N is a **positive semi-definite matrix**.
- Lij indicates the similarity between sentences i and j.
- LY is a principal minor of L indexed by elements in Y.
- I is an identity matrix of the same dimension as L.
:::info
這句話是在說明 LY 是一個次矩陣(principal minor)。
詳細的解釋如下:
L 是一個 N x N 的正半定陣 (positive semi-definite matrix)。
其中每個元素 Lij 代表第 i 個句子和第 j 個句子之間的相似性。
當我們選出一個子集 Y ⊆ Y 作為摘要時,LY 就是一個 |Y| x |Y| 的次矩陣,其中 |Y| 表示子集 Y 的元素個數。
這個次矩陣 LY 是從原始矩陣 L 中挑出與子集 Y 對應的那些列向量和行向量形成的。
舉例來說,假設我們有 5 個句子 {1, 2, 3, 4, 5},而挑選出的子集 Y = {2, 4}。
那麼 LY 就會是下面這個 2x2 的次矩陣:
L22 L24
LY = L42 L44
也就是說,LY 被 Y 中的元素索引所挑選出來,成為一個只包含和子集 Y 相關的部分的次矩陣。
:::
- We make use of the quality-diversity decomposition for constructing <span class='red'>**L: L = diag(q) · S · diag(q)**</span>
- where q ∈ RN represents the quality of sentences.
- S ∈ RN×N captures the similarity of sentence pairs.
- In our model, the sentence quality score qi is given by ŷsum,i (Eq. (1)), indicating its importance to the summary.
- The sentence similarity score is defined by:

- We employ batch matrix multiplication (BMM) to efficiently perform batch matrix-matrix products.
:::info
這句話描述了構建矩陣L的一種方法,被稱為質量-多樣性分解(quality-diversity decomposition)。
詳細解釋如下:
L = diag(q) ⋅ S ⋅ diag(q)
- diag(q)表示對角矩陣(diagonal matrix),其對角線元素為向量q。向量q代表了每個句子的質量(quality)分數。
- S是一個N x N的矩陣,捕捉了句子對之間的相似性(similarity)。
- ⋅表示矩陣乘法運算。
因此,這種分解方式把句子的質量和句子對的相似性結合起來構建了矩陣L。
具體地:
1) 對角矩陣diag(q)將向量q中每個句子的質量分數放在對角線上。
2) 然後與句子對相似性矩陣S相乘。
3) 再次與diag(q)相乘,將質量分數納入到最後的結果中。
這樣通過質量-多樣性分解,生成的矩陣L同時捕捉了句子本身的重要性(通過質量分數q)和句子之間的相異性(通過相似性矩陣S)。這樣構建的L可用於後續的摘要句選擇等任務。
:::
- A summary containing two sentences i and j has a high probability score *P*(Y = {i, j}) if the sentences are of high quality and dissimilar from each other.
- Conversely, if two identical sentences are included in the summary, the determinant det(*LY*) is zero.
- Modeling pairwise repulsiveness helps increase the diversity of the selected sentences and eliminate redundancy.

- As illustrated in Eq. (6), our DPP regularizer is defined as the negative log-probability of the groundtruth extractive summary Y′.
- It has the practical effect of promoting selection of the ground-truth summary while down-weighting alternatives.
:::info
這句話說明了該論文中提出的DPP正則化項的實際效果:
它有助於提高選擇ground-truth摘要的機率,同時降低選擇其他替代摘要的權重。
更詳細地解釋:
1. Ground-truth摘要指的是從原始文檔中人工標注出來的真實摘要。
2. 替代摘要是指除了ground-truth摘要之外,系統可能會生成的其他不同的摘要句子組合。
3. 該論文提出了一個基於DPP(determinantal point process)的正則化項,用來衡量一組摘要句子的整體質量。
4. 這個DPP正則項的設計目的,就是讓系統在訓練過程中,能夠提高選擇真實ground-truth摘要的機率,同時降低選擇其他不合理的替代摘要的權重。
5. 換句話說,DPP正則項像一個引導,能夠有效地引導模型更偏好選擇那些真實重要且內容多樣的摘要句子,避免選擇冗餘或不合理的句子組合。
總之,這個正則項的實際效果就是在訓練過程中,提高了模型選擇ground-truth摘要的概率,降低了選擇其他替代摘要的權重,從而有效地引導模型學習到更合理的摘要生成策略。
:::

- Our **final model, “Lodoss-full,”** is shown as follow.

- It adds the DPP regularizer to the joint model (Eq. (7)).
- β is a coefficient that balances sentence-level cross-entropy losses and summary-level DPP regularization.
- Θ are all of our model parameters.
## 3. Experiments
- Our datasets include collections of scientific articles and lecture transcripts, their associated summaries and section boundaries.
- We contrast our approach with strong summarization baseline systems and report results on three standard benchmarks.
### 3.1 Datasets
- For <span class='red'>written documents</span>, we choose to experiment with **scientific articles** as they follow a logical document structure.
- They come with human summaries and sections, delimited by top-level headings.
- The scientific articles are gathered from two open-access repositories: arXiv.org and PubMed.com.
- These datasets contain 148K and 216K instances, respectively.
- For <span class='red'>transcript summarization</span>, we utilize **lectures** gathered from VideoLectures.NET
- **The transcripts are time aligned with lecture slides.**
- All utterances aligned to a single slide are grouped into a cluster, they form a transcript section.
- Text extracted from slides are used as ground-truth summaries.
:::info
這段話描述了該論文如何處理講座影片的轉錄文本,並生成對應的摘要和話題分割。
詳細解釋如下:
1. 資料來自VideoLectures.NET,包括講座影片及其自動語音轉文本的轉錄文件。
2. 將每個講座影片的所有語音片段與投影片做時間對齊。
3. 所有對齊到同一張投影片的語音片段,被視為一個小組(cluster),構成一個轉錄章節(transcript section)。
4. 每個投影片上的文字被抽取,並作為該章節的參考摘要(ground-truth summary)。
總結一下:
這篇論文將每張投影片投放的內容,視為該區間講解內容的摘要。
所有對應同一張投影片的語音轉錄文字,被視為討論這個摘要的詳細內容,構成一個獨立的話題章節。
通過這種方式,論文利用投影片上的文字,自動地為整個影片的轉錄文件生成了weak supervision的參考摘要(ground-truth)和話題分割標註,用於指導模型的訓練。
:::
- The dataset contains a total of 9,616 videos. Each video contains about 33 slides.
- It helps us lay the groundwork for unifying summarization and segmentation on spoken documents, using lecture slides to provide **weak supervision** for both tasks.

### Ground-Truth Labels
- A label ysum,i is assigned to each sentence of the document: 1 indicates the sentence belongs to the ORACLE summary, 0 otherwise.
- An ORACLE is created by adding one sentence at a time incrementally to the summary, so that it improves the average of ROUGE-1 and -2 F-scores.
:::info
這段話描述了如何通過逐步增量的方式構建ORACLE摘要。
ORACLE摘要是一種參考標準,代表了從原始文檔中抽取多個句子構成的最佳摘要。具體來說:
1. 初始時,ORACLE摘要是一個空集合,不含任何句子。
2. 然後從原始文檔中逐一考慮每個句子。
3. 對於每個考慮的句子,將它暫時加入到當前的ORACLE摘要集合中。
4. 計算新的暫時ORACLE摘要集合的評估分數(如ROUGE分數)。
5. 如果評估分數提高,則將該句子永久保留在ORACLE摘要集合中。如果評估分數下降,則將該句子移除。
6. 逐步考慮完文檔中所有句子後,剩下的最終集合就是ORACLE摘要。
通過這種增量式的方法,每次只考慮加入一個新句子,從而構建出一個能最大程度提高評估分數的最優ORACLE摘要集合。
這種ORACLE摘要代表了用抽取方法在目前文檔上能達到的最佳摘要質量,可作為訓練抽取式摘要系統的參考標準。
:::
- <span class='red'>ORACLE summaries give the ceiling performance of an extractive summarizer.</span>
- ORACLE summaries for scientific papers are created by us; those for lecture transcripts are provided by Lv et al. (2021) generated by aligning transcript utterances with lecture slides.
- <span class='red'>Scientific papers</span> come with sections: we specify yseg,i = 1 if it is the first (or last) sentence of a section, 0 otherwise.
- Both the first and last sentences of a section could contain discourse connectives indicating a topic shift.
- Clear document structure depends on establishing where a section ends and the next one begins.
- For <span class='red'>lectures</span>, all transcript utterances are time aligned with lecture slides, creating mini-sections.
### System Predictions
- At inference time, our system extracts a fixed number of sentences (*K*) from an input document.
- <span class='red'>These sentences have the highest probability scores according to Eq. (1).</span>
- K is chosen to be close to the average number of sentences per reference summary.
- We set **K=7 and 5 for the PubMed and arXiv datasets**, respectively,We use **K=3 for lectures**.
- Section predictions are given by Eq. (2).
### 3.2 Experimental Settings
- We use the **Adam optimizer**. Its initial learning rate is set to be 3e−5. The learning rate is linearly warmed up for 10% of the total training steps.
- The training was performed on **8 NVIDIA Tesla P40 GPUs**.
- The models were trained on each dataset for **20 epochs**, using a **batch size of 8 with gradient accumulation every 4 steps**.
- We run hyperparameter search trials on the validation set, with β ∈ {1, 0.1, 0.01, 0.001}.
- We adopt half-precision (FP16) to speed up training for all models, with the exception of the full model, where full-precision (FP32) is used to ensure a stable performance of eigenvalue decomposition required by the DPP regularizer.
### 3.3 Summarization Results
### Baseline Systems
*We compare our system with strong summarization baselines:*
1. **SumBasic** is an extractive approach that adds sentences to the summary if they contain frequently occurring words.
2. **LexRank** measures sentence salience based on <span class='blue'>eigenvector centrality</span>.
3. **ExtSum-LG** leverages local and global context to extract salient sentences.
4. **+RdLoss** further adds a redundancy loss term to the learning objective to help the model eliminate redundancy in long document summarization.
5. **Sent-PTR** uses a hierarchical seq2seq sentence pointer model for sentence extraction.
*Our **abstractive baselines** include the following:*
1. **Discourse** utilizes a hierarchical encoder to model the document structure and an attentive decoder to generate the summary.
2. **TLM-I+E** generates a paper abstract using the Transformer language model, where the introduction section and extracted sentences are provided as context.
3. **BigBird** and **LED** use sparse attention and windowed attention to process long input sequences.
4. **HAT** adds hierarchical attention layers to an encoder-decoder model to summarize long documents.
### Results on Scientific Papers
- We compare three of our model variants, listed below.
1. **Lodoss-base**, using *L*sum
2. **Lodoss-joint**, using *L*sum + *L*seg
3. **Lodoss-full**, using (*L*sum + *L*seg) + β*L*DPP
- Standard evaluation metrics, including R-1, R-2 and R-L, are used to measure the quality of system summaries. (It allows our model to be directly compared to previous approaches.)

:::success
1. Our models strongly outperform both extractive and abstractive baselines, suggesting the effectiveness of unifying section segmentation with summarization.
2. The LEAD baseline, however, does not perform as well on long documents as it does on news articles.
3. It is interesting to note that **our models are trained with indirect signals**, i.e., binary sentence labels derived from reference summaries, and they remain quite effective at capturing salient content on long documents.
:::info
這句話指出了一個有趣的現象:
雖然該論文提出的模型是以間接的訊號進行訓練,即利用由參考摘要推導出來的二元句子標籤作為監督訊號,但是它們仍能有效地從長文檔中捕捉到重要的內容信息。
詳細解釋如下:
1. 直接的訓練信號指的是人工為每個句子標註是否應該包含在摘要中。
2. 而該論文採用的是間接信號,利用自動與參考摘要對比產生的二元標籤(0/1)。也就是說,模型沒有直接看到人工的句子重要性標註,只能從現有的參考摘要裡推斷每個句子是否重要。
3. 儘管如此,該模型能夠從這種間接的監督信號中有效地學習,最終還是能識別出來原文檔中的重要內容。
4. 對於長文檔而言,準確地捕捉關鍵訊息是一個很有挑戰性的任務。但論文中的模型儘管訓練信號有限,仍能達到相當不錯的效果。
5. 作者認為,這種能力來自模型本身的結構設計和訓練目標,即同時進行文本分段和摘要句抽取訓練,幫助模型更有效地學習長文檔的結構和內容重要性。
所以總的來說,該模型能夠在相對間接的訓練信號下,仍然有效地從長文檔中提取出關鍵的內容信息,這是一個值得關注的現象。
:::
:::success
4. We conduct <span class='blue'>significance tests(假設檢定)</span> using the approximate randomization method.
5. With a confidence level of 99%, all of our Lodoss models are significantly better than BigBird-base and LED-4K.
:::
:::success
*The differences between our model variants are also significant:*
1. Our results indicate that <span class='red'>incorporating section segmentation and a summary-level DPP regularizer can help the model better locate salient sentences.</span>
2. The large encoder (‘-LG’) results in improvements on both datasets.
:::
### Results on Lecture Transcripts
- We could train our model from scratch using lecture transcripts, or pretrain the model on either arXiv or PubMed, then fine-tune it on transcripts.
:::info
這句話是說論文作者探討了兩種訓練模型的方法:
1. 從頭訓練模型(from scratch)使用講座文字記錄(lecture transcripts)。
2. 首先在arXiv或PubMed的文章集上預先訓練(pretrain)模型,然後利用該預訓練的模型,進行微調(fine-tune)調整參數來適用於講座文字記錄。
換句話說,作者嘗試了兩種方式:一是直接使用講座文字記錄從零開始訓練模型;二是先在書面文章資料集上預訓練模型,獲得一些基礎知識,再針對講座文字記錄進行微調,使模型更加適應於口語文本。透過比較這兩種訓練方式,作者分析模型對不同類型文本的適應能力和可轉移性。
:::

:::success
1. We observe that models pretrained on written documents perform substantially better compared to training a model from scratch.
2. PubMed outperforms arXiv consistently except Lo-full-grp.
3. <span class='red'>It suggests that knowledge gained from summarizing written documents could be transferred to summarization of spoken transcripts.</span>
4. The Lo-joint-* model consistently outperforms the model trained from scratch regardless of different segmentation labels.
5. Note that F-scores are not necessarily aligned with the Rouge scores as the system can predict sentences with similar context that are not labeled as summaries.
:::info
這段話表示,不論使用何種分段標籤(segmentation labels)定義,經過預訓練的Lo-joint-⋆模型都能持續獲得比從頭訓練(trained from scratch)模型更好的效果。
具體來說:
- Lo-joint-⋆模型是先在書面文章數據集上預訓練,學會辨識段落結構的能力,再微調應用在口語記錄的摘要任務。
- 不同的分段標籤定義會影響模型如何將口語記錄分成段落,但無論採用何種分段方式,Lo-joint-⋆模型都能展現出優於從頭訓練模型的效能。
- 作者提醒,F分數和Rouge分數可能無法完全一致。因為系統可能會預測出一些與摘要標記內容相似的句子,但並未被標記為摘要句。
總之,經過預訓練的Lo-joint-⋆模型能夠更好地適應不同的段落劃分方式,並產生較高品質的摘要,展現較強的泛化能力。但評估指標可能會因為標記差異而有所偏差。
:::
*We explore alternative definitions of a transcript section*:
**all utterances aligned to a single slide is considered a section** vs. **using six major sections per transcript**
:::success
1. The former leads to about 33 sections per transcript.
2. The latter is achieved by finding 6 slides that are discussed the most, and using the first utterance of each slide as the start of a new section.
3. Because scientific papers on PubMed and arXiv contain 6.06 and 5.68 sections averagely, this definition allows our model to be pretrained and fine-tuned under similar conditions.
4. We find that <span class='red'>using six major sections per transcript improves summarization performance.</span>
:::
## 4. Ablations and Analyses
### Effect of Summary Length
- We vary the length of output summaries to contain 5-7 sentences and report summarization results on PubMed and arXiv.

:::success
1. Our model Lodoss-full consistently outperforms other variants across all lengths and evaluation metrics.
2. The highest scores are obtained for PubMed with 7 output sentences, whereas 5 sentences work best for arXiv, as it gives a good tradeoff between recall and precision.
:::
### Effect of Source Sequence Length

:::success
1. We observe that our model performs increasingly better when longer source sequences are used.
2. This is expected, as importance information will be left out if source sequences need to be truncated to a certain length. For example, using 4K tokens, we have to truncate about 50% of arXiv inputs.
:::
### Model’s Performance on Section Segmentation

:::info
Section segmentation results evaluated by F1 (higher is better) and WinDiff (lower is better). Results are reported for PubMed and arXiv. Best performance is achieved with our Lodoss-full model. ‘-LG’ means a Longformer-large model is used to encode the input document.
:::
- Our goal is to predict the first sentence of a section.
- **WindowDiff** is a lenient measure for segmentation results.
- It uses a sliding window to scan through the input document. At each step, it examines whether the predicted boundaries are correct within the local window.
:::success
1. Both our full model and large pretrained models help the system to better predict section boundaries.
:::

:::success
1. Predicting the first sentence of a section is easier than predicting the last sentence.
2. This gives 4% and 6% gain, respectively, on PubMed and arXiv.
:::
### Effect of Our DPP Regularizer

- Shows the average number of words per summary, where summaries are produced by different model variants.
:::success
1. We find that summaries produced by Lodoss-full tend to be shorter compared to other summaries.
2. Lodoss-full remains the best performing model.
3. It suggests that the DPP regularizer favors a diverse set of sentences to be included in the summary.
4. The selected sentences are not necessarily long as they may contain redundant content.
:::
### Why Section Segmentation is Necessary

:::info
How often summary sentences are found near section boundaries (PubMed). “1” indicates a summary sentence is the first sentence of a section, whereas “-1” indicates it is the last sentence of a section. Both the first and last sentences of a section are likely to be selected for inclusion in the summary.
:::

:::info
How often summary sentences are found near section boundaries (arXiv).
:::
- We investigate how often summary sentences are found near section boundaries.
- “1” indicates a summary sentence is the first sentence of a section.
- Whereas “-1” indicates it is the last sentence of a section.
:::success
1. Overall, both the first and last sentences of a section are likely to be selected for inclusion in the summary.
2. The effect is stronger for PubMed and less so for arXiv.We conjecture that because arXiv papers are twice as long as PubMed papers, summary sentences may not always occur near section boundaries.
3. In both cases, our models are able to leverage this characteristic to simultaneously identify summary sentences and perform section segmentation.
:::
### Human Assessment of System Summaries
- We focus on evaluating **informativeness** and **diversity** of summary sentences.
- Other criteria are not considered because extractive summaries can be highlighted on their source materials, allowing them to be understood in context.
- <span class='red'>Our evaluation focuses on the **Lodoss-joint** and **Lodoss-full** models.</span>
:::info
As a toy example, let S1={1, 3, 7, 12} and S2={2, 3, 7, 9} be summaries produced by two models, respectively.
The numbers are sentence indices.
For **informativeness**:
1. we <span class='red'>take the union of summary sentences</span> {1, 2, 3, 7, 9, 12}
2. And ask human evaluators to judge the relevance of each sentence against the ground-truth summary on a scale of 1 (worst) to 5 (best).
3. The informativeness score of a summary is the average of its sentence scores.
For **diversity**:
1. we obtain the <span class='red'>symmetric difference </span>of two summaries {1, 2, 9, 12}
2. And ask humans to judge if each sentence has offered new content different from those of the common sentences {3, 7}.
3. A good summary should contain diverse sentences that are dissimilar from each other.
:::

:::info
Table 6 是人工評估本文所提出的模型摘要的資訊整合度(Informativeness)和多樣性(Diversity)的結果。
4/5、3、1/2 是評分標準:
4/5 代表評分 4 或 5 分,這表示摘要中的句子對於源文件的資訊整合度很高。
3 代表評分 3 分,表示摘要中的句子對於源文件的資訊整合度尚可。
1/2 代表評分 1 或 2 分,表示摘要中的句子對於源文件的資訊整合度很低。
因此,4/5表示摘要句子對原文資訊相當完整,1/2表示摘要句子遺漏了原文的很多重要資訊。
所以對於資訊整合度(Informativeness)指標,模型得分越高、在4/5佔比越高越好。對於多樣性(Diversity)指標,模型得分越低、在1/2佔比越高越好,因為這代表摘要中的句子彼此之間不那麼重複。
:::
- We performed evaluation using **Amazon Mechanical Turk** on 100 randomly selected summarization instances from arXiv.
- We recruited Masters turkers to work on our task. They must have completed at least 100 HITs and with ≥90% HIT approval rate.
- Each summary sentence was judged by 3 turkers.
:::success
1. Lodoss-full receives better relevancy and diversity ratings than Lodoss-joint.
2. A substantial portion of the sentences receive a score of 1 or 2. It suggests that the extracted sentences lack informativeness despite that the DPP regularizer is effective at increasing the diversity of selected sentences and eliminating redundancy.
:::
:::success
圖3顯示了在PubMed和arXiv兩個數據集上,本文所提出的模型在文章段落劃分(section segmentation)任務上的表現。
圖中使用了兩個評估指標:
1. F1分數 (F1 score): 這是一個常用的評估指標,綜合考慮了模型在段落劃分任務上的精準度(precision)和召回率(recall)。F1分數越高,表示模型能夠越好地預測文章的段落邊界。
2. WinDiff (Window Difference): 這是針對段落劃分任務設計的一個寬鬆的評估指標。它使用一個滑動窗口掃過輸入文件,在每一步驟中檢查預測的段落邊界是否在局部窗口內正確。WinDiff分數越低,表示模型在確定段落邊界方面表現越好。
從圖中可以看出,本文提出的"Lodoss-full"模型和使用更大預訓練模型(large pretrained model)的"-LG"變體,在兩個數據集上都獲得了比基礎版本"Lodoss-base"更好的F1和WinDiff分數。這表明同時對段落劃分和摘要句子抽取進行聯合訓練,以及使用更大的預訓練模型,有助於模型學習更健壯的句子表徵,從而更好地預測文章的段落結構。
:::
## 5. Conclusion
- We investigate a new architecture for extractive long document summarization that has demonstrated a reasonable degree of transferability from written documents to spoken transcripts.
- Our model learns effective sentence representations by performing section segmentation and summarization in one fell swoop, enhanced by an optimization-based framework that utilizes the determinantal point process to select salient and diverse summary sentences.
- The model achieves state-of-the-art performance on publicly available summarization benchmarks. Further, we conduct a series of analyses to examine why segmentation aids extractive summarization of long documents.
- Our future work includes:
1. exploration of various text segmentation techniques to improve our understanding of the latent document structure.
2. Another direction would be to extend our study to the realm of neural abstractive summarization with the help of learned document structure.
## 6. Limitations
1. The proposed summarization models are trained on scientific articles that are segmented into multiple sections by authors. Those section boundaries are utilized by the model to learn robust sentence representations and estimate sentence salience given their proximity to section boundaries. When section boundaries are unavailable, the model may not work as intended.
2. Trained models may carry inductive biases rooted in the data they are pretrained on.
3. Finetuning on target datasets helps mitigate the issue as the model has been shown to demonstrate a reasonable degree of transferability from written documents to other genres.
:::info
這段話提到了兩個重點:
1. 預訓練模型可能會帶有來自於原始預訓練資料的偏見(inductive biases)。
意思是,模型在預訓練階段所使用的資料集,可能會使模型在學習過程中產生某些固定的模式或偏好,導致模型對特定類型的資料有偏頗的理解或判斷。這些潛在的偏見可能會影響模型在新資料上的泛化能力。
2. 透過在目標資料集上進行微調(finetuning),有助於減緩上述問題。
實驗結果表明,經過微調後,模型展現了不錯的可轉移能力(transferability),能夠合理地將在書面文件資料集上學到的知識應用到其他類型的文本,如口語記錄等。
簡單來說,透過微調,模型可以適應並學習新資料的特殊模式,有助於減少源自原始預訓練資料的偏見,進而提升模型在不同領域文本上的泛化表現。
:::