Text Summarization with Pretrained Encoders

# Text Summarization with Pretrained Encoders [1908.08345](https://arxiv.org/pdf/1908.08345.pdf) ## Abstract * extractive and abstractive models. * 使用基於 BERT 上的 transformer 抽取摘要 * extractive: inter-sentence Transformer layers * abstractive: encoder-decoder architecture; 隨機初始化的Transformer decoder * 用新的 fine-tune 法平衡 BERT 與後者的不平衡 * separates the optimizers of the encoder and the decoder 分開優化 * two-stage approach: encoder fine-tune 2次；先 extractive 再 abstractive * 在3個不同寫作習慣(重要訊息放的地方；簡短的摘要等等)的 dataset 中2種摘要生成都最好 * 3個貢獻 * 強調 document encoding 很重要，且用小小的模型就打趴一些很高深的技術的模型 * 展示了用 pretrain model 在摘要生成的方法 * 變成將來新模型的墊腳石及 baseline --- ## Background ### Pretrained Language Models * pretrain model 變成 NLP 多數任務的主流 * 原始BERT的結構如下 * CLS 表整個序列的訊息；放開頭 * SEP 表句子邊界；放句尾 ![](https://i.imgur.com/9IWVIFE.png) * 句子會被表為3種embedding，並合成1個 vector $x_i$ 餵給多層雙向 transformer * token embedding: 每個 token 的意義 * segmentation embeddings: 區分２個句子 * position embeddings: 句子在原序列的位置 * $\widetilde{h}^l = \text{LN}(h^{l-1} + \text{MAHtt}(h^{l-1}))$ $h^l = LN(\widetilde{h}^l + \text{FFN}(\widetilde{h}^l))$ $h^0 = \text{PosEmb}(T)$ $T$: $sent_i$ 的特徵向量 PosEmb: 表示每個句子的位置($E_P$) LN: normalization MHAtt: multi-head attention 上標$l$: 深度 * BERT 會在最上層輸出句有上下文訊息的 token $t_i$ * 因為 BERT 的會跟下游一起微調，所以目前比 ELMO 廣泛 ### Extractive Summarization * neural encoder 創建句子的 representation (理解句子) * classifier 選擇哪些是摘要 * 範例: * SUMMARUNNER: 最早的 NN 摘要; RNN * REFRESH: 針對 ROUGE 優化的 reinforcement learning * LATENT: 直接針對人類給定的句子做最佳化 * SUMO: 用 multi-root dependency tree 表達文章並預測輸出 * NEUSUM: 同時評分及選擇句子; extractive summarization 中的最好 ### Abstractive Summarization * NN 將此視為 seq2seq 問題 * encoder 把輸入序列 x 映射到連續表達序列 z * decoder 以 auto regressive token-by-token 的生成摘要 y * 對條件機率 $P(y|x)$ 建模 * 範例 * Rush 和 Nallapati 是第1個把編解碼器架構運用在摘要生成的人 * See 用 pointergenerator network (PTGEN) 加強前者的模型，並用 coverage mechanism (COV) 追蹤已當作摘要的 word * Celikyilmaz 提出用多個 encoder 表示文檔並用階層的attention 當 decoder 的抽象系統: Deep Communicating Agents (DCA), end-to-end with reinforcement learning * Paulus 提出: deep reinforced model (DRM), 用 intra-attention 處理 coverage 問題, decoder 注意先前生成的 word * Gehrmann 用 bottom-up approach, 先確定摘要要有那些文章中的片語，接著在 decode 時針對先前預選的片語用 copy mechanism * Narayan 用 CNN 加主題分布為條件，提出一種特別適合極端摘要（單句摘要）的抽像模型 --- ## Fine-tuning BERT for Summarization ### Summarization Encoder  * 因 BERT 的輸出太片面，通常只有字的表示不是整(多)句話，所以修改了輸入及embedding 來進行摘要分析 * 在句子前加 [CLS]; 後加 [SEP] * 用 $E_A$, $E_B$ 來區分單雙數句子 * $T_i$ 表 $sent_i$ 的特徵向量 ![](https://i.imgur.com/hoIvRJh.png) * 可以分層學習文檔表達，其中較低的 Transformer layer 表相鄰的句子，而較高的層結合自注意，則代表多句話語 * 原先 BERT 的 position embedding 只有512，這裡添加更多的並隨機初始化然後一起訓練 ### Extractive Summarization  * $sent_i$ 表文件中的第 $i$ 句 * 若句子 $sent_i$ 包含在最後的輸出中，則輸出 $y_i$ = 1 * 代表第 $i$ 句為摘要  * BERTSUMEXT: * 用一個 transformer 做分類會比前一個好 * $\widetilde{h}^l = \text{LN}(h^{l-1} + \text{MAHtt}(h^{l-1}))$ $h^l = LN(\widetilde{h}^l + \text{FFN}(\widetilde{h}^l))$ $h^0 = \text{PosEmb}(T)$ $T$: $sent_i$ 的特徵向量 PosEmb: 表示每個句子的位置($E_P$) LN: normalization MHAtt: multi-head attention 上標$l$: 深度輸出層一樣是 sigmoid * 實驗發現$l$=2最好(在$l$=1~3中) * $\hat{Y}_i = σ(W_o h^L_i + b_o)$ * $h^L$: 第$L$層 transformer 的輸出向量  * 計算 $\hat{Y}_i$ 與 $Y_i$ 的 Binary Classification Entropy * 這個額外的 layer 跟著 BERT 一起 fine-tuned  * Adam β1 = 0.9, β2 = 0.999 is used for fine-tuning. * Learning rate schedule is following with warming-up on first 10,000 steps: $lr = 2e^{−3}· min(step^{−0.5}, step · warmup^{−1.5})$ ### Abstractive Summarization * 表準的 encoder-decoder 結構 (See) * encoder: pretrained BERTSUM * decoder: 6 layers transformer (random initial) * 分別用2個 Adam β1 = 0.9, β2 = 0.999 但 warm-up 和 learning rate 不同 * $lr_\mathcal{E} = \widetilde{lr}_\mathcal{E} · min(\text{step} ^{−0.5} , \text{step} · \text{warmup}_\mathcal{E}^{−1.5})$ $lr_\mathcal{E} = 2e^{−3}$　;　$\text{warmup}_\mathcal{E} = 20,000$ * $lr_\mathcal{D} = \widetilde{lr}_\mathcal{D} · min(\text{step} ^{−0.5} , \text{step} · \text{warmup}_\mathcal{D}^{−1.5})$ $lr_\mathcal{D} = 0.1$　;　$\text{warmup}_\mathcal{D} = 10,000$ * pretrained encoder 以較小的學習速率和更平滑的衰減進行微調（以便在 decoder 變得穩定時可以使用更準確的梯度來訓練 encoder） * 2階段 fine tune: 先對 extractive 接著對 abstractive fine tune * 利用這兩個任務之間共享的訊息，而無需從根本上改變結構 * 上面的叫 BERTSUMABS；用2階段 fine tune 的叫 BERTSUMEXTABS --- ## Experimental Setup ### Summarization Datasets  * CNN/DailyMail news highlights dataset * associated highlight * 沒有匿名化 * 用 Hermann 等人的方法分訓練、驗證、測試集 * 用 CoreNLP 和 See 等人的方法做句子分割及預處理 * Input truncated to 512 tokens * New York Times Annotated Corpus (NYT) * abstractive summaries * 根據日期9/1拆測試集；4% validation(4000筆) * 摘要少於50字的刪除 * 用 CoreNLP 和 Durrett 等人的方法做句子分割及預處理 * Input truncated to 800 tokens * XSum * 只有一句的 summary * 用 Narayan 等人的方法分訓練、驗證、測試集及預處理 * Input truncated to 512 tokens. * 前2 dataset 算 extractive 後算 abstractive * 期望結果可以更偏向 dataset 的特性 * 最右欄: novel bi-gram (新穎的2字組)在 gold 中所佔的比例 ![](https://i.imgur.com/c6Co7Vx.png) ### Implementation Details * PyTorch, OpenNMT, BERT(bert-base-uncased) * source 跟 target 都用 BERT 的 subwords tokenizer 標記 * 根據 loss 前3低的 checkpoints 進行 testing #### Extractive Summarization  * trained for 50,000 steps * batch size 約 36 * Model checkpoints are saved and evaluated on the validation set every 1,000 steps  * 類似於 Nallapati 等人的 oracle 算法用於為每個文檔生成預言摘要來訓練模型 * 選擇可以讓 ROUGE2 分數最高的語句作為預言語句  * 類似 Maximal Marginal Relevance (MMR) 但簡單的許多 * 指判斷候選句 c 跟已存在的 S 的 word 重複性 (超過3個就略過 c) #### Abstractive Summarization * 所有抽象模型都在線性 layer 前 dropout (p = 0.1)；並且用 label smoothing (factor = 0.1) * Transformer hidden nuit = 768 * hidden size of all feed forword layer = 2048 * trained for 200,000 steps * Model checkpoints were saved and evaluated on the validation set every 2,500 step * 在 decode 時 * 使用 beam search (size = 5) * α for the length penalty 0.6 ~ 1 * 使用 trigram blocking * 直到 end-of-sequence token 出現 * 沒有用到 copy 或 coverage 的技術卻很受歡迎 * 因為模型是最小需求 * 因為有 subwords tokenizer 所以很少出現 out-of-vocabulary words * trigram-blocking 也有效的減少重複的摘要 --- ## Results ### Automatic Evaluation * ROUGE-1,2 評估訊息豐富程度 * ROUGE-L 評估流暢性 * ORACLE (ROUGE-2最高分)當 upper bound * LEAD-3 (只拿錢3句)當 baseline  * TransformerEXT: baseline; 沒預先訓練, 參數較少且隨機的 BERTSUMEXT * 6 layers; hidden size 512; feed-forward filter size 2048 * 模型按照Vaswani等人的相同設置進行訓練 * TransformerABS: baseline; 用跟 BERTSUMABS 相同的解碼器 * encoder 為 6 layers; hidden size 768; feed-forward filter size 2048 #### CNN/DailyMail * BERT base 的 models 只比 ORACLE 差 * BERTSUMEXT 表現最好，因為 dataset 本身就偏向抽取式，連抽象式 BERT 也因為 dataset 而偏向從文章複製 * 較大版本的 BERT 可以有效提升性能 * 而 interval embeddings 只能有微小提升 ![](https://i.imgur.com/lMmYWzS.png) #### NYT * 用 Durrett 的方法評測 * limited-length ROUGE Recall * 輸出截斷至 gold summaries 的長度 * BERT base 的 models 又把其他人打趴 * 抽象 BERTSUM 幾乎快追上 ORACLE ![](https://i.imgur.com/2CxFsIi.png) #### XSum * 透過 LEAD 及 ORACLE發現抽取式模型效能不佳 * 因為 XSum 是1句式摘要 * 所以沒有跟抽取式模型比較 * BERT base 又打趴以前的模型了 ![](https://i.imgur.com/TgkfNcN.png) ### Model Analysis #### Learning Rates * 重寫式摘要的 lr 根據下表發現 $lr_\mathcal{E} = 2e^{−3}$; $lr_\mathcal{D} = 0.1$ 表現最好 ![](https://i.imgur.com/bFe03Ii.png) #### Position of Extracted Sentences * 對於提取式摘要進一步分析摘要位於原文中的位置的相關性 * ORACLE 非常的平滑 * TransformerEXT 則傾向文章的前段 * BERTSUMEXT 非常接近 ORACLE 表示預訓練的模型更深入了解文章的意思而非只用位置判別重要性 ![](https://i.imgur.com/YbqmgxI.png) #### Novel N-grams * 對重寫式摘要進一步分析新 n-gram 的比例 * 在 CNN/DailyMail 中生成的比低很多 * 其中 BERTEXTABS 的比例又最低，因為先被訓練成提取式模型 * 但在 XSum 中就比較靠近參考 ![](https://i.imgur.com/EP8pAQ0.png) ### Human Evaluation * 用 QA 範例量化模型保留文件的關鍵訊息 * 根據 gold summary 創建問題集 * 假設 highlight 了最重要的內容 * 參與者只讀生成的摘要就回答問題；可以回答對越多問題的表該模型越好 * 由於重寫式可能會混亂或不合語法 * 因此採用 Best-Worst Scaling method 向參與者顯示2個模型(及原始文件)的輸出 * 根據 Informativeness, Fluency, and Succinctness 判斷優劣 * 上2中方法都在 Amazon Mechanical Turk(工人智慧平台) 評測 * 對於先前發表的模型 CNN/DailyMail and NYT 用相同的20篇文章和問題 * 對於 XSum 從 Narayan 等人的模型隨機選擇20篇文章和問題 * QA 的評測: * 全對為1; 部分0.5; 錯0分 * 判斷品質: * 被選擇較好的次數 - 較差的次數的百分比 * 範圍 -1 ~ 1 * 把性能最好的 BERTSUM 跟各種最新系統的比較 * LEAD 當 baseline; GOLD 當 upper bound * QA 評測中除了 TCONVS2S 外都具有統計意義 (p < 0.05) ![](https://i.imgur.com/0lpxPfm.png) ![](https://i.imgur.com/cpgLH8P.png) --- ## Conclusions * 3個 dataset 都顯示模型基於自動或人類的評估都好棒棒 --- ###### tags: `Paper`