Mind the Gap! Injecting Commonsense Knowledge for Abstractive Dialogue Summarization - HackMD

<style> .red { color: red; } .blue{ color: blue; } .green{ color: green; } </style> # [Mind the Gap! Injecting Commonsense Knowledge for Abstractive Dialogue Summarization](https://arxiv.org/abs/2209.00930) :::danger **Comments:** Accepted at COLING 2022 **Github:** https://github.com/SeungoneKim/SICK_Summarization ::: ## 1. Introduction - Dialogue-to-document summarization suffers from the discrepancy between input and output forms, which makes learning their mapping patterns more challenging. - There are two key challenges that make summarizing dialogues harder than documents: 1. Detecting unspoken intention is crucial for understanding an utterance. 2. There exists information that can only be understood when its hidden meaning is revealed. :::info 檢測未說出的意圖對於理解一個言語表達是關鍵的。這意味著除了說話人口頭表達的內容之外,還要理解說話人隱藏的意圖和目的。例如在一個對話中,一個人說"那個表演真是糟糕",表面上是在評論一個表演,但隱藏的意圖可能是在嘲笑或挖苦那個演員。能夠理解這種未說出口的意圖,對於準確理解對話的整體意義是至關重要的。此外,有些訊息只有在它隱含的意義被揭示時才能被理解。這指的是對話中有些訊息沒有直接表達出來,需要從上下文、對話者之間的關係等方面推理出隱藏的含義。比如有人說"這筆電已經很古老了",隱含的意思可能是想表達筆電太舊需要更換了,但卻沒有直接說出來。能夠發現並理解這些隱含的含義,對於全面把握對話內容也是必要的。 ::: ![截圖 2024-03-13 21.28.46](https://hackmd.io/_uploads/BkMHqmkAT.png =60%x) - We argue that the aforementioned issues can be mitigated using commonsense knowledge model by filling in the gap in a dialogue. :::info 事件中心的常識模型(event-centered commonsense model)和社會互動常識模型(social-interaction commonsense model)是指能夠產生與事件和人際互動相關的常識推理的模型。 **事件中心的常識模型**: 這類模型專注於從給定的事件或情節中推理出可能的原因、影響、需求等常識知識。例如給定事件「有人在清晨醒來」,該模型可以推理出「XREASON:想要上班」、「XNEED:想喝杯咖啡」、「XEFFECT:感到疲勞」等與此事件相關的常識推理。這種事件中心的常識幫助理解事件發生的前因後果。 **社會互動常識模型**: 這類模型則關注於推理人際互動場景中的意圖、情感反應、預期等常識。舉例來說,給定一段對話「小明問媽媽:我可以去玩嗎?媽媽回答:你先把家課做完吧。」該模型可以推理出「小明的XWANT:想去玩」、「媽媽的XREACTION:拒絕」、「媽媽的XINTENT:讓小明先完成家課」等隱含的社會互動常識。這有助於準確把握對話雙方的心理活動和情緒狀態。這兩類常識模型都旨在從文本中挖掘出人們用來理解世界和社會的常識性知識,以彌補純粹依賴文字的模型缺乏常識推理能力的缺陷。將它們與對話摘要任務相結合,可以幫助模型更好地理解對話的語境和隱含意義。 ::: - We argue a naïve adoption of commonsense only hurts performance in summarization, as: 1. Expanding source contents is counter-intuitive approach for the goal of comdensation. 2. Simply adding additional inputs in pre-trained language models does not lead to robust inferences. - Our framework addresses this by (a) <span class='red'>filtering</span> and (b) <span class='red'>robust training</span>. - In SICK++, we also design a new auxiliary task named commonsense supervision. Using commonsense knowledge generated from gold summaries as additional supervision, the goal of the task is to generate the target commonsense. - Then, the <span class='red'>dialogue summarization and commonsense generation tasks are jointly learned in a multi-task learning setting to effectively inject commonsense knowledge into the shared encoder</span>. ## 2. Related Work ### 2.1 Abstractive Dialogue Summarization - Benefiting from the advance of large-scale pre-trained language models, the performance of encoder-decoder models has achieved substantial improvements in document summarization. - It is more difficult to capture the key points in dialogues than documents, because people do not state the obvious and conversations have a more interactive flow of information between speakers. - Instead of **organizing the given dialogue** for better understanding, our method <span class='red'>adds additional knowledge</span> to fill in the missing cues between dialogues. ### 2.2 Commonsense Knowledge Models - Recent research has focused on commonsense knowledge acquisition through different lines: 1. **commonsense knowledge graphs**: which entities and relations between entities are represented in nodes and edges 2. **commonsense knowledge models**: which have been shown to generate implicit commonsense inferences along several dimensions depending on what knowledge graphs they are trained on. - Commonsense knowledge models can be used to anticipate and reason unobserved causes and effects in relation to the observed event. - On dialogue summarization task, there has been limited usage of using commonsense directly as additional context. - Instead of retrieving from a static knowledge graph, our method deploys on-the-fly commonsense knowledge models to acquire a rich set of commonsense inferences dynamically. ## 3.Proposed Framework *We introduce our new framework, SICK(**S**ummarizing with **I**njected **C**ommonsense **K**nowledge) and its extension SICK++.* ![截圖 2024-03-13 21.30.49](https://hackmd.io/_uploads/Sksc5X1RT.png) ### 3.1 Task Description - Our task definition follows a sequence-to-sequence learning problem setting. - Based on pre-trained generative language models, our goal is to learn a mapping function M : D → Y where D = {u1, u2, ..., un} is a dialogue with n utterances, and Y = {y1, y2, ..., ym} is a corresponding summary of m sentences. :::info 「我們的任務定義遵循序列到序列(sequence-to-sequence)的學習問題設定」意指這項任務是將一個輸入序列映射轉換為相應的輸出序列,這是一種常見的機器學習問題形式。在這個對話摘要任務中: - 輸入序列(Input Sequence)是對話內的多個utterance(言語表達),表示為D={u1, u2, ..., un},其中n是utterance的數量。 - 目標輸出序列(Target Sequence)是對應的對話摘要Y={y1, y2, ..., ym},其中m是摘要句子的數量。序列到序列模型的目標是學習一個映射函數M: D → Y,將輸入對話序列D轉換為摘要序列Y。這通常是透過神經網路如RNN、Transformer等架構,在大量的訓練數據上讓模型逐步優化,找到最佳的參數來擬合輸入輸出的序列對應關係。這類sequense2sequense的學習架構不僅適用於對話摘要,也可運用於其他自然語言處理任務,例如機器翻譯、文本摘要、對話系統等,只要將問題形式轉化為從一個序列轉換到另一個序列即可。<span class='red'>相比只預測單個標記的任務,序列學習問題更能捕捉序列之間的長程依賴</span>,往往有更強的表現能力。 ::: - We further extend the task with two modifications: 1. First, we generate and filter to acquire a set of commonsense knowledge C = {c1, c2, ..., cn} based on D. 2. Then, we adjust the mapping function as <span class='red'>M : X → Y</span>, where X is a cross concatenation of D and C. 3. We add an auxiliary task commonsense supervision, <span class='red'>M∗ : X → Z</span>, where the target commonsense Z = {z1, z2, ..., zm} is acquired based on Y. ### 3.2 Commonsense Knowledge Generation - We adopt an external commonsense knowledge model that generates a diverse and abundant set of commonsense inferences **in natural language**. - Given a text x and a relation type r, the commonsense knowledge model gives an output c grounded to the relation type. i.e., f : (x, r) → c. :::info 在常識模型中,relation type r指的是用來描述特定常識關係類型的標籤或類別。不同的關係類型代表了常識知識中不同的語義關聯方式。以COMET這個常識模型為例,它定義了23種relation type,包括: 1. xNeed - 表示一個事件或行為所需的條件或前置要求,例如「吃飯xNeed有食物」 2. xWant - 表示一個人物的想望或目標,例如「他xWant找一份工作」 3. xReact - 描述一個人對特定事件的反應,例如「她聽到噪音xReact感到生氣」 4. xEffect - 描述一個事件或行為可能產生的影響或後果,例如「下雨xEffect地面會變濕」 5. xAttr - 用於描述人物或事物的屬性,例如「苦瓜xAttr味道苦」 6. xIntent - 解釋一個人的言行背後的意圖,例如「他說謊xIntent想隱瞞真相」這些不同的relation type能夠細緻地刻畫事件發生的原因、條件、影響等常識面向,對於建構完整的常識知識圖譜很有幫助。當生成常識推理時,模型會根據輸入的文本內容,選擇最合適的relation type來產生對應的常識描述。不同的關係類型蘊含著不同的語義,能夠捕捉到常識知識的多樣化面向。 ::: - Specifically, we use <span class='red'>COMET</span>, a widely-used generative commonsense model as our external model. - Among the 23 possible candidate relation types, we **choose 5 unique relations that helps understand the speakers’ intentions** and find out the missing information. - COMET **generates 5 commonsense inferences per relation type, resulting in 25 per input**. :::info 這句話是說COMET這個常識模型對於每個輸入,都會根據23種不同的relation type(關係類型),為每個關係類型生成5個相應的常識推理(commonsense inference),因此對於每個輸入,COMET總共會生成25個常識推理。舉個例子,假設輸入是一個簡單的句子:「昨天下雨了」 COMET會為這個輸入,針對每一種關係類型生成5個常識推理: xNeed關係類型下可能產生的推理為: 1. 需要有雲 2. 需要有水汽 ... xEffect關係類型下可能產生的推理為: 1. 路面會變濕 2. 人們會打傘 ... 如此按照23種不同關係類型,每種各生成5個推理,最終就會得到25個與輸入相關的常識推理句子。通過產生豐富的常識推理,COMET試圖模擬人類根據背景知識和常識,對事物作出豐富聯想和推理的能力。這些常識推理可以補充輸入語料的不足,為下游任務提供更多背景信息和線索,幫助模型更好理解並完成任務。同時,每個輸入生成25個推理也增加了數據量,有利於模型學習。 ::: - Also, to attend to the previous utterances when generating commonsense inferences, we further explore <span class='red'>a discourse-aware model, PARACOMET.</span> - <span class='red'>COMET generates a set of commonsense inferences considering only one sentence at a time.</span> - PARA-COMET adopts an internal memory module to consider previous dialogue history when generating an output. ![image](https://hackmd.io/_uploads/r1ZLy_JRa.png =60%x) ### 3.3 Summarizing with Injecting Commonsense Knowledge (SICK) ### Filtering - The amount of data provided as the input should be mapped into the output in a concise form. Therefore, <span class='red'>simply providing extra input (i.e., commonsense knowledge) may confuse the model when generating a summary</span>. - It is unable to add every possible commonsense knowledge to the dialogue due to the <span class='red'>limited input sequence length of transformer-based models</span>. - To address this issue, we propose to select the most favorable commonsense for each utterance. - For 25 candidates, we measure the semantic relevance between the utterance and the commonsense inference concerning. - One could imagine that filtering could choose <span class='red'>only very similar “safe” examples that might not be as valuable/interesting in practice</span> (i.e., diversity vs. quality). - We employ **SBERT** to compute the similarity score between utterance and commonsense pairs. - We select one commonsense inference ci , with the highest score for each utterance ui among the candidate relations R. - As a result, we obtain the input commonsense C = {ci}$^n_{i=1}$ aligned with dialogue D. ![image](https://hackmd.io/_uploads/Bk3NzO1Aa.png =55%x) ### Cross Concatenation - After obtaining the input commonsense for the dialogue, we concatenate the dialogue and its corresponding set of commonsense inferences. - To encode the information that ci is derived from ui , we enforce to attend its neighbor token. Instead of concatenating D and C back and forth, <span class='red'>we concatenate turn by turn considering **locality of reference**</span>. - where tokens tend to attend its neighboring tokens. - In order to **separate the modalities between dialogues and commonsense inferences**, we add special tokens <I>, </I> in back and forth of each commonsense inference ci . Thus the input sequence X is formulated as: ![image](https://hackmd.io/_uploads/Syr6X_JRa.png =55%x) ### Training - SICK is built upon a transformer-based encoder-decoder architecture. - The encoder fuses the information from two different modalities (i.e., dialogue and commonsense inference). - By the stack of decoders, the encoder output is used for cross-attention with the summary. - The training objective, a negative-log likelihood parameterized by θds, can be formulated as: ![image](https://hackmd.io/_uploads/B1XsEOyC6.png =55%x) where wi,j is j-th token of i-th sentence yi in target summary Y. :::info 上述公式表示對話摘要解碼器的訓練目標,採用negative-log likelihood(probability) fuction: 其中: - |Y|表示目標摘要序列Y的句子數量 - |yi|表示第i句摘要的詞數 - wij表示第i句摘要的第j個詞 - wi<j表示在wij之前的所有詞 - X是輸入序列,包含對話內容和注入的常識知識 - θds是模型在對話摘要任務上的參數 logs P(wij | wi<j, X; θds)計算的是在給定前置詞wi<j和輸入X的情況下,正確生成目標詞wij的對數概率。通過將所有目標詞的對數機率求和,得到生成整個摘要序列Y的negative-log likelihood的值L_ds。模型訓練的目標就是最小化這個損失L_ds,使得生成的摘要與Ground Truth摘要序列Y越接近越好。負對數似然損失函數在序列生成任務中很常見,它可以有效捕捉目標序列與生成序列之間的差異,並通過Backpropagation不斷調整模型參數θds,使損失值最小化。通過最小化這個損失函數,模型就能夠學會從輸入對話和常識知識中挖掘並整合有用的訊息,進而生成高質量、訊息豐富的摘要內容。 ::: ### 3.4 SICK++ ![image](https://hackmd.io/_uploads/rk0y3OkCT.png) ### Commonsense Supervision - It is well known that models do not consider the actual input as a whole and only look at certain parts of the input therefore not performing the underlying task but some derivative. :::info 這段話指出一個眾所周知的現象:模型在處理輸入時,往往不會將整個輸入作為一個整體來考慮,而只關注輸入的某些部分。因此,模型實際上並沒有執行我們所期望的底層任務,而是在執行一些衍生出來的次要任務。這種現象源於模型的內在偏差和訓練數據的局限性。具體來說: 1. 模型的內在偏差神經網路等模型為了簡化問題,對輸入做了各種假設和限制,導致它只能夠關注輸入的某些方面,忽視了其他部分。比如模型可能假設輸入是獨立同分佈的,或者受限於感受野大小等。 2. 訓練數據的局限性模型是在有限的訓練數據上進行學習的,很可能出現訓練樣本中某些特定的統計規律被模型高度關注,而忽視了其他重要但不那麼顯著的信號。這種偏差會被放大到模型的推理行為中。 3. 缺乏明確的誘導在許多任務中,模型的損失函數設計並不能很好地指導模型關注真正的底層任務,反而會使模型聚焦於一些捷徑或謂詞規律。這導致模型很難理解整個輸入的深層語義。因此,即使看似在執行某項任務,模型實際上可能只關注輸入的某些部分特徵,忽略了許多重要的語義信息,從而執行了一個與期望不同的衍生任務。這種現象被稱為"shortcut"或"degeneration",需要格外關注。通過改進模型偏置、優化訓練數據分佈、設計合理的損失函數等方法,或許能夠緩解這一問題,使模型真正關注整個輸入語義,執行期望的底層任務。 ::: - To overcome this problem, we propose an auxiliary task named <span class='red'>commonsense supervision</span>. - we also leverage commonsense knowledge as additional target variable, which prevents the model from disregarding commonsense and enforces actually to utilize commonsense. - <span class='red'>Generating both the summary and the target commonsense has an effect of emphasizing that the input commonsense inference is important</span>. - We generate a set of target commonsense inferences Z with the summary Y using an external knowledge model f. - Then we filter and select the most plausible target commonsense. ![image](https://hackmd.io/_uploads/HJXbod10a.png =60%x) - To adopt <span class='red'>commonsense knowledge as additional supervision</span>, we further include commonsense summarization decoder Dcs, which learns to generate target commonsense Z. ### Training - With the target commonsense Z, we train the commonsense summarization decoder Dcs to minimize a negative log-likelihood loss function such as: ![image](https://hackmd.io/_uploads/HyGU3dJRT.png =60%x) where wij is a j-th word token of sentence cyi from the target commonsense Z - We linearly combine the two loss functions: ![image](https://hackmd.io/_uploads/H1PTh_kC6.png =50%x) 1. where Lds and Lcs denote the loss function for dialogue summarization decoder Dds and commonsense summarization decoder Dcs, respectively. 2. λ is a predefined hyperparameter to adjust the scale of each loss. In our setting, we set λ = 0.66. ### Inference - Note that while we train the model in a dual-decoder setting, we **only use the dialogue summarization decoder Dds and discard the commonsense prediction decoder Dcs at inference time**. ## 4. Experimental Setup ### 4.1 Datasets and Baselines - We perform experiments on two datasets: 1. **SAMSum**: is the most widely used resource for abstractive dialogue summarization task. It consists of natural messenger-like conversations in English created by linguists with manually annotated summaries. 2. **DialogSum**: is a recently released dataset for a more challenging task with a lower compression ratio. It contains multi-turn dialogues of real-life scenarios collected from three dialogue corpora. ![image](https://hackmd.io/_uploads/rJ6g4WxRT.png =60%x) - We adopt three different types of baselines: 1. **Generative language models** 2. **Pre-trained language models** 3. **Dialogue summarization Models** ### 4.2 Implementation Details - We employ two automatic evaluation metrics as: 1. **ROUGE scores**: including ROUGE-1, ROUGE-2, and ROUGE-L, which compares word-level uni-gram and bi-gram, and the longest common sequence overlap with the gold summary respectively. 2. **BERTScore**: computes the contextual similarity score between generated and reference summaries. - We report F1 scores for both metrics. - Our implementation is <span class='red'>based on the Huggingface implementation of BART language model</span>. - Specifically, we use the weight checkpoint of BART-xsum. - We use a <span class='red'>maximum input length of 1024 tokens and output length of 100 tokens</span>. - the input is either padded or truncated after each utterance and its corresponding commonsense is concatenated during preprocessing. - We use a <span class='red'>learning rate of 3e-6</span> and a <span class='red'>batch size of 32</span> when fine-tuning our model on both benchmarks. - We use linear warm-up over the first 600 steps, <span class='red'>apply linear decay and use the Adam optimizer</span>. :::info 這句話描述了在模型訓練過程中採用的一些優化策略: 1. 線性warmup 在訓練的前600個step(iterate)內,不直接使用設定的完整學習率,而是從較小的學習率開始,然後線性遞增至設定值。這種<span class='red'>warmup的做法可以使模型在起步階段有一個平緩的優化過程</span>,避免因初始的大學習率而造成不穩定或失敗。 2. 線性衰減(linear decay) 在warmup之後的訓練過程中,不再固定使用設定的學習率,而是讓學習率線性衰減。也就是說隨著訓練步驟的增加,學習率會逐漸降低。這是因為在<span class='red'>後期訓練時,較小的學習率有助於模型收斂並獲得更好的最終性能</span>。線性衰減的策略使得學習過程從起步的大步長調整,逐漸變為後期的小步長微調。 3. Adam優化器 Adam是一種常用的優化算法,可以替代傳統的隨機梯度下降(SGD)。它透過計算梯度的指數移動平均值,對不同參數分別自適應地調整更新步長,整體而言收斂較為快速且有較好的收斂性能。因此,整個優化策略的流程是: 1) 前600步使用線性warmup階段,從小學習率開始 2) 600步後進入線性衰減階段,學習率逐漸降低 3) <span class='red'>在warmup和衰減階段中都使用Adam優化器更新模型參數</span> 這種優化策略結合了warmup、衰減和Adam優化器的優點,被廣泛應用於各種神經網絡模型的訓練中,通常能夠取得較好的收斂速度和最終模型性能。對於大型預訓練語言模型的進一步微調也是這樣的範式。 ::: - We use <span class='red'>beam search with beam size of 20</span>. - We fine-tune our model on SAMSum for 20 epochs and DialogSum for 25 epochs. - All experiments are run on one A100 NVIDIA GPU. ## 5. Experimental Results ### 5.1 Automatic Evaluation ![image](https://hackmd.io/_uploads/HyzJt-lAa.png) ### Performance - SICK++ outperforms all baselines on ROUGE-1, ROUGE2 and BERTScore by a consistent margin in both datasets. ### Comparison with State-of-the-Art - We find that pre-trained language models (e.g., DialoGPT, UniLM, PEGASUS, BART-xsum), outperform models that are not pre-trained (e.g., PointerGenerator, DynamicConv, Transformer), <span class='red'>confirming the impact of pre-training on abstractive dialogue summarization</span>. - Among the pre-trained generative language models examined, PEGASUS and BART-xsum are the most competitive models with ROUGE-1 higher than 50. - SICK++ shows improvement on all metrics compared to BART-xsum (e.g., without additional input, commonsense supervision) in both benchmarks showing that our method can be applied in different settings. :::info 這句話提到了"without additional input, commonsense supervision"作為對比,意指SICK++框架的改進是在原有BART-xsum模型的基礎上,增加了輸入常識知識(additional input)和常識監督(commonsense supervision)兩個主要部分。讓我們詳細解釋一下: 1. 沒有額外輸入 (without additional input) 指的是原始的BART-xsum等seq2seq模型,只將對話內容作為輸入,沒有額外注入任何外部知識或信息。這種單純依賴對話內容的輸入方式,往往難以捕捉到對話中的隱含意圖、前因後果等重要信息。 2. 沒有常識監督 (without commonsense supervision) 指的是傳統seq2seq模型在訓練過程中,只有對話摘要作為監督訊號,模型被要求直接從輸入對話生成摘要。但這種監督往往是間接的、稀疏的,模型很容易忽視對話中隱含的常識知識。相比之下,SICK++框架做了以下改進: - Additional input: 使用外部常識知識模型(如COMET)為每個utterance生成多種常識推理,將這些常識作為額外輸入併入對話內容中。這豐富了輸入的信息。 - Commonsense supervision: 在訓練時,除了生成摘要作為監督訊號,SICK++還引入了從摘要生成的目標常識推理作為輔助監督訊號。這種監督迫使模型兼顧對話理解和常識推理,更好地融合常識知識。因此,通過這兩點改進,SICK++不僅提供了更充足的信息作為輸入,還加強了常識知識融合的監督學習,最終使得模型能生成更加準確、合理、富含常識的摘要。而原始BART-xsum由於缺乏這兩方面,生成的摘要往往存在疏漏和不合理之處。 ::: - Among methods that **alter the input to seek additional useful information in a dialogue setting**, (e.g., D-HGN, SBART, and CODS), CODS achieves better performance over other baselines in SAMSum. - CODS achieves better performance over other baselines in SAMSum. However, on DialogSum, a more challenging setting due to higher abstractiveness, CODS is not able to get as much gain of performance compared to other baselines. ### Commonsense Models - <span class='red'>While SICK++ shows better performance regardless of which commonsense generation model is used, the excelling choice differs depending on the dataset.</span> - In SAMSum, SICK++ shows better performance with PARACOMET than with COMET, however it shows opposite result in DialogSum. - We conjecture this <span class='red'>due to the characteristic of datasets and commonsense models hold</span>. :::info 這段話指出,無論使用哪種常識生成模型,SICK++框架在性能上都表現較好。但是,在不同數據集上,最佳的常識生成模型選擇會有所不同。具體來說: 1. 在SAMSum數據集上,使用PARA-COMET作為常識生成模型的SICK++表現最佳。 2. 而在DialogSum數據集上,使用COMET作為常識生成模型的SICK++效果則更好一些。造成這種差異的原因,作者推測是由於數據集本身的特點所導致的。 SAMSum數據集包含的對話通常較短,對話歷史上下文較少。在這種情況下,PARA-COMET擁有參數化記憶體模組來捕捉上下文訊息的優勢就較為明顯,因此幫助SICK++取得更好的摘要性能。而DialogSum數據集中的對話往往更長,上下文信息也更豐富。對於這種情況,COMET這種單純基於當前utterance生成常識的模型,或許反而更有效一些。總的來說,SICK++框架本身是通用的,能夠兼容不同類型的常識生成模型。但針對不同數據集的特點,選擇最優常識模型有助於進一步提升性能。這也反映出數據分佈的差異對模型選擇的影響。未來,作者認為如果能設計出常識模型來更好地捕獲長對話歷史記憶並生成常識,將能進一步幫助SICK++在更多數據集上獲得出色表現。這需要對常識生成模型做出持續改進和創新。 ::: - PARA-COMET has an advantage of using parametric memory to consider previous sentences, which may be sensitive in terms of length. - Since SAMSum has shorter length of dialogues than DialogSum, the recurrent memory component of PARA-COMET is less likely to forget the previous sentences. :::success We expect to get better performance with the help of commonsense-models that <span class='red'>maintains longer memories of sentences/dialogues</span> and leave this as future research. ::: ### 5.2 Human Evaluation *We conduct human evaluation to verify the quality of the generated summaries*. - We randomly sample 50 dialogues from test sets of SAMSum and DialogSum, respectively. - Annotators were asked to score the quality of a set of summaries from **BART-xsum**, **SICK++**, and **ground-truth** using a Likert scale from 1 (worst) to 5 (best) in terms of <span class='red'>informativeness (i.e., covers adequate information)</span> and <span class='red'>factual consistency (i.e., consistent with the original input)</span>. - Each summary was evaluated by three different annotators. - Also, the win-loss ratio, which is not biased by subjectivity, is 51.33 (informativeness) and 54.16 (factual consistency), which is consistent to the observations made from the absolute scores. :::info 這句話的意思是說,相對於絕對分數的觀察結果,勝率(不受主觀性影響)也與之一致,在資訊豐富度方面為51.33%,在事實一致性方面為54.16%。勝率是一個不受主觀評分影響的指標,它計算生成摘要勝過基準摘要的比例。如果勝率高於50%,表示生成摘要的品質勝過基準摘要。在這個情況下,在資訊豐富度和事實一致性兩個評分標準下,生成摘要的勝率分別為51.33%和54.16%,都高於50%。這個結果與人工評分的絕對分數觀察結果一致,也就是說生成摘要的品質優於基準摘要,能產生更多資訊且與對話內容更加一致的摘要。由此可見,即使勝率是一個不受主觀性影響的指標,它與人工評分的結果仍然保持一致,進一步驗證了所提出模型的優勢。 ::: ![image](https://hackmd.io/_uploads/SJmfWfgRa.png =50%x) - SICK++ gets better scores than BART-xsum for informativeness, which matches the results of ROUGE scores. - SICK++ also produces more consistent summaries even though factual consistency is not explicitly modeled. :::success We assume that incorporating commonsense knowledge helps the model recognize the hidden meanings and better understand the dialogue, resulting in fewer factual errors without improper reasoning over conversational flow. ::: ## 6. Analysis *To evaluate the effectiveness of our method, we address the following research questions to guide our experiments:* 1. RQ1: Does commonsense help summarizing dialogues? 2. RQ2: Is our method worth using in terms of efficiency despite the extra effort? 3. RQ3: Does commonsense supervision lead SICK++ to inject commonsense knowledge? ### 6.1 RQ1: Commonsense Applicability - We experiment in a zero-shot setting to examine how commonsense knowledge solely affects dialogue summarization . :::info 在這個實驗中,作者採用了zero-shot的設定,目的是單純觀察commonsense knowledge對於對話摘要的影響。所謂zero-shot的設定,是指在沒有任何訓練資料的情況下,直接將模型應用於測試資料集。這種設定可以避免訓練過程中的其他因素(如超參數設置)對模型性能的影響,從而更清晰地看出提供常識知識的效果。具體來說,作者將訓練好的BART-xsum模型和他們提出的SICK模型(只提供額外的常識知識作為輸入,沒有啟用commonsense supervision)分別在SAMSum和DialogSum測試集上進行評估。由於SICK和BART-xsum的唯一區別是SICK輸入端附加了從對話生成的常識知識,因此兩者性能的差距就可以反映出常識知識對對話摘要任務的貢獻。通過這種zero-shot對比的實驗,作者證明了提供額外的常識知識確實可以幫助模型生成更準確、更豐富語義的對話摘要,從而支持他們的觀點:對話摘要任務中需要commonsense來彌補對話和文本之間的差距。 ::: - While there exist many factors that could affect performance besides commonsense during training (e.g., hyperparameter configurations), in a zero-shot setting, we can directly compare when commonsense is given and not. ![截圖 2024-03-14 16.02.46](https://hackmd.io/_uploads/HkRN1VgRT.png =90%x) - We find that SICK outperforms BART-xsum, where the performance gain comes from additional commonsense. - This also supports the idea that commonsense is essential in resolving the discrepancy between dialogues and documents. ### 6.2 RQ2: Data Efficiency - **Our approach has limitations in terms of time efficiency**. However, we find that our method is helpful in situations where data is insufficient, meaning there is a trade-off (<span class='red'>time vs data efficiency</span>). ![截圖 2024-03-14 16.09.07](https://hackmd.io/_uploads/rkFaxExAa.png) - We hypothesize that due to providing additional knowledge and commonsense supervision, <span class='red'>SICK++ can show comparable performance even if only a small amount of training data is available (i.e., data efficiency)</span>. :::info 作者假設,由於SICK++ 提供了額外的常識知識,並且使用了常識監督的訓練方式,因此即使只有較少量的訓練數據,SICK++ 也能展現出可媲美使用大量訓練數據的模型的性能(即數據利用效率較高)。具體來說: 1. 額外的常識知識 SICK++ 在輸入對話時,會同時附加從對話中生成的相關常識知識作為輸入。這些額外的常識知識有助於模型更好地理解對話語境和說話者的隱含意圖,彌補了對話內容本身的不足。 2. 常識監督在SICK++的訓練過程中,除了對話摘要這個主要任務,還引入了生成目標常識知識的輔助任務。這種多任務學習的訓練方式,迫使模型關注並利用了輸入的常識知識。通過這兩個設計,SICK++ 得以在訓練過程中有效學習如何利用常識知識,彌補了訓練數據本身的不足。因此,作者假設即使在訓練數據較少的情況下,SICK++ 也能達到可觀的性能表現,展現出較高的數據利用效率。這一假設體現了作者期望SICK++ 框架能為訓練數據不足的情況下提供一個有益的解決方案。 ::: - with only 30% of training data, SICK++ shows better performance than BART-xsum trained with 70% of training data. - SICK++ consistently outperforms BART-xsum regardless of training data size, proving the robustness of SICK++. - The performance gap between SICK++ and BART-xsum can be viewed as a consequence of the leveraged commonsense, based on the fact that BART-xsum is the base architecture of SICK++. ### 6.3 RQ3: Effect of commonsense supervision on Injecting Commonsense Knowledge ![截圖 2024-03-14 16.21.12](https://hackmd.io/_uploads/B1sKmNlRa.png =65%x) :::success Attention visualization of SICK/SICK++. Each point of the line corresponds to the average attention a particular SICK encoder attention head puts towards commonsense inferences. ::: - Attention weights can be viewed as governing how “important” every other token is when producing the next representation for the current token. :::info 這句話說明了注意力權重可以被視為在生成當前詞彙的下一個表徵時,決定其他每個詞彙的"重要性"。更詳細地解釋如下: 在transformer等基於注意力機制的模型中,當生成一個詞彙的表徵時,模型會參考輸入序列中所有其他詞彙,並根據這些詞彙與當前詞彙的關聯程度分配不同的注意力權重。注意力權重的大小體現了該詞彙對於生成當前詞彙表徵的重要程度。注意力權重越大,表示該詞彙在計算當前詞彙表徵時被模型更多地參考和關注,即對生成當前詞彙表徵的貢獻越大。反之,權重較小的詞彙則對當前表徵的影響較小。因此,注意力權重實際上決定了在表徵當前詞彙時,模型會多大程度上關注輸入序列中的每個其他詞彙的信息,決定了每個詞彙對下一步表徵的重要貢獻大小。通過分析注意力權重分佈,可以窺探模型內在的關注焦點和推理過程。總的來說,注意力權重可被理解為模型在生成當前詞彙表徵時,對其他相關詞彙"重要性"的權衡與分配。 ::: - We conduct a experiment of measuring the averaged attention value of the commonsense inferences compared to utterances using validation sets of DialogSum, which is more abstractive (i.e., more challenging to comprehend) compared to SAMSum. :::info 在這個實驗中,作者使用DialogSum數據集中的驗證集來衡量模型對常識推理和對話內容的平均關注程度,並將二者進行對比分析。DialogSum數據集相較於SAMSum數據集更加抽象(即更具挑戰性,需要更深層次的理解能力)。具體步驟如下: 1. 選擇數據集作者選擇使用DialogSum驗證集進行實驗,原因是DialogSum比SAMSum更加抽象化,對話內容更難理解,因此能更好考驗模型對常識推理的利用能力。 2. 計算注意力權重利用訓練好的模型,輸入DialogSum驗證集的對話數據。在生成對話摘要的過程中,記錄模型對常識推理和原始對話內容的平均注意力權重值。 3. 對比分析將常識推理和對話內容的平均注意力權重值進行對比分析。較高的注意力權重值,代表模型在生成過程中更多關注對應的內容。通過這一實驗設計,作者想要觀察模型在處理更抽象的對話時,對外部引入的常識推理和原始對話內容的關注程度差異。分析結果有助於了解常識知識在不同情況下的作用,以及模型內部对常識推理的利用情況,從而驗證他們所提出模型的有效性。 ::: - <span class='red'>Final layers of language models are most task-specific, and we observe that SICK++ has marginally higher attention values.</span> - We conjecture this is due to the supervision provided by generating Z instead of relying on distant supervision, meaning that our goal of enforcing the model to use commonsense inferences is successful. :::info 這句話提到了監督方式的不同 - 生成target commonsense knowledge Z作為監督,而非依賴遠程監督(distant supervision)。詳細解釋如下: 傳統的監督方式稱為遠程監督,是指利用已標註的數據集作為標籤,通過學習映射關係獲得監督訊號。但在對話摘要任務中,直接獲取對話-摘要的標註數據相對困難。作者提出的新穎監督方式是基於生成target commonsense knowledge *Z*。具體來說: 1. 利用外部commonsense knowledge model (如COMET),將參考摘要輸入,得到一組相關的commonsense knowledge *Z*。 2. 將生成*Z*作為輔助任務,與生成對話摘要的主任務一起進行多任務聯合訓練。 3. 在訓練過程中,模型需要同時生成對話摘要和相應的commonsense knowledge *Z*,而Z的生成正是對模型常識理解能力的監督。通過這種監督方式,模型被強制關注並利用了輸入的常識推理,而不僅僅是從數據中學習映射,從而達到了注入常識的目的。與遠程監督相比,生成Z的監督方式更加直接有效,能更好地約束模型利用外部知識,實現對話理解的深層次推理。通過實驗結果,作者也驗證了這一監督方式的確能幫助模型更好地融合常識知識。總之,生成*Z*監督是一種創新的監督策略,有助於模型融合外部常識知識,提升對話摘要的質量。 ::: - Meanwhile in lower and middle layers, SICK++’s attention values tend to be lower than SICK. - One possible reason is that lower layers tend to look at syntactic and word-level information, whereas the commonsense inferences generated by COMET or PARA-COMET is only meaningful when understood conceptually. :::info 這段說明提出了一個可能的原因,解釋了在transformer模型的較低Layer時,為什麼對commonsense inference的關注程度較低。具體來說: transformer模型的較低的Layer往往專注於捕捉語法和word-level的訊息,例如詞序、構詞、詞性等。而由COMET或PARA-COMET生成的commonsense inference,其含義通常是概念層面的,需要更高層次的理解和推理才能體現其意義。也就是說,commonsense inference所蘊含的是一種概念性、抽象的知識,其涵義往往超出了單詞本身的語義範疇。僅僅停留在詞彙和語法層面是無法完全把握這些常識的,需要將詞語訊息融合並上升到概念層面,建立對應的知識表徵,才能真正理解常識推理所傳達的含義。而transformer的底層主要負責詞匹語法等較為基礎的訊息處理,對應了人類語言理解的初級階段。相比之下,概念性的常識推理訊息往往需要在更高層中形成表徵,對應了理解的高層次推理過程。因此在較低層關注常識推理的程度較低也是合理的。通過層層推進,transformer較高的Layer才能夠將底層獲取的詞匯語法等基礎訊息融合並建模,獲得常識推理這種高層次概念知識的表徵,從而真正理解其含義。這就解釋了為什麼在高層時,模型對常識推理的關注會明顯增強。 ::: ## 7. Conclusion - In this work, we propose SICK and SICK++ framework in order to resolve the two key challenges: 1. **Filling in the gap in dialogues** 2. **Injecting commonsense knowledge into a model**