# DA paper 1 -- Speaker-Change Aware CRF for Dialogue Act Classification (May 2020) ## Abstract DA Classfication -> use neural network model (NN) + conditional random field (CRF) CRF : 條件隨機域 一種鑑別式機率模型,常用於標註或分析資料 utterance : smallest unit of speech --- In this paper : **Make CRF layer takes speaker-change into account.** Conclusion : CRF layer learns meaningfully, **it can tell sophisticated transition patterns between DA label pairs.** --- Source code: https://bitbucket.org/guokan_shang/da-classification/src/master/ ## 1. Introduction DA : notion of illocutionary force (speaker's intention in delivering an utterance) Task of DA Classification : Assign DA label to each utterance to represent its functionality in communication. --- Applications of DA: * Utterance Clustering * Real-Time Information Retrieval * Conversational Agents * Summarization --- conversation : sequence of utterances Difficult to predict DA of single utterance without having access to other utterances in the context. -> Since same word may represent different meanings in different situations **It is necessary for a DA classification model to capture dependencies both at the utterance level and at the label level** --- Model Used in this paper: **bidirectional RNN with LSTM cells + linear chain CRF** **1. BiLSTM** Capture the dependencies among consecutive utterances **2. CRF** Capture the dependencies among consecutive DA-labels ## 2. Motivation Most NLP tasks : involve only two sequence, input and target DA Classification : additional input sequence - speaker identifiers In dialogue, participants do not start or stop arbitrarily, follow an underlying turn-taking system. -> The sequences of DAs and speakers are tightly interconnected. ## 3. Related Work 這段提到了許多前人在DA classification所使用的架構,但我看不太懂 比如說Naive Bayes, Maxent, HMM, BiLSTM... 其實即便是聽過的RNN,我也不知道它所提到的詳細架構是什麼 想問說遇到這種情況的話,我能做的是什麼呢? 大致查一下那些名詞常見的架構嗎? 再想像說為什麼會用這些架構嗎? ## 4. Model BiLSTM-CRF Model for DA classification: ![](https://i.imgur.com/rBaOVTq.jpg) Fig.1 : {u} represents utterance embeddings as input. Q, A, S, S (DA Labels) as traget. 3 possible labels {Q, A, S} stand for Question, Answer, and Statement. --- * **Notation** { ( x^t, y^t ) } ( t = 1~T ) : a conversation of length T. X = { x^t } ( t = 1~T ) : sequence of utterances. x^t = { x^t_n } ( n = 1~N ) : a sequence of words of length N Y = { y^t } ( t = 1~T ) : the target sequence y^t : the label * **Utterance Enconder** Each utterance is separetely encoded by a shared forward RNN with LSTM cell. Only the last annotation u^t_N is retained -> 不太懂為什麼 We are left with a sequence of utterance embeddings { u^t } ( t = 1~T ) * **BiLSTM Layer** pass in { u^t } to a bidirectional LSTM, return { v^t } ( t = 1~T ) : conversation-level utterance representations * **CRF Layer** 前面BiLSTM所產生的v^t已經可以用來DA label,但是是以單句來產生結果,而不是以DA sequence的角度(各DA之間是有關連的)來產生。 CRF model了 P(Y|X) 這個條件機率來保證會得到global solution(就是以seqeunce的角度來產生的結果) --- 其實這一整段我大致上只看得懂它的notations、那張圖, 還有為什麼需要CRF layer 詳細的數學以及它所列出的式子都是引用自reference的論文, 單看過去沒什麼概念 想知道我該去看那些reference論文, 還是看他是怎麼在他的source code裡面implement的? ## 5. Our Contribution 1. 加入了speaker-identifier S = { s^t } ( t = 1~T ) 2. 藉由比較鄰近的S,得出sequence of speaker changes Z = { z^(t,t+1) } ( t = 1~T-1 ) z = 0 : speaker does not change z = 1 : speaker changed 3. 在上一段提到,CRF layer model了P(Y|X)這個條件機率 作者在CRF layer中修改的因素是讓它model P(Y|X, Z) -> 同時用utterance sequence跟speaker-change sequence來predict DA label 4. 最後,CRF layer的transition score為 ![](https://i.imgur.com/QwIvTCX.jpg) G0 : label transition matrice of speaker unchanged G1 : label transition matrice of speaker changed ## 6. Experimental Setup * **Dataset** The SwDA dataset : https://github.com/cgpotts/swda Annoted with 42 mutually exclusive labels. * **The '+' Tag** ![](https://i.imgur.com/c4yG85E.jpg) 整個SwDA中,有8.1%的DA label是 '+' 號,它表示聽者插話了。它的用意是用來表示當講者被打斷時,接續說完話的所有utterance,並非原先所想研究,用來呈現對話功能的DA label。 因此,在這篇文章中,作者將所有 '+' 的句子與同講者的上一句話做連接,還原成完整的utterance,再賦予整句一個DA label。 這樣子做的原因是因為與其讓model predict出 '+',對我們來說沒有意義,以及要讓model predict出 '+' 前面的那句不完整句子的DA label是困難的。由於相同的話在不同使用時機會有不同的意義,單就一個句子的片段,model未必能準確判斷。 作者所舉的例子如下: 1."A: so, (Wh-Question)" 2."B: <throat_cleaning> (Non-verbal)" 3."A: what's your name? (+)" 假如只看第一句,so這個詞有太多意思了。 所以依作者提出的辦法做修正後: 1."A: so, what's your name (Wh-Question)" 2."B: <throat_cleaning> (Non-verbal)" 同時避免了model難以解讀以及predict出 '+' 的問題。 --- 看完這段覺得可以將它處理interrupt的方法用在我們的dataset中, 如果有需要的話 * **Implementation and Training Details** 1. All characters are converted to lowercase. 2. Utterance embeddings ( u^t ) 和 conversation-level utterance representations ( v^t ) 都用 0.2 的 dropout 4. LSTM layers has 300 hidden units 6. Embedding layer uses 300-dim word vectors pretrained with gensim implementation of word2vec. 8. Vocabulary size ~ 21K, unrecognized word all mapped to specific tag [UNK]. 10. Model trained with adam optimizer. ## 7. Quantitative Resuits * **Performance Comparison** 1. Overall accuracy on predicting 42 DA labels : Author's modified CRF outperforms 1% than previous work of others. ![](https://i.imgur.com/Njy84pD.jpg) --- 2. Overall accuarcy on the 10 most frequent DA labels (91% of all labels) : ![](https://i.imgur.com/qz4XmCm.jpg) According to the diagonals, author's model better predicts 6 labels out of 10. In the case of **sv** (statement-opinion), the distance is up to 5.9% In the case of **x** (non-verbal), the tow models tied. In the case of reducing misclassification, author's model also has evident performances. For example, for **sv** misclassified as **sd**, miss rates are decreased up to 5.8%; for **aa** misclassified as **b**, miss rates are decreased up to 3.9%. This is to be noted, as these two labels are among the most frequently confused pairs (Kumar et al., 2018; Chen et al., 2018). 3. Author's model also brings improvement where it is most necessary: ![](https://i.imgur.com/g0WPXGd.jpg) For the most difficult and rare DAs, author's model decreases the miss rate with 4.87% --- * **The benifits of considering speaker information vary across DA labels** Some labels contain clear lexical cues that can be mapped to corresponding DA labels, that is, these kinds of DAs does not require having access to speaker information (or speaker-change awareness). There are four DAs mentioned to be in this situation: 1. "<laughter>" -> **x** : non-verbal 2. "Bye-bye" -> **fc** : conventional-closing 3. "That's great" -> **ba** : appreciation 4. "Do you ... ?" -> **qy** : yes-no-question --- * **Ensembling and Joint Training** 這段作者做了個嘗試,由於他們改良後的CRF layer和原先的CRF layer有各自的優缺點,他試著將兩者一起用來提升performance。 **1. Ensembling** 簡單來說就是把改良後和原先的CRF各自的transition score平均在一起。 前面所提到的transition score變為: ![](https://i.imgur.com/5LjsFlm.jpg) 新增了G_basis項,並在每個時間點都採用(不論是否有 speaker-change )。 在table2中顯示的結果,是有比只採用改良後CRF多出0.2%的準確度,且比只採用原先的CRF多出1.2%。 **2. Joint Training** ![](https://i.imgur.com/g79BkJo.jpg) 不論是 (G_basis + G0) 或是 (G_basis + G1) 的情況,添加 G_basis 都會模糊掉 label transition pattern。 (相當於把兩張transition matrix的結果疊加做平均) --- * **Ablation Studies** 這段在探討將speaker-change information用在CRF是否是在有效益的使用方法。 ![](https://i.imgur.com/pUoAWwT.jpg) 1. 由b1和b2可以知道,加入speaker-identifier或是speaker-change都能增進base model的表現,但都沒有加在CRF(a的情況)來得高。 2. 如果將speaker information各別加入BiLSTM和CRF,最後也不會帶來準確度的提升。這表示考慮兩次並沒有用,或至少不能是如表格中這樣用。 --- * **BiLSTM-CRF v.s. BiLSTM-Softmax** 這段沒有太多技術性內容,只是在闡述作者獨創的CRF對於用在DA classification的效果比其他人的CRF或Softmax還要好。 但如果用在其他的NLP task,他們自己的model就與別人的效果差距不大。 ## 8. Qualitative Results **Visualization of transition matrices** ![](https://i.imgur.com/qpLCHPc.jpg) 左(**G0**) : 改良後CRF,speaker-unchanged 中(**G1**) : 改良後CRF,speaker-changed 右 (**G**) : 原先的CRF **Observations:** 1. **G0**有條明顯深色對角線,表示當講者沒變時,多數DA label傾向於延續到下一個utterance。 **G1**則無此現象,表示當講者轉換時,DA也很常從原先的轉換到其他的。 2. 如果講者轉換了,**qw** 常會接著 statements (**sd**和**sv**) 或 other answers (**no**) 如果講者沒變,**qw** 則會接著 **qy**(是非問題) 和 **qrr**(非P則Q) 或 acknowkedgements (**bk**和**b**)。這代表講者在重申相同的問題,或是在自問自答。 3. 如果講者轉換了,**sv** 常會接著 **aa**(同意) 或 **ar**(拒絕)。 **sd** 則沒有像這樣清楚的模式可循。 4. 如果講者轉換了,**qy**(是非問題) 常會接著 **ny**(是), **nn**(否) 或 **no**(其他答案)。 如果講者沒變,**qy** 則會接著其他類型的問題DA。和2.一樣,這表示講者在重複提問。 5. 如果講者轉換了,回答類型的DA(**ny**, **nn**, **no**) 常會接著 acknowkedgements(**bk**和**b**)。 如果講者沒變,**ny**, **nn** 和 **no** 則會接著 statements (**sd**和**sv**)。 6. **G**則無法像另兩個transition matrix觀察到這些資訊,只會發現它綜合了**G0**和**G1**的結果。(或者說,原先的CRF會將兩者的情況都試著抓取) -> 使用兩個matrices(考量是否有speaker-change)才能解讀到更多的DA transition pattern。 --- 其餘段落並無技術性內容,皆為本篇論文之內容彙整,便不再整理。