# DA paper 1 -- Speaker-Change Aware CRF for Dialogue Act Classification (May 2020)
## Abstract
DA Classfication -> use neural network model (NN) + conditional random field (CRF)
CRF : 條件隨機域
一種鑑別式機率模型,常用於標註或分析資料
utterance : smallest unit of speech
---
In this paper :
**Make CRF layer takes speaker-change into account.**
Conclusion :
CRF layer learns meaningfully, **it can tell sophisticated transition patterns between DA label pairs.**
---
Source code:
https://bitbucket.org/guokan_shang/da-classification/src/master/
## 1. Introduction
DA : notion of illocutionary force (speaker's intention in delivering an utterance)
Task of DA Classification :
Assign DA label to each utterance to represent its functionality in communication.
---
Applications of DA:
* Utterance Clustering
* Real-Time Information Retrieval
* Conversational Agents
* Summarization
---
conversation : sequence of utterances
Difficult to predict DA of single utterance without having access to other utterances in the context.
-> Since same word may represent different meanings in different situations
**It is necessary for a DA classification model to capture dependencies both at the utterance level and at the label level**
---
Model Used in this paper: **bidirectional RNN with LSTM cells + linear chain CRF**
**1. BiLSTM**
Capture the dependencies among consecutive utterances
**2. CRF**
Capture the dependencies among consecutive DA-labels
## 2. Motivation
Most NLP tasks : involve only two sequence, input and target
DA Classification : additional input sequence - speaker identifiers
In dialogue, participants do not start or stop arbitrarily, follow an underlying turn-taking system.
-> The sequences of DAs and speakers are tightly interconnected.
## 3. Related Work
這段提到了許多前人在DA classification所使用的架構,但我看不太懂
比如說Naive Bayes, Maxent, HMM, BiLSTM...
其實即便是聽過的RNN,我也不知道它所提到的詳細架構是什麼
想問說遇到這種情況的話,我能做的是什麼呢?
大致查一下那些名詞常見的架構嗎? 再想像說為什麼會用這些架構嗎?
## 4. Model
BiLSTM-CRF Model for DA classification:

Fig.1 :
{u} represents utterance embeddings as input.
Q, A, S, S (DA Labels) as traget.
3 possible labels {Q, A, S} stand for Question, Answer, and Statement.
---
* **Notation**
{ ( x^t, y^t ) } ( t = 1~T ) : a conversation of length T.
X = { x^t } ( t = 1~T ) : sequence of utterances.
x^t = { x^t_n } ( n = 1~N ) : a sequence of words of length N
Y = { y^t } ( t = 1~T ) : the target sequence
y^t : the label
* **Utterance Enconder**
Each utterance is separetely encoded by a shared forward RNN with LSTM cell.
Only the last annotation u^t_N is retained -> 不太懂為什麼
We are left with a sequence of utterance embeddings { u^t } ( t = 1~T )
* **BiLSTM Layer**
pass in { u^t } to a bidirectional LSTM,
return { v^t } ( t = 1~T ) : conversation-level utterance representations
* **CRF Layer**
前面BiLSTM所產生的v^t已經可以用來DA label,但是是以單句來產生結果,而不是以DA sequence的角度(各DA之間是有關連的)來產生。
CRF model了 P(Y|X) 這個條件機率來保證會得到global solution(就是以seqeunce的角度來產生的結果)
---
其實這一整段我大致上只看得懂它的notations、那張圖,
還有為什麼需要CRF layer
詳細的數學以及它所列出的式子都是引用自reference的論文,
單看過去沒什麼概念
想知道我該去看那些reference論文,
還是看他是怎麼在他的source code裡面implement的?
## 5. Our Contribution
1. 加入了speaker-identifier S = { s^t } ( t = 1~T )
2. 藉由比較鄰近的S,得出sequence of speaker changes Z = { z^(t,t+1) } ( t = 1~T-1 )
z = 0 : speaker does not change
z = 1 : speaker changed
3. 在上一段提到,CRF layer model了P(Y|X)這個條件機率
作者在CRF layer中修改的因素是讓它model P(Y|X, Z)
-> 同時用utterance sequence跟speaker-change sequence來predict DA label
4. 最後,CRF layer的transition score為

G0 : label transition matrice of speaker unchanged
G1 : label transition matrice of speaker changed
## 6. Experimental Setup
* **Dataset**
The SwDA dataset : https://github.com/cgpotts/swda
Annoted with 42 mutually exclusive labels.
* **The '+' Tag**

整個SwDA中,有8.1%的DA label是 '+' 號,它表示聽者插話了。它的用意是用來表示當講者被打斷時,接續說完話的所有utterance,並非原先所想研究,用來呈現對話功能的DA label。
因此,在這篇文章中,作者將所有 '+' 的句子與同講者的上一句話做連接,還原成完整的utterance,再賦予整句一個DA label。
這樣子做的原因是因為與其讓model predict出 '+',對我們來說沒有意義,以及要讓model predict出 '+' 前面的那句不完整句子的DA label是困難的。由於相同的話在不同使用時機會有不同的意義,單就一個句子的片段,model未必能準確判斷。
作者所舉的例子如下:
1."A: so, (Wh-Question)"
2."B: <throat_cleaning> (Non-verbal)"
3."A: what's your name? (+)"
假如只看第一句,so這個詞有太多意思了。
所以依作者提出的辦法做修正後:
1."A: so, what's your name (Wh-Question)"
2."B: <throat_cleaning> (Non-verbal)"
同時避免了model難以解讀以及predict出 '+' 的問題。
---
看完這段覺得可以將它處理interrupt的方法用在我們的dataset中,
如果有需要的話
* **Implementation and Training Details**
1. All characters are converted to lowercase.
2. Utterance embeddings ( u^t ) 和 conversation-level utterance representations ( v^t ) 都用 0.2 的 dropout
4. LSTM layers has 300 hidden units
6. Embedding layer uses 300-dim word vectors pretrained with gensim implementation of word2vec.
8. Vocabulary size ~ 21K, unrecognized word all mapped to specific tag [UNK].
10. Model trained with adam optimizer.
## 7. Quantitative Resuits
* **Performance Comparison**
1. Overall accuracy on predicting 42 DA labels :
Author's modified CRF outperforms 1% than previous work of others.

---
2. Overall accuarcy on the 10 most frequent DA labels (91% of all labels) :

According to the diagonals, author's model better predicts 6 labels out of 10.
In the case of **sv** (statement-opinion), the distance is up to 5.9%
In the case of **x** (non-verbal), the tow models tied.
In the case of reducing misclassification, author's model also has evident performances. For example, for **sv** misclassified as **sd**, miss rates are decreased up to 5.8%; for **aa** misclassified as **b**, miss rates are decreased up to 3.9%. This is to be noted, as these two labels are among the most frequently confused pairs (Kumar et al., 2018; Chen et al., 2018).
3. Author's model also brings improvement where it is most necessary:

For the most difficult and rare DAs, author's model decreases the miss rate with 4.87%
---
* **The benifits of considering speaker information vary across DA labels**
Some labels contain clear lexical cues that can be mapped to corresponding DA labels, that is, these kinds of DAs does not require having access to speaker information (or speaker-change awareness).
There are four DAs mentioned to be in this situation:
1. "<laughter>" -> **x** : non-verbal
2. "Bye-bye" -> **fc** : conventional-closing
3. "That's great" -> **ba** : appreciation
4. "Do you ... ?" -> **qy** : yes-no-question
---
* **Ensembling and Joint Training**
這段作者做了個嘗試,由於他們改良後的CRF layer和原先的CRF layer有各自的優缺點,他試著將兩者一起用來提升performance。
**1. Ensembling**
簡單來說就是把改良後和原先的CRF各自的transition score平均在一起。
前面所提到的transition score變為:

新增了G_basis項,並在每個時間點都採用(不論是否有 speaker-change )。
在table2中顯示的結果,是有比只採用改良後CRF多出0.2%的準確度,且比只採用原先的CRF多出1.2%。
**2. Joint Training**

不論是 (G_basis + G0) 或是 (G_basis + G1) 的情況,添加 G_basis 都會模糊掉 label transition pattern。 (相當於把兩張transition matrix的結果疊加做平均)
---
* **Ablation Studies**
這段在探討將speaker-change information用在CRF是否是在有效益的使用方法。

1. 由b1和b2可以知道,加入speaker-identifier或是speaker-change都能增進base model的表現,但都沒有加在CRF(a的情況)來得高。
2. 如果將speaker information各別加入BiLSTM和CRF,最後也不會帶來準確度的提升。這表示考慮兩次並沒有用,或至少不能是如表格中這樣用。
---
* **BiLSTM-CRF v.s. BiLSTM-Softmax**
這段沒有太多技術性內容,只是在闡述作者獨創的CRF對於用在DA classification的效果比其他人的CRF或Softmax還要好。
但如果用在其他的NLP task,他們自己的model就與別人的效果差距不大。
## 8. Qualitative Results
**Visualization of transition matrices**

左(**G0**) : 改良後CRF,speaker-unchanged
中(**G1**) : 改良後CRF,speaker-changed
右 (**G**) : 原先的CRF
**Observations:**
1. **G0**有條明顯深色對角線,表示當講者沒變時,多數DA label傾向於延續到下一個utterance。
**G1**則無此現象,表示當講者轉換時,DA也很常從原先的轉換到其他的。
2. 如果講者轉換了,**qw** 常會接著 statements (**sd**和**sv**) 或 other answers (**no**)
如果講者沒變,**qw** 則會接著 **qy**(是非問題) 和 **qrr**(非P則Q) 或 acknowkedgements (**bk**和**b**)。這代表講者在重申相同的問題,或是在自問自答。
3. 如果講者轉換了,**sv** 常會接著 **aa**(同意) 或 **ar**(拒絕)。 **sd** 則沒有像這樣清楚的模式可循。
4. 如果講者轉換了,**qy**(是非問題) 常會接著 **ny**(是), **nn**(否) 或 **no**(其他答案)。
如果講者沒變,**qy** 則會接著其他類型的問題DA。和2.一樣,這表示講者在重複提問。
5. 如果講者轉換了,回答類型的DA(**ny**, **nn**, **no**) 常會接著 acknowkedgements(**bk**和**b**)。
如果講者沒變,**ny**, **nn** 和 **no** 則會接著 statements (**sd**和**sv**)。
6. **G**則無法像另兩個transition matrix觀察到這些資訊,只會發現它綜合了**G0**和**G1**的結果。(或者說,原先的CRF會將兩者的情況都試著抓取)
-> 使用兩個matrices(考量是否有speaker-change)才能解讀到更多的DA transition pattern。
---
其餘段落並無技術性內容,皆為本篇論文之內容彙整,便不再整理。