**Team14 Final Project Report**

# Team14 Final Project Report ## Q1. Model design & Concept ### **Rule-based Method** * **Pun location選擇機制的設計想法：** “a humorous use of a word or phrase that has several meanings or that sounds like another word”此為[字典](https://dictionary.cambridge.org/zht/%E8%A9%9E%E5%85%B8/%E8%8B%B1%E8%AA%9E/pun)中對Pun的說明，可以得知部分雙關語在一句話中會有多種意思。因此在機制的設計上，首先會先將不具多種意思的詞排除在候選列表外。接著參酌Huang等，在“[Identification of Homographic Pun Location for Pun Understanding](https://dl.acm.org/doi/10.1145/3041021.3054257)”中所提出，以位置資訊作為判斷雙關語的框架。我們以test資料計算雙關語位置，雙關語位置/句子長度，其平均約為0.8。故我們採用這個概念，將句子中前半部分的詞排除在候選列表之外。最後，觀察雙關語對句子意思的影響。我們比較每個候選詞被替換成對應同義詞後，整個句子意思的變化。當替換後的句子意思與原本的句子意思有所差距時，我們認為此候選詞將會是真正的雙關語，反之，當兩個句子語意上沒有變化時，此候選詞便有更少可能被選為最後的答案。 * **Pun location選擇機制：** 1. 從句子中，篩選出最有可能為正確答案的候選詞列表： a.排除標點符號、stop words。 b.排除同義詞數量小於兩個的詞(使用wordnet查詢)。 c.排除特定位置前的詞。 2. 當候選詞列表建立完成，我們首先找出列表中每個詞對應的同義詞集合。將每個詞的同義詞代換進句子中，計算代換後句子的詞意(將句子中所有詞之向量相加取平均)，接著算出代換後句子與原句子的cosine similarity。最後選擇相似度最小的，找出答案。 3. 案例說明： ``` 原句子(text_id=hom_625)：OLD PROGRAMMERS never die they just go to bits . 候選列表：['go', 'bits'] go的同義詞集合：['blend_in', 'go_bad', 'hold_up', 'live_on', 'fling', … , 'move', … , 'exit', 'go', 'whirl', 'function', 'sound', 'pass', 'endure', 'decease', 'Adam', 'run', 'offer'] 代換同義詞後的句子：OLD PROGRAMMERS never die they just move to bits . 原句子與代換同義詞後的句子的相似度：0.973536915106364 bits的同義詞集合：['morsel', 'moment', 'second', 'snatch', 'flake', 'routine', 'prick', 'mo', 'turn', 'sting', … , 'number'] 代換同義詞後的句子：OLD PROGRAMMERS never die they just go to number . 原句子與代換同義詞後的句子的相似度：0.929340444103404 ->故選擇較小相似度者(使句子意思出現變化的詞)，bits，為雙關語。 ``` ### **Deep Learning Approach** * Architecture: ![](https://i.imgur.com/LzEBUJB.png) 1. 此架構是 [Zou and Lu,2019](https://arxiv.org/abs/1909.00175) 所設計的。 2. 其主要思路是將character和 word 分別 embedding，之後將他們的tag( BPA: Before,Pun,After) 標上去一起餵給雙向LSTM 後，透過CRF Layer 將tagging 預測出來。 ## Q2. What kind of word sense representation used and experimented in your model ### **Rule-based Method** * Word Sense Representation：使用google提供的GoogleNews-vectors-negative300.bin詞向量。 * 找出最佳斷點位置的試驗： | **斷點位置** | **正確答案有在候選列表的比例** |**候選詞總數/詞總數** | | -------- | -------- | -------- | | 0.1 (取句子後90%的詞) | 0.9556 | 0.3885 | | 0.3 (取句子後70%的詞) | 0.9393 | 0.2896 | | 0.35 (取句子後65%的詞) | 0.93 | 0.2696 | | 0.5 (取句子後50%的詞) | 0.8818 | 0.1962 | | 0.7 (取句子後30%的詞) | 0.8017 | 0.1323 | 我們期望正確答案有在候選列表的比例越高越好，候選詞總數/詞總數則越低越好(代表排除更多選擇)。觀察此數據，斷點位置在越前面，代表取到越多的詞，正確答案在候選列表的比例也會越高，但相對的被排除的也會越少，增加更多選錯的可能。 ->考量正確性與排除數量間的取捨，最終選擇0.3作為可接受的斷點位置。 * **Supplement(another ruled based experiment)** 同樣使用google提供的GoogleNews-vectors-negative300.bin詞向量。從原始的句子中排除stopwords及不在詞向量表中的字後，計算兩兩word的similarity來做最後的選擇。這部分就沒有使用同義詞的替換。 ### **Deep Learning Approach** 根據Zou 的描述，此架構並沒有將word sense考慮進來，而是使用 characters embeddings, pre-trained word embeddings 和 position indicators 作為model 的 input，並利用LSTM 做sequential labeling。 ## Q3. What problem did you face during the homework and how you solved ### **Deep Learning Approach** 1. 起初遇到的問題首先是code libraries 版本的問題，由於這篇文章已經是3年前的文章，source code還用著pytorch 0.4的版本，將syntax 統統更新至目前版本就花了不少時間。 2. 後來在inferences的時候發現該作者沒有提供api，甚至model 也不好直接拿來做預測，於是我們只好去修改他的model，做出可以符合我們需求的api。 ## Q4. Error Analysis and Discussion ### **Deep Learning Approach** KFold Validation的 K 對結果造成的影響非常大，選對K做造成Performance上的差距可以在20%以上。此外所選用的optimizer 也有影響。原作者是推薦使用SGD做optimize，但是我們實測下來用SGD 做optimize 在 K = 5, Epoch=50 的情況下僅僅只有 65% 左右的F1 score. 但是在使用Adam optimizer 在同樣的參數設置下可以到75%左右的準確率。 ## Q5. Compare and implement unsupervised method and supervised method ### **Rule-based Method** 最終此方法在Kaggle上的Mean F1-Score為0.56875，相較深度學習方法的得分(0.75625)，存在不小差距。觀察單純依靠此一選擇機制，尚應存在許多其他變因影響最終答案的選擇。相關研究也有提出一些可以參酌的Pun location選擇框架，未來可繼續嘗試結合不同概念，實驗是否能有更好的突破。 (而沒有使用同義詞替換的試驗結果更差，最後的score落在0.42左右，但還是可以看出有蠻大差別的) * **Possible improvement** 可能可以找出每個word可代表的各種語意，再去做比對 ### **Deep Learning Approach** 這是一個Supervised Learning的方法，需要提供ground truth 給model才有結果。和Unsupervised 的方法比較起來，雖然在input上沒有unsupervised的方便，但是它能夠給出更好的performance，在有Ground Truth的情況下會更推薦使用Supervised Learning ## 參考資料 1. Yu-Hsiang Huang, Hen-Hsen Huang, Hsin-Hsi Chen. “Identification of Homographic Pun Location for Pun Understanding” 2. 刁宇峰，杨亮，林鸿飞，吴迪，樊小超，徐博，许侃。＜基于潜在语义特性的语义双关语检测及双关词定位＞ 3. Yanyan Zou, Wei Lu. "Joint Detection and Location of English Puns"

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.