📃 Template-free Prompt Tuning for Few-shot NER

--- tags: paper-reading --- # 📃 Template-free Prompt Tuning for Few-shot NER Ruotian Ma et al. 上海復旦大學 2022 Naacl https://github.com/rtmaww/EntLM/ ## 0. How's it related to our project ChemiCloud (1) few-shot setting：資料量少的 supervised learning, 化學雲官方可能無力標注大量資料。或甚至需要我們自己標（一百句左右）。如果懶得架設標記平台或大量正式標記不可得的話，few-shot 的設定基礎很重要。 (2) 似乎是首篇論文研究 Prompt-tuning/Large LMs 在 NER 任務上如何有效率 reformulate 並訓練。 [2023.2.2 化學雲報告投影片（有四頁介紹這篇）](https://docs.google.com/presentation/d/1QwNY-2hr9QDfduiMASTltBkd_L-YcdEfDL1mmnxosWI/edit?usp=sharing) ### 閱讀心得 - 從 templateNER 變成 decoding 有效率的 template-free NER 這點滿不錯。這篇 Conclusion 自己總結的優點是： 1. Few-shot 效果好 2. Entity-oriented LM task, 維持 pretraining, finetuning obj 一致。 3. Fast decoding - 要額外花時間 mine label words 有點麻煩，而且要三種方法（Data & LM + Virtual）一起用才有顯著成效。 table 2 放的應該是最好版本的分數（三種都有使用），如果把 table 3 的 label word selection ablation 的數據配合看的話，只使用部分 label word selection strategies、尤其不用 virtual 的話，大概就落在單用 StructShot 的分數而已。 - label word 統計顯然還是會仰賴 labeled data 或品質尚可的 lexicon。如果是化學雲這種要自己定義 named entity，lexicon 顯然更不足。 - Code: - 寫得感覺還行，包的型態就是 `.py`, `.sh` 那樣， dataset 形式就是一般來說認知 NER 的標記形式 BIO schema（https://github.com/rtmaww/EntLM/tree/main/dataset/conll/10shot）。 - Label word selection [code](https://github.com/rtmaww/EntLM/blob/main/language_model.py) 這段寫得有點複雜。 ## 1. Intro ### Prompt-based learning on few-shot classification 為何蔚為風潮？ (1) Re-using masked LM objective （pre-training, fine-tuning 的目標函數相同） (2) The sophisticated template and label word design helps LMs better fit the task-specific answer distributions. 好的模板（template）和精細挑選的 label words 對任務的訓練有很大幫助。以情緒分類任務為例，設一提示學習情緒分類任務如下 "I love this milk. It was [MASK]." 並 predict [MASK] 中 "great" 或 "terrible" 的機率，則 template 為 "It was [MASK]." 而 label words 為 "great" 和 "terrible"。 ### NER with prompt-based learning 難處在於 search for appropriate templates 的 search space 快速增長（span-level）。NER 的資料因為難以標注通常少量，而少量資料的學習本來就易有 overfitting，要再輔以嘗試各種 templates 很難訓練起來。可見下圖 fig 1. 會需要 21 次 queries。需要 different sizes of spans 走 sliding window 過整句（span length = 1 者有 5 個，span length = 2 者有 4 個，以此類推）。故而越長句，用這種 template-based prompt method 做 NER 訓練時長越高，儘管可以用某些方法避免過長的 span，但仍治標不治本。這個方法在後續的實驗中有作為 baseline (templateNER) 被實作，其中 Steve Jobs 的 slot 需要 enumerate 所有 $x_{i:j}$ 的可能。雖然在 few-shot setting 之下不會花太多時間，但仍然有著根本缺陷、不能算是一個好方法。 ![](https://i.imgur.com/zLMPTyR.png) ![](https://i.imgur.com/ZaPQwVw.png =500x) ## 2. Problem Setup 遵循 few-shot setting。每個類別 cut 成訓練數量 balanced。 ## 3. Approach 以下編號按照 paper 編號，不重要的 sections 跳過不寫筆記。 ### 3.2 Ent-Oriented LM fine-tuning 設計一組 label words mapping，如： ``` { 'PER': 'John', 'LOC': 'Australia', 'Date': 'Sunday', ... } ``` 輸入句子 “Steve Jobs_PER was born in America_LOC"，將句子中的標記處以 label word mapping 對應轉換得 “John John_PER was born in Australia_LOC" ( Steve Jobs 有兩個 tokens 故轉換後有兩個 John)。將前者作為 input $X$，後者作為 target output $X^{Ent}$ 訓練。目標是模型能夠正確將詞彙轉換為所屬的 NER 類別之代表詞彙。 ![](https://i.imgur.com/972kmjV.png) $W_lm$ 為 Pretrained LM 的參數，並非新導入的參數。因此這個訓練方法也不用新加參數、可以直接以訓練好的 LMs 作微調、具有 pretraining 和 fine-tuning 為同一個訓練目標的好處。 ### 3.3 Label Word Enginnering 如何找到一組好的 label word mapping。怎麼知道 'PER' class 應該要挑選 'John' 作為代表詞彙？ ![](https://i.imgur.com/uYMB4V1.png) ### 3.3.1 Low-resource label word selection 使用額外 Knowledge Bases (KB)，更具體而言是使用 *BOND: BERT-Assisted Open-Domain NamedEntity Recognition with Distant Supervision (2020)* 的 KB-matching method 來做 lexicon-based annotation（自動的字典式無腦標記），以擴充本來少量的人工標記。隨後再以以下（四）種方法尋找最佳的 NER class 代表詞彙。我們稱擴充完的標記資料 $\mathcal{D}$。 #### 1) Data Search: 選取 $\mathcal{D}$ 中每個 class 裡 frequency 最高的詞彙，使用時即使用該詞彙的代表詞向量，以下兩種方法皆同。 ![](https://i.imgur.com/qgY4w8v.png) #### 2) LM Search: 將 $\mathcal{D}$ 所有 samples 餵入 LM 並取得機率分布。選取每個 class 的 top k predictions 裡出現最為頻繁的詞彙。 ![](https://i.imgur.com/XNGgSbR.png) #### 3) Data & LM Search: 將每一詞彙的(1), (2)相乘後取 argmax 。 ![](https://i.imgur.com/rl8yMk7.png) #### Virtual Label Words: 可以運用以上任何一種方法得到每一 NER class 中的 top k label words。Virtual Label Words 的作法是將這top k 字的embeddings 做 element-wise averaging，使用該 averaged embedding 作為該 class 代表的詞向量，因為其無法對應回一個真正的詞彙，故稱作 virtual label word。 $\mathcal{V_c}$ 是 class $c$ 的代表 label word set， size 為 k。 $f_{\phi}(.)$ 是 Pretrained LM 的 embedding function。 ![](https://i.imgur.com/hEsorbe.png) #### 移除 conflicting label words 有些 words 一次代表很多 class （數個 NER class 被同一詞彙代表），類似 stopwords 的概念，故制定一個 threshold $Th$ 並保留代表性高的其他 words。以下的 $w$ 代表被留下來的、代表 class $C$ 代表性夠高夠獨立的詞彙。![](https://i.imgur.com/tGLO0i0.png) ## 4. Experiments ![](https://i.imgur.com/eSYwQxf.png) ### Settings K-shot experiment, K = 5, 10, 20, 50. For each K, 3 training sets are sampled and each (K, trainset) pair is initialized differently for 4 times to get 4 statistics, meaning **each K gets 12 statistics.** Special algorithm (Appendix A.2) is run to ensure exactly $K$ class entities are sampled for the K-shot (instead of K sentences). ### Results EntLM EntLM + Struct :::success Struct 的演算法： Yi Yang and Arzoo Katiyar. 2020. *Simple and effective few-shot named entity recognition with structured nearest neighbor learning.* EMNLP. ::: ![](https://i.imgur.com/2unNclN.png) 1. 5-shot learning 時表現特別好，EntLM + Struct 屌打其他所有方法，EntLM 緊跟在後。 2. templateNER 除了先天的 decoding time 高的不足之處，數據表現也無法與 EntLM 抗衡。 3. Efficiency-wise, given **Titan XP GPU with batch_size = 8（8 個不同句子）**, TemplateNER with batch_size = 1 個長度為 9 個 tokens的句子，總共 45 個 sequences, 因為需要 enumerate over all $x_{i:j}$，如下表所示，TemplateNER 的 inference 非常無敵的慢。 ![](https://i.imgur.com/JAVMmgi.png =500x) 4. Label Word Selection 的部分而言，Virtual Word selection （averaging real words' embeddings）表現幾乎都比只挑選單一詞彙當代表來得好。同時使用 Data 和 LM 的選法是最佳的，Data 次之，只用 LM 最不佳（尤其 K 越小的時候）。 ![](https://i.imgur.com/q59t8Zn.png) 5. Impact of Lexicon Quality on Label Word Selection 調整 lexicon （dataset）的量，連帶會影響 label word 的選擇。數據主要是看在這樣的調整之下 EntLM 與其他 baselines 會有什麼樣的表現。發現 EntLM(+ Struct) 使用 Data&LM+Virtual 都還是很 resilient。下圖是 OntoNotes 上的表現。但沒講 K 是多少，沒辦法和上面 Table 3 比較（如果 100% 是 lexicon 未做手腳的正常狀態，也就是說數據應該和 table 3 吻合的話，那感覺應該是 K=5，但下圖 100% 時的 EntLM + Struct + Data&LM+Virtual 分數46.x% 超越table 3 的平均數據？）。 ![](https://i.imgur.com/bYYfk86.png =400x)