MELM: Data Augmentation with Masked Entity Language Modeling for Low-Resource NER

###### tags: `nlp` # MELM: Data Augmentation with Masked Entity Language Modeling for Low-Resource NER [ACL 2022](https://aclanthology.org/2022.acl-long.160/) + ner 的標記是很耗費成本的，因此有越來越多論文討論 low-resource NER + 典型的資料增強方式有 word-level modification & back-translation，但他們在 NER 任務上都遇上不同困難 + word-level replacing entity with alternatives, but mismatch the origin label + back-translation hinges on external word alignment tools for propagating the labels from the original input to the augmented text.(需要對齊) + [An Analysis of Simple Data Augmentation for Named Entity Recognition](https://aclanthology.org/2020.coling-main.343/) + 提出替換實體，但這個作法並不會增加 entity diversity + [A Rigorous Study on Named Entity Recognition: Can Fine-tuning Pretrained Model Lead to the Promised Land?](https://aclanthology.org/2020.emnlp-main.592/) + 研究發現 Augmentation on **context** gave marginal improvement on pretrained-LM-based NER models. + 本篇的初步研究 Figure1. (這我其實看不太懂怎實驗的) + **Diversifying entities** in the taining data is more effective than introuducing more context patterns. > ![](https://i.imgur.com/h6dMZVK.png) # Method 作者希望能用 LM 模型的預測替換實體來增加 Entity diversity 1. 把 MLM model 在任務文本上 finetune，方式為遮蔽實體並讓模型預測 + 但這會有一些問題: 如 Figure 2b 雖然替換了實體後 context 正確，但標籤已經不正確了。 + 本文增加了 labeled sequence **linearization** strategy，將提示標籤 <{B, I}-{ORG, LOC, ...}> 這些新 token 嵌入到實體前後，再去訓練 MLM 模型。 2. 讓 MELM 模型預測 linearization 後的 training samples，產生新的訓練資料，如圖 Figure 2c 所示 ![](https://i.imgur.com/jTV3Bc8.png) 3. 為了避免產生的 entity 跟原本相同，讓 LM 預測 top-k 作為候選並隨機取一個 4. Post-Processing: 為了移除不太好的資料，訓練了一個 NER 模型來檢視這些增強資料，只要跟原本標籤不符合就移除。 ![](https://i.imgur.com/AgwDnmZ.png) ## Extending to Multilingual Scenarios +