Automated Concatenation of Embeddings for Structured Prediction

acl2021
https://aclanthology.org/2021.acl-long.206/

把一堆 embedding 串一起然後輸入 LSTM 來分類 NER label，取得 SOTA 成績
但用了一大堆模型 的 embedding，感覺只是用來拚榜而已

Introduction

word embeddings 對於下游任務影響很深，像是 ELMO, Flair, BERT, XLM-R 都取得了 SOTA 成績，而 embedding 的串接也能產生更好的效果
Neural architecture search (NAS) 是一種能自動尋找最好模型架構 (排列組合) 的方法，並且在 image classification / semantic segmentation / object detection 任務都有背應用上
- Neural architecture search 介紹

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

結合以上想法，本文提出 Automated Concatenation of Embeddings 來找出最好的 word embedding 串接方式，策略為去優化 reinforcement learning 架構中的 controller，並在 6 tasks and 21 dataset 上取得了 SOTA 成績
在每次的 STEP 中，controller 根據 belief model 選出 word embedding 串接方式，用這些參數訓練完模型後取得 accuracy 當作 reward，並去更新 belief model
- 專注在挑選 embedding 串接方式
- 新穎的 controller 設計和 reward function
- efficient and practical(實際的)

Task model

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

我們的任務為給定

x

，輸出每個

y

類別的機率 (其實就是 softmax 分類)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

任務模型有兩種，一種是基於 sequence，另一種基於 graph

Sequence

BiLSTM-CRF model
Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional LSTM-CNNsCRF

Graph

BiLSTM-Biaffine
Timothy Dozat and Christopher D. Manning. 2018. Simpler but more accurate semantic dependency parsing

輸入 n 個字，在經過

L

個模型 embedd 後的詞向量序列為

V

d 為所有模型 hidden size 加總
$v_{i}^{l}$ 代表由第
$l$ 個模型產生的第
$i$ 個字的 embedding，
$v_{i}$ 是由多個模型的第
$i$ 個字串接而成

$V = [v_{1}; \dots; V_{n}], V \in R^{d \times n} v_{i}^{l} = e m b e d_{i}^{l} (x); v_{i} = [v_{i}^{1}; v_{i}^{2}; \dots v_{i}^{L}]$

在給予輸入

x

後產生

y

的機率

P^{s e q} (y | x) = B i L S T M - C R F (V, y) P^{g r a p h} (y | x) = B i L S T M - B i a f f i n e (V, y)

Search Space Design

L 個模型所能組成的 embeddings 有
$2^{L}$ 種
binary variable
$a^{i}$ 能控制 embedding 要不要被選用，類似 mask 的概念

$v_{i} = [v_{i}^{1} a_{1}; \dots; v_{i}^{l} a_{l}; v_{i}^{L} a_{L}]$

Searching in the Space

我們用參數
$θ = [θ_{1}; θ_{2}; \dots; θ_{L}]$ 來當作 controller 變數
$P (a; θ)$ 代表在給定變數
$θ$ 下產生
$a$ 的機率
選擇 concatenation
$a$ 的機率分佈 (
$\prod$ 為連乘)

P^{c t r l} (a; θ) = \prod_{l = 1}^{L} P_{l}^{c t r l} (a_{l}; θ_{l})

$P_{l}^{c t r l}$ 為 Bernoulli distribution，σ is the sigmoid function
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

直觀解釋: 假設我們用了 ELMO 和 BERT，用

θ_{1}

控制選擇 ELMO 的機率，用

θ_{2}

控制選擇 ELMO 的機率。啟用 ELMO embedding (

a_{1} = 1

) 的機率為

σ (θ_{1})

，不啟用的機率為

1 - σ (θ_{1})

Train loop

在訓練過程中，先選擇好 mask，接著依照這個 mask 去訓練模型，並在 development set 上測試得到 accuracy R 當作 reward

由於 R 不能微分計算 gradient，因此根據 policy gradient method，controller 的目標為最大化

J (θ) = E_{P^{c t r l}} (a; θ) [R]

為了訓練效率的提升，一次只選擇一種

θ

來去計算，而不是把所有的

θ

可能性都嘗試

$b$ 是 baseline function (通常選擇最高的 accuracy)

\nabla_{θ} J (θ) \approx \sum_{l = 1}^{L} \nabla_{θ} l o g P_{l}^{c t r l} (a_{l}; θ_{l}) (R - b)

Reward
用 binary vector

| a^{t} - a^{i} |

來代表 embedding 選擇的變化

$a^{t}$ 是在 time step
$t$ 的 embedding 選擇
$a^{i}$ 是在 time step
$i$ 的 embedding 選擇
$r^{t}$ 是長度為
$L$ 的 vector，表示了每個 embedding 得到的 reward

$r^{t} = \sum_{i = 1}^{t - 1} (R_{t} - R_{i}) | a^{t} - a^{i} |$

在以上的基礎，考慮到越遠的 timestep 影響力應該要更小，因此加入縮放因子

γ \in (0, 1)

r^{t} = \sum_{i = 1}^{t - 1} (R_{t} - R_{i}) γ^{H a m m (a_{t} - a_{i}) - 1} | a^{t} - a^{i} |

重寫式子

\nabla_{θ} J_{t} (θ) \approx \sum_{l = 1}^{L} \nabla_{θ} l o g P_{l}^{c t r l} (a_{l}^{t}; θ_{l}) r_{l}^{t}

Training

用字典

D

來存 串接方式和驗證分數的對應

在 t=1 時候，所有的 embedding 都啟用

t >= 2

sample
$a^{t}$
依照
$a^{t}$ 訓練 task model ，得到 accuracy R
把
$a^{t}$ 和
$R_{t}$ 加入
$D$ ，讓 t = t+1

在選擇
$a^{t}$ 時，避免選擇
$a^{t - 1}$ 和全零的 vector
當
$a^{t}$ 存在於字典，替換為更高的 R

embeddings

用超多的

ELMo
Flair
base BERT
GloVe word embeddings
fastText word embeddings
noncontextual character embeddings (Lample et al.,2016)
multilingua Flair (M-Flair)
M-BERT
XLM-R embeddings.

Result

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →