DeBERTa (Decoding-enhanced BERT with disentangled attention)

###### tags: `Paper Notes` # DeBERTa (Decoding-enhanced BERT with disentangled attention) * 原文：[DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) * 機構：Microsoft Dynamics 365 AI、Microsoft Research * 時間：2020 年 ### Introduction * DeBERTa (Decoding-enhanced BERT with Disentangled Attention) 光聽名字就知道是要來改進 BERT 的架構。DeBERTa 的關鍵有以下兩點： * disentangled attention * enhanced mask decoder * 除了在各項 NLP 任務達到 SOTA 外，DeBERTa-1.5B 還在 SuperGLUE 任務上超越了人類。 ### Disentangled Attention * 與 BERT 不同，對於序列中的任意 token，DeBERTa 都用兩個向量 $H_i$、$\{P_{i|j}\}$ 來表示它。 * $i$：position of the token in the sequence * $H_i$：content * $P_{i|j}$：relative position between token $i$ and token $j$ * 承上，token $i、j$ 之間的 attention score 可以拆解成以下形式： $$ A_{i, j} = \{ H_i, P_{i|j} \} \times \{ H_j, P_{j|i} \} ^{T} \\ = H_i H_{j}^{T} + H_i P_{j|i}^{T} + P_{i|j} H_{j}^{T} + P_{i|j} P_{j|i}^{T} $$ * 由於 token 之間的 attention weight 與它們之間的 content 和 relative position 有關，因此 content-to-position 和 position-to-content 這兩項都不能省略。此外，由於使用的是 relative position，因此 position-to-position 這一項並不能提供更多額外的資訊，故可以省略這一項（$P_{i|j} P_{j|i}^{T}$）。 * 承上，standard self-attention 可以寫成以下形式： $$ Q = H W_q,\ K = H W_k,\ V = H W_v,\ A = \frac{QK^T}{\sqrt{d}} \\ H_o = softmax(A)V $$ * $H、H_o \in R^{N \times d}$ * $W_q、W_k、W_v \in R^{d \times d}$ * $N$：input sequence length * $d$：dimension of hidden state * 令 $k$ 為 maximum relative distance，定義 relative distance matrix $\delta$ 為以下形式： $$ \delta(i, j) = \left\{ \begin{array}{rcl} 0 & \mbox{for} & i - j ≦ -k \\ 2k - 1 & \mbox{for} & i - j ≧ k \\ i - j + 2k & \mbox{others} \end{array}\right. $$ * $\delta(i, j) \in [0, 2k)$ * 綜上所述，DeBERTa 中的 attention 可以寫成以下形式： $$ Q_c = HW_{q, c},\ K_c = HW_{k, c},\ V_c = HW_{v, c} \\ Q_r = PW_{q, r},\ K_r = PW_{k, r} \\ \tilde{A}_{i, j} = Q_i^c {K_j^c}^{T} + Q_i^c {K_{\delta(i,j)}^r}^{T} + K_j^c {Q_{\delta(j,i)}^r}^T \\ H_o = softmax(\frac{\tilde{A}_{i, j}}{\sqrt{3d}})V_c $$ * 簡單的說就是將 (a) content-to-content (b) content-to-position (c) position-to-content 三種資訊整合起來。 * 注意，$\tilde{A}_{i, j}$ 最後一項是 $\delta(j, i)$，因為是 $j$ 的 key 對上 $i$ 的 position。 ### Enhanced Mask Decoder * 在 mask detection 任務中，有可能出現以下狀況「A new **store** opened near the new **mall**.」。只靠 relative position 做預測是不夠的。因此，還要加一些 absolute position 的資訊。 * 在 decoder 的最後，softmax layer 之前，作者將 absolute position 的資訊的資訊加了進去。 * 對於 natural language generation (NLG) 任務，作者將 self-attention matrix 的上三角部分遮掉，也就是設為 $- ∞$，以符合 auto-regression 的形式。 ### Experiments & Results * DeBERTa 的參數配置與 BERT 相同。與其它 SOTA 的比較結果如 Table 1 ~ Table 4 所示。 <center><img src ="https://i.imgur.com/hqRzdQY.png" width=600></center> <center><img src ="https://i.imgur.com/Q9pio9J.png" width=600></center> <center><img src ="https://i.imgur.com/y1NRoov.png" width=600></center> <center><img src ="https://i.imgur.com/VqHrvbx.png" width=600></center> * 此外，作者還做了 ablation study，結果如 Table 5 所示。 <center><img src ="https://i.imgur.com/byKaUfY.png" width=600></center> * 除了 base、large 以外，作者還做了 1.5B 的版本，擁有 1.5B 的參數。在 SuperGLUE 任務上，除了打敗擁有 11B 參數的 T5，DeBERTa 也帶領 AI 首次超越了人類（89.9 vs. 89.8）。如 Table 6 所示。 <center><img src ="https://i.imgur.com/ln9twPh.png" width=600></center>