Transformer - HackMD

###### tags: `Paper Notes` # Transformer * 原文：[Attention Is All You Need](https://arxiv.org/abs/1706.03762) * 機構：Google Brain、University of Toronto * 時間：2017 年 ### Introduction * 現今的序列轉換模型大都基於 RNN 或 CNN。然而，RNN 的缺點就是它 (1) 無法平行運算 (2) 輸入序列無法太長，因為每個時間點的 hidden state，$h_t$，都是基於上一個時間的 hidden state，$h_{t-1}$，計算出來的。 * transformer 是一個完全使用 attention 機制，而沒有夾雜 RNN 或 CNN 的模型。 ### Model Architecture <center><img src="https://i.imgur.com/jXS0DgX.png" width=350></center> <center><img src="https://i.imgur.com/9ZFSGOw.png" width=600></center> * 如 Figure 1 所示，transformer 由兩部份組成，encoder 與 decoder。 * encoder 與 decoder 的輸入會先經過 linear transformation (input embedding、output embedding) 後，才輸入至模型。 * input embedding、output embedding、linear 是同一個 linear transformation。也就是它們是三個 weight 相同的矩陣。 * encoder 與 decoder 都是由 multi-head attention、residual connection (Add)、layer normalization (Norm) 所組成。 * 參數設置： * encoder 與 decoder 都疊了 $N = 6$ 層。 * 總共有 $h = 8$ 個 head。 * encoder 與 decoder 的輸入與輸出維度皆為 $d_{model} = 512$。 * **encoder** (Figure 1 left)： * 每個 input embedding 都與各自的 positional encoding 相加後，得到輸入向量，$x \in R^{d_{model}}$。 * 將 $x$ 分別做三次線性轉換後，得到三個矩陣 $Q、K \in R^{d_k}$ 以及 $V \in R^{d_v}$，分別代表 query、key、value。$d_k = d_v = \frac{d_{model}}{h} = 64$。 * 對於每個 $head_i$，都對 $Q、K、V$ 做一次線性轉換後，再做 scaled-dot product attention。 $$ head_i = Attention(Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}) \\ Attention(Q, K, V) = softmax(\frac{Q K^T}{\sqrt {d_k} }) V $$ * scaled-dot product attention 的機制如 Figure 2 left 所示。 * 最後將所有的 $head_i$ 串聯起來後，在做一次線性轉換。即為 multi-head attention 的輸出。 $$ MultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^{O} $$ * 之後便做 residual connection、layer normalization、feed forward、residual connection、layer normalization。即可得到 encoder 的輸出。 * feed forward 中包含 2 層線性轉換。並用 ReLU 當 activation。 $$ FFN(x) = max(0, xW_1 + b_1)W_2 + b_2 $$ * 第一層會將 $x$ 轉成 2048 維。 * 第二層再轉回 512 維。 * **decoder** (Figure 1 right)： * decoder 的架構與 encoder 差不多。 * decoder 的 input 為 encoder stack 的 output。 * decoder 中間一層的 multi-head attention 使用 encoder output 的 key 與 value。 * 為了符合 auto-regression (AR) 的機制，decoder 在做 attention 時，會遮蔽位置大於自己的輸入向量。也就是讓其輸出為 0。也就是在做 softmax 前將其設為 $- \infty$。如 Figure 2 left 所示。 * **positional encoding**： $$ PE_{(pos, 2i)} = sin(\frac{pos}{10000^{\frac{2i}{d_{model}}}}) \\ PE_{(pos, 2i + 1)} = cos(\frac{pos}{10000^{\frac{2i + 1}{d_{model}}}}) $$ * $pos$ 表示位置。 * $i$ 表示維度。 * input / output embedding 要先加完 positional encoding 後才會送入模型。 * 除了公式解，positional encoding 也可以讓機器自己學。但作者表示這樣無法泛化到 training 時沒有到達過的位置。 ### References * [李宏毅 - Transformer ](https://www.youtube.com/watch?v=ugWDIIOHtPA&t=1525s&ab_channel=Hung-yiLee)