Attention Is All You Need (Transformer)

{%hackmd SybccZ6XD %} # Attention Is All You Need (Transformer) ###### tags: `paper` ## Abstract Before: complex recurrent or convolutional neural network Proposed architecture: Transformer, based on **attention mechanisms** ## Introduction ### Before (sequence model) - Recurrent neural networks ![](https://i.imgur.com/R30KRRp.png) source: https://gotensor.com/2019/02/28/recurrent-neural-networks-remembering-whats-important/ - Long short-term memory. ![](https://i.imgur.com/cIKJNE4.png) source: https://towardsdatascience.com/lstm-recurrent-neural-networks-how-to-teach-a-network-to-remember-the-past-55e54c2ff22e - gated recurrent ![](https://i.imgur.com/d5MkjnG.png) source: https://towardsdatascience.com/gru-recurrent-neural-networks-a-smart-way-to-predict-sequences-in-python-80864e4fe9f6 ### computation problem - factorization tricks [21]: computational efficiency - conditional computation [32]: computational efficiency and model performance ### Attention mechanisms - without regard to their distance in the input or output sequences - conjunction with a recurrent network :::warning 補充 (attention) https://blog.csdn.net/Enjoy_endless/article/details/88679989 - Human attention: focus something we think it is importance ![](https://i.imgur.com/EjT2rZQ.png) - Arbitary formulation have a appropriate C let it more attentive ![](https://i.imgur.com/LEpTSYQ.png) $y_1 = f1(C_1)$ $y_2 = f1(C_2, y_1)$ $y_3 = f1(C_3, y_2)$ ::: :::warning 補充 (self-attention) 從影像來看，觀測範圍會比較廣，比起CNN，CNN一次只有看一個kernal 經過encoder可以知道哪個部分重要 ![](https://i.imgur.com/18Bucep.png) ::: ### Transformer - eschewing recurrence - attention mechanism to draw global dependencies between input and output ## Background - Before (Extended Neural GPU [16], ByteNet [18] and ConvS2S [9]): the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions - disadvantage: it is difficult to learn dependency distant positions - Transformer: constant number of operations - Transformer is not use any convolution or RNN ## Model Architecture - encoder input: $x = (x_1,...,x_n)$ - encoder output: $z = (z_1,...,z_n)$ - decoder output: $y = (y_1,...,y_m)$ ### Encoder and Decoder Stacks :::warning 補充 (residual connection) ![](https://i.imgur.com/yoU98W1.png) source: arXiv:1512.03385 ::: :::warning 補充 (batch normalization) ![](https://i.imgur.com/aL7LAhu.png) ![](https://i.imgur.com/GmzV9zn.png) ::: ==encoder==: N = 6, produce outputs of dimension $d_{model} = 512$ Add: residual connection Norm: layer normalization ![](https://i.imgur.com/gdx02XH.png) 別人理解: ![](https://i.imgur.com/vf2SMpO.png) source: https://jalammar.github.io/illustrated-transformer/ 我的理解: paper: In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. ![](https://i.imgur.com/ndpIzUs.png) ==Decoder==: N = 6 ![](https://i.imgur.com/ZSHLmkc.png) ### Attention #### Scaled Dot-Product Attention - Two kinds of attention function - Additive attention - Dot-product attention: faster and more space-efficient - Why we need $\frac{1}{\sqrt{d_k}}$ - larger values of $d_k$, the dot products grow large in magnitude and lead to the bad performance in the softmax function - $Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$ ![](https://i.imgur.com/eQyE4Hd.png) #### Multi-Head Attention $h = 8, d_k = d_v = d_{model}/h = 64$ - beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions - Concat: h dimensions are concatenated to 1 - $MultiHead(Q, K, V) = Concat(head_1,...,head_h)W^O$ - where $head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)$ ![](https://i.imgur.com/LKdbhE4.png) #### Applications of Attention in our Model - In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. ![](https://i.imgur.com/xRFuZ8R.png) - Each position in the encoder can attend to all positions in the previous layer of the encoder. ![](https://i.imgur.com/yVbgWGZ.png) - each position in the decoder to attend to all positions in the decoder up to and including that position. ![](https://i.imgur.com/2aJVl9M.png) ### Position-wise Feed-Forward Networks Dense(dff, activation='relu') fully connected: $FFN(x) = max(0, xW_1+b_1)W_2+b_2 = Relu(xW_1+b_1)W_2+b_2$ ### Embeddings and Softmax ### Positional Encoding ![](https://i.imgur.com/toEQiNX.png) Because using binary values would be waste of place, we can use float contunus counterparts - Sinusoidal functions ![](https://i.imgur.com/1zNa21g.png) source: https://kazemnejad.com/blog/transformer_architecture_positional_encoding/ ## Why Self-Attention - total computational complexity - the amount of computation that can be parallelized - long-range dependencies ## Training ## Results ## Question - Transformer是如何處理可變長度的？ length跟weight的dim無關以下只計算維度變化，softmax不會造成維度變化 ![](https://i.imgur.com/I8bFJto.png) - decoder跟 encoder怎麼接起來? ![](https://i.imgur.com/wbjm7hE.png) ![](https://i.imgur.com/vf2SMpO.png) source: https://jalammar.github.io/illustrated-transformer/ paper: In “encoder-decoder attention” layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. ![](https://i.imgur.com/ndpIzUs.png) paper: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. [Rethinking and Improving Natural Language Generation with Layer-Wise Multi-View Decoding](https://arxiv.org/abs/2005.08081) ![](https://i.imgur.com/AIWlz6l.png) ![](https://i.imgur.com/nv0uzR0.png) - we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation 為什麼可以share? 跟NLP subword有關 ![](https://i.imgur.com/gy2vRkI.png) - 為什麼要有FFN，self-attention後已經有linear transformation 有Relu增加performance