Transformer - HackMD

# Transformer ## I. Introduction **Transformer**, which was first introduced in the paper [Attention is all you need](https://arxiv.org/pdf/1706.03762.pdf) paper, is a network architecture based solely on attention mechanisms. Since published, the idea was widely applied in various sequences tasks and surpassed old-fashion RNN models. Thanks to the success in NLP, transformer was also applied to vision tasks ([Vision Transformer](https://arxiv.org/abs/2010.11929)) and gained the comparable results comparing to traditional CNN models. In this blog, I will summarize transformer and its mechanisms. ## II. Attention mechanisms The idea behind attention is that it leverage the most relevant parts of the input sequence by a weighted combination of all of the input vectors. In attention layer, we normally have three components as inputs: 1. Query vector: **q** (Shape: $D_{Q}$) is a feature vector that tells us knows what part we should pay attention to. 2. Input vector: **X** ($N_{X} \times D_{X}$) 3. Similarity function: $f_{att}$ might simply be *dot product*. ![Attention](https://i.imgur.com/Myx30NJ.png) <center>Fig 1. The encoder-decoder model with additive attention mechanism</center> <center>src: lilianweng.github.io</center> In [Bahdanau et al., 2015.](https://arxiv.org/pdf/1409.0473.pdf), the author used a **feed-forward network** with a single hidden layer as the similarity function $f_{att}$. However, it is much more efficient and works just as well if we replace the $f_{att}$ with a simple dot product. Moreover, scholars often used **scaled dot product** to compute similarity score by dividing the score by $\sqrt{D_{Q}}$. Now, the similarity score is: $e_{i} = q \cdot X_{i}/\sqrt{D_{Q}}$ The reason we want to do that is that we have to put the similarity score through a softmax to calculate the attention weights. Therefore, large similarity would cause softmax to satuate and give vanishing gradients (You can refer to this [article](https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484)). Now, we want to generalize to multiple query vectors to calculate all at once, the inputs become: 1. Query vector: $Q$ (Shape: $N_{Q} \times D_{Q}$) is a set of feature vector that tells us knows what part we should pay attention to. 2. Input vector: $X$ ($N_{X} \times D_{X}$) Our computation would be: 1. Similarities: $E = QX^{T}$ (shape $N_{Q} \times N_{X}$), $E_{i,j} = Q_{i} \cdot X_{j}/\sqrt D_{Q}$ 2. Attention weights: A = softmax(E, dim=1) (shape $N_{Q} \times N_{X}$) 3. Output vectors: Y = AX (shape $N_{Q} \times D_{X}$) $Y_{i} = \sum_{j}A_{i,j}X_{j}$ To this time, we actually used the input $X$ in two different ways. First, we used it to compute similarity score $E$. Second, we used it again to compute output vector $Y$ after we applied softmax into $E$, and that might limit the ability to learn of the model. In order to give model the flexibility to learn and attend, we could split $X$ into two different matrix. How are we gonna do that? We transform form $X$ by multiply it by two different matrices and used it in seperately. Now, the intputs changes to: 1. Query vector: $Q$ (Shape: $N_{Q} \times D_{Q}$) is a set of feature vector that tells us knows what part we should pay attention to. 2. Input vector: $X$ ($N_{X} \times D_{X}$) 3. Key matrix: $W_{K}$ (shape $D_{X} \times D_{Q}$) 4. Value matrix: $W_{V}$ (shape $D_{X} \times D_{V}$) And the computation would be: 1. Key vectors: $K = XW_{K}$ (shape $N_{X} \times D_{Q}$) 2. Values vectors: $V = XW_{V}$ (shape $N_{X} \times D_{V}$) 3. Similarities: $E = QK^{T}$ (shape $N_{Q} \times N_{X}$), $E_{i,j} = Q_{i} K_{j}/ \sqrt D_{Q}$ 4. Attention weights: $A = softmax(E, dim=1)$ (shape $N_{Q} \times N_{X}$) 5. Output vectors: $Y = AV$ (shape $N_{Q} \times D_{V}$) $Y_{i} = \sum_{j}A_{i,j}V_{j}$ The computation is represented in this graph: ![](https://i.imgur.com/o0QWQEs.png) <center>Fig 2. Attention mechanisms and how it works. Taken from this lecture https://web.eecs.umich.edu/~justincj/slides/eecs498/498_FA2019_lecture13.pdf</center> This is a very general attention layer that can insert anywhere in your model. Now, anytime you have two set of data: one is considered as query and one is considered as inputs, you can apply the attention as a layer in your neural networks. You can also change the way to compute similarity function as you apply (refer to [Attention? Attention!](https://lilianweng.github.io/posts/2018-06-24-attention/)). ## III. Self-attention layer and Transformer blocks ### 1. Self-attention Moving forward, one special case of attention is **self-attention layer**, which is the main ingredient of transformer. Self-attention layer slightly modify the general attention mechanisms as we discussed above. The idea is very simple that instead of taking query vectors $Q$ from hidden state of RNN models, why don't we transform inputs vectors $X$ by multiply by a learnable matrix, which is exactly the same as we did with key vectors $K$ and value vectors $V$? The way it works is that it compares each input vector to itself and other around it. Then, the computation graph would be: ![](https://i.imgur.com/XzvdKx0.png) <center>Fig 3. Self-attention layer.</center> Now, an interesting properties of self-attention layer is **permutation-equivariant**. What it means is that when you permute the input vectors, the output vector will be the same, but permuted. To be specific, the layer does not care about the order of the inputs, but it just operates on a set of vectors. ![](https://i.imgur.com/ZI8MPgM.png) <center>Fig 4. Permutation equivariant. When you change the order of input X, it accordingly changes the value of Q, V, K and the output Y. </center> However, sometimes, we do want to let the model know the order of input vectors. Therefore, the author append position information to the input. One of the options is to use **positional encoding** (refer to the original [paper](https://arxiv.org/pdf/1706.03762.pdf)). ### 2. Multihead self-attention Another common variant of self-attention layer is multihead self-attention. In multihead self-attention, we split the input into equal chunks and then feed it into independent self-attention layers. In the end, we concatenate the heads and combine them with a final weight matrix. We can express it in this formula: $Multihead(Q,K,V) = Concat(head_{1}, head_{2}, ..., head_{h})W^{O}$ where $head_{i} = Attention(QW_{i}^{Q}, KW_{i}^{K}, VW_{i}^{V})$ Now, each head responsible for a chunk of input, so it is required that the *input embeding dimension* is divisible by the *number of head*. ![](https://i.imgur.com/CbontUN.png) <center>Fig 5. Multihead self-attention</center> ### 3. Transformer Next, we are going to talk about the whole transformer block after we have multihead self-attention layer. In the original [paper](https://arxiv.org/pdf/1706.03762.pdf), they leveraged transformer for sequence-to-sequence tasks, so that is why their whole architecture contained two parts: encoder and decoder. However, transformer can apply anywhere in your network by stacking multiple blocks of transformer (similar to the encoder part only). ![](https://i.imgur.com/UExOZho.png) <center>Fig 5. Transformer block.</center> The transformer block takes a set of vectors $x$ as input and output a set of vector $y$. We can see that the only interaction between each vector is inside the self-attention layer, because the layer normalization and MLP work independently per vector. Therefore, the block is highly scalable and highly parallelizable. Now, we can stack $N$ identical transformer block to deepen our network. ## IV. Vision transformer Transformer is extremely successful in NLP (by surpassing traditional RNN models), but its application in CV is still limited. In 2020, [Alexey Dosovitskiy et al.](https://openreview.net/pdf?id=YicbFdNTTy) showed that transformer can apply directly to sequences of image patches and perform very well on image classification tasks without the help of CNN. Since then, Vision Transformer (ViT) emerged as a new solution for vision task beside traditional CNN. ![](https://i.imgur.com/4yAQVPh.png) <center>Fig 6. Vision transformer overview.</center> The idea is very simple that the authors split the image into fixed-size patches, linearly embed each of them, add position embeddings, and feed to a standard Transformer encoder. They append an extra learnable 'classification token' to the sequence to perform classification. The detail techniques are exactly the same as standard transformer as we discuss above. ## V. References 1. [Attention is all you need](https://arxiv.org/pdf/1706.03762.pdf) 2. [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://openreview.net/pdf?id=YicbFdNTTy) 3. [Lecture 13: Attention by Justin Johnson](https://web.eecs.umich.edu/~justincj/slides/eecs498/498_FA2019_lecture13.pdf) 4. [Tutorial 6: Transformers and Multi-Head Attention](https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial6/Transformers_and_MHAttention.html) 5. [Tutorial 15: Vision Transformers](https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial15/Vision_Transformer.html) 6. [Attention? Attention!](https://lilianweng.github.io/posts/2018-06-24-attention/) 7. [CS480/680 Lecture 19: Attention and Transformer Networks](https://www.youtube.com/watch?v=OyFJWRnt_AY&t=95s)