Notes on "Attention is All You Need"

# Notes on "Attention is All You Need" #### Author: [Sharath Chandra](https://sharathraparthy.github.io/) ## [Paper link](https://arxiv.org/pdf/1706.03762.pdf) This paper proposes a new architecture which leverages the successes of attention mechanism (self-attention) and completely removed the necessity of recurrence. The proposed model allows data parallelization unlike the RNNs where the inputs are expected in sequential manner. The model architecture heavily relies on self-attention block which computes/learns how much attention score should be given to each word relative to other words in a particular sequence. ## Model Architecture The architecture is shown below. ![](https://i.imgur.com/1AXyUDf.png) If we consider the entire transformer as a black box, the on the high level this model takes, for example, an input sentence and translates it. Now, if we unpack this black box, we will see two important blocks: 1. Encoder block 2. Decoder block These encoder and decoder blocks takes in vectors as an input and spits out the output (in terms of the probabilities). Note that the sentences are not fed into these block as is and these are pre-processed and converted into a vector representation. Since there is not recurrence in the whole architecture, this input embedding is passed through a special type of encoding called positional encoding which accounts for the order of the words in the sentence. Now the output of this positional encoding is what we use as an input to the encoder/decoder. ### Encoder block: In the paper, the author uses six encoder blocks which are stacked upon top of each other. The input to the encoder goes through a series of computations through all these blocks. Each block share common structure in terms of computations but there is no weight sharing. Inside each encoder block, there is a multi-headed attention block and a position-wise fully connected feed forward network. #### Attention block ![](https://i.imgur.com/ufz8tnD.png) As mentioned earlier, transformers architecture uses self-attention mechanism which allows to each of the input word to look at the other positions of the input sequence and form a better encoding of the word. In RNNs we update the hidden state at each timestep which carries the information of the past sequence. Here, we are relying upon the attention mechanism to do that. ##### Self attention using keys, values and queries. The formula for calculating the attention score is the following: $$ Attention(Q, K, V) = Softmax(\frac{QK^T}{\sqrt{d_k}})V $$ where $Q, K, V, d_k$ are queries, keys, values and the length of any vector $q/k/v$respectively. The factor $QK^T$ is scaled down by the factor of $\sqrt{d_k}$ to ensure the training stability. The way the query, key and value vectors are calculated is by employing corresponding learnable weight matrices $W^q, W^k$ and $^v$. These take the encoders input ($e_i$) and are linearly transform the vector to obtain corresponding vectors. $$ Q = W^q e_i\\ K = W^k e_i\\ V = W^v e_i\\ $$ So, for a stack of six encoders, we will have $6 \times 3 = 18$ learnable weight matrices. The paper employs a multi-headed attention rather than just using one head. This expands the representation capacity and the models ability to focus on different positions in the sentence. The attention is computed in parallel which outputs $d_v$ dimensional vector per head. All the heads are concatenated and linearly projected resulting in final values. After this step, the output is passes through a feed forward neural network with two layers and ReLU activation. Note that, in addition to this, authors also employ residual connections (layernorms) in each sublayer. ### Decoder block: The decoder block has a similar architecture with one extra sub-block: encoder-decoder attention block. This takes in the keys and values of the output of the encoder block and uses these to calculate the attention wrt the out values. This is the crucial block which helps decoder to focus or "attend" on the appropriate places in the input sequence. One more difference in the multi-headed self-attention block is that the decoder block only allows to attend the words it has previously seen but all all the words in the output sentences. This is done by masking out the words from the current words. The output of the decoder, after all the computations, is passed through a linear layer followed by a softmax activation. This is the overall architecture of the transformer model. The training details and the results are skipped here. Please refer to the paper for more details.