# **Attention** In this note, I will describe the concept of *attention* in deep learning. Attention is a technique used in machine learning to help a model focus on specific parts of its input data during processing. It is particularly useful when the input is a lengthy sequence, like a sentence or document. Attention has been applied in various fields, including natural language processing, image recognition, and machine translation. I will first describe the process of attention and how it works, following up with an example. # Main idea The core concept behind attention is that the model only uses certain parts of the input sequence when making predictions, rather than the entire sequence. This means: it only pays attention to certain input words. This is especially useful when dealing with long input sequences. Attention acts as a connection between the encoder and decoder, providing the decoder with information from every encoder hidden state. This allows the model to selectively focus on valuable parts of the input sequence and learn the association between them. This helps the model to effectively handle long input sentences. ![](https://i.imgur.com/5DdDMJu.png) source: https://lilianweng.github.io/posts/2018-06-24-attention/ # Example In general, attention can be thought of as a weighted sum of the input features, where the weights reflect the importance of each input feature. These weights are learned by the model and can be computed using a variety of approaches. An example of an attention approach is *additive attention* which calculates the weights as follows: For each input feature, "keys" and "values" are calculated. The keys are used to determine the importance of the feature, while the values are used to compute the weighted sum. A "query" vector is also calculated, which represents the current state of the model and indicates which parts of the input should be focused on. The keys and the query are then compared using a similarity function, such as the dot product or cosine similarity. This produces a set of "attention scores," which show the similarity between the keys and the query. The attention scores are then normalized using a softmax function to produce weights that sum to 1. The weighted sum is then calculated by multiplying the values by the weights and adding them together. This weighted sum is the output of the attention mechanism.