What is "attention" and how does it work?

# What is "attention" and how does it work? Attention is a mechanism in machine learning model that is useful in tasks such as translation and summarization, where the input sequences can be long and complex. Imagine you're playing a game where you have to find hidden objects in a room. You can only look at one thing at a time, so you have to decide which thing to look at first. That's kind of like what a computer does when it's trying to understand something we've written or said. It looks at one word at a time, and it has to decide which word to pay attention to first. Sometimes, there are lots of words, and it can be hard to know which one to look at first. That's where attention comes in. Attention helps the computer figure out which words are most important, so it knows which ones to look at first. It does this by giving each word a score, kind of like a number between 0 and 1. The higher the score, the more important the word is. These scores are called weights. To calculate the attention weights, the model often uses a dot product attention mechanism. In this mechanism, the attention weights are calculated as the dot product between the query vector and the key vector for each element in the input sequence. The query vector is typically a weighted sum of the hidden states produced by the model's encoder, while the key vector is a weighted sum of the hidden states produced by the model's decoder.The dot products between the query and key vectors are then normalized using a softmax function, which converts them into probabilities that sum to 1. These probabilities represent the attention weights, which indicate how much attention the model should pay to each element in the input sequence. The computer, therefore, looks at the words with the highest scores first, and then works its way down to the words with lower scores. There are three commonly used attention classes: ## 1 - Luong Dot Attention Class It is a way for the computer to decide which words to pay attention to when trying to understand something we've written or said. It works by comparing each word to a special word that the computer is interested in. The more similar the word is to the special word, the more attention it gets. ## 2 - Bahdanau Attention Class It is another way for the computer to decide which words to pay attention to. It works by looking at the words one at a time, and deciding which ones are most important based on what the computer already knows about the thing we've written or said. ## 3 - Luong Multiplicative Attention Class It is a combination of the Luong Dot and Bahdanau attention classes. It uses both of these methods to decide which words to pay attention to, which can make it even better at understanding what we've written or said.