Attention Mechanics - An Overview

# Attention Mechanics - An Overview #### By Peter Brosten ## Addressing a lack of focus **Attention mechanics** were first introduced to solve a particular problem that arose in **Sequence-to-Sequence** models. These models are designed to intake one sequence and return another. An example of such models are _generative translation_ models. **Seq2Seq** models function in two main steps. 1. _The encoding step_. In which an **encoder** takes the input sequence and constructs a new representation of the input which captures relevant information. This new vector representation is called **context**. 2. _The decoding step_. In which a **decoder** takes the context vector and generates an output sequence based on the encoded information. We have now described the proceedure of Sequence-to-Sequence models but not the need for Attention. The need stems from the fact that the **context vector** must be of fixed size. This leads to it acting as a **bottleneck** preventing the model from generating high quality outputs for long sequencial inputs. This is the problem Attention models sought to address. ## How to pay Attention? As humans, when we "pay attention" to something, let us say a falling apple, we apply a process of filtering out other - less important - inputs in favor of increased concentration on the falling apple and those factors which are most important to following its descent. This is the driving force behind the Attention mechanics introduced by Bahdanau and Luong in 2014. They wished for an achitecture that could focus on the most important or revelant features. This was done by allowing the **encoder** to pass more than just the final hidden state to the decoder. In fact, **they allow the transfer of ALL the hidden states**. The **decoder** then takes the set of **hidden states** and computes what is called an **Attention score** to each of them. The **Luong dot** function is the simpilest of the score functions. Notice it lacks any training parameters! $$ \operatorname{score_{LuongDot}}(h_{t},\bar{h}_{s}) = h_{t}^{T}\bar{h}_{s} $$ The **Luong multiplicative** and **Bahdanau** functions require more computation. Specifically, Luong multiplicative requires one weight matrix $W$ and Bahdanau rquiures two weight matrices $W_{1,2}$ and a hidden layer network model $v$. Their explicit formulation are as follows. $$ \operatorname{score_{LuongMult}}(h_{t},\bar{h}_{s}) = h_{t}^{T}W\bar{h}_{s}$$ $$ \operatorname{score_{Bahd}}(h_{t},\bar{h}_{s}) = v_{a}^{T}\tanh(W_{1}h_{t} + W_{2}\bar{h}_{s})$$ Once a scoring function is chosen and applied, the scores are used to filter and amplify unimportant and important the original hidden states respectively. This is done by multiplying each of the original hidden states by the $\operatorname{softmax}$ of their corresponding score. These are called **Attention weights**. $$ \alpha_{ts} = \operatorname{softmax}(\operatorname{score}(h_{t},\bar{h}_{s}))$$ All the Attention weights are finally used to construct a new **Context State** which acts as a representation for the weighted encoder hidden states. $$ c_{t} = \sum_{s} \alpha_{ts}\bar{h}_{s}$$