# Seq2Seq with Attention
## Brief Outline
* Previously, we used to take the encoder state of the entire input sentence and use that every time in the decoder step.
* However, at every decoder time step, we don't require the entire encoder state as the word at that time step does not depend on the entire sentence
* This would also overload the decoder
* Can we have a weighted sum of the encoder states at each time steps instead, to tell which encoder states are important?
* The answer is, attention.
## Attention
* To enable attention, we define a function
$$e_{jt} = f_{ATT}(s_{t-1}, h_j)$$
* This quantity captures the importance of the $j^{th}$ input word for decoding the $t^{th}$ output word.
* Since $e_{jt}$ needs to sum up to one, we apply the softmax function.
$$\alpha_{jt} = \frac{exp(e_{jt})}{\sum_{k=1}^M exp(c_{kt})}$$
* One of many possible choices of $f_{ATT}$ is
$$f_{ATT} = V^T_{att} \text{ tanh}(U_{att}s_{t-1} + W_{att} h_j)$$
* Where
* $h_j \in ℝ^{d_1×1}$
* $s_t \in ℝ^{d_2×1}$
* And
* $V_{att} \in ℝ^{d_1×1}$
* $U_{att} \in ℝ^{d_1×d_2}$
* $W_{att} \in ℝ^{d_1×d_1}$
* Clearly, $\alpha_{jt}$ will result in a scalar.
* These parameters will be learned along with the other parameters of the encoder and decoder.
## Architecture
![](https://i.imgur.com/HLWgUrE.png =300x450)
## Forward propagation
### Encoder
* $x_j = \text{Word Embeddings} \in ℝ^{e_1×1}$
* $h_j = RNN(h_{j-1}, x_j) \in ℝ^{d_1×1}$
### Attention
* $e_{jt} = V^T_{att} \text{ tanh}(U_{att}s_{t-1} + W_{att} h_j)$
* $\alpha_{jt} = softmax(e_{jt})$
* $c_t = \sum_{j=1}^T \alpha_{jt} h_j$
* This $c_t$ is the encoder hidden state that will be passed to the decoder at every timestep $t$ to get the decoder hidden state $s_t$
### Decoder
* $s_t = RNN(s_{t-1}, c_t)$
* $l_t = softmax(Vs_t + b)$