# Seq2Seq with Attention ## Brief Outline * Previously, we used to take the encoder state of the entire input sentence and use that every time in the decoder step. * However, at every decoder time step, we don't require the entire encoder state as the word at that time step does not depend on the entire sentence * This would also overload the decoder * Can we have a weighted sum of the encoder states at each time steps instead, to tell which encoder states are important? * The answer is, attention. ## Attention * To enable attention, we define a function $$e_{jt} = f_{ATT}(s_{t-1}, h_j)$$ * This quantity captures the importance of the $j^{th}$ input word for decoding the $t^{th}$ output word. * Since $e_{jt}$ needs to sum up to one, we apply the softmax function. $$\alpha_{jt} = \frac{exp(e_{jt})}{\sum_{k=1}^M exp(c_{kt})}$$ * One of many possible choices of $f_{ATT}$ is $$f_{ATT} = V^T_{att} \text{ tanh}(U_{att}s_{t-1} + W_{att} h_j)$$ * Where * $h_j \in ℝ^{d_1×1}$ * $s_t \in ℝ^{d_2×1}$ * And * $V_{att} \in ℝ^{d_1×1}$ * $U_{att} \in ℝ^{d_1×d_2}$ * $W_{att} \in ℝ^{d_1×d_1}$ * Clearly, $\alpha_{jt}$ will result in a scalar. * These parameters will be learned along with the other parameters of the encoder and decoder. ## Architecture ![](https://i.imgur.com/HLWgUrE.png =300x450) ## Forward propagation ### Encoder * $x_j = \text{Word Embeddings} \in ℝ^{e_1×1}$ * $h_j = RNN(h_{j-1}, x_j) \in ℝ^{d_1×1}$ ### Attention * $e_{jt} = V^T_{att} \text{ tanh}(U_{att}s_{t-1} + W_{att} h_j)$ * $\alpha_{jt} = softmax(e_{jt})$ * $c_t = \sum_{j=1}^T \alpha_{jt} h_j$ * This $c_t$ is the encoder hidden state that will be passed to the decoder at every timestep $t$ to get the decoder hidden state $s_t$ ### Decoder * $s_t = RNN(s_{t-1}, c_t)$ * $l_t = softmax(Vs_t + b)$