Seq2Seq with Attention

Brief Outline

Previously, we used to take the encoder state of the entire input sentence and use that every time in the decoder step.
However, at every decoder time step, we don't require the entire encoder state as the word at that time step does not depend on the entire sentence
This would also overload the decoder
Can we have a weighted sum of the encoder states at each time steps instead, to tell which encoder states are important?
The answer is, attention.

To enable attention, we define a function
$e_{j t} = f_{A T T} (s_{t - 1}, h_{j})$
This quantity captures the importance of the $j^{t h}$ input word for decoding the $t^{t h}$ output word.
Since $e_{j t}$ needs to sum up to one, we apply the softmax function.
$α_{j t} = \frac{e x p (e_{j t})}{\sum_{k = 1}^{M} e x p (c_{k t})}$
One of many possible choices of $f_{A T T}$ is
$f_{A T T} = V_{a t t}^{T} tanh (U_{a t t} s_{t - 1} + W_{a t t} h_{j})$
Where
- $h_{j} \in ℝ^{d_{1} \times 1}$
- $s_{t} \in ℝ^{d_{2} \times 1}$
And
- $V_{a t t} \in ℝ^{d_{1} \times 1}$
- $U_{a t t} \in ℝ^{d_{1} \times d_{2}}$
- $W_{a t t} \in ℝ^{d_{1} \times d_{1}}$
Clearly, $α_{j t}$ will result in a scalar.
These parameters will be learned along with the other parameters of the encoder and decoder.

Image Not Showing Possible Reasons

$e_{j t} = V_{a t t}^{T} tanh (U_{a t t} s_{t - 1} + W_{a t t} h_{j})$
$α_{j t} = s o f t m a x (e_{j t})$
$c_{t} = \sum_{j = 1}^{T} α_{j t} h_{j}$
This $c_{t}$ is the encoder hidden state that will be passed to the decoder at every timestep $t$ to get the decoder hidden state $s_{t}$