LSTM and GRU

Brief Outline

The state of an RNN records information from all previous time steps.
At each new timestep the old information gets morphed by the current input
One could imagine that after many time steps the information stored at a later time step might get completely morphed so much that it would be impossible to extract the original information stored.

To get more flexibility and have a better modelling choice:

c_{t} = σ (W c_{t - 1} + U x_{t} + b)

We don't want to write the whole to
$s_{t - 1}$ , we just want to write selective portions of that into
$c_{t}$ .
We introduce a vector
$o_{t - 1}$ which decides what fraction should be passed. ie. Selective Write =
$c_{t - 1} \cdot o_{t - 1}$
But how does the RNN know what should be the values of
$o_{t - 1}$ ? - We introduce parameters.
We compute
$o_{t - 1}$ and
$h_{t - 1}$ as
- $o_{t - 1} = σ (W_{o} h_{t - 2} + U_{o} x_{t - 1} + b_{o})$
- $h_{t - 1} = c_{t - 1} ⊙ o_{t - 1}$
$o_{t}$ is known as the output gate.

Now that we have
$h_{t - 1}$ , which contains only selectively written values from
$c_{t - 1}$ .
We may not want to pass this along with
$x_{t}$ directly to
$c_{t}$ as
$x_{t}$ also may contain irrelevant information. Hence, we define an intermediate step
${\tilde{s}}_{t}$
- ${\tilde{c}}_{t} = σ (W h_{t - 1} + U x_{t} + b)$
Then Selective Read =
${\tilde{s}}_{t} \cdot i_{t}$ where
$i_{t}$ is defined as
- $i_{t} = σ (W_{i} h_{t - 1} + U_{i} x_{t} + b_{i})$

We now have
$c_{t - 1}$ and
${\tilde{c}}_{t} \cdot i_{t}$ , and have to combine these to get
$c_{t}$ , ie. the new hidden state.
One simple way of doing this is adding both the above terms. However, we may want to forget some parts of
$s_{t - 1}$ instead of passing it directly. We introduce another gate
- $f_{t} = σ (W_{f} h_{t - 1} + U_{f} x_{t} + b_{f})$

Finally after combing all the above three gates, we get the new state as
- $c_{t} = {\tilde{c}}_{t} ⊙ i_{t}$ +
  $c_{t - 1} ⊙ f_{t}$

Image Not Showing Possible Reasons