---
disqus: ierosodin
---
# L10-RecurrentNeuralNetworks
> Organization contact [name= [ierosodin](ierosodin@gmail.com)]
###### tags: `deep learning` `學習筆記`
==[Back to Catalog](https://hackmd.io/@ierosodin/Deep_Learning)==
* http://www.deeplearningbook.org/contents/rnn.html
* A family of neural networks sprcialized for processing sequential data $x_1, x_2, x_3, ..., x_n$ with hidden units $h_t$ forming a dynamic system
* $h_t = f_\theta (h_{t-1}, x_t)$
* **unfolded**
* $\begin{split}h_t &= f_\theta (f_\theta(...f_\theta(h_0, x_1), ..., x_{t-1}), x_t) \\ &= g(h_0, x_1, x_2, ..., x_t)\end{split}$
* The RNN is to learn a single shared model $f_\theta$ instead of separate models $g^{(t)}$ at different time steps.
* Circuit diagrams
* a succinct description of computations with nodes representing components that might exit in physical implementations
* 
* Unfolded computational graphs
* an explicit description of computations with nodes indicating variables at each time step
* 
* Design Pattern I
* 
* **RNN Back Propagation Though Time**
* Map sequence input to sequence output: $x_1, x_2, ..., x_T$ to $y_1, y_2, ..., y_T$
* Forward propagation can be written as below:
$a_{t} = b + Wh_{t−1} + Ux_{t}$, $h_{t} = tanh(a_{t})$, $o_{t} = c + Vh_{t}$
$\hat y_t$ is generated from $o_t$, which $\hat y_t = softmax(o_{t}) = \frac{e^{\theta_i}}{\sum_k e^{\theta_k}}$
* For the loss function, we use −$logp_{model} (y_t|x_1, x_2, ..., x_t)$, and the total loss is the sum of losses over all time steps.
$L = \sum_{t=1}^T{L_t} = \sum_{t=1}^T{-y_tlog\hat {y_t}}$
* First, we calculate the derivation of $o_t$:
$\because\frac{\partial L}{\partial L_t} = 1, \therefore \frac{\partial L}{\partial o_t} = \frac{\partial L_t}{\partial o_t} = \frac{\partial L_t}{\partial \hat y_t}\frac{\partial \hat y_t}{\partial o_t}$
* Assume that $y$ is a $D$ dimensional slots, and at the time step $t$, the true answer of $y$ is in slot $k$.
$\frac{\partial L_t}{\partial o_t} = \frac{\partial L_t}{\partial \hat y_t}\frac{\partial \hat y_t}{\partial o_t} = \frac{\partial L_t}{\partial \hat y_{t}}\frac{\partial \hat y_{t}}{\partial e^{o_t}}\frac{\partial e^{o_t}}{\partial o_{t}}$
* And because $L_t = -y_tlog\hat {y_t} = -log\hat y_{t, k}$, for dimension $k$:
$\frac{\partial L_t}{\partial o_t} = \frac{\partial L_t}{\partial \hat y_{t}}\frac{\partial \hat y_{t}}{\partial e^{o_t}}e^{o_t} = -\frac{1}{\hat y_{t}}(\frac{\partial \hat y_{t}}{\partial e^{o_t}}e^{o_t}) = -\frac{1}{\hat y_{t}}\frac{\sum_{d=1}^De^{o_t}_d - e^{o_t}_k}{(\sum_{d=1}^De^{o_t}_d)^2}e^{o_t} = -\frac{\sum_{d=1}^De^{o_t}_d}{e^{o_t}_k}\frac{\sum_{d=1}^De^{o_t}_d - e^{o_t}_k}{(\sum_{d=1}^De^{o_t}_d)^2}e^{o_t}_k = \hat y_{t, k} - y_{t, k}$
* And for dimension $i \neq k$:
$\frac{\partial L_t}{\partial o_t} = \frac{\partial L_t}{\partial \hat y_{t}}\frac{\partial \hat y_{t}}{\partial e^{o_t}}e^{o_t} = -\frac{1}{\hat y_{t}}(\frac{\partial \hat y_{t}}{\partial e^{o_t}}e^{o_t}) = -\frac{1}{\hat y_{t}}\frac{0 - e^{o_t}_i}{(\sum_{d=1}^De^{o_t}_d)^2}e^{o_t} = -\frac{\sum_{d=1}^De^{o_t}_d}{e^{o_t}_i}\frac{0 - e^{o_t}_i}{(\sum_{d=1}^De^{o_t}_d)^2}e^{o_t}_i = \hat y_{t, i} - y_{t, i}$
* Therefore, $\frac{\partial L_t}{\partial o_t} = \hat y_{t} - y_{t}$
* And then we can derive: $\frac{\partial L}{\partial V} = \sum_{t=1}^T\frac{\partial L_t}{\partial V} = \sum_{t=1}^T\frac{\partial L_t}{\partial o_t}\frac{\partial o_t}{\partial V} = \sum_{t=1}^T\frac{\partial L_t}{\partial o_t}h_t = \sum_{t=1}^T(\hat y_{t} - y_{t})h_t$
* For $\frac{\partial L}{\partial h}$, we start from the last time step.
* When $t = T$, $\frac{\partial L_T}{\partial h_T} = \frac{\partial L_T}{\partial o_T}\frac{\partial o_T}{\partial h_T} = V^T\frac{\partial L_T}{\partial o_T}$
* However, when $t < T$, $\begin{split}\frac{\partial L_t}{\partial h_t} &= \frac{\partial L_t}{\partial o_t}\frac{\partial o_t}{\partial h_t} + \frac{\partial L_t}{\partial h_{t+1}}\frac{\partial h_{t+1}}{\partial h_t} = V^T\frac{\partial L_t}{\partial o_t} + (\frac{\partial h_{t+1}}{\partial h_t})^T\frac{\partial L_t}{\partial h_{t+1}} \\
&= V^T\frac{\partial L_t}{\partial o_t} + (\frac{\partial h_{t+1}}{\partial a_t}\frac{\partial a_t}{\partial h_t})^T\frac{\partial L_t}{\partial h_{t+1}} = V^T\frac{\partial L_t}{\partial o_t} + W^Tdiag(1 - h_{t+1}^2)\frac{\partial L_t}{\partial h_{t+1}}\end{split}$
* Therefore, $\frac{\partial L}{\partial W} = \sum_t\frac{\partial L_t}{\partial W} = \sum_t\frac{\partial L_t}{\partial h_t}\frac{\partial h_t}{\partial W} = \sum_t diag(1-h_t^2)\frac{\partial L_t}{\partial h_t}h_{t-1}^T$
$\frac{\partial L}{\partial U} = \sum_t\frac{\partial L_t}{\partial U} = \sum_t\frac{\partial L_t}{\partial h_t}\frac{\partial h_t}{\partial U} = \sum_t diag(1-h_t^2)\frac{\partial L_t}{\partial h_t}x_t^T$
* The last thing is finding the gradients of bias terms.
$\frac{\partial L}{\partial c} = \sum_t\frac{\partial L_t}{\partial c} = \sum_t\frac{\partial L_t}{\partial o_t}\frac{\partial o_t}{\partial c} = \sum_t\frac{\partial L_t}{\partial o_t}$
$\frac{\partial L}{\partial b} = \sum_t\frac{\partial L_t}{\partial b} = \sum_t\frac{\partial L_t}{\partial h_t}\frac{\partial h_t}{\partial b} = \sum_t diag(1-h_t^2)\frac{\partial L_t}{\partial h_t}$
* Design Pattern II
* Networks with hidden unit connections that produce a single output for an entire sequence
* 
* **Vanishing and Exploding Gradients**
* Gradients propagate over many stages tend to either vanish or explode
* $\begin{split}\nabla_{h^{(t)}}L &= W^TH^{(t+1)}(\nabla_{h^{(t+1)}}L) \\
&= (W^TH^{(t+1)})^{(\tau - t)}(\nabla_{h^{(\tau)}}L) \\
&= Q\Lambda^{(\tau - t)} Q^T(\nabla_{h^{(\tau)}}L)\end{split}$
* **The term $\Lambda^{(\tau - t)}$ causes the eigenvalues with magnitude smaller than one to decay to zero and eigenvalues with magnitude greater than one to explode (when $t << \tau$ )**
* When gradients vanish quickly, it becomes difficult to learn long-term dependencies
* **The reason for choosing tanh as activation has to do with H, of which the diagonal entries are derivatives of $h(x)$; gradients may vanish much quicker if sigmoid is used.**
* 
* When gradients explode, the gradient-based training could throw the parameters far into a region where the objective becomes larger
* 
* **One simple solution is to clip the gradient before the parameter update when a threshold on its norm is exceeded.**
* Learning Long-Term Dependencies
* Echo state networks
* Set recurrent weights W to ensure spectral radius of Jacobian ≈ 1
* Learn output weights V only
* Skip connections
* connections from variables in the distant past
* 
* **Leaky units**
* introducing an integrator in the hidden unit
* $h^{(t)} = (1 - \frac{1}{\alpha})\odot h^{(t - 1)} + \frac{1}{\alpha}\odot tanh (Wh^{(t - 1)} + Ux^{(t)}), 1 \le \alpha_i < \infty$
* 
* $\alpha$ determines the time scale of integration ($\alpha$ ↑, scale↑)
* $\alpha = 1$ : Ordinary RNN
* $\alpha > 1$ : $x(t)$’s flow longer; gradients propagate more easily
* 
* Long Short-Term Memory (LSTM)
* To change the time scale of integration dynamically by introducing programable gates $(g(t),f(t),q(t))$ that are conditioned on context
* 
* **The gating context at time $t$ refers collectively to $x^{(t)},h^{(t−1)}$ and may include other inputs**
* Memory state: $s^{(t)}$
* Input gate: $g^{(t)} = \sigma(U^gx^{(t)} + W^gh^{(t-1)})$
* Output gate: $q^{(t)} = \sigma(U^ox^{(t)} + W^oh^{(t-1)})$
* Forget gate: $f^{(t)} = \sigma(U^fx^{(t)} + W^fh^{(t-1)})$
* New content: $a^{(t)} = Ux^{(t)} + Wh^{(t-1)}$
* Memory update: $s^{(t)} = f^{(t)} \odot s^{(t-1)} + g^{(t)} \odot tanh(a^{(t)})$
* Hidden unit update: $h^{(t)} = q^{(t)} \odot tanh(s^{(t)})$
* Output unit update: $o^{(t)} = Vh^{(t)}$
* The design of LSTM is justified by the fact that there could be subsequences of varying statistics in a main sequence
* Design Pattern III
* Networks with only output recurrence
* 
* **Such networks are less powerful as the outputs $o^{(t)}$’s are forced to approximate the targets $y^{(t)}$’s while having to convey a good summary about the past**
* One effective way for training this type of networks is **teacher forcing**, which feeds correct targets $y^{(t)}$ into $h^{(t+1)}$ during training, allowing the gradient for each time step to be computed in isolation.
* Open-loop issue: inputs seen at **training and test time are different**
* To resolve this, it is common to train with **both teacher-forced inputs and free-running inputs** (generated by the output-to-input paths), or randomly choose between them.
* Design Pattern IV
* Networks with both output recurrence and hidden unit connections
* 
* The training objective becomes to maximize
* $\begin{split}&p_{model}(y^{(1)}, y^{(2)}, ..., y^{(\tau))} | x^{(1)}, x^{(2)}, ..., x^{(\tau)}) \\
&= \prod p_{model}(y^{(t)} | x^{(1)}, x^{(2)}, ..., x^{(\tau)}, y^{(1)}, y^{(2)}, ..., y^{(t - 1)})\end{split}$
* where $y^{(t)}$’s are modeled to be conditionally dependent
* This type of networks allows modeling an arbitrary distribution over the $y$ sequence given the $x$ sequence of the same length, **whether $y^{(t)}$’s are conditionally independent or dependent**
* One can also turn the network into a generative model for unsupervised learning by omitting all the $x^{(t)}$’s
* 
* Bidirectional RNN
* In many applications, it may be necessary to output a prediction $y^{(t)}$ that depend on the entire sequence
* Bidirectional RNN combines two RNNs, one moving forward in time and the other moving backward, to capture information from both the past and future
* **requires the entire sequence be buffered**
* 
* **Sequence-to-Sequence Networks**
* **The encoder network encodes the input sequence and emits a context C for the decoder network to generate the output sequence**
* The context C can be a function of the last hidden unit or a summary of different hidden units by introducing an attention mechanism
* 
* Attention Mechanisms
* A weighted average of information with weights $\sigma^{(t)}$’s (gate signals) produced by the model itself
* 
* The information to be averaged could be hidden units of a network or raw input data
* The weights are usually produced by applying a softmax function to some relevance scores emitted somewhere in the model
* The soft weighting is more expensive than direct indexing but is differentiable, making it trainable with gradient-based algorithms.
* Other models
* 