L10-RecurrentNeuralNetworks

--- disqus: ierosodin --- # L10-RecurrentNeuralNetworks > Organization contact [name= [ierosodin](ierosodin@gmail.com)] ###### tags: `deep learning` `學習筆記` ==[Back to Catalog](https://hackmd.io/@ierosodin/Deep_Learning)== * http://www.deeplearningbook.org/contents/rnn.html * A family of neural networks sprcialized for processing sequential data $x_1, x_2, x_3, ..., x_n$ with hidden units $h_t$ forming a dynamic system * $h_t = f_\theta (h_{t-1}, x_t)$ * **unfolded** * $\begin{split}h_t &= f_\theta (f_\theta(...f_\theta(h_0, x_1), ..., x_{t-1}), x_t) \\ &= g(h_0, x_1, x_2, ..., x_t)\end{split}$ * The RNN is to learn a single shared model $f_\theta$ instead of separate models $g^{(t)}$ at different time steps. * Circuit diagrams * a succinct description of computations with nodes representing components that might exit in physical implementations * ![](https://i.imgur.com/GXRfLrA.png) * Unfolded computational graphs * an explicit description of computations with nodes indicating variables at each time step * ![](https://i.imgur.com/0NW8GBu.png) * Design Pattern I * ![](https://i.imgur.com/VVJ4tym.png) * **RNN Back Propagation Though Time** * Map sequence input to sequence output: $x_1, x_2, ..., x_T$ to $y_1, y_2, ..., y_T$ * Forward propagation can be written as below: $a_{t} = b + Wh_{t−1} + Ux_{t}$, $h_{t} = tanh(a_{t})$, $o_{t} = c + Vh_{t}$ $\hat y_t$ is generated from $o_t$, which $\hat y_t = softmax(o_{t}) = \frac{e^{\theta_i}}{\sum_k e^{\theta_k}}$ * For the loss function, we use −$logp_{model} (y_t|x_1, x_2, ..., x_t)$, and the total loss is the sum of losses over all time steps. $L = \sum_{t=1}^T{L_t} = \sum_{t=1}^T{-y_tlog\hat {y_t}}$ * First, we calculate the derivation of $o_t$: $\because\frac{\partial L}{\partial L_t} = 1, \therefore \frac{\partial L}{\partial o_t} = \frac{\partial L_t}{\partial o_t} = \frac{\partial L_t}{\partial \hat y_t}\frac{\partial \hat y_t}{\partial o_t}$ * Assume that $y$ is a $D$ dimensional slots, and at the time step $t$, the true answer of $y$ is in slot $k$. $\frac{\partial L_t}{\partial o_t} = \frac{\partial L_t}{\partial \hat y_t}\frac{\partial \hat y_t}{\partial o_t} = \frac{\partial L_t}{\partial \hat y_{t}}\frac{\partial \hat y_{t}}{\partial e^{o_t}}\frac{\partial e^{o_t}}{\partial o_{t}}$ * And because $L_t = -y_tlog\hat {y_t} = -log\hat y_{t, k}$, for dimension $k$: $\frac{\partial L_t}{\partial o_t} = \frac{\partial L_t}{\partial \hat y_{t}}\frac{\partial \hat y_{t}}{\partial e^{o_t}}e^{o_t} = -\frac{1}{\hat y_{t}}(\frac{\partial \hat y_{t}}{\partial e^{o_t}}e^{o_t}) = -\frac{1}{\hat y_{t}}\frac{\sum_{d=1}^De^{o_t}_d - e^{o_t}_k}{(\sum_{d=1}^De^{o_t}_d)^2}e^{o_t} = -\frac{\sum_{d=1}^De^{o_t}_d}{e^{o_t}_k}\frac{\sum_{d=1}^De^{o_t}_d - e^{o_t}_k}{(\sum_{d=1}^De^{o_t}_d)^2}e^{o_t}_k = \hat y_{t, k} - y_{t, k}$ * And for dimension $i \neq k$: $\frac{\partial L_t}{\partial o_t} = \frac{\partial L_t}{\partial \hat y_{t}}\frac{\partial \hat y_{t}}{\partial e^{o_t}}e^{o_t} = -\frac{1}{\hat y_{t}}(\frac{\partial \hat y_{t}}{\partial e^{o_t}}e^{o_t}) = -\frac{1}{\hat y_{t}}\frac{0 - e^{o_t}_i}{(\sum_{d=1}^De^{o_t}_d)^2}e^{o_t} = -\frac{\sum_{d=1}^De^{o_t}_d}{e^{o_t}_i}\frac{0 - e^{o_t}_i}{(\sum_{d=1}^De^{o_t}_d)^2}e^{o_t}_i = \hat y_{t, i} - y_{t, i}$ * Therefore, $\frac{\partial L_t}{\partial o_t} = \hat y_{t} - y_{t}$ * And then we can derive: $\frac{\partial L}{\partial V} = \sum_{t=1}^T\frac{\partial L_t}{\partial V} = \sum_{t=1}^T\frac{\partial L_t}{\partial o_t}\frac{\partial o_t}{\partial V} = \sum_{t=1}^T\frac{\partial L_t}{\partial o_t}h_t = \sum_{t=1}^T(\hat y_{t} - y_{t})h_t$ * For $\frac{\partial L}{\partial h}$, we start from the last time step. * When $t = T$, $\frac{\partial L_T}{\partial h_T} = \frac{\partial L_T}{\partial o_T}\frac{\partial o_T}{\partial h_T} = V^T\frac{\partial L_T}{\partial o_T}$ * However, when $t < T$, $\begin{split}\frac{\partial L_t}{\partial h_t} &= \frac{\partial L_t}{\partial o_t}\frac{\partial o_t}{\partial h_t} + \frac{\partial L_t}{\partial h_{t+1}}\frac{\partial h_{t+1}}{\partial h_t} = V^T\frac{\partial L_t}{\partial o_t} + (\frac{\partial h_{t+1}}{\partial h_t})^T\frac{\partial L_t}{\partial h_{t+1}} \\ &= V^T\frac{\partial L_t}{\partial o_t} + (\frac{\partial h_{t+1}}{\partial a_t}\frac{\partial a_t}{\partial h_t})^T\frac{\partial L_t}{\partial h_{t+1}} = V^T\frac{\partial L_t}{\partial o_t} + W^Tdiag(1 - h_{t+1}^2)\frac{\partial L_t}{\partial h_{t+1}}\end{split}$ * Therefore, $\frac{\partial L}{\partial W} = \sum_t\frac{\partial L_t}{\partial W} = \sum_t\frac{\partial L_t}{\partial h_t}\frac{\partial h_t}{\partial W} = \sum_t diag(1-h_t^2)\frac{\partial L_t}{\partial h_t}h_{t-1}^T$ $\frac{\partial L}{\partial U} = \sum_t\frac{\partial L_t}{\partial U} = \sum_t\frac{\partial L_t}{\partial h_t}\frac{\partial h_t}{\partial U} = \sum_t diag(1-h_t^2)\frac{\partial L_t}{\partial h_t}x_t^T$ * The last thing is finding the gradients of bias terms. $\frac{\partial L}{\partial c} = \sum_t\frac{\partial L_t}{\partial c} = \sum_t\frac{\partial L_t}{\partial o_t}\frac{\partial o_t}{\partial c} = \sum_t\frac{\partial L_t}{\partial o_t}$ $\frac{\partial L}{\partial b} = \sum_t\frac{\partial L_t}{\partial b} = \sum_t\frac{\partial L_t}{\partial h_t}\frac{\partial h_t}{\partial b} = \sum_t diag(1-h_t^2)\frac{\partial L_t}{\partial h_t}$ * Design Pattern II * Networks with hidden unit connections that produce a single output for an entire sequence * ![](https://i.imgur.com/mnyXO4t.png) * **Vanishing and Exploding Gradients** * Gradients propagate over many stages tend to either vanish or explode * $\begin{split}\nabla_{h^{(t)}}L &= W^TH^{(t+1)}(\nabla_{h^{(t+1)}}L) \\ &= (W^TH^{(t+1)})^{(\tau - t)}(\nabla_{h^{(\tau)}}L) \\ &= Q\Lambda^{(\tau - t)} Q^T(\nabla_{h^{(\tau)}}L)\end{split}$ * **The term $\Lambda^{(\tau - t)}$ causes the eigenvalues with magnitude smaller than one to decay to zero and eigenvalues with magnitude greater than one to explode (when $t << \tau$ )** * When gradients vanish quickly, it becomes difficult to learn long-term dependencies * **The reason for choosing tanh as activation has to do with H, of which the diagonal entries are derivatives of $h(x)$; gradients may vanish much quicker if sigmoid is used.** * ![](https://i.imgur.com/CPLvYPX.png) * When gradients explode, the gradient-based training could throw the parameters far into a region where the objective becomes larger * ![](https://i.imgur.com/lA9HAhP.png) * **One simple solution is to clip the gradient before the parameter update when a threshold on its norm is exceeded.** * Learning Long-Term Dependencies * Echo state networks * Set recurrent weights W to ensure spectral radius of Jacobian ≈ 1 * Learn output weights V only * Skip connections * connections from variables in the distant past * ![](https://i.imgur.com/1nn1Ovr.png) * **Leaky units** * introducing an integrator in the hidden unit * $h^{(t)} = (1 - \frac{1}{\alpha})\odot h^{(t - 1)} + \frac{1}{\alpha}\odot tanh (Wh^{(t - 1)} + Ux^{(t)}), 1 \le \alpha_i < \infty$ * ![](https://i.imgur.com/wQIVoTS.png) * $\alpha$ determines the time scale of integration ($\alpha$ ↑, scale↑) * $\alpha = 1$ : Ordinary RNN * $\alpha > 1$ : $x(t)$’s flow longer; gradients propagate more easily * ![](https://i.imgur.com/yCzLutZ.png) * Long Short-Term Memory (LSTM) * To change the time scale of integration dynamically by introducing programable gates $(g(t),f(t),q(t))$ that are conditioned on context * ![](https://i.imgur.com/pOOsN89.png) * **The gating context at time $t$ refers collectively to $x^{(t)},h^{(t−1)}$ and may include other inputs** * Memory state: $s^{(t)}$ * Input gate: $g^{(t)} = \sigma(U^gx^{(t)} + W^gh^{(t-1)})$ * Output gate: $q^{(t)} = \sigma(U^ox^{(t)} + W^oh^{(t-1)})$ * Forget gate: $f^{(t)} = \sigma(U^fx^{(t)} + W^fh^{(t-1)})$ * New content: $a^{(t)} = Ux^{(t)} + Wh^{(t-1)}$ * Memory update: $s^{(t)} = f^{(t)} \odot s^{(t-1)} + g^{(t)} \odot tanh(a^{(t)})$ * Hidden unit update: $h^{(t)} = q^{(t)} \odot tanh(s^{(t)})$ * Output unit update: $o^{(t)} = Vh^{(t)}$ * The design of LSTM is justified by the fact that there could be subsequences of varying statistics in a main sequence * Design Pattern III * Networks with only output recurrence * ![](https://i.imgur.com/hCz3i4x.png) * **Such networks are less powerful as the outputs $o^{(t)}$’s are forced to approximate the targets $y^{(t)}$’s while having to convey a good summary about the past** * One effective way for training this type of networks is **teacher forcing**, which feeds correct targets $y^{(t)}$ into $h^{(t+1)}$ during training, allowing the gradient for each time step to be computed in isolation. * Open-loop issue: inputs seen at **training and test time are different** * To resolve this, it is common to train with **both teacher-forced inputs and free-running inputs** (generated by the output-to-input paths), or randomly choose between them. * Design Pattern IV * Networks with both output recurrence and hidden unit connections * ![](https://i.imgur.com/BfcFup5.png) * The training objective becomes to maximize * $\begin{split}&p_{model}(y^{(1)}, y^{(2)}, ..., y^{(\tau))} | x^{(1)}, x^{(2)}, ..., x^{(\tau)}) \\ &= \prod p_{model}(y^{(t)} | x^{(1)}, x^{(2)}, ..., x^{(\tau)}, y^{(1)}, y^{(2)}, ..., y^{(t - 1)})\end{split}$ * where $y^{(t)}$’s are modeled to be conditionally dependent * This type of networks allows modeling an arbitrary distribution over the $y$ sequence given the $x$ sequence of the same length, **whether $y^{(t)}$’s are conditionally independent or dependent** * One can also turn the network into a generative model for unsupervised learning by omitting all the $x^{(t)}$’s * ![](https://i.imgur.com/KRxZdAT.png) * Bidirectional RNN * In many applications, it may be necessary to output a prediction $y^{(t)}$ that depend on the entire sequence * Bidirectional RNN combines two RNNs, one moving forward in time and the other moving backward, to capture information from both the past and future * **requires the entire sequence be buffered** * ![](https://i.imgur.com/uy1J5Tt.png) * **Sequence-to-Sequence Networks** * **The encoder network encodes the input sequence and emits a context C for the decoder network to generate the output sequence** * The context C can be a function of the last hidden unit or a summary of different hidden units by introducing an attention mechanism * ![](https://i.imgur.com/SeBUoTQ.png) * Attention Mechanisms * A weighted average of information with weights $\sigma^{(t)}$’s (gate signals) produced by the model itself * ![](https://i.imgur.com/5yn0oAW.png) * The information to be averaged could be hidden units of a network or raw input data * The weights are usually produced by applying a softmax function to some relevance scores emitted somewhere in the model * The soft weighting is more expensive than direct indexing but is differentiable, making it trainable with gradient-based algorithms. * Other models * ![](https://i.imgur.com/VfAkB8C.png)