Nhung Hong Le
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Deep Learning Lab 6: RNN ## Overview (0:00 - 14:00) - Zhengyuan Ding RNN is one type of architectures that we can use to deal with sequences of data. What is a sequence? From the CNN lesson we see that a signal can be 1D,2D or 3D based on the domain. The domain is what you map from to go to. Handling sequential data is basically dealing with 1D data since the domain are just temporal input Xs. Nevertheless, you can also use RNN to deal with 2D data, where you have double directions. #### 1. Vanilla vs. Recurrent NN ![Vanilla](https://i.imgur.com/cY4AbHu.png) On the left is a typical vanilla neural network diagram with three layers. "Vanilla" is an american term meaning plain. The pink bubble is the input vector x, in the center is the hidden layer in green, and the final blue layer is the output. Using an example from digital electronics on the right, this is like a combinational logic, where the current output only depends on the current input. ![RNN](https://i.imgur.com/R0AjfK1.png) In constrast to a vanilla neural network, in recurrent neural network the current output depends not only on the current input but also the state of the system, shown on the left of the image above. This is like a sequential logic in digital electronic, where the output also depends on a "flip-flop"( a basic memory unit in digital electronics). Therefore the main difference here is that the outoput of a vanilla neural network only depends on the current input, while the one of RNN depends also on the state of the system. ![mapping1](https://i.imgur.com/l8R2YA9.png) Yann's diagram adds these shapes between neurons to represent the mapping between one tensor and another(one vector to another). For example, in the diagram above the input vector x will map through this additional item to the hidden representations h. This item is actually affine transformation i.e. rotation plus distortion. Then through another transformation we get from the hidden layer to the final output. Similarly, in the RNN diagram you can have the same additional items between neurons. ![](https://i.imgur.com/bmfmU4K.png) #### 2. 4 Types of RNN Architectures and Examples ![](https://i.imgur.com/7zMks9G.png) First case is vector to sequence as shown in the diagram above. The input is one bubble and then there will be evolutions of the internal state of the system annotated as these green bubbles. As the state of the system evolves, at every time step there will be one speicific output. An example of this type of architecture is shown below. The input is one of these images while the output will be a sequence of words representing the english descriptions of the input image. To explain using the diagram above, each blue bubble here can be an index in a dictionay and tells which word to pick in an indexed dictionary.For instance if the output is the sentence "This is a yellow school bus". You first get an index and look it up to be "a" and then get another one to pick up word "yellow" and so on. Some of the results of this network is shown below.For example, in the first column the description regarding the last picture is "A herd of elephants walking across a dry grass field.", which is very well refined. Then in the second column the first image outputs "Two dogs play in the grass.", while it's actually three. In the last column are the more wrong examples such as "A yellow school bus parked in a parking lot." In general, these results show that this network can be failing or performs well sometimes. This is the case that is from one input vector, which is the representation of an image, to a sequence of symbols, which are for example characters or words making up the english sentences. This kind of architecture is called autoregressive network. An autoregressive network is a network which is outputing an output given that you feed as input the previous output. ![](https://i.imgur.com/T3hecD4.jpg) ![](https://i.imgur.com/YKWqSqn.png) Second usage is sequence to a final vector. This network keeps feeding a sequence of symbols and only at the end it gives a final output. An application of this can be using the network to interpret Python. For example, the input are these lines of Python program. ```python j = 8584 for x in range(8): j += 920 b=(1500+j) print((b+7567)) ``` Then network will be able to output the correct solution of this program. Another more complicated program like this: ```python i = 8827 c = (i-5347) print((c+8704) if 2641<8500 else 5308) ``` Then the output should be 12184. These two examples display that you can train a neural network to do this kind of operation. We just need to feed a sequence of symbols and enforce the final output to be a specific value. ![](https://i.imgur.com/bIqsWek.png) Next is sequence to vector to sequence. This architecture used to be the standard way of performing language translation. You start with a sequence of symbols here shown in pink. Then everything gets condensed into this final h, which represents a concept. For instance, we can have a sentence as input and squeeze it temproraly into a vector, which is representing the meaning and message that to send across. Then after getting this meaning in whatever representation, the network unrolls it back into a different language. For example "Today I'm very happy" in a sequence of words in English can be translated into Italian or Chinese. In general, the network gets some kind of encoding as inputs and turns them to a compressed representation. Finaly it gets the decoding given the same compressed version. Also in recent time we have seen networks like Transformers, which we will cover in the next lesson. This type of architecture used to be the state of the art two years ago. If you do a PCA over the latent space, you will have the words grouped by semantics like shown in this graph. ![](https://i.imgur.com/CYytabM.jpg) If we zoom in, we will see that the in the same location there are all the months, like January and November. ![](https://i.imgur.com/B6jRSib.png) If you focus on a different region, you get words like "a few days ago " the next few months etc. ![](https://i.imgur.com/LVqS9xi.png) From these examples, we see that different location will have some specific common meaning. This graph below showcases how how by training this kind of network will pick up on some semantics features. For exmaple in this case you can see there is a vector connecting man to woman and another between king and queen, which means woman minus man is going to be equal to queen minus king. You will get the same distance in this embeddings space applied to cases like male-female. Another example will be walking to walked and swimming to swam. You can always apply this kind of specific linear transofmation going from one word to another or from country to capital. ![](https://i.imgur.com/ZmTZETy.png) ![](https://i.imgur.com/XaWz5HW.png) The last case is sequence to sequence. In this network, as you start feeding inside input the network starts getting output. An example of this type of architecture is T9, if you remeber using Nokia phone, you will get text suggestions as you are typing things through. Another example is speech to captions. One of the cool example is this RNN-writer. When you start typing "the rings of saturn glittered while", it suggests the following "two men looked at each other". This network was trained on some sci-fi novels so that you can just type somthing and let it make suggestions to help you write a book. One more example is shown below. You input the top prompt and then this network will try to complete the rest. ![](https://i.imgur.com/OinKjjV.jpg) ## Back Propagation through time (14:00 - 23:50) - Biao Huang #### 1. Model architecture In order to perform training on RNN, backpropagation through time (BPTT) is supposed to be used. The model architechture of RNN is given in figure below. The left design uses loop representation while the right figure unfolds the loop into a row over time. ![](https://i.imgur.com/BYmRGQv.jpg) Hidden representations are stated as $\begin{aligned} \begin{cases} h[t]&= g(W_{h}\begin{bmatrix} x[t] \\ h[t-1] \end{bmatrix} +b_h) \\ h[0]&\dot=\ \boldsymbol{0},\ W_h\dot=\left[ W_{hx} W_{hh}\right] \\ \hat{y}[t]&= g(W_yh[t]+b_y) \end{cases} \end{aligned}$ The first equation indicates a non-linear function applied on rotation of a stack version of input where previous configuration of hidden layer is appended. At the beginning, $h[0]$ is set 0. To simplify the equation, $W_h$ can be written as two separate matrices, $\left[ W_{hx}\ W_{hh}\right]$, thus sometimes the transformation can be stated as $$W_{hx}\cdot x[t]+W_{hh}\cdot h[t-1]$$which corresponds to the stack representation of input. y[t] is calculated at the final rotation and then we can use the chain rule to backpropagate the error to previous time step. #### 2. Batch-Ification in Language Modeling When dealing with a sequence of symbols, we can batchify the text into different sizes. For example, when dealing with sequence shown in the following figure, batch-ification can be applied first, where time domain is preserved vertically. In this case, batch size is set to 4. ![](https://i.imgur.com/eWCvimr.png) If BPTT period $T$ is set to 3, the first input $x[1:T]$ and output $y[1:T]$ for RNN is determined as $\begin{aligned} x[1:T] &= \begin{bmatrix} a & g & m & s \\ b & h & n & t \\ c & i & o & u \\ \end{bmatrix} \\ y[1:T] &= \begin{bmatrix} b & h & n & t \\ c & i & o & u \\ d & j & p & v \end{bmatrix} \end{aligned}$ When performing RNN on the first batch, firstly, we feed $x[1] = [a\ g\ m\ s]$ into RNN and force the output to be $y[1] = [b\ h\ n\ t]$. The hidden representation $h[1]$ will be sent forward into next time step to help RNN predict $y[2]$ from $x[2]$. After sending $h[T-1]$ to the final set of $x[T]$ and $y[T]$, we cut gradient propagation process for both $h[T]$ and $h[0]$ so that gradients will not propagate infinitely(.detach() in Pytorch). The whole process is shown in figure below. ![](https://i.imgur.com/BrC10nt.png) ## Vanishing and Exploding Gradient (23:50 - 31:00) - Biao Huang #### 1. Problem ![](https://i.imgur.com/BlOXIIz.jpg) The figure above is a typical RNN architecture. In order to perform rotation over previous steps in RNN, we use matrices, which can be regarded as horizontal arrows in model above. Since the matrices can change the size of outputs, if the determinant we select is larger than 1, the gradient will Inflate through time and cause gradient explosion. Relatively speaking, if the eigenvalue we select is small across 0, the propagation process will shrink gradients and finally leads to gradient vanishing. In typical RNN, gradients will be propagate through all the possible arrows, which provides the gradients a large chance to vanish or explode. For example, the gradient at time 1 is large, which is indicated by the bright color. When it goes through one rotation, the gradient shrinks a lot and at time 3, it gets killed. #### 2. Solution An ideal to prevent gradients from exploding or vanishing is to skip connections. To fulfill this, multiply networks can be used. ![](https://i.imgur.com/E4Bsv3A.jpg) In the case above, we split original network into 4 networks. Take the first network for instance. It takes in value from input at time 1 and send output to the first intermediate state in Hidden Layer. The state has 3 other networks where the $\circ$s allows the gradients to pass while the $-$s blocks propagation. Such technique is called gated recurrent network. LSTM is one prevalent gated RNN and is introduced in detail in following sections. ## Long Short-Term Memory (31:00 - 39:00) - Lin Jiang #### 1. Model Architecture ![](https://i.imgur.com/Ttnvp1g.png) Above are the equations for recurrent neural networks. Below are equations expressing a LSTM. The input gate is highlighted by yellow boxes, which will be an affine transformation. This input transformation will be multiplying $c[t]$, which is our candidate gate. ![](https://i.imgur.com/ALBd3cb.png) Don’t forget gate is multiplying the previous value of cell memory $c[t-1]$. Total cell value $c[t]$ is don’t forget gate plus input gate. Final hidden representation is element-wise multiplication between output gate $o[t]$ and hyperbolic tangent version of the cell $c[t]$, such that things are bounded. Finally, candidate gate $\tilde{c[t]}$ is simply a recurrent net. So we have a $o[t]$ to modulate the output, a $f[t]$ to modulate the don’t forget gate, and a $i[t]$ to modulate the input gate. All these interactions between memory and gates are multiplicative interactions. $i[t]$, $f[t]$ and $o[t]$ are all sigmoids, going from zero to one. Hence, when multiplying by zero, you have a closed mouth. When multiplying by one, you have an open mouth. How do we turn off the output? Let’s say we have a purple internal representation $th$ and put a zero in the output gate. Then the output will be zero multiplying by something, and we get a zero. If we put a one in the output gate, we will get the same value as the purple representation. ![](https://i.imgur.com/f45C6b9.png) ![](https://i.imgur.com/Z5ALXFm.png) ![](https://i.imgur.com/293aMCN.png) Similarly, we can control the memory. For example, we can reset it by having $f[t]$ and $i[t]$ to be zeros. After multiplication and summation, we have a zero inside the memory. Otherwise, we can keep the memory, by still zeroing out the internal representation $th$ but keep a one in $f[t]$. Hence, the sum gets $c[t-1]$ and keeps sending it out. Finally, we can write such that we can get a one in the input gate, the multiplication gets purple, then set a zero in the don’t forget gate so it actually forget. ![](https://i.imgur.com/0uRxbP5.png) ![](https://i.imgur.com/3Trja5G.png) ![](https://i.imgur.com/8Yh6sIT.png) #### 2. Step by Step Walk Through ## Notebook Examples (39:00 - End) - Lin Jiang **Sequence Classification** The goal is to classify sequences. Elements and targets are represented locally (input vectors with only one non-zero bit). The sequence starts with an B, ends with a E (the “trigger symbol”), and otherwise consists of randomly chosen symbols from the set `{a, b, c, d}` except for two elements at positions t1 and t2 that are either X or Y. For the DifficultyLevel.HARD case, the sequence length is randomly chosen between 100 and 110, t1 is randomly chosen between 10 and 20, and t2 is randomly chosen between 50 and 60. There are 4 sequence classes Q, R, S, and U, which depend on the temporal order of X and Y. The rules are: `X, X -> Q`; `X, Y -> R`; `Y, X -> S`; `Y, Y -> U`. 1). Dataset Exploration The return type from a data generator is a tuple with length 2. The first item in the tuple is the batch of sequences with shape (32, 9, 8). This is the data going to be fed into the network. There are eight different symbols in each row (X, Y, a,b, c, d, B, E). Each row is a one-hot vector. Sequence of rows represents sequence of symbols. The first all-zero row is padding. We use padding when the length of sequence is shorter than the maximum length in the batch. ![](https://i.imgur.com/YcJrvHi.png) The second item in the tuple is the corresponding batch of class labels with shape (32, 4), since we have 4 classes (Q, R, S, and U). The first sequence is: BbXcXcbE. Then its decoded class label is $[1, 0, 0, 0]$, corresponding to Q. 2). Defining the Model and Training Let’s create a simple recurrent network, a LSTM and train for 10 epoches. In the training loop, we should always look for five steps: 1) perform the forward pass of the model, 2) compute the loss, 3) zero the gradient cache, 4) backpropagate to compute the partial derivative of loss with regard to parameters, 5) step in the opposite direction of gradient. ![](https://i.imgur.com/urr6vPv.png) ![](https://i.imgur.com/qvHeR7H.png) With easy level of difficulty, RNN gets 50% accuracy while LSTM gets 100% after 10 epoches. But LSTM has four times more weights than RNN and has two hidden layers, so it is not a fair comparison. After 100 epoches, RNN also gets 100% accuracy, but it takes longer than LSTM. ![](https://i.imgur.com/QeYKP7i.png) ![](https://i.imgur.com/Fr5ZNQf.png) If we increase the difficulty of the training part (using longer sequences), we will see the RNN fails while LSTM keeps working. ![](https://i.imgur.com/tfAZiF3.png) ![](https://i.imgur.com/P9kOE61.jpg) *The above visualization is drawing the value of hidden state over time in LSTM. We will send the inputs through a hyperbolic tangent, such that if the input is below -2.5, it will be mapped to -1, and if it is above `2.5`, it will be mapped to 1. So in this case, we can see the specific hidden layer picked on X (fifth row in the picture) and then it became red until we got the other X. So, the fifth hidden unit of the cell is triggered by observing the X and goes quiet after seeing the other X. This allows us to recognize the class of sequence. **Signal Echoing** Echoing signal n steps is an example of synchronized many-to-many task.For instance, the 1st input sequence is "1 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 1 1 1 ...", and the 1st target sequence is "0 0 0 1 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 ...". In this case, the output is three steps after. So we need a short-time working memory to keep the information. Whereas in the language model, it says something that hasn't already been said. Before, we send the whole sequence to the network, and force the final target to be something. In this case, we need to cut the long sequence into little chunks. While feeding a new chunk, we need to keep track of the hidden state, and send it as input to the internal state when adding the next new chunk. In LSTM, you can keep the memory for a long time as long as you have enough capacity. In RNN, after you reach some kind of length, it starts to forget about what happened in the past. ](https://)

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully