# Recurrent Neural Network ###### tags: `Deep Learning for Computer Vision` ## What Are The Limitations of CNN? * Can’t easily model series data * Both input and output might be sequential data (scalar or vector) * Simply feed-forward processing * Cannot memorize, no long-term feedback ## More Applications in Vision ### Image Captioning ![](https://i.imgur.com/Zn0KZCw.jpg) ### Visual Question Answering (VQA) ![](https://i.imgur.com/3ydXLfv.jpg) ## How to Model Sequential Data? * Deep learning for sequential data * 3-dimensional convolution neural networks ![](https://i.imgur.com/UWosZLL.png) * Deep learning for sequential data * Recurrent neural networks (RNN) ![](https://i.imgur.com/wYnt077.png) ## Recurrent Neural Networks * Parameter sharing + unrolling * Keeps the number of parameters fixed $A$ * Allows sequential data with varying lengths $x_1..x_t$ * Memory ability * Capture and preserve information (hidden state vector) which has been extracted ![](https://i.imgur.com/DQGA8TU.png) ### Recurrence Formula Same function and parameters used at every time step : ![](https://i.imgur.com/WTZ1YMA.jpg) ### Multiple Recurrent Layers ![](https://i.imgur.com/GExd9M0.png) ![](https://i.imgur.com/zL1miDG.png) #### Example: Image Captioning ![](https://i.imgur.com/2WDhEFv.jpg) ![](https://i.imgur.com/f0wRVIh.jpg) ### Training RNNs #### Back Propagation Through Time ![](https://i.imgur.com/aZ8StNr.jpg) ![](https://i.imgur.com/vi5yWzj.png) #### Gradient Vanishing & Exploding * Computing gradient involves many factors of W * Exploding gradients : Largest singular value > 1 * Vanishing gradients : Largest singular value < 1 #### Solutions * Gradients clipping : rescale gradients if too large ![](https://i.imgur.com/nlpG3sV.png) * How about vanishing gradients? * Change RNN architecture! ## Sequence-to-Sequence Modeling * Setting * An input sequence $X_1, ..., X_N$ * An output sequence $Y_1, ..., Y_M$ * Generally $N ≠ M$, i.e., **no synchrony** between $X$ and $Y$ * Examples * Speech recognition: speech goes in, and a word sequence comes out * Machine translation: word sequence goes in, and another comes out * Video captioning: video frames goes in, word sequence comes out ### What’s the Potential Problem? * Each hidden state vector extracts/carries information across time steps (some might be diluted downstream). * However, information of the entire input sequence is embedded into a **single hidden state vector**. * the performance of RNN will degrade if the input is a long sequence ![](https://i.imgur.com/aHUoGis.png) * Connecting every hidden state between encoder and decoder? ![](https://i.imgur.com/xAv5OSK.png) * Infeasible! * Both inputs and outputs are with varying sizes. * Overparameterized ### Solution Ver. 1: Attention Model * What should the attention model be? * A NN whose inputs are z and h while output is a scalar, indicating the **similarity** between z and h. * Most attention models are jointly learned or trained with other parts of network (e.g., recognition, etc.) ![](https://i.imgur.com/yykEstm.jpg) ![](https://i.imgur.com/YPOP0WI.png) ![](https://i.imgur.com/nYvxiZ0.jpg) ![](https://i.imgur.com/vHIFneS.png) ![](https://i.imgur.com/WY7f7UE.png) ### Selected Attention Models for Image-Based Applications #### Image Captioning with Attention RNN focuses visual attention at different spatial locations when generating corresponding words during captioning. ![](https://i.imgur.com/zm4CK53.png) * Example Results ![](https://i.imgur.com/Rnp9LxB.jpg) ![](https://i.imgur.com/Si8gDTS.png) We get that each $a_i$ corresponds to $f_i$. If $a_i$ is large, it means this area is important. ![](https://i.imgur.com/wLvtIkC.jpg) #### Visual Question Answering ![](https://i.imgur.com/E9l2UDE.jpg) * Examples of multiple-choice QA & pointing QA ![](https://i.imgur.com/ZylguF0.jpg) ![](https://i.imgur.com/0fUmTYd.jpg) #### Image Classification ![](https://i.imgur.com/uBV2cia.jpg) * Example Results ![](https://i.imgur.com/QShVCcP.png) ### Solution Ver. 2: Transformer ![](https://i.imgur.com/jHV6hdc.png) ## Recent Advances of GAN-based Models for Video-Based Applications ### Video Generation * Learning a latent space to describe image/video data * Input: latent representation * Output: sequence of images/frames (i.e., video) #### Latent space for content-motion decomposition ![](https://i.imgur.com/3KHnUAs.png) ### MoCoGAN: Decomposing Motion and Content for Video Generation ![](https://i.imgur.com/ziR7oKE.jpg) ## Recent Advances of Attention models in Video-Based Applications ### Video Prediction * Input: A few known frames * Output: Unknown future frames ### Unsupervised Learning of Video Representations using LSTMs ![](https://i.imgur.com/0AsNSu6.jpg) ### Learning to generate long-term future via hierarchical prediction