# Recurrent Neural Network
###### tags: `Deep Learning for Computer Vision`
## What Are The Limitations of CNN?
* Can’t easily model series data
* Both input and output might be sequential data (scalar or vector)
* Simply feed-forward processing
* Cannot memorize, no long-term feedback
## More Applications in Vision
### Image Captioning

### Visual Question Answering (VQA)

## How to Model Sequential Data?
* Deep learning for sequential data
* 3-dimensional convolution neural networks

* Deep learning for sequential data
* Recurrent neural networks (RNN)

## Recurrent Neural Networks
* Parameter sharing + unrolling
* Keeps the number of parameters fixed $A$
* Allows sequential data with varying lengths $x_1..x_t$
* Memory ability
* Capture and preserve information (hidden state vector) which has been extracted

### Recurrence Formula
Same function and parameters used at every time step :

### Multiple Recurrent Layers


#### Example: Image Captioning


### Training RNNs
#### Back Propagation Through Time


#### Gradient Vanishing & Exploding
* Computing gradient involves many factors of W
* Exploding gradients : Largest singular value > 1
* Vanishing gradients : Largest singular value < 1
#### Solutions
* Gradients clipping : rescale gradients if too large

* How about vanishing gradients?
* Change RNN architecture!
## Sequence-to-Sequence Modeling
* Setting
* An input sequence $X_1, ..., X_N$
* An output sequence $Y_1, ..., Y_M$
* Generally $N ≠ M$, i.e., **no synchrony** between $X$ and $Y$
* Examples
* Speech recognition: speech goes in, and a word sequence comes out
* Machine translation: word sequence goes in, and another comes out
* Video captioning: video frames goes in, word sequence comes out
### What’s the Potential Problem?
* Each hidden state vector extracts/carries information across time steps (some might be diluted downstream).
* However, information of the entire input sequence is embedded into a **single hidden state vector**.
* the performance of RNN will degrade if the input is a long sequence

* Connecting every hidden state between encoder and decoder?

* Infeasible!
* Both inputs and outputs are with varying sizes.
* Overparameterized
### Solution Ver. 1: Attention Model
* What should the attention model be?
* A NN whose inputs are z and h while output is a scalar, indicating the **similarity** between z and h.
* Most attention models are jointly learned or trained with other parts of network (e.g., recognition, etc.)





### Selected Attention Models for Image-Based Applications
#### Image Captioning with Attention
RNN focuses visual attention at different spatial locations when generating corresponding words during captioning.

* Example Results


We get that each $a_i$ corresponds to $f_i$. If $a_i$ is large, it means this area is important.

#### Visual Question Answering

* Examples of multiple-choice QA & pointing QA


#### Image Classification

* Example Results

### Solution Ver. 2: Transformer

## Recent Advances of GAN-based Models for Video-Based Applications
### Video Generation
* Learning a latent space to describe image/video data
* Input: latent representation
* Output: sequence of images/frames (i.e., video)
#### Latent space for content-motion decomposition

### MoCoGAN: Decomposing Motion and Content for Video Generation

## Recent Advances of Attention models in Video-Based Applications
### Video Prediction
* Input: A few known frames
* Output: Unknown future frames
### Unsupervised Learning of Video Representations using LSTMs

### Learning to generate long-term future via hierarchical prediction