# Video prediction approaches
### [Unsupervised Learning of Disentangled Representations from Video](https://arxiv.org/pdf/1705.10915.pdf) (Denton et al., 2017)

#### General approach:
Train two separate encoders to disentangle content from pose.
#### Key insights:
1. Introduce adversarial loss on pose to prevent them from being discriminable between videos.
Let $x_i = \{x_i^1, ..., x_i^T \}$ denote a sequence of $T$ images from video $i$. Encode the pose representation as $h_p^t =E_c(x^t)$, the content representation as $h_c^t = E_c(x^t)$.
Then $C$ is a discriminator which decides whether two pose vectors $(h_1, h_2)$ came from the same scene or not.
2. Loss function has 3 components: reconstruction loss, similarity loss, adversarial loss.
Reconstruction loss (content of frame t and pose of next frame):
$$
\mathcal{L}_{\text{reconstruction}}= ||D(E_c(x^t), E_p(x^{t+k})) - x^{t+k}||^2_2
$$
Similarity loss: Make sure content of consecutive frames doesn't change dramatically.
$$
\mathcal{L}_{\text{similarity}} = || E_c(x^t) - E_c(x^{t+k}) ||_2^2
$$
Adversarial losses:
***First loss:***
trains a discriminator to not be able to answer: "did these two poses come from the same video or not?" This forces the pose vectors to carry no content information
The positive pair $P^+ =(E_p(x_i^t), E_p(x_i^{t+k}))$ are two frames from the same video $(i, i)$.
The negative pair $P^- = (E_p(x_i^t), E_p(x_j^{t+k}))$ are two frames from different videos $(i, j)$.
using cross entropy (optimize wrt C):
$$
\mathcal{L(C)}_{\text{pose adversarial}} = \log(C(P^+)) - \log(1 - C(P^-))
$$
***Second loss:***
Trains the pose encoder $E_p$ to trick the discriminator, into believing two frames from the same clip came from different clips.
$$
\mathcal{L(E_p)} = \log(C(P^+)) - \log(1 - C(P^+))
$$
#### Decoding

To decode, we take the first frame $x^t$ and extract the pose and content vectors
$h_c^t = E_c(x^t)\\
h_p^t = E_p(x^t)$
Then feed both to an RNN to get the next pose video:
$$
\hat{h}_p^{t+1} = \text{RNN}(h_c^t, h_p^t)
$$
And re-feed that into the RNN to continue decoding.
---
### [Stochastic Video Generation with a Learned Prior](https://arxiv.org/pdf/1802.07687v2.pdf) (Denton et al., 2018)

Uses a convolutional LSTM conditioned with a latent variable model to generate predictions. The latent variables are encouraged to be sparse by forcing it to be close to a Prior (via KL divergence).
#### Method overview
System has 3 components:
***Prediction model***
Generates next frame $\hat{x}_t$ given the previous frames $x_{1:t-1}$ and a latent variable $z_{t}$:
$$
\hat{x}_t = p_{\theta}(x_{1:t-1}, z_{t})
$$
***Latent variable model***
A prior distribution $p(z)$ which allows sampling $z$ variables for each time step:
$$
z_t \sim p(z)
$$
The prior can be fixed or learned.
***Inference model***
This model takes as input the target of the prediction model $x_t$, and previous frames $x_{1:t-1}$. It computes a distribution $q_{\phi}(z_t | x_{1:t})$ from which we sample $z_t$:
$$
z_t \sim q_{\phi}(z_t | x_{1:t})
$$
To prevent $z_t$ from just copying $x_t$ (because it took in $x_t$), we force $z_t$ to be close to the prior via KL divergence:
$$
D_{KL}(q_{\phi}(z_t | x_{1:t} || p(z))
$$
This constrains the information $z$ can carry
**Note: HOW??, WHY??**

(remember z uses all the previous $x$ frames).
#### Reconstruction error
The final term in this loss is a reconstruction error between $(\hat{x}_t, x_t)$
---
### [Improved Conditional VRNNs for Video Prediction](https://arxiv.org/pdf/1904.12165v1.pdf)(Castrejon et. al, 2019).

#### Key Insight
Hierarchical levels of $z$ latent variables (ie: each residual block gets its own level of latent vars).
The latent vars are integrated as follows:

---
### [VideoFlow: A Flow-Based Generative Model for Video](https://arxiv.org/pdf/1903.01434v2.pdf)
---
### [Stochastic Adversarial Video Prediction](https://arxiv.org/pdf/1804.01523v1.pdf)
---
### [Stochastic variational Video Prediction](https://arxiv.org/pdf/1804.01523v1.pdf)
---
### [Point-to-Point Video Generation](https://arxiv.org/pdf/1904.02912v2.pdf)
---
### [Learning to Decompose and Disentangle Representations for Video Prediction](https://arxiv.org/pdf/1806.04166v2.pdf)
---
### [Simple Video Generation using Neural ODEs](https://voletiv.github.io/docs/publications/2019e_NeurIPSW_EncODEDec.pdf)