# Video prediction approaches ### [Unsupervised Learning of Disentangled Representations from Video](https://arxiv.org/pdf/1705.10915.pdf) (Denton et al., 2017) ![](https://i.imgur.com/a0eXIJI.png) #### General approach: Train two separate encoders to disentangle content from pose. #### Key insights: 1. Introduce adversarial loss on pose to prevent them from being discriminable between videos. Let $x_i = \{x_i^1, ..., x_i^T \}$ denote a sequence of $T$ images from video $i$. Encode the pose representation as $h_p^t =E_c(x^t)$, the content representation as $h_c^t = E_c(x^t)$. Then $C$ is a discriminator which decides whether two pose vectors $(h_1, h_2)$ came from the same scene or not. 2. Loss function has 3 components: reconstruction loss, similarity loss, adversarial loss. Reconstruction loss (content of frame t and pose of next frame): $$ \mathcal{L}_{\text{reconstruction}}= ||D(E_c(x^t), E_p(x^{t+k})) - x^{t+k}||^2_2 $$ Similarity loss: Make sure content of consecutive frames doesn't change dramatically. $$ \mathcal{L}_{\text{similarity}} = || E_c(x^t) - E_c(x^{t+k}) ||_2^2 $$ Adversarial losses: ***First loss:*** trains a discriminator to not be able to answer: "did these two poses come from the same video or not?" This forces the pose vectors to carry no content information The positive pair $P^+ =(E_p(x_i^t), E_p(x_i^{t+k}))$ are two frames from the same video $(i, i)$. The negative pair $P^- = (E_p(x_i^t), E_p(x_j^{t+k}))$ are two frames from different videos $(i, j)$. using cross entropy (optimize wrt C): $$ \mathcal{L(C)}_{\text{pose adversarial}} = \log(C(P^+)) - \log(1 - C(P^-)) $$ ***Second loss:*** Trains the pose encoder $E_p$ to trick the discriminator, into believing two frames from the same clip came from different clips. $$ \mathcal{L(E_p)} = \log(C(P^+)) - \log(1 - C(P^+)) $$ #### Decoding ![](https://i.imgur.com/zjGEr2X.png) To decode, we take the first frame $x^t$ and extract the pose and content vectors $h_c^t = E_c(x^t)\\ h_p^t = E_p(x^t)$ Then feed both to an RNN to get the next pose video: $$ \hat{h}_p^{t+1} = \text{RNN}(h_c^t, h_p^t) $$ And re-feed that into the RNN to continue decoding. --- ### [Stochastic Video Generation with a Learned Prior](https://arxiv.org/pdf/1802.07687v2.pdf) (Denton et al., 2018) ![](https://i.imgur.com/lqo9oBY.png) Uses a convolutional LSTM conditioned with a latent variable model to generate predictions. The latent variables are encouraged to be sparse by forcing it to be close to a Prior (via KL divergence). #### Method overview System has 3 components: ***Prediction model*** Generates next frame $\hat{x}_t$ given the previous frames $x_{1:t-1}$ and a latent variable $z_{t}$: $$ \hat{x}_t = p_{\theta}(x_{1:t-1}, z_{t}) $$ ***Latent variable model*** A prior distribution $p(z)$ which allows sampling $z$ variables for each time step: $$ z_t \sim p(z) $$ The prior can be fixed or learned. ***Inference model*** This model takes as input the target of the prediction model $x_t$, and previous frames $x_{1:t-1}$. It computes a distribution $q_{\phi}(z_t | x_{1:t})$ from which we sample $z_t$: $$ z_t \sim q_{\phi}(z_t | x_{1:t}) $$ To prevent $z_t$ from just copying $x_t$ (because it took in $x_t$), we force $z_t$ to be close to the prior via KL divergence: $$ D_{KL}(q_{\phi}(z_t | x_{1:t} || p(z)) $$ This constrains the information $z$ can carry **Note: HOW??, WHY??** ![](https://i.imgur.com/qf5aqaa.png) (remember z uses all the previous $x$ frames). #### Reconstruction error The final term in this loss is a reconstruction error between $(\hat{x}_t, x_t)$ --- ### [Improved Conditional VRNNs for Video Prediction](https://arxiv.org/pdf/1904.12165v1.pdf)(Castrejon et. al, 2019). ![](https://i.imgur.com/7WTVjjT.jpg) #### Key Insight Hierarchical levels of $z$ latent variables (ie: each residual block gets its own level of latent vars). The latent vars are integrated as follows: ![](https://i.imgur.com/0HZfmlE.png) --- ### [VideoFlow: A Flow-Based Generative Model for Video](https://arxiv.org/pdf/1903.01434v2.pdf) --- ### [Stochastic Adversarial Video Prediction](https://arxiv.org/pdf/1804.01523v1.pdf) --- ### [Stochastic variational Video Prediction](https://arxiv.org/pdf/1804.01523v1.pdf) --- ### [Point-to-Point Video Generation](https://arxiv.org/pdf/1904.02912v2.pdf) --- ### [Learning to Decompose and Disentangle Representations for Video Prediction](https://arxiv.org/pdf/1806.04166v2.pdf) --- ### [Simple Video Generation using Neural ODEs](https://voletiv.github.io/docs/publications/2019e_NeurIPSW_EncODEDec.pdf)