Video Diffusion Models: Survey === Taxonomy --- ![image](https://hackmd.io/_uploads/Hkq1OdSHR.png) Architecture --- - UNet - Vision Transformer Temporal dynamics --- - Spatio-temporal attention - Temporal Upsampling: 1. Generate key frames 2. Interpolating or another diffusion pass conditioned on two key frames - Structure preservation: - Replace initial noise with (latent of) input video frames - Conditioned on depth, pose, ... Training --- - Datasets ![image](https://hackmd.io/_uploads/SJWxTurrC.png) - Evaluation metrics - Human - FID (individual frames) - FVD - Kernel Video Distance (FVD alternative) - [Fréchet Video Motion Distance (FVMD)](): focus temporal consistency - [Content-Debiased FVD](https://github.com/songweige/content-debiased-fvd) - Unary: - Inception - CLIP similarity - [VBench](https://vchitect.github.io/VBench-project/) Image-conditioned Generation --- - Image conditioning can be achieved by injecting embeddings (CLIP, VAE latents, extra input channels)