Video Diffusion Models: Survey
===
Taxonomy
---

Architecture
---
- UNet
- Vision Transformer
Temporal dynamics
---
- Spatio-temporal attention
- Temporal Upsampling:
1. Generate key frames
2. Interpolating or another diffusion pass conditioned on two key frames
- Structure preservation:
- Replace initial noise with (latent of) input video frames
- Conditioned on depth, pose, ...
Training
---
- Datasets

- Evaluation metrics
- Human
- FID (individual frames)
- FVD
- Kernel Video Distance (FVD alternative)
- [Fréchet Video Motion Distance (FVMD)](): focus temporal consistency
- [Content-Debiased FVD](https://github.com/songweige/content-debiased-fvd)
- Unary:
- Inception
- CLIP similarity
- [VBench](https://vchitect.github.io/VBench-project/)
Image-conditioned Generation
---
- Image conditioning can be achieved by injecting embeddings (CLIP, VAE latents, extra input channels)