Video Diffusion Models: Survey === Taxonomy ---  Architecture --- - UNet - Vision Transformer Temporal dynamics --- - Spatio-temporal attention - Temporal Upsampling: 1. Generate key frames 2. Interpolating or another diffusion pass conditioned on two key frames - Structure preservation: - Replace initial noise with (latent of) input video frames - Conditioned on depth, pose, ... Training --- - Datasets  - Evaluation metrics - Human - FID (individual frames) - FVD - Kernel Video Distance (FVD alternative) - [Fréchet Video Motion Distance (FVMD)](): focus temporal consistency - [Content-Debiased FVD](https://github.com/songweige/content-debiased-fvd) - Unary: - Inception - CLIP similarity - [VBench](https://vchitect.github.io/VBench-project/) Image-conditioned Generation --- - Image conditioning can be achieved by injecting embeddings (CLIP, VAE latents, extra input channels)
×
Sign in
Email
Password
Forgot password
or
By clicking below, you agree to our
terms of service
.
Sign in via Facebook
Sign in via Twitter
Sign in via GitHub
Sign in via Dropbox
Sign in with Wallet
Wallet (
)
Connect another wallet
New to HackMD?
Sign up