# Text2Video-Zero ## Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators * **Link:** https://arxiv.org/pdf/2303.13439. * **Conference / Journal:** ICCV 2023 Oral. * **Authors:** Picsart AI Resarch. * **Comments:** https://github.com/Picsart-AI-Research/Text2Video-Zero. ## Introduction > Zero-shot, “training-free” text-to-video synthesis, which is the task of generating videos from textual prompts **without requiring any optimization or finetuning** **Contributions** - Enrich latent codes with motion information - Cross-frame attention on first frame to preserve id, context, apprearance. ## Method ![image](https://hackmd.io/_uploads/SJSY_SrBC.png) **Zero-shot T2V Problem** - Given a text $\tau$ and positive integer $m$ (frames) - Goal: a function $F$ that outputs video $V \in \mathbb{R}^{m \times H \times W \times 3}$. $F$ requires no training or fine-tuning **Motion dynamics in latent codes** ![image](https://hackmd.io/_uploads/HyIfurHrA.png) **Reprogramming Cross-Frame Attention** ![image](https://hackmd.io/_uploads/SyppuHBSR.png) ## Experiments ### Results **Compare with baselines** - CogVideo ![image](https://hackmd.io/_uploads/SJkBKSrrC.png) - Tune-A-Video ![image](https://hackmd.io/_uploads/BykvFHSrR.png) ## Misc - Implementation of warping function - Background smoothing - Limitations: - Not smooth video