# Text2Video-Zero
## Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators
* **Link:** https://arxiv.org/pdf/2303.13439.
* **Conference / Journal:** ICCV 2023 Oral.
* **Authors:** Picsart AI Resarch.
* **Comments:** https://github.com/Picsart-AI-Research/Text2Video-Zero.
## Introduction
> Zero-shot, “training-free” text-to-video synthesis, which is the task of generating videos from textual prompts **without requiring any optimization or finetuning**
**Contributions**
- Enrich latent codes with motion information
- Cross-frame attention on first frame to preserve id, context, apprearance.
## Method

**Zero-shot T2V Problem**
- Given a text $\tau$ and positive integer $m$ (frames)
- Goal: a function $F$ that outputs video $V \in \mathbb{R}^{m \times H \times W \times 3}$. $F$ requires no training or fine-tuning
**Motion dynamics in latent codes**

**Reprogramming Cross-Frame Attention**

## Experiments
### Results
**Compare with baselines**
- CogVideo

- Tune-A-Video

## Misc
- Implementation of warping function
- Background smoothing
- Limitations:
- Not smooth video