# Table of contents
[TOC]
## ModelScope

- VQGAN and Denoising UNet initialized from Stable Diffusion
- Temporal init that output is 0
- Training data: LAION2B-en, WebVid-10M (336 $\times$ 596), MSR-VTT for validation
## I2VGen-XL

**Challenges:**
* Scarcity of well-aligned text-video data
* Complex inherent structure of videos
* Decoupling these two factors
* Base stage: preserves content from input images by using two hierarchical encoders
* Refinement stage: scale resolution, incorporate text