# Table of contents [TOC] ## ModelScope ![image](https://hackmd.io/_uploads/ryOK9d0B0.png) - VQGAN and Denoising UNet initialized from Stable Diffusion - Temporal init that output is 0 - Training data: LAION2B-en, WebVid-10M (336 $\times$ 596), MSR-VTT for validation ## I2VGen-XL ![image](https://hackmd.io/_uploads/S1Uzx5Cr0.png) **Challenges:** * Scarcity of well-aligned text-video data * Complex inherent structure of videos * Decoupling these two factors * Base stage: preserves content from input images by using two hierarchical encoders * Refinement stage: scale resolution, incorporate text