DynamiCrafter - HackMD

# DynamiCrafter * **Link:** [pdf](https://arxiv.org/pdf/2310.12190). * **Conference / Journal:** * **Authors:** CUHK + Tencent AI Lab. * **Comments:** [project page](https://doubiiu.github.io/projects/DynamiCrafter/#motion_control_container). ## Introduction > Govern the video generation process of T2V diffusion models by incorporating a conditional image. * Previous works are incompetent for image animation due to less comprehensive image injection mechanisms --> **Propose a dual-stream image injection paradigm** ## Method **NOTE:** Image can appear in the arbitrary frame. ![image](https://hackmd.io/_uploads/Hk_LqlOrA.png) * **Image Dynamics from Video Diffusion Priors** * Text-aligned context representation: project the image into a text-aligned embedding space: CLIP Image encoder, then a query transformer * Cross-attention text and context embedding with learned $\lambda$ hyperparams ![image](https://hackmd.io/_uploads/SkVMYNdrR.png) * Visual detail guidance (VDG): CLIP image encoder’s limited capability to fully preserve input image information --> concatenate the conditional image with per-frame initial noise * Two guidance scale ## Experiments * Based on T2V VideoCrafter (256 $\times$ 256 resolution) and T2I Stable-Diffusion-v2.1 ### Datasets * WebVid-10M dataset: sampling 16 frames at 256x256 ### Results ![image](https://hackmd.io/_uploads/SkekvNOSC.png) ## Misc --- # VideoCrafter * **Link:** [pdf](https://arxiv.org/pdf/2310.19512) * **Conference / Journal:** * **Authors:** CUHK, Tencent AI Lab * **Comments:** ## Introduction > Open-source T2V and I2V models at 1024 $\times$ 576 resoultion ## Method - **Overview** ![image](https://hackmd.io/_uploads/Hks0CedSR.png) Consists of a video VAE and a video latent diffusion model - **Denoising 3D UNet:** - Each block contains conv, spatial transformers (ST), and temporal transformers (TT) ![image](https://hackmd.io/_uploads/SkTPkZuS0.png) - FPS or timestep is projected into an embedding vector using sinusoidal embedding. - Image conditional branch ![image](https://hackmd.io/_uploads/SyjZebdSC.png) ## Experiments ### Datasets - **Training**: LAION COCO 600M (image) + WebVid-10M (video) ### Results ![image](https://hackmd.io/_uploads/ryGQZZuBA.png) ## Misc - Implementation of temporal attention --- # VideoCrafter2 * **Link:** [pdf](https://arxiv.org/pdf/2401.09047) * **Conference / Journal:** * **Authors:** Tencent AI Lab * **Comments:** ## Introduction > We explore the **training scheme** of video models extended from Stable Diffusion and investigate the feasibility of **leveraging low-quality videos and synthesized high-quality images** to obtain a high-quality video model > We observe that **full training of all modules** results in a stronger coupling between spatial and temporal modules than only training temporal modules ## Method * **Spatial-temporal Connection Analyses** * Spatial Perturbation * Temporal Perturbation * --> Coupling strength between spatial and temporal modules * Propose to disentangle motion (temporal) from appearance (spatial) at data level. * Fully training a video model with low-quality videos first and then directly finetuning the spatial modules only with high-quality images. ## Experiments ### Datasets ### Results ![image](https://hackmd.io/_uploads/HJ1FVEYrR.png) ## Misc