# DynamiCrafter
* **Link:** [pdf](https://arxiv.org/pdf/2310.12190).
* **Conference / Journal:**
* **Authors:** CUHK + Tencent AI Lab.
* **Comments:** [project page](https://doubiiu.github.io/projects/DynamiCrafter/#motion_control_container).
## Introduction
> Govern the video generation process of T2V diffusion models by incorporating a conditional image.
* Previous works are incompetent for image animation due to less comprehensive image injection mechanisms
--> **Propose a dual-stream image injection paradigm**
## Method
**NOTE:** Image can appear in the arbitrary frame.

* **Image Dynamics from Video Diffusion Priors**
* Text-aligned context representation: project the image into a text-aligned embedding space: CLIP Image encoder, then a query transformer
* Cross-attention text and context embedding with learned $\lambda$ hyperparams

* Visual detail guidance (VDG): CLIP image encoder’s limited capability to fully preserve input image information --> concatenate the conditional image with per-frame initial noise
* Two guidance scale
## Experiments
* Based on T2V VideoCrafter (256 $\times$ 256 resolution) and T2I Stable-Diffusion-v2.1
### Datasets
* WebVid-10M dataset: sampling 16 frames at 256x256
### Results

## Misc
---
# VideoCrafter
* **Link:** [pdf](https://arxiv.org/pdf/2310.19512)
* **Conference / Journal:**
* **Authors:** CUHK, Tencent AI Lab
* **Comments:**
## Introduction
> Open-source T2V and I2V models at 1024 $\times$ 576 resoultion
## Method
- **Overview**

Consists of a video VAE and a video latent diffusion model
- **Denoising 3D UNet:**
- Each block contains conv, spatial transformers (ST), and temporal transformers (TT)

- FPS or timestep is projected into an embedding vector using sinusoidal embedding.
- Image conditional branch

## Experiments
### Datasets
- **Training**: LAION COCO 600M (image) + WebVid-10M (video)
### Results

## Misc
- Implementation of temporal attention
---
# VideoCrafter2
* **Link:** [pdf](https://arxiv.org/pdf/2401.09047)
* **Conference / Journal:**
* **Authors:** Tencent AI Lab
* **Comments:**
## Introduction
> We explore the **training scheme** of video models extended from Stable Diffusion and investigate the feasibility of **leveraging low-quality videos and synthesized high-quality images** to obtain a high-quality video model
> We observe that **full training of all modules** results in a stronger coupling between spatial and temporal modules than only training temporal modules
## Method
* **Spatial-temporal Connection Analyses**
* Spatial Perturbation
* Temporal Perturbation
* --> Coupling strength between spatial and temporal modules
* Propose to disentangle motion (temporal) from appearance (spatial) at data level.
* Fully training a video model with low-quality videos first and then directly finetuning the spatial modules only with high-quality images.
## Experiments
### Datasets
### Results

## Misc