---
# System prepended metadata

title: Tune A Video

---

# Tune A Video

* **Link:** [[pdf]](https://arxiv.org/pdf/2212.11565)
* **Authors:** Jay Zhangjie Wu et al.: NUS + Tencent
* **Comments:**: ICCV 2023 [[project page]](https://tuneavideo.github.io/)

## Introduction

> New T2V generation setting—One-Shot Video Tuning, where only one text-video pair is presented.

- Built on SoTA T2I diffusion model
- Observation
    - T2I models can generate still images
    - Extending T2I models to generate multiple images shows good content consistency


## Method
**Pipeline**
![image](https://hackmd.io/_uploads/SkTPXU4H0.png)


- **Model fine-tuning**:
    - Finetune on given input video
    - ST-Attn: Fix $W^K$ and $W^V$, only update $W^Q$ as we want to query relevant positions in previous frames.
![image](https://hackmd.io/_uploads/SyOQ4UEB0.png)
- **Inference**:
    - Incoporate structure guidance from the source video.

- **Application**:
    - Object editing: change object in the text prompt.
    - Style transfer: add style to prompt
    - Personalized and controllable: use Dreambooth or ControlNet T2I model.

## Misc

Implement with Egocentric video data
- Given a (set) of video of an action, finetune on these videos.
- Inference with/without structure guidance