# SEER
- Inflate Stable Diffusion along temporal axis.
- Egocentric datasets benchmark.
## Architecture
This paper use similar techniques for lifting 2D to 3D from Stable Diffusion:
- 2D conv (3x3) --> 3D conv (1x3x3). Similar to prior work. e.g. VDM
## Temporal attention: Use window temporal Attention
Compared with related work:
- VDM: Attention across the temporal dimension (spatial axes as batch)
- Tune-A-Video: Self-temporal Attention similar to VDM; add ST-Attn, current frame attends to first and previous frames.
## Frame Sequential Text Decomposer
- The authors argue that using a global text guidance is too strong.
- Propose (FSText) Decomposer: decompose text to sub-instructions for each frame.

- Text Seq-Attn: capture global dependencies (BERT-like)
- Cross-Attn: ensure semantic inheritance (from CLIP)
- Temporal-attention: temporal consitency
The FSText is first initialized (optimized) as (a) to ensure identity to CLIP, then trained with diffusion loss as (b)