# SEER - Inflate Stable Diffusion along temporal axis. - Egocentric datasets benchmark. ## Architecture This paper use similar techniques for lifting 2D to 3D from Stable Diffusion: - 2D conv (3x3) --> 3D conv (1x3x3). Similar to prior work. e.g. VDM ## Temporal attention: Use window temporal Attention Compared with related work: - VDM: Attention across the temporal dimension (spatial axes as batch) - Tune-A-Video: Self-temporal Attention similar to VDM; add ST-Attn, current frame attends to first and previous frames. ## Frame Sequential Text Decomposer - The authors argue that using a global text guidance is too strong. - Propose (FSText) Decomposer: decompose text to sub-instructions for each frame. ![image](https://hackmd.io/_uploads/BJ-jUzqKR.png) - Text Seq-Attn: capture global dependencies (BERT-like) - Cross-Attn: ensure semantic inheritance (from CLIP) - Temporal-attention: temporal consitency The FSText is first initialized (optimized) as (a) to ensure identity to CLIP, then trained with diffusion loss as (b)