# SORA
## Justin Lu
---
# Contents
- Introduction to SORA
- Architecture of SORA
- Model training
---
# What is SORA?
----
- Made by OpenAI
- text-to-video model
- can generate video up tp 1 minute
----
# What can SORA do?
----
- text-to-video
- image-to-video
- video extension
- video style transformation / transition
----
## Example (From OpenAI)
{%youtube HK6y8DAPN_0%}
(1:56~2:06)
---
# Architecture
----
**SORA = VAE encoder+DiT+VAE decoder+CLIP**

----
## Spacetime latent patches
- **"Given a compressed input video, we extract a sequence of spacetime patches which act as transformer tokens."**
- the image version of tokens

----
## Latent Space Representation

----
#### Example : 3D data positioned at 2D latent space

- Position within the latent space can be viewed as being defined by a set of latent variables that emerge from the resemblances from the objects.
----
### embedding (嵌入)


----
## VAE(Variational Autoencoder)
- Encoding/Decoding of spacetime latent patches

----
## CLIP
- Contrastive Language-Image Pre-training
- find the relation between image and natural language

----
## DiT(Diffusion Transformer)
- Diffusion Model + Transformer
- change the core structure of Diffusion Model into Transformer
----
- DiT refers to ViT, convert the compressed latent features into tokens

----
### ViT:Vision Transformer
- Using transformer model on images
- long-range, pixel-level interactions

----

----
- Patchify : converts the spatial input into a sequence of tokens.
- Layer norm : Normalization.
(Speed up the training process)

- Transformer decoder : decode sequence of image tokens into an output noise prediction.
---
## Diffusion Model
----

#### Forward Process:Add noise
#### Reverse Process:Denoise
----
## Algorithm of Diffusion Model

Training : train a noise predictor which can predict the sampled noise correctly
Sampling : given an image with noise and make it clear
----
### Maximum Likelihood Estimation

----
### Forward process
$q(x_t|x_{t-1})$ = given $x_{t-1}$, add noise on it to get $x_{t}$

----
### Step by step ?

----

----

- get $x_t$ directly from $x_0$
----
### Reversed process(denoise)
- the lower bound needs to maximize in the model

----
### Minimize KL divergence

- $q(x_{t-1}|x_t,x_0)$

----

----

----
## Gradient Descent
- for finding a local minimum of a differentiable function

$\eta$ : learning rate
----

----

----
### AdaGrad
- "Adaptive" version of Gradient Descent

---
## Transformer

----
## Encoder

----

- Multi-Head Attention : self-attention block
- Add&Norm : residual + layer normalization
----
## Decoder

- output the one which has max probability.
----
- Masked Self-attention

---
## better data for training
----
### longer detailed caption
- LLM generate detailed prompt
- Re-captioning from DALL-E
----

- SSC (short synthetic captions)
- DSC (descriptive synthetic captions)
----
### Native aspect ratios
- Using original video size and ratio as input
- Unify all pictures and videos into a token sequences (patches)

---
### For image-to-video/video extension
- text-to-video : Using LLM/CLIP to generate texts as the condition of DiT.
- any type of data can be the condition of DiT through proper embedding.
{"contributors":"[{\"id\":\"4cfe44cd-c690-4918-adee-3ac14d148d7b\",\"add\":6524,\"del\":940}]","title":"生成式AI-簡報","description":"Introduction to SORA"}