生成式AI-簡報

# SORA ## Justin Lu --- # Contents - Introduction to SORA - Architecture of SORA - Model training --- # What is SORA？ ---- - Made by OpenAI - text-to-video model - can generate video up tp 1 minute ---- # What can SORA do？ ---- - text-to-video - image-to-video - video extension - video style transformation / transition ---- ## Example (From OpenAI) {%youtube HK6y8DAPN_0%} (1:56~2:06) --- # Architecture ---- **SORA = VAE encoder+DiT+VAE decoder+CLIP** ![IMG_0932](https://hackmd.io/_uploads/SJEI5ZmlA.jpg =100%x) ---- ## Spacetime latent patches - **"Given a compressed input video, we extract a sequence of spacetime patches which act as transformer tokens."** - the image version of tokens ![螢幕擷取畫面 2024-04-10 030058](https://hackmd.io/_uploads/BkOilfmx0.png) ---- ## Latent Space Representation ![0_kHJ_LsPi-jz_CreZ](https://hackmd.io/_uploads/ByXaTmug0.png) ---- #### Example : 3D data positioned at 2D latent space ![bAMl5](https://hackmd.io/_uploads/HJLGyNuxA.png =50%x)![L5uRO](https://hackmd.io/_uploads/By6f1V_gA.png =50%x) - Position within the latent space can be viewed as being defined by a set of latent variables that emerge from the resemblances from the objects. ---- ### embedding (嵌入) ![螢幕擷取畫面 2024-04-13 234705](https://hackmd.io/_uploads/ByVQKX_lC.png) ![pasted image 0 (1)](https://hackmd.io/_uploads/Hy21W4_gR.png) ---- ## VAE(Variational Autoencoder) - Encoding/Decoding of spacetime latent patches ![螢幕擷取畫面 2024-04-14 005347](https://hackmd.io/_uploads/r1yxt4_eC.png =100%x) ---- ## CLIP - Contrastive Language-Image Pre-training - find the relation between image and natural language ![image](https://hackmd.io/_uploads/r1CTWxjlA.png =130%x) ---- ## DiT(Diffusion Transformer) - Diffusion Model + Transformer - change the core structure of Diffusion Model into Transformer ---- - DiT refers to ViT, convert the compressed latent features into tokens ![螢幕擷取畫面 2024-04-10 015622](https://hackmd.io/_uploads/BytK-ZQgR.png =70%x) ---- ### ViT：Vision Transformer - Using transformer model on images - long-range, pixel-level interactions ![IMG_0934](https://hackmd.io/_uploads/B1lK2bXe0.jpg =85%x) ---- ![螢幕擷取畫面 2024-04-10 014838](https://hackmd.io/_uploads/ryS9JWQlR.png =80%x) ---- - Patchify : converts the spatial input into a sequence of tokens. - Layer norm : Normalization. (Speed up the training process) ![image](https://hackmd.io/_uploads/rJBUfr3eA.png) - Transformer decoder : decode sequence of image tokens into an output noise prediction. --- ## Diffusion Model ---- ![image](https://hackmd.io/_uploads/ryceTi9gA.png) #### Forward Process：Add noise #### Reverse Process：Denoise ---- ## Algorithm of Diffusion Model ![螢幕擷取畫面 2024-04-10 010324](https://hackmd.io/_uploads/SkL7HgmlC.png) Training : train a noise predictor which can predict the sampled noise correctly Sampling : given an image with noise and make it clear ---- ### Maximum Likelihood Estimation ![image](https://hackmd.io/_uploads/H1J4Lb6xR.png =80%x)![image](https://hackmd.io/_uploads/SkOURg6eR.png =60%x) ---- ### Forward process $q(x_t|x_{t-1})$ = given $x_{t-1}$, add noise on it to get $x_{t}$ ![image](https://hackmd.io/_uploads/SkxLk4al0.png) ---- ### Step by step ? ![image](https://hackmd.io/_uploads/SJmgeNTx0.png) ---- ![image](https://hackmd.io/_uploads/HyY4l46gC.png) ---- ![IMG_1032 (1) (2)](https://hackmd.io/_uploads/rJZv6i2gA.png =80%x) - get $x_t$ directly from $x_0$ ---- ### Reversed process(denoise) - the lower bound needs to maximize in the model ![image](https://hackmd.io/_uploads/SJvFW-plC.png) ---- ### Minimize KL divergence ![image](https://hackmd.io/_uploads/rkMA7V6lA.png) - $q(x_{t-1}|x_t,x_0)$ ![image](https://hackmd.io/_uploads/H17lBNTxC.png =100%x) ---- ![IMG_1015 (1)](https://hackmd.io/_uploads/SyVJ2s2gC.png) ---- ![IMG_1035](https://hackmd.io/_uploads/H1nGZ32gC.png) ---- ## Gradient Descent - for finding a local minimum of a differentiable function ![image](https://hackmd.io/_uploads/rJGiHk3l0.png =100%x) $\eta$ : learning rate ---- ![gradient_descent_small_steps](https://hackmd.io/_uploads/BJlSnN3eC.png) ---- ![image](https://hackmd.io/_uploads/SyruHJ3eC.png) ---- ### AdaGrad - "Adaptive" version of Gradient Descent ![image](https://hackmd.io/_uploads/r1lRmMhgC.png) --- ## Transformer ![0QSIATv](https://hackmd.io/_uploads/HJtQBZ3gC.png) ---- ## Encoder ![IFJLy5G](https://hackmd.io/_uploads/Bk04L-neA.png =47%x)![rvUzp4R](https://hackmd.io/_uploads/S1M48-3lA.png =48%x) ---- ![mHkgERd](https://hackmd.io/_uploads/ByU_Db3lA.png =80%x) - Multi-Head Attention : self-attention block - Add&Norm : residual + layer normalization ---- ## Decoder ![檔案_000 (1)](https://hackmd.io/_uploads/rJrwRznxA.png =70%x) - output the one which has max probability. ---- - Masked Self-attention ![檔案_000](https://hackmd.io/_uploads/H15daM3gA.png) --- ## better data for training ---- ### longer detailed caption - LLM generate detailed prompt - Re-captioning from DALL-E ---- ![螢幕擷取畫面 2024-04-10 031205](https://hackmd.io/_uploads/HkTQ7M7xA.png =80%x) - SSC (short synthetic captions) - DSC (descriptive synthetic captions) ---- ### Native aspect ratios - Using original video size and ratio as input - Unify all pictures and videos into a token sequences (patches) ![image](https://hackmd.io/_uploads/SJAulG2gC.png) --- ### For image-to-video/video extension - text-to-video : Using LLM/CLIP to generate texts as the condition of DiT. - any type of data can be the condition of DiT through proper embedding.