# Text to Image Project
Objective:
You will have the deep insights of language models as well as generative models. Understanding the theoretical and practical perspectives of diffusion-based model and transformer-based models.
## Abstract
[ ] Why we do this?
[ ] How we do this?
# Related Work
# Background Theory
## Vision Modal
### Diffusion
* What is diffusion?
* Theory of diffusion (latent variable models)
* Forward process
* Backward process
* Training
* Inference
* Problems related to improvements (speed up, stable training, etc.)
## Language Modal
### Word Embedding
* Word2Vec
* Training
* Inference
### Deep learning models used in NLP
* Recurrent neural network
* Long-short term memory (LSTM) / Gated Recurrent Unit (GRU): solve the limitations of RNN (vanishing/exploding gradients)
* Attention
### Transformer
* Architecture (encoder/decoder)
* Training
* Inference
## Vision-Language Model
How to simutaneously process two inputs (images and text).
(Conditional generation)
Some useful references
* CLIP (OpenAI)
* Imagen
* DALL-E 2
* GLIDE
### Encode Text (formulate 'label' to guide diffusion)
### Generate Image
### Evaluation
Automatic: FID, IS (Images are good or not?)
R2-Score (Images match description or not?)
###
Overleaf: