# Text to Image Project Objective: You will have the deep insights of language models as well as generative models. Understanding the theoretical and practical perspectives of diffusion-based model and transformer-based models. ## Abstract [ ] Why we do this? [ ] How we do this? # Related Work # Background Theory ## Vision Modal ### Diffusion * What is diffusion? * Theory of diffusion (latent variable models) * Forward process * Backward process * Training * Inference * Problems related to improvements (speed up, stable training, etc.) ## Language Modal ### Word Embedding * Word2Vec * Training * Inference ### Deep learning models used in NLP * Recurrent neural network * Long-short term memory (LSTM) / Gated Recurrent Unit (GRU): solve the limitations of RNN (vanishing/exploding gradients) * Attention ### Transformer * Architecture (encoder/decoder) * Training * Inference ## Vision-Language Model How to simutaneously process two inputs (images and text). (Conditional generation) Some useful references * CLIP (OpenAI) * Imagen * DALL-E 2 * GLIDE ### Encode Text (formulate 'label' to guide diffusion) ### Generate Image ### Evaluation Automatic: FID, IS (Images are good or not?) R2-Score (Images match description or not?) ### Overleaf: