<h1>Stable Diffusion - Image to Prompt </h1> --- <h2> Problem Statement </h2> The popularity of text-to-image models has spurned an entire new field of prompt engineering. Part art and part unsettled science, ML practitioners and researchers are rapidly grappling with understanding the relationships between prompts and the images they generate and in this project we aim to reverse the typical direction of a generative text-to-image model: instead of generating an image from a text prompt, can you create a model which can predict the text prompt given a generated image? You will make predictions on a dataset containing a wide variety of (prompt, image) pairs generated by Stable Diffusion 2.0, in order to understand how reversible the latent relationship is. <h2> Solution </h2> The above problem statement can be broken down into two sub problems: 1. Extract features from Image 2. Use these Features to generate prompt embedding **Task 1** Task 1 involves making a encoder model for your image in order to get a structure and context aware feature representation of image. For this, various pretrained model can be used or another approch could also be to explore new architectures in recently published paper which invove simalar task of feature extraction in their pipeline. **Task 2** Task 2 aims to build a decoder architecture to map the image features to prompt embedding. In order to calculate prompt similarity in a robust way—meaning that "epic cat" is scored as similar to "majestic kitten" in spite of character-level differences—you will use embeddings of your predicted prompts for training.And whether you model the embeddings directly or first predict prompts and then convert to embeddings is up to you! **Note**: The dataset contains target prompts as text so it needs to be converted to an embedding vector and it's completely up to you to develop your own strategy of creating a training set of images, using pre-trained models, etc. <h2> Timeline </h2> **Week 1:** `May 22 2023 - May 28 2023` * Get comfortable with Pytorch framework * Understand concepts related to the problem statement: * Basic CNN based architectures like VGG19, EfficientNet, ResNet,etc. * **Basic NLP techniques:** Pre-Processing,Tokenisation,Word embedding generation,etc. * **Trasformers:** Vanila attention model and Vision Transformers. **Week 2:** `May 29 2023 - June 5 2023` * Download the dataset and prepare it for training. * Figure out your approch to convert the prompt in train set to embedding vector. * Implement a pipeline consisting both image feature extractor and prompt generator model and get some baseline results. * Explore various new researches similar to the problem statement which you can use as image feature extractor or prompt generator. **Week 3:** `June 6 2023 - June 12 2023` * Experiment with various different models and tuning their hyperparameters. * Analyse the accuracy metrics and write a conclusion reporting qualitative and quantative results. This is just a tentative timeline, you are free to move at your own pace. <h2> Resources </h2> [Dataset](https://www.kaggle.com/competitions/stable-diffusion-image-to-prompts/data?select=prompts.csv) [CNN Architectures](https://medium.com/analytics-vidhya/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5) [Vision Transformers](https://www.v7labs.com/blog/vision-transformer-guide) [NLP Techniques](https://www.deeplearning.ai/resources/natural-language-processing/)