# Deep Learning Project Proposal
## The Team
* **Evgenia Kivotova**
* **Daria Miklashevskaya**
* **Yuriy Sukhorukov**
## Problem formulation
We aim to create a reliable and high-performance platform for criminal face reconstruction from natural-language descriptions with 70% accuracy of resemblance to the reference images from the dataset, within the period of the current study module. The primary users are the police department and the department's visitors, who describe the images that need to be generated to find missing relatives or violators.
The image generation is not only popular and fun topic, but also applicable in many situations. For example, it is common interest of all people to visualise their thoughts and share this image with other people for some reason or just for fun. In our project, we want to build the system which will generate image of a person from free-text describtion in English.
## Who is the user of future product?
We think that our application will target the following user groups:
- **Police officers.** Text to face application may be useful for quick identikit generation, these users seek generated images of missing people or violators.
- **Department visitors.** These users provide natural-language descriptions for the images to be generated.
Besides, the functionality allows to further extend the target group to:
- **Book readers and publishers.** It is always interesting how novel characters look like, especially when book has no illustrations. Curious readers may use our application to satisfy their interest. Publishers in opposite may populate the book with illustrations quicly and make it more interesting for customers.
- **Artists.** In case artists want to create specific character but cannot find suitable reference, they may bescribe what they want to our application and get a bunch of possible pictures to start from.
$$~$$
$$~$$
$$~$$
$$~$$
$$~$$
## How should we frame this problem?
It is clear that out target system will represent the collaboration between
- some NLP Seq2Vec model (**Encoder**)
- Image Generator (**Decoder**).
We are still researching for suitable models, but defenitely will use Pytorch for implementation.
Clearly, training will be supervised.
* We prompt users to give their feedback about the quality of model's output and store querries, feedback and generated image in a database. This data will be included in the health-check report and will help us to further tune the model.
Source code will be stored at the [GitLab repository](https://gitlab.com/k0t1k/dl-project)
## What data can we use?
We found [Multi-Modal-CelebA-HQ](https://github.com/weihaox/Multi-Modal-CelebA-HQ-Dataset) Dataset containing 30k high resolution images of selebrities with several English-text short appearance descriptions connected to each image. The describtions may contain same information but different form, witch we hope will help the model to stick to concreete words instead of grammatic structure of sentence.
## Performance measure
Performance in this context consists of two differnet things
* System performance - how fast our system as a whole (website, API endpoint, neural net itself) will perform certain actions, like generating an image of interst or loading webstie pages
* Neural net performance - is a measure of **model error** in other words, how good our model is at generating images given the textual description
### How should we measure the performance?
First of all the neural net performance may be evaluated using some simularity function between generated image and target face.
To measure the load time we can use logging and tools like [Awesome Prometeus](https://github.com/roaldnefs/awesome-prometheus) to monitor the system performance and health
$$~$$
$$~$$
$$~$$
### Is the performance measure aligned with the business objective?
Yes, reliability and stability will be provided by the healthcheck tools, to notify the users about the downtime and for logging. The similarity function allows to measure the performance of the model, which is a similarity to the target image (not less than 70%).
### What would be the minimum performance needed to reach the business objective?
* We want to keep page loading time withing 2 seconds - tolerable page loading time according to this [research](https://www.researchgate.net/publication/220893869_A_Study_on_Tolerable_Waiting_Time_How_Long_Are_Web_Users_Willing_to_Wait)
For the image generation time we want to stay under 10 seconds, which should be doable provided powerfull GPU (RTX2070s)
- For image quality we want the results to be at least recognisable as human faces and match textual description, or 70% resemblance to the target pictures from the dataset.
### Is human expertise available?
Yes, almost every person is able to evaluate generated image given description.
## Existing solutions and research.
There exist several studies that implement the same idea for the English language:
- [Faces la Carte: Text-to-Face Generation via
Attribute Disentanglement](https://arxiv.org/pdf/2006.07606.pdf): this GAN-based model generates several different faces in response to a single description to cover all unspecified features.
- [A Realistic Image Generation of Face From Text Description Using the Fully Trained Generative Adversarial Networks](https://www.researchgate.net/publication/343565403_A_Realistic_Image_Generation_of_Face_From_Text_Description_Using_the_Fully_Trained_Generative_Adversarial_Networks): here, a fully trained GAN is proposed which beats the previous benchmarks.
- [Text2FaceGAN: Face Generation from Fine Grained Textual Descriptions](https://arxiv.org/abs/1911.11378): using conditional distribution of faces in the same latent space and a DC-GAN with GAN-CLS loss for learning conditional multi-modality, they show good results and argue against validity of inception score metric for evaluation.
- [TediGAN: Text-Guided Diverse Image Generation and Manipulation](https://arxiv.org/abs/2012.03308): they utilize StyleGAN inversion module, visual-linguistic similarity learning, and instance-level optimization to beat the previous benchmarks used for this task.
$$~$$
$$~$$
$$~$$
## How would you solve the problem manually?
It is possible to read the descriptions and draw faces with some tools like pensil, pen or paint.
## Assumptions
For our work we assume that
* Model will be fast enough even when executed on CPU
* Desired form of input is text, not something else (voice input, JSON/csv with person attributes, image)
* Model is accurate enough so when an image is generated, there will be no need to fine-tune it with querries like "change the eye color".
* Users do know English.
### Assumption verification
* First three assumptions are not easily-verifiable because we do not have a working model, but there is some clue that at least model will work fast enough on a decent CPU ( we just checked the performance of other relatively-heavy networks)
* About the last one, we can consider a possibility of incorporating Russian language