There is no commentSelect some text and then click Comment, or simply add a comment to this page from below to start a discussion.
Survey Text Based Image Synthesis
Note: summary of the papers and some text as well as images taken directly from the respective papers because of the clarity in the explanation.
Metrics for perfomance measure
Peak Signal to Noise Ratio (PSNR)
It measure ratio between maximum possible power of a signal and the power of corrupting noise that affets the signal quality.
PSNR commonly used to quantify the quality of image and video reconstruction.
Measures in dB (decibel) scale.
Assume we have an image I of size mxn which is monochromatic and noise free along with it we also have an image K which is noisy version of image I. To measure PSNR between them we do the following:
Higher the PSNR value better the result is.
Structural Similarity for Image Quality (SSIM) [2003]
It measures how close two images are.
Two patches are slided over the images x and y, pixel by pixel to collect the following stats:
The measure of luminance, constrast and covariance is then given by:
Finally SSIM is calculated as:
SSIM satisfes following properties:
Symmetry: SSIM(x,y) = SSIM(y,x)
Boundedness: SSIM(x,y) <= 1
Unique maximum: SSIM(x, y) = 1 if and only if x = y
This measure proposes that even though two images content are different there might be a chance that, they have very low mean square error but using SSIM we can measure their quality better.
Inception Score (IS) [2016]
It measures the quality of generated image by GAN.
Pre-trained InceptionV2 model is use to extract the probability distribution for the generated data.
The fold applications of this metric are:
It measures how well the generated images are.
It also measure how diverse the generated are.
Diversity is capture by summing all the predicted probability distribution.
Frechet distance is use to measure the distance between two mutlivariate normal distribution.
For a univariate normal distribution Frechet Distance is given as:
To calculate FID, the activation statistics of pre-trained InceptionV3 model is used. On final activation layer global average pooling is applied to get 2048 size vector.
FID for multivariate distribution is calculated as:
Conditional Image Snythesis with Auxillary Classifier GANs (AC-GAN) [2016]
This paper proposed new training strategy by adding classification loss only to discriminator part of the GAN framework.
Also shows that high quality image generation is not just upsampling of small resolution images and how downsampling the high resolution image affects the classification accuracy.
Text Conditioned Auxiliary Classifier Generative Adversarial Network (TAC-GAN) [2017]
This paper uses AC-GAN as base work and proposed a new framework for text based image generation inspired from the work of Reed Scott et.al.
Authors have added text embedding network to AC-GAN and similar to Scott, they train their discriminator to aware of correct image is matched to correct text (matching aware discriminator).
AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks [2017]
In this paper authors have proposed attention GAN network and a attention network for finding coarsed text and image similarity to compute the loss and combining these two part generates fine-grained images that highly correlates to its text description.
The attention gan network generates image by paying attention on the each word of the give text description.
The proposed method perform better for images containing complex information such as more than one object is present in image as well as its equivalent text description (COCO dataset).
Intution behind attention GAN is to let the generator focus on the word for generating part of images that is highly relevant to it.
The network extract the global as well as word level embedding from the given text description.
The first sub network of the generative network utilizes the global embedding for low resolution image generation. The later part of the network uses image vector and word vector to generate fine-grained encoding. The generator use the encoding to generate high quality image that simantically aligned with given text.
To generate fine-grained encoding attention mechanism is used:
The generated attention context vector is than passed to generator for synthesizing high resolution image.
There are three different generator are used in this complete network, generates image, generates image and generates image.
Generator loss:
Discriminator loss:
Total loss:
Deep Attentional Multimodal Similarity Model (DAMSM)
Authors proposed this network to provide consistency between text and image.
The network is pre-trained on image-text pair and the pre-trained later used for computing loss ().
Intution behind using this network is to map image-text pair into same embedding space.
For generating encoding for text, bi-directional LSTM network is used. Where, represents local word level embedding and represents global encoding of the text. is embedding size for text and is number of words in given sentence. The text encoder network is trained from scratch.
To generate encoding for the images authros have used pre-trained inceptionV3 network. They resize image to before giving it as input to the network.
Local feature for image is extracted from "mixed_6e" layer of inceptionV3 network. The extracted local feature is represented as , which represents there are pixel in image each having features. The local feature is resized to form one single vector of size . The global feature is extracted from last layer after applying global average pooling and represented as .
Both local and global extracted image feature is transformed into common semantic for text by passing them through perceptron layer which is trained from scratch, before it weights for inceptionV3 network is freezed.
Similarity matrix is calculated between image's subregion and words:
The calculated similarity score is then normalized:
Region context vector, it gives how much a word in text related to an image subregion.
Relevance between image subregion and ith word:
Attention-drive image text matching score:
DAMSM Loss:
, , and
Finally the complete network is trained using total loss:
Authors have done experiment to verify the improvement due to different components of the network by removing it one by one or adding component to the inital network one by one.
Semi-supervised FusedGAN for Conditional Image Generation [2018]
This paper proposed a method for generating images with controlled informations. Generated image follow fidelity and diversity.
Controlled information includes posture, style, background, and fine-grained details.
Other aim of paper is to show that the proposed method is able to learn disentangled representation.
As per the paper disentanglement is achieved by cascadin different generator models.
One pair of generator model is responsible for learning unsupervised image generation. Where it trained as GAN method. Another pair is trained to generate conditional images.
Generator stage is divded into two parts. Part one of generation stage is common to next stage generators. First stage of generator is use to generate over a structure of shape in considration. It act like a scketch which consist of posture. The generated shape is then passed on to two different generators in the next stage. One is unconditional generator and other one is conditional generator conditioned on the text.
Text embedding is also pass to the discriminator of the conditional generator but it is not trained as matching aware discriminator.
The complete network is trained using GAN based loss in alternate manner, paramaters of the network is first updated for pair one and then for pair second.
Authors have proposed many experiment to show emperical proof of how the disentanglement is happening because of cascading the generators on the text to image synthesis.
Fixed posture with varying styles.
This experiment shows that after generating a structure from the first generator, we can vary the text caption in order generate different style images by keeping posture constant.
Fixed posture with varying details.
This experiment shows how conditional augmentation for the text brings disentanglement control. Authors keep same structure and vary different samples for the same text using CA and show that it creates different texture for birds of same species.
Interpolation with same posture but varying styles.
This is simple interpolation between the latent space of text by keeping two different bird structures constant.
Inception score is use to perform quantitative analysis.
Photographic Text-to-Image Synthesis with a Hierarchically-nested Adversarial Network (HDGAN) [2018]
Authors have proposed a novel framework for generating text based high quality images with fidelity and diversity.
Along with novel framework they also proposed an evaluation metric known as Visual-Semantic Similarity score, that helps in evaluate how well a generated image follow given text description.
The important part as per the paper is generation of image which act as based structure for further level of framework.
To train the discriminator authors have used matching-aware training training strategy similar to STACK-GAN.
The HDGAN framework consist of level of generators where K represents resolution quality of image (). As per paper i.e. image quality is generated (64, 128, 256, 512). Output of each level is accumulated and passed to discriminator for each level, this way the different level of discriminator try to encode fine grained information into image. Discriminators act as regularizer here.
Using discriminator, GAN framework maintain the global consistency but to ensure the local patch based consistency, authors modified branched the discriminator output into two part. First part consist of real and fake score and the Second part consist of probabilistic map for each pixel being real or fake similar to CycleGAN and PixToPixGAN, for it computing loss is similar to computing loss for real and fake, the output probability map is compared with all ones map for real and all zeros for fake.
Similar to StackGAN authors have used Conditional Augmentation (CA) to ensure consistency in text embedding.
Visual-Semantic Similarity score:
To use this metric we have to trained the model for encoding images and encoding text using the following loss function: 𝟝𝟙𝟚
In this visual-semantic similarity we have used triplet loss or ranking loss to learn parameters of the network.
After learning the parameter following function is use to measure the similarity:
Authors have perform multiple experiments to support their hypothesis:
Hierarchically-nested adverserial training:
Using this experiment authors explain the importance of discriminator for each layer. They showed that removing discriminator at lower level can cause degradation in generated result.
The local image loss:
Similar experiments have been performed to show the effectiveness of local image loss for pixel consistency in images.
Perceptual Pyramid Adversarial Networks for Text-to-Image Synthesis (PPAN) [AAAI 2019]
This paper present a novel framework consist of one generator and multilevel discriminator to synthesis realistic image based on text description. Each level of discriminator force generator to generate fine grained images.
Authors have used Perceptual loss instead of using pixel based loss for generating realistic images. Multi-purpose discriminator encourage semantic consistency, image fidelity and class invariance.
The PPGAN framework consist of conditional augmentation network inspired from the StackGAN, to prevent discontinuity in the higher order embedding of text. So a text embedding is passed through CA network to sample dimension embedding vector. After sampling text embedding from CA, PPGAN pass the embedding to the generator network after concatenating it with noise vector . The generator network up-sample the given input to generate , and image at different stage. This single generator is connected to 3 different discriminator and .
Authors have used four different types of loss function at different stages of network:
loss is matching aware loss inspired from GAN-INT-CLS (Reed et. al.) which encourage discriminator to learn contextual information. It takes three pairs as input i.e. , and .
loss is local image loss which encourage discriminator to penalize the generator for not generating diverse and locally stable images similar to HDGAN.
is class information loss, which is inspired from TAC-GAN. The purpose of this loss is to encourage discriminator to understand differences in the different class of images and penalize generator for not generating correct class image.
is perceptual loss used by authors in the framework to encourage generator to generate spatially correlated image not some random pixel as output. This loss is inspired from the work of Jhonson et. al. for style transfer.
Discriminator uses loss. As per the paper learning at lower level helps in generating the semantically meaningful image. uses loss along with and . loss is applied for images, because high resolution images contain more discriminative features, which force generator to generate semantically coherent pixels.
convolution is used to merge the textual information into image in discriminator.
The authors have used following evaluation techniques:
Inception score (IS) [High score is better.]
Visual-semantic Similarity (VS) [HDGAN] [High score is better.]
Authors verify the importance of each component by removing all the 3 discriminator first and added it one by one to check the improvement in the IS and VS score.
MirrorGAN: Learning Text-to-Image Generation by Redescription [2019]
In this paper authors have proposed mirrored strategy text to image generation method. An image is generated from the text description and noise vector. The generated image is then passed through the image caption network to get back the caption, which suppose to be similar to the given text description to generate image at first place.
Authors have proposed three modules for transforming text to an image:
STEM (semantic text embedding module): This module is to generate the local and global embedding for the text.
GLAM (global-local collaborative attentive module): This module is similar to AttnGAN module, but authors have also added the global sentence attention along with local attention of words and image feature.
STREAM (semantic text regenration and attention module): This module is image captioning module, which takes generated image as input and generate image caption as ouput that later used to estimate the incosistency between input caption and generated caption.
The reason authors have added global attention along with word-level attention is that, using word-level attention alone doesn't ensure global semantic consistency due to diverse text and image modalities ex: an image can have 5 or more captions as per COCO dataset, and caption have different way of conveying the same features of an image.
To train model end to end following losses are used:
Visual realism adverserial loss: whether the generated image is real or fake
Text-image paired semantic consistency loss: whether the generated image is consistent with text or not
Text-semantic reconstrcution loss: cross-entropy loss to make text-to-image and image-to-text task consistent and guide generator to generate semantically consistent images.
Due to diverse text, the small perturbation in two text can have similar meaning. To make embedding consistent inspired from StackGAN, authors have used Conditional Augmentation (CA).
Local and Global attention mechanism (GLAM):
Incremental generator network is used.
Similar to AttnGAN:
Attention mechanism:
First word level attentive context feature is generated ():
After the word level attention, sentence level attention is applied ():
Image to text (STREAM):
Authors have used pre-trained CNN model to extract image feature, so that image can be given as input to the LSTM network.
LSTM network is trained from scratch. Authors have trained this network seperatly and then used it after freezing weights for loss calculation.
Loss functions:
Visual and Text semantic loss:
Image to caption loss:
Total loss:
Ablation studies:
MirrorGAN component
MirrorGAN cascade generator
DM-GAN: Dynamic Memory Generative Adverserial Networks for Text-to-Image Synthesis [2019]
In this paper authors have proposed dynamic memory based method for generating fine grained image that follows its textual description.
Motivation behind the paper is, all the previous depends on the inital generated image for furthure text based alignment and refinement. Which can cause problem if the inital generated image is of low quality i.e. contains very less or noisy information unrelated to text.
The Dynamic Memory method consist of two parts:
Gated Memory Writing: Its purpose is to focus on relvant words to refine the inital image. For example if the inital image consist of a bird with black stripes but the text says white, it is the responsibility of Gate Memory writing network to identify it and write that to memory.
Gated Response: It fuses the relevant information from the text to image by using the memory.
First the text is passed through pre-trained bi-directional LSTM model, where represents the global embedding of the text and represent the word level embeddings.
The first passes through the conditional augmentation network (CA) and which again give a new embedding that concatenated with noise vector and passed through initial image generator () that generate an image of size (𝟞𝟜𝟞𝟜).
Feature map from the inital network () is passed to the Dynamic Memory base Image Refinement network .
Where, utilises word level embedding to refine the inital generated image using Memory writing gate and Response gate.
Notations:
Dynamic Memory consist of four different steps (inutions):
Memory writing (for a single word):
Key addressing: It calculates similarity score between memory feature and image feature
Value reading: It calculate how much each word is important for jth image feature
Response: The purpose of it to use the memory feature to refine image
Extra:
Gated Memory Writing:
For inution we have assumed only single word feature for image refinement but a network is proposed which utilizes all the word to identify the relevant words for refinement of image.
Gated Response:
For inution we have assumed simple concatination operation for new image feature but this network is proposed to fuse the important word memory into image feature.
Generator loss:
Discriminator loss:
Loss for the network:
Authors have performed ablation study by dropping each proposed component from the base network and show imporvement by adding each component one by one.
Semantics Disentangling for Text-to-Image Generation (SD-GAN) [2019]
In this paper authors have proposed a method that learn disentangled representation for text-to-image synthesis. Both high-level semantic consistency and low-level semantic diversity is preserved by the proposed method.
Controllable Text-to-Image Generation [2019]
In this paper authors have proposed method for generating images from text as well as manipulating genrated image in controlled manner. Small changes to sentences won't affect non-text related image part.
Authors also proposed a word-level discriminator for fine grained feedback to generator for better learning.
There are three novel component in the architecture:
spatial and channel-wise attention for generator network for generating better image. The proposed generator follow multi-stage architcture similar to AttnGAN.
A word-level discriminator is proposed to exploit the correlation between image region and words for disentangle different attributes.
Perceptual loss is used for better guiding generator to generate realistic images.
Authors have used AttnGAN as base work. To generate feature vector for text, pre-trained bi-directional LSTM model is used.
Channel-Wise Attention
Channel-Wise attention is proposed to help network learning the relation between words and channel. As per the paper, authors found that the spatial attention focues more on color description.
The intution behind word-level discriminator is to provide generator with fine grained feedback, so that it can generate image by focusing more on text related area in the text description. This idea is inspired by text-adaptive discriminator.
Learning Transferable Visual Models From Natural Language Supervision (CLIP) [2021] [Youtube(Yannic)]
Authors have proposed a method for learning discriminative features in zero shot environment. The proposed method adapts to large varity of datasets.
The model predicts a matching sentence instead of just single word label.
The idea is to learn good features using vast amount of text available. Generally, learning features using labeled dataset which has limit, restrict the model generality.
Intuition:
Ask model how likely a given text goes with an image. Like given a text cat, dog or mouse, how likely these labels goes with an image and model can output a probability distribution.
We can make model more robust by rephrasing the label into a sentences and sentences can be of different type. Ex: a photo of a dog or a photo of cat e.t.c. Sentences are constructed from the label of the dataset but with prompts.
The clip model is trained using contrastive learning approach.
Method is to make a batch of paired text-image data.
Use image encoder to extract feature for image and use text encoder to extract feature for text and find inner product to calculate the relation between each image and text pair of a batch dataset. Do this for both way image to text and text to image. The aim is to generate constrastive encoding in which all the diagonal must be high value i.e 1 and all no diagonal data which is in general non related image-text pair must be close to 0. For image classification task is in horizontal direction and for text classification task is in vertical direction. Classification problem view from two different angles.
For inference:
Extract the feature of image using trained image encoder.
For text we have a set of labels and we generate an text prompt with these labels and passes all those prompt through text encoder network trained earlier to extract feature of each text.
To assign a class to an image we find higher value inner product between the image feature and all the prompt text features.
Due to the contrastive training strategy, the clip model generates robust features for inter and intra-class classification. The caption provided for training captures these minute details. Example of captions: this dog is a pug, the aussie pup etc.
For text encoder authors have used transformer model of approx 63M parameters and for image encoder authors have used ResNet and ViT model.
Authors compare performance between zero-shot classification and linear probing. Important part of the proposed method is how the prompt is prepared i.e a part of sentence where label is embeded.
The proposed method has used pre-trained VQ-VAE codebook. DALL-E model has to predict tokens.
Authors have trained transformer model for autoregressive task, where text and image token is given as stream of data.
To deal with high resolution images, authors proposed two stage:
Stage 1: They train a discrete variational autoencoder (VQ-VAE2). Codebook size is . The images of size is is reduced to .
Stage 2: They combined the text token of 256 BPE and image token , train autoregressive transformer model to predict the rest of image token. Idea is to learn join distribution over text and images.
First, dVAE is learned for image task and after learning the dVAE model, the learned weights are fixed and transformer model is trained for autoregressive task by concatinating image and text token.
Cycle-Consistent Inverse GAN for Text-to-Image Synthesis [2021]
This paper proposed method for both text-to-image generation and text based image manipulation task.
Method:
A GAN is trained without condition aimed to generate images with high diversity and quality i.e. high inception score.
After training GAN, a inversion model is learned to map real images to latent space, using cycle-consistency loss to learn more robust and consistent inverted code.
Later, a text encoding model is learned to explore the semantic space of learned GAN.
The previous methods are restricted in terms of paired text-image data, which limits the generator capibility of generating diverse images for text-to-image task. Also, it is hard to preserved non-text related semantic and made small changes to image. For text-based image manipulation new model need to train.
Authors have used StyleGAN for disentangled representation of images. StyleGAN map a noise vector to a disentangled latent space .
Inspired from the idea of GAN inversion, authors have proposed a method for text based image synthesis.
Authors trained a GAN inversion encoder, that map real image to latent space. To make inverse latent code consistent with trained StyleGAN latent code , cycle consistent loss is used.
For text based image generation task, a similarity model is learned between text and inverted latent code, such that inverted code optimized to have desired semantic attributes. For image editing using text task, same method is used and perceptual loss is used to optimized reconstruction between original image and generated image.
Authors have divided the CI-GAN into three different stages:
STAGE 1: A StyleGAN model is trained to generate high quality and diverse image without any conditions i.e. mapping , where is more disentangled representaiton.
STAGE 2: Train a GAN inversion encoder using cycle-consistency loss.
STAGE 3: A text encoder model is trained to align the text encoding with inverted latent code .
GAN inversion encoder:
GAN inversion studies mapping of generated image back to latent code.
An encoder based model is used, which takes image as input and give as output which is similar to .
Pixel loss () and perceptual loss () is used to make consistent with to generate original image.
Cycle consistent constraint:
Using perceptual loss and pixel loss model focus more on image domain, but we also have to make (re-constructed) latent code similar to original latent code . To do so, cycle consistency training is used.
The adversarial loss is used on both inverted code and generated image from .
The adversarial loss on generated image using inverted latent code , forces encoder to generate realistic image as well as aligned the with generator semantic. Here we use a new discriminator.
The adverserial loss on the inverted latent code forces model to align the inverted latent code to , so that it follows the same distribution as .
The complete objective of GAN inversion encoder is: