Survey Text Based Image Synthesis

Note: summary of the papers and some text as well as images taken directly from the respective papers because of the clarity in the explanation.

Metrics for perfomance measure

Peak Signal to Noise Ratio (PSNR)

It measure ratio between maximum possible power of a signal and the power of corrupting noise that affets the signal quality.
PSNR commonly used to quantify the quality of image and video reconstruction.
Measures in dB (decibel) scale.
Assume we have an image I of size mxn which is monochromatic and noise free along with it we also have an image K which is noisy version of image I. To measure PSNR between them we do the following:

$M S E = \frac{1}{m n} \sum_{i = 0}^{m - 1} \sum_{j = 0}^{n - 1} [I (i, j) - K (i, j)]^{2} P S N R = 10 * l o g_{10} (\frac{M A X_{I}^{2}}{M S E}) P S N R = 20 * l o g_{10} (\frac{M A X_{I}}{\sqrt{M S E}})$
Higher the PSNR value better the result is.

Structural Similarity for Image Quality (SSIM) [2003]

It measures how close two images are.
Two patches are slided over the images x and y, pixel by pixel to collect the following stats:

$μ_{x} : Luminance of image x patch (mean) σ_{x}^{2} : Constrast of image x patch (variance) σ_{x y} : Covariance between patches of image x and y$
The measure of luminance, constrast and covariance is then given by:

$l (x, y) = \frac{2 μ_{x} μ_{y} + C_{1}}{μ_{x}^{2} + μ_{y}^{2} + C_{1}} c (x, y) = \frac{2 σ_{x} σ_{y} + C_{2}}{σ_{x}^{2} + σ_{y}^{2} + C_{2}} s (x, y) = \frac{σ_{x y} + C_{3}}{σ_{x} σ_{y} + C_{1}} Where, C_{1} = (K_{1} L)^{2}, C_{2} = (K_{2} L)^{2} and C_{3} = C_{2} / 2 K_{1} = 0.01 and K_{2} = 0.03 (as per paper)$
Finally SSIM is calculated as:

$S S I M = l (x, y)^{α} . c (x, y)^{β} . s (x, y)^{γ} Where, α = β = γ = 1$
SSIM satisfes following properties:
- Symmetry: SSIM(x,y) = SSIM(y,x)
- Boundedness: SSIM(x,y) <= 1
- Unique maximum: SSIM(x, y) = 1 if and only if x = y
This measure proposes that even though two images content are different there might be a chance that, they have very low mean square error but using SSIM we can measure their quality better.

Inception Score (IS) [2016]

It measures the quality of generated image by GAN.
Pre-trained InceptionV2 model is use to extract the probability distribution for the generated data.
The fold applications of this metric are:
- It measures how well the generated images are.
- It also measure how diverse the generated are.
  - Diversity is capture by summing all the predicted probability distribution.

E [D_{K L} (p (y | x_{i}) | | p (y))] p (y) = \int p (y | x = G (z)) d (z)

Link

Frechet Inception Distance (FID) [2017]

Frechet distance is use to measure the distance between two mutlivariate normal distribution.
For a univariate normal distribution Frechet Distance is given as:

$d (X, Y) = (μ_{X} - μ_{Y})^{2} + (σ_{X} - σ_{Y})^{2} Where, X and Y are two normal distribution.$
To calculate FID, the activation statistics of pre-trained InceptionV3 model is used. On final activation layer global average pooling is applied to get 2048 size vector.
FID for multivariate distribution is calculated as:

$F I D (X, Y) = | | μ_{X} - μ_{Y} | |^{2} - T r (\sum_{X} + \sum_{Y} - 2 \sum_{X} \sum_{Y}) Where, X and Y are real and fake embedding. μ_{X} and μ_{Y} are mean of embedding vectors X and Y . \sum_{X} and \sum_{Y} is covariance matrix of embedding vecotrs X and Y . T r is trace of the matrix, which is sum of diagonal elements.$
Link

Generative Adversarial Text to Image Synthesis (GAN-CLS) [2016] [First paper]

Link

Conditional Image Snythesis with Auxillary Classifier GANs (AC-GAN) [2016]

This paper proposed new training strategy by adding classification loss only to discriminator part of the GAN framework.
Also shows that high quality image generation is not just upsampling of small resolution images and how downsampling the high resolution image affects the classification accuracy.

$L_{I} = E_{x \sim p (x)} [l o g (D_{θ} (x))] + E_{z \sim N (0, I)} [l o g (1 - D_{θ} (G_{ϕ} (z)))] L_{C} = E [l o g (P (y_{r e a l} | x_{r e a l}))] + E [l o g (P (y_{f a k e} | x_{f a k e}))] L_{D} = L_{I} + L_{C} L_{G} = - E_{z \sim N (0, I)} l o g (D_{θ} (G_{ϕ} (z))) + E [l o g (P (y_{f a k e} | x_{f a k e}))]$

AC-GAN

Text Conditioned Auxiliary Classifier Generative Adversarial Network (TAC-GAN) [2017]

This paper uses AC-GAN as base work and proposed a new framework for text based image generation inspired from the work of Reed Scott et.al.
Authors have added text embedding network to AC-GAN and similar to Scott, they train their discriminator to aware of correct image is matched to correct text (matching aware discriminator).

TAC-GAN

Input to the discriminator:

$A_{D} = {(I_{f}, C_{f}, l_{f}), (I_{r}, C_{r}, l_{r}), (I_{w}, C_{w}, l_{r})} Where, I is Image, C is class label, l is text embedding after passing ψ through fully connected layer. r is real, f is fake, w is wrong.$
Input to the generator:

$A_{G} = {(I_{f}, C_{r}, l_{r})} Where, I is Image, C is class label, l is text embedding after passing ψ through fully connected layer. f is fake.$
Training objective for discriminator:

$L_{D_{S}} = H (D_{S} (I_{r}, l_{r}), 1) + H (D_{S} (I_{f}, l_{r}), 0) + H (D_{S} (I_{w}, l_{r}), 0) L_{D_{C}} = H (D_{C} (I_{r}, l_{r}), C_{r}) + H (D_{C} (I_{f}, l_{r}), C_{r}) + H (D_{C} (I_{w}, l_{r}), C_{w}) Here, H denotes binary cross entropy.$
Training objective for generator:

$L_{G_{S}} = H (D_{S} (I_{f}, l_{r}), 1) L_{G_{C}} = H (D_{C} (I_{f}, l_{r}), C_{r}) Here, H denotes binary cross entropy.$
Loss function is similar to AC-GAN and matching aware discriminator.

StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks [2017] (High quality image generation)

Link

AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks [2017]

In this paper authors have proposed attention GAN network and a attention network for finding coarsed text and image similarity to compute the loss and combining these two part generates fine-grained images that highly correlates to its text description.
The attention gan network generates image by paying attention on the each word of the give text description.
The proposed method perform better for images containing complex information such as more than one object is present in image as well as its equivalent text description (COCO dataset).

AttnGAN

Attention GAN working:
- Intution behind attention GAN is to let the generator focus on the word for generating part of images that is highly relevant to it.
- The network extract the global as well as word level embedding from the given text description.
- The first sub network of the generative network utilizes the global embedding for low resolution image generation. The later part of the network uses image vector
  $R^{\hat{D} \times N}$ and word vector
  $R^{D \times N}$ to generate fine-grained encoding. The generator use the encoding to generate high quality image that simantically aligned with given text.
- To generate fine-grained encoding attention mechanism is used:
  
  $Word feature vector (R^{D \times T}): e_{0}, e_{1}, e_{2}, . . ., e_{T} Hidden image feature vector (R^{\hat{D} \times N}) generated by network F : h_{0}, h_{1}, h_{2}, . . ., h_{N} Word feature vector is converted to common semantinc space of image by it through perceptron layer U \in R^{\hat{D} \times D} : e^{'} = U e \in R^{\hat{D} \times T} Semantic score between each vector is calculated as: s^{'} = h^{T} e^{'} \in R^{N \times T} Semantic score is normalized as follow: β_{j} = \frac{e^{s_{j}^{'}}}{\sum_{k = 0}^{T - 1} e^{s_{j, k}^{'}}} Final context vector is generated as: c_{j} = \sum_{i = 0}^{T - 1} β_{j} e_{i}^{'} The ouput of attention network is: F^{a t t n} (e, h) = (c_{0}, c_{1}, c_{2}, . . ., c_{N - 1}) \in R^{\hat{D} \times N}$
The generated attention context vector is than passed to generator for synthesizing high resolution image.
There are three different generator are used in this complete network,
$G_{0}$ generates
$I_{64 \times 64}$ image,
$G_{1}$ generates
$I_{128 \times 128}$ image and
$G_{2}$ generates
$I_{256 \times 256}$ image.
Generator loss:

$L_{G_{i}} = Unconditional loss + Conditional loss L_{G_{i}} = - \frac{1}{2} E_{{\hat{x}}_{i} \sim P_{G_{i}}} [l o g (D_{i} (\hat{x_{i}}))] - \frac{1}{2} E_{{\hat{x}}_{i} \sim P_{G_{i}}} [l o g (D_{i} (\hat{x_{i}}, \hat{e}))]$
Discriminator loss:

$L_{D_{i}} = - \frac{1}{2} E_{{\hat{x}}_{i} \sim P_{d a t a_{i}}} [l o g (D_{i} (x_{i}))] - \frac{1}{2} E_{{\hat{x}}_{i} \sim P_{G_{i}}} [l o g (1 - D_{i} (\hat{x_{i}}))] + - \frac{1}{2} E_{{\hat{x}}_{i} \sim P_{d a t a_{i}}} [l o g (D_{i} (x_{i}, \hat{e}))] - \frac{1}{2} E_{{\hat{x}}_{i} \sim P_{G_{i}}} [l o g (1 - D_{i} (\hat{x_{i}}, \hat{e}))]$
Total loss:
$L_{G} = \sum_{i = 0}^{m - 1} L_{G_{i}}$
Deep Attentional Multimodal Similarity Model (DAMSM)
- Authors proposed this network to provide consistency between text and image.
- The network is pre-trained on image-text pair and the pre-trained later used for computing loss (
  $L_{D A M S M}$ ).
- Intution behind using this network is to map image-text pair into same embedding space.
- For generating encoding for text, bi-directional LSTM network is used. Where,
  $e \in R^{D \times T}$ represents local word level embedding and
  $\bar{e} \in R^{D \times T}$ represents global encoding of the text.
  $D$ is embedding size for text and
  $T$ is number of words in given sentence. The text encoder network is trained from scratch.
- To generate encoding for the images authros have used pre-trained inceptionV3 network. They resize image to
  $299 \times 299$ before giving it as input to the network.
- Local feature for image is extracted from "mixed_6e" layer of inceptionV3 network. The extracted local feature is represented as
  $f \in R^{768 \times 17 \times 17}$ , which represents there are
  $17 \times 17$ pixel in image each having
  $768$ features. The local feature is resized to form one single vector of size
  $f \in R^{768 \times 289}$ . The global feature is extracted from last layer after applying global average pooling and represented as
  $\bar{f} \in R^{2048}$ .
- Both local and global extracted image feature is transformed into common semantic for text by passing them through perceptron layer which is trained from scratch, before it weights for inceptionV3 network is freezed.
  
  $Local image feature transformation: v = W f Global image feature transformation: \bar{v} = \bar{W} \bar{f} Where, W \in R^{D \times 768} (Fully connected layer) v \in R^{D \times 289} (Transformed local image feature) \bar{W} \in R^{D \times 2048} (Fully connected layer) \bar{v} \in R^{D} (Transformed global image feature)$
- Similarity matrix is calculated between image's subregion and words:
  
  $s = e^{T} v (local word vector matrix with local image feature matrix) Where, e^{T} \in R^{T \times D} v \in R^{D \times 289} s \in R^{T \times 289} s_{(i, j)} is the dot product of ith word feature and jth image region which give similarity score of how much they are related.$
- The calculated similarity score is then normalized:
  
  ${\bar{s}}_{i} = \frac{e^{s_{i}}}{\sum_{k = 0}^{T - 1} e^{s_{k}}}$
- Region context vector, it gives how much a word in text related to an image subregion.
  
  $α_{j} = \frac{e^{γ_{1} {\bar{s}}_{i, j}}}{\sum_{k = 0}^{288} e^{γ_{1} {\bar{s}}_{i, k}}} c_{i} = \sum_{j = 0}^{288} α_{j} v_{j} Where, γ_{1} is factor that determine how much attention is to be paid to relevant subregion.$
Relevance between image subregion and ith word:

$R (c_{i}, e_{i}) = \frac{c_{i}^{T} e_{i}}{| | c_{i} | | | | e_{i} | |}$
Attention-drive image text matching score:

$R (Q, D) = l o g (\sum_{i = 1}^{T - 1} e^{γ_{2} R (c_{i}, e_{i})})^{\frac{1}{γ_{2}}} Where, γ_{2} is a factor that determine how much to magnify importance of word to context region context pair.$
DAMSM Loss:

${(Q_{i}, D_{i})}_{i = 1}^{M} Where, (Q, D) is image text pair. Matching between image Q_{i} and text D_{i} is computed as: P (D_{i} | Q_{i}) = \frac{e^{γ_{3} R (Q_{i}, D_{i})}}{\sum_{j = 1}^{M} e^{γ_{3} R (Q_{i}, D_{j})}} Where, γ_{3} is smoothing factor. Except one all other text is consider to be negative. Matching between text D_{i} and image Q_{i} is computed as: P (Q_{i} | D_{i}) = \frac{e^{γ_{3} R (Q_{i}, D_{i})}}{\sum_{j = 1}^{M} e^{γ_{3} R (Q_{i}, D_{j})}} Except one all other images is consider to be negative. L_{1}^{w} = - \sum_{i = 1}^{M} l o g (P (D_{i} | Q_{i})) L_{2}^{w} = - \sum_{i = 1}^{M} l o g (P (D_{i} | Q_{i})) Similarly we can do this for global embedding of images and sentence also where R (Q, D) = \frac{{\bar{v}}^{T} \bar{e}}{| | \bar{v} | | | | \bar{e} | |} L_{D A M S M} = L_{1}^{w} + L_{2}^{w} + L_{1}^{s} + L_{1}^{s}$
$γ_{1} = 5$ ,
$γ_{2} = 5$ ,
$γ_{3} = 10$ and
$γ_{M} = 50$
Finally the complete network is trained using total loss:
$L_{total} = L_{G} + λ L_{D A M S M}$
Authors have done experiment to verify the improvement due to different components of the network by removing it one by one or adding component to the inital network one by one.

Semi-supervised FusedGAN for Conditional Image Generation [2018]

This paper proposed a method for generating images with controlled informations. Generated image follow fidelity and diversity.
Controlled information includes posture, style, background, and fine-grained details.
Other aim of paper is to show that the proposed method is able to learn disentangled representation.
As per the paper disentanglement is achieved by cascadin different generator models.
One pair of generator model is responsible for learning unsupervised image generation. Where it trained as GAN method. Another pair is trained to generate conditional images.

FusedGAN

Generator stage is divded into two parts. Part one of generation stage is common to next stage generators. First stage of generator is use to generate over a structure of shape in considration. It act like a scketch which consist of posture. The generated shape is then passed on to two different generators in the next stage. One is unconditional generator and other one is conditional generator conditioned on the text.

$z \sim N (0, I) M_{s} = G (z) I_{u n c o n d i t i o n e d_f a k e} = G_{u} (M_{s}) ψ_{t} = E (y) I_{c o n d i t i o n e d_f a k e} = G_{u} (M_{s}, ψ_{t}) Here, z is sampled noise vector. ψ_{t} is generated representation for text after passing text embedding through CA network. y embedding of the text.$
Text embedding is also pass to the discriminator of the conditional generator but it is not trained as matching aware discriminator.
The complete network is trained using GAN based loss in alternate manner, paramaters of the network is first updated for pair one and then for pair second.
Authors have proposed many experiment to show emperical proof of how the disentanglement is happening because of cascading the generators on the text to image synthesis.
- Fixed posture with varying styles.
  - This experiment shows that after generating a structure from the first generator, we can vary the text caption in order generate different style images by keeping posture constant.
- Fixed posture with varying details.
  - This experiment shows how conditional augmentation for the text brings disentanglement control. Authors keep same structure and vary different samples for the same text using CA and show that it creates different texture for birds of same species.
- Interpolation with same posture but varying styles.
  - This is simple interpolation between the latent space of text by keeping two different bird structures constant.
Inception score is use to perform quantitative analysis.

Photographic Text-to-Image Synthesis with a Hierarchically-nested Adversarial Network (HDGAN) [2018]

Authors have proposed a novel framework for generating text based high quality images with fidelity and diversity.
Along with novel framework they also proposed an evaluation metric known as Visual-Semantic Similarity score, that helps in evaluate how well a generated image follow given text description.

Image Not Showing Possible Reasons The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported Learn More →	Image Not Showing Possible Reasons The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported Learn More →
HDGAN	Probabilistic map output at different level

The important part as per the paper is generation of
$64 \times 64$ image which act as based structure for further level of framework.
To train the discriminator authors have used matching-aware training training strategy similar to STACK-GAN.
The HDGAN framework consist of
$K$ level of generators where K represents resolution quality of image (
$2^{K}$ ). As per paper
$K = 9$ i.e.
$512 \times 512$ image quality is generated (64, 128, 256, 512). Output of each level is accumulated and passed to discriminator for each level, this way the different level of discriminator try to encode fine grained information into image. Discriminators act as regularizer here.
Using discriminator, GAN framework maintain the global consistency but to ensure the local patch based consistency, authors modified branched the discriminator output into two part. First part consist of real and fake score and the Second part consist of probabilistic map for each pixel being real or fake similar to CycleGAN and PixToPixGAN, for it computing loss is similar to computing loss for real and fake, the output probability map is compared with all ones map for real and all zeros for fake.
Similar to StackGAN authors have used Conditional Augmentation (CA) to ensure consistency in text embedding.
Visual-Semantic Similarity score:
- To use this metric we have to trained the model for encoding images and encoding text using the following loss function:
  
  $L_{t o t a l} = \sum_{v} \sum_{t_{\hat{v}}} m a x (0, c (f_{v} (v), f_{t} (t_{\hat{v}})) - c (f_{v} (v), f_{t} (t_{v})) + δ) + \sum_{t} \sum_{v_{\hat{t}}} m a x (0, c (f_{t} (t), f_{v} (v_{\hat{t}})) - c (f_{t} (t), f_{v} (v_{v})) + δ) Where, v denotes image feature vector extracted using pre-trained inception model. f_{v}, f_{t} is mapping function that map image-text pair to a common space R^{512} . δ is margin which is set to 0.2. {v, t} is ground-truth image text pair. {v, t_{\hat{v}}} is mismatched image-text pair. {v_{\hat{t}}, t} is mismatched text-image pair.$
- In this visual-semantic similarity we have used triplet loss or ranking loss to learn parameters of the network.
- After learning the parameter following function is use to measure the similarity:
  
  $c (x, y) = \frac{x . y}{| | x | |_{2} . | | y | |_{2}} Where, x and y are image embedding and text embedding obtained from f_{v} and f_{t} .$
Authors have perform multiple experiments to support their hypothesis:
- Hierarchically-nested adverserial training:
  - Using this experiment authors explain the importance of discriminator for each layer. They showed that removing discriminator at lower level can cause degradation in generated result.
- The local image loss:
  - Similar experiments have been performed to show the effectiveness of local image loss for pixel consistency in images.

Perceptual Pyramid Adversarial Networks for Text-to-Image Synthesis (PPAN) [AAAI 2019]

This paper present a novel framework consist of one generator and multilevel discriminator to synthesis realistic image based on text description. Each level of discriminator force generator to generate fine grained images.
Authors have used Perceptual loss instead of using pixel based loss for generating realistic images. Multi-purpose discriminator encourage semantic consistency, image fidelity and class invariance.

PPGAN

The PPGAN framework consist of conditional augmentation network inspired from the StackGAN, to prevent discontinuity in the higher order embedding of text. So a text embedding
$ϕ_{t}$ is passed through CA network to sample
$R^{128}$ dimension embedding vector. After sampling text embedding from CA, PPGAN pass the embedding to the generator network after concatenating it with noise vector
$z$ . The generator network up-sample the given input to generate
$64 \times 64$ ,
$128 \times 128$ and
$256 \times 256$ image at different stage. This single generator is connected to 3 different discriminator
$D_{1}, D_{2}$ and
$D_{3}$ .
Authors have used four different types of loss function at different stages of network:
- $L_{1}$ loss is matching aware loss inspired from GAN-INT-CLS (Reed et. al.) which encourage discriminator to learn contextual information. It takes three pairs as input i.e.
  $(I_{r}, ϕ_{t}^{r})$ ,
  $(I_{f}, ϕ_{t}^{r})$ and
  $(I_{r}, ϕ_{t}^{f})$ .
- $L_{2}$ loss is local image loss which encourage discriminator to penalize the generator for not generating diverse and locally stable images similar to HDGAN.
- $L_{3}$ is class information loss, which is inspired from TAC-GAN. The purpose of this loss is to encourage discriminator to understand differences in the different class of images and penalize generator for not generating correct class image.
- $L_{4}$ is perceptual loss used by authors in the framework to encourage generator to generate spatially correlated image not some random pixel as output. This loss is inspired from the work of Jhonson et. al. for style transfer.
Discriminator
$D_{1}, D_{2}$ uses
$L_{1}, L_{2}$ loss. As per the paper learning at lower level helps in generating the semantically meaningful image.
$D_{3}$ uses
$L_{3}$ loss along with
$L_{1}$ and
$L_{2}$ .
$L_{4}$ loss is applied for
$256 \times 256$ images, because high resolution images contain more discriminative features, which force generator to generate semantically coherent pixels.
$1 \times 1$ convolution is used to merge the textual information into image in discriminator.
The authors have used following evaluation techniques:
- Inception score (IS) [High score is better.]
- Visual-semantic Similarity (VS) [HDGAN] [High score is better.]
Authors verify the importance of each component by removing all the 3 discriminator first and added it one by one to check the improvement in the IS and VS score.

MirrorGAN: Learning Text-to-Image Generation by Redescription [2019]

In this paper authors have proposed mirrored strategy text to image generation method. An image is generated from the text description and noise vector. The generated image is then passed through the image caption network to get back the caption, which suppose to be similar to the given text description to generate image at first place.

Authors have proposed three modules for transforming text to an image:
- STEM (semantic text embedding module): This module is to generate the local and global embedding for the text.
- GLAM (global-local collaborative attentive module): This module is similar to AttnGAN module, but authors have also added the global sentence attention along with local attention of words and image feature.
- STREAM (semantic text regenration and attention module): This module is image captioning module, which takes generated image as input and generate image caption as ouput that later used to estimate the incosistency between input caption and generated caption.
The reason authors have added global attention along with word-level attention is that, using word-level attention alone doesn't ensure global semantic consistency due to diverse text and image modalities ex: an image can have 5 or more captions as per COCO dataset, and caption have different way of conveying the same features of an image.
To train model end to end following losses are used:
- Visual realism adverserial loss: whether the generated image is real or fake
- Text-image paired semantic consistency loss: whether the generated image is consistent with text or not
- Text-semantic reconstrcution loss: cross-entropy loss to make text-to-image and image-to-text task consistent and guide generator to generate semantically consistent images.

MirrorGAN Architecture

Text embedding module (STEM):

$w, s = R N N (T) Where, w \in R^{D \times L} (embedding for each word) s \in R^{D} (global sentence embedding) T : is text description D : word embedding size L : number of words in text description$
- Due to diverse text, the small perturbation in two text can have similar meaning. To make embedding consistent inspired from StackGAN, authors have used Conditional Augmentation (CA).
  
  $s_{C A} = F_{C A} (s) Where, s_{c a} \in R^{D^{'}} D^{'} : it is the dimenssion of conditional augmentation$
Local and Global attention mechanism (GLAM):
- Incremental generator network is used.
- Similar to AttnGAN:
  
  $f_{0} = F_{0} (z, s_{C A}) f_{i} = F_{i} (f_{i - 1}, F_{a t t_{i}} (f_{i - 1}, w, s_{C A})), i \in {1, 2, . . ., m - 1} F_{a t t_{i}} (f_{i - 1}, w, s_{C A}) = c o n c a t (F_{a t t_{i}}^{w}, F_{a t t_{i}}^{s}) I_{i} = G_{i} (f_{i}), i \in {0, 1, 2, . . ., m - 1} Where, {F_{0}, F_{1}, . . ., F_{m - 1}} : represents m visual feature transformation networks. {G_{0}, G_{1}, . . ., G_{m - 1}} : represents m generator networks. z \sim N (0, I) (random noise) F_{a t t_{i}} : is attention module consist of both local (F_{a t t_{i}}^{w}) and (F_{a t t_{i}}^{s}) global attention f_{i} \in R^{M_{i} \times N_{i}} I_{i} \in R^{q_{i} \times q_{i}} m = {64, 128, 256}$
- Attention mechanism:
  - First word level attentive context feature is generated (
    $F_{a t t_{i}}^{w} (w, f_{i - 1})$ ):
    
    $Word embedding w is converted into common semantic space of visual feature w^{'} = U_{(i - 1)} w (using perceptron layer) Common semantic feature is then multiplied with visual feature to get attention score A t t_{i - 1}^{w} = \sum_{l = 0}^{L - 1} (w^{'})^{l} . s o f t m a x (f_{i - 1}^{T} (w^{'})^{l}) Where, U_{i - 1} \in R^{M_{i - 1} \times D} w^{'} \in R^{M_{i - 1} \times L} A t t_{i - 1}^{w} \in R^{M_{i - 1} \times N_{i - 1}}$
  - After the word level attention, sentence level attention is applied (
    $F_{a t t_{i}}^{s} (s_{C A}, f_{i - 1})$ ):
    
    $Similar to word attention s_{C A} is converted into common semantic space of visual feature w^{″} = V_{(i - 1)} w (using perceptron layer) Common semantic feature is then multiplied with visual feature to get attention score A t t_{i - 1}^{s} = \sum_{l = 0}^{L - 1} (w^{″})^{l} . s o f t m a x (f_{i - 1}^{T} (w^{″})^{l}) Where, V_{i - 1} \in R^{M_{i - 1} \times D} w^{'} \in R^{M_{i - 1} \times L} A t t_{i - 1}^{w} \in R^{M_{i - 1} \times N_{i - 1}}$
Image to text (STREAM):
- Authors have used pre-trained CNN model to extract image feature, so that image can be given as input to the LSTM network.
- LSTM network is trained from scratch. Authors have trained this network seperatly and then used it after freezing weights for loss calculation.
  
  $x_{- 1} = C N N (I_{m - 1}), (visual feature) x_{t} = W_{e} T_{t}, t \in {0, 1, 2, . . ., L - 1} p_{t + 1} = R N N (x_{t}), t \in {0, 1, 2, . . ., L - 1} Where, x_{- 1} \in R^{M_{m - 1}} (visual feature used as input at beginning) W_{e} \in R^{M_{m - 1} \times D} (is a word embedding matrix to map word feature to visual feature space) p_{t + 1} : predicted probability distribution over words$
Loss functions:
- Visual and Text semantic loss:
  
  $L_{G_{i}} = - \frac{1}{2} E_{I_{i} \sim p_{I_{i}}} [l o g (D_{i} (I_{i}))] - \frac{1}{2} E_{I_{i} \sim p_{I_{i}}} [l o g (D_{i} (I_{i}, s))]$
- Image to caption loss:
  
  $L_{s t r e a m} = - \sum_{t = 0}^{L - 1} l o g (p_{t} (T_{t}))$
- Total loss:
  $L_{G} = \sum_{i = 0}^{m - 1} L_{G_{i}} + λ L_{s t r e a m}$
Ablation studies:
- MirrorGAN component
- MirrorGAN cascade generator

DM-GAN: Dynamic Memory Generative Adverserial Networks for Text-to-Image Synthesis [2019]

In this paper authors have proposed dynamic memory based method for generating fine grained image that follows its textual description.
Motivation behind the paper is, all the previous depends on the inital generated image for furthure text based alignment and refinement. Which can cause problem if the inital generated image is of low quality i.e. contains very less or noisy information unrelated to text.
The Dynamic Memory method consist of two parts:
- Gated Memory Writing: Its purpose is to focus on relvant words to refine the inital image. For example if the inital image consist of a bird with black stripes but the text says white, it is the responsibility of Gate Memory writing network to identify it and write that to memory.
- Gated Response: It fuses the relevant information from the text to image by using the memory.

DM-GAN Architecture

Proposed approach is:
- First the text is passed through pre-trained bi-directional LSTM model, where
  $s$ represents the global embedding of the text and
  $W$ represent the word level embeddings.
- The
  $s$ first passes through the conditional augmentation network (CA) and which again give a new embedding that concatenated with
  $z$ noise vector and passed through initial image generator (
  $G_{0}$ ) that generate an image of size (
  $x_{0} \in R^{64 \times 64}$ ).
- Feature map from the inital network (
  $R_{0} = G_{0} (z, s)$ ) is passed to the Dynamic Memory base Image Refinement network
  $G_{i}$ .
- Where,
  $x_{i} = G_{i} (R_{i - 1}, W)$ utilises word level embedding to refine the inital generated image using Memory writing gate and Response gate.
- Notations:
  
  $W = {w_{1}, w_{2}, w_{3}, . . ., w_{T}}, w_{i} \in R^{N_{w}} R = {r_{1}, r_{2}, r_{3}, . . ., r_{N}}, r_{i} \in R^{N_{r}} Where, T is number of words N is number pixels in an image N_{w} is dimension of word feature N_{r} is dimension of pixel feature$
- Dynamic Memory consist of four different steps (inutions):
  - Memory writing (for a single word):
    
    $m_{i} = M (w_{i}), m_{i} \in R^{N_{m}} Where, M (.) : is network consist of 1 \times 1 convolution that encode word feature to memory feature space. N_{m} : is size of memory feature space$
  - Key addressing: It calculates similarity score between memory feature
    $m_{i}$ and image feature
    $r_{j}$
    
    $α_{i, j} = \frac{e^{(ϕ_{K} (m_{i})^{T} . r_{j})}}{\sum_{l = 1}^{T} e^{(ϕ_{K} (m_{l})^{T} . r_{j})}} Where, ϕ_{K} (.) : is network consist of 1 \times 1 convolution that map memory feature to image feature domain i.e. R^{N_{m}} to R^{N_{r}} . α_{i, j} : shows how much ith word is related to jth image feature.$
  - Value reading: It calculate how much each word is important for jth image feature
    
    $o_{j} = \sum_{i = 1}^{T} α_{i, j} . ϕ_{v} (m_{i}) Where, ϕ_{v} (.) : is network consist of 1 \times 1 convolution that map memory feature to image feature domain i.e. R^{N_{m}} to R^{N_{r}} .$
  - Response: The purpose of it to use the memory feature to refine image
    
    $r_{i}^{n e w} = [o_{i}, r_{i}] Where r_{i}^{n e w} is new image feature after refinement. [., .] : represents concatination operation.$
- Extra:
  
  $N_{w} = 256, N_{r} = 64, N_{m} = 128$
- Gated Memory Writing:
  - For inution we have assumed only single word feature for image refinement but a network is proposed which utilizes all the word to identify the relevant words for refinement of image.
    
    $The following part calculates importance of each word W for image R: g_{i}^{w} (R, w_{i}) = σ (A * w_{i} + B * \frac{1}{N} \sum_{i = 1}^{N} r_{i}) The following part calculates memory slot by combining image and word feature: m_{i} = M_{w} (w_{i}) * g_{i}^{w} + M_{r} (\frac{1}{N} \sum_{i = 1}^{N} r_{i}) * (1 - g_{i}^{w}) Where, M_{w} (.) and M_{r} (.) embed the word and image feature into memory feature space R^{N_{m}} . A \in R^{1 \times N_{w}}, B \in R^{1 \times N_{r}}, m_{i} \in R^{N_{m}} g_{i}^{w} : is memory writing gate$
- Gated Response:
  - For inution we have assumed simple concatination operation for new image feature but this network is proposed to fuse the important word memory into image feature.
    
    $g_{i}^{r} = σ (W [o_{i}, r_{i}] + b) r_{i}^{n e w} = o_{i} * g_{i}^{r} + r_{i} * (1 - g_{i}^{r}) Where, W and b parameter matrix and bias term. g_{i}^{r} : is memory reponse gate$
Generator loss:

$L_{G_{i}} = Unconditional loss + Conditional loss L_{G_{i}} = - \frac{1}{2} E_{{\hat{x}}_{i} \sim P_{G_{i}}} [l o g (D_{i} (\hat{x_{i}}))] - \frac{1}{2} E_{{\hat{x}}_{i} \sim P_{G_{i}}} [l o g (D_{i} (\hat{x_{i}}, s))]$
Discriminator loss:

$L_{D_{i}} = - \frac{1}{2} E_{{\hat{x}}_{i} \sim P_{d a t a_{i}}} [l o g (D_{i} (x_{i}))] - \frac{1}{2} E_{{\hat{x}}_{i} \sim P_{G_{i}}} [l o g (1 - D_{i} (\hat{x_{i}}))] + - \frac{1}{2} E_{{\hat{x}}_{i} \sim P_{d a t a_{i}}} [l o g (D_{i} (x_{i}, s))] - \frac{1}{2} E_{{\hat{x}}_{i} \sim P_{G_{i}}} [l o g (1 - D_{i} (\hat{x_{i}}, s))]$
Loss for the network:

$L = \sum_{i} L_{G_{i}} + λ_{1} L_{C A} + λ_{2} L_{D A M S A M}$
Authors have performed ablation study by dropping each proposed component from the base network and show imporvement by adding each component one by one.

Semantics Disentangling for Text-to-Image Generation (SD-GAN) [2019]

In this paper authors have proposed a method that learn disentangled representation for text-to-image synthesis. Both high-level semantic consistency and low-level semantic diversity is preserved by the proposed method.

Controllable Text-to-Image Generation [2019]

In this paper authors have proposed method for generating images from text as well as manipulating genrated image in controlled manner. Small changes to sentences won't affect non-text related image part.
Authors also proposed a word-level discriminator for fine grained feedback to generator for better learning.
There are three novel component in the architecture:
- spatial and channel-wise attention for generator network for generating better image. The proposed generator follow multi-stage architcture similar to AttnGAN.
- A word-level discriminator is proposed to exploit the correlation between image region and words for disentangle different attributes.
- Perceptual loss is used for better guiding generator to generate realistic images.

ControlGAN

Authors have used AttnGAN as base work. To generate feature vector for text, pre-trained bi-directional LSTM model is used.

Channel-Wise Attention

Channel-Wise attention is proposed to help network learning the relation between words and channel. As per the paper, authors found that the spatial attention focues more on color description.

$At, k^{t h} stage D : dimension of word embedding L : number of word in a sentence v_{k} \in R^{C \times (H_{k} \times W_{k})} [visual feature] w \in R^{D \times L} [word feature] F_{k} \in R^{(H_{k} \times W_{k}) \times D} [perception layer] {\hat{w}}_{k} \in R^{(H_{k} \times W_{k}) \times L} [transformed feature] m^{k} \in R^{C \times L} [channel wise attention matrix] Word feature is mapped to semantic space of visual feature: {\hat{w}}_{k} = F_{k} w Channel wise attention matrix is calculated: m^{k} = v_{k} w_{k} [correlation between channel and words across all spatial location] Normalized attention map: α_{i, j} = \frac{e^{m_{i, j}^{k}}}{\sum_{l = 0}^{L - 1} e^{m_{i, l}^{k}}} α_{i, j} represent correlation between i^{t h} channel in the visual feat v_{k} and j^{t h} word in the sentence S. Final channel wise feat: f_{k}^{α} = α^{k} ({\hat{w}}_{k})^{T} f_{k}^{α} is weighted correlation between word and corresponding channel.$

Word-Level Discriminator

The intution behind word-level discriminator is to provide generator with fine grained feedback, so that it can generate image by focusing more on text related area in the text description. This idea is inspired by text-adaptive discriminator.
$ \mathcal{L}_{corre}$ is the correlation loss.

Along with generator and discriminator with and without condition loss, DAMSAM loss is also used.

Zero-Shot Text-to-Image Generation (DALL-E) [2021, OpenAI]

Learning Transferable Visual Models From Natural Language Supervision (CLIP) [2021] [Youtube(Yannic)]

Authors have proposed a method for learning discriminative features in zero shot environment. The proposed method adapts to large varity of datasets.
The model predicts a matching sentence instead of just single word label.
The idea is to learn good features using vast amount of text available. Generally, learning features using labeled dataset which has limit, restrict the model generality.
Intuition:
- Ask model how likely a given text goes with an image. Like given a text cat, dog or mouse, how likely these labels goes with an image and model can output a probability distribution.
- We can make model more robust by rephrasing the label into a sentences and sentences can be of different type. Ex: a photo of a dog or a photo of cat e.t.c. Sentences are constructed from the label of the dataset but with prompts.

The clip model is trained using contrastive learning approach.
- Method is to make a batch of paired text-image data.
- Use image encoder to extract feature for image and use text encoder to extract feature for text and find inner product to calculate the relation between each image and text pair of a batch dataset. Do this for both way image to text and text to image. The aim is to generate constrastive encoding in which all the diagonal must be high value i.e 1 and all no diagonal data which is in general non related image-text pair must be close to 0. For image classification task is in horizontal direction and for text classification task is in vertical direction. Classification problem view from two different angles.
For inference:
- Extract the feature of image using trained image encoder.
- For text we have a set of labels and we generate an text prompt with these labels and passes all those prompt through text encoder network trained earlier to extract feature of each text.
- To assign a class to an image we find higher value inner product between the image feature and all the prompt text features.

Due to the contrastive training strategy, the clip model generates robust features for inter and intra-class classification. The caption provided for training captures these minute details. Example of captions: this dog is a pug, the aussie pup etc.
For text encoder authors have used transformer model of approx 63M parameters and for image encoder authors have used ResNet and ViT model.
Authors compare performance between zero-shot classification and linear probing. Important part of the proposed method is how the prompt is prepared i.e a part of sentence where label is embeded.

DALL-E [Youtube(Yannic)]

The proposed method has used pre-trained VQ-VAE codebook. DALL-E model has to predict tokens.
Authors have trained transformer model for autoregressive task, where text and image token is given as stream of data.
To deal with high resolution images, authors proposed two stage:
- Stage 1: They train a discrete variational autoencoder (VQ-VAE2). Codebook size is
  $K = 8192$ . The images of size is
  $256 \times 256$ is reduced to
  $32 \times 32$ .
- Stage 2: They combined the text token of 256 BPE and image token
  $32 \times 32 \to 1024$ , train autoregressive transformer model to predict the rest of image token. Idea is to learn join distribution over text and images.
First, dVAE is learned for image task and after learning the dVAE model, the learned weights are fixed and transformer model is trained for autoregressive task by concatinating image and text token.

Cycle-Consistent Inverse GAN for Text-to-Image Synthesis [2021]

This paper proposed method for both text-to-image generation and text based image manipulation task.
Method:
- A GAN is trained without condition aimed to generate images with high diversity and quality i.e. high inception score.
- After training GAN, a inversion model is learned to map real images to latent space, using cycle-consistency loss to learn more robust and consistent inverted code.
- Later, a text encoding model is learned to explore the semantic space of learned GAN.
The previous methods are restricted in terms of paired text-image data, which limits the generator capibility of generating diverse images for text-to-image task. Also, it is hard to preserved non-text related semantic and made small changes to image. For text-based image manipulation new model need to train.
Authors have used StyleGAN for disentangled representation of images. StyleGAN map a noise vector
$z$ to a disentangled latent space
$W$ .
Inspired from the idea of GAN inversion, authors have proposed a method for text based image synthesis.
Authors trained a GAN inversion encoder, that map real image to latent space. To make inverse latent code consistent with trained StyleGAN latent code
$W$ , cycle consistent loss is used.
For text based image generation task, a similarity model is learned between text and inverted latent code, such that inverted code optimized to have desired semantic attributes. For image editing using text task, same method is used and perceptual loss is used to optimized reconstruction between original image and generated image.

Authors have divided the CI-GAN into three different stages:
- STAGE 1: A StyleGAN model is trained to generate high quality and diverse image without any conditions i.e. mapping
  $Z \to W$ , where
  $W$ is more disentangled representaiton.
- STAGE 2: Train a GAN inversion encoder using cycle-consistency loss.
- STAGE 3: A text encoder model is trained to align the text encoding with inverted latent code
  $w^{'}$ .
GAN inversion encoder:
- GAN inversion studies mapping of generated image back to latent code.
- An encoder based model
  $E (.)$ is used, which takes image as input and give
  $w^{'}$ as output which is similar to
  $w \in W$ .
- Pixel loss (
  $L_{p i x}$ ) and perceptual loss (
  $L_{v g g}$ ) is used to make
  $w^{'}$ consistent with
  $w$ to generate original image.
  
  $x^{'} = G (E (x)) L_{p i x} = | | x - x^{'} | |_{2} L_{v g g} = | | F (x) - F (x^{'}) | |_{2} Where, F is VGG trained network for feature extraction.$
Cycle consistent constraint:
- Using perceptual loss and pixel loss model focus more on image domain, but we also have to make
  $w^{'}$ (re-constructed) latent code similar to original latent code
  $w$ . To do so, cycle consistency training is used.
- The adversarial loss is used on both inverted code
  $w^{'}$ and generated image from
  $w^{'}$ .
- The adversarial loss on generated image using inverted latent code
  $w^{'}$ , forces encoder to generate realistic image as well as aligned the
  $w^{'}$ with generator semantic. Here we use a new discriminator.
  
  $L_{a d v}^{x} = E_{x \sim P_{d a t a}} [D (x^{'})]$
- The adverserial loss on the inverted latent code forces model to align the inverted latent code
  $w^{'}$ to
  $w$ , so that it follows the same distribution as
  $w$ .
  
  $L_{a d v}^{w^{'}} = E_{w \sim P_{w}} [D_{w} (w^{'})]$
The complete objective of GAN inversion encoder is:

$m i n_{θ_{E}} L_{E} = L_{p i x} + λ_{v g g} L_{v g g} + λ_{w} L_{w} - (λ_{a d v}^{x} L_{a d v}^{x} + λ_{a d v}^{w^{'}} L_{a d v}^{w^{'}})$

Text encoder by latent space alignment
- Authors have trained simple LSTM model to discover the relation between explicit text semantic space and implicit image encoding space
  $W$ .
- The enocding dimension obtained from single LSTM is same as
  $w^{'}$ .
- Aligning
  $t$ and
  $w^{'}$ makes text encoder better capture the useful semantic in text description.
- Authors have used modified version of InfoNCE loss to learn the latent relation, they used
  $l_{2}$ distance in it:
  
  $L_{s i m} = - E_{T} [l o g \frac{e^{| | t_{(i + k)} - w_{i}^{'} | |_{2}}}{\sum_{t_{j} \in T} e^{| | t_{j} - w_{i}^{'} | |_{2}}}] Where, T : represents all the text representation in the mini-batch i : one sample index in the mini-batch t_{(i + k)} : represents unpaired samples to the w_{i}^{'} Aim is to minimize l_{2} distance between t and w^{'} while maximze the distance between unpaired instances.$
Text guided image manipulation
- $w^{'}$ is pushed towards the direction of
  $t$ .
  
  $w_{o p t}^{'} = a r g m i n_{w^{'}} | | t - w^{'} | |_{2} + | | F (x) - F (G (w^{'})) | |_{2}$
- For image manipulation task perceptual loss function
  $(F)$ is used.
Authors have performed ablation studies to show effectivness of cycle-consistency training.
The latent space of StyleGAN used in the paper is not much disentangled, which can cause incosistency in text based image editing task.