Try   HackMD

Survey Text Based Image Synthesis

Note: summary of the papers and some text as well as images taken directly from the respective papers because of the clarity in the explanation.

Metrics for perfomance measure

Peak Signal to Noise Ratio (PSNR)

  • It measure ratio between maximum possible power of a signal and the power of corrupting noise that affets the signal quality.

  • PSNR commonly used to quantify the quality of image and video reconstruction.

  • Measures in dB (decibel) scale.

  • Assume we have an image I of size mxn which is monochromatic and noise free along with it we also have an image K which is noisy version of image I. To measure PSNR between them we do the following:

    MSE=1mni=0m1j=0n1[I(i,j)K(i,j)]2PSNR=10log10(MAXI2MSE)PSNR=20log10(MAXIMSE)

  • Higher the PSNR value better the result is.

Structural Similarity for Image Quality (SSIM) [2003]

  • It measures how close two images are.

  • Two patches are slided over the images x and y, pixel by pixel to collect the following stats:

    μx: Luminance of image x patch (mean)σx2: Constrast of image x patch (variance)σxy: Covariance between patches of image x and y

  • The measure of luminance, constrast and covariance is then given by:

    l(x,y)=2μxμy+C1μx2+μy2+C1c(x,y)=2σxσy+C2σx2+σy2+C2s(x,y)=σxy+C3σxσy+C1Where, C1=(K1L)2,C2=(K2L)2 and C3=C2/2K1=0.01 and K2=0.03 (as per paper) 

  • Finally SSIM is calculated as:

    SSIM=l(x,y)α.c(x,y)β.s(x,y)γWhere, α=β=γ=1

  • SSIM satisfes following properties:

    • Symmetry: SSIM(x,y) = SSIM(y,x)
    • Boundedness: SSIM(x,y) <= 1
    • Unique maximum: SSIM(x, y) = 1 if and only if x = y
  • This measure proposes that even though two images content are different there might be a chance that, they have very low mean square error but using SSIM we can measure their quality better.

Inception Score (IS) [2016]

  • It measures the quality of generated image by GAN.
  • Pre-trained InceptionV2 model is use to extract the probability distribution for the generated data.
  • The fold applications of this metric are:
    • It measures how well the generated images are.
    • It also measure how diverse the generated are.
      • Diversity is capture by summing all the predicted probability distribution.

E[DKL(p(y|xi)||p(y))]p(y)=p(y|x=G(z))d(z)

Frechet Inception Distance (FID) [2017]

  • Frechet distance is use to measure the distance between two mutlivariate normal distribution.

  • For a univariate normal distribution Frechet Distance is given as:

    d(X,Y)=(μXμY)2+(σXσY)2Where, X and Y are two normal distribution.

  • To calculate FID, the activation statistics of pre-trained InceptionV3 model is used. On final activation layer global average pooling is applied to get 2048 size vector.

  • FID for multivariate distribution is calculated as:

    FID(X,Y)=||μXμY||2Tr(X+Y2XY)Where, X and Y are real and fake embedding.μX and μY are mean of embedding vectors X and Y.X and Y is covariance matrix of embedding vecotrs X and Y.Tr is trace of the matrix, which is sum of diagonal elements.

  • Link

Generative Adversarial Text to Image Synthesis (GAN-CLS) [2016] [First paper]

Conditional Image Snythesis with Auxillary Classifier GANs (AC-GAN) [2016]

  • This paper proposed new training strategy by adding classification loss only to discriminator part of the GAN framework.
  • Also shows that high quality image generation is not just upsampling of small resolution images and how downsampling the high resolution image affects the classification accuracy.
    LI=Exp(x)[log(Dθ(x))]+EzN(0,I)[log(1Dθ(Gϕ(z)))]LC=E[log(P(yreal|xreal))]+E[log(P(yfake|xfake))]LD=LI+LCLG=EzN(0,I)log(Dθ(Gϕ(z)))+E[log(P(yfake|xfake))]


Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
AC-GAN

Text Conditioned Auxiliary Classifier Generative Adversarial Network (TAC-GAN) [2017]

  • This paper uses AC-GAN as base work and proposed a new framework for text based image generation inspired from the work of Reed Scott et.al.
  • Authors have added text embedding network to AC-GAN and similar to Scott, they train their discriminator to aware of correct image is matched to correct text (matching aware discriminator).
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
TAC-GAN
  • Input to the discriminator:

    AD={(If,Cf,lf),(Ir,Cr,lr),(Iw,Cw,lr)}Where,I is Image,C is class label,l is textembedding after passing ψ through fully connected layer.r is real, f is fake, w is wrong.

  • Input to the generator:

    AG={(If,Cr,lr)}Where,I is Image,C is class label,l is textembedding after passing ψ through fully connected layer.f is fake.

  • Training objective for discriminator:

    LDS=H(DS(Ir,lr),1)+H(DS(If,lr),0)+H(DS(Iw,lr),0)LDC=H(DC(Ir,lr),Cr)+H(DC(If,lr),Cr)+H(DC(Iw,lr),Cw)Here, H denotes binary cross entropy.

  • Training objective for generator:

    LGS=H(DS(If,lr),1)LGC=H(DC(If,lr),Cr)Here, H denotes binary cross entropy.

  • Loss function is similar to AC-GAN and matching aware discriminator.

StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks [2017] (High quality image generation)

AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks [2017]

  • In this paper authors have proposed attention GAN network and a attention network for finding coarsed text and image similarity to compute the loss and combining these two part generates fine-grained images that highly correlates to its text description.
  • The attention gan network generates image by paying attention on the each word of the give text description.
  • The proposed method perform better for images containing complex information such as more than one object is present in image as well as its equivalent text description (COCO dataset).
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
AttnGAN
  • Attention GAN working:

    • Intution behind attention GAN is to let the generator focus on the word for generating part of images that is highly relevant to it.
    • The network extract the global as well as word level embedding from the given text description.
    • The first sub network of the generative network utilizes the global embedding for low resolution image generation. The later part of the network uses image vector
      RD^×N
      and word vector
      RD×N
      to generate fine-grained encoding. The generator use the encoding to generate high quality image that simantically aligned with given text.
    • To generate fine-grained encoding attention mechanism is used:
      Word feature vector (RD×T ): e0,e1,e2,...,eTHidden image feature vector (RD^×N) generated by network Fh0,h1,h2,...,hNWord feature vector is converted to common semantinc space of image by it through perceptron layer URD^×De=UeRD^×TSemantic score between each vector is calculated as: s=hTeRN×TSemantic score is normalized as follow:βj=esjk=0T1esj,kFinal context vector is generated as:cj=i=0T1βjeiThe ouput of attention network is:Fattn(e,h)=(c0,c1,c2,...,cN1)RD^×N
  • The generated attention context vector is than passed to generator for synthesizing high resolution image.

  • There are three different generator are used in this complete network,

    G0 generates
    I64×64
    image,
    G1
    generates
    I128×128
    image and
    G2
    generates
    I256×256
    image.

  • Generator loss:

    LGi=Unconditional loss+Conditional lossLGi=12Ex^iPGi[log(Di(xi^))]12Ex^iPGi[log(Di(xi^,e^))]

  • Discriminator loss:

    LDi=12Ex^iPdatai[log(Di(xi))]12Ex^iPGi[log(1Di(xi^))]+12Ex^iPdatai[log(Di(xi,e^))]12Ex^iPGi[log(1Di(xi^,e^))]

  • Total loss:

    LG=i=0m1LGi

  • Deep Attentional Multimodal Similarity Model (DAMSM)

    • Authors proposed this network to provide consistency between text and image.

    • The network is pre-trained on image-text pair and the pre-trained later used for computing loss (

      LDAMSM).

    • Intution behind using this network is to map image-text pair into same embedding space.

    • For generating encoding for text, bi-directional LSTM network is used. Where,

      eRD×T represents local word level embedding and
      e¯RD×T
      represents global encoding of the text.
      D
      is embedding size for text and
      T
      is number of words in given sentence. The text encoder network is trained from scratch.

    • To generate encoding for the images authros have used pre-trained inceptionV3 network. They resize image to

      299×299 before giving it as input to the network.

    • Local feature for image is extracted from "mixed_6e" layer of inceptionV3 network. The extracted local feature is represented as

      fR768×17×17, which represents there are
      17×17
      pixel in image each having
      768
      features. The local feature is resized to form one single vector of size
      fR768×289
      . The global feature is extracted from last layer after applying global average pooling and represented as
      f¯R2048
      .

    • Both local and global extracted image feature is transformed into common semantic for text by passing them through perceptron layer which is trained from scratch, before it weights for inceptionV3 network is freezed.

      Local image feature transformation: v=WfGlobal image feature transformation: v¯=W¯f¯Where, WRD×768 (Fully connected layer)vRD×289 (Transformed local image feature)W¯RD×2048 (Fully connected layer)v¯RD (Transformed global image feature)

    • Similarity matrix is calculated between image's subregion and words:

      s=eTv (local word vector matrix with local image feature matrix)Where, eTRT×DvRD×289sRT×289s(i,j) is the dot product of ith word feature and jth image region whichgive similarity score of how much they are related.

    • The calculated similarity score is then normalized:

      s¯i=esik=0T1esk

    • Region context vector, it gives how much a word in text related to an image subregion.

      αj=eγ1s¯i,jk=0288eγ1s¯i,kci=j=0288αjvjWhere, γ1 is factor that determine how much attention is to be paid to relevant subregion.

  • Relevance between image subregion and ith word:

    R(ci,ei)=ciTei||ci||||ei||

  • Attention-drive image text matching score:

    R(Q,D)=log(i=1T1eγ2R(ci,ei))1γ2Where, γ2 is a factor that determine how much to magnify importance of word to context region context pair.

  • DAMSM Loss:

    {(Qi,Di)}i=1MWhere, (Q,D) is image text pair.Matching between image Qi and text Diis computed as:P(Di|Qi)=eγ3R(Qi,Di)j=1Meγ3R(Qi,Dj)Where, γ3 is smoothing factor. Except one all other text is consider to be negative.Matching between text Di and image Qiis computed as:P(Qi|Di)=eγ3R(Qi,Di)j=1Meγ3R(Qi,Dj)Except one all other images is consider to be negative.L1w=i=1Mlog(P(Di|Qi))L2w=i=1Mlog(P(Di|Qi)) Similarly we can do this for global embedding of images and sentence also where R(Q,D)=v¯Te¯||v¯||||e¯||LDAMSM=L1w+L2w+L1s+L1s

  • γ1=5,
    γ2=5
    ,
    γ3=10
    and
    γM=50

  • Finally the complete network is trained using total loss:

    Ltotal=LG+λLDAMSM

  • Authors have done experiment to verify the improvement due to different components of the network by removing it one by one or adding component to the inital network one by one.

Semi-supervised FusedGAN for Conditional Image Generation [2018]

  • This paper proposed a method for generating images with controlled informations. Generated image follow fidelity and diversity.

  • Controlled information includes posture, style, background, and fine-grained details.

  • Other aim of paper is to show that the proposed method is able to learn disentangled representation.

  • As per the paper disentanglement is achieved by cascadin different generator models.

  • One pair of generator model is responsible for learning unsupervised image generation. Where it trained as GAN method. Another pair is trained to generate conditional images.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
FusedGAN
  • Generator stage is divded into two parts. Part one of generation stage is common to next stage generators. First stage of generator is use to generate over a structure of shape in considration. It act like a scketch which consist of posture. The generated shape is then passed on to two different generators in the next stage. One is unconditional generator and other one is conditional generator conditioned on the text.

    zN(0,I)Ms=G(z)Iunconditioned_fake=Gu(Ms)ψt=E(y)Iconditioned_fake=Gu(Ms,ψt)Here,z is sampled noise vector.ψt is generated representation for text after passing text embedding through CA network.y embedding of the text.

  • Text embedding is also pass to the discriminator of the conditional generator but it is not trained as matching aware discriminator.

  • The complete network is trained using GAN based loss in alternate manner, paramaters of the network is first updated for pair one and then for pair second.

  • Authors have proposed many experiment to show emperical proof of how the disentanglement is happening because of cascading the generators on the text to image synthesis.

    • Fixed posture with varying styles.
      • This experiment shows that after generating a structure from the first generator, we can vary the text caption in order generate different style images by keeping posture constant.
    • Fixed posture with varying details.
      • This experiment shows how conditional augmentation for the text brings disentanglement control. Authors keep same structure and vary different samples for the same text using CA and show that it creates different texture for birds of same species.
    • Interpolation with same posture but varying styles.
      • This is simple interpolation between the latent space of text by keeping two different bird structures constant.
  • Inception score is use to perform quantitative analysis.

Photographic Text-to-Image Synthesis with a Hierarchically-nested Adversarial Network (HDGAN) [2018]

  • Authors have proposed a novel framework for generating text based high quality images with fidelity and diversity.
  • Along with novel framework they also proposed an evaluation metric known as Visual-Semantic Similarity score, that helps in evaluate how well a generated image follow given text description.
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
HDGAN Probabilistic map output at different level
  • The important part as per the paper is generation of
    64×64
    image which act as based structure for further level of framework.
  • To train the discriminator authors have used matching-aware training training strategy similar to STACK-GAN.
  • The HDGAN framework consist of
    K
    level of generators where K represents resolution quality of image (
    2K
    ). As per paper
    K=9
    i.e.
    512×512
    image quality is generated (64, 128, 256, 512). Output of each level is accumulated and passed to discriminator for each level, this way the different level of discriminator try to encode fine grained information into image. Discriminators act as regularizer here.
  • Using discriminator, GAN framework maintain the global consistency but to ensure the local patch based consistency, authors modified branched the discriminator output into two part. First part consist of real and fake score and the Second part consist of probabilistic map for each pixel being real or fake similar to CycleGAN and PixToPixGAN, for it computing loss is similar to computing loss for real and fake, the output probability map is compared with all ones map for real and all zeros for fake.
  • Similar to StackGAN authors have used Conditional Augmentation (CA) to ensure consistency in text embedding.
  • Visual-Semantic Similarity score:
    • To use this metric we have to trained the model for encoding images and encoding text using the following loss function:
      Ltotal=vtv^max(0,c(fv(v),ft(tv^))c(fv(v),ft(tv))+δ)+tvt^max(0,c(ft(t),fv(vt^))c(ft(t),fv(vv))+δ)Where, v denotes image feature vector extracted using pre-trained inception model.fv,ft is mapping function that map image-text pair to a common space R512.δ is margin which is set to 0.2.{v,t} is ground-truth image text pair.{v,tv^} is mismatched image-text pair.{vt^,t} is mismatched text-image pair.
    • In this visual-semantic similarity we have used triplet loss or ranking loss to learn parameters of the network.
    • After learning the parameter following function is use to measure the similarity:
      c(x,y)=x.y||x||2.||y||2Where, x and y are image embedding and text embedding obtained from fv and ft.
  • Authors have perform multiple experiments to support their hypothesis:
    • Hierarchically-nested adverserial training:
      • Using this experiment authors explain the importance of discriminator for each layer. They showed that removing discriminator at lower level can cause degradation in generated result.
    • The local image loss:
      • Similar experiments have been performed to show the effectiveness of local image loss for pixel consistency in images.

Perceptual Pyramid Adversarial Networks for Text-to-Image Synthesis (PPAN) [AAAI 2019]

  • This paper present a novel framework consist of one generator and multilevel discriminator to synthesis realistic image based on text description. Each level of discriminator force generator to generate fine grained images.
  • Authors have used Perceptual loss instead of using pixel based loss for generating realistic images. Multi-purpose discriminator encourage semantic consistency, image fidelity and class invariance.
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
PPGAN
  • The PPGAN framework consist of conditional augmentation network inspired from the StackGAN, to prevent discontinuity in the higher order embedding of text. So a text embedding

    ϕt is passed through CA network to sample
    R128
    dimension embedding vector. After sampling text embedding from CA, PPGAN pass the embedding to the generator network after concatenating it with noise vector
    z
    . The generator network up-sample the given input to generate
    64×64
    ,
    128×128
    and
    256×256
    image at different stage. This single generator is connected to 3 different discriminator
    D1,D2
    and
    D3
    .

  • Authors have used four different types of loss function at different stages of network:

    • L1
      loss is matching aware loss inspired from GAN-INT-CLS (Reed et. al.) which encourage discriminator to learn contextual information. It takes three pairs as input i.e.
      (Ir,ϕtr)
      ,
      (If,ϕtr)
      and
      (Ir,ϕtf)
      .
    • L2
      loss is local image loss which encourage discriminator to penalize the generator for not generating diverse and locally stable images similar to HDGAN.
    • L3
      is class information loss, which is inspired from TAC-GAN. The purpose of this loss is to encourage discriminator to understand differences in the different class of images and penalize generator for not generating correct class image.
    • L4
      is perceptual loss used by authors in the framework to encourage generator to generate spatially correlated image not some random pixel as output. This loss is inspired from the work of Jhonson et. al. for style transfer.
  • Discriminator

    D1,D2 uses
    L1,L2
    loss. As per the paper learning at lower level helps in generating the semantically meaningful image.
    D3
    uses
    L3
    loss along with
    L1
    and
    L2
    .
    L4
    loss is applied for
    256×256
    images, because high resolution images contain more discriminative features, which force generator to generate semantically coherent pixels.

  • 1×1 convolution is used to merge the textual information into image in discriminator.

  • The authors have used following evaluation techniques:

    • Inception score (IS) [High score is better.]
    • Visual-semantic Similarity (VS) [HDGAN] [High score is better.]
  • Authors verify the importance of each component by removing all the 3 discriminator first and added it one by one to check the improvement in the IS and VS score.

MirrorGAN: Learning Text-to-Image Generation by Redescription [2019]

  • In this paper authors have proposed mirrored strategy text to image generation method. An image is generated from the text description and noise vector. The generated image is then passed through the image caption network to get back the caption, which suppose to be similar to the given text description to generate image at first place.
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
  • Authors have proposed three modules for transforming text to an image:

    • STEM (semantic text embedding module): This module is to generate the local and global embedding for the text.
    • GLAM (global-local collaborative attentive module): This module is similar to AttnGAN module, but authors have also added the global sentence attention along with local attention of words and image feature.
    • STREAM (semantic text regenration and attention module): This module is image captioning module, which takes generated image as input and generate image caption as ouput that later used to estimate the incosistency between input caption and generated caption.
  • The reason authors have added global attention along with word-level attention is that, using word-level attention alone doesn't ensure global semantic consistency due to diverse text and image modalities ex: an image can have 5 or more captions as per COCO dataset, and caption have different way of conveying the same features of an image.

  • To train model end to end following losses are used:

    • Visual realism adverserial loss: whether the generated image is real or fake
    • Text-image paired semantic consistency loss: whether the generated image is consistent with text or not
    • Text-semantic reconstrcution loss: cross-entropy loss to make text-to-image and image-to-text task consistent and guide generator to generate semantically consistent images.
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
MirrorGAN Architecture
  • Text embedding module (STEM):

    w,s=RNN(T)Where, wRD×L(embedding for each word)sRD(global sentence embedding)T: is text descriptionD: word embedding sizeL: number of words in text description

    • Due to diverse text, the small perturbation in two text can have similar meaning. To make embedding consistent inspired from StackGAN, authors have used Conditional Augmentation (CA).
      sCA=FCA(s)Where, scaRDD: it is the dimenssion of conditional augmentation
  • Local and Global attention mechanism (GLAM):

    • Incremental generator network is used.
    • Similar to AttnGAN:
      f0=F0(z,sCA)fi=Fi(fi1,Fatti(fi1,w,sCA)),i{1,2,...,m1}Fatti(fi1,w,sCA)=concat(Fattiw,Fattis)Ii=Gi(fi),i{0,1,2,...,m1}Where, {F0,F1,...,Fm1}: represents m visual feature transformation networks.{G0,G1,...,Gm1}: represents m generator networks.zN(0,I)(random noise)Fatti: is attention module consist of both local (Fattiw) and (Fattis) global attentionfiRMi×NiIiRqi×qim={64,128,256}
    • Attention mechanism:
      • First word level attentive context feature is generated (
        Fattiw(w,fi1)
        ):
        Word embedding w is converted into common semantic space of visual featurew=U(i1)w (using perceptron layer)Common semantic feature is then multiplied with visual feature to get attention scoreAtti1w=l=0L1(w)l.softmax(fi1T(w)l)Where, Ui1RMi1×DwRMi1×LAtti1wRMi1×Ni1
      • After the word level attention, sentence level attention is applied (
        Fattis(sCA,fi1)
        ):
        Similar to word attention sCA is converted into common semantic space of visual featurew=V(i1)w (using perceptron layer)Common semantic feature is then multiplied with visual feature to get attention scoreAtti1s=l=0L1(w)l.softmax(fi1T(w)l)Where, Vi1RMi1×DwRMi1×LAtti1wRMi1×Ni1
  • Image to text (STREAM):

    • Authors have used pre-trained CNN model to extract image feature, so that image can be given as input to the LSTM network.
    • LSTM network is trained from scratch. Authors have trained this network seperatly and then used it after freezing weights for loss calculation.
      x1=CNN(Im1),(visual feature)xt=WeTt,t{0,1,2,...,L1}pt+1=RNN(xt),t{0,1,2,...,L1}Where, x1RMm1(visual feature used as input at beginning)WeRMm1×D( is a word embedding matrix to map word feature to visual feature space)pt+1: predicted probability distribution over words
  • Loss functions:

    • Visual and Text semantic loss:
      LGi=12EIipIi[log(Di(Ii))]12EIipIi[log(Di(Ii,s))]
    • Image to caption loss:
      Lstream=t=0L1log(pt(Tt))
    • Total loss:
      LG=i=0m1LGi+λLstream
  • Ablation studies:

    • MirrorGAN component
    • MirrorGAN cascade generator

DM-GAN: Dynamic Memory Generative Adverserial Networks for Text-to-Image Synthesis [2019]

  • In this paper authors have proposed dynamic memory based method for generating fine grained image that follows its textual description.
  • Motivation behind the paper is, all the previous depends on the inital generated image for furthure text based alignment and refinement. Which can cause problem if the inital generated image is of low quality i.e. contains very less or noisy information unrelated to text.
  • The Dynamic Memory method consist of two parts:
    • Gated Memory Writing: Its purpose is to focus on relvant words to refine the inital image. For example if the inital image consist of a bird with black stripes but the text says white, it is the responsibility of Gate Memory writing network to identify it and write that to memory.
    • Gated Response: It fuses the relevant information from the text to image by using the memory.
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
DM-GAN Architecture
  • Proposed approach is:

    • First the text is passed through pre-trained bi-directional LSTM model, where

      s represents the global embedding of the text and
      W
      represent the word level embeddings.

    • The

      s first passes through the conditional augmentation network (CA) and which again give a new embedding that concatenated with
      z
      noise vector and passed through initial image generator (
      G0
      ) that generate an image of size (
      x0R64×64
      ).

    • Feature map from the inital network (

      R0=G0(z,s)) is passed to the Dynamic Memory base Image Refinement network
      Gi
      .

    • Where,

      xi=Gi(Ri1,W) utilises word level embedding to refine the inital generated image using Memory writing gate and Response gate.

    • Notations:

      W={w1,w2,w3,...,wT},wiRNwR={r1,r2,r3,...,rN},riRNrWhere, T is number of wordsN is number pixels in an imageNw is dimension of word featureNr is dimension of pixel feature

    • Dynamic Memory consist of four different steps (inutions):

      • Memory writing (for a single word):

        mi=M(wi),miRNmWhere,M(.) : is network consist of 1×1 convolution thatencode word feature to memory feature space.Nm : is size of memory feature space

      • Key addressing: It calculates similarity score between memory feature

        mi and image feature
        rj

        αi,j=e(ϕK(mi)T.rj)l=1Te(ϕK(ml)T.rj)Where,ϕK(.) : is network consist of 1×1 convolution thatmap memory feature to image feature domain i.e. RNm to RNr.αi,j : shows how much ith word is related to jth image feature.

      • Value reading: It calculate how much each word is important for jth image feature

        oj=i=1Tαi,j.ϕv(mi)Where,ϕv(.) : is network consist of 1×1 convolution thatmap memory feature to image feature domain i.e. RNm to RNr.

      • Response: The purpose of it to use the memory feature to refine image

        rinew=[oi,ri]Where rinew is new image feature after refinement.[.,.]: represents concatination operation.

    • Extra:

      Nw=256,Nr=64,Nm=128

    • Gated Memory Writing:

      • For inution we have assumed only single word feature for image refinement but a network is proposed which utilizes all the word to identify the relevant words for refinement of image.
        The following part calculates importance of each word W for image R:giw(R,wi)=σ(Awi+B1Ni=1Nri)The following part calculates memory slot by combining image and word feature:mi=Mw(wi)giw+Mr(1Ni=1Nri)(1giw)Where,Mw(.) and Mr(.) embed the word and image feature into memory feature space RNm.AR1×Nw,BR1×Nr,miRNmgiw : is memory writing gate
    • Gated Response:

      • For inution we have assumed simple concatination operation for new image feature but this network is proposed to fuse the important word memory into image feature.
        gir=σ(W[oi,ri]+b)rinew=oigir+ri(1gir)Where,W and b parameter matrix and bias term.gir : is memory reponse gate
  • Generator loss:

    LGi=Unconditional loss+Conditional lossLGi=12Ex^iPGi[log(Di(xi^))]12Ex^iPGi[log(Di(xi^,s))]

  • Discriminator loss:

    LDi=12Ex^iPdatai[log(Di(xi))]12Ex^iPGi[log(1Di(xi^))]+12Ex^iPdatai[log(Di(xi,s))]12Ex^iPGi[log(1Di(xi^,s))]

  • Loss for the network:

    L=iLGi+λ1LCA+λ2LDAMSAM

  • Authors have performed ablation study by dropping each proposed component from the base network and show imporvement by adding each component one by one.

Semantics Disentangling for Text-to-Image Generation (SD-GAN) [2019]

  • In this paper authors have proposed a method that learn disentangled representation for text-to-image synthesis. Both high-level semantic consistency and low-level semantic diversity is preserved by the proposed method.

Controllable Text-to-Image Generation [2019]

  • In this paper authors have proposed method for generating images from text as well as manipulating genrated image in controlled manner. Small changes to sentences won't affect non-text related image part.
  • Authors also proposed a word-level discriminator for fine grained feedback to generator for better learning.
  • There are three novel component in the architecture:
    • spatial and channel-wise attention for generator network for generating better image. The proposed generator follow multi-stage architcture similar to AttnGAN.
    • A word-level discriminator is proposed to exploit the correlation between image region and words for disentangle different attributes.
    • Perceptual loss is used for better guiding generator to generate realistic images.
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
ControlGAN
  • Authors have used AttnGAN as base work. To generate feature vector for text, pre-trained bi-directional LSTM model is used.

  • Channel-Wise Attention

    • Channel-Wise attention is proposed to help network learning the relation between words and channel. As per the paper, authors found that the spatial attention focues more on color description.
      At, kth stageD:dimension of word embeddingL:number of word in a sentencevkRC×(Hk×Wk) [visual feature]wRD×L [word feature]FkR(Hk×Wk)×D [perception layer]w^kR(Hk×Wk)×L [transformed feature]mkRC×L [channel wise attention matrix]Word feature is mapped to semantic space of visual feature:w^k=FkwChannel wise attention matrix is calculated:mk=vkwk [correlation between channel and words across all spatial location]Normalized attention map: αi,j=emi,jkl=0L1emi,lkαi,j represent correlation between ith channel in the visual feat vk and jth word in the sentence S.Final channel wise feat: fkα=αk(w^k)Tfkα is weighted correlation between word and corresponding channel.
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
  • Word-Level Discriminator

    • The intution behind word-level discriminator is to provide generator with fine grained feedback, so that it can generate image by focusing more on text related area in the text description. This idea is inspired by text-adaptive discriminator.
    • $ \mathcal{L}_{corre}$ is the correlation loss.
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
  • Along with generator and discriminator with and without condition loss, DAMSAM loss is also used.

Zero-Shot Text-to-Image Generation (DALL-E) [2021, OpenAI]

Learning Transferable Visual Models From Natural Language Supervision (CLIP) [2021] [Youtube(Yannic)]

  • Authors have proposed a method for learning discriminative features in zero shot environment. The proposed method adapts to large varity of datasets.
  • The model predicts a matching sentence instead of just single word label.
  • The idea is to learn good features using vast amount of text available. Generally, learning features using labeled dataset which has limit, restrict the model generality.
  • Intuition:
    • Ask model how likely a given text goes with an image. Like given a text cat, dog or mouse, how likely these labels goes with an image and model can output a probability distribution.
    • We can make model more robust by rephrasing the label into a sentences and sentences can be of different type. Ex: a photo of a dog or a photo of cat e.t.c. Sentences are constructed from the label of the dataset but with prompts.
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
  • The clip model is trained using contrastive learning approach.

    • Method is to make a batch of paired text-image data.
    • Use image encoder to extract feature for image and use text encoder to extract feature for text and find inner product to calculate the relation between each image and text pair of a batch dataset. Do this for both way image to text and text to image. The aim is to generate constrastive encoding in which all the diagonal must be high value i.e 1 and all no diagonal data which is in general non related image-text pair must be close to 0. For image classification task is in horizontal direction and for text classification task is in vertical direction. Classification problem view from two different angles.
  • For inference:

    • Extract the feature of image using trained image encoder.
    • For text we have a set of labels and we generate an text prompt with these labels and passes all those prompt through text encoder network trained earlier to extract feature of each text.
    • To assign a class to an image we find higher value inner product between the image feature and all the prompt text features.
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
  • Due to the contrastive training strategy, the clip model generates robust features for inter and intra-class classification. The caption provided for training captures these minute details. Example of captions: this dog is a pug, the aussie pup etc.
  • For text encoder authors have used transformer model of approx 63M parameters and for image encoder authors have used ResNet and ViT model.
  • Authors compare performance between zero-shot classification and linear probing. Important part of the proposed method is how the prompt is prepared i.e a part of sentence where label is embeded.

DALL-E [Youtube(Yannic)]

  • The proposed method has used pre-trained VQ-VAE codebook. DALL-E model has to predict tokens.
  • Authors have trained transformer model for autoregressive task, where text and image token is given as stream of data.
  • To deal with high resolution images, authors proposed two stage:
    • Stage 1: They train a discrete variational autoencoder (VQ-VAE2). Codebook size is
      K=8192
      . The images of size is
      256×256
      is reduced to
      32×32
      .
    • Stage 2: They combined the text token of 256 BPE and image token
      32×321024
      , train autoregressive transformer model to predict the rest of image token. Idea is to learn join distribution over text and images.
  • First, dVAE is learned for image task and after learning the dVAE model, the learned weights are fixed and transformer model is trained for autoregressive task by concatinating image and text token.

Cycle-Consistent Inverse GAN for Text-to-Image Synthesis [2021]

  • This paper proposed method for both text-to-image generation and text based image manipulation task.
  • Method:
    • A GAN is trained without condition aimed to generate images with high diversity and quality i.e. high inception score.
    • After training GAN, a inversion model is learned to map real images to latent space, using cycle-consistency loss to learn more robust and consistent inverted code.
    • Later, a text encoding model is learned to explore the semantic space of learned GAN.
  • The previous methods are restricted in terms of paired text-image data, which limits the generator capibility of generating diverse images for text-to-image task. Also, it is hard to preserved non-text related semantic and made small changes to image. For text-based image manipulation new model need to train.
  • Authors have used StyleGAN for disentangled representation of images. StyleGAN map a noise vector
    z
    to a disentangled latent space
    W
    .
  • Inspired from the idea of GAN inversion, authors have proposed a method for text based image synthesis.
  • Authors trained a GAN inversion encoder, that map real image to latent space. To make inverse latent code consistent with trained StyleGAN latent code
    W
    , cycle consistent loss is used.
  • For text based image generation task, a similarity model is learned between text and inverted latent code, such that inverted code optimized to have desired semantic attributes. For image editing using text task, same method is used and perceptual loss is used to optimized reconstruction between original image and generated image.
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
  • Authors have divided the CI-GAN into three different stages:

    • STAGE 1: A StyleGAN model is trained to generate high quality and diverse image without any conditions i.e. mapping
      ZW
      , where
      W
      is more disentangled representaiton.
    • STAGE 2: Train a GAN inversion encoder using cycle-consistency loss.
    • STAGE 3: A text encoder model is trained to align the text encoding with inverted latent code
      w
      .
  • GAN inversion encoder:

    • GAN inversion studies mapping of generated image back to latent code.
    • An encoder based model
      E(.)
      is used, which takes image as input and give
      w
      as output which is similar to
      wW
      .
    • Pixel loss (
      Lpix
      ) and perceptual loss (
      Lvgg
      ) is used to make
      w
      consistent with
      w
      to generate original image.
      x=G(E(x))Lpix=||xx||2Lvgg=||F(x)F(x)||2Where, F is VGG trained network for feature extraction.
  • Cycle consistent constraint:

    • Using perceptual loss and pixel loss model focus more on image domain, but we also have to make
      w
      (re-constructed) latent code similar to original latent code
      w
      . To do so, cycle consistency training is used.
    • The adversarial loss is used on both inverted code
      w
      and generated image from
      w
      .
    • The adversarial loss on generated image using inverted latent code
      w
      , forces encoder to generate realistic image as well as aligned the
      w
      with generator semantic. Here we use a new discriminator.
      Ladvx=ExPdata[D(x)]
    • The adverserial loss on the inverted latent code forces model to align the inverted latent code
      w
      to
      w
      , so that it follows the same distribution as
      w
      .
      Ladvw=EwPw[Dw(w)]
  • The complete objective of GAN inversion encoder is:

    minθELE=Lpix+λvggLvgg+λwLw(λadvxLadvx+λadvwLadvw)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
  • Text encoder by latent space alignment

    • Authors have trained simple LSTM model to discover the relation between explicit text semantic space and implicit image encoding space
      W
      .
    • The enocding dimension obtained from single LSTM is same as
      w
      .
    • Aligning
      t
      and
      w
      makes text encoder better capture the useful semantic in text description.
    • Authors have used modified version of InfoNCE loss to learn the latent relation, they used
      l2
      distance in it:
      Lsim=ET[loge||t(i+k)wi||2tjTe||tjwi||2]Where, T: represents all the text representation in the mini-batchi: one sample index in the mini-batcht(i+k): represents unpaired samples to the wiAim is to minimize l2 distance between t and w while maximzethe distance between unpaired instances.
  • Text guided image manipulation

    • w
      is pushed towards the direction of
      t
      .
      wopt=argminw||tw||2+||F(x)F(G(w))||2
    • For image manipulation task perceptual loss function
      (F)
      is used.
  • Authors have performed ablation studies to show effectivness of cycle-consistency training.

  • The latent space of StyleGAN used in the paper is not much disentangled, which can cause incosistency in text based image editing task.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →