Disentangled Inference for GANs with Latently Invertible Autoencoder

Notes Author : Vignesh

Abstract

The disadvantage of GANs is that the absence of capability of encoding real-world samples(to be more precise, GANs can't produce a disentangled latent vector of data when the data is encoded)
The conventional way of addressing this issue is to learn an encoder for GAN via Variational Auto-Encoder (VAE)
This paper explains that the entanglement of the latent space for the VAE-GAN framework poses the main challenge for encoder learning.
To address this entanglement issue and enable inference in GAN, the paper proposes a novel algorithm named Latently Invertible Autoencoder (LIA)

Introduction

GANs are suitable for generating high quality images. But the disadvantage is the absence of encoder for carrying out inference on real samples. We cannot directly derive the corresponding random variable z (latent vector) for a given sample x by the GAN architecture.
This is highly required because we need to have access to a good latent vector so that we can perform operations on it to manipulate images (editing, domain adaptation etc)
There are some common approaches to derive the encoded vector from the image
A common approach is called the GAN inversion which is based on the optimization of Mean Squared Error(MSE) between the generated samples and the associated real samples. But drawbacks: the sensitivity to initialization of z and the time-consuming computation
The second method is named as adversarial inference which uses another GAN to infer z as latent variables in a framework of dual GANs.
We can also use a VAE-GAN. VAE-GAN with shared decoder/generator is actually an elegant solution to GAN inference. Drawbacks: Reconstruction precision is worse than VAE, quality is bad when compared with GAN
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
The disentanglement of the latent space is the decisive factor of learning a high-quality encoder for VAE-GAN
Based on the disentanglement argument, Latently Invertible Autoencoder (LIA) is developed.
The paper is going to explain:
1. The effect of the entanglement in the latent space (z-space) and the disentanglement in the intermediate latent space (y-space).
2. The benifits of using LIA network. One is The prior distribution can be exactly fitted from a disentangled and unfolded feature space, thus significantly easing the inference problem
3. Another one is, since the latent space is detached when training the encoder, the encoder can be trained without variational inference, as opposed to VAE.
4. The two-stage adversarial learning decomposes the LIA framework into the vanilla GAN with an invertible network and a standard autoencoder without stochastic variables.

Entanglement in GANs

To visualize the entanglement in GANs, authors randomly sampled Zi and Zj and linearly interpolate between them. The face identities are significantly changed during the deformation from face Xi and face Xj , meaning that the faces in the X-space do not change continuously with linearity in the z-space (x = g(x), where g is generator)
As a comparison, authors established the mapping via an intermediate latent variable y = ϕ(z) and x = g(y), where ϕ is a multilayer perceptron producing an output vector with the same dimension as z. From the results, it is clear that the face path associated with y interpolation approaches the straight line significantly nearer than that with z interpolation and the corresponding generated faces vary very smoothly.
We call the folded z space with respect to the generated face the entangled latent space, and that the intermediate latent space is the disentangled one.
The entanglement for z incurs from GAN training with random sampling, because there is no geometric constraint to guarantee the spatial adjacency correspondence between Z and X. A consequence is that the spatial locations of the associated Zi and Zj are not necessarily adjacent if Xi and face Xj are perceptually similar.

Inference Difficulty from Entanglement

Let ID(x) denote the identity of face x (or the category of an object). The entanglement occurs if ID(g(Zt)) is semantically distinctive from ID(g(Zo)) and ID(g(Zt)) (Zt is latent space/prior Zo- vector previous to the prior, ZT- steps ).
the entanglement of the z space is the crux that VAE/GAN and the MSE-based optimization are incapable of performing inference for GANs well. Evidence from practice is that the entanglement causes the sharp gradient volatility during optimization, hindering the convergence of the algorithm

Disentanglement and Lipschitz Continuity

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Authors compared the path lengths of generated faces from randomly sampled z and y, respectively. Table 2 shows that the path length from y is much shorter than that from z, verifying the fact that the y-space approaches Lipschitz and thus is more disentangled.
Lipschitz continuity can guarantee the spatial consistency between the latent space and the generated sample space, which is more favorable of inference task
Therefore, the authors devised algorithm for GAN inference based on the disentangled y-space instead of the z-space

Latently Invertible Autoencoder

To acquire the disentangled latent space, we need to embed a mapping network in the architecture of the vanilla GAN.
φ(z) = y ; g(y) = x
At the same time, the encoder f has to directly infer y ( don't confuse with z) to favor the disentanglement
So, we also need to establish a reverse mapping to obtain z from y, i.e. z = ϕ(y).
This implies φ = ϕ-1. Therefore, an invertible neural network can establish the reversibility between z and y.
The LIA framework is designed according to this rule.

Neural Architecture of LIA

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

The role of f(x) for LIA can be regarded to unfold the underlying data manifold. Therefore, Euclidean operations such as linear interpolation and vector arithmetic are more reliable and continuous in this disentangled feature space.
Then we establish an invertible mapping φ(y) from the feature y to the latent variable z, as opposed to VAEs that directly map original data to latent variables.
The feature y can be exactly recovered via the invertibility of φ from z, which is the advantage of using invertible networks. The recovered feature y is then fed into a partial decoder g(y) to generate the corresponding data x˜.
If we remove ϕ and ϕ-1, LIA is just a standard autoencoder

Reconstruction Loss and Adversarial Learning

Let E denote the feature extractor, (e.g. VGG). Then we can write the loss
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
X-X~ is the reconstruction term (pixel loss) and the second term is perceptual loss. where β1 the hyper-parameter to balance those two losses. The feasibility for this type of mixed reconstruction loss is actually evident in diverse image-to-image translation tasks.
It suffices to emphasize that the functionality of E here is in essence to produce the representations of the input x and the output x
We also use Adversarial training (due to blurry outputs of norm based reconstructions). A discriminator c is employed to balance the loss of the comparison between x and x˜. (Authors used Loss term of Wasserstein GAN but not vanilla GAN)
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Where px and px˜ denote the probability distributions of the real data and the generated data, respectively. γ (Gamma) is the hyper-parameter of the regularization.

Two-Stage Training

Authors propose a scheme of two-stage training, which decomposes the framework into two parts
1.First, the decoder of LIA is trained as a GAN model with invertible network.
2. Second, the invertible network that connects feature space and latent space is detached from the architecture, reducing the framework to a standard autoencoder without variational inference. Thus this two-stage design prevents the entanglement issue.

Decoder Training

GAN models are applicable to recover a precise x˜ if we can find the latent variable z for the given x.
We train the associated GAN model separately in the LIA framework. To do this, we single out a standard GAN model for the first-stage training, as displayed in Figure 2(b) (2nd Image in neural Architecture of LIA)
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
It is worth noting that the role of the invertible network here is just its transformation invertibility. We do not pose any constraints on the probabilities of z and φ(y) in contrast to normalizing flows. Our strategy of attaching an invertible network in front of the generator can be potentially applied to any GAN models.

Encoder Training

In the network Architecture, the invertible network is embedded in the latent space in symmetrical fashion.
This unique characteristic of the invertible network allows us to detach the invertible network φ from the LIA framework. Thus we attain a conventional autoencoder without stochastic variables
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
After the first-stage GAN training, the
parameter of f is learned as
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
where β2 is the hyper-parameter and c∗∗ is the fine-tuned parameter of the discriminator, meaning that the discriminator is fine-tuned with the training of the encoder while the generator is frozen.

Related to the works which solve Inference problem in VAE/GANs.

An intriguing attempt of training VAE in the adversarial manner

The works relevant to LIA Network Architecture

Conclusion

A new generative model, named Latently Invertible Autoencoder (LIA), has been proposed for generating image sample from a probability prior and simultaneously inferring accurate latent code for a given samp
The core idea of LIA is to symmetrically embed an invertible network in an autoencoder.
Then the neural architecture is trained with adversarial learning as two decomposed modules.
With the design of two-stage training, the decoder can be replaced with any GAN generator for high-resolution image generation.
The role of the invertible network is to remove any probability optimization and bridge the prior with unfolded feature vectors

Disentangled Inference for GANs with Latently Invertible Autoencoder

Abstract

Introduction

Entanglement in GANs

Inference Difficulty from Entanglement

Disentanglement and Lipschitz Continuity

Latently Invertible Autoencoder

Neural Architecture of LIA

Reconstruction Loss and Adversarial Learning

Two-Stage Training

Decoder Training

Encoder Training

Related Work

Conclusion