# **Progress Report 03/19/2019** ## Learning the theory of Generative Models ### From GAN to WGAN (Review) * **Generative Adversarial Networks (GANs)** **GAN consists of two models**: ![](https://i.imgur.com/Mhv7X6p.jpg) **Loss Function** ![](https://i.imgur.com/k8i0V9H.png) * **What is the optimal value for D?** For G fixed, the optimal discriminator D is ![](https://i.imgur.com/Io7fLvF.jpg) * **What is the global optimal?** When both G and D are at their optimal values, we have pg=pr and D∗(x)=1/2 and the loss function becomes: ![](https://i.imgur.com/J2ujw4K.jpg) * **What does the loss function represent?** The loss function of GAN quantifies the similarity between the generative data distribution pg and the real sample distribution pr by JS divergence when the discriminator is optimal. The best G∗ that replicates the real data distribution leads to the minimum L(G∗,D∗)=−2log2 which is aligned with equations below. ![](https://i.imgur.com/9rCaOL6.jpg) * **Problems in GANs** Although GAN has shown great success in the realistic image generation, the training is not easy; * **Hard to achieve Nash equilibrium** Salimans et al. (2016)[2] discussed the problem with GAN’s gradient-descent-based training procedure. Two models are trained simultaneously to find a Nash equilibrium to a two-player non-cooperative game. However, each model updates its cost independently with no respect to another player in the game. **Updating the gradient of both models concurrently cannot guarantee a convergence**. Let’s check out a simple example to better understand why it is difficult to find a Nash equilibrium in an non-cooperative game. Suppose one player takes control of x to minimize f1(x)=xy, while at the same time the other player constantly updates y to minimize f2(y)=−xy. Because ∂f1/∂x=y and ∂f2/∂y=−x, we update x with x−η⋅y and y with y+η⋅x simulitanously in one iteration, where η is the learning rate. **Once x and y have different signs**, every following gradient update causes huge oscillation and the instability gets worse in time. ![](https://i.imgur.com/j8hqFnI.png) * **Vanishing gradient** When the discriminator is perfect, we are guaranteed with D(x)=1,∀x∈pr and D(x)=0,∀x∈pg. Therefore the loss function L falls to zero and we end up with no gradient to update the loss during learning iterations. Fig. 5 demonstrates an experiment when the discriminator gets better, the gradient vanishes fast. Arjovsky and Bottou, (2017)[3] explored graident vanishing problem by experiemtns. First, a DCGAN is trained for 1, 10 and 25 epochs. Then, with the generator fixed, a discriminator is trained from scratch and measure the gradients with the original cost function. We see the gradient norms decay quickly (in log scale), in the best case 5 orders of magnitude after 4000 discriminator iterations. ![](https://i.imgur.com/mSHA3vU.png) As a result, training a GAN faces a dilemma: If the discriminator behaves badly, the generator does not have accurate feedback and the loss function cannot represent the reality. If the discriminator does a great job, the gradient of the loss function drops down to close to zero and the learning becomes super slow or even jammed. This dilemma clearly is capable to make the GAN training very tough. * **Mode collapse** During the training, the generator may collapse to a setting where it always produces same outputs. This is a common failure case for GANs, commonly referred to as Mode Collapse. Even though the generator might be able to trick the corresponding discriminator, it fails to learn to represent the complex real-world data distribution and gets stuck in a small space with extremely low variety. (Image Source WGAN[4]) ![](https://i.imgur.com/KBE4jB9.png) * **Different techniques to solve problems in GANs** * **Improved GAN Training [2]** * **Feature Matching (Stablize the GAN's Training)** Feature matching addresses the instability of GANs by specifying a new objective for the generator that prevents it from overtraining on the current discriminator. It suggests to optimize the discriminator to inspect whether the generator’s output matches expected statistics of the real samples. The new loss function is defined as: ![](https://i.imgur.com/IpNHaTs.jpg) where f(x) can be any computation of statistics of features, such as mean or median. * **Minibatch Discrimination (Addresses Mode collapse**) The concept of minibatch discrimination is quite general: any discriminator model that looks at multiple examples in combination, rather than in isolation, could potentially help avoid collapse of the generator. * **Wasserstein GAN (Stablize the GAN's Training) [4]** WGAN introduced a new loss function using Wasserstein metrics. **Why Wasserstein is better than JS or KL divergence?** Even when two distributions are located in lower dimensional manifolds without overlaps, Wasserstein distance can still provide a meaningful and smooth representation of the distance in-between. **Wasserstein loss function:** ![](https://i.imgur.com/CcnHgEW.jpg) The function f comes from a family of K-Lipschitz continuous functions, {fw}w∈W, parameterized by w. In the modified Wasserstein-GAN, the “discriminator” model is used to learn w to find a good fw and the loss function is configured as measuring the Wasserstein distance between pr and pg. In WGAN, the “discriminator” is not a direct critic of telling the fake samples apart from the real ones anymore. Instead, it is trained to learn a K-Lipschitz continuous function to help compute Wasserstein distance. As the loss function decreases in the training, the Wasserstein distance gets smaller and the generator model’s output grows closer to the real data distribution. One big problem is to maintain the K-Lipschitz continuity of fw during the training in order to make everything work out. The paper presents a simple but very practical trick: After every gradient update, clamp the weights w to a small window, such as [−0.01,0.01], resulting in a compact parameter space W and thus fw obtains its lower and upper bounds to preserve the Lipschitz continuity. ## Generalized Adversarial Learning across Variaty of Domains * ### Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANS by constraining information flow [8] * **Contribution:** This paper introduced an adaptive stochastic regularization method for adversarial learning that substantially improves performance across a range of different application domains * Generate high quality images (GANs) * Learn reward functions in the framework of inverse reinforcement learning (Inverse RL) [7] * Directly imitate demonstrations or motions of a object [6] * **Architecture:** ![](https://i.imgur.com/ge6KoRQ.jpg) * **Modified GANs Loss Fuction introduced by this paper** * **Discriminator**: ![](https://i.imgur.com/2XSQtFR.jpg) ![](https://i.imgur.com/l80w2m4.jpg) * **Generator:** ![](https://i.imgur.com/IWOJMCX.jpg) * **One of the tasks solved by this paper is Imitation Leanring:** * The goal of the motion imitation tasks is to train a simulated character to mimic demonstrations provided by mocap (motion capture) clips recorded from human actors * Method used: Variational Adversarial Imitation Leanring ![](https://i.imgur.com/sFby8BE.jpg) ![](https://i.imgur.com/1ctbJwn.jpg) * While this method has produced promising results for video imitation, the results have been primarily with videos of synthetic scenes. * Other Tasks solved by this paper: * Learning tranferable reward fuction * Image generation * **Future Research Direction based on this paper** * Extending the technique to imitating real-world videos * Theoretical analysis of the method: * derive convergence * stability results * conditions * Incorporate Variational Discriminator Bottleneck (VDB) in Conditional Gans ## References 1. Goodfellow, Ian, et al. “Generative adversarial nets.” NIPS, 2014. 2. Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. “Improved techniques for training gans.” In Advances in Neural Information Processing Systems. 3. Martin Arjovsky and Léon Bottou. “Towards principled methods for training generative adversarial networks.” arXiv preprint arXiv:1701.04862 (2017). 4. Martin Arjovsky, Soumith Chintala, and Léon Bottou. “Wasserstein GAN.” arXiv preprint arXiv:1701.07875 (2017). 5. Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin Murphy. Deep variational information bottleneck. CoRR, abs/1612.00410, 2016. URL http://arxiv.org/abs/1612.00410. 6. https://papers.nips.cc/paper/6391-generative-adversarial-imitation-learning.pdf 7. https://arxiv.org/abs/1710.11248 8. [Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow ](https://openreview.net/forum?id=HyxPx3R9tm)