Notes on "[GANs: What it can generate and What it cannot?](https://arxiv.org/abs/1804.00140)"

# Notes on "[GANs: What it can generate and What it cannot?](https://arxiv.org/abs/1804.00140)" ###### tags: `review` `adversarial` `supervised` ## Introduction This is a review paper that focuses on the theoretical aspects of Generative Adversarial Networks (GANs). It also identifies questions (issues) in GANs and summarizes approaches to these issues. ## Primary GAN Setup ### Notation ![Notations Table](https://i.imgur.com/meJahec.png) ### Useful Definitions #### KL Divergence * Given two probability distributions $p_d(x), p_g(x)$ with positive support $\forall x$. The KL divergence is given by $$ KL(p_d || p_g) = \int_{-\infty}^{\infty} p_d(x)\log\frac{p_d(x)}{p_g(x)}dx \tag{1} $$ * Note that here difference between data distribution $p_d$ and model distribution $p_g$ is weighted by $p_d$. It is not symmetric, so we also define Reverse KL Divergence. #### Reverse KL Divergence * Given two probability distributions $p_d(x), p_g(x)$ with positive [support](https://en.wikipedia.org/wiki/Support_(mathematics)) $\forall x$. The reverse KL divergence is given by $$ KL(p_d || p_g) = \int_{-\infty}^{\infty} p_g(x)\log\frac{p_g(x)}{p_d(x)}dx \tag{2} $$ #### Jensen Shannon Divergence (JSD) * According to the original GAN paper ([Goodfellow et. al. 2014](https://arxiv.org/abs/1406.2661)), this divergence is minimized during GAN training. * The JSD is a symmetric distance metric between the 2 distributions $p_d(x), p_g(x)$ given by $$ \text{JSD}(p_d || p_g) = \frac{1}{2}KL(p_d || \frac{p_d + p_g}{2}) + \frac{1}{2}KL(p_g || \frac{p_d + p_g}{2}) \tag{3} $$ #### $L_p$ distance (p-norm) * The $L_p$ distance between any 2 $n$-dimensional vectors $x$ and $y$ for $1 \leq p \leq \infty$ is given by $$ ||x - y||_p = (\sum_{i = 1}^n |x_i - y_i|^p)^{\frac{1}{p}} \tag{4} $$ #### Binary Cross Entropy * Given a classification task with 2 classes, the loss between the predicted probability $p$ and the target probability $y$ is given by $$ \text{BCE}(p, y) = -y\log p - (1-y)\log(1 - p) \tag{5} $$ #### Integral Probability Metric * Let $\mathscr{F}$ be a set of measurable, symmetric and bounded real valued functions on $X$. Given $\mathbb{P}, \mathbb{Q} \in \mathscr{P}(X)$, $$ d_{\mathscr{F}}(\mathbb{P}, \mathbb{Q}) = \sup_{f \in \mathscr{F}}\{\mathop{\mathbb{E}}_{x\sim \mathbb{P}}f(x) - \mathop{\mathbb{E}}_{x\sim \mathbb{Q}}f(x) \} \tag{6} $$ ### Architecture * The **generator** $G$ takes noise $z \in Z$ as the input and transforms it into a vector $\hat{x} \sim p_g(z)$. The generated $\hat{x}$ has same dimensions as a sample from the real data, $x \sim p_d$. * The **discriminator** $D$ takes input a vector of dimension equivalent to the data sample i.e. $x$ or $\hat{x}$. It outputs a scalar which represents its confidence about the input being a real sample $x$ or a fake sample $\hat{x}$. ### Loss Function * The loss function is designed to pit the $D$ and $G$ against each other. * In a particular iteration, $D$ tries to get better at classifying $x$ and $\hat{x}$. It's parameters are denoted by $\theta$ and are trained to maximize the loss which indicates the probability of sample being real. * In the same iteration, parameters of $G$ (denoted by $\phi$) are optimized such that the discriminator is not able to distinguish between $x$ and $\hat{x}$ (i.e. to generate more realistic samples). Ideally, $\phi$ is trained to minimize the same loss that $\theta$ is maximizing. * Formally, the loss function is $$ \min_{\phi} \max_{\theta}V(D_\theta, G_\phi) = \mathbb{E}_{x\sim p_d(x)}[\log D_\theta(x)] + \mathbb{E}_{z\sim p_g(z)}[\log (1 - D_\theta(G_\phi(z)))] \tag{7} $$ * This is the same as [BCE](https://hackmd.io/fWTf_RCBTOKf5IkV6A7Fag?both#Binary-Cross-Entropy), because the first term considers only real samples (so the outer $y$ term will be 1), while the second term considers only fake samples (so the outer $1 - y$ term will also be 1 since $y$ is 0). * Early in the training, $D$ is very powerful and able to easily distinguish between real and fake (since generated samples are far from realistic). Thus, gradients w.r.t. $\phi$ are very small, so there is no strong signal for $G$ to improve. * [Goodfellow et. al. 2014](https://arxiv.org/abs/1406.2661) propose the following objective which has better gradients when $D(G(z))$ takes low values. $$ \max_\phi \log (D_\theta(G_\theta(z))) \tag{8} $$ ### Optimizer * The main goal is to ensure $p_g = p_d$, i.e. generated distribution has learnt the data distribution. * [Goodfellow et. al. 2014](https://arxiv.org/abs/1406.2661) propose *Simultaneous Gradient Descent* as the optimizer. In this, we first fix $\phi$ and optimize over $\theta$ by maximizing equation $7$. Then, we fix $\theta$ and maximize equation $8$ over $\phi$. #### Existence and Convergence to Saddle Point * TODO (will be done later when I have free time). ## Primary Challenges in GAN Setup * The typical approach is to setup a generalized framework for the min-max objective given by equation $7$ and continue the analysis of convergence by borrowing the tools from the set framework. * Authors in [Nowozin et. al. 2016](https://arxiv.org/abs/1606.00709) view the GAN objective as *Divergence Minimization*, and accordingly prove the convergence. * There are papers which model it as a *two-player zero sum game* and borrow the concept of Nash equilibrium to derive the optimal behavior. * There is also a line of work which views it as a *Regret Minimization* problem. * Till date, there has not been a satisfactory explanation for why the model fails or for why it works sometimes (when it does). * There are 3 main challenges encountered while implementing the model * Mode Collapse * Most significant and widely discussed problem. * The data distribution $p_d$ is highly complex and spreads over many modes (modes represent variation within the data). Ideally, at convergence, the generated distribution should have equivalent number of modes. * However, generated samples lack variation in this scenario (same or same type of image generated over and over). Although images generated are sharp and realistic, they don't have much variation. * Non-convergence and Instability * In a GAN, $G$ tries to maximize the loss that $D$ minimizes. Hence, the convergence is not evident through the training loss curves. * Generally, if $G$ and $D$ losses converge to a particular value, it does not imply $p_g = p_d$. Rather, the generator is always generating a few samples (mode collapse). * It is also observed that the losses have not converged to any value, yet the generator is generating realistic samples (usually the losses are not smooth, but have damped oscillations). The oscillations indicate the instability of training. * The sensitivity to hyper-parameters of DL models is exaggerated within GANs. * Evaluation of Generative Models * In a generative model like GAN, measuring performance is not an obvious task (as in classification, regression, etc.) * To generate realistic images similar to the data while not generating the training data itself, MSE or any distance metric between generated and real images is not a correct measure. * [Borji, 2018](https://arxiv.org/abs/1802.03446) gives a good review on various GAN evaluation measures. * Ultimately, it is desirable to have a metric that evaluates both diversity and visual fidelity simultaneously. * We can design a GAN setup without mode collapse and unstable training only if we find positive answers to 1. Does the model have enough capacity to capture the complex data distribution $p_d$? 2. Given enough capacity, will minimizing the given objective function guarantee $p_g = p_d$ at optimality with considerable generalization bounds? 3. Assuming there exists a global optima which corresponds to $p_g = p_d$, do we have an optimization algorithm which reaches it in finite iterations? ## Review of Approaches to Challenges TODO (hopefully, I will someday be in a good position to write about the rest of paper which has a lot of intensive maths).