Notes on "[Generative Adversarial Nets](https://arxiv.org/pdf/1406.2661.pdf)"

# Notes on "[Generative Adversarial Nets](https://arxiv.org/pdf/1406.2661.pdf)" ###### tags: `GANs` ## Introduction They simultaneously train two models for generating data: 1) A generative model G that captures the data distribution and generates new samples from that distribution. 2) A discriminative model D that estimates the probability that a sample belongs to true data rather than Generated data. The training is carried out in such a way that both these models improve in their corresponding tasks until ideally the generated data is indistinguishable from the original training data. ## Summary of pre-reqs: ### Information theory **Information theory** formalizes the notion of information using the following intuitions: 1) Events which are likely to occur should have less information 2) Events which are less likely to occur should have high information 3) Independent events should have additive information. Satisfying all three notions, self information of an event x is defined as $I(x) = -log(P(x))$ We can quantify the amount of uncertainty in an entire probability distribution using the Shannon entropy $H(x) = E_{x \sim P}[I(x)]$ If we have two separate probability distributions over the same random variable x, we can measure how diﬀerent these two distributions are using the **Kullback-Leibler (KL) divergence** $D_{kl}(P||Q) = E_{x \sim P}[log(P(x)) - log(Q(x))]$ **Cross entropy** is built upon KL divergence. It omits the first part of RHS. $H(P,Q) = -E_{x \sim P}[log(Q(x))]$ Note that minimizing both the objectives wrt $Q$ shall lead to similar solutions. ### Maximum Likelihood Estimation Maximum likelihood is a way to match the model distribution to the training data distribution, hoping that the model will perform well on the true data distribution. The model can be any parametric distribution function over the same space. ## Adversarial Nets: $P_g$ : Distribution of generator over data x $P_z$ : Prior on input noise variables. The generator uses its samples as seeds for data generation. $G(z,\theta_g)$ : Generator network parameterized by $\theta_g$ $D(z,\theta_d)$ : Discriminator network parameterized by $\theta_d$ G is trained to maximize $D(G(z))$, i.e. minimize $log(1-D(G(z)))$ D is trained to maximize the likelihood of training data, like normal classifiers. D has to maximize $log(D(x)) + log(1-D(G(z)))$ All criteria are to be minimized/maximized as expectations over a dataset. The authors portray this as a minimax game with the following value: ![](https://i.imgur.com/98zgtN3.png) ## Training GANs : ![](https://i.imgur.com/qipKb20.png) The authors provide some theoretical reasoning on how the algorithm would converge to $P_g = P_{data}$, given enough "representational capacity" to the models. The best case would be when $P_g = P_{data}$ and D would discriminate(classify) randomly.