Wasserstein GAN

# Wasserstein GAN ## NIPS2014 @U of Montreal ## PMLR2017 @Facebook --- # Outline - GAN(Generative Adversarial Nets) - What is GAN? - Problems of GAN. - WGAN(Wasserstein GAN) - What is WGAN? - What issues the WGAN solved? - Experiments. - Conclusion --- # GAN - Task - Problems ---- # Goal - Obtain a generative model. ---- ### Generative Adversarial Nets - Train two model simultaneously. - The two models is like two players playing a minmax game. ---- ### Generative Model - Input: - Noise z. - Output: - A fake data. - Learn from the training data and generates fake data to confuse the model D. ---- ### Discriminative Model - Input: - Data. - Output: - The probability that the data comes from the data distribution. - Distinguishes whether the data comes from the training data distribution or the real data distribution. ---- ### Some Notations - Discriminative Model: $D$ - $\theta_d$ := the parameters of the model D. - $D(x, \theta_d)$ := the probability of $x$ is from the training data set. - Generative Model: $G$ - $\theta_g$ := the parameters of the model G. - $G(z, \theta_g)$ := a sample data generated from G. - $z$ := prior input noise sampled from $p_z$ (Gaussian or Uniform distribution). ---- ### Loss Function - Discriminative Model - Maximize $\frac{1}{m}\sum_{i=0}^{m}\left [ log(D(x^i)) + log(1 - D(G(z^i))) \right ]$ - Generative Model - Minimize $\frac{1}{m}\sum_{i=0}^{m}\left [ log(1 - D(G(z^i))) \right ]$ ---- ### Algorithm Overview ![](https://i.imgur.com/EeI9jQW.png) ---- ### Problem - In early stages when G is poor, the gradient will be small. - G is poor means D(G(z)) is close to $0$. - The loss function of G is to Minimize $\frac{1}{m}\sum_{i=0}^{m}\left [ log(1 - D(G(z^i))) \right ]$ - $F(x) = log(1 - x)$ - ![](https://i.imgur.com/aO29xkF.png) - the gradient is too small to train model G. ---- ### Solution in GAN papar - Loss function of model G: minimize $\frac{1}{m}\sum_{i=0}^{m}\left [ -log(D(G(z^i))) \right ]$ - $F(x) = -log(x)$ - ![](https://i.imgur.com/VrqbXKO.png) - the gradient is large than original loss function's gradient when the input x is close to zero. - When we train the model D, we can't train it too well. - It's ambiguous about the term "too well". ---- # Minmax ![](https://i.imgur.com/se3Uwti.png) ---- ### Global Optimality of $p_g = p_{data}$ - Review the loss function to model D - maximum $V(D) = p_{data}(x)logD(x) + p_g(x)log(1 - D(x))$ - If the model D is optimal, the value of the loss function is maximum. - $\frac{dV(D)}{dD} = \frac{1}{D(x)} \times p_{data}(x) - \frac{1}{1 - D(x)} \times p_g(x) = 0$ - $p_{data}(x) \times (1 - D(x)) = p_g(x) \times D(x)$ - $D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)}$ ---- ### Global Optimality of $p_g = p_{data}$ #### Training Criterion $$ \begin{align} V(G, D) &= \int_x p_{data}(x)log(D(x))dx\ + \int_zp_z(z)log(1-D(g(z)))dz \\ &= \int_x p_{data}(x)log(D(x))\ + p_g(x) log (1 - D(x))dx \end{align} $$ ---- ### Global Optimality of $p_g = p_{data}$ #### Minmax Game $$ \begin{align} C(G) & = \underset{D}{max}\ V(G,D)\\ & = \mathbb{E}_{x\sim p_{data}}[log D_G^*(x)] + \mathbb{E}_{z \sim p_z}[log(1 - D_G^*(G(z)))] \\ & = \mathbb{E}_{x\sim p_{data}}[log D_G^*(x)] + \mathbb{E}_{x \sim p_g}[log(1 - D_G^*(x)] \\ & = \mathbb{E}_{x\sim p_{data}} \left[log \frac{p_{data}(x)}{p_{data}(x)+p_g(x)}\right] + \mathbb{E}_{x\sim p_g} \left[log \frac{p_g(x)}{p_{data}(x)+p_g(x)}\right] \end{align} $$ ---- ### Global Optimality of $p_g = p_{data}$ $$ \begin{align} C(G) = &-log(4) \\ &+ KL\left( p_{data} \parallel \frac{p_{data}+p_g}{2} \right) \\ &+ KL\left( p_g \parallel \frac{p_{data}+p_g}{2} \right) \end{align} $$ $$ C(G) = -log(4) + 2 \cdot JSD(p_{data} \parallel p_g) $$ [*proof*](/FdZwlUwhQFmLVcQVVKiABQ) --- # WGAN - Introduction - Problems of GAN. - Pros of EM distance - Experiments. ---- ### Introduction ---- ### Problems of GAN - Diminished Gradient. ---- ### EM Distance $$ W\left(\mathbb{P}_{r}, \mathbb{P}_{g}\right)=\inf _{\gamma \in \Pi\left(\mathbb{P}_{r}, \mathbb{P}_{g}\right)} \mathbb{E}_{(x, y) \sim \gamma}[\|x-y\|] $$ The EM distance is the “cost” of the optimal transport plan from $P_r$ to $P_g$. ---- ### Example $$ g_{\theta}(z)=(\theta, z) $$ ![](https://i.imgur.com/vVAYZhd.png) ---- ### Various Distances - EM converges even if the two distributions do not overlap. ![](https://i.imgur.com/dsh6jAS.png) ---- ### Theorem 1 Out of JS, KL, and Wassertstein distance, only the Wasserstein distance has guarantees of **continuity** and **differentiability**. ---- ### Theorem 2 Every distribution that converges under the KL, reverse-KL, TV, and JS divergences also converges under the Wasserstein divergence. ---- ### Wasserstein GAN Wasserstein Distance $W(\mathbb{P}_r, \mathbb{P}_\theta) = \underset{\gamma \in \Pi(\mathbb{P}_r, \mathbb{P}_\theta)}{inf} \mathbb{E}_{(x,y)\sim\gamma}[ \parallel x - y \parallel ]$ From the *Kantorovich-Rubinstein duality*, it can be transformed to $W(\mathbb{P}_r, \mathbb{P}_\theta) = \underset{\parallel f \parallel_L \le 1}{sup} \mathbb{E}_{x\sim \mathbb{P}_r}[f(x)] - \mathbb{E}_{x\sim \mathbb{P}_\theta}[f(x)]$ ---- ### Wasserstein GAN #### K-Lipschitz Functions $$ |f(x_1) - f(x_2)| \le K|x_1 - x_2| $$ Gradient restricted. #### Clipping let $c$ be the clipping parameter. The weights are constrained to lie within $[-c, c]$. ---- ### Wasserstein GAN To approximate $W(\mathbb{P}_r, \mathbb{P}_g)$, we could consider solving $\underset{w \in \mathcal{W}}{max}\ \mathbb{E}_{x \sim \mathbb{P}_r}[f_w(x)] - \mathbb{E}_{z \sim p(z)}[f_w(g_\theta(z))]$ For a function $f_w$, we have $\nabla_\theta W(\mathbb{P}_r, \mathbb{P}_\theta) = -\mathbb{E}_{z \sim p(z)}[\nabla_\theta f_w(g_\theta(z))]$ to update $\theta$. ---- ### Training Process - Approximate $W(\mathbb{P}_r, \mathbb{P}_\theta)$ by training $f_w$ to convergence. ($f_w$ is the discriminator.) - After finding the optimal $f_w$, compute the $\theta$ gradient $-\mathbb{E}_{z\sim p(z)}[\nabla_\theta f_w(g_\theta(z))]$ by sampleing $z \sim p(z)$. - Update $\theta$ and repeat the process. ---- ### Algorithm ![](https://i.imgur.com/DWa7VH3.png) ---- ### Gradients - Discriminator $g_{w} \leftarrow \nabla_{w}\left[\frac{1}{m} \sum_{i=1}^{m} f_{w}\left(x^{(i)}\right)-\frac{1}{m} \sum_{i=1}^{m} f_{w}\left(g_{\theta}\left(z^{(i)}\right)\right)\right]$ - Generator $g_{\theta} \leftarrow-\nabla_{\theta} \frac{1}{m} \sum_{i=1}^{m} f_{w}\left(g_{\theta}\left(z^{(i)}\right)\right)$ ---- ### GAN Discriminator vs. WGAN Critic ![](https://i.imgur.com/MFvc456.png) ---- ## Experiment ---- ### EM Distance Plot ![](https://i.imgur.com/6yMr7jB.png) ---- ### JS Divergence Plot ![](https://i.imgur.com/OdelHPL.png) --- # Conclusion ---- ## Modification of W-GAN - Remove last Sigmoid layer from the discriminator. - No log in loss function. - Clip the weights of the discriminator to a small range of value. - Avoid momentum-based optimizer, use RMSProp or SGD instead. ---- ## Wasserstein Distance is crucial - Gradient won't disappear. - Meaning full loss metric. - Stable gradient.