# Wasserstein GAN
## NIPS2014 @U of Montreal
## PMLR2017 @Facebook
---
# Outline
- GAN(Generative Adversarial Nets)
- What is GAN?
- Problems of GAN.
- WGAN(Wasserstein GAN)
- What is WGAN?
- What issues the WGAN solved?
- Experiments.
- Conclusion
---
# GAN
- Task
- Problems
----
# Goal
- Obtain a generative model.
----
### Generative Adversarial Nets
- Train two model simultaneously.
- The two models is like two players playing a minmax game.
----
### Generative Model
- Input:
- Noise z.
- Output:
- A fake data.
- Learn from the training data and generates fake data to confuse the model D.
----
### Discriminative Model
- Input:
- Data.
- Output:
- The probability that the data comes from the data distribution.
- Distinguishes whether the data comes from the training data distribution or the real data distribution.
----
### Some Notations
- Discriminative Model: $D$
- $\theta_d$ := the parameters of the model D.
- $D(x, \theta_d)$ := the probability of $x$ is from the training data set.
- Generative Model: $G$
- $\theta_g$ := the parameters of the model G.
- $G(z, \theta_g)$ := a sample data generated from G.
- $z$ := prior input noise sampled from $p_z$ (Gaussian or Uniform distribution).
----
### Loss Function
- Discriminative Model
- Maximize $\frac{1}{m}\sum_{i=0}^{m}\left [ log(D(x^i)) + log(1 - D(G(z^i))) \right ]$
- Generative Model
- Minimize $\frac{1}{m}\sum_{i=0}^{m}\left [ log(1 - D(G(z^i))) \right ]$
----
### Algorithm Overview

----
### Problem
- In early stages when G is poor, the gradient will be small.
- G is poor means D(G(z)) is close to $0$.
- The loss function of G is to Minimize $\frac{1}{m}\sum_{i=0}^{m}\left [ log(1 - D(G(z^i))) \right ]$
- $F(x) = log(1 - x)$
- 
- the gradient is too small to train model G.
----
### Solution in GAN papar
- Loss function of model G: minimize $\frac{1}{m}\sum_{i=0}^{m}\left [ -log(D(G(z^i))) \right ]$
- $F(x) = -log(x)$
- 
- the gradient is large than original loss function's gradient when the input x is close to zero.
- When we train the model D, we can't train it too well.
- It's ambiguous about the term "too well".
----
# Minmax

----
### Global Optimality of $p_g = p_{data}$
- Review the loss function to model D
- maximum $V(D) = p_{data}(x)logD(x) + p_g(x)log(1 - D(x))$
- If the model D is optimal, the value of the loss function is maximum.
- $\frac{dV(D)}{dD} = \frac{1}{D(x)} \times p_{data}(x) - \frac{1}{1 - D(x)} \times p_g(x) = 0$
- $p_{data}(x) \times (1 - D(x)) = p_g(x) \times D(x)$
- $D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)}$
----
### Global Optimality of $p_g = p_{data}$
#### Training Criterion
$$
\begin{align}
V(G, D) &= \int_x p_{data}(x)log(D(x))dx\ + \int_zp_z(z)log(1-D(g(z)))dz \\
&= \int_x p_{data}(x)log(D(x))\ + p_g(x) log (1 - D(x))dx
\end{align}
$$
----
### Global Optimality of $p_g = p_{data}$
#### Minmax Game
$$
\begin{align}
C(G) & = \underset{D}{max}\ V(G,D)\\
& = \mathbb{E}_{x\sim p_{data}}[log D_G^*(x)] + \mathbb{E}_{z \sim p_z}[log(1 - D_G^*(G(z)))] \\
& = \mathbb{E}_{x\sim p_{data}}[log D_G^*(x)] + \mathbb{E}_{x \sim p_g}[log(1 - D_G^*(x)] \\
& = \mathbb{E}_{x\sim p_{data}} \left[log \frac{p_{data}(x)}{p_{data}(x)+p_g(x)}\right] + \mathbb{E}_{x\sim p_g} \left[log \frac{p_g(x)}{p_{data}(x)+p_g(x)}\right]
\end{align}
$$
----
### Global Optimality of $p_g = p_{data}$
$$
\begin{align}
C(G) = &-log(4) \\
&+ KL\left( p_{data} \parallel \frac{p_{data}+p_g}{2} \right) \\
&+ KL\left( p_g \parallel \frac{p_{data}+p_g}{2} \right)
\end{align}
$$
$$
C(G) = -log(4) + 2 \cdot JSD(p_{data} \parallel p_g)
$$
[*proof*](/FdZwlUwhQFmLVcQVVKiABQ)
---
# WGAN
- Introduction
- Problems of GAN.
- Pros of EM distance
- Experiments.
----
### Introduction
----
### Problems of GAN
- Diminished Gradient.
----
### EM Distance
$$
W\left(\mathbb{P}_{r}, \mathbb{P}_{g}\right)=\inf _{\gamma \in \Pi\left(\mathbb{P}_{r}, \mathbb{P}_{g}\right)} \mathbb{E}_{(x, y) \sim \gamma}[\|x-y\|]
$$
The EM distance is the “cost” of the optimal transport plan from $P_r$ to $P_g$.
----
### Example
$$
g_{\theta}(z)=(\theta, z)
$$

----
### Various Distances
- EM converges even if the two distributions do not overlap.

----
### Theorem 1
Out of JS, KL, and Wassertstein distance, only the Wasserstein distance has guarantees of **continuity** and **differentiability**.
----
### Theorem 2
Every distribution that converges under the KL, reverse-KL, TV, and JS divergences also converges under the Wasserstein divergence.
----
### Wasserstein GAN
Wasserstein Distance
$W(\mathbb{P}_r, \mathbb{P}_\theta) = \underset{\gamma \in \Pi(\mathbb{P}_r, \mathbb{P}_\theta)}{inf} \mathbb{E}_{(x,y)\sim\gamma}[ \parallel x - y \parallel ]$
From the *Kantorovich-Rubinstein duality*,
it can be transformed to
$W(\mathbb{P}_r, \mathbb{P}_\theta) = \underset{\parallel f \parallel_L \le 1}{sup} \mathbb{E}_{x\sim \mathbb{P}_r}[f(x)] - \mathbb{E}_{x\sim \mathbb{P}_\theta}[f(x)]$
----
### Wasserstein GAN
#### K-Lipschitz Functions
$$
|f(x_1) - f(x_2)| \le K|x_1 - x_2|
$$
Gradient restricted.
#### Clipping
let $c$ be the clipping parameter.
The weights are constrained to lie within $[-c, c]$.
----
### Wasserstein GAN
To approximate $W(\mathbb{P}_r, \mathbb{P}_g)$,
we could consider solving
$\underset{w \in \mathcal{W}}{max}\ \mathbb{E}_{x \sim \mathbb{P}_r}[f_w(x)] - \mathbb{E}_{z \sim p(z)}[f_w(g_\theta(z))]$
For a function $f_w$, we have
$\nabla_\theta W(\mathbb{P}_r, \mathbb{P}_\theta) = -\mathbb{E}_{z \sim p(z)}[\nabla_\theta f_w(g_\theta(z))]$
to update $\theta$.
----
### Training Process
- Approximate $W(\mathbb{P}_r, \mathbb{P}_\theta)$ by training $f_w$ to convergence. ($f_w$ is the discriminator.)
- After finding the optimal $f_w$, compute the $\theta$ gradient $-\mathbb{E}_{z\sim p(z)}[\nabla_\theta f_w(g_\theta(z))]$ by sampleing $z \sim p(z)$.
- Update $\theta$ and repeat the process.
----
### Algorithm

----
### Gradients
- Discriminator
$g_{w} \leftarrow \nabla_{w}\left[\frac{1}{m} \sum_{i=1}^{m} f_{w}\left(x^{(i)}\right)-\frac{1}{m} \sum_{i=1}^{m} f_{w}\left(g_{\theta}\left(z^{(i)}\right)\right)\right]$
- Generator
$g_{\theta} \leftarrow-\nabla_{\theta} \frac{1}{m} \sum_{i=1}^{m} f_{w}\left(g_{\theta}\left(z^{(i)}\right)\right)$
----
### GAN Discriminator vs. WGAN Critic

----
## Experiment
----
### EM Distance Plot

----
### JS Divergence Plot

---
# Conclusion
----
## Modification of W-GAN
- Remove last Sigmoid layer from the discriminator.
- No log in loss function.
- Clip the weights of the discriminator to a small range of value.
- Avoid momentum-based optimizer, use RMSProp or SGD instead.
----
## Wasserstein Distance is crucial
- Gradient won't disappear.
- Meaning full loss metric.
- Stable gradient.
{"metaMigratedAt":"2023-06-15T00:44:01.252Z","metaMigratedFrom":"YAML","title":"Wasserstein GAN","breaks":true,"lang":"zh-tw","disqus":"hackmd","slideOptions":"{\"width\":1200,\"height\":900,\"transition\":\"slide\",\"transitionSpeed\":\"fast\",\"spotlight\":{\"enabled\":false}}","contributors":"[{\"id\":\"ea4c9078-b79c-4ee8-af2b-bbfba7a8cfb5\",\"add\":3168,\"del\":1181},{\"id\":\"4f33a168-5f35-4c6f-b03f-7ee08987743c\",\"add\":4152,\"del\":1071},{\"id\":\"7fe63435-2b13-4117-9f5f-2ac25fd19961\",\"add\":4930,\"del\":2602}]"}