<style>
.reveal .slides {
font-size: 30px;
text-align: left;
}
</style>
# Paper Report
2020/02/27 林隼興
$\newcommand{\prob}{\mathcal{P}}
\newcommand{\expt}{\mathbb{E}}
\newcommand{\mb}[1]{\mathbf{#1}}$
---
## Lipschitz Generative Adversarial Nets
ICML 2019 Oral
[arxiv link](https://arxiv.org/abs/1902.05687)


----
### Previous Submission
- [Understanding the Effectiveness of Lipschitz-Continuity in Generative Adversarial Nets](https://openreview.net/forum?id=r1zOg309tX¬eId=SkxDk51iCX)
Rejected by ICLR 2019 with scores 6, 4, 5.

- [Towards Efficient and Unbiased Implementation of Lipschitz Continuity in GANs](https://arxiv.org/abs/1904.01184)
----
## Contributions
1. *~~WGAN~~ Lipschitz GAN!
2. Deep analysis of discriminator objectives and its empirical support.
**convex** instead of **linear** (used in WGAN).
3. Implementation of Lipschitz constaint: MaxGP.
---
## Background
----
### GAN's objective
- Minimax game
$\displaystyle{
\begin{align}
&\min_\theta \max_\phi \mathcal{V}(D_\phi,
G_\theta) \\
= &\min_\theta \max_\phi\expt_{x \sim \prob_r} [\log D_\phi (x)] + \expt_{x \sim \prob_{G_\theta}}[\log(1 - D_\phi(x))].
\end{align}
}$
- Optimal discriminator and Jenson-Shannon divergence
$\displaystyle{
D^*(x) = \frac{p_r(x)}{p_r(x) + p_{G_\theta}(x)}, \\
\mathcal{V}(D^*, G_\theta) = 2JS(\prob_r || \prob_{G_\theta}) - \log4.
}$
----
### Problems of Jenson-Shannon Divergence
- Gradient vanishing
$\nabla\log(1 - D_\phi(x))$ is small when $D_\phi(x) \to 0$
- Constant when $P, Q$ are disjoint
(下圖取自李宏毅老師投影片)

----
### Alternative Losses
- Non-saturating loss
$\displaystyle{
\min_\theta \expt_{x \sim \prob_{G_\theta}}[-\log D_\phi(x)]
}$
Equivalent to $\displaystyle{
\min_\theta KL(\prob_{G_\theta} || \prob_r) - 2JS(\prob_{G_\theta} || \prob_r).
}$ ([Arjovsky and Bottou, 2017](https://arxiv.org/abs/1701.04862))
- Linear / Least Square / Hinge ....
----
### Wasserstein Distance
$\displaystyle{
W_1(P, Q) = \underset{\pi \in \Pi(P, Q)}{\inf} \expt_{(x,y) \sim \pi} [d(x, y)].
}$
(下圖取自李宏毅老師投影片)

----
### Dual Form & Lipschitz Constraint
By the Kantorovich-Rubinstein (KR) duality ([Villani, 2008](https://ljk.imag.fr/membres/Emmanuel.Maitre/lib/exe/fetch.php?media=b07.stflour.pdf))
$\displaystyle{
W_1(P, Q) = \underset{f \in \mathcal{F}}{\sup} \expt_{x \sim P}[f(x)] - \expt_{x \sim Q}[f(x)], \\
\mathcal{F} = \{f: \mathbb{R}^n \to \mathbb{R}, f(x) - f(y) \le d(x, y), \forall x, \forall y \}.
}$
----
### Wasserstein GAN
$\displaystyle{
\min_\theta \max_\phi ~\expt_{x \sim \prob_r}[f_\phi(x)] - \expt_{x \sim \prob_{G_\theta}}[f_\phi(x)] = \min_\theta W_1(\prob_r, \prob_{G_\theta}), \\
\text{s.t. }\Vert f_\phi \Vert_{\text{Lip}} \le 1.
}$ however, is intractable.
Thus, it's $\textit{very crucial }$ to show
$\displaystyle{ \\
\nabla_\theta W_1(\prob_r, \prob_{G_\theta})
= - \nabla_\theta \expt_{x \sim \prob_{G_\theta}}[f^*(x)]
= - \expt_{z \sim \prob_z}[\nabla_\theta f^*(g_\theta(z))].
}$
[Paul Milgrom and Ilya Segal. 2002](https://web.stanford.edu/~isegal/envelope.pdf)
---
## Analysis
----
### Discriminator Objectives
$\displaystyle{
\min_{f \in \mathcal{F}} \expt_{z \sim \prob_z} [\phi(f(g(z)))] + \expt_{x \sim \prob_r} [\varphi(f(x))].
}$

----
### Generator Objective
$\displaystyle{
J_G(x) := \expt_{x \sim \prob_G}[\psi(f(x))], \\
\nabla_x J_G(x) = \nabla_x \psi(f(x)) = \nabla_{f}~ψ(f) · \nabla_x f(x)
}$
- The Gradient Vanishing
$\nabla_f~\psi(f) \to 0$.
- The Gradient Uninformativeness
Direction of $\nabla_x f(x)$ is meaningless without appropriate constraint
----
### Cause of Gradient Uninformativeness
$f^*$ without constraint is **undefined** out of $S_r, S_{G_\theta}$:

---
## Proposed Methods
----
### Formulation
- Discriminator
$\displaystyle{
\min_{f \in \mathcal{F}} \expt_{z \sim \mathcal{P}_z}[\phi(f(g(z)))] + \expt_{x \sim \mathcal{P}_r}[\phi(-f(x))] + \lambda k(f)^2. \\
\phi'(x) > 0, \phi''(x) \ge 0, \\
k(f) \equiv \Vert f\Vert_{\text{Lip}}.
}$
- Generator
$\displaystyle{
\min_\theta -\expt_{z \sim \mathcal{P}_z} [f(g_\theta(z))].
}$
----
### Theorems
- Existence & Uniqueness (due to 0-lipschitz)

- Bounding relationship

- Optimal transport $\to$ gradient informativeness

----
### Gradient Informativeness
$\nabla_{x_t}f(x_t) = k \cdot \dfrac{y - x}{\Vert y-x \Vert},~\forall x_t = tx + (1 - t)y,~0 \le t \le 1.$

----
### Max Gradient Penalty
Given that: $\displaystyle{
k(f) = \max \Vert \nabla_x f(x) \Vert, \text{(by Mean value theorem)}
}$
**Proposed MaxGP**:
$\displaystyle{
J_{\text{maxgp}} = \lambda \max_{x_t \sim B(S_r,S_{G_\theta})} \big[ \Vert \nabla_{x_t} f(x_t) \Vert^2 \big], \\
B = \{tx + (1 - t)y~|~0 \le t \le 1, \forall x \in \mathcal{S}_r, \forall y \in \mathcal{S}_{G_\theta}\}
}$
**WGAN-GP** ([Gulrajani et al. 2017](https://arxiv.org/pdf/1704.00028.pdf)):
$\displaystyle{
J_{\text{gp}} = \lambda~~\expt_{x_t \sim B(S_r,S_{G_\theta})}\big[ (\Vert\nabla_{x_t} f(x_t) \Vert - 1)^2 \big].
}$
---
## Results
----
### Comparison of Discriminator Objectives

----
### Comparison of Discriminator Objectives
 | 
----
### Gradient Uninformativeness of Unrestricted GAN

----
### Gradient Informativeness of Lipschitz GAN

----
### Importance of Uniqueness

---
## Conclusions
1. ~~Wasserstein distance~~ Lipschitz constraint is the key .
2. Linear loss might not be the best choice.
3. MaxGP yields a comparabale result with GP.
---
## Comments on ICLR 2019 Rebuttal
----
### Wasserstein Distance and its dual forms
$\displaystyle{
W_1(\prob_r, \prob_{G_\theta}) = \underset{f \in \mathcal{F}}{\sup} \expt_{x \sim \prob_r}[f(x)] - \expt_{x \sim \prob_{G_\theta}}[f(x)],
}$
- KR duality (used in many previous works like Gulrajani et al., 2017; Petzka et al., 2017; Miyato et al., 2018)
constraint on **entire space** (independent to $\theta$.)
$\displaystyle{
\mathcal{F}_{KR} = \{f: \mathbb{R}^n \to \mathbb{R}, f(x) - f(y) \le d(x, y), \forall x, \forall y \},
}$
- Compact dual form (mentioned in this work.)
constraint on **supports** (dependent to $\theta$.)
$\displaystyle{
\mathcal{F}_{LL} = \{f: \mathbb{R}^n \to \mathbb{R}, f(x) - f(y) \le d(x, y), \forall x \in \mathcal{S}_r, \forall y \in \mathcal{S}_{G_\theta}\}.
}$
----
### Failure of compact dual form
Gradient is not informative again.

----
### Envelope Theorem
$\displaystyle{
V(\theta) = \max_\phi f(\theta, \phi) = f(\theta, \phi^*(\theta)) = f^*(\theta), \\
\to \frac{dV}{d\theta} = \frac{\partial f(\theta, \phi)}{\partial \theta} \bigg|_{\phi = \phi^*} + \frac{\partial f(\theta, \phi)}{\partial \phi} \bigg|_{\phi = \phi^*} \frac{d\phi^*(\theta)}{d\theta} \\
(\text{Since } \phi^* \text{ maximize } f \Rightarrow \frac{\partial f(\theta, \phi)}{\partial \phi} \bigg|_{\phi = \phi^*} = 0) \\
= \frac{\partial f(\theta, \phi)}{\partial \theta} \bigg|_{\phi = \phi^*} + 0 = \frac{df^*}{d\theta}.
}$
----
### Failure of compact dual form
- Envelope theorem with **differentiable** constraints $g(\theta, \phi) = 0.$
$f(\theta, \phi) \to f(\theta, \phi) + \lambda(\phi) g(\theta, \phi),$
- Biased estimator
Since $g$ is not differentiable in compact dual form,
Even though
$\displaystyle{
W_1(\prob_r, \prob_\theta) = \underset{f \in \mathcal{F}_{LL}}{\sup} \expt_{x \sim \prob_r}[f(x)] - \expt_{x \sim \prob_{G_\theta}}[f(x)], \\
\nabla_\theta W_1 \ne - \nabla_\theta \expt_{x \sim \prob_{G_\theta}}[f^*(x)] \text{ with such } f^* \text{ maximizes the above.}
}$
----
### 懶人包

----
### 懶人包

----


----
(改自 *Bleach 死神* 單行本 18 卷 118 頁)

----
(改自 *Bleach 死神* 單行本 71 卷 163 頁)

----
### Lessons learned
1. **VERIFY THE ESTIMATOR** before claiming you're minimizing some intractable statistical quantity...
2. **THINK TWICE** before saying something is wrong...
3. Only put your **MAIN CONTRIBUTIONS** on the paper, remove other bullshit.
4. The power of **WORDING** ...
---
## Q & A
---
## 03/04 勘誤
----
### Convexity and Uniqueness (劉浩然同學提問)
- Uniqueness of minimizer $f^*$
- Appendix A.1 Lemma 1

- Appendix A.1 Lemma 6

- Uniqueness of Nash equilibrium
Seems it does not require $f^*$ to be unique.
----
### Optimal Transport of WGAN (李宏毅老師提問)

----
### Optimal Transport of WGAN (李宏毅老師提問)
[Improved Training of Wasserstein GANs](https://arxiv.org/abs/1704.00028)

[25] C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008. (看無)
----
### Overcoming Mode Collapse (李宏毅老師提問)

---
## Q & A
---
{"metaMigratedAt":"2023-06-15T04:40:40.616Z","metaMigratedFrom":"YAML","title":"Paper Report 02/27 林隼興","breaks":true,"slideOptions":"{\"theme\":\"serif\",\"transition\":\"slide\"}","contributors":"[{\"id\":\"fa032ae2-ac9e-4754-9d6f-947b0ea9cb6e\",\"add\":13592,\"del\":3728}]"}