<style> .reveal .slides { font-size: 30px; text-align: left; } </style> # Paper Report 2020/02/27 林隼興 $\newcommand{\prob}{\mathcal{P}} \newcommand{\expt}{\mathbb{E}} \newcommand{\mb}[1]{\mathbf{#1}}$ --- ## Lipschitz Generative Adversarial Nets ICML 2019 Oral [arxiv link](https://arxiv.org/abs/1902.05687) ![](https://i.imgur.com/bre3gwD.png) ![](https://i.imgur.com/s8cLBnB.png) ---- ### Previous Submission - [Understanding the Effectiveness of Lipschitz-Continuity in Generative Adversarial Nets](https://openreview.net/forum?id=r1zOg309tX&noteId=SkxDk51iCX) Rejected by ICLR 2019 with scores 6, 4, 5. ![](https://i.imgur.com/HbEjXa3.png) - [Towards Efficient and Unbiased Implementation of Lipschitz Continuity in GANs](https://arxiv.org/abs/1904.01184) ---- ## Contributions 1. *~~WGAN~~ Lipschitz GAN! 2. Deep analysis of discriminator objectives and its empirical support. **convex** instead of **linear** (used in WGAN). 3. Implementation of Lipschitz constaint: MaxGP. --- ## Background ---- ### GAN's objective - Minimax game $\displaystyle{ \begin{align} &\min_\theta \max_\phi \mathcal{V}(D_\phi, G_\theta) \\ = &\min_\theta \max_\phi\expt_{x \sim \prob_r} [\log D_\phi (x)] + \expt_{x \sim \prob_{G_\theta}}[\log(1 - D_\phi(x))]. \end{align} }$ - Optimal discriminator and Jenson-Shannon divergence $\displaystyle{ D^*(x) = \frac{p_r(x)}{p_r(x) + p_{G_\theta}(x)}, \\ \mathcal{V}(D^*, G_\theta) = 2JS(\prob_r || \prob_{G_\theta}) - \log4. }$ ---- ### Problems of Jenson-Shannon Divergence - Gradient vanishing $\nabla\log(1 - D_\phi(x))$ is small when $D_\phi(x) \to 0$ - Constant when $P, Q$ are disjoint (下圖取自李宏毅老師投影片) ![](https://i.imgur.com/3tuci2M.png) ---- ### Alternative Losses - Non-saturating loss $\displaystyle{ \min_\theta \expt_{x \sim \prob_{G_\theta}}[-\log D_\phi(x)] }$ Equivalent to $\displaystyle{ \min_\theta KL(\prob_{G_\theta} || \prob_r) - 2JS(\prob_{G_\theta} || \prob_r). }$ ([Arjovsky and Bottou, 2017](https://arxiv.org/abs/1701.04862)) - Linear / Least Square / Hinge .... ---- ### Wasserstein Distance $\displaystyle{ W_1(P, Q) = \underset{\pi \in \Pi(P, Q)}{\inf} \expt_{(x,y) \sim \pi} [d(x, y)]. }$ (下圖取自李宏毅老師投影片) ![](https://i.imgur.com/UjBmnCO.png) ---- ### Dual Form & Lipschitz Constraint By the Kantorovich-Rubinstein (KR) duality ([Villani, 2008](https://ljk.imag.fr/membres/Emmanuel.Maitre/lib/exe/fetch.php?media=b07.stflour.pdf)) $\displaystyle{ W_1(P, Q) = \underset{f \in \mathcal{F}}{\sup} \expt_{x \sim P}[f(x)] - \expt_{x \sim Q}[f(x)], \\ \mathcal{F} = \{f: \mathbb{R}^n \to \mathbb{R}, f(x) - f(y) \le d(x, y), \forall x, \forall y \}. }$ ---- ### Wasserstein GAN $\displaystyle{ \min_\theta \max_\phi ~\expt_{x \sim \prob_r}[f_\phi(x)] - \expt_{x \sim \prob_{G_\theta}}[f_\phi(x)] = \min_\theta W_1(\prob_r, \prob_{G_\theta}), \\ \text{s.t. }\Vert f_\phi \Vert_{\text{Lip}} \le 1. }$ however, is intractable. Thus, it's $\textit{very crucial }$ to show $\displaystyle{ \\ \nabla_\theta W_1(\prob_r, \prob_{G_\theta}) = - \nabla_\theta \expt_{x \sim \prob_{G_\theta}}[f^*(x)] = - \expt_{z \sim \prob_z}[\nabla_\theta f^*(g_\theta(z))]. }$ [Paul Milgrom and Ilya Segal. 2002](https://web.stanford.edu/~isegal/envelope.pdf) --- ## Analysis ---- ### Discriminator Objectives $\displaystyle{ \min_{f \in \mathcal{F}} \expt_{z \sim \prob_z} [\phi(f(g(z)))] + \expt_{x \sim \prob_r} [\varphi(f(x))]. }$ ![](https://i.imgur.com/PDfsvvM.png) ---- ### Generator Objective $\displaystyle{ J_G(x) := \expt_{x \sim \prob_G}[\psi(f(x))], \\ \nabla_x J_G(x) = \nabla_x \psi(f(x)) = \nabla_{f}~ψ(f) · \nabla_x f(x) }$ - The Gradient Vanishing $\nabla_f~\psi(f) \to 0$. - The Gradient Uninformativeness Direction of $\nabla_x f(x)$ is meaningless without appropriate constraint ---- ### Cause of Gradient Uninformativeness $f^*$ without constraint is **undefined** out of $S_r, S_{G_\theta}$: ![](https://i.imgur.com/Df9VMgJ.png) --- ## Proposed Methods ---- ### Formulation - Discriminator $\displaystyle{ \min_{f \in \mathcal{F}} \expt_{z \sim \mathcal{P}_z}[\phi(f(g(z)))] + \expt_{x \sim \mathcal{P}_r}[\phi(-f(x))] + \lambda k(f)^2. \\ \phi'(x) > 0, \phi''(x) \ge 0, \\ k(f) \equiv \Vert f\Vert_{\text{Lip}}. }$ - Generator $\displaystyle{ \min_\theta -\expt_{z \sim \mathcal{P}_z} [f(g_\theta(z))]. }$ ---- ### Theorems - Existence & Uniqueness (due to 0-lipschitz) ![](https://i.imgur.com/8HMq8cP.png) - Bounding relationship ![](https://i.imgur.com/D5STUpp.png) - Optimal transport $\to$ gradient informativeness ![](https://i.imgur.com/mG4xGdW.png) ---- ### Gradient Informativeness $\nabla_{x_t}f(x_t) = k \cdot \dfrac{y - x}{\Vert y-x \Vert},~\forall x_t = tx + (1 - t)y,~0 \le t \le 1.$ ![](https://i.imgur.com/XuNyTuh.png) ---- ### Max Gradient Penalty Given that: $\displaystyle{ k(f) = \max \Vert \nabla_x f(x) \Vert, \text{(by Mean value theorem)} }$ **Proposed MaxGP**: $\displaystyle{ J_{\text{maxgp}} = \lambda \max_{x_t \sim B(S_r,S_{G_\theta})} \big[ \Vert \nabla_{x_t} f(x_t) \Vert^2 \big], \\ B = \{tx + (1 - t)y~|~0 \le t \le 1, \forall x \in \mathcal{S}_r, \forall y \in \mathcal{S}_{G_\theta}\} }$ **WGAN-GP** ([Gulrajani et al. 2017](https://arxiv.org/pdf/1704.00028.pdf)): $\displaystyle{ J_{\text{gp}} = \lambda~~\expt_{x_t \sim B(S_r,S_{G_\theta})}\big[ (\Vert\nabla_{x_t} f(x_t) \Vert - 1)^2 \big]. }$ --- ## Results ---- ### Comparison of Discriminator Objectives ![](https://i.imgur.com/GS3CVSq.png) ---- ### Comparison of Discriminator Objectives ![](https://i.imgur.com/5KAb8rv.png) | ![](https://i.imgur.com/NVwpiXc.png) ---- ### Gradient Uninformativeness of Unrestricted GAN ![](https://i.imgur.com/EHJLbxl.png) ---- ### Gradient Informativeness of Lipschitz GAN ![](https://i.imgur.com/gQ8vyEe.png) ---- ### Importance of Uniqueness ![](https://i.imgur.com/Mt1IVXU.png) --- ## Conclusions 1. ~~Wasserstein distance~~ Lipschitz constraint is the key . 2. Linear loss might not be the best choice. 3. MaxGP yields a comparabale result with GP. --- ## Comments on ICLR 2019 Rebuttal ---- ### Wasserstein Distance and its dual forms $\displaystyle{ W_1(\prob_r, \prob_{G_\theta}) = \underset{f \in \mathcal{F}}{\sup} \expt_{x \sim \prob_r}[f(x)] - \expt_{x \sim \prob_{G_\theta}}[f(x)], }$ - KR duality (used in many previous works like Gulrajani et al., 2017; Petzka et al., 2017; Miyato et al., 2018) constraint on **entire space** (independent to $\theta$.) $\displaystyle{ \mathcal{F}_{KR} = \{f: \mathbb{R}^n \to \mathbb{R}, f(x) - f(y) \le d(x, y), \forall x, \forall y \}, }$ - Compact dual form (mentioned in this work.) constraint on **supports** (dependent to $\theta$.) $\displaystyle{ \mathcal{F}_{LL} = \{f: \mathbb{R}^n \to \mathbb{R}, f(x) - f(y) \le d(x, y), \forall x \in \mathcal{S}_r, \forall y \in \mathcal{S}_{G_\theta}\}. }$ ---- ### Failure of compact dual form Gradient is not informative again. ![](https://i.imgur.com/NPa43Mz.png) ---- ### Envelope Theorem $\displaystyle{ V(\theta) = \max_\phi f(\theta, \phi) = f(\theta, \phi^*(\theta)) = f^*(\theta), \\ \to \frac{dV}{d\theta} = \frac{\partial f(\theta, \phi)}{\partial \theta} \bigg|_{\phi = \phi^*} + \frac{\partial f(\theta, \phi)}{\partial \phi} \bigg|_{\phi = \phi^*} \frac{d\phi^*(\theta)}{d\theta} \\ (\text{Since } \phi^* \text{ maximize } f \Rightarrow \frac{\partial f(\theta, \phi)}{\partial \phi} \bigg|_{\phi = \phi^*} = 0) \\ = \frac{\partial f(\theta, \phi)}{\partial \theta} \bigg|_{\phi = \phi^*} + 0 = \frac{df^*}{d\theta}. }$ ---- ### Failure of compact dual form - Envelope theorem with **differentiable** constraints $g(\theta, \phi) = 0.$ $f(\theta, \phi) \to f(\theta, \phi) + \lambda(\phi) g(\theta, \phi),$ - Biased estimator Since $g$ is not differentiable in compact dual form, Even though $\displaystyle{ W_1(\prob_r, \prob_\theta) = \underset{f \in \mathcal{F}_{LL}}{\sup} \expt_{x \sim \prob_r}[f(x)] - \expt_{x \sim \prob_{G_\theta}}[f(x)], \\ \nabla_\theta W_1 \ne - \nabla_\theta \expt_{x \sim \prob_{G_\theta}}[f^*(x)] \text{ with such } f^* \text{ maximizes the above.} }$ ---- ### 懶人包 ![](https://i.imgur.com/fvfg1Gl.png) ---- ### 懶人包 ![](https://i.imgur.com/b9pGAFn.png) ---- ![](https://i.imgur.com/HbEjXa3.png) ![](https://i.imgur.com/UqUdTvv.png) ---- (改自 *Bleach 死神* 單行本 18 卷 118 頁) ![](https://i.imgur.com/NLyfXlL.png) ---- (改自 *Bleach 死神* 單行本 71 卷 163 頁) ![](https://i.imgur.com/sfKUPif.png) ---- ### Lessons learned 1. **VERIFY THE ESTIMATOR** before claiming you're minimizing some intractable statistical quantity... 2. **THINK TWICE** before saying something is wrong... 3. Only put your **MAIN CONTRIBUTIONS** on the paper, remove other bullshit. 4. The power of **WORDING** ... --- ## Q & A --- ## 03/04 勘誤 ---- ### Convexity and Uniqueness (劉浩然同學提問) - Uniqueness of minimizer $f^*$ - Appendix A.1 Lemma 1 ![](https://i.imgur.com/lyKhZVy.png) - Appendix A.1 Lemma 6 ![](https://i.imgur.com/QsSKZft.png) - Uniqueness of Nash equilibrium Seems it does not require $f^*$ to be unique. ---- ### Optimal Transport of WGAN (李宏毅老師提問) ![](https://i.imgur.com/xqigWNk.png) ---- ### Optimal Transport of WGAN (李宏毅老師提問) [Improved Training of Wasserstein GANs](https://arxiv.org/abs/1704.00028) ![](https://i.imgur.com/ScclEaG.png) [25] C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008. (看無) ---- ### Overcoming Mode Collapse (李宏毅老師提問) ![](https://i.imgur.com/WW5BjCI.png) --- ## Q & A ---
{"metaMigratedAt":"2023-06-15T04:40:40.616Z","metaMigratedFrom":"YAML","title":"Paper Report 02/27 林隼興","breaks":true,"slideOptions":"{\"theme\":\"serif\",\"transition\":\"slide\"}","contributors":"[{\"id\":\"fa032ae2-ac9e-4754-9d6f-947b0ea9cb6e\",\"add\":13592,\"del\":3728}]"}
    629 views