# Improved Training of Wasserstein GANs_Paper(翻譯)
###### tags: `Improved WGAN` `wgan` `gan` `論文翻譯` `deeplearning` `對抗式生成網路`
[TOC]
## 說明
區塊如下分類,原文區塊為藍底,翻譯區塊為綠底,部份專業用語翻譯參考國家教育研究院
:::info
原文
:::
:::success
翻譯
:::
:::warning
任何的翻譯不通暢部份都請留言指導
:::
:::danger
* [paper hyperlink](https://arxiv.org/pdf/1704.00028.pdf)
:::
## Abstract
:::info
Generative Adversarial Networks (GANs) are powerful generative models, but suffer from training instability. The recently proposed Wasserstein GAN (WGAN) makes progress toward stable training of GANs, but sometimes can still generate only poor samples or fail to converge. We find that these problems are often due to the use of weight clipping in WGAN to enforce a Lipschitz constraint on the critic, which can lead to undesired behavior. We propose an alternative to clipping weights: penalize the norm of gradient of the critic with respect to its input. Our proposed method performs better than standard WGAN and enables stable training of a wide variety of GAN architectures with almost no hyperparameter tuning, including 101-layer ResNets and language models with continuous generators.
We also achieve high quality generations on CIFAR-10 and LSUN bedrooms.
:::
:::success
生成對抗式網路(GANs)是非常強的生成模型,但是受到訓練不穩定的影響。最近提出的Wasserstein GAN (WGAN)在穩定訓練上取得進展,但有些時候依然生成不良樣本或收斂失敗。我們發現這些問題通常來自在WGAN中使用`weights clipping`來強制critic滿足Lipschitz的約束式,這通常會導致非預期行為發生。我們提出一個`clipping weights`的替代方案:就critic的輸入,懲罰其梯度範數。<sub>(註:此句原文較為通暢,原則上就是一個penalty)</sub>。我們提出的方法表現較標準WGAN好,而且能夠在各種GAN架構上幾乎沒有調參情況下穩定訓練,包含101層的ResNets與連續生成語言模型。
我們也在CIFAR-10與LSUN bedrooms資料集上有高品質的影像生成。
:::
## 1 Introduction
:::info
Generative Adversarial Networks (GANs) [9] are a powerful class of generative models that cast generative modeling as a game between two networks: a generator network produces synthetic data given some noise source and a discriminator network discriminates between the generator’s output and true data. GANs can produce very visually appealing samples, but are often hard to train, and much of the recent work on the subject [23, 19, 2, 21] has been devoted to finding ways of stabilizing training. Despite this, consistently stable training of GANs remains an open problem.
In particular, [1] provides an analysis of the convergence properties of the value function being optimized by GANs. Their proposed alternative, named Wasserstein GAN (WGAN) [2], leverages the Wasserstein distance to produce a value function which has better theoretical properties than the original. WGAN requires that the discriminator (called the critic in that work) must lie within the space of 1-Lipschitz functions, which the authors enforce through weight clipping.
:::
:::success
生成對抗式網路(GANs)[9]是非常強的生成模型,它將生成建模視為兩個網路之間的遊戲:generator在給定噪點資料<sub>(從discribution中sample)</sub>之後產生合成資料,discriminator則判斷產生的合成資料與真實資料。GANs可以產生非常具有視覺吸引力的樣本,但通常難以訓練,而且最近在專題[23, 19, 2, 21]上的主要工作就是致力於尋找一個穩定訓練的方法。儘管如此,GANs連續穩定的訓練仍然是一個重要的問題。
特別是,[1]提供了一個由GANs最佳化的值函數收斂屬性分析。他們提出的替代方案,名稱為Wasserstein GAN (WGAN)[2],利用Wasserstein distance來產生值函數,它有比原始方案更好的理論性質。WGAN要求discriminator(WGAN中稱為critic)必須位於1-Lipschitz functions空間中,作者利用weight clipping強制達成這個約束式。
:::
:::info
Our contributions are as follows:
1. On toy datasets, we demonstrate how critic weight clipping can lead to undesired behavior.
2. We propose gradient penalty (WGAN-GP), which does not suffer from the same problems.
3. We demonstrate stable training of varied GAN architectures, performance improvements over weight clipping, high-quality image generation, and a character-level GAN language model without any discrete sampling.
[SourceCode_Git](https://github.com/igul222/improved_wgan_training/blob/master/gan_toy.py)
:::
:::success
我們的貢獻如下:
1. 在簡單資料集上,我們證明critic在weight clipping之後如何導致預期外行為。
2. 我們提出gradient penalty (WGAN-GP),它並不會引起相同的問題。
3. 我們證明在能夠穩定訓練於各種GAN架構,效能更勝weight clipping,高品質影像生成,以及沒任何離散採樣的character-level GAN語言模型
:::
## 2 Background
### 2.1 Generative adversarial networks
:::info
The GAN training strategy is to define a game between two competing networks. The generator network maps a source of noise to the input space. The discriminator network receives either a generated sample or a true data sample and must distinguish between the two. The generator is trained to fool the discriminator.
Formally, the game between the generator G and the discriminator D is the minimax objective:
$$\operatorname*{min}_{G} \operatorname*{max}_{D} \operatorname*{\mathbb{E}}_{x\sim\mathbb{P}_r} \left[log(D(x))\right] + \operatorname*{\mathbb{E}}_{\widetilde{x}\sim\mathbb{P}_g} \left[log(1-D(\widetilde{x}))\right] \qquad (1)$$
where $\mathbb{P}_r$ is the data distribution and $\mathbb{P}_g$ is the model distribution implicitly defined by $\widetilde{x}=G(z),z\sim p(z)$ (the input $z$ to the generator is sampled from some simple noise distribution $p$, such as the uniform distribution or a spherical Gaussian distribution).
If the discriminator is trained to optimality before each generator parameter update, then minimizing the value function amounts to minimizing the Jensen-Shannon divergence between $\mathbb{P}_r$ and $\mathbb{P}_g$ [9], but doing so often leads to vanishing gradients as the discriminator saturates. In practice, [9] advocates that the generator be instead trained to maximize $\mathbb{E}_{\widetilde{x}\sim\mathbb{P}_g\left[logD(\widetilde{x})\right]}$, which goes some way to circumvent this difficulty. However, even this modified loss function can misbehave in the presence of a good discriminator [1].
:::
:::success
GAN的訓練策略就是在兩個競爭網路之間定義一個遊戲。Generator將噪點映射為輸入空間<sub>(由discribution中sample的vector)</sub>。Discriminator接收生成樣本或真實資料樣本,且並必須區分兩者。Generator是訓練來欺騙Discriminator。
形式上,generator-G與discriminator-D之間的遊戲就是最小極大目標:
$$\operatorname*{min}_{G} \operatorname*{max}_{D} \operatorname*{\mathbb{E}}_{x\sim\mathbb{P}_r} \left[log(D(x))\right] + \operatorname*{\mathbb{E}}_{\widetilde{x}\sim\mathbb{P}_g} \left[log(1-D(\widetilde{x}))\right] \qquad (1)$$
其中$\mathbb{P}_r$是真實資料分佈,而$\mathbb{P}_g$是由$\widetilde{x}=G(z),z\sim p(z)$隱式定義的模型分佈(generator的input-$z$是從簡單的[雜訊分布](http://terms.naer.edu.tw/detail/4669081/)-$p$中隨機採樣而來,也可以是[均勻分佈](http://terms.naer.edu.tw/detail/2126976/)或球面高斯分佈(spherical Gaussian distribution))。
如果discriminator在每次generator參數更新之前訓練至最佳化,那麼最小化值函數等價於最小化$\mathbb{P}_r$與$\mathbb{P}_g$之間的Jensen-Shannon divergence[9],但這麼做通常導致discriminator飽合而梯度消失。在實踐上,[9]主張generator應該訓練至最大化$\mathbb{E}_{\widetilde{x}\sim\mathbb{P}_g\left[logD(\widetilde{x})\right]}$,這在某種程度上可以避免這種困難。
然而,即使這個修正的損失函數在好的discriminator面前還是會存在錯誤。
:::
### 2.2 Wasserstein GANs
:::info
[2] argues that the divergences which GANs typically minimize are potentially not continuous with respect to the generator’s parameters, leading to training difficulty. They propose instead using the Earth-Mover (also called Wasserstein-1) distance $W(q, p)$, which is informally defined as the minimum cost of transporting mass in order to transform the distribution $q$ into the distribution $p$ (where the cost is mass times transport distance). Under mild assumptions, $W(q, p)$ is continuous everywhere and differentiable almost everywhere.
:::
:::success
[2]認為,一般GANs最小化的divergences可能不連續,這與generator的參數有關,也導致訓練困難。他們提出使用Earth-Mover distance(也稱為Wasserstein-1) $W(q,p)$來代替,這是被非正式定義為最小化運輸質量的成本,以便從分佈$q$轉移至分佈$p$(成本為質量乘運輸距離)。在適當的假設下,$W(q, p)$是連續而且幾乎是處處可微。
:::
:::info
The WGAN value function is constructed using the Kantorovich-Rubinstein duality [25] to obtain
$$\operatorname*{min}_{G} \operatorname*{max}_{D\in\mathcal{D}} \operatorname*{\mathbb{E}}_{x\sim\mathbb{P}_r} \left[D(x)\right] - \operatorname*{\mathbb{E}}_{\widetilde{x}\sim\mathbb{P}_g} \left[D(\widetilde{x})\right] \qquad (2)$$
where $\mathcal{D}$ is the set of 1-Lipschitz functions and $\mathbb{P}_g$ is once again the model distribution implicitly defined by $\widetilde{x}=G(z),z \sim p(z)$. In that case, under an optimal discriminator (called a critic in the paper, since it’s not trained to classify), minimizing the value function with respect to the generator parameters minimizes $W(\mathbb{P}_r,\mathbb{P}_g)$.
The WGAN value function results in a critic function whose gradient with respect to its input is better behaved than its GAN counterpart, making optimization of the generator easier. Empirically, it was also observed that the WGAN value function appears to correlate with sample quality, which is not the case for GANs [2].
To enforce the Lipschitz constraint on the critic, [2] propose to clip the weights of the critic to lie within a compact space $[−c, c]$. The set of functions satisfying this constraint is a subset of the $k$-Lipschitz functions for some $k$ which depends on $c$ and the critic architecture. In the following sections, we demonstrate some of the issues with this approach and propose an alternative.
:::
:::success
使用Kantorovich-Rubinstein duality來建構WGAN的值函數以便得到
$$\operatorname*{min}_{G} \operatorname*{max}_{D\in\mathcal{D}} \operatorname*{\mathbb{E}}_{x\sim\mathbb{P}_r} \left[D(x)\right] - \operatorname*{\mathbb{E}}_{\widetilde{x}\sim\mathbb{P}_g} \left[D(\widetilde{x})\right] \qquad (2)$$
,其中$\mathcal{D}$是1-Lipschitz functions的集合,而$\mathbb{P}_g$又是由$\widetilde{x}=G(z),z \sim p(z)$隱式定義的模型分佈。在這個情況下,一個最佳的discriminator(論文稱為critic,因為沒有經過分類訓練),最小化值函數意味著generator參數最小化$W(\mathbb{P}_r,\mathbb{P}_g)$。
WGAN的值函數產生一個critic function,其輸入的梯度比GAN function有更好的表現,讓generator的最佳化更容易。經驗來說,也觀察到,WGAN的值函數似乎與取樣的品質有關聯,但GNAs則不然[2]。
要將critic強制滿足Lipschitz約束,[2]提出weight clipping的方式,將critic限制在一個[緊密空間](http://terms.naer.edu.tw/detail/2350200/)$[−c, c]$。滿足這個約束式的函數集合是某些$k$的$k$-Lipschitz functions,取決於$c$與critic的架構。下面的章節,我們將證明實現這方法的一些問題並提出替代方案。
:::
### 2.3 Properties of the optimal WGAN critic
:::info
In order to understand why weight clipping is problematic in a WGAN critic, as well as to motivate our approach, we highlight some properties of the optimal critic in the WGAN framework. We prove these in the Appendix.
Proposition 1. Let $\mathbb{P}_r$ and $\mathbb{P}_g$ be two distributions in $\mathcal{X}$, a compact metric space. Then, there is a 1-Lipschitz function $f^*$ which is the optimal solution of $max_{\Vert f \Vert \leq 1} \mathbb{E}_{y \sim \mathbb{P}_r} \left[f(y)\right] - \mathbb{E}_{x \sim \mathbb{P}_g} \left[f(x)\right]$. Let $\pi$ be the optimal coupling between $\mathbb{P}_r$ and $\mathbb{P}_g$, defined as the minimizer of: $W(\mathbb{P}_r, \mathbb{P}_g)=inf_{\pi\in\prod(\mathbb{P}_r,\mathbb{P}_g)}\mathbb{E}_{(x,y)\sim\pi}\left[\Vert x-y \Vert\right]$ where $\prod(\mathbb{P}_r,\mathbb{P}_g)$ is the set of joint distributions $\pi(x, y)$ whose marginals are $\mathbb{P}_r$ and $\mathbb{P}_g$, respectively. Then, if $f^*$ is differentiable^‡^, $\pi(x=y)=0$^§^ and $x_t=tx+(1-t)y$ with $0\leq t \leq 1$ it holds that $\mathbb{P}_{(x,y)\sim\pi}\left[\nabla f^*(x_t)=\dfrac{y-x_t}{\Vert y-x_t \Vert}\right]=1$
Corollary 1. $f^∗$ has gradient norm 1 almost everywhere under $\mathbb{P}_r$ and $\mathbb{P}_g$.
:::
:::success
為了瞭解為何weight clipping在WGAN critic上是有問題的,並激發我們的想法,我們強調在WGAN框架中最佳critic的一些屬性。我們在附錄中證明這些。
Proposition 1. 假設$\mathbb{P}_r$與$\mathbb{P}_g$是$\mathcal{X}$上的兩個分佈,一個緊湊的[賦距空間](http://terms.naer.edu.tw/detail/2119684/)。然後,有一個1-Lipschitz function $f^*$,為$max_{\Vert f \Vert \leq 1} \mathbb{E}_{y \sim \mathbb{P}_r} \left[f(y)\right] - \mathbb{E}_{x \sim \mathbb{P}_g} \left[f(x)\right]$的最佳解。假設$\pi$為$\mathbb{P}_r$與$\mathbb{P}_g$之間的最佳[耦合](http://terms.naer.edu.tw/detail/2113909/),定義為下面最小化:$W(\mathbb{P}_r, \mathbb{P}_g)=inf_{\pi\in\prod(\mathbb{P}_r,\mathbb{P}_g)}\mathbb{E}_{(x,y)\sim\pi}\left[\Vert x-y \Vert\right]$,其中$\prod(\mathbb{P}_r,\mathbb{P}_g)$為$\pi(x, y)$的[聯合分佈](http://terms.naer.edu.tw/detail/2118619/)集合,其邊際分別為$\mathbb{P}_r$與$\mathbb{P}_g$。然而,如果$f^*$是可微的^‡^,$\pi(x=y)=0$^§^和$x_t=tx+(1-t)y$其中$0\leq t \leq 1$,它認為$\mathbb{P}_{(x,y)\sim\pi}\left[\nabla f^*(x_t)=\dfrac{y-x_t}{\Vert y-x_t \Vert}\right]=1$
Corollary 1. $f^*$在$\mathbb{P}_r$與$\mathbb{P}_g$下幾乎都有梯度範數(L1)
:::
## 3 Difficulties with weight constraints
:::info
We find that weight clipping in WGAN leads to optimization difficulties, and that even when optimization succeeds the resulting critic can have a pathological value surface. We explain these problems below and demonstrate their effects; however we do not claim that each one always occurs in practice, nor that they are the only such mechanisms.
Our experiments use the specific form of weight constraint from [2] (hard clipping of the magnitude of each weight), but we also tried other weight constraints (L2 norm clipping, weight normalization), as well as soft constraints (L1 and L2 weight decay) and found that they exhibit similar problems.
To some extent these problems can be mitigated with batch normalization in the critic, which [2] use in all of their experiments. However even with batch normalization, we observe that very deep WGAN critics often fail to converge.
:::
:::success
我們發現,在WGAN中實施的weight clipping導致最佳化困難,即使最佳化成功,critic也可能具有歷理價值表面<sub>(這邊是硬翻,不甚理解)</sub>。我們在下面說明這些問題,並證明它們的影響;然而,我們並不是說每個人總是在實作中發生這問題,也不是說他們是唯一這樣的機制。
我們的實驗使用了[2]中指定的權重約束(每個權限大小的硬性限制),也嚐試使用其它權重約束(L2範數、權重正規化),還有非硬性約束(L1與L2權重衰減),並且發現他們表現出類似的問題。
某種程度上,這些問題可以透過critic中實作Batch Normalization來得到緩解,在[2]的實驗中都這麼做。然而,即使使用Batch Normalization,我們觀察到,在WGAN的critic架構非常深的情況下通常無法收斂。
:::
:::info

(a) Value surfaces of WGAN critics trained to optimality on toy datasets using (top) weight clipping and (bottom) gradient penalty. Critics trained with weight clipping fail to capture higher moments of the data distribution. The 'generator' is held fixed at the real data plus Gaussian noise.
(b) (left) Gradient norms of deep WGAN critics during training on the Swiss Roll dataset either explode or vanish when using weight clipping, but not when using a gradient penalty. (right) Weight clipping (top) pushes weights towards two values (the extremes of the clipping range), unlike gradient penalty (bottom).
Figure 1: Gradient penalty in WGANs does not exhibit undesired behavior like weight clipping
:::
:::success
(a) WGAN critics在簡單的資料集上訓練最佳化的值表面,上排為使用weight clipping,下排為使用gradient penalty。使用weight clipping訓練的critic對於[高階動差](http://terms.naer.edu.tw/detail/368378/)的資料分佈已無法補捉。Generator固定在實際資料加上[高斯雜訊](http://terms.naer.edu.tw/detail/2373282/)。
(b) (左圖)WGAN critics在Swiss Roll資料集上訓練期間的梯度範數,在使用weight clipping的時候不是爆炸就是消失,但使用gradient penalty不會有這個問題。(右圖)Weight clipping(上圖)將權重推向兩邊極值(clipping range)與gradient penalty不同。
Figure 1: Gradient penalty在WGANs中表現的不若weight clipping般不如預期。
:::
### 3.1 Capacity underuse
:::info
Implementing a $k$-Lipshitz constraint via weight clipping biases the critic towards much simpler functions. As stated previously in Corollary 1, the optimal WGAN critic has unit gradient norm almost everywhere under $\mathbb{P}_r$ and $\mathbb{P}_g$; under a weight-clipping constraint, we observe that our neural network architectures which try to attain their maximum gradient norm $k$ end up learning extremely simple functions.
To demonstrate this, we train WGAN critics with weight clipping to optimality on several toy distributions, holding the generator distribution $\mathbb{P}_g$ fixed at the real distribution plus unit-variance Gaussian noise. We plot value surfaces of the critics in Figure 1a. We omit batch normalization in the critic. In each case, the critic trained with weight clipping ignores higher moments of the data distribution and instead models very simple approximations to the optimal functions. In contrast, our approach does not suffer from this behavior.
:::
:::success
透過weight clipping來實作$k$-Lipshitz約束導致critic偏向為較簡單的函數。如同先前Corollary 1的說明,最佳WGAN critic在$\mathbb{P}_r$與$\mathbb{P}_g$底下幾乎處處都有梯度範數;在weight-clippings約束之下,我們觀察到我們的神經網路架構試著達到它們的最大梯度範數$k$,最終學習到極簡單的函數。
為了證明這點,我們訓練WGAN critics並做weight clipping,以此在多個簡單的分佈上做最佳化,固定generator分佈$\mathbb{P}_g$在實際分佈加上單位變異數[高斯雜訊](http://terms.naer.edu.tw/detail/2373282/)。我們繪製critics的值表面在Figure 1a。我們的critic省略batch normalization。在每種情況下,使用weight clipping訓練的critic會忽略資料分佈的[高階動差](http://terms.naer.edu.tw/detail/368378/),而是對最佳函數做非常簡單的近似建模。相反的,我們的方法不受這種行為的影響。
:::
:::info
^‡^We can actually assume much less, and talk only about directional derivatives on the direction of the line; which we show in the proof always exist. This would imply that in every point where $f^∗$ is differentiable (and thus we can take gradients in a neural network setting) the statement holds.
^§^This assumption is in order to exclude the case when the matching point of sample $x$ is $x$ itself. It is satisfied in the case that $\mathbb{P}_r$ and $\mathbb{P}_g$ have supports that intersect in a set of measure 0, such as when they are supported by two low dimensional manifolds that don’t perfectly align [1].
:::
:::success
^‡^實際上我們可以假設的更少,單純討論關於線的方向上的[方向導數](http://terms.naer.edu.tw/detail/2114822/);我們在證明中說明總是存在。這意味著$f^∗$在每一個點上都是可微的(因此在神經網路設置中我們可以取得梯度)的論述是成立的。
^§^這個假設是為了排除當匹配的採樣點$x$是它本身的情況。它滿足於在$\mathbb{P}_r$與$\mathbb{P}_g$有一組量測值為0中相交的支撐時,例如,當它們由兩個不完全對齊的低維流形支撐時[1]。
:::
:::success
**Algorithm 1** WGAN with gradient penalty。我們使用預設參數,$\lambda=10$,$n_{critic}=5$,$\alpha=0.0001$,$\beta_1=0$,$\beta_2=0.9$
**Require:** gradient penalty coefficient(梯度懲罰系數)$\lambda$,每次generator迭代的critic迭代次數$n_{critic}$,batch size$m$,Adam超參數$\alpha,\beta_1,\beta_2$
**Require:** 初始化critic參數-$w_0$,初始化generator參數-$\theta_0$
1: **while** $\theta$ has not converged **do**
2: $\qquad$**for** $t=1,...,n_{critic}$ **do**
3: $\qquad \qquad$ **for** $i=1,...,m$ **do**
4: $\qquad \qquad \qquad$ Sample real data$x\sim\mathbb{P}_r$, latent variable $z\sim p(z)$, a random number $\epsilon\sim U\left[0,1\right]$
5: $\qquad \qquad \qquad$ $\widetilde{x} \leftarrow G_\theta(z)$
6: $\qquad \qquad \qquad$ $\hat{x} \leftarrow \epsilon x + (1 - \epsilon)\widetilde{x}$
7: $\qquad \qquad \qquad$ $L^{(i)} \leftarrow D_w(\widetilde{x})-D_w(x)+\lambda(\Vert \nabla_\hat{x}D_w(\hat{x}) \Vert_2-1)^2$
8: $\qquad \qquad$ **end for**
9: $\qquad \qquad$ $w \leftarrow Adam(\nabla_w\dfrac{1}{m}\sum^m_{i=1}L^{(i)},w,\alpha,\beta_1,\beta_2)$
10: $\qquad$ **end for**
11: $\qquad$ Sample a batch of latent variables $\left\{z^{(i)}\right\}^m_{i=1}\sim p(z)$
12: $\qquad$ $\theta \leftarrow Adam(\nabla_\theta\dfrac{1}{m}\sum^m_{i=1}-D_w(G_\theta(z)),\theta,\alpha,\beta_1,\beta_2)$
13:**end while**
:::
### 3.2 Exploding and vanishing gradients
:::info
We observe that the WGAN optimization process is difficult because of interactions between the weight constraint and the cost function, which result in either vanishing or exploding gradients without careful tuning of the clipping threshold $c$.
To demonstrate this, we train WGAN on the Swiss Roll toy dataset, varying the clipping threshold $c$ in [10^−1^ , 10^−2^ , 10^−3^], and plot the norm of the gradient of the critic loss with respect to successive layers of activations. Both generator and critic are 12-layer ReLU MLPs without batch normalization. Figure 1b shows that for each of these values, the gradient either grows or decays exponentially as we move farther back in the network. We find our method results in more stable gradients that neither vanish nor explode, allowing training of more complicated networks.
:::
:::success
我們觀察到,WGAN的最佳化過程是困難的,因為權重約束與成本函數的交互作用,在沒有小心調校weight clipping的閥值-$c$的情況下會導致梯度消失或爆炸。
為了證明這點,我們在Swiss Roll的簡單資料集上訓練WGAN,在[10^−1^ , 10^−2^ , 10^−3^]中改變weight clipping的閥值-$c$,並繪製critis的loss關於連續啟動層的梯度範數。generator與critic皆為12-layer ReLU MLPs without batch normalization。Figure 1b顯示,對於其中的每一個值,隨著我們在網路中向後推移,梯度會呈現指數型增長或衰減。我們發現,我們的方法產生更穩定的梯度,既不會消失也不會爆炸,允許訓練更複雜的網路。
:::
## 4 Gradient penalty
:::info
We now propose an alternative way to enforce the Lipschitz constraint. A differentiable function is 1-Lipschtiz if and only if it has gradients with norm at most 1 everywhere, so we consider directly constraining the gradient norm of the critic’s output with respect to its input. To circumvent tractability issues, we enforce a soft version of the constraint with a penalty on the gradient norm for random samples $\hat{x} \sim \mathbb{P}_\hat{x}$. Our new objective is
$$L=\underbrace{\operatorname*{\mathbb{E}}_{\widetilde{x}\sim\mathbb{P}_g}\left[D(\widetilde{x})\right]-\operatorname*{\mathbb{E}}_{x\sim\mathbb{P}_r}\left[D(x)\right]}_{Original \space critic \space loss} + \underbrace{\lambda \operatorname*{\mathbb{E}}_{\hat{x}\sim\mathbb{P}_\hat{x}}\left[(\Vert \nabla_\hat{x}D(\hat{x}) \Vert_2 -1)^2\right]}_{Our \space gradient \space penalty}\qquad (3)$$
**Sampling distribution** We implicitly define $\mathbb{P}_\hat{x}$ sampling uniformly along straight lines between pairs of points sampled from the data distribution $\mathbb{P}_r$ and the generator distribution $\mathbb{P}_g$. This is motivated by the fact that the optimal critic contains straight lines with gradient norm 1 connecting coupled points from $\mathbb{P}_r$ and $\mathbb{P}_g$ (see Proposition 1). Given that enforcing the unit gradient norm constraint everywhere is intractable, enforcing it only along these straight lines seems sufficient and experimentally results in good performance.
:::
:::success
我們現在提出一種強制Lipschitz約束的替代方法。可微函數1-Lipschtiz僅在任意地方有梯度且範數最多為1,因此我們考慮直接約束critis的相對於它輸入的輸出梯度範數<sub>(不確定是否正確)</sub>。為了避免可處理性問題,我們強制一個非硬性的約束,對隨機採樣$\hat{x} \sim \mathbb{P}_\hat{x}$的梯度範數增加了懲罰項。我們的新的目標函數是
$$L=\underbrace{\operatorname*{\mathbb{E}}_{\widetilde{x}\sim\mathbb{P}_g}\left[D(\widetilde{x})\right]-\operatorname*{\mathbb{E}}_{x\sim\mathbb{P}_r}\left[D(x)\right]}_{Original \space critic \space loss} + \underbrace{\lambda \operatorname*{\mathbb{E}}_{\hat{x}\sim\mathbb{P}_\hat{x}}\left[(\Vert \nabla_\hat{x}D(\hat{x}) \Vert_2 -1)^2\right]}_{Our \space gradient \space penalty}\qquad (3)$$
**抽樣分佈**我們隱式定義$\mathbb{P}_\hat{x}$,並從資料分佈$\mathbb{P}_r$與generator分佈$\mathbb{P}_g$沿著直線介於對點之間做均勻抽樣。這是因為最佳的critic包含具梯度範數1的直線,連結來自$\mathbb{P}_r$與$\mathbb{P}_g$的耦合點(見Proposition 1)。有鑒於在任意地方強制單位梯度範數是不好處理的,強制單純沿著這些直線似乎足夠,而且實驗結果有不錯的效果。
:::
:::info
**Penalty coefficient** All experiments in this paper use $\lambda = 10$, which we found to work well across a variety of architectures and datasets ranging from toy tasks to large ImageNet CNNs.
**No critic batch normalization** Most prior GAN implementations [22, 23, 2] use batch normalization in both the generator and the discriminator to help stabilize training, but batch normalization changes the form of the discriminator’s problem from mapping a single input to a single output to mapping from an entire batch of inputs to a batch of outputs [23]. Our penalized training objective is no longer valid in this setting, since we penalize the norm of the critic’s gradient with respect to each input independently, and not the entire batch. To resolve this, we simply omit batch normalization in the critic in our models, finding that they perform well without it. Our method works with normalization schemes which don’t introduce correlations between examples. In particular, we recommend layer normalization [3] as a drop-in replacement for batch normalization.
**Two-sided penalty** We encourage the norm of the gradient to go towards 1 (two-sided penalty) instead of just staying below 1 (one-sided penalty). Empirically this seems not to constrain the critic too much, likely because the optimal WGAN critic anyway has gradients with norm 1 almost everywhere under $\mathbb{P}_r$ and $\mathbb{P}_g$ and in large portions of the region in between (see subsection 2.3). In our early observations we found this to perform slightly better, but we don’t investigate this fully. We describe experiments on the one-sided penalty in the appendix.
:::
:::success
**懲罰系數** 論文中的所有實驗都使用$\lambda = 10$,我們發現這個值在很多架構並且從簡單任務到大型資料集ImageNet CNNs都有很好的效果。
**critic不實作batch normalization**,多數先前的GAN實作[22, 23, 2]都會在generator與discriminator上使用batch normalization來穩定訓練過程,但是batch normalization改變discriminator的問題形式,從映射單個輸入到單個輸出變為從整個批次輸入到批次輸出[23]。我們的懲罰訓練目標在batch normalization這設置上不再有效,因為我們是單獨地懲罰每個critic的輸入的梯度範數,而不是整個批次。為了解決這個問題,我們只需要把我們模型中的critic忽略掉batch normalization,會發現沒有batch normalization它們的執行效果是好的。我們的方法適用正規化方案,這並不會引入範本之間的相關性。特別是,我們建議將layer normalization[3]做為batch normalization的替代。
**雙邊懲罰** 我們鼓勵梯度範數趨近於1(雙邊懲罰)而不是停留於1(單邊懲罰)。經驗上來看,這似乎不會對critic有過多限制,可能是因為最佳WGAN critic在$\mathbb{P}_r$與$\mathbb{P}_g$之下幾乎都有著範數1的梯度,而且在$\mathbb{P}_r$與$\mathbb{P}_g$之間的絕大部份區域都有梯度。(見subsection 2.3)。在我們早期的觀察中,我們發現它的表現略好,但我們並沒有對這點做充份的調查。我們在附錄中描述單邊懲罰的實驗。
:::
## 5 Experiments
### 5.1 Training random architectures within a set
:::info
We experimentally demonstrate our model’s ability to train a large number of architectures which we think are useful to be able to train. Starting from the DCGAN architecture, we define a set of architecture variants by changing model settings to random corresponding values in Table 1. We believe that reliable training of many of the architectures in this set is a useful goal, but we do not claim that our set is an unbiased or representative sample of the whole space of useful architectures: it is designed to demonstrate a successful regime of our method, and readers should evaluate whether it contains architectures similar to their intended application.
:::
:::success
我們通過實驗證明了我們的模型能夠訓練大量的架構,我們認為這些架構對訓練有用。從DCGAN架構開始,我們通過改變模型設置為Table 1中的隨機對應值來定義一系列架構變體。我們相信,對這一系列的許多架構的可靠訓練是有用的目標,但是我們並不認為我們這一系列的架構在有效架構的整個空間中是沒偏差或是有代表性的樣本:它的主旨在證明我們的成功機制,而且讀者應該評估是否包含與其預期應用程式的架構類似。
:::
:::info
Table 1: We evaluate WGAN-GP’s ability to train the architectures in this set.
Table 1: 我們評估WGAN-GP在這系列中訓練架構的能力

:::
:::info
From this set, we sample 200 architectures and train each on 32×32 ImageNet with both WGAN-GP and the standard GAN objectives. Table 2 lists the number of instances where either: only the standard GAN succeeded, only WGAN-GP succeeded, both succeeded, or both failed, where success is defined as inception score > min score. For most choices of score threshold, WGAN-GP successfully trains many architectures from this set which we were unable to train with the standard GAN objective. We give more experimental details in the appendix.
:::
:::success
從這集合中,我們隨機採樣200個架構並且同時用WGAN-GP與標準GAN在32x32的ImageNet上訓練每一種架構。Table 2列出兩種情況的實例數:單純標準GAN成功,單純WGAN-GP成功,兩者皆成功或兩者皆失敗,成功的定義來自inception score > min score。對大多數評分閥值的選擇,WGAN-GP成功地訓練許多在標準GAN上無法成功的架構。我們在附錄中給出更多的實驗細節。
:::
:::info
Table 2: Outcomes of training 200 random architectures, for different success thresholds. For comparison, our standard DCGAN scored 7.24.
Table 2: 兩百個隨機架構的結果,針對不同的閥值。相比之下,我們的標準DCGAN得分為7.24。

:::
:::info

Figure 2: Different GAN architectures trained with different methods. We only succeeded in training every architecture with a shared set of hyperparameters using WGAN-GP.
Figure 2: 不同方法訓練不同的GAN架構,我們只使用WGAN-GP成功地使用一組共享的超參數來訓練每個架構。
:::
### 5.2 Training varied architectures on LSUN bedrooms
:::info
To demonstrate our model’s ability to train many architectures with its default settings, we train six different GAN architectures on the LSUN bedrooms dataset [31]. In addition to the baseline DCGAN architecture from [22], we choose six architectures whose successful training we demonstrate: (1) no BN and a constant number of filters in the generator, as in [2], (2) 4-layer 512-dim ReLU MLP generator, as in [2], (3) no normalization in either the discriminator or generator (4) gated multiplicative nonlinearities, as in [24], (5) tanh nonlinearities, and (6) 101-layer ResNet generator and discriminator.
Although we do not claim it is impossible without our method, to the best of our knowledge this is the first time very deep residual networks were successfully trained in a GAN setting. For each architecture, we train models using four different GAN methods: WGAN-GP, WGAN with weight clipping, DCGAN [22], and Least-Squares GAN [18]. For each objective, we used the default set of optimizer hyperparameters recommended in that work (except LSGAN, where we searched over learning rates).
For WGAN-GP, we replace any batch normalization in the discriminator with layer normalization (see section 4). We train each model for 200K iterations and present samples in Figure 2. We only succeeded in training every architecture with a shared set of hyperparameters using WGAN-GP. For every other training method, some of these architectures were unstable or suffered from mode collapse.
:::
:::success
為了證明我們的模型能夠以預設值訓練許多架構,我們在LSUN bedrooms資料集上訓練六種不同的GAN架構[31]。除了基線DCGAN架構[22]之外,我們還選擇六種成功訓練的架構來展示:(1) no BN and a constant number of filters in the generator,as in [2],(2) 4-layer 512-dim ReLU MLP generator,as in [2],(3) no normalization in either the discriminator or generator (4) gated multiplicative nonlinearities, as in [24],(5) tanh nonlinearities,and (6) 101-layer ResNet generator and discriminator。
雖然我們並沒有主張沒有我們的方法是不可能辦到的,但據我們所知,這是第一次有人在GAN環境中成功的訓練非常深的殘差網路(deep residual networks)。針對每一個架構,我們使用四種不同GAN的方法來訓練模型:WGAN-GP,WGAN with weight clipping,DCGAN [22],and Least-Squares GAN [18]。對每一個目標,我們使用該工作建議的預設超參數(LSGAN除外,我們針對學習效率尋參)。
對於WGAN-GP,我們使用layer normalization取代batch normalization(見section 4)。每一個模型都訓練200K次迭代,並在Figure 2中顯示樣本。我們只有使用WGAN-GP可以用一組共享超參數成功地訓練各種架構。對於其它訓練方式,部份架構不是不穩定就是出現mode collapse狀況。
:::
### 5.3 Improved performance over weight clipping
:::info
One advantage of our method over weight clipping is improved training speed and sample quality. To demonstrate this, we train WGANs with weight clipping and our gradient penalty on CIFAR10 [13] and plot Inception scores [23] over the course of training in Figure 3. For WGAN-GP, we train one model with the same optimizer (RMSProp) and learning rate as WGAN with weight clipping, and another model with Adam and a higher learning rate. Even with the same optimizer, our method converges faster and to a better score than weight clipping. Using Adam further improves performance. We also plot the performance of DCGAN [22] and find that our method converges more slowly (in wall-clock time) than DCGAN, but its score is more stable at convergence.
:::
:::success
我們方法的優於weight clipping的地方在於提升了訓練速度與樣本品質。為了證明這點,我們在CIFAR10[13]資料集上用weight clipping與gradient penalty來訓練WGANs,並在Fiture 3上繪製訓練過程中的Inception scores [23]。針對WGAN-GP,我們使用與weight clipping一樣的最佳化演算法(RMSProp)與學習效率,另一個模型則使用Adam與較高的學習效率。即使使用相同的最佳化演算法,我們的方法相較於weight clipping還是有較快的收斂速度並且擁有較高的得分。使用Adam進一步的提高效能。我們也繪製DCGAN[22]的效能並且發現,我們的方法收斂較DCGAN來的慢(in wall-clock time),但收斂時候的得分是更為穩定。
:::
:::info

Figure 3: CIFAR-10 Inception score over generator iterations (left) or wall-clock time (right) for four models: WGAN with weight clipping, WGAN-GP with RMSProp and Adam (to control for the optimizer), and DCGAN. WGAN-GP significantly outperforms weight clipping and performs comparably to DCGAN.
Figure 3: 四個模型的CIFAR-10 Inception score在generator iterations(左)與wall-clock time(右): WGAN with weight clipping,WGAN-GP with RMSProp and Adam (to control for the optimizer),and DCGAN。WGAN-GP明顯優化weight clipping,而且效能與DCGAN相當。
:::
### 5.4 Sample quality on CIFAR-10 and LSUN bedrooms
:::info
For equivalent architectures, our method achieves comparable sample quality to the standard GAN objective. However the increased stability allows us to improve sample quality by exploring a wider range of architectures. To demonstrate this, we find an architecture which establishes a new state of the art Inception score on unsupervised CIFAR-10 (Table 3). When we add label information (using the method in [20]), the same architecture outperforms all other published models except for SGAN.
We also train a deep ResNet on 128 × 128 LSUN bedrooms and show samples in Figure 4. We believe these samples are at least competitive with the best reported so far on any resolution for this dataset.
:::
:::success
對於等效的架構,我們的方法可以達到與標準GAN相當的樣本品質。然而,穩定性的提升讓我們可以透過探索更廣泛的架構來增加採樣的品質。為了證明這點,我們發現一個架構,在非監督式CIFAR-10(tABLE 3)上建立一個新的最先進的Inception Score架構。當我們增加標記資訊([20]中的方法),相同的架構優於SGAN以外的其它已發佈的模型。
我們還在LSUN bedrooms資料集上訓練一個深度殘差網路(deep ResNet)並在Figure 4上顯示樣本。我們相信這些樣本最少跟這資料集上的任何解析度目前為止的最佳報告都具有競爭性。
:::
:::info

Table 3: Inception scores on CIFAR-10. Our unsupervised model achieves state-of-the-art performance, and our conditional model outperforms all others except SGAN.
Table 3: Inception scores on CIFAR-10。我們的非監督式模型達到最高效能,而且我們的條件模型優於SGAN以外的其它模型。
:::
### 5.5 Modeling discrete data with a continuous generator
:::info
To demonstrate our method’s ability to model degenerate distributions, we consider the problem of modeling a complex discrete distribution with a GAN whose generator is defined over a continuous space. As an instance of this problem, we train a character-level GAN language model on the Google Billion Word dataset [6]. Our generator is a simple 1D CNN which deterministically transforms a latent vector into a sequence of 32 one-hot character vectors through 1D convolutions. We apply a softmax nonlinearity at the output, but use no sampling step: during training, the softmax output is passed directly into the critic (which, likewise, is a simple 1D CNN). When decoding samples, we just take the argmax of each output vector.
We present samples from the model in Table 4. Our model makes frequent spelling errors (likely because it has to output each character independently) but nonetheless manages to learn quite a lot about the statistics of language. We were unable to produce comparable results with the standard GAN objective, though we do not claim that doing so is impossible.
:::
:::success
為了證明我們的方法能夠對[退化分佈](http://terms.naer.edu.tw/detail/3647917/)建模,我們考慮用GAN建立複雜離散分佈的問題,它的generator被定義在一個連續空間中。做為這個問題的案例,我們在Google Billion Word資料集上[6]訓練一個character-level GAN language model。我們的generator是一個簡單的1D-CNN,它透過1D卷積確實地將特徵向量轉換為32個獨熱編碼字符向量的序列。我們在輸出的地方執行softmax nonlinearity,但是沒有採樣步驟:訓練期間,softmax的輸出直接傳遞給critic(critic也是一個簡單的1D-CNN)。當解碼樣本的時候,我們只取每一個輸出向量的argmax。
我們在Table 4內提供模型的採樣。我們的模型經常拼寫錯誤(可能是因為它必需獨立輸出每個字符),但是,儘管如此還是可以學習到相當多的語言統計知識。我們無法產生可以與標準GAN比較的結果,雖然我們並沒有主張這麼做是不可能的。
:::
:::info

Figure 4: Samples of 128×128 LSUN bedrooms. We believe these samples are at least comparable to the best published results so far.
Figure 4: 128×128 LSUN bedrooms的採樣。我們相信這些樣本最少可以跟目前為止的最佳結果有得比。
:::
:::info

Table 4: Samples from a WGAN-GP character-level language model trained on sentences from the Billion Word dataset, truncated to 32 characters. The model learns to directly output one-hot character embeddings from a latent vector without any discrete sampling step. We were unable to achieve comparable results with the standard GAN objective and a continuous generator.
Table 4: 從WGAN-GP字符等級語言模型(character-level language model)採樣,這模型利用Google Billion Word資料集的語句訓練,截斷為32個字串符。模型學習由潛在向量直接輸出獨熱字串崁入,沒有任何離散採樣步驟。我們無法用標準GAN與一個連續的geneator取得可比較的結果。
:::
:::info
The difference in performance between WGAN and other GANs can be explained as follows. Consider the simplex $\Delta_n=\left\{p\in\mathbb{R}^n:p_i\geq 0,\sum_ip_i=1\right\}$, and the set of vertices on the simplex (or one-hot vectors) $V_n=\left\{p\in\mathbb{R}^n:p_i\in\left\{0,1\right\},\sum_ip_i=1\right\}\subseteq\Delta_n$. If we have a vocabulary of size $n$ and we have a distribution $\mathbb{P}_r$ over sequences of size $T$, we have that $\mathbb{P}_r$ is a distribution on $V^T_n=V_n\times\cdot\cdot\cdot\times V_n$. Since $V^T_n$ is a subset of $\Delta^T_n$, we can also treat $\mathbb{P}_r$ as a distribution on $\Delta^T_n$(by assigning zero probability mass to all points not in $V^T_n$).
$\mathbb{P}_r$ is discrete (or supported on a finite number of elements, namely $V^T_n$) on $\Delta^T_n$, but $\mathbb{P}_g$ can easily be a continuous distribution over $\Delta^T_n$. The KL divergences between two such distributions are infinite, and so the JS divergence is saturated. Although GANs do not literally minimize these divergences [16], in practice this means a discriminator might quickly learn to reject all samples that don’t lie on $V^T_n$ (sequences of one-hot vectors) and give meaningless gradients to the generator. However, it is easily seen that the conditions of Theorem 1 and Corollary 1 of [2] are satisfied even on this non-standard learning scenario with $\mathcal{X}=\Delta^T_n$. This means that $W(\mathbb{P}_r, \mathbb{P}_g)$ is still well defined, continuous everywhere and differentiable almost everywhere, and we can optimize it just like in any other continuous variable setting. The way this manifests is that in WGANs, the Lipschitz constraint forces the critic to provide a linear gradient from all $\Delta^T_n$ towards the real points in $V^T_n$.
Other attempts at language modeling with GANs [32, 14, 30, 5, 15, 10] typically use discrete models and gradient estimators [28, 12, 17]. Our approach is simpler to implement, though whether it scales beyond a toy language model is unclear.
:::
:::success
WGAN與其它GANs之間的效能差異可以解釋如下。考慮[單體](http://terms.naer.edu.tw/detail/2124716/)$\Delta_n=\left\{p\in\mathbb{R}^n:p_i\geq 0,\sum_ip_i=1\right\}$,與[單體](http://terms.naer.edu.tw/detail/2124716/)上的頂點集(或獨熱向量)$V_n=\left\{p\in\mathbb{R}^n:p_i\in\left\{0,1\right\},\sum_ip_i=1\right\}\subseteq\Delta_n$。如果我們有一個大小為$n$的詞彙表,並且在大小為$T$的序列上有一個分佈$\mathbb{P}_r$, 那$\mathbb{P}_r$就是$V^T_n=V_n\times\cdot\cdot\cdot\times V_n$上的一個分佈。因為$V^T_n$是$\Delta^T_n$的子集,我們還可以將$\mathbb{P}_r$視為$\Delta^T_n$上的分佈(通過分配零機率質量到所有不存在於$V^T_n$的點)。
$\mathbb{P}_r$在$\Delta^T_n$上是離散的(或支持有限的元素數量,即$V^T_n$),但$\mathbb{P}_g$可以很容易的成為$\Delta^T_n$上的連續分佈。這兩種分佈之間的KL divergences是無限的,因此JS divergence是飽合的。雖然GANs並沒有確實的最小化這些divergences[16],實際上這意昧著discriminator也許很快的學習到拒絕所有不在$V^T_n$上的樣本(獨熱向量的序列)而且可能會給generator沒意義的梯度。然而,這明顯看的出來,Theorem 1的條件以及[2]的Corollary 1即使在使用$\mathcal{X}=\Delta^T_n$這種非標準學習場景中也可以滿足。這代表著,$W(\mathbb{P}_r, \mathbb{P}_g)$依然定義的很好,處處連續,而且幾乎處處可微,我們可以像其它連續變數設置一樣最佳化。這表明了在WGANs中,Lipschitz約束式強制critic從所有的$\Delta^T_n$向$V^T_n$中的實際資料的線性梯度。
其它嚐試用GANs[32, 14, 30, 5, 15, 10]做語言建模經常使用離散模型與梯度估計[28, 12, 17]。我們的方法比較容易實現,儘管還不清楚它是否可以擴展到簡單的語言模型之外。
:::
:::info

Figure 5: (a) The negative critic loss of our model on LSUN bedrooms converges toward a minimum as the network trains. (b) WGAN training and validation losses on a random 1000-digit subset of MNIST show overfitting when using either our method (left) or weight clipping (right). In particular, with our method, the critic overfits faster than the generator, causing the training loss to increase gradually over time even as the validation loss drops.
Figure 5: (a) 我們訓練在LSUN bedrooms的negative critic loss隨著訓練收斂至最小化。(b)WGAN在MNIST上隨機1000個子集,不論使用我們的方法(左)或weight clipping(右),其training and validation losses顯示著過擬合。特別是,用我們的方法,critic過擬合的比generator還快,這導致training loss隨著時間推移而增加,即使validation loss已下降。
:::
### 5.6 Meaningful loss curves and detecting overfitting
:::info
An important benefit of weight-clipped WGANs is that their loss correlates with sample quality and converges toward a minimum. To show that our method preserves this property, we train a WGAN-GP on the LSUN bedrooms dataset [31] and plot the negative of the critic’s loss in Figure 5a. We see that the loss converges as the generator minimizes $W(\mathbb{P}_r, \mathbb{P}_g)$.
Given enough capacity and too little training data, GANs will overfit. To explore the loss curve’s behavior when the network overfits, we train large unregularized WGANs on a random 1000-image subset of MNIST and plot the negative critic loss on both the training and validation sets in Figure 5b. In both WGAN and WGAN-GP, the two losses diverge, suggesting that the critic overfits and provides an inaccurate estimate of $W(\mathbb{P}_r, \mathbb{P}_g)$, at which point all bets are off regarding correlation with sample quality. However in WGAN-GP, the training loss gradually increases even while the validation loss drops.
[29] also measure overfitting in GANs by estimating the generator’s log-likelihood. Compared to that work, our method detects overfitting in the critic (rather than the generator) and measures overfitting against the same loss that the network minimizes.
:::
:::success
WGANs使用weight-clipped的一個重要的好處在於,它們的loss與採樣品質是有關的,並且往最小化收斂。為了說明我們的方法保留這個特點,我們在LSUN bedrooms資料集[31]上訓練WGAN-GP,並且在Figure 5a繪製negative of the critic’s loss。我們看到,loss隨著generator最小化$W(\mathbb{P}_r, \mathbb{P}_g)$而收斂。
如果模型所給容量是充足的而且訓練資料很少情況下,GANs會過擬合。為了探索網路過擬合的時候其loss curve的行為,我們隨機從MNIST取1000張照片訓練一個大型非正規的WGANs並繪製模型訓練與驗證的negative critic loss在Figure 5b。在WGAN與WGAN-GP中,兩者的loss有所不同,這說明了critic過擬合,並且提供了對$W(\mathbb{P}_r, \mathbb{P}_g)$不正確的估計,[在這一點上](https://www.ptt.cc/bbs/Eng-Class/M.1480426180.A.A56.html),關於樣本品質上的關聯[變的不確定](https://www.learnwithkak.com/all-bets-are-off-%E6%98%AF%E7%94%9A%E9%BA%BC%E6%84%8F%E6%80%9D/)了。然而在WGAN-GP中,training loss逐漸的增加,即使validation loss下降。
[29]也有透過估計generator的log-likelihood來量測GAN的過擬合。相較之下,我們的方法是在critic中檢測過擬合(並非在generator)並針對網路最小化相同的loss來量測過擬合。
:::
## 6 Conclusion
:::info
In this work, we demonstrated problems with weight clipping in WGAN and introduced an alternative in the form of a penalty term in the critic loss which does not exhibit the same problems. Using our method, we demonstrated strong modeling performance and stability across a variety of architectures. Now that we have a more stable algorithm for training GANs, we hope our work opens the path for stronger modeling performance on large-scale image datasets and language. Another interesting direction is adapting our penalty term to the standard GAN objective function, where it might stabilize training by encouraging the discriminator to learn smoother decision boundaries.
:::
:::info
這次的作業中,我們證明WGAN中weight clipping的問題,並且提出一個在critic loss中引入penalty term形式的替代方案,這並不會產生相同的問題。使用我們的方法,我們證明了各種架構上強大的模型效能以及穩定性。現在,對於訓練GANs我們擁有更穩定的演算法,我們希望我們的研究為在[大尺度](http://terms.naer.edu.tw/detail/366863/)影像資料集以及語言上獲得更強力的建模效能開創一道路。另一個有趣的方向是讓penalty term(懲罰項)適合標準GAN的目標函數,通過鼓勵discriminator學習更平滑的決策邊界來穩定訓練。
:::
## Acknowledgements
:::info
We would like to thank Mohamed Ishmael Belghazi, Leon Bottou, Zihang Dai, Stefan Doerr, Ian Goodfellow, Kyle Kastner, Kundan Kumar, Luke Metz, Alec Radford, Colin Raffel, Sai Rajeshwar, Aditya Ramesh, Tom Sercu, Zain Shah and Jake Zhao for insightful comments.
:::
## E Hyperparameters used for LSUN robustness experiments
:::info
For each method we used the hyperparameters recommended in that method’s paper. For LSGAN, we additionally searched over learning rate (because the paper did not make a specific recommendation).
* WGAN with gradient penalty: Adam (α = .0001, β1 = .5, β2 = .9)
* WGAN with weight clipping: RMSProp (α = .00005)
* DCGAN: Adam (α = .0002, β1 = .5)
* LSGAN: RMSProp (α = .0001) [chosen by search over α = .001, .0002, .0001]
:::