WGAN_Paper(翻譯)

# WGAN_Paper(翻譯) ###### tags: `wgan` `gan` `論文翻譯` `deeplearning` `對抗式生成網路` [TOC] ## 說明區塊如下分類，原文區塊為藍底，翻譯區塊為綠底，部份專業用語翻譯參考國家教育研究院 :::info 原文 ::: :::success 翻譯 ::: :::warning 任何的翻譯不通暢部份都請留言指導 ::: :::danger * [paper hyperlink](https://arxiv.org/abs/1701.07875) * [Kantorovich-Rubinstein duality](https://vincentherrmann.github.io/blog/wasserstein/) * [知乎_郑华滨_令人拍案叫绝的Wasserstein GAN](https://zhuanlan.zhihu.com/p/25071913) * [李宏毅老師WGAN課程筆記](https://hackmd.io/@shaoeChen/HJ3JycWeB) * WGANs論文中的critic相對於原始GANs的discriminator * mode collapse指GAN可能會集中到某一個distribution，因此造成生成照片多數雷同 ::: ## 1 Introduction :::info The problem this paper is concerned with is that of unsupervised learning. Mainly, what does it mean to learn a probability distribution? The classical answer to this is to learn a probability density. This is often done by defining a parametric family of densities$(P_\theta)_{\theta\in\mathbb{R}^d}$ and finding the one that maximized the likelihood on our data: if we have real data examples $\lbrace{x^{(i)}\rbrace}^m_{i=1}$, we would solve the problem ::: :::info $$\max_{\theta\in\mathbb{R}^d}\dfrac{1}{m}\sum_{i=1}^mlog P_\theta(x^{(i)})$$ ::: :::success 此論文所關注的問題是非監督式學習。主要的，學習機率分佈代表什麼？這問題的經典回答是，學習一個機率密度。這通常透過定義參數密度函數族$(P_\theta)_{\theta\in\mathbb{R}^d}$，然後在我們的資料上尋找最大似然概率：如果我們有真實的資料範例$\lbrace{x^{(i)}\rbrace}^m_{i=1}$，我們可以解決這個問題。 ::: :::info If the real data distribution $\mathbb{P}_r$ admits a density and $\mathbb{P}_\theta$ is the distribution of the parametrized density $P_\theta$, then, asymptotically, this amounts to minimizing the Kullback-Leibler divergence $KL(\mathbb{P}_r\|\mathbb{P}_\theta)$. ::: :::success 如果實際的資料分佈$\mathbb{P}_r$提供機率密度，而$\mathbb{P}_\theta$是參數化的機率密度$P_\theta$，然後，兩個資料分佈漸近，這相當於最小化`Kullback-Leibler divergence`(KL-Divergence)$KL(\mathbb{P}_r\|\mathbb{P}_\theta)$ ::: :::info For this to make sense, we need the model density $P_\theta$ to exist. This is not the case in the rather common situation where we are dealing with distributions supported by low dimensional manifolds. It is then unlikely that the model manifold and the true distribution’s support have a non-negligible intersection (see [1]), and this means that the KL distance is not defined (or simply infinite) ::: :::success 為了讓它說的通，我們需要模型密度$P_\theta$存在。在我們處理低維流形所支撐的分佈相當普遍情況下，事實並非如此。模型流程與實際分佈的支撐不可能存在一個不可忽視的交集(see \[1\])，這意味著KL distance並沒有被定義(或者根本無限大) ::: :::info The typical remedy is to add a noise term to the model distribution. This is why virtually all generative models described in the classical machine learning literature include a noise component. In the simplest case, one assumes a Gaussian noise with relatively high bandwidth in order to cover all the examples. It is well known, for instance, that in the case of image generation models, this noise degrades the quality of the samples and makes them blurry. For example, we can see in the recent paper [23] that the optimal standard deviation of the noise added to the model when maximizing likelihood is around 0.1 to each pixel in a generated image, when the pixels were already normalized to be in the range [0, 1]. This is a very high amount of noise, so much that when papers report the samples of their models, they don’t add the noise term on which they report likelihood numbers. In other words, the added noise term is clearly incorrect for the problem, but is needed to make the maximum likelihood approach work. ::: :::success 典型的補救方式就是在模型分佈中加入噪點項目。這就是為什麼在經典機器學習文獻上幾乎所有生成模型描述都包含噪點成份的因為。最簡單的情況，假設一個帶有相對高寬的高斯噪點要覆蓋所有的範例。眾所皆知，例如，在影像生成模型情況中，噪點會降低樣本的品質並且造成模糊。舉例來說，我們可以在最近的論文\[23\]看到，當每一個像素已經被正規化至\[0, 1\]之間，當最大似然在生成照片每個像素的時候，噪點的最佳標準大約為0.1。這是非常大的噪點，以至於當論文報告他們的模型樣本時，他們並沒有報告增加噪點項目的可能數值。換句話說，增加噪點在這問題上是不正確的，但需要讓最大似然起作用。 ::: :::info Rather than estimating the density of $\mathbb{P}_r$ which may not exist, we can define a random variable $Z$ with a fixed distribution $p(z)$ and pass it through a parametric function $g_\theta: Z \to X$ (typically a neural network of some kind) that directly generates samples following a certain distribution $\mathbb{P}_\theta$. By varying $\theta$, we can change this distribution and make it close to the real data distribution $\mathbb{P}_r$. This is useful in two ways. First of all, unlike densities, this approach can represent distributions confined to a low dimensional manifold. Second, the ability to easily generate samples is often more useful than knowing the numerical value of the density (for example in image superresolution or semantic segmentation when considering the conditional distribution of the output image given the input image). In general, it is computationally difficult to generate samples given an arbitrary high dimensional density [16] ::: :::success 我們可以定義一個擁有固定分佈$p(z)$的隨機變數$Z$，而不是評估也許不存在的$\mathbb{P}_r$的密度，然後透過參數化函數$g_\theta: Z \to X$來傳遞它(通常是某種神經網路)，直接從一個特定分佈$\mathbb{P}_\theta$中生成樣本。通過改變$\theta$，我們可以改變分佈，並使它接近實際資料的分佈$\mathbb{P}_r$。這有兩個用途。首先，不同於密度，這種方法可以表示分佈侷限於低維流形。其次，容易生成樣本的能力通常比瞭解密度數值還要來的有用(舉例，當考慮給定輸入影像的輸出影像的條件分佈時，超高影像解度或語義分割)。一般來說，在任意的高維密度下是非常難以生成樣本的。 ::: :::info Variational Auto-Encoders (VAEs) [9] and Generative Adversarial Networks (GANs) [4] are well known examples of this approach. Because VAEs focus on the approximate likelihood of the examples, they share the limitation of the standard models and need to fiddle with additional noise terms. GANs offer much more fexibility in the definition of the objective function, including Jensen-Shannon [4], and all f-divergences [17] as well as some exotic combinations [6]. On the other hand, training GANs is well known for being delicate and unstable, for reasons theoretically investigated in [1]. ::: :::success Variational Auto-Encoders (VAEs) [9] and Generative Adversarial Networks (GANs) [4]是這種方法的著名範例。因為VAEs專注在樣本的近似可能性，所以它們共享標準模型的侷限性，並需要加入額外的噪點項目。GANs在目標函數上提供了更多靈活性，包含Jensen-Shannon [4]，與所有的f-divergences [17]以及一些奇異的組合 [6]。另一方面，由於 [1]的理論研究原因，訓練GANs是以容易壞掉與不穩定聞名。 ::: :::info In this paper, we direct our attention on the various ways to measure how close the model distribution and the real distribution are, or equivalently, on the various ways to define a distance or divergence $\rho(\mathbb{P}_\theta, \mathbb{P}_r)$. The most fundamental difference between such distances is their impact on the convergence of sequences of probability distributions. A sequence of distributions $(\mathbb{P}_t)_{t\in \mathbb{N}}$ converges if and only if there is a distribution $\mathbb{P}_\infty$ such that $\rho(\mathbb{P}_t, \mathbb{P}_\infty)$ tends to zero, something that depends on how exactly the distance $\rho$ is defined. Informally, a distance $\rho$ induces a weaker topology when it makes it easier for a sequence of distribution to converge.(見註1) Section 2 clarifies how popular probability distances differ in that respect. 註1：More exactly, the topology induced by $\rho$ is weaker than that induced by $\rho'$ when the set of convergent sequences under $\rho$ is a superset of that under $\rho'$. ::: :::success 論文中，我們將注意力放在各種量測模型分佈與實際分佈有多接近的方法，或者可以說是各種不同方法定義距離或離散程度$\rho(\mathbb{P}_\theta, \mathbb{P}_r)$。這種距離之間最根本的區別是它們對機率分佈序列收斂性的影響。分佈序列$(\mathbb{P}_t)_{t\in \mathbb{N}}$收斂的充份且必要條件是有一個分佈$\mathbb{P}_\infty$使得$\rho(\mathbb{P}_t, \mathbb{P}_\infty)$趨近於零，有些時候這取決於定義的距離$\rho$有多麼精確。簡略的說，當距離$\rho$讓分佈序列更容易收斂時，它會導致較弱的拓撲。(見註1)Section-2闡明了概率距離在這方面的差異。註1：更確切地說，$\rho$下的收斂序列集是$\rho$下的收斂序列集的超集時，$\rho$誘導的拓撲比$\rho'$誘導的拓撲弱。 ::: :::info In order to optimize the parameter $\theta$, it is of course desirable to define our model distribution $(\mathbb{P}_\theta)$ in a manner that makes the mapping $\theta \mapsto \mathbb{P}_\theta$ continuous. Continuity means that when a sequence of parameters $\theta_t$ converges to $\theta$, the distributions $\mathbb{P}_{\theta_t}$ also converge to $(\mathbb{P}_\theta)$. However, it is essential to remember that the notion of the convergence of the distributions $\mathbb{P}_{\theta_t}$ depends on the way we compute the distance between distributions. The weaker this distance, the easier it is to define a continuous mapping from $\theta$-space to $(\mathbb{P}_\theta)$-space, since it's easier for the distributions to converge. The main reason we care about the mapping $\theta \mapsto \mathbb{P}_\theta$ to be continuous is as follows. If $\rho$ is our notion of distance between two distributions, we would like to have a loss function $\theta \mapsto \rho(\mathbb{P}_\theta, \mathbb{P}_r)$ that is continuous, and this is equivalent to having the mapping $\theta \mapsto \mathbb{P}_\theta$ be continuous when using the distance between distributions $\rho$. ::: :::success 為了最佳化參數$\theta$，當然需求以某種方式定義我們的模型分佈讓$\theta \mapsto \mathbb{P}_\theta$的映射是連續的。連續性意味著，當一系列的參數$\theta_t$收斂為$\theta$時，其分佈$\mathbb{P}_{\theta_t}$也收斂為$(\mathbb{P}_\theta)$。然而，必需記得，資料分佈$\mathbb{P}_{\theta_t}$收斂的概念取決於我們計算資料分佈之間距離的方式。距離愈短就愈容易定義一個由$\theta$-space至$(\mathbb{P}_\theta)$-space的連續映射，因為分佈更容易收斂。我們關心映射$\theta \mapsto \mathbb{P}_\theta$是連續的主要原因如下。假如$\rho$是兩個資料分佈之間距離的概念，我們可能希望有一個連續的損失函數(loss function)$\theta \mapsto \rho(\mathbb{P}_\theta, \mathbb{P}_r)$，這相當於當使用分佈之間的距離$\rho$的時候，所有的映射$\theta \mapsto \mathbb{P}_\theta$是連續的。 ::: :::info The contributions of this paper are: * In Section 2, we provide a comprehensive theoretical analysis of how the Earth Mover (EM) distance behaves in comparison to popular probability distances and divergences used in the context of learning distributions. * In Section 3, we define a form of GAN called Wasserstein-GAN that minimizes a reasonable and efficient approximation of the EM distance, and we theoretically show that the corresponding optimization problem is sound. * In Section 4, we empirically show that WGANs cure the main training problems of GANs. In particular, training WGANs does not require maintaining a careful balance in training of the discriminator and the generator, and does not require a careful design of the network architecture either. The mode dropping phenomenon that is typical in GANs is also drastically reduced. One of the most compelling practical benefits of WGANs is the ability to continuously estimate the EM distance by training the discriminator to optimality. Plotting these learning curves is not only useful for debugging and hyperparameter searches, but also correlate remarkably well with the observed sample quality. ::: :::success 這篇論文的貢獻如下： * 在Section-2中，我們提供了一個全方面的分析，比較在學習分佈情況下，Earth Mover (EM) distance與常用機率距離與離散程度相比如何表現。 * 在Section-3中，我們定義了一種GAN的形式，稱之為Wasserstein-GAN，它最小化一個合理並且有效近似的Earth Mover distance，並且理論上證明相對應的最佳化問題是合理的。 * 在Section-4中，我們的臨床研究顯示，WGANs可以解決GANs訓練遇到的主要問題。特別是，訓練WGANs不需要在訓練Discriminator與Generator過程中保持微妙的平衡，也不需要特別的設計網路架構。GANs中典型的model dropping現象也大幅減少。WGANs最引人注目的一個實際優點在於，它可以透過訓練Discriminator持續估計Earth Mover Distance來最佳化。繪製學習曲線(learning curves)不止對除錯與尋參有幫助，也跟觀察樣本品質有好的相關性。 ::: ## 2 Different Distances :::info We now introduce our notation. Let $\mathcal{X}$ be a compact metric set (such as the space of images $\left[0; 1\right]^d$) and let $\Sigma$ denote the set of all the Borel subsets of $\mathcal{X}$. Let Prob($\mathcal{X}$) denote the space of probability measures defined on $\mathcal{X}$. We can now define elementary distances and divergences between two distributions $\mathbb{P}_r, \mathbb{P}_g \in$ Prob($\mathcal{X}$): * The Total Variation (TV) distance $$\delta(\mathbb{P}_r,\mathbb{P}_g)=\operatorname*{sup}_{A\in\sum}\vert\mathbb{P}_r(A)-\mathbb{P}_g(A)\vert$$ * The Kullback-Leibler (KL) divergence $$KL(\mathbb{P}_r\Vert \mathbb{P}_g)=\int log\left(\dfrac{P_r(x)}{P_g(x)}\right)P_r(x)d\mu(x)$$ where both $\mathbb{P}_r$ and $\mathbb{P}_g$ are assumed to be absolutely continuous, and therefore admit densities, with respect to a same measure $\mu$ defined on $\mathcal{X}$.\left[2\right] The KL divergence is famously assymetric and possibly infinite when there are points such that $P_g(x)$ = 0 and $P_r(x)$ > 0. * The Jensen-Shannon (JS) divergence $$JS(\mathbb{P}_r, \mathbb{P}_g)=KL(\mathbb{P}_r\Vert \mathbb{P}_m)+KL(\mathbb{P}_g\Vert \mathbb{P}_m)$$ where $\mathbb{P}_m$ is the mixture $(\mathbb{P}_r+\mathbb{P}_g)/2$. This divergence is symmetrical and always defined because we can choose $\mu=\mathbb{P}_m$. * The Earth-Mover (EM) distance or Wasserstein-1 $$W(\mathbb{P}_r, \mathbb{P}_g)=\operatorname*{inf}_{\gamma\in\prod(\mathbb{P}_r, \mathbb{P}_g)}\mathbb{E}_{(x,y)\sim\gamma}\left[\Vert x-y\Vert\right]$$ where $\prod(\mathbb{P}_r,\mathbb{P}_g)$ denotes the set of all joint distributions $\gamma(x,y)$ whose marginals are respectively $\mathbb{P}_r$ and $\mathbb{P}_g$. Intuitively, $\gamma(x,y)$ indicates how much "mass" must be transported from $x$ to $y$ in order to transform the distributions $\mathbb{P}_r$ into the distribution $\mathbb{P}_g$. The EM distance then is the "cost" of the optimal transport plan. ::: :::success 我們現在介紹我們的符號。$\mathcal{X}$為`compact metric set`(像是影像空間$\left[0; 1\right]^d$)，$\Sigma$表示$\mathcal{X}$的所有Borel子集的集合。Prob($\mathcal{X}$)表示在$\mathcal{X}$上定義的機率度量空間。我們現在可以定義兩個分佈$\mathbb{P}_r, \mathbb{P}_g \in$ Prob($\mathcal{X}$)間的基本距離與離散程度： * The Total Variation (TV) distance $$\delta(\mathbb{P}_r,\mathbb{P}_g)=\operatorname*{sup}_{A\in\sum}\vert\mathbb{P}_r(A)-\mathbb{P}_g(A)\vert$$ * The Kullback-Leibler (KL) divergence $$KL(\mathbb{P}_r\Vert \mathbb{P}_g)=\int log\left(\dfrac{P_r(x)}{P_g(x)}\right)P_r(x)d\mu(x)$$ 兩個分佈$\mathbb{P}_r$ and $\mathbb{P}_g$都假設它們是絕對連續的，因此`admit densities`，相對於在$\mathcal{X}$上定義的相同度量$\mu$。當存在$P_g(x)$ = 0 與 $P_r(x)$ > 0的點時，KL divergence是非對稱並且可能無限的。 * The Jensen-Shannon (JS) divergence $$JS(\mathbb{P}_r, \mathbb{P}_g)=KL(\mathbb{P}_r\Vert \mathbb{P}_m)+KL(\mathbb{P}_g\Vert \mathbb{P}_m)$$ $\mathbb{P}_m$為$(\mathbb{P}_r+\mathbb{P}_g)/2$。JS divergence是對稱並且總是可以被定義的，因為我們可以選擇$\mu=\mathbb{P}_m$。 * The Earth-Mover (EM) distance or Wasserstein-1 $$W(\mathbb{P}_r, \mathbb{P}_g)=\operatorname*{inf}_{\gamma\in\prod(\mathbb{P}_r, \mathbb{P}_g)}\mathbb{E}_{(x,y)\sim\gamma}\left[\Vert x-y\Vert\right]$$ 其中$\prod(\mathbb{P}_r,\mathbb{P}_g)$是聯合分佈$\gamma(x,y)$的集合，它的邊界分別為$\mathbb{P}_r$ and $\mathbb{P}_g$。直觀來看，$\gamma(x,y)$說明了為了轉移分佈，將$\mathbb{P}_r$轉為$\mathbb{P}_g$，有多少"mass"(土堆)必須從$x$移到$y$。那，EM distance即為最佳運輸計畫的"cost"。 ::: :::info The following example illustrates how apparently simple sequences of probability distributions converge under the EM distance but do not converge under the other distances and divergences defined above. ::: :::success 下面的範例說明機率分佈的簡單序列如何明顯的在EM distance下收斂，但卻不會在上述其它distance與divergences下收斂。 ::: :::info Example 1 (Learning parallel lines). Let $Z\sim U\left[0,1\right]$ the uniform distribution on the unit interval. Let $\mathbb{P}_0$ be the distribution of $(0,Z) \in \mathbb{R}^2$ (a 0 on the x-axis and the random variable $Z$ on the y-axis), uniform on a straight vertical line passing through the origin. Now let $g_\theta(z)=(\theta,z)$ with $\theta$ a single real parameter. It is easy to see that in this case, * $W(\mathbb{P}_0,\mathbb{P}_\theta)=\vert\theta\vert$ * $JS(\mathbb{P}_0,\mathbb{P}_\theta)= \begin{cases} log2& \text{if }\theta\neq 0,\\ 0& \text{if }\theta= 0 \end{cases}$ * $KL(\mathbb{P}_\theta\Vert\mathbb{P}_0)=KL(\mathbb{P}_0\Vert\mathbb{P}_\theta)= \begin{cases} +\infty& \text{if }\theta\neq 0,\\ 0& \text{if }\theta= 0 \end{cases}$ * and $\delta(\mathbb{P}_0,\mathbb{P}_\theta)= \begin{cases} 1& \text{if }\theta\neq 0,\\ 0& \text{if }\theta= 0 \end{cases}$ When $\theta \rightarrow 0$, the sequence $(\mathbb{P}_{\theta_t})_{t\in\mathbb{N}}$ converges to $\mathbb{P}_0$ under the EM distance, but does not converge at all under either the JS, KL, reverse KL, or TV divergences. Figure 1 illustrates this for the case of the EM and JS distances. ::: :::success 範例-1(Example 1 (Learning parallel lines))。假設$Z\sim U\left[0,1\right]$為單位間隔上的平均分佈(uniform distribution)。假設$\mathbb{P}_0$的分佈是$(0,Z) \in \mathbb{R}^2$(a 0 on the x-axis and the random variable $Z$ on the y-axis)，均勻分佈在穿過原點的垂直直線上。現在假設$g_\theta(z)=(\theta,z)$，$\theta$為單精實數參數。這很容易在這種情況下看到， * $W(\mathbb{P}_0,\mathbb{P}_\theta)=\vert\theta\vert$ * $JS(\mathbb{P}_0,\mathbb{P}_\theta)= \begin{cases} log2& \text{if }\theta\neq 0,\\ 0& \text{if }\theta= 0 \end{cases}$ * $KL(\mathbb{P}_\theta\Vert\mathbb{P}_0)=KL(\mathbb{P}_0\Vert\mathbb{P}_\theta)= \begin{cases} +\infty& \text{if }\theta\neq 0,\\ 0& \text{if }\theta= 0 \end{cases}$ * and $\delta(\mathbb{P}_0,\mathbb{P}_\theta)= \begin{cases} 1& \text{if }\theta\neq 0,\\ 0& \text{if }\theta= 0 \end{cases}$ 當$\theta \rightarrow 0$的時候，序列$(\mathbb{P}_{\theta_t})_{t\in\mathbb{N}}$在使用EM distance的情況下會收斂到$\mathbb{P}_0$，但在JS，KL，reverse KL或TV divergences完全不會收斂。Figure 1說明EM與JS distances的情況。 ::: <img src='https://i.imgur.com/x0aZNLi.png' width="300"/><img src='https://i.imgur.com/gcZGLJP.png' width="300"/> :::info Figure 1: These plots show $\rho(\mathbb{P}_\theta, \mathbb{P}_0)$ as a function of $\theta$ when $\rho$ is the EM distance (left plot) or the JS divergence (right plot). The EM plot is continuous and provides a usable gradient everywhere. The JS plot is not continuous and does not provide a usable gradient. ::: :::success Figure 1: 這圖說明$\rho(\mathbb{P}_\theta, \mathbb{P}_0)$是$\theta$的function，左圖是EM distance，右圖是JS divergence。EM的部份是連續，而且任何地方都提供了有效梯度。而JS的部份了不連續，也無法提供有效的梯度。 ::: :::info Example 1 gives us a case where we can learn a probability distribution over a low dimensional manifold by doing gradient descent on the EM distance. This cannot be done with the other distances and divergences because the resulting loss function is not even continuous. Although this simple example features distributions with disjoint supports, the same conclusion holds when the supports have a non empty intersection contained in a set of measure zero. This happens to be the case when two low dimensional manifolds intersect in general position [1]. ::: :::success Example 1 給了我們一個範例，我們可以通過對EM distance上做梯度下階來學習到低維流形(low dimensional manifold)的機率分佈。這在其它的distance與divergence上是做不到的，因為它們的loss function是不連續的。雖然這個簡單的範例是基於不相交支撐集(disjoint supports)的分佈，但是，當支撐集(supports)擁有有一個包含在零測度(measure zero)的非空交集時，結論同樣成立。當兩個低維流形(low dimensional manifolds)交集在一般位置(general position)的時候，這種情況就會發生[1]。 ::: :::info Since the Wasserstein distance is much weaker than the JS distance[3], we can now ask whether $W(\mathbb{P}_r, \mathbb{P}_\theta)$ is a continuous loss function on $\theta$ under mild assumptions. This, and more, is true, as we now state and prove. ::: :::success 由於Wasserstein distance(EM distance)遠小於JS distance[3]，我們現在可以問，在溫和假設情況下，$W(\mathbb{P}_r, \mathbb{P}_\theta)$是否為function-$\theta$上的continuous loss function。正如我們現在所陳述與證明，是的，那是真的。 ::: :::info ^3^ The argument for why this happens, and indeed how we arrived to the idea that Wasserstein is what we should really be optimizing is displayed in Appendix A. We strongly encourage the interested reader who is not afraid of the mathematics to go through it. ::: :::success ^3^ 關於發生這種情況的論點，以及我們是如何得出Wasserstein是我們真正應該優化的這個觀點，都在附錄A說明。我們強烈的鼓勵那些不害怕數學並感興趣的讀者去讀它。 ::: :::info **Theorem 1.** Let $\mathbb{P}_r$ be a fixed distribution over $\mathcal{X}$. Let $Z$ be a random variable (e.g Gaussian) over another space $\mathcal{Z}$. Let $g:\mathcal{Z} \times \mathbb{R}^d \rightarrow \mathcal{X}$ be a function, that will be denoted $g_\theta(z)$ with $z$ the first coordinate and $\theta$ the second. Let $\mathbb{P}_\theta$ denote the distribution of $g_\theta(Z)$. Then, 1. If $g$ is continuous in $\theta$, so is $W(\mathbb{P}_r, \mathbb{P}_\theta)$. 2. If $g$ is locally Lipschitz and satisfies regularity assumption [1], then $W(\mathbb{P}_r, \mathbb{P}_\theta)$ is continuous everywhere, and differentiable almost everywhere. 3. Statements 1-2 are false for the Jensen-Shannon divergence $JS(\mathbb{P}_r, \mathbb{P}_\theta)$ and all the KLs. Proof. See Appendix C ::: :::success **Theorem 1.** 假設$\mathbb{P}_r$是$\mathcal{X}$上的固定分佈。假設$Z$是另一個空間$\mathcal{Z}$上的一個隨機變數(e.g Gaussian)。假設$g:\mathcal{Z} \times \mathbb{R}^d \rightarrow \mathcal{X}$是一個function，這將被表示為$g_\theta(z)$，$z$為第一個座標，$\theta$為第二個。假設$\mathbb{P}_\theta$表示分佈-$g_\theta(Z)$，然後， 1. 如果$g$在$\theta$中是連續的，那$W(\mathbb{P}_r, \mathbb{P}_\theta)$。 2. 如果$g$為locally Lipschitz並且滿足正則性假設[1]，然後$W(\mathbb{P}_r, \mathbb{P}_\theta)$為處處連續(continuous everywhere)，並且幾乎處處可微。 3. Jensen-Shannon divergence $JS(\mathbb{P}_r, \mathbb{P}_\theta)$與所有KLs而言，1-2的論述都是錯的。證明部份請參閱附錄C ::: :::info The following corollary tells us that learning by minimizing the EM distance makes sense (at least in theory) with neural networks. **Corollary 1.** Let $g_\theta$ be any feedforward neural network[4] parameterized by $\theta$, and $p(z)$ a prior over $z$ such that $\mathbb{E}_{z\sim p(z)}\left[\Vert Z \Vert \right] < \infty$ (e.g. Gaussian, uniform, etc.). Then assumption [1] is satisfied and therefore $W(\mathbb{P}_r, \mathbb{P}_\theta)$ is continuous everywhere and differentiable almost everywhere. Proof. See Appendix C All this shows that EM is a much more sensible cost function for our problem than at least the Jensen-Shannon divergence. The following theorem describes the relative strength of the topologies induced by these distances and divergences, with KL the strongest, followed by JS and TV, and EM the weakest. ::: :::success 下面推論告訴我們，透過最小化EM-distance學習對神經網路而言是有意義的(至少理論上) **Corollary 1.** 假設$g_\theta$可以是任意的feedforward neural network[4]，參數為$\theta$，$p(z)$ a prior over $z$，因此$\mathbb{E}_{z\sim p(z)}\left[\Vert Z \Vert \right] < \infty$(e.g. Gaussian, uniform, etc.)。然後滿足假設\left[1\right]，因此$W(\mathbb{P}_r, \mathbb{P}_\theta)$處處連續並且那裡都可以微分。證明部份請參閱附錄C 所有的一切都表明著，對我們問題而言，EM的cost funtion至少比JS divergence更為合理。下面定理描述著由這些距離與離散所引起的相對強度，KL最強，接著是JS與TV，EM最弱。 ::: :::info ^4^ By a feedforward neural network we mean a function composed by affine transformations and pointwise nonlinearities which are smooth Lipschitz functions (such as the sigmoid, tanh, elu, softplus, etc). **Note:** the statement is also true for rectifier nonlinearities but the proof is more technical (even though very similar) so we omit it. ::: :::success ^4^ 通過前饋神經網路，我們的意思是一個透過仿射變換(affine transformations)與點態非線性(pointwise nonlinearities)所組成的函數，它們是平滑的Lipschitz函數(像是sigmoid、 tanh、 elu、 softplus...等等)。 **注意：**這個說明對整流非線性函數(rectifier nonlinearities)而言也是正確的，但證明是更技術性的(即使非常類似)，因此我們省略它。 ::: :::info **Theorem 2.** Let $\mathbb{P}$ be a distribution on a compact space $\mathcal{X}$ and $(\mathbb{P}_n)_{n\in\mathbb{N}}$ be a sequence of distributions on $\mathcal{X}$. Then, considering all limits as $n\rightarrow \infty$, 1. The following statements are equivalent * $\delta(\mathbb{P}_n, \mathbb{P}) \rightarrow 0$ with $\delta$ the total variation distance. * $JS(\mathbb{P}_n, \mathbb{P}) \rightarrow 0$ with JS the Jensen-Shannon divergence. 2. The following statements are equivalent * $W(\mathbb{P}_n, \mathbb{P}) \rightarrow 0$ * $\mathbb{P}_n\phantom{2}\underrightarrow{D}\phantom{2}\mathbb{P}$ where $\underrightarrow{D}$ represents convergence in distribution for random variables. 3. $KL(\mathbb{P}_n\Vert\mathbb{P}) \rightarrow 0$ or $KL(\mathbb{P}\Vert\mathbb{P}_n) \rightarrow 0$ imply the statements in (1). 4. The statements in (1) imply the statements in (2). Proof. See Appendix C This highlights the fact that the KL, JS, and TV distances are not sensible cost functions when learning distributions supported by low dimensional manifolds. However the EM distance is sensible in that setup. This obviously leads us to the next section where we introduce a practical of optimizing the EM distance. ::: :::success **Theorem 2.** 假設$\mathbb{P}$是緊緻空間(compact space)$\mathcal{X}$上的一個分佈，$(\mathbb{P}_n)_{n\in\mathbb{N}}$是分佈-$\mathcal{X}$上的序列。然後，然後所有的限制為$n\rightarrow \infty$， 1. 以下陳述是等價的 * $\delta(\mathbb{P}_n, \mathbb{P}) \rightarrow 0$ with $\delta$ the total variation distance. * $JS(\mathbb{P}_n, \mathbb{P}) \rightarrow 0$ with JS the Jensen-Shannon divergence. 2. 以下陳述是等價的 * $W(\mathbb{P}_n, \mathbb{P}) \rightarrow 0$ * $\mathbb{P}_n\phantom{2}\underrightarrow{D}\phantom{2}\mathbb{P}$ where $\underrightarrow{D}$ 表示隨機變量的分佈收斂性 3. $KL(\mathbb{P}_n\Vert\mathbb{P}) \rightarrow 0$ or $KL(\mathbb{P}\Vert\mathbb{P}_n) \rightarrow 0$ 蘊涵(1)的說明 4. (1)的說明也蘊涵(2)的說明證明部份請參閱附錄C 這突顯一個事實，在利用低維流形學習支撐的分佈時，KL、JS、TV distance的cost function是不合理的。然後，這個設置在EM distance是合理的，這很顯然的會引導我們到下一個章節，我們將介紹一個最佳化EM distance的實際近似值。 ::: ## 3 Wasserstein GAN :::info Again, Theorem [2] points to the fact that $W(\mathbb{P}_r, \mathbb{P}_\theta)$ might have nicer properties when optimized than $JS(\mathbb{P}_r, \mathbb{P}_\theta)$. However, the infimum in ([1]) is highly intractable. On the other hand, the Kantorovich-Rubinstein duality [22] tells us that $$W(\mathbb{P}_r, \mathbb{P}_\theta)=\operatorname*{sup}_{\Vert f \Vert_L\phantom{1}\leq \phantom{1}1}\mathbb{E}_{x\sim\mathbb{P}_r}\left[f(x)\right]-\mathbb{E}_{x\sim\mathbb{P}_\theta}\left[f(x)\right] \qquad (2)$$ where the supremum is over all the 1-Lipschitz functions $f:\mathcal{X}\rightarrow\mathbb{R}$. Note that if we replace $\Vert f \Vert_L \leq 1$ for $\Vert f \Vert_L \leq K$ (consider K-Lipschitz for some constant K), then we end up with $K\cdot W(\mathbb{P}_r,\mathbb{P}_g)$. Therefore, if we have a parameterized family of functions $\left\{f_w\right\}_{w\in\mathcal{W}}$ that are all K-Lipschitz for some K, we could consider solving the problem $$\operatorname*{max}_{w\in\mathcal{W}}\mathbb{E}_{x\sim\mathbb{P}_r}\left[f_w(x)\right]-\mathbb{E}_{z\sim p(z)}\left[f_w(g_\theta(z))\right] \qquad (3)$$ and if the supremum in (2) is attained for some $w\in\mathcal{W}$ (a pretty strong assumption akin to what's assumed when proving consistency of an estimator), this process would yield a calculation of $W(\mathbb{P}_r,\mathbb{P}_\theta)$ up to a multiplicative constant. Furthermore, we could consider differentiating $W(\mathbb{P}_r,\mathbb{P}_\theta)$ (again, up to a constant) by back-proping through equation (2) via estimating $\mathbb{E}_{z\sim p(z)}\left[\nabla_\theta f_w(g_\theta(z))\right]$. While this is all intuition, we now prove that this process is principled under the optimality assumption. ::: :::success 再次的，Theorem [2]指出一個事實，$W(\mathbb{P}_r, \mathbb{P}_\theta)$在最佳化的時候，相比$JS(\mathbb{P}_r, \mathbb{P}_\theta)$也許有較好的特性。然後，在\[1\]中的最大下界(infimum)是非常難以處理的。另一方面，Kantorovich-Rubinstein duality [22]告訴我們， $$W(\mathbb{P}_r, \mathbb{P}_\theta)=\operatorname*{sup}_{\Vert f \Vert_L\phantom{1}\leq \phantom{1}1}\mathbb{E}_{x\sim\mathbb{P}_r}\left[f(x)\right]-\mathbb{E}_{x\sim\mathbb{P}_\theta}\left[f(x)\right] \qquad (2)$$ 最小上界(supremum)覆蓋所有1-Lipschitz functions $f:\mathcal{X}\rightarrow\mathbb{R}$。注意，如果我們將$\Vert f \Vert_L \leq 1$替換為$\Vert f \Vert_L \leq K$(考慮K-Lipschitz為某一個常數K)，那我們最終得到$K\cdot W(\mathbb{P}_r,\mathbb{P}_g)$。因此，如果我們有一個參數化的函數族(family of functions)$\left\{f_w\right\}_{w\in\mathcal{W}}$，某些K的K-Lipschitz，那我們可以考慮解決這個問題。 $$\operatorname*{max}_{w\in\mathcal{W}}\mathbb{E}_{x\sim\mathbb{P}_r}\left[f_w(x)\right]-\mathbb{E}_{z\sim p(z)}\left[f_w(g_\theta(z))\right] \qquad (3)$$ 如果某個$w\in\mathcal{W}$達到最小上界(supremum)在(2)(一個非常強烈的假設，類似於證明估計量一致性時的假設)，這個過程會產生一個計算$W(\mathbb{P}_r,\mathbb{P}_\theta)$，直到乘常數(multiplicative constant)。更進一步，我們可以考慮透過估計$\mathbb{E}_{z\sim p(z)}\left[\nabla_\theta f_w(g_\theta(z))\right]$通過方程式(2)對$W(\mathbb{P}_r,\mathbb{P}_\theta)$做反向傳播求導(再次的，直到常數。雖然這很直覺，但我們現在證明這個過程是最佳性假設下的原則。 ::: :::danger 取自[國家教育研究院](http://terms.naer.edu.tw/detail/1298487/) 名詞解釋-乘常數: 視距測量時，望遠鏡物鏡焦距與視距絲間隔之比值，即視距計算公式之，以此值乘視距絲在標尺上所截夾距而得儀器至標尺之距離，故稱乘常數，亦稱視距間隔因數，通常以K示之，其值一般于儀器設計製造時均定為100。見視距測量。 ::: :::info **Theorem 3.** Let $\mathbb{P}_r$ be any distribution. Let $\mathbb{P}_\theta$ be the distribution of $g_\theta(Z)$ with $Z$ a random variable with density $p$ and $g_\theta$ a function satisfying assumption \left[1\right]. Then, there is a solution $f:\mathcal{X}\rightarrow\mathbb{R}$ to the problem $$\operatorname*{max}_{\Vert f \Vert_L\phantom{1}\leq \phantom{1}1}\mathbb{E}_{x\sim\mathbb{P}_r}\left[f(x)\right]-\mathbb{E}_{x\sim\mathbb{P}_\theta}\left[f(x)\right]$$ and we have $$\nabla_\theta W(\mathbb{P}_r,\mathbb{P}_\theta)=-\mathbb{E}_{z\sim p(z)}\left[\nabla_\theta f(g_\theta(z))\right]$$ when both terms are well-defined. Proof. See Appendix C ::: :::success **Theorem 3.** 假設$\mathbb{P}_r$是任意的分佈。假設$\mathbb{P}_\theta$是$g_\theta(Z)$的分佈，$Z$為帶有密度$p$的隨機變數，而$g_\theta$是滿足假設\left[1\right]的函數。然後，有一個解決這問題的方法$f:\mathcal{X}\rightarrow\mathbb{R}$。 $$\operatorname*{max}_{\Vert f \Vert_L\phantom{1}\leq \phantom{1}1}\mathbb{E}_{x\sim\mathbb{P}_r}\left[f(x)\right]-\mathbb{E}_{x\sim\mathbb{P}_\theta}\left[f(x)\right]$$ 當兩個項目定義都明確的時候我們就有 $$\nabla_\theta W(\mathbb{P}_r,\mathbb{P}_\theta)=-\mathbb{E}_{z\sim p(z)}\left[\nabla_\theta f(g_\theta(z))\right]$$ 證明部份請參閱附錄C ::: :::info Now comes the question of finding the function $f$ that solves the maximization problem in equation (2). To roughly approximate this, something that we can do is train a neural network parameterized with weights $w$ lying in a compact space $\mathcal{W}$ and then backprop through $\mathbb{E}_{z\sim p(z)}\left[\nabla_\theta f_w(g_\theta(z))\right]$, as we would do with a typical GAN. Note that the fact that $\mathcal{W}$ is compact implies that all the functions $f_w$ will be $K$-Lipschitz for some $K$ that only depends on $\mathcal{W}$ and not the individual weights, therefore approximating (2) up to an irrelevant scaling factor and the capacity of the `critic' $f_w$. In order to have parameters $w$ lie in a compact space, something simple we can do is clamp the weights to a fixed box (say $\mathcal{W}=\left[-0.001,0.01\right]^l$) after each gradient update. The Wasserstein Generative Adversarial Network (WGAN) procedure is described in Algorithm 1. ::: :::success 現在的問題是找到解決方程式(2)中最大化問題的function-$f$。為了粗略的估計這點，我們能做的就是訓練一個神經網路，其權重-$w$位於一個緊密空間(compact space)-$\mathcal{W}$，然後通過$\mathbb{E}_{z\sim p(z)}\left[\nabla_\theta f_w(g_\theta(z))\right]$做反向傳播，就像是對典型的GAN做的事。注意，$\mathcal{W}$是緊密(compact)的這個事實意味著，對某些$K$而言，所有的functions-$f_w$都會是$K$-Lipschitz，它取決於$\mathcal{W}$，而非各別權重，因此近似於(2)，直到一個不相關的[縮放因子](http://terms.naer.edu.tw/detail/3121571/)與`critic`-$f_w$的容量。為了讓參數$w$位於一個緊密空間(compact space)中，我們能做的事就是在每次梯度更新的時候將權重限制在某個固定區間(假設$\mathcal{W}=\left[-0.001,0.01\right]^l$)。Algorithm (1)描述了Wasserstein Generative Adversarial Network (WGAN)的過程。 ::: :::info Weight clipping is a clearly terrible way to enforce a Lipschitz constraint. If the clipping parameter is large, then it can take a long time for any weights to reach their limit, thereby making it harder to train the critic till optimality. If the clipping is small, this can easily lead to vanishing gradients when the number of layers is big, or batch normalization is not used (such as in RNNs). We experimented with simple variants (such as projecting the weights to a sphere) with little difference, and we stuck with weight clipping due to its simplicity and already good performance. However, we do leave the topic of enforcing Lipschitz constraints in a neural network setting for further investigation, and we actively encourage interested researchers to improve on this method. ::: :::success 用權重剪裁(clipping)來強制Lipschitz約束是一種明顯糟糕的方法。如果剪裁(clipping)參數過大，那們每一個權重都需要花費很長的時間才能到達它們的極值，從而造成訓練critic最佳化非常困難。如果剪裁(clipping)參數過小，那在神經網路很深或沒有使用batch normalization(像是RNNs)的情況下，它會很輕易的導致梯度消失(vanishing gradients)。我們嚐試簡單的變體(像是將權重(weights)投射到球面(sphere))，差異很小，並且由於它的簡單性和良好的效能，我們堅持使用權重剪裁(weight clipping)。然而，我們確實將強制Lipschitz約束的主題保留在神經網路設置中供進一步研究，我們積極鼓勵感興趣的研究人員來改進這個方法。 ::: :::info **Algorithm 1 WGAN**, our proposed algorithm. All experiments in the paper used the default values $\alpha$ = 0.00005, $c$ = 0.01, $m$ = 64, $n_{critic}$ = 5. **Require:** : $\alpha$, the learning rate. $c$, the clipping parameter. $m$, the batch size. $\qquad n_{critic}$, the number of iterations of the critic per generator iteration. **Require:** : $w_0$, initial critic parameters. $\theta_0$, initial generator's parameters. 1: **while** $\theta$ has not converged **do** 2: $\qquad$ **for** t = 0,...$n_{critic}$ **do** 3: $\qquad\qquad$ Sample $\left\{x^{(i)}\right\}^m_{i=1}\sim\mathbb{P}_r$ a batch from the real data. 4: $\qquad\qquad$ Sample $\left\{z^{(i)}\right\}^m_{i=1}\sim p(z)$ a batch of prior samples. 5: $\qquad\qquad$ $g_w\leftarrow\nabla_w\left[\dfrac{1}{m}\sum^m_{i=1}f_w(x^{(i)})-\dfrac{1}{m}\sum^m_{i=1}f_w(g_\theta(z^{(i)}))\right]$ 6: $\qquad\qquad$ $w\leftarrow w+\alpha\cdot \mathrm{RMSProp}(w, g_w)$ 7: $\qquad\qquad$ $w\leftarrow \mathrm{clip}(w, -c,c)$ 8: $\qquad$ **end for** 9: $\qquad$ Sample $\left\{z^{(i)}\right\}^m_{i=1}\sim p(z)$ a batch of prior samples. 10: $\qquad$ $g_\theta \leftarrow - \nabla_\theta\dfrac{1}{m}\sum^m_{i=1}f_w(g_\theta(z^{(i)}))$ 11: $\qquad$ $\theta \leftarrow \theta - \alpha \cdot \mathrm{RMSProp}(\theta,g_\theta)$ 12: **end while** ::: :::info The fact that the EM distance is continuous and differentiable a.e. means that we can (and should) train the critic till optimality. The argument is simple, the more we train the critic, the more reliable gradient of the Wasserstein we get, which is actually useful by the fact that Wasserstein is differentiable almost everywhere. For the JS, as the discriminator gets better the gradients get more reliable but the true gradient is 0 since the JS is locally saturated and we get vanishing gradients, as can be seen in Figure 1 of this paper and Theorem 2.4 of [1]. In Figure 2 we show a proof of concept of this, where we train a GAN discriminator and a WGAN critic till optimality. The discriminator learns very quickly to distinguish between fake and real, and as expected provides no reliable gradient information. The critic, however, can't saturate, and converges to a linear function that gives remarkably clean gradients everywhere. The fact that we constrain the weights limits the possible growth of the function to be at most linear in different parts of the space, forcing the optimal critic to have this behaviour. ::: :::success EM-distance是連續而且可微的事實意味著我們可以(應該)訓練critic一直到[最佳性](http://terms.naer.edu.tw/detail/2121025/)。這論點很簡單，訓練critic愈多，我們得到的Wasserstein梯度就愈可靠，事實上這非常有用，因為Wasserstein幾乎處處可微。對於JS-divergence，即使discriminator得到更好的梯度，也更可靠，但是險際上它的梯度是0，因為JS是局部飽合的(locally saturated)，我們得到消失的梯度，可以參考此論文中的Figure 1以及Theorem 2.4 of [1]。在Figure 2，我們說明這個論述的證明，我們訓練一個GAN-discriminator與WGAN-critic直到最佳性。discriminator很快的學會辨識真假，也如預期般的提供不可靠的梯度資訊。然而，critic不能飽合，並且收斂為線性函數，在每一個地方都給出了非常完美的梯度。事實上，我們約束權重的這件事限制函數在空間不同部位可能增長為最多線性，迫使最佳critic有這種行為。 ::: :::info Perhaps more importantly, the fact that we can train the critic till optimality makes it impossible to collapse modes when we do. This is due to the fact that mode collapse comes from the fact that the optimal generator for a fixed discriminator is a sum of deltas on the points the discriminator assigns the highest values, as observed by [4] and highlighted in [11]. ::: :::success 也許更重要的是，當我們這樣做的時候，事實上我們可以訓練critic一直到最佳性，這確保訓練過程不會壞掉。壞掉的一個原因事實在於，固定discriminator的最佳generator是分配最高值的點上的增量總和，如[4]觀察與[11]的highlight。 ::: :::info In the following section we display the practical benefits of our new algorithm, and we provide an in-depth comparison of its behaviour and that of traditional GANs. ::: :::success 下一個章節中，我們會展示新演算法的實際優勢，並提供一個它與傳統GANs的深入比較。 ::: ![](https://i.imgur.com/kbiBeNR.png) :::info Figure 2: Optimal discriminator and critic when learning to differentiate two Gaussians. As we can see, the discriminator of a minimax GAN saturates and results in vanishing gradients. Our WGAN critic provides very clean gradients on all parts of the space. ::: :::success Figure 2: 當學習區分兩個Gaussians時，最佳的discriminator與critic。我們可以看到，minimax GAN的discriminator是飽合並且引發梯度消失。我們WGAN的critic在空間的所有地方都提供了非常完美的梯度。 ::: ## 4 Empirical Results :::info We run experiments on image generation using ourWasserstein-GAN algorithm and show that there are significant practical benefits to using it over the formulation used in standard GANs. We claim two main benefits: * a meaningful loss metric that correlates with the generator's convergence and sample quality * improved stability of the optimization process ::: :::success 我們使用我們的Wasserstein-Gan演算法進行了影像生成的實驗，結果表明，與標準GANs中使用的公式相比，使用該演算法具有顯著的實際效益。我們說明兩個主要優點： * 一個有意義的損失評測，與generator的收斂性、採樣品質有關 * 提高優化過程的穩定性 ::: ### 4.1 Experimental Procedure :::info We run experiments on image generation. The target distribution to learn is the LSUN-Bedrooms dataset [24] - a collection of natural images of indoor bedrooms. Our baseline comparison is DCGAN [18], a GAN with a convolutional architecture trained with the standard GAN procedure using the -$logD$ trick [4]. The generated samples are 3-channel images of 64x64 pixels in size. We use the hyper-parameters specified in Algorithm 1 for all of our experiments. ::: :::success 我們進行影像生成的實驗。學習的目標分佈資料集是LSUN-Bedrooms[24]-一個收集室內臥房的自然影像集合。我們的比較基線是DCGAN[18]，它是一個帶有卷積架構的標準GAN，使用$logD$[4]。生成的樣本是64x64x3的影像。我們所有實驗使用的超參數都如Algorithm 1所設定。 ::: :::info Figure 3: Training curves and samples at different stages of training. We can see a clear correlation between lower error and better sample quality. Upper left: the generator is an MLP with 4 hidden layers and 512 units at each layer. The loss decreases consistently as training progresses and sample quality increases. Upper right: the generator is a standard DCGAN. The loss decreases quickly and sample quality increases as well. In both upper plots the critic is a DCGAN without the sigmoid so losses can be subjected to comparison. Lower half: both the generator and the discriminator are MLPs with substantially high learning rates (so training failed). Loss is constant and samples are constant as well. The training curves were passed through a median filter for visualization purposes. ::: :::success Figure 3: 訓練曲線與不同訓練階段的樣本。我們可以看到在較低誤差與較佳取樣品質看到明顯相關性。上排左：generator is an MLP with 4 hidden layers and 512 units at each layer。 $\qquad$隨著訓練的過程與取樣品質的提高，損失(loss)逐步下降。上排右：the generator is a standard DCGAN。 $\qquad$損失(loss)快速下降，並且取樣品質也上升。上排兩張圖的critic為DCGAN，沒有sigmoid，因此可以對損失(loss)進行比較。下排中：generator與discriminator皆為MLPs，都有相當高的學習效率(learning rates-$\alpha$)(因此訓練失敗) $\qquad$損失(loss)與取樣(sample)都沒變化。為了可視化，訓練曲線都通過中值濾波。 ::: ## 4.2 Meaningful loss metric :::info Because the WGAN algorithm attempts to train the critic $f$ (lines 2-8 in Algorithm 1) relatively well before each generator update (line 10 in Algorithm 1), the loss function at this point is an estimate of the EM distance, up to constant factors related to the way we constrain the Lipschitz constant of $f$. Our first experiment illustrates how this estimate correlates well with the quality of the generated samples. Besides the convolutional DCGAN architecture, we also ran experiments where we replace the generator or both the generator and the critic by 4-layer ReLU-MLP with 512 hidden units. Figure 3 plots the evolution of the WGAN estimate (3) of the EM distance during WGAN training for all three architectures. The plots clearly show that these curves correlate well with the visual quality of the generated samples. To our knowledge, this is the first time in GAN literature that such a property is shown, where the loss of the GAN shows properties of convergence. This property is extremely useful when doing research in adversarial networks as one does not need to stare at the generated samples to figure out failure modes and to gain information on which models are doing better over others. ::: :::success 因為WGAN演算法嚐試在每次更新generator前(line 10 in Algorithm 1)相對較好的訓練critic-$f$(lines 2-8 in Algorithm 1)，因此在這時候的loss function是EM distance的估算值，直到與約束Lipschitz常數$f$的方法相關的常數因數(upper bound)。我們第一個實驗說明這估計值與生成影像的品質有多麼相關。除了卷積架構-DCGAN，我們也執行實驗，我們用4-layer ReLU-MLP with 512 hidden units替換generator或同時替換generator與critic。 Figure 3繪製出三種架構在訓練期間的EM-distance的估計(3)變化。圖表清楚的顯示出，三條曲線與生成的取樣品質有很好的相關性。據我們所知，這是GAN的文獻中第一次有這樣的屬性顯示，其中GAN的loss顯示出收斂性。這個屬性對於在研究adversarial networks非常有幫助，因為不需要盯著生成的樣本來找出故障模式並獲得那些模型做的比那些做的更好的資訊。 ::: <img src='https://i.imgur.com/dvtUzTU.png' width="300"/><img src='https://i.imgur.com/kXh5Cta.png' width="300"/> <img src='https://i.imgur.com/y8WZx24.png' width="300"/> :::info Figure 4: JS estimates for an MLP generator (upper left) and a DCGAN generator (upper right) trained with the standard GAN procedure. Both had a DCGAN discriminator. Both curves have increasing error. Samples get better for the DCGAN but the JS estimate increases or stays constant, pointing towards no significant correlation between sample quality and loss. Bottom: MLP with both generator and discriminator. The curve goes up and down regardless of sample quality. All training curves were passed through the same median filter as in Figure 3. ::: :::success Figure 4: JS估測以標準GAN流程訓練的MLP generator(上圖左)與DCGAN generator(上圖右)。兩者皆有DCGAN discriminator。兩個曲線的誤差都在增加。DCGAN的取樣變好，但是JS的估計增加或維持不變，這表明了取樣品質與loss沒有明顯的相關性。下面：generator與discriminator皆為MLP。不管取樣品質好壞，曲線始終上下震盪。所有的訓練曲線皆與Figure 3相同，經過中值濾波處理。 ::: :::info However, we do not claim that this is a new method to quantitatively evaluate generative models yet. The constant scaling factor that depends on the critic's architecture means it's hard to compare models with different critics. Even more, in practice the fact that the critic doesn't have infinite capacity makes it hard to know just how close to the EM distance our estimate really is. This being said, we have succesfully used the loss metric to validate our experiments repeatedly and without failure, and we see this as a huge improvement in training GANs which previously had no such facility. ::: :::success 然而，我們並未聲明這是一種定量評估生成模型的新方法。固定縮放因數(constant scaling factor)取決於critic的架構，這意味著它是非常難將模型與不同critics做比較。實踐中，事實上，critic並沒有無限的容量，這造成它難以理解實際的EM-distance與我們估測的有多接近。儘管如此，我們還是成功的使用這個損失度量(loss metric)來反覆驗證我們的實驗而且沒有失敗，我們認為相較之前沒有這樣的作法，這是在訓練GANs上的一大進步 ::: :::info In contrast, Figure 4 plots the evolution of the GAN estimate of the JS distance during GAN training. More precisely, during GAN training, the discriminator is trained to maximize $$L(D,g_\theta)=\mathbb{E}_{x\sim\mathbb{P}_r}\left[logD(x)\right]+\mathbb{E}_{x\sim\mathbb{P}_\theta}\left[log(1-D(x))\right]$$ which is is a lower bound of $2JS(\mathbb{P}_r,\mathbb{P}_\theta)-2log2$. In the figure, we plot the quantity $\dfrac{1}{2}L(D,g_\theta)+ log2$ , which is a lower bound of the JS distance. ::: :::success 相比之下，Figure 4繪製出GAN訓練過程中的JS distince的變化。更精確的說，在GAN訓練過程中，discriminator被訓練來最大化 $$L(D,g_\theta)=\mathbb{E}_{x\sim\mathbb{P}_r}\left[logD(x)\right]+\mathbb{E}_{x\sim\mathbb{P}_\theta}\left[log(1-D(x))\right]$$ 這是$2JS(\mathbb{P}_r,\mathbb{P}_\theta)-2log2$的下限。圖表中我們繪製了$\dfrac{1}{2}L(D,g_\theta)+ log2$，它是JS distance的下限數值。 ::: :::info This quantity clearly correlates poorly the sample quality. Note also that the JS estimate usually stays constant or goes up instead of going down. In fact it often remains very close to log $2 \approx 0.69$ which is the highest value taken by the JS distance. In other words, the JS distance saturates, the discriminator has zero loss, and the generated samples are in some cases meaningful (DCGAN generator, top right plot) and in other cases collapse to a single nonsensical image [4]. This last phenomenon has been theoretically explained in [1] and highlighted in [11]. When using the $-logD$ trick [4], the discriminator loss and the generator loss are different. Figure 8 in Appendix E reports the same plots for GAN training, but using the generator loss instead of the discriminator loss. This does not change the conclusions. Finally, as a negative result, we report thatWGAN training becomes unstable at times when one uses a momentum based optimizer such as Adam [8] (with $\beta_1>0$) on the critic, or when one uses high learning rates. Since the loss for the critic is nonstationary, momentum based methods seemed to perform worse. We identified momentum as a potential cause because, as the loss blew up and samples got worse, the cosine between the Adam step and the gradient usually turned negative. The only places where this cosine was negative was in these situations of instability. We therefore switched to RMSProp [21] which is known to perform well even on very nonstationary problems [13]. ::: :::success 數值明顯與取樣品質無關。也注意到，JS估測值通常保持不變或上升，而不是下降。事實上，它JS估測值通常維持在接近log $2 \approx 0.69$，這是JS distance的最大值。換句話說，JS distance飽合了，discriminator誤差值為零，某些情況下生成的樣本是有意義的(DCGAN generator, top right plot)，在其它情況折起來變成一個無意義的影像[4]。最後一種現象在[1]得到理論上的解釋且並於[11]強調。當使用$-logD$[4]的時候，discriminator的損失(loss)與generator的損失(loss)是不同的。附錄E的Figure 8報告了GAN訓練的相同圖，只是使用generator的損失(loss)而不是discriminator的損失(loss)。這並不會改變結論。最後，做為一個負面結果，我們報告說明了WGAN在使用基於momentum的最佳化演算法像是Adam[8](當$\beta_1>0$)或大的學習效率(learning rate-$\alpha$)來訓練critic的時候會不穩定。因為critic的損失(loss)並非平穩的，基於momentum似乎表現更差。我們確定momentum是潛在的原因，因為當損失(loss)飆升而取樣變糟的時候，其Adam與梯度的餘弦(cosine)通常轉為負值。只有在這種不穩定情況下餘弦(cosine)才是負值。因此我們改用RMSProp [21]，即使在非常不平穩的問題上[13]也可以表現良好。 ::: ## 4.3 Improved stability :::info One of the benefits of WGAN is that it allows us to train the critic till optimality. When the critic is trained to completion, it simply provides a loss to the generator that we can train as any other neural network. This tells us that we no longer need to balance generator and discriminator's capacity properly. The better the critic, the higher quality the gradients we use to train the generator. We observe that WGANs are much more robust than GANs when one varies the architectural choices for the generator. We illustrate this by running experiments on three generator architectures: (1) a convolutional DCGAN generator, (2) a convolutional DCGAN generator without batch normalization and with a constant number of filters, and (3) a 4-layer ReLU-MLP with 512 hidden units. The last two are known to perform very poorly with GANs. We keep the convolutional DCGAN architecture for the WGAN critic or the GAN discriminator. Figures 5, 6, and 7 show samples generated for these three architectures using both the WGAN and GAN algorithms. We refer the reader to Appendix F for full sheets of generated samples. Samples were not cherry-picked. In no experiment did we see evidence of mode collapse for theWGAN algorithm. ::: :::success WGAN的優勢之一就是它允許我們訓練critic一直到最佳化。當critic訓練完成的時候，它只會提到loss給generator，我們可以像其他神經網路一樣訓練它。這告訴我們，我們不再需要合理的平衡generator與discriminator的容量。critic愈好，我們用來訓練generator的梯度品質就愈高。我們觀察到，當改變generator架構的時候，WGANs比GANs更具有魯棒性(robust)。我們透過執行三個generator架構實驗說明：(1) a convolutional DCGAN generator，(2) a convolutional DCGAN generator without batch normalization and with a constant number of filters，與(3) a 4-layer ReLU-MLP with 512 hidden units。已知後面兩個架構效在GANs上的效能非常差。我們為WGAN的critic或GAN的discriminator保留了卷積架構的DCGAN。 Figures 5, 6, and 7說明使用WGAN與GAN的三種架構生成樣本。請讀者參考附錄F以取得生成樣本全圖。樣本並不是只拿好的(cherry-picked採櫻桃，美式用語，意指只挑好的)。我們並沒有在實驗中看到WGAN mode collapse的證據。 ::: <img src='https://i.imgur.com/pWQzBLt.png' width=300/><img src='https://i.imgur.com/ehtjQb4.png' width=300/> :::info Figure 5: Algorithms trained with a DCGAN generator. Left: WGAN algorithm. Right: standard GAN formulation. Both algorithms produce high quality samples. ::: :::success Figure 5: DCGAN generator訓練的演算法。左邊：WGAN。右邊：標準GAN。兩個演算法都產出高品質的樣本。 ::: <img src='https://i.imgur.com/eFRMp40.png' width=300/><img src='https://i.imgur.com/0BAUE1V.png' width=300/> :::info Figure 6: Algorithms trained with a generator without batch normalization and constant number of filters at every layer (as opposed to duplicating them every time as in [18]). Aside from taking out batch normalization, the number of parameters is therefore reduced by a bit more than an order of magnitude. Left: WGAN algorithm. Right: standard GAN formulation. As we can see the standard GAN failed to learn while the WGAN still was able to produce samples. ::: :::success Figure 6: generator沒有batch normalization並維持每一層的filter數量(而不是每次都重複它們[18])。除了取出batch normalization，參數量也降低一個量級。左：WGAN。右：標準GAN。我們可以看到GAN已經失敗了但WGAN依然可以生成樣本。 ::: <img src='https://i.imgur.com/bwIqBFo.png' width=300/><img src='https://i.imgur.com/lHN4Bcx.png' width=300/> :::info Figure 7: Algorithms trained with an MLP generator with 4 layers and 512 units with ReLU nonlinearities. The number of parameters is similar to that of a DCGAN, but it lacks a strong inductive bias for image generation. Left: WGAN algorithm. Right: standard GAN formulation. The WGAN method still was able to produce samples, lower quality than the DCGAN, and of higher quality than the MLP of the standard GAN. Note the significant degree of mode collapse in the GAN MLP. ::: :::success Figure 7: MLP with 4 layers and 512 units with ReLU nonlinearities的Generator。參數量相似於DCGAN，但影像生成並沒有很強烈的[歸納偏差](https://zh.wikipedia.org/wiki/%E6%AD%B8%E7%B4%8D%E5%81%8F%E7%BD%AE)。左圖：WGAN。右圖：標準GAN。WGAN依然可以產出樣本，品質較DCGAN低，但高於MLP架構的標準GAN。注意到GAN-MLP中mode collapse的顯著程度 ::: ## 5 Related Work :::info There's been a number of works on the so called Integral Probability Metrics (IPMs)[15]. Given $\mathcal{F}$ a set of functions from $\mathcal{X}$ to $\mathbb{R}$, we can define $$d_\mathcal{F}(\mathbb{P}_r,\mathbb{P}_\theta)=\operatorname*{sup}_{f\in\mathcal{F}}\mathbb{E}_{x\sim\mathbb{P}_r}\left[f(x)\right]-\mathbb{E}_{x\sim\mathbb{P}_\theta}\left[f(x)\right] \qquad (4)$$ as an integral probability metric associated with the function class $\mathcal{F}$. It is easily verified that if for every $f\in\mathcal{F}$ we have $-f\in\mathcal{F}$ (such as all examples we'll consider), then $d_\mathcal{F}$ is nonnegative, satisfies the triangular inequality, and is symmetric. Thus, $d_\mathcal{F}$ is a pseudometric over Prob($\mathcal{X}$). While IPMs might seem to share a similar formula, as we will see different classes of functions can yeald to radically different metrics. ::: :::success 關於Integral Probability Metrics (IPMs)[15]已經有很多的研究。給定$\mathcal{F}$由$\mathcal{X}$到$\mathbb{R}$的一組函數，我們可以定義 $$d_\mathcal{F}(\mathbb{P}_r,\mathbb{P}_\theta)=\operatorname*{sup}_{f\in\mathcal{F}}\mathbb{E}_{x\sim\mathbb{P}_r}\left[f(x)\right]-\mathbb{E}_{x\sim\mathbb{P}_\theta}\left[f(x)\right] \qquad (4)$$ 做為與函數類$\mathcal{F}$相關的integral probability metric。這很容易驗證，如果對於每一個$f\in\mathcal{F}$我們有$-f\in\mathcal{F}$(也許我們考慮所有資料集)，然後$d_\mathcal{F}$為非負，滿足三角不等式，並且為對稱。因此，$d_\mathcal{F}$為Prob($\mathcal{X}$)的[擬度量](http://terms.naer.edu.tw/detail/2122633/)。雖然IPMs似乎可以共用類似的數學式，但我們將看到不同類別的函數可以採用完全不同的度量標準。 ::: :::info * By the Kantorovich-Rubinstein duality [22], we know that $W(\mathbb{P}_r, \mathbb{P}_\theta)=d_\mathcal{F}(\mathbb{P}_r, \mathbb{P}_\theta)$ when $\mathcal{F}$ is the set of 1-Lipschitz functions. Furthermore, if $\mathcal{F}$ is the set of $K$-Lipschitz functions, we get $K\cdot W(\mathbb{P}_r, \mathbb{P}_\theta)=d_\mathcal{F}(\mathbb{P}_r, \mathbb{P}_\theta)$. * When $\mathcal{F}$ is the set of all measurable functions bounded between -1 and 1 (or all continuous functions between -1 and 1), we retrieve $d_\mathcal{F}(\mathbb{P}_r, \mathbb{P}_\theta)=\delta(\mathbb{P}_r, \mathbb{P}_\theta)$ the total variation distance [15]. This already tells us that going from 1-Lipschitz to 1-Bounded functions drastically changes the topology of the space, and the regularity of $d_\mathcal{F}(\mathbb{P}_r, \mathbb{P}_\theta)$ as a loss function (as by Theorems 1 and 2). * Energy-based GANs (EBGANs) [25] can be thought of as the generative approach to the total variation distance. This connection is stated and proven in depth in Appendix D. At the core of the connection is that the discriminator will play the role of $f$ maximizing equation (4) while its only restriction is being between 0 and $m$ for some constant $m$. This will yeald the same behaviour as being restricted to be between -1 and 1 up to a constant scaling factor irrelevant to optimization. Thus, when the discriminator approaches optimality the cost for the generator will aproximate the total variation distance $\delta(\mathbb{P}_r, \mathbb{P}_\theta)$. Since the total variation distance displays the same regularity as the JS, it can be seen that EBGANs will suffer from the same problems of classical GANs regarding not being able to train the discriminator till optimality and thus limiting itself to very imperfect gradients. ::: :::success * 透過Kantorovich-Rubinstein duality [22]，我們知道當$\mathcal{F}$是1-Lipschitz functions的集合時，$W(\mathbb{P}_r, \mathbb{P}_\theta)=d_\mathcal{F}(\mathbb{P}_r, \mathbb{P}_\theta)$，如果$\mathcal{F}$是$K$-Lipschitz functions的集合，我們得到$K\cdot W(\mathbb{P}_r, \mathbb{P}_\theta)=d_\mathcal{F}(\mathbb{P}_r, \mathbb{P}_\theta)$。 * 當$\mathcal{F}$是所有可測量函數的集合，界定於-1到1之間(或所有介於-1到1之間的連續函數)，我們檢索 $d_\mathcal{F}(\mathbb{P}_r, \mathbb{P}_\theta)=\delta(\mathbb{P}_r, \mathbb{P}_\theta)$的總變異距離(total variation distance)[15]。這已經告訴我們，從1-Lipschitz到1-Bounded會大幅的改變空間的拓撲結構，以及$d_\mathcal{F}(\mathbb{P}_r, \mathbb{P}_\theta)$做為損失函數的規律性(如Theorems 1 and 2)。 * Energy-based GANs (EBGANs) [25] 可以被認為是總變異距離(total variation distance)的生成方法。這種連結在附錄D有深入說明與證明。連接的核心在於，discriminator會起到讓$f$最大化數學式(4)的作用，對某些常數$m$而言，它唯一限制就是要介於0與$m$之間。這跟限制介於-1與1之間的行為完全相同，直到與最佳化無關的固定縮放因數(constant scaling factor)。因此，當discriminator到達最佳性的時候，generator的cost會近似於總變異距離(total variation distance)$\delta(\mathbb{P}_r, \mathbb{P}_\theta)$。由於總變異距離(total variation distance)顯示出與JS相同的規律性，可以看的出來，EBGANs將會遇到與典型GAN相同的問題，無法最佳化discriminator，從而將自身限制在非常不完美的梯度上。 ::: :::info * Maximum Mean Discrepancy (MMD) [5] is a specific case of integral probability metrics when $\mathcal{F}=\left\{f\in\mathcal{H}:\Vert f \Vert_\infty\right\}$ for $\mathcal{H}$ some Reproducing Kernel Hilbert Space (RKHS) associated with a given kernel $k:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}$. As proved on [5] we know that MMD is a proper metric and not only a pseudometric when the kernel is universal. In the specific case where $\mathcal{H}=L^2(\mathcal{X},m)$ for $m$ the normalized Lebesgue measure on $\mathcal{X}$, we know that $\left\{f\in C_b(\mathcal{X},\Vert f \Vert_\infty\leq 1)\right\}$ will be contained in $\mathcal{F}$, and therefore $d_\mathcal{F}(\mathbb{P}_r,\mathbb{P}_\theta)\leq\delta(\mathbb{P}_r,\mathbb{P}_\theta)$ so the regularity of the MMD distance as a loss function will be at least as bad as the one of the total variation. Nevertheless this is a very extreme case, since we would need a very powerful kernel to approximate the whole $L^2$. However, even Gaussian kernels are able to detect tiny noise patterns as recently evidenced by [20]. This points to the fact that especially with low bandwidth kernels, the distance might be close to a saturating regime similar as with total variation or the JS. This obviously doesn't need to be the case for every kernel, and figuring out how and which different MMDs are closer to Wasserstein or total variation distances is an interesting topic of research. The great aspect of MMD is that via the kernel trick there is no need to train a separate network to maximize equation (4) for the ball of a RKHS. However, this has the disadvantage that evaluating the MMD distance has computational cost that grows quadratically with the amount of samples used to estimate the expectations in (4). This last point makes MMD have limited scalability, and is sometimes inapplicable to many real life applications because of it. There are estimates with linear computational cost for the MMD [5] which in a lot of cases makes MMD very useful, but they also have worse sample complexity. ::: :::success 當$\mathcal{F}=\left\{f\in\mathcal{H}:\Vert f \Vert_\infty\right\}$($\mathcal{H}$為某些Reproducing Kernel Hilbert Space (RKHS)與給定的kernel-$k:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}$相關聯)時，最大均值誤差(Maximum Mean Discrepancy)(MMD) [5]是integral probability metrics(IPMs)的特定情況。如[5]所證明，我們知道MMD是一個合適的度量，而不僅是kernel通用時的[擬度量](http://terms.naer.edu.tw/detail/2122633/)。在特定情況$\mathcal{H}=L^2(\mathcal{X},m)$，$m$為$\mathcal{X}$上標準化的[李貝克測度](http://terms.naer.edu.tw/detail/67551/)，我們知道$\left\{f\in C_b(\mathcal{X},\Vert f \Vert_\infty\leq 1)\right\}$會包含在$\mathcal{F}$內，因此$d_\mathcal{F}(\mathbb{P}_r,\mathbb{P}_\theta)\leq\delta(\mathbb{P}_r,\mathbb{P}_\theta)$，所以，作為損失函數的MMD distance的規律性至少會與總變異(total variation)的規律性一樣差。不過，這是一個非常極端的情況，因為我們需要一個非常強大的kernel來接近(approximate)整個L2。然而，即使是高斯kernel也能檢測出微小的雜訊模式，正如[20]最近所證明的那樣。這表明特別是對於low bandwidth kernels，distance可能接近類似於JS或總變異的飽合狀態。很明顯的並不是每一個kernel都有這種情況，找出不同MMD如何以及哪些MMD更接近Wasserstein或總變異距離(total variation distances)是一個有趣的研究課題。 MMD重要方面是，通過kernel trick，無需訓練一個單獨的網路來最大化RKHS球的方程式(4)。然而，這是有缺點的，評估MMD distance的計算成本會隨著用於估計(4)中期望值的樣本數目有二次方的增長。最後一點使MMD具有有限的可擴展性，並且有時候因此而無法適用於許多實際的應用程式。MMD[5]有著線性計算成本的估測，很多情況下MMD非常有用，但它們的樣本複雜度也較差。 ::: :::info * Generative Moment Matching Networks (GMMNs) [10, 2] are the generative counterpart of MMD. By backproping through the kernelized formula for equation (4), they directly optimize $d_{MMD}((\mathbb{P}_r,\mathbb{P}_\theta))$ (the IPM when F is as in the previous item). As mentioned, this has the advantage of not requiring a separate network to approximately maximize equation (4). However, GMMNs have enjoyed limited applicability. Partial explanations for their unsuccess are the quadratic cost as a function of the number of samples and vanishing gradients for low-bandwidth kernels. Furthermore, it may be possible that some kernels used in practice are unsuitable for capturing very complex distances in high dimensional sample spaces such as natural images. This is properly justified by the fact that [19] shows that for the typical Gaussian MMD test to be reliable (as in it's power as a statistical test approaching 1), we need the number of samples to grow linearly with the number of dimensions. Since the MMD computational cost grows quadratically with the number of samples in the batch used to estimate equation (4), this makes the cost of having a reliable estimator grow quadratically with the number of dimensions, which makes it very inapplicable for high dimensional problems. Indeed, for something as standard as 64x64 images, we would need minibatches of size at least 4096 (without taking into account the constants in the bounds of [19] which would make this number substantially larger) and a total cost per iteration of 40962, over 5 orders of magnitude more than a GAN iteration when using the standard batch size of 64. That being said, these numbers can be a bit unfair to the MMD, in the sense that we are comparing empirical sample complexity of GANs with the theoretical sample complexity of MMDs, which tends to be worse. However, in the original GMMN paper [10] they indeed used a minibatch of size 1000, much larger than the standard 32 or 64 (even when this incurred in quadratic computational cost). While estimates that have linear computational cost as a function of the number of samples exist [5], they have worse sample complexity, and to the best of our knowledge they haven't been yet applied in a generative context such as in GMMNs. ::: :::success Generative Moment Matching Networks(GMMNs)[10, 2]是MMD的生成對應物(變體?)。通過對kernelized formula for equation (4)反向傳播，他們直接最佳化$d_{MMD}((\mathbb{P}_r,\mathbb{P}_\theta))$(當F與上一項一樣時的IPM)。如上所述，這樣做的優點是不需要單獨的網絡來近似最大化方程式(4)。然而，GMMNS的適用性有限。對於它們失敗的部分解釋是二次成本的樣本數和low-bandwidth kernels消失梯度的函數關係。此外，這也許是可能的，在實踐中的部份kernels可能不適合用來補捉高維度取樣空間中的非常複雜的distance，像是自然影像。事實表明這一點，如[19]說明，為了讓典型的Gaussian MMD測試是可靠的(正如統計測試接近1時的作用一樣)，我們需要取樣數量隨著維度增加而線性增長。由於MMD的計算成本會隨著用來估計equation (4)的樣本數量二次方成長，這使得具有可靠估計器的成本也隨著維度數量呈二次方增長，這使得它非常不適合應用於高維度的問題。實際上，對於一些64x64的影像，我們需要minibatches of size最少4096(不考慮[19]範圍內的常數，這將使這個數目明顯大得多)，並且每一次的迭代的total cost為40962，在使用標準batch size-64的時候，比GAN迭代還要超出五個[數量級](http://terms.naer.edu.tw/detail/1283638/)。話說如此，但這數值對MMD有一些不公平，因為我們正在比較的是GAN的經驗樣本複雜性與MMD的理論樣本複雜性，後者往往更糟。然而，在原始GMMN論文[10]中，他們確認使用minibatch size-1000，比標準32或64都來的大(即使需要二次方的計算成本)。 ::: :::info On another great line of research, the recent work of [14] has explored the use of Wasserstein distances in the context of learning for Restricted Boltzmann Machines for discrete spaces. The motivations at a first glance might seem quite different, since the manifold setting is restricted to continuous spaces and in finite discrete spaces the weak and strong topologies (the ones of W and JS respectively) coincide. However, in the end there is more in commmon than not about our motivations. We both want to compare distributions in a way that leverages the geometry of the underlying space, and Wasserstein allows us to do exactly that. Finally, the work of [3] shows new algorithms for calculating Wasserstein distances between different distributions. We believe this direction is quite important, and perhaps could lead to new ways of evaluating generative models. ::: :::success 另一個偉大的研究領域，[14]內最近的研究開始探討在Restricted Boltzmann Machines for discrete spaces(受限玻爾茲曼機器學習)環境下使用Wasserstein distances。乍看之下動機似乎是不同的，因為流形設置是限制於連續空間，而在有限的離散空間中，弱、強拓撲結構(分別為W與JS)是重合的。然而，最終我們的動機有很多共通之處。我們都希望以一種利用底層空間幾何的方式比較分佈，而Wasserstein允許我們這麼做。最後，[3]說明兩個不分佈計算Wasserstein distances的新演算法。我們相信這個方向非常重要，這或許可以引入一種評估生成模型的新方法。 ::: ## 6 Conclusion :::info We introduced an algorithm that we deemed WGAN, an alternative to traditional GAN training. In this new model, we showed that we can improve the stability of learning, get rid of problems like mode collapse, and provide meaningful learning curves useful for debugging and hyperparameter searches. Furthermore, we showed that the corresponding optimization problem is sound, and provided extensive theoretical work highlighting the deep connections to other distances between distributions. ::: :::success 我們引入一種演算法，我們認為WGAN可以替代傳統GAN的訓練。在這個新的模型中，我們說明我們可以提高學習穩定性，擺掉脫一些問題，像是mode collapse，並提供有意義的學習曲線來幫助除錯與超參數的尋找。此外，我們說明相關優化問題是合理的，並提供廣泛的理論工作來強調分佈之間其它距離的深度連結。 :::