A Wasserstein GAN model with the total variational regularization(翻譯)

# A Wasserstein GAN model with the total variational regularization(翻譯) ###### tags: `wgan` `gan` `論文翻譯` `deeplearning` `對抗式生成網路` [TOC] ## 說明區塊如下分類，原文區塊為藍底，翻譯區塊為綠底，部份專業用語翻譯參考國家教育研究院 :::info 原文 ::: :::success 翻譯 ::: :::warning 任何的翻譯不通暢部份都請留言指導 ::: :::danger * [paper hyperlink](https://arxiv.org/pdf/1812.00810.pdf) * [Kantorovich-Rubinstein duality](https://vincentherrmann.github.io/blog/wasserstein/) * [知乎_郑华滨_令人拍案叫绝的Wasserstein GAN](https://zhuanlan.zhihu.com/p/25071913) * [李宏毅老師WGAN課程筆記](https://hackmd.io/@shaoeChen/HJ3JycWeB) * mode collapse指GAN可能會集中到某一個distribution，因此造成生成照片多數雷同 ::: :::info It is well known that the generative adversarial nets (GANs) are remarkably difficult to train. The recently proposed Wasserstein GAN (WGAN) creates principled research directions towards addressing these issues. But we found in practice that gradient penalty WGANs (GP-WGANs) still suffer from training instability. In this paper, we combine a Total Variational (TV) regularizing term into the WGAN formulation instead of weight clipping or gradient penalty, which implies that the Lipschitz constraint is enforced on the critic network. Our proposed method is more stable at training than GP-WGANs and works well across varied GAN architectures. We also present a method to control the trade-off between image diversity and visual quality. It does not bring any computation burden. ::: :::success 大家都知道GAN真的很不好訓練。最近提出的Wasserstein GAN(WGAN)給出一個解決這些問題的研究方向。但我們發現實務上，gardient penalty WGANs(GP-WGANs)面對訓練的不穩定還是很困擾。這篇論文中，我們結合Total Variational(TV) regularizing項目到WGAN的公式，而不是使用weight clipping或是gradient penalty，這意謂著Lipschitz的約束式會被實施在critic network上。我們提出的方法在訓練過程上比GP-WGANs還要穩定，而且在各種不同GAN架構上都可以做的很好。我們還提出一個方法來控制影像多樣性與視覺品質之間的權衡。而且這並不需要額外的計算負載。 ::: ## 1 Introduction :::info Though there has been an explosion of GANs in the last few years[1, 2, 3, 4, 5, 6], the training of some of these architectures were unstable or suffered from mode collapse. During training of a GAN, the generator G and the discriminator D keep racing against each other until they reach the Nash equilibrium, more generally, an optimum. In order to overcome the training difficulty, various hacks are applied depending on the nature of the problem. Salimans et al. [7] presents several techniques for encouraging convergence, such as feature matching, minibatch discrimination, historical averaging, one-sided label smoothing and virtual batch normalization. In [8], authors explain the unstable behaviour of GANs training in theory and in [9] the Wasserstein GANs (WGANs) with weight clipping was proposed. Because the original WGANs still sometimes generate only poor samples or fail to converge, its gradient penalty version (GP-WGANs) was proposed in [10] to address these issues. Since then several WGAN-based methods were proposed[11, 12, 13]. The Boundary equilibrium generative adversarial networks (BEGANs) was proposed in [14]. Its main idea is to have an auto-encoder as a discriminator, where the loss is derived from the Wasserstein distance between the reconstruction losses of real and generated images: ::: :::success 雖然說這幾年GANs是暴炸性的增長[1, 2, 3, 4, 5, 6]，但是某些架構的訓練還是不穩定，而且面對模式崩潰(model collapse)的問題一籌莫展。GAN的訓練過程中，generator G與discriminator D維持著互相對抗，一直到兩者之間處於Nash equilibrium，更白話一點，就是達到最佳化。為了能夠克服訓練上的困難，面對不同的問題就有了各種不同的技巧。Salimans et al. [7]提出多種能夠促進收斂的技術，像是feature mapping、minibatch discrimination、historical averaging、one-side label smoothing與virtual batch normalization。在[8]中，作者從理論來說明GANs訓練的不穩定的行為，在[9]中，提出使用weight clipping的Wasserstein GANs(WGANs)。因為原始的WGANs有些時候還是會只能生成出不好的樣本或是無法收斂，所以[10]提出gradient penalty的WGANs解決了這個問題。然後就有多種基於WGAN的方法被提出[11, 12, 13]。[14]提出Boundary equilibrium generative adversarial networks (BEGANs)。BEGANs主要的想法就是有一個auto-encoder做為discriminator，其loss由實際照片與生成照片的重建損失之間的Wasserstein distance所得： ::: :::info WGAN requires that the discriminator (or called as critic) function must satisfy 1-Lipschitz condition, which is enforced through gradient penalty in [10]. The GP-WGANs are much more stable in training than the original weight clipping version. However, we found in practice that GP-WGANs still have such drawbacks: 1) Fail to converge with homogeneous network architectures. 2) Weights explode for a high learning rate. 3) Losses fluctuate drastically even after long term training. BEGANs improve the training stability significantly and are able to generate high quality images, but the auto-encoder is high resource consuming. ::: :::success WGAN要求discriminator function要滿足1-Lipschitz的條件(以[10]所提出的gradient penalty來實作)。GP-WGANs在訓練上比原始的weight clipping版本還要來的穩定。然而，我們現在在實務上，GP-WGANs仍然存在幾個缺點：1)homogeneous network architectures無法收斂。2)learning rate較大情況下會有權重爆炸(weight explode)的問題。3)即使已經長時間的訓練，其losses的波動依然很大。BEGANs明顯的提升訓練的穩定度，而且能夠生成出高品質的影像，但是auto-encoder需要較高的計算資源。 ::: :::info In this paper, we conduct some experiments to show the above defects of GP-WGANs. Then with exploration about the theoretical property of the Wasserstein distance, we choose to add a TV regularization term rather than the gradient penalty term into the formulation of the objective. Compared with GP-WGANs and BEGANs, our approach is simple and effective, but it is much stable for training of varied GAN architectures. Additionally, we introduce a margin factor which is able to control the trade-off between image diversity and visual quality. ::: :::success 這篇論文中，我們做了一些實驗來說明上面說的關於GP-WGANs的缺陷。然後研究關於Wasserstein distiance的理論特性，我們選擇加入一個TV regularization而不是gradient penalty在目標函數中。與GP-WGANs與BEGANs相比，我們方法既簡單又有效率，但相較各種GAN架構更為穩定。此外，我們引入一個邊際因子(margin factor)，這個邊際因子能夠控制影像多樣性與視覺品質之間的權衡。 ::: :::info We make the following contributions: 1) An Wasserstein GAN model with TV regularization (TV-WGAN) is proposed. It is much simpler but is more stable than GP-WGANs. 2) The TV term implies that the 1-Lipschitz constraint is enforced on the discriminative function. We try to give a rough proof for it. 3) A margin factor is introduced to control the trade-off between generative diversity and visual quality. ::: :::success 我們有著下面幾個貢獻：1)提出一個帶有TV regularization的Wassertein GAN(TV-WGAN)。比GP-WGANs更簡單而且更穩定。2)TV regularization意謂著對discriminative function做1-Lipschitz的約束式。我們會試著給出一個粗略的證明。3)引入一個邊際因子(margin factor)來控制影像多樣性與視覺品質之間的權衡。 ::: :::info The remainder of this paper is organized as follows. In Section 2, we first introduce the preliminary theory about training of GANs. In Section 3, we demonstrate the defects of GP-WGANs by experiments. In Section 4, we describe the proposed TV regularized method in detail. The implementation and results of our approach are given in Section 5. It is summarized in Section 6. ::: :::success 論文的其餘部份說明如下。在第2章，我們會先介紹關於GANs訓練的初步理論。第3章，我們會利用實驗來證明GP-WGANs的缺陷。第4章我們會說明所提出的TV regularized的細節。第5章會給出實作與方法的結果。第6章做總結。 ::: ## 2. Background ### 2.1. The Training of GAN :::info In training of an original GAN, the generator G and the discriminator D play the following two-player minimax game[15]: $$\min_G \max_D \left\{ \mathbb{E} \log D(x) + \mathbb{E} \log (1-D(G(z))) \right\} \tag{1}$$ ::: :::success 訓練原始GAN的時候，generator G與discriminator D玩著下面兩個玩家的minimax的賽局[15]： $$\min_G \max_D \left\{ \mathbb{E} \log D(x) + \mathbb{E} \log (1-D(G(z))) \right\} \tag{1}$$ ::: :::info This objective function of cross entropy is equivalent to the following Jensen-Shannon divergence form: $$\min_G \max_D \text{JSD}(P_r \vert \vert P_g) \tag{2}$$ where the real data $x \sim P_r$, the generated data $G(z) \sim P_g$, and $\text{JSD}(\cdot)$ measures the similarity of the distribution $P_r$ and $P_g$. When $P_r=P_g$, their $\text{JSD}=0$, means that the generator perfectly replicating the real data generating process. However, it is always remarkably difficult to train GANs, which suffer from unstable and mode collapse. ::: :::success 這個目標函數的交叉熵等價於下面的JS-divergence形式： $$\min_G \max_D \text{JSD}(P_r \Vert P_g) \tag{2}$$ 其中$x \sim P_r$是實際資料，$G(z) \sim P_g$是生成資料，而$\text{JSD}(\cdot)$是用來量測$P_r, P_g$兩個分佈之間的相似度。當$P_r=P_g$，那$\text{JSD}=0$，這意謂著generator完美的複製實際資料的生成過程。然而，訓練GAN明顯的非常困難(不穩定與模式崩潰的問題)。 ::: ### 2.2. Wasserstein distance :::info As discussed in [8], the distribution $P_r$ and $P_g$ have disjoint supports because their supports lie on low dimensional manifolds of a high dimensional space. Therefore their JSD is always a constant $\log 2$, which means that the gradient will be zero almost everywhere. Besides gradient vanishing, GANs can also incur gradient instability and mode collapse problems, if the generator takes $-\log D(G(z))$ as its loss function. In order to get better theoretical properties than the original, WGANs leverages the Wasserstein distance between $P_r$ and $P_g$ to produce an objective function[9]: $$W(P_r, P_g) = \inf_{\gamma\in(P_r,P_g)} \mathbb{E}_{(x,\tilde{x}) \sim \gamma} \Vert x - \tilde{x} \Vert \tag{3}$$ where $x \sim P_r$ and $\tilde{x} \sim P_g$ implicitly defined by $\tilde{x} = G(z)$. ::: :::success 如[8]的討論，兩個分佈$P_r$與$P_g$之間有著不相交的支撐(disjoin supports)，這是因為它們的支撐是取決於高維度空間中的低維流形上。因此，它們的JSD始終都是$\log 2$，這意謂著其梯度也會幾乎都是零。如果generator採用$-\log D(G(z))$做為loss function，那除了梯度消失之外，GANs還會受到梯度不穩定以及模式崩潰的問題所困擾。為了能夠得到比原始GAN擁有更好的理論特性，WGANs利用$P_r$與$P_g$之間的Wasserstein distance來產生一個目標函數[9]： $$W(P_r, P_g) = \inf_{\gamma\in(P_r,P_g)} \mathbb{E}_{(x,\tilde{x}) \sim \gamma} \Vert x - \tilde{x} \Vert \tag{3}$$ 其中$x \sim P_r$與$\tilde{x} \sim P_g$由$\tilde{x}=G(z)$隱含地定義。 ::: :::warning 這邊說明的是，只要兩個分佈之間沒有任何的交集，不管遠近，就算只差那麼一點的距離，得到的loss都是$\log 2$，這種情況下你得到loss是完全沒有意義的。 ::: :::info Since the infimum is highly intractable, with KantorovichRubinstein duality, it can be reformulated as[10]: where the $\mathcal{D}$ is the set of 1-Lipschitz functions. To enforce the Lipschitz constraint on the discriminator, [9] proposes to clip the weights of the discriminator to lie within a compact space $[-c, c]$. In [10], it is found that weight clipping will result in vanishing or exploding gradients, or weights will be pushed towards the extremes of the clipping range. ::: :::success 由於[最大下界](http://terms.naer.edu.tw/detail/3184349/)非常難處理，具有Kantorovich Rubinstein duality([對偶性](http://terms.naer.edu.tw/detail/2115200/))，因此我們把它重新寫為[10]： $$\min_G \max_{D\in\mathcal{D}}\mathbb{E}D(x) - \mathbb{E}D(G(z)) \tag{4}$$ 其中$\mathcal{D}$為1-Lipschitz function的集合。為了把Lipschitz限制式強制加在discriminator上，[9]提出一種權重剪裁的方式，讓權重限制式在一個緊緻空間$[-c, c]$中。在[10]中，作者發現到權重剪裁將導致梯度的消失或爆炸，又或者權重會被推向剪裁範圍的極值(也就是$c, -c$)。 ::: :::info Hence the gradient penalty method is proposed: $$L=\mathbb{E}D(x) - \mathbb{E}D(\tilde{x}) + \lambda \mathbb{E}[(\Vert \nabla_{\hat{x}}D(\hat{x})\Vert_2 - 1)^2] \tag{5}$$ ::: :::success 因此有了gradient penalty這個方法： $$L=\mathbb{E}D(x) - \mathbb{E}D(\tilde{x}) + \lambda \mathbb{E}[(\Vert \nabla_{\hat{x}}D(\hat{x})\Vert_2 - 1)^2] \tag{5}$$ ::: ## 3. Difficulties with gradient penalty :::info Though the GP-WGAN is theoretically elegant, usually the gradient operation is sensitive to noise. Probably due to this reason, training of GP-WGANs is still problematic. We illustrate this by running experiments on image generation using the GP-WGAN algorithm. ::: :::success 儘管GP-WGAN在理論上還不賴，但是梯度的操作通常對於噪點是敏感的。可能是因為這個原因，GP-WGANs的訓練仍然存在著問題。我們用GP-WGAN演算法在影像生成的實驗上來說明這點。 ::: :::info ![](https://i.imgur.com/CIdPYI2.png) Figure 1. A homogeneous network structure which is similar to BEGANs. Figure 1. 類似BEGANs的homogeneous network架構。 ::: :::info ![](https://i.imgur.com/z0n0T2k.png) Figure 2. Gradient explosion of GP-WGANs. This happens when 1) using a homogeneous network structure, or 2) the learning rate is high. Figure 2. GP-GWANs的梯度爆炸。這會發生在1)使用homogeneous network架構，或2)學習效率設置太高 ::: ### 3.1. Unstable for the homogeneous network architecture :::info The training of GP-WGAN will be unstable for such a network architecture as shown in Figure 1. It is similar to that of the BEGAN[14], except that our discriminator does not have an auto-decoder inside it. This structure is homogeneous since it is composed of repeated convolutional layers. Usually, GANs with such homogeneous structure is hard to train due to its lack of either batch normalization[16] or dropout layers. Therefore this structure can be taken as a benchmark for evaluating the training stability of a GAN model. We trained the GP-WGAN with this network structure. As we observed in our experiments, it resulted in exploding gradients in GP-WGAN as shown in Figure 2. ::: :::success 如Figure 1所示，GP-WGAN的訓練在這樣的架構(homogeneous network)下會是不穩定的。這跟BEGAN[14]是類似的，就只差在我們的discriminator沒有auto-decoder在裡面。這樣的架構稱為homogeneous，因為它由重覆的卷積層所組成。通常，擁有這種homogeneous架構的GANs比較難訓練是因為缺少batch normalization[16]或dropout layer。因為這樣的架構可以拿來做為評估GAN模型訓練穩定性的基準。我們實驗的GP-WGAN就是用這樣的架構訓練的。如我們實驗中所觀察到的，它導致了GP-WGANs內的梯度爆炸，見Figure 2說明。 ::: ### 3.2. Drastic fluctuation in long term :::info We tested the GP-WGAN with a carefully designed structure as shown in figure 3. This network is similar to that of the standard DCGAN[1], where the batch normalization is applied in both the generator and the discriminator. It is much easy to train this network because its structure is carefully designed for GANs. Not surprisingly, the GP-WGAN with this structure is capable of being trained to generate good samples. However, the loss curve of the GP-WGAN keeps intense fluctuating even after a long term training, which indicates that the training of the GP-WGAN is potentially unstable. In our experiments, this phenomenon has recurred repeatedly for both CIFAR10 and CelebA datasets, as shown in Figure 4. ::: :::success 我們用著Figure 3所示的精心設計的架構來測試GP-WGAN。這個網路類似於標準的DCGAN[1]，在generator與discriminator都用了batch normalization。因為這個架構是針對GANs所精心設計，所以訓練起來非常輕鬆。毫無懸念的，用這個架構的GP-GAN訓練起來效果不錯，而且能夠生成出好的樣本。然而，即使經過長時間的訓練，GP-WGAN的損失函線還是呈現出強烈的波動，這意味著GP-WGAN的訓練可能是不穩定的。在我們的實驗中，這個現象在CelebA與CIFAR10兩個資料集中反覆的出現，如Figure 4所示。 ::: :::info ![](https://i.imgur.com/ie1W4xo.png) Figure 3. A carefully designed network structure similar to DCGANs. Figure 3. 類似DCGANs所精心設計的網路架構。 ::: :::info ![](https://i.imgur.com/KKdL1Kx.png) Figure 4. Drastic fluctuation of GP-WGANs after a long term traing with learning rate=1e-5. Figure 4. 以學習效率=1e-5，經過長時間訓練的GP-WGANs的強烈波動。 ::: ### 3.3. Sensitive to learning rate :::info The third drawback of GP-WGAN is sensitivity to the learning rate of training. In the last experiment, if we enlarge the learning rate from 1e-5 to 1e-4 which is suitable for most GAN models including BEGANs, the GP-WGAN becomes unstable and gradients explode rapidly. The loss curves are the same as in Figure 2, so we don’t re-draw it here. Certainly the learning rate plays a significant role in most deep learning networks. A too high learning rate always makes them difficult to converge, but rarely leads gradient explosion. This illustrates that GP-WGANs are potentially unstable. ::: :::success GP-WGAN的第三個缺陷就是對訓練時所設置的學習效率的敏感性。在最後一個實驗中，如果我們把學習效率從1e-5擴大到1e-4，這對多數的GANs模型可能是可以的(包括BEGANs)，但是在GP-WGAN會變的不穩定，梯度很快的就是爆掉。其損失曲線就跟Figure 2所示一樣，這邊就不再重覆提供。當然啦，學習效率在多數的深度學習網路中都扮演著非常重要的角色。太高的學習效率通常會造成不好收斂的問題，不過很少碰到會爆掉的。這說明了GP-WGANs可能也許或許真的就是不穩定。 ::: ## 4. Proposed :::info In this section, we first present the TV-WGANs along with their benefits in section 4.1. A rough proof for the enforcement of Lipschitz constraint by the TV term is given in section 4.2. Then in section 4.3 we present the objective function with a margin factor which controls the trade-off between generative diversity and visual quality. Finally, the benefits of our approach are discussed in section 4.4. ::: :::success 在這一章裡面，我們會先在4.1中介紹一下TV-WGANs以及它的效益。然後會在4.2的部份概略的證明一下利用TV所強制的Lipschitz限制式。然後4.3的時候我們會介紹帶有邊際因子(margin factor)來控制生成多樣性與視覺品質之間的權衡的目標函數。最後，會在4.4的時候討論我們所提出的方法帶來什麼好處。 ::: ### 4.1. Total variational WGAN :::info The WGAN objective Equation (4) could be explained in a intuitive way: The discriminator $D$ is trained to make its output as high as possible for real data $x$, and as low as possible for fake data $\tilde{x}$; The generator $G$ which are trained to produce fake images tries to make the discriminator to give a high output for them. Since the discriminator works as a real-fake critic in a WGAN model, it tries to make its output separated as far as possible for real data and fake data. As a counterpart in the adversarial game, the generator tries to make the discriminative output as close as possible to each other. Besides, Equation (4) implies that the discriminator must be a smooth 1-Lipschitz function. Without this constraint, the training will diverge continuously. As shown in Figure 5, the $D(x)$ and $D(\tilde{x})$ tends to increase and decrease infinitely until exceed the range of machines floating point number. ::: :::success WGAN目標函數方程式(4)可以用一個直觀的方式來解釋：discriminator $D$的目標就是對實際的資料$x$的輸出愈高愈好，對假資料$\tilde{x}$的輸出則是愈低愈好。generator $G$則是試圖生成出假資料讓discriminator輸出高分。因為discriminator在WGAN模型中併演著實際資料與假資料之間的評論者(critic，這也是為什麼WGAN中不稱discriminator而是critic)，因此它會試著讓它的輸出盡可能的分離出實際資料與假資料。做為在對抗賽局中的一個對手，generator試圖讓區分(discriminative)輸出盡可能地彼此接近。此外，方程式(4)就暗示著，discriminator必需是平滑的1-Lipschitz function。沒有這個限約式，整個訓練過程就會不斷的發散。如Figure 5所示，$D(x)$與$D(\tilde{x})$整個趨勢是往無限增大、減小，一直到超過機器的浮點數的範圍。 ::: :::info ![](https://i.imgur.com/eDTTXyl.png) Figure 5. Discriminator output in training for real data and fake data without 1-Lipschitz constraint. Figure 5. discriminator在沒有1-Lipschitz限制式情況下對於實際資料與假資料的輸出狀況。 ::: :::info To enforce 1-Lipschitz constraint on discriminative functions, the weighting clipping and the gradient penalty methods are proposed. However, both these methods have their defects as demonstrated in section 3. In this section we introduced an alternative in the form of total variational regularization $\vert D(x) - D(\tilde{x}) - \delta \vert$, where δ is a wanted margin between discriminative output for real data and fake data. The meaning of the margin $\delta$ is shown in Figure 6. Hence the objective function is formulated as: $$\min_{G} \max_{D} \mathbb{E}D(x) - \mathbb{E}D(\tilde{x}) + \lambda \mathbb{E} [\vert D(x) - D(\tilde{x}) - \delta \vert] \tag{6}$$ where λ is the regularization factor which is set to 1 across this paper. We assume that it is equivalent to enforce an 1-Lipschitz constraint via this TV term. About this point, a simple proof will be given in the next section. ::: :::success 為了能夠在discriminative function上施加1-Lipschitz約束式，就有人提出了權重剪裁(weight clipping)跟梯度懲罰(gradient penalty)兩種方法。然而，這兩種方法都有它們的缺陷，這部份已經在第3章說明過。這一章我們會提出一種以total variational regularization(總變分正規化?) $\vert D(x) - D(\tilde{x}) - \delta \vert$替代的形式，其中$\delta$是實際資料與假資料的區別性輸出之間的期望邊際(wanted margin)。這個邊際$\delta$說明如Figure 6。因此，目標函數可以寫為： $$\min_{G} \max_{D} \mathbb{E}D(x) - \mathbb{E}D(\tilde{x}) + \lambda \mathbb{E} [\vert D(x) - D(\tilde{x}) - \delta \vert] \tag{6}$$ 其中$\lambda$是正規化的因子，在這篇論文中設置為1。我們假設透過這個TV項目等價於施行1-Lipschitz約束式。關於這一點，下一章會給出一個簡單的證明。 ::: :::info ![](https://i.imgur.com/G9KHb2h.png) Figure 6. The margin between discriminative outputs of real data and fake data. Figure 6. 實際資料與假資料的區別性輸出之間的期望邊際(wanted margin) ::: ### 4.2. The 1-Lipschitz constraint :::info Proposition 1. The marginal TV regularizing term $\vert D(x) - D(\tilde{x}) - \delta \vert$ in equation 6 enforces the 1-Lipschitz constraint on the discriminative function $D(x)$. ::: :::success Proposition 1. 方程式6中的邊際總變分正規項(TV regularizing term) $\vert D(x) - D(\tilde{x}) - \delta \vert$在discriminative function $D(x)$上施加1-Lipschitz約束式。 ::: :::info Proof. Given a discriminative function $D(\theta)$, whose parameters $\theta$ are updated by the gradient descent method: $$\theta_{n+1} = \theta_n + \eta \nabla \theta_n \tag{7}$$ where $\eta$ is the learning rate. ::: :::success Proof. 給定一個discriminative function $D(\theta)$，其參數$\theta$的更新是透過梯度下降方法： $$\theta_{n+1} = \theta_n + \eta \nabla \theta_n \tag{7}$$ 其中$\eta$是學習效率。 ::: :::info Then by the first order Taylor expansion, the discriminative output $D_{n+1}(x) = D(\theta_n + \eta \nabla \theta_n, x)$ for real data can be approximated as: $$D_{n+1}(x) \approx D(\theta_n, x) + \eta \nabla \theta_nD'(\theta_n) = D_n(x) + \eta\nabla_xD(x) \tag{8}$$ ::: :::success 然後利用first order Taylor expansion，實際資料的discriminative的輸出$D_{n+1}(x) = D(\theta_n + \eta \nabla \theta_n, x)$可以近似於： $$D_{n+1}(x) \approx D(\theta_n, x) + \eta \nabla \theta_nD'(\theta_n) = D_n(x) + \eta\nabla_xD(x) \tag{8}$$ ::: :::info On the other hand, its output for fake data is approximately: $$D_{n+1}(\tilde{x}) = D_n(\tilde{x}) + \eta\nabla_\tilde{x}D(\tilde{x}) \tag{9}$$ ::: :::success 換句話說，假資料的輸出就近似於： $$D_{n+1}(\tilde{x}) = D_n(\tilde{x}) + \eta\nabla_\tilde{x}D(\tilde{x}) \tag{9}$$ ::: :::info Considering that the discriminator is trained to give a continuously increasing output for real data, and to give a continuously decreasing output for fake data, hence $$ \begin{cases} \mathbb{E}[\nabla_xD(x)] > 0 \\[2ex] \mathbb{E}[\nabla_\tilde{x}D(\tilde{x})] < 0 \tag{10} \end{cases}$$ ::: :::success 考慮到discriminator是訓練來對實際資料給出不斷增加的輸出，對假資料則是給出不斷減少的輸出，因此 $$ \begin{cases} \mathbb{E}[\nabla_xD(x)] > 0 \\[2ex] \mathbb{E}[\nabla_\tilde{x}D(\tilde{x})] < 0 \tag{10} \end{cases}$$ ::: :::info Meanwhile, the difference between $D_n(x)$與$D_n(\tilde{x})$ is bounded by the marginal TV regularization, i.e., for a constant $\epsilon$, we have $$\vert D_n(x) - D_n(\tilde{x} - \delta) \vert < \epsilon \tag{11}$$ ::: :::success 與此同時，$D_n(x)$與$D_n(\tilde{x})$之間的差異就由邊際總變分正規項來限制，即常數$\epsilon$，我們得到 $$\vert D_n(x) - D_n(\tilde{x} - \delta) \vert < \epsilon \tag{11}$$ ::: :::info So by subtracting Equation (8) from (9), we derive that the difference between $\nabla_xD(x)$ and $\nabla_\tilde{x}D(\tilde{x})$ is also bounded by $$\vert \nabla_xD(x) - \nabla_\tilde{x}D(\tilde{x}) \vert < \dfrac{2\epsilon}{\eta} \tag{12}$$ ::: :::success 因此，我們透過方程式(8) - 方程式(9)，得到$\nabla_xD(x)$與$\nabla_\tilde{x}D(\tilde{x})$之間的差異也被下面方程式限制住 $$\vert \nabla_xD(x) - \nabla_\tilde{x}D(\tilde{x}) \vert < \dfrac{2\epsilon}{\eta} \tag{12}$$ ::: :::info Considering Equation (10), we derive that $\nabla_xD(x)$ must also be bounded by $\epsilon/\eta$, which means that $D(x)$ must satisfy the $k$-Lipschitz constraint for $k=\epsilon/\eta$. If $\epsilon$ is small enough, which is controlled by the regularizing factor $\lambda$, then the 1-Lipschitz constraint is enforced. ::: :::success 考慮到方程式(10)，我們得到$\nabla_xD(x)$也必然被$\epsilon/\eta$給限制住，這意味著$D(x)$在$k=\epsilon/\eta$情況下一定滿足$k$-Lipschitz約束式。如果$\epsilon$夠小(由正規化因子$\lambda$控制)，那就會施加1-Lipschitz約束式。 ::: :::info In this section, we prove that the 1-Lipschitz constraint is enforced approximately by the TV regularization. Additionally, we draw the histogram of the discriminators weights in Figure 7 after we train TV-WGAN using CIFAR10 dataset. As shown in the figure, the weights of the TV-WGAN remain uniform, unlike in the weight clipping WGAN model where weights are pushed towards two extremes of the clipping range. ::: :::success 在這一章裡面，我們提出1-Lipschitz約束式可以由TV regularization來施加的近似。此外，在使用CIFAR10資料集訓練TV-WGAN之後，我們繪製了discriminators的權重長條圖(Figure 7)。如圖所示，TV-WGAN的權重維持著均勻，不像使用權重剪裁的WGAN一昧的將權重往剪裁區間的極值去推進($c, -c$)。 ::: :::info ![](https://i.imgur.com/UeZLYbf.png) Figure 7. Weights histogram of discriminator in TV-WGAN. Figure 7. TV-WGAN的discriminator的權重長條圖。 ::: ### 4.3. Objective and margin factor :::info In Equation 6, the purpose of the TV term is to make the discriminative output of real data and fake data to be separated from each other within a bound. All experiments in this paper use $\lambda=1$. The larger $\lambda$, the stronger the Lipschitz constraint is enforced. We rewrite the loss function of our model as: $$ \begin{cases} L_D=-\mathbb{E}D(x)+\mathbb{E} D(G(z)) + \lambda\mathbb{E} \vert D(x) - D(G(z)) - \delta \vert \\[2ex] L_G=-\mathbb{E}D(G(z)) \tag{13} \end{cases}$$ ::: :::success 在方程式6中，總變分正規項目的目的是讓discriminative對於實際資料與假資料的輸出能夠在一定的限制內彼此的分離。論文中所有的實驗都設置$\lambda=1$。$\lambda$愈大，就意味著施加的Lipschitz約束式愈大。我們將模型的損失函數重寫為： $$ \begin{cases} L_D=-\mathbb{E}D(x)+\mathbb{E} D(G(z)) + \lambda\mathbb{E} \vert D(x) - D(G(z)) - \delta \vert \\[2ex] L_G=-\mathbb{E}D(G(z)) \tag{13} \end{cases}$$ ::: :::info The margin factor $\delta$ is capable of controlling the trade-off between generative diversity and visual quality. Higher values of $\delta$ lead to higher visual quality because it helps distinguish real data and fake data, so that the generator has to output vivid images with more details. Lower values of $\delta$ lead to higher image diversity ::: :::success 邊際因子$\delta$能夠控制生成多樣性與視覺品質之間的權衡。$\delta$愈高會得到愈高的視覺品質，因為$\delta$有助於區分實際資料與假資料，也因此，generator就會擁有更多細節的生動影像。$\delta$愈低就會有愈高的影像多樣性。 ::: ### 4.4. Benefits :::info The benefits of TV-WGANs can be derived from two aspects. First, unlike GP-WGANs which directly compute gradients as penalty, the TV-WGANs enforce the Lipschitz constraint via TV regularization which is much more mild than weight clipping and gradient penalty. This results in more stable gradients that neither vanish nor explode, allowing training of more complicated networks, as well as more homogeneous networks. ::: :::success TW-WGANs的好處可以從兩個方面來說。首說，不像GP-WGANs直接計算其梯度來做為懲罰項，TW-WGANs透過TV regularization來施加Lipschitz約束式，這比起權重剪裁跟梯度懲罰要來的溫和多。這也導致更穩定的梯度，既不會消失也不會爆炸，允許訓練更為複雜的網路，以及更為homogeneous的網路架構。 ::: :::info Second, TV-WGANs do not adding any computation burden. On the contrary, GP-WGANs have to implement the troublesome gradients operation. BEGANs do not solve gradients for Lipschitz constraint, but their discriminator is composed of an auto-encoder and an auto-decoder, which means that they require even more computation resources. ::: :::success 第二，TV-WGANs並不會增加任何的計算負擔。相反的，GP-WGANs必須實作那麻煩的梯度操作。BEGANs並非是解梯度的Lipschitz約束式，但它們的discriminator是由auto-encoder與auto-decoder所組成，這意謂著它們必需有更多的計算資源。 ::: ## 5. Experiments :::info In this section, we run experiments on image generation using our TV-WGAN algorithm and show that there are significant practical benefits to using it over other Wasserstein GANs. ::: :::success 在這一章裡面，我們在我們所提出的TV-WGAN所生成的影像上做實驗，並且證明比其它類型的Wassersetin GANs，TV-WGAN有著明顯的好處。 ::: ### 5.1. Set up :::info We conduct experiments on the CIFAR-10 and CelebA datasets whose resolutions are 32×32 and 128×96 respectively. As for the CelebA dataset, we scale each picture into 64 × 48 for training. We trained our model using the Adam optimizer with an constant learning rate of 1e-4 for CIFAR10 dataset and 1e-5 for CelebA dataset. Mode collapse will be observed if a high learning rate is used. However gradient explosion never happens in training of our model, unlike the GP-WGAN which is certain to suffer from gradient explosion if using high learning rate. ::: :::success 我們在CIFAR-10(32x32)與CelebA(129x96)兩個資料集上做實驗。對CelebA資料集，我們把每一張照片都縮成64x48來訓練。我們用Adam來訓練模型，學習效率固定，不做衰減設置，CIFAR10使用1e-4，CelebA則使用1e-5。如果使用較高的學習效率，那就會觀察到模式崩潰的問題。然後，模型訓練過程中從來沒有梯度爆炸的問題出現過，不像GP-WGAN，只要用比較高的學習效率就一定會遇到梯度爆炸的問題。 ::: ### 5.2. Stability :::info We use CIFAR-10 dataset to train the TV-WGAN model with the DCGAN-like network structure which is depicted in section 3.2 and is shown in Figure 3. The CelebA dataset is used to train the BEGAN-like network which is depicted in section 3.1 and is shown in Figure 1. The resulted loss curves are draw in Figure 8 and 9. Compared with Figure 4, we believe that our model is much more stable in training than the GP-WGAN which keeps drastic fluctuating even after very long term training. ::: :::success 我們用CIFAR-10資料集來訓練一個類似於DCGAN網路架構的TV-WGAN模型，這在3.2章提過，如Figure 3所示。CelebA資料集則訓練一個類似於BEGAN網路架構的TV-WGAN模型，這在3.1章提過，如Figure 1所示。其訓練的損失曲線繪製於Figure 8與Figure 9。與Figure 4相比，我們確信我們的模型訓練過程比GP-WGAN還要穩定，GP-WGAN即使經過長時間的訓練，還是會激烈的波動。 ::: :::info ![](https://i.imgur.com/2kryez6.png) Figure 8. Training TV-WGAN on CIFAR-10 dataset with learning rate=1e-4. ::: :::info ![](https://i.imgur.com/BxkwaMJ.png) Figure 9. Training TV-WGAN on CelebA dataset with learning rate=1e-5. ::: :::info Further experiments demonstrate that training the GPWGAN always encounters gradient explosion when using BEGAN-like network structures or using high learning rate, as we have discussed in section 3.1 and 3.3. This indicates that the gradient penalty did not make the WGANs stable enough, probably because the derivative control is susceptible to noise. While our TV-WGAN model has never been observed for gradient explosion in all above situations. ::: :::success 進一步的實驗證明，當我們使用類似於BEGANS網路架構或較高的學習效率來訓練GPWGAN的時候，總是會遇到梯度爆炸的問題，如我們在3.1、3.3章所討論。這意謂著，梯度懲罰並不會讓WGANs的訓練過程有足夠的穩定，這有可能是因為梯度的控制容易受噪點的影響。而我們的TV-WGAN模型在上面情況(設定)下從來沒有遇過梯度爆炸的問題。 ::: :::info We have not so far done more experiments with many other network architectures, but the two structures we used here are typical. The DCGAN-like structure is carefully designed for GANs, with batch normalization layers in both generator and discriminator, so it is actually easy to train. The BEGAN-like structure is too homogeneous and is hard to train due to lack of either batch normalization or dropout layers. ::: :::success 目前為止，我們還沒有對其它的網路架構做更多的實驗，但有兩種架構是常見的。類似於DCGAN的架構是為GAN精心設計過的，在generator與discriminator都有著batch normalization layers，因此這確實很容易訓練。類似於BEGAN架構則是太過[齊次](http://terms.naer.edu.tw/detail/2117274/)(homogeneous)，而且因為沒有batch normalization或dropout layers導致它非常難訓練。 ::: :::info Figure 10 shows the generated face pictures by the TVWGAN with the homogeneous network. The generated images do not seem to be of high quality, because we only trained 20 epochs. ::: :::success Figure 10說明著利用有著齊次網路的TVWGAN所生成的臉部影像。因為我們只訓練20個epochs，所以看起來品質沒有很好。 ::: :::info ![](https://i.imgur.com/Xjyecte.png) Figure 10. The generated face pictures by the TV-WGAN with 20 epochs. ::: ### 5.3. Effect of margin factor :::info In section 4.3, we introduced a margin factor $\delta$. Figure 11 demonstrate its effect on the generative diversity and visual quality. The effect is obvious when training 50 epochs, that higher values of $\delta$ lead to higher visual quality. ::: :::success 在章節4.3中，我們介紹一個邊際因子$\delta$。Figure 11表明該因子在生成的多樣性與視覺品質的影響。影響很明顯，當訓練50個epochs的時候，較高的$\delta$會導致較高的視覺品質。 ::: :::info ![](https://i.imgur.com/YIWkwEY.png) Figure 11. The effect of the margin factor. From left to right, $\delta = 0, 5, 10$. Figure 11. 邊際因子的影響。從左到右為$\delta = 0, 5, 10$。 ::: :::info When our model is trained for 100 epochs, the quality of the generated images is not very different from each other for various margin factors. It’s hard to distinguish by naked eyes. In section 5.5, we will perform the numerical assessment by using inception scores. As shown in table 1, the quality of generated images improves with the increase of the margin factor. ::: :::success 當我們的模型訓練到100個epochs的時候，對於各種不同的邊際因子所生成的影像品質沒有很大的差別。肉眼很難分辨。在5.5章節時，我們會用inception scores來進行評分。如table 1所示，生成影像的品質隨著邊子因子的增加而提高。 ::: ### 5.4. Comparison with the BEGAN model :::info The BEGAN is capable of producing high quality images. However, the BEGAN model is prone to mode collapse when it is over-trained, which is verified by our experiments. We trained a TV-WGAN model and a BEGAN model on the CIFAR-10 dataset respectively by using the learning rate of 1e-4. When they are trained for 20 epochs, generated images of our model are of low quality, while the BEGAN model can generate good images, though they are lack of diversity. When we train them for 50 epochs, both models produce good images and the BEGAN generated pictures are of higher quality. But we found that when they are trained for 100 epochs, our model can generate high quality pictures while the BEGAN falls into the mode collapse if it does not use the learning rate decaying. The results are shown in Figure 12. ::: :::success BEGAN能夠生成高品質的影像。但是，BEGAN在過度訓練的時候很容易卡到(模式崩潰)。我們用學習效率1e-4各別在CIFAR-10資料集上訓練TV-WGAN與BEGAN。在訓練20個epochs之後，我們的模型(TV-WGAN)所生成的影像品質不好，而BEGAN模型可以生成出好的影像，儘管它們缺乏多樣性。在兩個模型都訓練50個epochs之後，都能生成出好的影，而且BEGAN能夠生成出較高品質的影像。但是我們發現到，在訓練100個epochs之後，我們的模型可以生成出高品質的影，而BEGAN如果沒有使用學習效率衰減的話，就會落入模式崩潰的困境。結果可Figure 12所示。 ::: :::info ![](https://i.imgur.com/N4ur52l.jpg) Figure 12. Generated images by the TV-WGAN (left) and the BEGAN model without learning rate decaying (right). Figure 12. 使用TV-WGAN(左)生成的影像與BEGAN模型不使用學習效率衰減(右)生成的影像。 ::: ### 5.5. Inception scores :::info The inception score was considered as a good assessment for sample quality from a dataset[7]. To measure quality and diversity numerically, we trained our model on CIFAR10 with various margin factor and computed the inception score of generated images. We also trained other models, such as DCGAN, GP-WGAN and BEGAN respectively, and computed their inception scores by ourselves. All of these models are trained by Adam optimizers. The learning rate is 1e-4 for all models except that the GP-WGAN uses 1e-5, since it is unstable with high learning rates. ::: :::success inception scrore被視為是對資料集樣本品質好壞的一種評估方式[7]。為了將品質與多樣性的量測數值化，我們用各種不同的邊際因子在CIFAR10上訓練我們的模型，然後計算其生成影像的inception score。我們還訓練其它的模型，像是DCGAN、GP-WGAN與BEGAN，然後它們各自的inception score。所有的這些模型都是用Adam來最佳化。學習效率都是1e-4，除了GP-WGAN為1e-5，因為GP-WGAN用較大的學習效率的話會不穩定。 ::: :::info The results are listed in table 1. It is shown that the inception scores of our model are higher than that of GP-WGANs, but is still less than DCGANs and BEGANs. The BEGAN model obtains the highest inception score in our experiments, which indicates that it is capable of producing high quality images. ::: :::success 結果列在table 1。這說明了我們的模型對比GP-WGANs有著較高的inception scores，但仍然比DCGANs、BEGANs還要低。這次的實驗中，BEGAN獲得最高的inception score，這說明BEGAN能夠生成出高品質的影像。 ::: :::info For different $\delta$ value, the TV-WGAN has different inception scores. With the increase of the $\delta$ value, the inception score of the TV-WGAN has also increased, while the standard variance of the score reduced. This seems indicating that the margin factor $\delta$ takes effect for controlling the visual quality of generated images. ::: :::success 不同的$\delta$數值，TV-WGAN就會有不同的inception scores。隨著$\delta$的增加，TV-WGAN的inception score也跟著增加，其分數的標準差也減少。這似乎說明著，邊際因子$\delta$對控制生成影像的視覺品質起了作用。 ::: ## 6. Conclusion :::info In this work, we demonstrated problems with gradient penalty in WGAN and introduced an alternative in the form of a total variational regularization in the objective function, which enforce the 1-Lipschitz constraint on the discriminator implicitly. The new approach is much stable in training. Additionally, we introduced a margin factor to control the trade-off between generative diversity and visual quality. ::: :::success 這次的研究中，我們說明在WGAN使用gradient penalty的問題，然後介紹一個在目標函數中加入total variational regularization的替代形式，這個方法隱含著在discriminator上施加1-Lipschitz約束式的概念。這個新的方法在訓練過程中穩定多了。此外，我們介紹一個邊際因子來控制生成多樣性與視覺品質的權衡。 :::