Progressive Growing of GANs for Improved Quality, Stability, and Variation(翻譯)

# Progressive Growing of GANs for Improved Quality, Stability, and Variation(翻譯) ###### tags: `pggan` `gan` `論文翻譯` `deeplearning` `對抗式生成網路` [TOC] ## 說明區塊如下分類，原文區塊為藍底，翻譯區塊為綠底，部份專業用語翻譯參考國家教育研究院 :::info 原文 ::: :::success 翻譯 ::: :::warning 任何的翻譯不通暢部份都請留言指導 ::: :::danger * [paper hyperlink](https://arxiv.org/pdf/1710.10196.pdf) * [PGGAN程式碼](https://github.com/shaoeChen/deeplearning/blob/master/tf2/Arch_PGGAN.ipynb) ::: ## ABSTRACT :::info We describe a new training methodology for generative adversarial networks. The key idea is to grow both the generator and discriminator progressively: starting from a low resolution, we add new layers that model increasingly fine details as training progresses. This both speeds the training up and greatly stabilizes it, allowing us to produce images of unprecedented quality, e.g., CELEBA images at 1024^2 . We also propose a simple way to increase the variation in generated images, and achieve a record inception score of 8.80 in unsupervised CIFAR10. Additionally, we describe several implementation details that are important for discouraging unhealthy competition between the generator and discriminator. Finally, we suggest a new metric for evaluating GAN results, both in terms of image quality and variation. As an additional contribution, we construct a higher-quality version of the CELEBA dataset. ::: :::success 我們說明一個新的用於生成對抗網路的訓練方法。主要的想法是漸進的增加generator與discriminator：從一個[低解析度](https://terms.naer.edu.tw/detail/4cb1c60b73f4cec86de9b783f73ddea1/)開始，隨著訓練的進行，我們慢慢地加入新的layer，讓模型有愈來愈好的細節。這樣的方式除了加速訓練速度，也大大的提高其穩定度，這讓我們生成前所未有的高品質照片，像是以1024x1024生成CELEBA的照片。我們還提出一個簡單的方法讓生成的照片增加一些變化，並且在unsupervised CIFAR10中的inception score創造了8.80的記錄。此外，我們說明多個關於阻止generator與discriminator之間不良競爭的重要的實現細節。最後，我們提出一個新的評估GAN結果的指標，這指標包括了照片的品質與變化性。額外的貢獻就是，我們建構了更高品質的CELEBA dataset。 ::: ## 1 INTRODUCTION :::info Generative methods that produce novel samples from high-dimensional data distributions, such as images, are finding widespread use, for example in speech synthesis (van den Oord et al., 2016a), image-to-image translation (Zhu et al., 2017; Liu et al., 2017; Wang et al., 2017), and image inpainting (Iizuka et al., 2017). Currently the most prominent approaches are autoregressive models (van den Oord et al., 2016b;c), variational autoencoders (VAE) (Kingma & Welling, 2014), and generative adversarial networks (GAN) (Goodfellow et al., 2014). Currently they all have significant strengths and weaknesses. Autoregressive models – such as PixelCNN – produce sharp images but are slow to evaluate and do not have a latent representation as they directly model the conditional distribution over pixels, potentially limiting their applicability. VAEs are easy to train but tend to produce blurry results due to restrictions in the model, although recent work is improving this (Kingma et al., 2016). GANs produce sharp images, albeit only in fairly small resolutions and with somewhat limited variation, and the training continues to be unstable despite recent progress (Salimans et al., 2016; Gulrajani et al., 2017; Berthelot et al., 2017; Kodali et al., 2017). Hybrid methods combine various strengths of the three, but so far lag behind GANs in image quality (Makhzani & Frey, 2017; Ulyanov et al., 2017; Dumoulin et al., 2016). ::: :::success 從高維度資料分佈中生成新的樣本(如影像)的生成方法，正被廣泛的使用中，舉例來說像是[語音合成](https://terms.naer.edu.tw/detail/ff48dcadc2301e70ff779ac768d44d47/)(van den Oord et al., 2016a)，image-to-image translation(Zhu et al., 2017; Liu et al., 2017; Wang et al., 2017)與[影像修補](https://terms.naer.edu.tw/detail/ddce3493664e62f5827190b9b1ef3402/)(Iizuka et al., 2017)。目前來說最著名的方法就是autoregressive models (van den Oord et al., 2016b;c)，variational autoencoders (VAE) (Kingma & Welling, 2014)與 generative adversarial networks (GAN) (Goodfellow et al., 2014)。這些方法都有自己的優缺點。Autoregressive models，像是PixelCNN，它生成[清晰影像](https://terms.naer.edu.tw/detail/388a44fc3fb38b044fb2b272afa766dd/)，但評估的速度太慢，而且沒有laten representation，這是因為這方法直接地以像素上的條件分佈來建模，這可能限制了它們的適應性。VAEs很容易訓練，但往往會因為模型的限制而產生模糊的結果，儘管最近的研究(Kingma et al., 2016)正在改進這個問題。GANs可以生成清晰影像，儘管其解析度很小，而且變化有限，儘管最近有進展(Salimans et al., 2016; Gulrajani et al., 2017; Berthelot et al., 2017; Kodali et al., 2017)，但其訓練還是不穩定。混合方法結合了三者的各種優點，不過目前來看其影像的品質還是低於GAN(Makhzani & Frey, 2017; Ulyanov et al., 2017; Dumoulin et al., 2016)。 ::: :::info Typically, a GAN consists of two networks: generator and discriminator (aka critic). The generator produces a sample, e.g., an image, from a latent code, and the distribution of these images should ideally be indistinguishable from the training distribution. Since it is generally infeasible to engineer a function that tells whether that is the case, a discriminator network is trained to do the assessment, and since networks are differentiable, we also get a gradient we can use to steer both networks to the right direction. Typically, the generator is of main interest – the discriminator is an adaptive loss function that gets discarded once the generator has been trained. ::: :::success 通常，GAN的組成包含兩個network：generator與discriminator(aka critic)。generator從latent code生成樣本(像是影像)，而且理想上這些生成的照片的分佈應該是要能夠跟訓練資料的分佈是一致的。通常你要設計一個函數來判斷是不是這種情況是不大可能的(判斷兩個分佈是不是一致)，所以我們就設計一個discriminator network來做這個工作，因為network是可微的，所以我們會得到一個梯度，那我們就可以用梯度把兩個網路帶往正確的方向。通常，generator是主要的觀注點，discriminator是一個自適應的損失函數(adaptive loss function)，在generator訓練好之後就會把它丟掉了。 ::: :::info There are multiple potential problems with this formulation. When we measure the distance between the training distribution and the generated distribution, the gradients can point to more or less random directions if the distributions do not have substantial overlap, i.e., are too easy to tell apart (Arjovsky & Bottou, 2017). Originally, Jensen-Shannon divergence was used as a distance metric (Goodfellow et al., 2014), and recently that formulation has been improved (Hjelm et al., 2017) and a number of more stable alternatives have been proposed, including least squares (Mao et al., 2016b), absolute deviation with margin (Zhao et al., 2017), and Wasserstein distance (Arjovsky et al., 2017; Gulrajani et al., 2017). Our contributions are largely orthogonal to this ongoing discussion, and we primarily use the improved Wasserstein loss, but also experiment with least-squares loss. ::: :::success 這個公式(GAN的公式)有幾個潛在的問題。當我們去量測訓練分佈與生成分佈之間的距離時，如果兩個分佈沒有一個比較[顯著](https://terms.naer.edu.tw/detail/3ff184a10680427abccf19026ea850d4/)的重疊的話，那梯度可能就會某種程度上的隨機指個方向，也就是很容易就能夠分辨出來(Arjovsky & Bottou, 2017)。起初，Jensen-Shannon divergence被拿來做為距離的指標(Goodfellow et al., 2014)，最近那公式被改進了(Hjelm et al., 2017)，也有多種更為穩定的替代方案被提出，包括least squares (Mao et al., 2016b)，absolute deviation with margin (Zhao et al., 2017)，與Wasserstein distance (Arjovsky et al., 2017; Gulrajani et al., 2017)。我們的貢獻很大程度跟這些無關，我們主要使用的是improved Wasserstein loss，不過也有用least-squares loss來做一些實驗就是。 ::: :::info The generation of high-resolution images is difficult because higher resolution makes it easier to tell the generated images apart from training images (Odena et al., 2017), thus drastically amplifying the gradient problem. Large resolutions also necessitate using smaller minibatches due to memory constraints, further compromising training stability. Our key insight is that we can grow both the generator and discriminator progressively, starting from easier low-resolution images, and add new layers that introduce higher-resolution details as the training progresses. This greatly speeds up training and improves stability in high resolutions, as we will discuss in Section 2. ::: :::success 高解析度影像的生成是非常困難的，因為更高的解析度會更容易的將生成的影像跟訓練的影像區隔(Odena et al., 2017)，這會徹底地放大梯度問題。這種超大解析度的影像也會因為記憶體的限制而使用較小的minibatches，這進一步的犧牲訓練的穩定度。我們的主要見解在於可以讓generator、discriminator都漸進地成長，從比較簡單的低解析度影像開始，然後隨著訓練進度一層一層的增加來引入較高解析度的細節。這大大的提升在高解析度上的訓練與提升的穩定性，這部份我們會在Section 2討論。 ::: :::info The GAN formulation does not explicitly require the entire training data distribution to be represented by the resulting generative model. The conventional wisdom has been that there is a tradeoff between image quality and variation, but that view has been recently challenged (Odena et al., 2017). The degree of preserved variation is currently receiving attention and various methods have been suggested for measuring it, including inception score (Salimans et al., 2016), multi-scale structural similarity (MS-SSIM) (Odena et al., 2017; Wang et al., 2003), birthday paradox (Arora & Zhang, 2017), and explicit tests for the number of discrete modes discovered (Metz et al., 2016). We will describe our method for encouraging variation in Section 3, and propose a new metric for evaluating the quality and variation in Section 5. ::: :::success GAN的數學式並沒有那麼明確地要求訓練好的生成模型做為整個訓練資料分佈的表示。一般的常識認為在影像品質與變化之間存在著一個折衷，但這個觀點近來受到挑戰(Odena et al., 2017)。保留變化的程度目前正受到關注，而且也有提出各種方法來量測這一點，包含inception score (Salimans et al., 2016)，multi-scale structural similarity (MS-SSIM) (Odena et al., 2017; Wang et al., 2003)，birthday paradox (Arora & Zhang, 2017)，以及對已發現的離散模式的數量明確的測試(Metz et al., 2016)。我們會在Section 3中說明我們用來鼓勵變化的方法，然後在Section 5中提出新的指標來估測品質與變化性。 ::: :::info Section 4.1 discusses a subtle modification to the initialization of networks, leading to a more balanced learning speed for different layers. Furthermore, we observe that mode collapses traditionally plaguing GANs tend to happen very quickly, over the course of a dozen minibatches. Commonly they start when the discriminator overshoots, leading to exaggerated gradients, and an unhealthy competition follows where the signal magnitudes escalate in both networks. We propose a mechanism to stop the generator from participating in such escalation, overcoming the issue (Section 4.2). ::: :::success Section 4.1討論了對network初始化的微妙調整，這讓不同層(layers)的學習速度更加的平衡。此外，我們觀察到傳統上困擾GANs的模式崩潰(mode collapses)往往會很快的發生，咻一下，十幾個minibatches就出現了。通常這事都是從discriminator太雞歪(overshoots)開始，導致太過於誇張的梯度，然後兩個network的信號強度不斷升級，就造成不健康的競爭了。我們提出一個機制阻止generator參與這樣的升級，用這樣的方式來解決問題(Section 4.2) ::: :::info We evaluate our contributions using the CELEBA, LSUN, CIFAR10 datasets. We improve the best published inception score for CIFAR10. Since the datasets commonly used in benchmarking generative methods are limited to a fairly low resolution, we have also created a higher quality version of the CELEBA dataset that allows experimentation with output resolutions up to 1024 × 1024 pixels. This dataset and our full implementation are available at https://github.com/tkarras/progressive_growing_of_gans, trained networks can be found at https://drive.google.com/open?id=0B4qLcYyJmiz0NHFULTdYc05lX0U along with result images, and a supplementary video illustrating the datasets, additional results, and latent space interpolations is at https://youtu.be/G06dEcZ-QTg. ::: :::success 我們用CELEBA、LSUN、CIFAR10資料集來評估我們的貢獻。我們提高CIFAR10已發佈的最佳inception score。由於用於基準生成方法的常用資料集通常都是被限制在相當低的解析度，所以我們就建立一個高品質版本的CELEBA資料集，這讓實驗得以在高達1024x1024像素的解析度上進行。這個資料集跟我們完整的實現都在 https://github.com/tkarras/progressive_growing_of_gans ，訓練好的模型可以在 https://drive.google.com/open?id=0B4qLcYyJmiz0NHFULTdYc05lX0U 找到(連結同結果影像)，資料集的說明、額外的結果以及latent space interpolations(潛在空間插值)的補充影片都放在 https://youtu.be/G06dEcZ-QTg 。 ::: ## 2 PROGRESSIVE GROWING OF GANS :::info Our primary contribution is a training methodology for GANs where we start with low-resolution images, and then progressively increase the resolution by adding layers to the networks as visualized in Figure 1. This incremental nature allows the training to first discover large-scale structure of the image distribution and then shift attention to increasingly finer scale detail, instead of having to learn all scales simultaneously. ::: :::success 我們的主要貢獻是一種用於GANs的訓練方法，我們從低解析度的影像開始，然後透過向網路增加層(layer)的方式漸進地提高解析度，如Figure 1所示。這種增量性質讓訓練首先發現影像分佈的大型結構，然後將注意力轉移到逐漸細膩的尺度細節上，而非同時學習所有的尺度。 ::: :::info ![](https://hackmd.io/_uploads/S1I7GoeTn.png) Figure 1: Our training starts with both the generator (G) and discriminator (D) having a low spatial resolution of 4×4 pixels. As the training advances, we incrementally add layers to G and D, thus increasing the spatial resolution of the generated images. All existing layers remain trainable throughout the process. Here $N \times N$ refers to convolutional layers operating on $N \times N$ spatial resolution. This allows stable synthesis in high resolutions and also speeds up training considerably. One the right we show six example images generated using progressive growing at 1024 × 1024. Figure 1：我們的訓練是從低解析度(4x4 pixels)的generator(G)與discriminator(D)開始。隨著訓練的演進，我們逐漸地向G跟D增加網路層(layer)，從而增加生成影像的空間解析度。所有存在的層都會在過程中保持可訓練狀態。這裡的$N \times N$指的是卷積層在$N \times N$空間解析度上的操作。這讓我們可以在高解析度空間中穩定的合成，也能夠大大的提升訓練速度。上面右邊我們呈現的六張範例影像就是以1024 x 1024漸進生成所產生的。 ::: :::info We use generator and discriminator networks that are mirror images of each other and always grow in synchrony. All existing layers in both networks remain trainable throughout the training process. When new layers are added to the networks, we fade them in smoothly, as illustrated in Figure 2. This avoids sudden shocks to the already well-trained, smaller-resolution layers. Appendix A describes structure of the generator and discriminator in detail, along with other training parameters. ::: :::success 我們使用generator與discriminator networks，它們是彼此的鏡像，而且始終同時成長。在這個訓練過程中，兩個網頁的所有已存在的層都會維持可訓練狀態。當新的層被加入網路中時，我們會平穩地加入它們(fade them in smoothly)，如Figure 2所示。這避免掉對那些已經訓練好的較小解析度的層的意想不到的衝擊。Appendix A中詳述了generator與discriminator的架構與其它的訓練參數。 ::: :::info ![](https://hackmd.io/_uploads/SytBdjla3.png) Figure 2: When doubling the resolution of the generator (G) and discriminator (D) we fade in the new layers smoothly. This example illustrates the transition from $16 \times 16$ images (a) to $32 \times 32$ images (c). During the transition (b) we treat the layers that operate on the higher resolution like a residual block, whose weight $\alpha$ increases linearly from 0 to 1. Here $2\times$ and $0.5\times$ refer to doubling and halving the image resolution using nearest neighbor filtering and average pooling, respectively. The $\text{toRGB}$ represents a layer that projects feature vectors to RGB colors and $\text{fromRGB}$ does the reverse; both use $1 \times 1$ convolutions. When training the discriminator, we feed in real images that are downscaled to match the current resolution of the network. During a resolution transition, we interpolate between two resolutions of the real images, similarly to how the generator output combines two resolutions. Figure 2：當generator(G)與discriminator(D)的解析度兩倍的時候，我們會穩穩地加入新的層。這個範例說明了從$16 \times 16$的影像(a)到$32 \times 32$的影像(c)之間的轉變。在(b)這個轉換期間，我們在解析度較高的layer的操作就像是殘差塊(residual block)那樣，其權重$\alpha$從0線性地增加到1。這邊的$2\times$與$0.5\times$分別表示使用nearest neighbor filtering與average pooling來加倍或是減半影像解析度。$\text{toRGB}$表示一個layer，這個layer會把feature vectors投射到RGB，而$\text{fromRGB}$做的事就是反過來；兩個都是使用$1 \times 1$的卷積。當訓練discriminator的時候，我們會餵入實際的影像，然後會把它降解到符合網路當下的解析度。解析度轉換的過程中，我們會在實際影像的兩個解析度之間做插值，這就有點類似於generator的output是如何的結合兩個解析度。 ::: :::info We observe that the progressive training has several benefits. Early on, the generation of smaller images is substantially more stable because there is less class information and fewer modes (Odena et al., 2017). By increasing the resolution little by little we are continuously asking a much simpler question compared to the end goal of discovering a mapping from latent vectors to e.g. 10242 images. This approach has conceptual similarity to recent work by Chen & Koltun (2017). In practice it stabilizes the training sufficiently for us to reliably synthesize megapixel-scale images using WGAN-GP loss (Gulrajani et al., 2017) and even LSGAN loss (Mao et al., 2016b). ::: :::success 我們觀察到漸近式的訓練有幾個好處。初期的時候，較小影像的生成實質上會更為穩定，因為類別信息跟模式都較少(Odena et al., 2017)。通過逐步增加解析度，我們不斷提出與終極目標(也就是發現一個從潛在向量映射到1024x1024影像)相比之下更為簡單的問題。這方法類似於Chen & Koltun (2017)近期研究所提出的概念。實際上，它充分地穩定了訓練，讓我們能夠使用WGAN-GP loss(Gulrajani et al., 2017)，甚至是LSGAN loss (Mao et al., 2016b)來合成megapixel-scale(百萬像素)的影像。 ::: :::info Another benefit is the reduced training time. With progressively growing GANs most of the iterations are done at lower resolutions, and comparable result quality is often obtained up to 2–6 times faster, depending on the final output resolution. ::: :::success 另一個好處就是減少訓練時間。使用progressively growing GANs，多數的迭代都是在低解析度的情況下完成的，通常可以快個2-6倍，不過這還是取決於最終的輸出解析度。 ::: :::info The idea of growing GANs progressively is related to the work of Wang et al. (2017), who use multiple discriminators that operate on different spatial resolutions. That work in turn is motivated by Durugkar et al. (2016) who use one generator and multiple discriminators concurrently, and Ghosh et al. (2017) who do the opposite with multiple generators and one discriminator. Hierarchical GANs (Denton et al., 2015; Huang et al., 2016; Zhang et al., 2017) define a generator and discriminator for each level of an image pyramid. These methods build on the same observation as our work – that the complex mapping from latents to high-resolution images is easier to learn in steps – but the crucial difference is that we have only a single GAN instead of a hierarchy of them. In contrast to early work on adaptively growing networks, e.g., growing neural gas (Fritzke, 1995) and neuro evolution of augmenting topologies (Stanley & Miikkulainen, 2002) that grow networks greedily, we simply defer the introduction of pre-configured layers. In that sense our approach resembles layer-wise training of autoencoders (Bengio et al., 2007). ::: :::success 漸進式生長GANs的觀念跟Wang et al. (2017)的研究有關，他們在不同的空間解析度中用了多個discriminators。這個研究也是受到Durugkar et al. (2016)的啟發，他們做了相反的事，多個generators，一個discriminator。Hierarchical GANs (Denton et al., 2015; Huang et al., 2016; Zhang et al., 2017)為一張影像金字塔(image pyramid)的每個層級都設置一個generator與一個discriminator。這些方法跟我們的研究建立在相同的觀點上，也就是從潛在空間到高解析度影像的這種複雜的映射採用分步處理的話更為容易學習，不過關鍵的差異在於我們只有一個GAN，而非分層結構。對比自適應生長網路(adaptively growing networks)的早期研究，像是growing neural gas (Fritzke, 1995)與neuro evolution of augmenting topologies (Stanley & Miikkulainen, 2002)，它們是貪婪地生長網路，而我們只是延遲pre-configured layers的引入。就這方面來說，我們的方法類似於auto-encoders的分層訓練(layer-wise training)(Bengio et al., 2007)。 ::: ## 3 INCREASING VARIATION USING MINIBATCH STANDARD DEVIATION :::info GANs have a tendency to capture only a subset of the variation found in training data, and Salimans et al. (2016) suggest “minibatch discrimination” as a solution. They compute feature statistics not only from individual images but also across the minibatch, thus encouraging the minibatches of generated and training images to show similar statistics. This is implemented by adding a minibatch layer towards the end of the discriminator, where the layer learns a large tensor that projects the input activation to an array of statistics. A separate set of statistics is produced for each example in a minibatch and it is concatenated to the layer’s output, so that the discriminator can use the statistics internally. We simplify this approach drastically while also improving the variation. ::: :::success GANs傾向於只補捉訓練資料中變化的子集，Salimans et al. (2016)提出"minibatch discrimination"做為解決方案。他們不僅從各別影像中計算特徵統計，也會在整個minibatch中計算，因而促進生成的小批量影像與訓練影像有相似的統計信息。這可以在discriminator的屁股加入一個minibatch layer來實現，這一個layer會學習到很大的張量(tensor)，這個很大的張量就是一個將input activation投射到統計信息的數組中。對於minibatch中的每個樣本都會生成一組各別的統計資訊，然後將之concatenated(連接)到layer(該層？)的輸出，這樣discriminator就可以在內部使用這些統計資訊。我們大大的簡化這個方法，同時也提高其變化性。 ::: :::info Our simplified solution has neither learnable parameters nor new hyperparameters. We first compute the standard deviation for each feature in each spatial location over the minibatch. We then average these estimates over all features and spatial locations to arrive at a single value. We replicate the value and concatenate it to all spatial locations and over the minibatch, yielding one additional (constant) feature map. This layer could be inserted anywhere in the discriminator, but we have found it best to insert it towards the end (see Appendix A.1 for details). We experimented with a richer set of statistics, but were not able to improve the variation further. In parallel work, Lin et al. (2017) provide theoretical insights about the benefits of showing multiple images to the discriminator ::: :::success 我們簡化過的解決方案除了有可學習參數(learnable parameters)之外，還有新的超參數。我們首先計算整個minibatch上每個空間位置中的每個特徵的標準差。然後根據所有的特徵與空間位置平均這些估測值，就可以得到單個值。我們複製這個值，然後將之連接到所有的空間位置與minibatch上，以此生成一個新的常數(constant)特徵圖(feature map)。這一層可以插入discriminator的任何位置，但我們發現插在最後面的屁股是最好的(詳見Appendix A.1)。我們用一組更豐富的統計資料來實驗，不過並沒有辦法進一步的提升變化性。與此同時，Lin et al. (2017)提出關於向discriminator顯示多個影像的好處的理論見解。 ::: :::info Alternative solutions to the variation problem include unrolling the discriminator (Metz et al., 2016) to regularize its updates, and a “repelling regularizer” (Zhao et al., 2017) that adds a new loss term to the generator, trying to encourage it to orthogonalize the feature vectors in a minibatch. The multiple generators of Ghosh et al. (2017) also serve a similar goal. We acknowledge that these solutions may increase the variation even more than our solution – or possibly be orthogonal to it – but leave a detailed comparison to a later time. ::: :::success 對於變化性問題的替代解決方案包含展開(unrolling)discriminator(Metz et al., 2016)來正規化(規範？)其更新，還有“repelling regularizer” (Zhao et al., 2017)，這方法對generator增加一個新的損失項目(loss term)，試著看能不能讓minibatch裡面的特徵向量正交化。Ghosh et al. (2017)所提出的multiple generators也有類似的目標。我們承認，這幾個解決方案比起我們的也許可以更增加變化性，或者可能與之正交，不過這些詳細的比較就留待後續了。 ::: ## 4 NORMALIZATION IN GENERATOR AND DISCRIMINATOR :::info GANs are prone to the escalation of signal magnitudes as a result of unhealthy competition between the two networks. Most if not all earlier solutions discourage this by using a variant of batch normalization (Ioffe & Szegedy, 2015; Salimans & Kingma, 2016; Ba et al., 2016) in the generator, and often also in the discriminator. These normalization methods were originally introduced to eliminate covariate shift. However, we have not observed that to be an issue in GANs, and thus believe that the actual need in GANs is constraining signal magnitudes and competition. We use a different approach that consists of two ingredients, neither of which include learnable parameters. ::: :::success GANs因為兩個網路之間不健康的競爭而容易導致信息幅度的升級。多數早期的解決方案都是利用batch normalization (Ioffe & Szegedy, 2015; Salimans & Kingma, 2016; Ba et al., 2016)來阻止這種情況發生(兩個網路都是這樣)。這些正規化(normalization)方法最一開始是被拿來消除covariate shift(共變量偏移？)。然而，我們尚未觀察到GANs有這樣的問題在，也因此我們相信GANs實際需要的是限制信息幅度與競爭。我們使用一種個由兩個成份所組成的方法，完全不包含可學習參數。 ::: ### 4.1 EQUALIZED LEARNING RATE :::info We deviate from the current trend of careful weight initialization, and instead use a trivial $\mathcal{N}(0,1)$ initialization and then explicitly scale the weights at runtime. To be precise, we set $\hat{w}_i=w_i/c$, where wi are the weights and c is the per-layer normalization constant from He’s initializer (He et al., 2015). The benefit of doing this dynamically instead of during initialization is somewhat subtle, and relates to the scale-invariance in commonly used adaptive stochastic gradient descent methods such as RMSProp (Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2015). These methods normalize a gradient update by its estimated standard deviation, thus making the update independent of the scale of the parameter. As a result, if some parameters have a larger dynamic range than others, they will take longer to adjust. This is a scenario modern initializers cause, and thus it is possible that a learning rate is both too large and too small at the same time. Our approach ensures that the dynamic range, and thus the learning speed, is the same for all weights. A similar reasoning was independently used by van Laarhoven (2017). ::: :::success 我們跟現下流行的謹慎初始化權重的趨勢不同，使用的是[平凡的](https://terms.naer.edu.tw/detail/893da2d80584b446a2588797196afc3c/)(trival)$\mathcal{N}(0,1)$初始化方式，然後在執行的時候顯示地縮放權重。精確一點的說法是，我們設置$\hat{w}_i=w_i/c$，其中$w_i$是權重，而$c$則是每一層來自He(He et al., 2015)的[初始器](https://terms.naer.edu.tw/detail/dbc0c91a5941084fc9cf445ea004ddc3/)的正規化常數。動態地這麼做而不是在初始化的期間執行的好處有些許的微妙，這跟常用的自適應隨機梯度下降法(adaptive stochastic gradient descent methods)，像是RMSProp (Tieleman & Hinton, 2012)與Adam (Kingma & Ba, 2015)，的[尺度不變性](https://terms.naer.edu.tw/detail/e3f9817f33d7d4219494f1e9f5d9a6f8/)(scale-invariance)有關。這些方法會利用其估測的標準差來正規化梯度更新，這讓梯度的更新跟參數的尺度(scale)無關。因此啊，如果有一些參數參數的動態範圍比其它參數來的大的話，它們就需要比較長的時間來調整。這是現代初始器會導致的一種情況，因此它的learning rate可能同時過大又過小。我們的方法能夠確保其所有權重的動態範圍(dynamic range)與其學習速度相同。van Laarhoven (2017)使用了類似的推論。 ::: ### 4.2 PIXELWISE FEATURE VECTOR NORMALIZATION IN GENERATOR :::info To disallow the scenario where the magnitudes in the generator and discriminator spiral out of control as a result of competition, we normalize the feature vector in each pixel to unit length in the generator after each convolutional layer. We do this using a variant of “local response normalization” (Krizhevsky et al., 2012), configured as $b_{x, y} = a_{x, y} / \sqrt{\dfrac{1}{N}\sum_{j=0}^{N-1}(a^j_{x,y})^2 + \epsilon}$, where $\epsilon=10^{-8}$, $N$ is the number of feature maps, and $a_{x,y}$ and $b_{x, y}$ are the original and normalized feature vector in pixel($x,y$), respectively. We find it surprising that this heavy-handed constraint does not seem to harm the generator in any way, and indeed with most datasets it does not change the results much, but it prevents the escalation of signal magnitudes very effectively when needed. ::: :::success 為了不要讓generator與discriminator因為競爭而導致螺旋式失控的情況發生，我們在每一個卷積層之後把每個pixel中的特徵向量正規化成generator中的單位長度(unit length)。我們採用的方法是local response normalization的一種變體(Krizhevsky et al., 2012)，設置為$b_{x, y} = a_{x, y} / \sqrt{\dfrac{1}{N}\sum_{j=0}^{N-1}(a^j_{x,y})^2 + \epsilon}$，其中為$\epsilon=10^{-8}$，$N$為feature maps的數量，$a_{x,y}$與$b_{x, y}$各別是pixel($x,y$)中的原始特徵向量與正規化後的特徵向量。讓人驚訝的是，下了這麼重的手的限制，看起來對generator沒有什麼影響，事實上，對於多數的資料集，這對結果並沒有產生什麼變化，不過在需要的時候它還是可以有效地防止信號變化幅度的升級。 ::: ## 5 MULTI-SCALE STATISTICAL SIMILARITY FOR ASSESSING GAN RESULTS :::info In order to compare the results of one GAN to another, one needs to investigate a large number of images, which can be tedious, difficult, and subjective. Thus it is desirable to rely on automated methods that compute some indicative metric from large image collections. We noticed that existing methods such as MS-SSIM (Odena et al., 2017) find large-scale mode collapses reliably but fail to react to smaller effects such as loss of variation in colors or textures, and they also do not directly assess image quality in terms of similarity to the training set. ::: :::success 為了比較不同GAN之間的結果，我們需要比較大量的影像，這可能很無聊、困難、而且非常主觀。也因此我們希望可以依賴一些自動化的方法從大量的影像集合中計算一些指示性指標。我們有注意到，已知的方法像是MS-SSIM (Odena et al., 2017)可以找出大規模的mode collapses，不過對一些比較小影響的，像是顏色或是紋理類的就比較沒辦法了，而且它們也無法直接地根據與訓練集之間的相似度來評估影像品質。 ::: :::info We build on the intuition that a successful generator will produce samples whose local image structure is similar to the training set over all scales. We propose to study this by considering the multiscale statistical similarity between distributions of local image patches drawn from Laplacian pyramid (Burt & Adelson, 1987) representations of generated and target images, starting at a low-pass resolution of 16 × 16 pixels. As per standard practice, the pyramid progressively doubles until the full resolution is reached, each successive level encoding the difference to an up-sampled version of the previous level. ::: :::success 我們建立在這樣的直觀上，也就是一個成功的generator所生成的樣本，其局部影像的結構在所有尺度(all sacles)上都會跟訓練集類似。我們提出透過考慮由Laplacian pyramid (Burt & Adelson, 1987)與目標影像所抽取出的局部慰像區塊的分佈之間的多尺度統計分析來研究這一點，然後這是從16x16 pixel的[低通](https://terms.naer.edu.tw/detail/2339437c7054ee8b1adac458369bb31d/)解析度開始的。根據標準實作經驗，金字塔(pyramid)漸近地加倍，一直到完整的解析度，每個連續的級別都會把差異編碼為上一個級別的上採樣版本。 ::: :::info A single Laplacian pyramid level corresponds to a specific spatial frequency band. We randomly sample 16384 images and extract 128 descriptors from each level in the Laplacian pyramid, giving us $2^{21}$(2.1M) descriptors per level. Each descriptor is a 7 × 7 pixel neighborhood with 3 color channels, denoted by $x\in \mathbb{R}^{7\times 7\times 3}=\mathbb{R}^{147}$. We denote the patches from level $l$ of the training set and generated set as $\left\{x^l_i \right\}^{2^{21}}_{i=1}$ and $\left\{y^l_i \right\}^{2^{21}}_{i=1}$, respectively. We first normalize $\left\{x^l_i \right\}$ and $\left\{y^l_i \right\}$ w.r.t. the mean and standard deviation of each color channel, and then estimate the statistical similarity by computing their sliced Wasserstein distance SWD($\left\{x^l_i \right\}, \left\{y^l_i \right\}$), an efficiently computable randomized approximation to earthmovers distance, using 512 projections (Rabin et al., 2011). ::: :::success 單一個Laplacian pyramid level對應一個特定的空間[頻帶](https://terms.naer.edu.tw/detail/69afb90be9fe95299992ae21ffed71ea/)(spatial frequency band)。我們隨機採樣16384張影像並且從Laplacian pyramid中的每一個級別(層級，level)中提取128個[描述符](https://terms.naer.edu.tw/detail/35f2883ce97fadd5740eb421520bc8c2/)，每個層級給了我們$2^{21}$(2.1M)個描述符。每一個描述符都是有3個色通道(color channel)的7x7像素鄰域，以$x\in \mathbb{R}^{7\times 7\times 3}=\mathbb{R}^{147}$表示。我們把來自訓練集與生成集的level $l$的區塊各自表示為$\left\{x^l_i \right\}^{2^{21}}_{i=1}$與$\left\{y^l_i \right\}^{2^{21}}_{i=1}$。我們首先正規化$\left\{x^l_i \right\}$與$\left\{y^l_i \right\}$為每個color channel的均值與標準差，然後利用計算它們之間的sliced Wasserstein distance SWD($\left\{x^l_i \right\}, \left\{y^l_i \right\}$)來估測相似度，這是一種以512個projections，高效可計算的隨機近似值來計算earthmovers distance(Rabin et al., 2011)。 ::: :::info Intuitively a small Wasserstein distance indicates that the distribution of the patches is similar, meaning that the training images and generator samples appear similar in both appearance and variation at this spatial resolution. In particular, the distance between the patch sets extracted from the lowest resolution 16 × 16 images indicate similarity in large-scale image structures, while the finest-level patches encode information about pixel-level attributes such as sharpness of edges and noise. ::: :::success 直觀來看，小的Wasserstein distance所說明的是區塊的分佈是相似的，這表示訓練影像與生成器樣本在該空間解析度上(spatial resolution)的外觀與變化上都是相似的。特別是，從最低解析度，16x16影像，中所提取的區塊集(patch sets)之間的距離就指出在大尺度影像結構中的相似性，而最精緻級別區塊則是編碼關於像素級別(pixel level)屬性的信息，像是邊緣的銳利度與噪點。 ::: ## 6 EXPERIMENTS :::info In this section we discuss a set of experiments that we conducted to evaluate the quality of our results. Please refer to Appendix A for detailed description of our network structures and training configurations. We also invite the reader to consult the accompanying video (https://youtu.be/G06dEcZ-QTg) for additional result images and latent space interpolations. In this section we will distinguish between the network structure (e.g., convolutional layers, resizing), training configuration (various normalization layers, minibatch-related operations), and training loss (WGAN-GP, LSGAN). ::: :::success 這一章節中我們討論了一組實驗，這是為了評估結果的品質而做的實驗。網路結果與訓練參數的配置的詳細資訊請參考附錄A。我們也誠摯的邀請讀者有空來看個片片(https://youtu.be/G06dEcZ-QTg)，額外的result images與潛在空間的插值。本節中，我們會聊聊網路結構(像是卷積層、調整大小)、訓練配置(各種正規化層、小批量相關的操作)，以及訓練損失(WGAN-GP，LSGAN)。 ::: ### 6.1 IMPORTANCE OF INDIVIDUAL CONTRIBUTIONS IN TERMS OF STATISTICAL SIMILARITY :::info We will first use the sliced Wasserstein distance (SWD) and multi-scale structural similarity (MSSSIM) (Odena et al., 2017) to evaluate the importance our individual contributions, and also perceptually validate the metrics themselves. We will do this by building on top of a previous state-of-the-art loss function (WGAN-GP) and training configuration (Gulrajani et al., 2017) in an unsupervised setting using CELEBA (Liu et al., 2015) and LSUN BEDROOM (Yu et al., 2015) datasets in 128 resolution. CELEBA is particularly well suited for such comparison because the training images contain noticeable artifacts (aliasing, compression, blur) that are difficult for the generator to reproduce faithfully. In this test we amplify the differences between training configurations by choosing a relatively low-capacity network structure (Appendix A.2) and terminating the training once the discriminator has been shown a total of 10M real images. As such the results are not fully converged. ::: :::success 我們首先會使用sliced Wasserstein distance (SWD)與multi-scale structural similarity (MSSSIM) (Odena et al., 2017)來評估我們各別貢獻的重要性，並從感知上驗證這些指標本身。我們會在一個非監督式的環境中使用CELEBA (Liu et al., 2015)與LSUN BEDROOM (Yu et al., 2015)資料集，以128解析度，基於目前最佳的損失函數(WGAN-GP)與訓練配置(Gulrajani et al., 2017)來建構。CELEBA非常適合這種比較，因為訓練影像包含明顯的[假影](https://terms.naer.edu.tw/detail/ce1251741a9f7d67bcb4632c6ad886af/)([重疊](https://terms.naer.edu.tw/detail/8204551e5bd2e1587cdb9e8320b48843/)、壓縮、模糊)，這些是generator很難去復現的。在這個測試中，我們會透過選擇相對低容量的網路架構來放大訓練配置之間的差異(附錄A.2)，並且在discriminator顯示10M的影像之後終止訓練。因此，其結果並沒有完全的收斂。 ::: :::info Table 1 lists the numerical values for SWD and MS-SSIM in several training configurations, where our individual contributions are cumulatively enabled one by one on top of the baseline (Gulrajani et al., 2017). The MS-SSIM numbers were averaged from 10000 pairs of generated images, and SWD was calculated as described in Section 5. Generated CELEBA images from these configurations are shown in Figure 3. Due to space constraints, the figure shows only a small number of examples for each row of the table, but a significantly broader set is available in Appendix H. Intuitively, a good evaluation metric should reward plausible images that exhibit plenty of variation in colors, textures, and viewpoints. However, this is not captured by MS-SSIM: we can immediately see that configuration (h) generates significantly better images than configuration (a), but MS-SSIM remains approximately unchanged because it measures only the variation between outputs, not similarity to the training set. SWD, on the other hand, does indicate a clear improvement. ::: :::success Table 1列出幾種SWD、MS-SSIM訓練配置的數值，我們的貢獻在於這是基於基線(Gulrajani et al., 2017)的基礎上逐一累積上來的。MS-SSIM的數值是從10000對的生成影像平均來的，而SWD的計算則是如Section 5所述。從這些配置中所生成的CELEBA影像如Figure 3所示。由於空間的限制，該照片就簡單的呈現一些範例，不過我們在Appendix H那邊有一組範圍較廣的範例。直觀來看，一個好的評估指標應該要獎勵那些色彩豐富、紋理與觀點看似真實的影像。不過，MS-SSIM並沒有補捉到這些，我們馬上可以看到，配置h生成的影像明顯比配置a來的好，不過MS-SSIM大致沒有變化，因為它只有量測與輸出之間的變化，而不是與訓練集與間的相似度。另一方面，SWD確實有明顯的改進。 ::: :::info ![](https://hackmd.io/_uploads/H16ilr0M6.png) Table 1: Sliced Wasserstein distance (SWD) between the generated and training images (Section 5) and multi-scale structural similarity (MS-SSIM) among the generated images for several training setups at 128 × 128. For SWD, each column represents one level of the Laplacian pyramid, and the last one gives an average of the four distances. Table 1：多個訓練設定於128x128解析度的生成與訓練影像(Section 5)之間的Sliced Wasserstein distance (SWD)，與生成影像之間的multi-scale structural similarity (MS-SSIM)。SWD的每個column都表示Laplacian pyramid的一個層級，最後一個給出四個距離的平均值。 ::: :::info ![](https://hackmd.io/_uploads/ry83xHAM6.png) Figure 3: (a) – (g) CELEBA examples corresponding to rows in Table 1. These are intentionally non-converged. (h) Our converged result. Notice that some images show aliasing and some are not sharp – this is a flaw of the dataset, which the model learns to replicate faithfully Figure 3：(a) – (g)對應Table 1的rows的CELEBA樣本。這些是故意不收斂的。h是我們的收斂結果。有些影像看起來重疊，有些不清楚，這是資料集的一個缺陷，而模型學會也忠實的複製出來。 ::: :::info The first training configuration (a) corresponds to Gulrajani et al. (2017), featuring batch normalization in the generator, layer normalization in the discriminator, and minibatch size of 64. (b) enables progressive growing of the networks, which results in sharper and more believable output images. SWD correctly finds the distribution of generated images to be more similar to the training set. ::: :::success 第一個訓練配置(a)對應的是Gulrajani et al. (2017)，其特徵是在generator中的batch normalization，discriminator中的layer normalization，以及64的minibatch size。(b)的話是讓網路能夠漸進的增長，從而產生出更精確、更可信的輸出影像。SWD正確地發現生成影像的分佈因而能與訓練集更相似。 ::: :::info Our primary goal is to enable high output resolutions, and this requires reducing the size of minibatches in order to stay within the available memory budget. We illustrate the ensuing challenges in (c) where we decrease the minibatch size from 64 to 16. The generated images are unnatural, which is clearly visible in both metrics. In (d), we stabilize the training process by adjusting the hyperparameters as well as by removing batch normalization and layer normalization (Appendix A.2). As an intermediate test (e∗), we enable minibatch discrimination (Salimans et al., 2016), which somewhat surprisingly fails to improve any of the metrics, including MS-SSIM that measures output variation. In contrast, our minibatch standard deviation (e) improves the average SWD scores and images. We then enable our remaining contributions in (f) and (g), leading to an overall improvement in SWD and subjective visual quality. Finally, in (h) we use a non-crippled network and longer training – we feel the quality of the generated images is at least comparable to the best published results so far. ::: :::success 我們的主要目標是希望能夠高輸出解析度，這需要減少minibatches的大小，這樣才能夠維持在可用的記憶體預算內。我們會在(c)中說明隨之而來的挑戰，這邊我們會把minibatch size從64調降到16。生成的影像是不自然的，這在兩個指標中都清晰可見。在(d)的話，我們透過調整超參數還有移除掉batch normalization與layer normalization(Appendix A.2)來穩定訓練過程。做為一個中間的測試(e\*)，我們啟用了minibatch discrimination (Salimans et al., 2016)，讓人嚇到吃手手的是，這並不能改善任何一個指標，包含量測輸出變化的MS-SSIM也是。對比之下，我高盼minibatch standard deviation(e)提升了平均的SWD分數與影像。然後(f)與(g)啟用其餘的貢獻，這全面的改進了SWD與主觀視覺品質。最終，在(h)我們使用一種non-crippled network與更長的訓練，我們覺得所生成的影像品質至少跟迄今為止發佈的最佳結果相當。 ::: ### 6.2 CONVERGENCE AND TRAINING SPEED :::info Figure 4 illustrates the effect of progressive growing in terms of the SWD metric and raw image throughput. The first two plots correspond to the training configuration of Gulrajani et al. (2017) without and with progressive growing. We observe that the progressive variant offers two main benefits: it converges to a considerably better optimum and also reduces the total training time by about a factor of two. The improved convergence is explained by an implicit form of curriculum learning that is imposed by the gradually increasing network capacity. Without progressive growing, all layers of the generator and discriminator are tasked with simultaneously finding succinct intermediate representations for both the large-scale variation and the small-scale detail. With progressive growing, however, the existing low-resolution layers are likely to have already converged early on, so the networks are only tasked with refining the representations by increasingly smaller-scale effects as new layers are introduced. Indeed, we see in Figure 4(b) that the largest-scale statistical similarity curve (16) reaches its optimal value very quickly and remains consistent throughout the rest of the training. The smaller-scale curves (32, 64, 128) level off one by one as the resolution is increased, but the convergence of each curve is equally consistent. With non-progressive training in Figure 4(a), each scale of the SWD metric converges roughly in unison, as could be expected. ::: :::success Figure 4說明了漸進式增長在SWD指標與原始影像產出的影響。前兩張圖對應Gulrajani et al. (2017)的訓練配置(有跟沒有使用漸進式增長)。我們觀察到，漸進變化提供兩個主要好處：它收斂到一個相當好的最佳值，而且減少大約兩倍的訓練時間。收斂性的改善是來自於逐步增加網路容量所加入的curriculum learning的隱含形式。在沒有漸進增長的情況下，generator與discriminator的所有層要同時為大規模的變化與小規模的細節找出一個簡明的表示。然而，使用漸進增長的情況下，已存在的低解析度層(low-resolution layers)可能在早期就已經收斂了，因此，隨著新的層的引入，網路就只需要利用愈來愈小的尺度的影響來細化其表示。確實，我們在Figure 4(b)看的出來，最大尺度的統計相似性曲線(16)很快的來到它的最佳值，並且在剩餘的訓練過程中維持不變。隨著解析度的增加，較小尺度的曲線(32、64、128)一個接著一個的躺平，不過每條曲線的斂性是一致的。對於Figure 4(a)採用非漸進訓練的情況下，SWD指標的每一個尺度的大約都同時收斂，這是可預期的。 ::: :::info ![](https://hackmd.io/_uploads/HkdS-EI7p.png) Figure 4: Effect of progressive growing on training speed and convergence. The timings were measured on a single-GPU setup using NVIDIA Tesla P100. (a) Statistical similarity with respect to wall clock time for Gulrajani et al. (2017) using CELEBA at 128 × 128 resolution. Each graph represents sliced Wasserstein distance on one level of the Laplacian pyramid, and the vertical line indicates the point where we stop the training in Table 1. (b) Same graph with progressive growing enabled. The dashed vertical lines indicate points where we double the resolution of G and D. (c) Effect of progressive growing on the raw training speed in 1024 × 1024 resolution. Figure 4：漸進式增長對於訓練速度與收斂性的影響。這時間是在單一個GPU設定上量測的，使用的是NVIDIA Telsa P100。(a)Gulrajani et al. (2017)使用CELEBA在解析度128 x 128上的統計相似性的[經過時間](https://terms.naer.edu.tw/detail/23e204a1eddec765331c37fe3f692054/)。每個圖表示Laplacian pyramid的一個級別上的sliced Wasserstein distance，垂直線指出我們在Table 1中停止訓練的點。(b)跟啟動漸進式增長的相同圖形。垂直虛線表示我們把G與D的解析度拉高一倍的點。(c)在1024 x 1024解析度下，漸進式增長對原始訓練速度的影響。 ::: :::info The speedup from progressive growing increases as the output resolution grows. Figure 4(c) shows training progress, measured in number of real images shown to the discriminator, as a function of training time when the training progresses all the way to 1024^2^ resolution. We see that progressive growing gains a significant head start because the networks are shallow and quick to evaluate at the beginning. Once the full resolution is reached, the image throughput is equal between the two methods. The plot shows that the progressive variant reaches approximately 6.4 million images in 96 hours, whereas it can be extrapolated that the non-progressive variant would take about 520 hours to reach the same point. In this case, the progressive growing offers roughly a 5.4× speedup. ::: :::success 隨著輸出解析度的增加，漸進式增長的加速效果也在增加。Figure 4(c)說明當訓練進度一直來到解析度1024^2^時，以顯示給discriminator的實際影像的數量來量測(取決於訓練時間)。我們看到，漸進式增長明顯領先，因為網路比較淺層，很快就能開始評估。一旦達到全解析度(full resolution)，這兩種方法之間的影像輸出就會相等。這圖表說明著，漸進式變化在96小時內來到大約640萬張照片，可以推斷出，非漸進式變化的話要花大約520小時才能有相同的結果。這種情況下，漸進式增長提供大約5.4倍的加速。 ::: :::warning as a function of似乎是翻譯為根據、取決於？這一段翻譯的不是很好，還要再思考。 ::: ### 6.3 HIGH-RESOLUTION IMAGE GENERATION USING CELEBA-HQ DATASET :::info To meaningfully demonstrate our results at high output resolutions, we need a sufficiently varied high-quality dataset. However, virtually all publicly available datasets previously used in GAN literature are limited to relatively low resolutions ranging from 32^2^ to 480^2^ . To this end, we created a high-quality version of the CELEBA dataset consisting of 30000 of the images at 1024 × 1024 resolution. We refer to Appendix C for further details about the generation of this dataset. ::: :::success 為了能夠在高解析度情況下有意義地展示我們的成果，我們需要一個變化性足夠多的高品質資料集。不過厚，幾乎所有之前用於GAN文獻的公開可用資料集都被限制在相對較低的解析度(32^2^~480^2^)。為了這個，我們建立一個高品質版本的CELEBA資料集，包含30000張解析度1024 x 1024的照片。更多該資料集的生成資訊請參閱Appendix C。 ::: :::info Our contributions allow us to deal with high output resolutions in a robust and efficient fashion. Figure 5 shows selected 1024 × 1024 images produced by our network. While megapixel GAN results have been shown before in another dataset (Marchesi, 2017), our results are vastly more varied and of higher perceptual quality. Please refer to Appendix F for a larger set of result images as well as the nearest neighbors found from the training data. The accompanying video shows latent space interpolations and visualizes the progressive training. The interpolation works so that we first randomize a latent code for each frame (512 components sampled individually from $\mathcal{N}(0, 1)$), then blur the latents across time with a Gaussian ($\sigma=45$ frames \@ 60Hz), and finally normalize each vector to lie on a hypersphere. ::: :::success 我們的貢獻讓我們可以以穩建、高效的形式來處理高輸出解析度的問題。Figure 5說明著由我們的網路所生成的1024 x 1024的影像。儘管megapixel GAN(Marchesi, 2017)在先前已經有在另外的資料集展示過了，不過我們的結果變化更大。請參考Appendix F，以瞭解更大組的結果影像以及從訓練資料中找出的最近鄰。隨文附上的影像說明了潛在空間的插值以及漸進訓練的可視化。插值的原理大概就是，我們會為每個frame(幀)隨機採樣一個laten code(各別從$\mathcal{N}(0, 1)$中採樣512個components)，然後用把這個laten code做高斯模糊($\sigma=45$ frames \@ 60Hz)，然後把每一個向量正規化到一個[超球面](https://terms.naer.edu.tw/detail/7e76efb067dcf95c325ffc6d555a4a32/)上。 ::: :::info We trained the network on 8 Tesla V100 GPUs for 4 days, after which we no longer observed qualitative differences between the results of consecutive training iterations. Our implementation used an adaptive minibatch size depending on the current output resolution so that the available memory budget was optimally utilized. ::: :::success 我們花了4天用了8片Tesla V100 GPUs訓練模型，之後就沒有再觀察到連續的訓練迭代結果之間質量上的差異。我們的實作會根據當前的輸出解析度來採用自適應minibatch size，也因此可以最佳化可用的記憶體預算。 ::: :::info In order to demonstrate that our contributions are largely orthogonal to the choice of a loss function, we also trained the same network using LSGAN loss instead of WGAN-GP loss. Figure 1 shows six examples of 1024^2^ images produced using our method using LSGAN. Further details of this setup are given in Appendix B. ::: :::success 為了能夠證明我們的貢獻很大程度地跟損失函數的選擇有正交性，我們還用了LSGAN而不是WGAN-PG loss來訓練相同網路。Figure 1由LSGAN所生成的6個1024^2^的樣本。更多設置細節請參考Appendix B。 ::: :::info ![image](https://hackmd.io/_uploads/rydTenx46.png) Figure 1: Our training starts with both the generator (G) and discriminator (D) having a low spatial resolution of 4×4 pixels. As the training advances, we incrementally add layers to G and D, thus increasing the spatial resolution of the generated images. All existing layers remain trainable throughout the process. Here N × N refers to convolutional layers operating on N × N spatial resolution. This allows stable synthesis in high resolutions and also speeds up training considerably. One the right we show six example images generated using progressive growing at 1024 × 1024. Figure 1：我們的訓練(generator (G)與discriminator (D))都是從低空間解析度4×4 pixels開始。隨著訓練的進行，我們會向G、D增加網路層，從而提高生成影像的空間解析度。所有已存在的網路層在過程中都維持著可訓練的狀態。這邊$\mathcal{N} \times \mathcal{N}$對應的是在$\mathcal{N} \times \mathcal{N}$空間解析度上操作的卷積層。這讓我們可以在高解析度情況下穩定合成，也大大的提高訓練速度。右邊的部份我們展示了在1024x1024解析度情況下以漸進式成長方式所生成的6張照片。 ::: ### 6.4 LSUN RESULTS :::info Figure 6 shows a purely visual comparison between our solution and earlier results in LSUN BEDROOM. Figure 7 gives selected examples from seven very different LSUN categories at 256^2^ . A larger, non-curated set of results from all 30 LSUN categories is available in Appendix G, and the video demonstrates interpolations. We are not aware of earlier results in most of these categories, and while some categories work better than others, we feel that the overall quality is high. ::: :::success Figure 6給出我們的解析方案跟LSUN BEDROOM中早期結果之間的純視覺比較。Figure 7則是從七個非常不同的LSUN類別中以256^2^所選擇的樣本。Appendix G給出一組來自所有30個LSUN類別中較大的的結果集(不是特別搞的)，影片中說明了插值。我們並不是那麼清楚大多數類別的早期結果是什麼，雖然有些類別的結果比其它的類別來的好，但是我們覺得整體的品質是好的。 ::: :::info ![image](https://hackmd.io/_uploads/SkL943xV6.png) Figure 6: Visual quality comparison in LSUN BEDROOM; pictures copied from the cited articles. ::: :::info ![image](https://hackmd.io/_uploads/BJ_sE3eVa.png) Figure 7: Selection of 256 × 256 images generated from different LSUN categories. ::: ### 6.5 CIFAR10 INCEPTION SCORES :::info The best inception scores for CIFAR10 (10 categories of 32 × 32 RGB images) we are aware of are 7.90 for unsupervised and 8.87 for label conditioned setups (Grinblat et al., 2017). The large difference between the two numbers is primarily caused by “ghosts” that necessarily appear between classes in the unsupervised setting, while label conditioning can remove many such transitions. ::: :::success 就我們所知道的，CIFAR10(10個類別，32x32 RGB影像)最佳的inception scores是7.90(非監督式)與8.87(標記資料)(Grinblat et al., 2017)。這兩個數字之間的巨大差異主要原因是在非監督中必然出現的"鬼影(ghosts)"，而標記條件(label conditioning)是可以排除掉這類的轉換的。 ::: :::info When all of our contributions are enabled, we get 8.80 in the unsupervised setting. Appendix D shows a representative set of generated images along with a more comprehensive list of results from earlier methods. The network and training setup were the same as for CELEBA, progression limited to 32 × 32 of course. The only customization was to the WGAN-GP’s regularization term $\mathbb{E}_{\hat{x}\sim\mathbb{P}_{\hat{x}}}[(\Vert \nabla_{\hat{x}}\mathbf{D}(\hat{\mathbf{x}}))\Vert_2 - \gamma)^2/\gamma^2]$. Gulrajani et al. (2017) used $\gamma=1.0$, which corresponds to 1-Lipschitz, but we noticed that it is in fact significantly better to prefer fast transitions ($\gamma=750$) to minimize the ghosts. We have not tried this trick with other datasets. ::: :::success 當我們功力全開的時候，我們在非監督式環境中得到8.80的分數。Appendix D展示了一組具有代表性的生成影像，以及來自早期方法中的更全面的結果列表。網路與訓練配置與CELEBA相同。當然，漸進增長的部份則是限定在32 x 32。唯一客製化的部份在於WGAN-GP的正規化項目，$\mathbb{E}_{\hat{x}\sim\mathbb{P}_{\hat{x}}}[(\Vert \nabla_{\hat{x}}\mathbf{D}(\hat{\mathbf{x}}))\Vert_2 - \gamma)^2/\gamma^2]$，Gulrajani et al. (2017)使用$\gamma=1.0$，這對應於1-Lipschitz，不過我們注意到，事實上，使用fast transitions($\gamma=750$)可以明顯減少鬼影。不過我們並沒有在其它的資料集中使用這個技巧就是。 ::: ## 7 DISCUSSION :::info While the quality of our results is generally high compared to earlier work on GANs, and the training is stable in large resolutions, there is a long way to true photorealism. Semantic sensibility and understanding dataset-dependent constraints, such as certain objects being straight rather than curved, leaves a lot to be desired. There is also room for improvement in the micro-structure of the images. That said, we feel that convincing realism may now be within reach, especially in CELEBA-HQ. ::: :::success 儘管跟GANs的早期研究相比，我們的結果品質算是好的，而且在高解析度情況下訓練也是穩定的，不過那種影像的真實感還有很長一段路要走。語意敏感性與對資料集依賴性約束的瞭解(像是某些物件是直的而不是彎的)，還有待加強。還有影像的[微結構](https://terms.naer.edu.tw/detail/e081180a0f79a0b87f49de0608c27a54/)也有改進的空間。話雖如此，我們認為令人信服的逼真程度現在可能是唾手可得的，尤其是在CELEBA-HQ。 ::: ## 8 ACKNOWLEDGEMENTS :::info We would like to thank Mikael Honkavaara, Tero Kuosmanen, and Timi Hietanen for the compute infrastructure. Dmitry Korobchenko and Richard Calderwood for efforts related to the CELEBA-HQ dataset. Oskar Elek, Jacob Munkberg, and Jon Hasselgren for useful comments. ::: :::success 一句話，謝天謝地。 ::: ## APPENDIX ### A NETWORK STRUCTURE AND TRAINING CONFIGURATION #### A.1 1024 × 1024 NETWORKS USED FOR CELEBA-HQ :::info Table 2 shows network architectures of the full-resolution generator and discriminator that we use with the CELEBA-HQ dataset. Both networks consist mainly of replicated 3-layer blocks that we introduce one by one during the course of the training. The last Conv 1 × 1 layer of the generator corresponds to the toRGB block in Figure 2, and the first Conv 1 × 1 layer of the discriminator similarly corresponds to fromRGB. We start with 4 × 4 resolution and train the networks until we have shown the discriminator 800k real images in total. We then alternate between two phases: fade in the first 3-layer block during the next 800k images, stabilize the networks for 800k images, fade in the next 3-layer block during 800k images, etc. ::: :::success Table 2說明了我們在CELEBA-HQ資料集中所使用的全解析度的generator與discriminator的網路架構。兩個網路主要是由重覆的3-layer blocks所組成，我們在訓練過程中逐一引入這些blocks。generator最後一個Conv 1 × 1 layer對應Figure 2中的toRGB block，而discriminator的第一個Conv 1 × 1 layer類似地對應於fromRGB。我們從4 × 4(解析度)開始訓練網路，然後一直到我們給discriminator看過800k張的實際影像。然後我們會在兩個階段之間交替：在接下來的800k張影像中逐漸增強前3個layer block，然後穩定這800k張影像的網路，在800k張影像期間逐漸增強下3個layer block，依此類推。 ::: :::info ![image](https://hackmd.io/_uploads/SJmNp2gNp.png) Table 2: Generator and discriminator that we use with CELEBA-HQ to generate 1024×1024 images. ::: :::info Our latent vectors correspond to random points on a 512-dimensional hypersphere, and we represent training and generated images in [-1,1]. We use leaky ReLU with leakiness 0.2 in all layers of both networks, except for the last layer that uses linear activation. We do not employ batch normalization, layer normalization, or weight normalization in either network, but we perform pixelwise normalization of the feature vectors after each Conv 3×3 layer in the generator as described in Section 4.2. We initialize all bias parameters to zero and all weights according to the normal distribution with unit variance. However, we scale the weights with a layer-specific constant at runtime as described in Section 4.1. We inject the across-minibatch standard deviation as an additional feature map at 4 × 4 resolution toward the end of the discriminator as described in Section 3. The upsampling and downsampling operations in Table 2 correspond to 2 × 2 element replication and average pooling, respectively. ::: :::success 我們的latent vectors對應512維超球面上的一個隨機點，在[-1, 1]表示訓練與生成的影像。兩個網路的所有網路層除了最後一個網路層使用linear activation之外，其它的都使用leaky ReLU(leakiness 0.2)。我們並沒有使用batch normalization，layer normalization或是weight normalization，不過，我們在generator中的每個Conv 3x3 layer之後都有做特徵向量的pixelwise normalization(如Section 4.2所述)。我們把所有的bias參數都初始化為0，然後權重的部份則是根據具有[單位變異數](https://terms.naer.edu.tw/detail/baa4a55a1b6171ed7a28c18f03de7118/)的正態分佈來初始化。然而，我們在執行的時候會以layer-specific constant來縮放權重(如Section 4.1所述)。我們在discriminator的最後注入across-minibatch standard deviation做為額外的4 x 4解析度的feature map(如Section 3所述)。Table 2中的上採樣與下採樣的操作則各別對應2x2的元素複製與平均池化(average pooling)。 ::: :::info We train the networks using Adam (Kingma & Ba, 2015) with $\alpha=0.001, \beta_1=0, \beta_2=0.99, \epsilon = 10^{-8}$. We do not use any learning rate decay or ramp down, but for visualizing generator output at any given point during the training, we use an exponential running average for the weights of the generator with decay 0.999. We use a minibatch size 16 for resolutions $4^2 - 128^2$ and then gradually decrease the size according to $256^2 \to 14, 512^2 \to 6, 1024^2 \to 3$ to avoid exceeding the available memory budget. We use the WGAN-GP loss, but unlike Gulrajani et al. (2017), we alternate between optimizing the generator and discriminator on a per-minibatch basis, i.e., we set $n_{critic}=1$. Additionally, we introduce a fourth term into the discriminator loss with an extremely small weight to keep the discriminator output from drifting too far away from zero. To be precise, we set $L'=L+\epsilon_{\text{drift}}\mathbb{E}_{x\in\mathbb{P}_{\gamma}}[D(x)^2]$, where $\epsilon_{\text{drift}}=0.001$. ::: :::success 我們用Adam (Kingma & Ba, 2015)訓練網路($\alpha=0.001, \beta_1=0, \beta_2=0.99, \epsilon = 10^{-8}$)。我們並沒有採用任何的學習效率衰減(decay)或是減少(ranmpdown，斜坡下降？)，不過為了能在訓練期間能夠可視化在任何點所給定的生成器的輸出(generator output)，我們對generator的權重使用0.999的衰減(decay)來計算其指數移動平均值。解析度$4^2 - 128^2$之間我們採用minibatch size為16，然後根據解析度的變化逐漸的減少大小($256^2 \to 14, 512^2 \to 6, 1024^2 \to 3$)，以此避免超出可用的記憶體預算。我們使用WGAN-GP loss，但跟Gulrajani et al. (2017)不一樣，我們在per-minibatch基礎上交替最佳化generator與discriminator，也就是說我們設置$n_{critic}=1$。此外，我們在discriminator loss中引入了第四個項目，很小很小的權重，讓discriminator的output不要離零太遠。精確的說，我們設置$L'=L+\epsilon_{\text{drift}}\mathbb{E}_{x\in\mathbb{P}_{\gamma}}[D(x)^2]$，其中$\epsilon_{\text{drift}}=0.001$。 ::: #### A.2 OTHER NETWORKS :::info Whenever we need to operate on a spatial resolution lower than 1024 × 1024, we do that by leavingout an appropriate number copies of the replicated 3-layer block in both networks. ::: :::success 當我們需要在低於1024 x 1024的空間解析度上操作的時候，我們會把兩個網路去掉(leavingout)適當數量的重複的3-layer block。 ::: :::info Furthermore, Section 6.1 uses a slightly lower-capacity version, where we halve the number of feature maps in Conv 3 × 3 layers at the 16 × 16 resolution, and divide by 4 in the subsequent resolutions. This leaves 32 feature maps to the last Conv 3 × 3 layers. In Table 1 and Figure 4 we train each resolution for a total 600k images instead of 800k, and also fade in new layers for the duration of 600k images. ::: :::success 此外，Section 6.1使用一個容量稍低的版本，在16 x 16解析度情況下把Conv 3 × 3 layers的feature mpas的數量減半，然後在後續的解析度再除4。所以這會在最後一個Conv 3 × 3 layers留下的feature maps為32。在Table 1與Figure 4中，每個解析度的訓練是拿600k的影像而不是800k，而且一樣會在600k的影像期間內新加入的網路層中逐步增強。 ::: :::info For the “Gulrajani et al. (2017)” case in Table 1, we follow their training configuration as closely as possible. In particular, we set $\alpha=0.001, \beta_2=0.9,n_{critic}=5, \epsilon_{\text{drift}}=0$, and minibatch size 64. We disable progressive resolution, minibatch stddev, as well as weight scaling at runtime, and initialize all weights using He’s initializer (He et al., 2015). Furthermore, we modify the generator by replacing LReLU with ReLU, linear activation with tanh in the last layer, and pixelwise normalization with batch normalization. In the discriminator, we add layer normalization to all Conv 3 × 3 and Conv 4 × 4 layers. For the latent vectors, we use 128 components sampled independently from the normal distribution. ::: :::success 對於Table 1中Gulrajani et al. (2017)的案例，我們盡可能的接近他們的訓練配置。特別是，我們設置$\alpha=0.001, \beta_2=0.9,n_{critic}=5, \epsilon_{\text{drift}}=0$，且minibatch size為64。我們在執行的過程中禁掉漸進式解析度、minibatch stddev、還有權重縮放，而且用He’s initializer (He et al., 2015)來做權重初始化。此外，我們調整generator，用LReLU取代ReLU、最後一個網路層使用tanh做啟動函數、pixelwise normalization取代batch normalization。然後，discriminator的話我們就在所有Conv 3 × 3與Conv 4 × 4 layers中加入layer normalization。latent vectors的話，我們使用128個components從正態分佈中獨立採樣。 ::: ### B LEAST-SQUARES GAN (LSGAN) AT 1024 × 1024 :::info We find that LSGAN is generally a less stable loss function than WGAN-GP, and it also has a tendency to lose some of the variation towards the end of long runs. Thus we prefer WGAN-GP, but have also produced high-resolution images by building on top of LSGAN. For example, the 1024^2^ images in Figure 1 are LSGAN-based. ::: :::success 我們發現到LSGAN是一個比WGAN-GP還要不穩定的損失函數，它在長時間訓練結束的最後會有一種失去一些變化的趨勢。因此啊，我們更喜歡WGAN-GP，不過還是有利用在LSGAN之上建構生成的高解析度影像。舉例來說，Figure 1中的1024^2^的照片就是基於LSGAN。 ::: :::info On top of the techniques described in Sections 2–4, we need one additional hack with LSGAN that prevents the training from spiraling out of control when the dataset is too easy for the discriminator, and the discriminator gradients are at risk of becoming meaningless as a result. We adaptively increase the magnitude of multiplicative Gaussian noise in discriminator as a function of the discriminator’s output. The noise is applied to the input of each Conv 3 × 3 and Conv 4 × 4 layer. There is a long history of adding noise to the discriminator, and it is generally detrimental for the image quality (Arjovsky et al., 2017) and ideally one would never have to do that, which according to our tests is the case for WGAN-GP (Gulrajani et al., 2017). The magnitude of noise is determined as $0.2 \cdot \max(0, \hat{d}_t - 0.5)^2$, where $\hat{d}_t = 0.1d + 0.9\hat{d}_{t-1}$ is an exponential moving average of the discriminator output $d$. The motivation behind this hack is that LSGAN is seriously unstable when $d$ approaches (or exceeds) $1.0$. ::: :::success 除了Section 2-4所描述過的技術之外，使用LSGAN我們還需要做個額外處理，這可以避免當資料集對於discriminator來說過於容易訓練時候失控，而且discriminator的梯度(gradient)還有可能會因此變得沒有意義。我們自適應地加入discriminator中[乘性](https://terms.naer.edu.tw/detail/4064e695485cd9284c2c5a07c9aadd5f/)高斯噪點的數值([magnitude](https://terms.naer.edu.tw/detail/018e6be74d747f512d8d57bf47b74b43/))，以此做為discriminator的輸出函數。我們會在每一個Conv 3 × 3與Conv 4 × 4中加入噪點。把噪點加到discriminator已經是司空見慣了，這通常不利於影像品質(Arjovsky et al., 2017)，理想情況下我們會希望一輩子不要這麼做，根據我們的測試，WGAN-GP (Gulrajani et al., 2017)就是這樣的。噪點的大小定義為$0.2 \cdot \max(0, \hat{d}_t - 0.5)^2$，其中$\hat{d}_t = 0.1d + 0.9\hat{d}_{t-1}$是discriminator的輸出$d$的指數移動平均值。這方法動機的背後就是當$d$接近(或它就是)$1.0$的時候，LSGAN是非常不穩定的。 ::: ### C CELEBA-HQ DATASET :::info In this section we describe the process we used to create the high-quality version of the CELEBA dataset, consisting of 30000 images in 1024 × 1024 resolution. As a starting point, we took the collection of in-the-wild images included as a part of the original CELEBA dataset. These images are extremely varied in terms of resolution and visual quality, ranging all the way from 43 × 55 to 6732 × 8984. Some of them show crowds of several people whereas others focus on the face of a single person – often only a part of the face. Thus, we found it necessary to apply several image processing steps to ensure consistent quality and to center the images on the facial region. ::: :::success 這章節中，我們要來說說我們用來建立高品質版本的CELEBA資料集的過程，這資料集包含30000張1024 x 1024解析度的影像。做為起點，我們收集了做為原始CELEBA資料集一部份的野生影像。這些影像在解析度與視覺品質上的變化很大。解析度從43 x 55到6732 x 8984都有。部份程現的是幾個人的人群，一些則是關注在單一個人的臉上，通常只是臉的一部份。因此，我們發現有必要做一些影像處理的步驟來確保一致性的品質，並且將影像集中在臉部區域。 ::: :::info Our processing pipeline is illustrated in Figure 8. To improve the overall image quality, we preprocess each JPEG image using two pre-trained neural networks: a convolutional autoencoder trained to remove JPEG artifacts in natural images, similar in structure to the proposed by Mao et al. (2016a), and an adversarially-trained 4x super-resolution network (Korobchenko & Foco, 2017) similar to Ledig et al. (2016). To handle cases where the facial region extends outside the image, we employ padding and filtering to extend the dimensions of the image as illustrated in Fig.8(c–d). We then select an oriented crop rectangle based on the facial landmark annotations included in the original CELEBA dataset as follows: $$ \begin{align} x'&=e_1 - e_0 \\ y' &= \dfrac{1}{2}(e_0+e_1) - \dfrac{1}{2}(m_0+m_1) \\ c &= \dfrac{1}{2}(e_0+e_1) - 0.1 \cdot y' \\ s &= \max(4.0 \cdot \vert x' \vert, 3.6 \cdot \vert y' \vert) \\ x &= \text{Normalize}(x'-\text{Rotate90}(y')) \\ y &= \text{Rotate90}(x) \end{align} $$ $e_0$, $e_1$, $m_0$, and $m_1$ represent the 2D pixel locations of the two eye landmarks and two mouth landmarks, respectively, $c$ and $s$ indicate the center and size of the desired crop rectangle, and $x$ and $y$ indicate its orientation. We constructed the above formulas empirically to ensure that the crop rectangle stays consistent in cases where the face is viewed from different angles. Once we have calculated the crop rectangle, we transform the rectangle to 4096 × 4096 pixels using bilinear filtering, and then scale it to 1024 × 1024 resolution using a box filter. ::: :::success 我們的processing pipeline就如同Figure 8所示。為了能夠提升整體影像的品質，我們使用兩個預訓練(pre-train)的神經網路來預處理(pre-process)每一張JPEG影像：一個是convolutional autoencoder，主要是訓練用來移除在自然影像中的JPEG失真(假影，artifacts)，其網路架構與Mao et al. (2016a)類似，另一個則是adversarially-trained 4x super-resolution network (Korobchenko & Foco, 2017)，這類似於Ledig et al. (2016)所提出的架構。為了處理臉部區域延伸到影像之外的情況，我們使用padding與filtering來擴展影像的維度，如Fig.8(c-d)所示。然後，我們根據原始CELEBA資料集中所包含的臉部標記註釋選擇了一個[定向](https://terms.naer.edu.tw/detail/00a7e5a6b0f4ef8c27249749395be2d5/)裁剪矩形，如下所示： $$ \begin{align} x'&=e_1 - e_0 \\ y' &= \dfrac{1}{2}(e_0+e_1) - \dfrac{1}{2}(m_0+m_1) \\ c &= \dfrac{1}{2}(e_0+e_1) - 0.1 \cdot y' \\ s &= \max(4.0 \cdot \vert x' \vert, 3.6 \cdot \vert y' \vert) \\ x &= \text{Normalize}(x'-\text{Rotate90}(y')) \\ y &= \text{Rotate90}(x) \end{align} $$ $e_0$、$e_1$、$m_0$與$m_1$分別表示兩隻眼睛標記與兩張嘴巴標記的2D像素位置，$c$跟$s$則表示要剪裁矩型的中心點與大小，而$x$、$y$則表示其定向。我們根據經驗建構上面的公式，以此確保剪裁的矩型會保持一致(即使從不同的角度來看)。一旦我們計算好剪裁矩型，我們就會用bilinear filtering將之轉為4096 x 4096像素，然後再使用box filter將之縮放至解析度1024 x 1024。 ::: :::info We perform the above processing for all 202599 images in the dataset, analyze the resulting 1024 × 1024 images further to estimate the final image quality, sort the images accordingly, and discard all but the best 30000 images. We use a frequency-based quality metric that favors images whose power spectrum contains a broad range of frequencies and is approximately radially symmetric. This penalizes blurry images as well as images that have conspicuous directional features due to, e.g., visible halftoning patterns. We selected the cutoff point of 30000 images as a practical sweet spot between variation and image quality, because it appeared to yield the best results. ::: :::success 我們把資料集內所有的照片(202599張)都拿來做上面所說的處理流程，進一步的分析得到的1024 x 1024的影像，以此估測最終影像的品質，然後相應地排序這些照片，最終留下30000張最佳的影像。我們使用frequency-based的品質指標來挑選那些有比較寬的頻率範例並且近似徑向對稱的[功率譜](https://terms.naer.edu.tw/detail/84181966aad0d7b08c70c9204c0b6b3d/)的影像。這會懲罰那些模糊影像以及由於像是可見的[半色調](https://terms.naer.edu.tw/detail/e90e5848758f014a71daf81a0c2957dc)模式而有明顯方向特徵的影像。我們選擇了30000張影像的[截止點](https://terms.naer.edu.tw/detail/b1df5235b32f496d5a3851c3f03df03f/)作為變化性和影像質量之間的實際平衡點，因為這似乎產生了最好的結果。 ::: :::info ![image](https://hackmd.io/_uploads/HkIg3EsVp.png) Figure 8: Creating the CELEBA-HQ dataset. We start with a JPEG image (a) from the CelebA in-the-wild dataset. We improve the visual quality (b,top) through JPEG artifact removal (b,middle) and 4x super-resolution (b,bottom). We then extend the image through mirror padding (c) and Gaussian filtering (d) to produce a visually pleasing depth-of-field effect. Finally, we use the facial landmark locations to select an appropriate crop region (e) and perform high-quality resampling to obtain the final image at 1024 × 1024 resolution (f). Figure：建立中的CELEBA-HQ資料集。我們從CelebA in-the-wild dataset的JPEG影像(a)開始。然後通過JPEG失真(artifact)的移除(b，中間)，以及4倍的super-resolution(b，下面)來提升視覺品質(b，上面)。然後我們利用mirror padding(c)來填充影像，用高斯濾波(d)來產生景深效果。最終，我們使用臉部標記的位置來選擇一個合適的剪裁區域(e)，並且執行高品質的重新採樣(resampling)，以此獲得最終影像(f)(解析度1024x1024)。 ::: ### D CIFAR10 RESULTS :::info Figure 9 shows non-curated images generated in the unsupervised setting, and Table 3 compares against prior art in terms of inception scores. We report our scores in two different ways: 1) the highest score observed during training runs (here ± refers to the standard deviation returned by the inception score calculator) and 2) the mean and standard deviation computed from the highest scores seen during training, starting from ten random initializations. Arguably the latter methodology is much more meaningful as one can be lucky with individual runs (as we were). We did not use any kind of augmentation with this dataset. ::: :::success Figure 9呈現出的是在非監督環境中所生成的非精心挑選的影像，Table 3則是針對inception scores跟現今的技術做比較。我們用兩種不同的方式來報告我們的分數：1)訓練執行期間所觀察到最高的分數(這邊±指的是inception score calculator所回傳的標準差)，2)從訓練期間所看過最高分數的均值與標準差(從十個隨機初始化開始)。我們可以說，後面那個方法是比較有意義的，因為大家可以很幸運地各自執行(就像我們一樣)。我們並沒有針對該資料集做任何的資料增強。 ::: :::warning 這邊的翻譯可能不是那麼精確。 ::: :::info ![image](https://hackmd.io/_uploads/SkVzoq2Np.png) Figure 9: CIFAR10 images generated using a network that was trained unsupervised (no label conditioning), and achieves a record 8.80 inception score. ::: :::info ![image](https://hackmd.io/_uploads/HJPmo93E6.png) Table 3: CIFAR10 inception scores, higher is better. ::: ### E MNIST-1K DISCRETE MODE TEST WITH CRIPPLED DISCRIMINATOR :::info Metz et al. (2016) describe a setup where a generator synthesizes MNIST digits simultaneously to 3 color channels, the digits are classified using a pre-trained classifier (0.4% error rate in our case), and concatenated to form a number in [0, 999]. They generate a total of 25,600 images and count how many of the discrete modes are covered. They also compute KL divergence as KL(histogram || uniform). Modern GAN implementations can trivially cover all modes at very low divergence (0.05 in our case), and thus Metz et al. specify a fairly low-capacity generator and two severely crippled discriminators (“K/2” has ∼ 2000 params and “K/4” only about ∼ 500) to tease out differences between training methodologies. Both of these networks use batch normalization. ::: :::success Metz et al. (2016)描述了一種設置，generator同時地把MIST數字合成為3顏色通道，數字的分類則是使用預訓練的分類器(在我們的情況下其錯誤率為0.4%)，然後利用拼接的方式將之形成[0, 999]之間的數字。它們總共生成26000張影像，並且計算覆蓋多少離散模式(discrete modes)。他們還把KL divergence計算為KL(histogram || uniform)。現代的GAN可以以非常低的散度(在我們的情況中為0.05)來覆蓋所有的模式，因此Metz et al.指定了一個相當低容量的generator與兩個嚴重削弱(severely crippled)的discriminators("K/2"有2000個參數，而"K/4"只有大約500個參數)來從中挑選出不同訓練方法之間的差異。這兩個網路都有使用batch normalization。 ::: :::info As shown in Table 4, using WGAN-GP loss with the networks specified by Metz et al. covers much more modes than the original GAN loss, and even more than the unrolled original GAN with the smaller (K/4) discriminator. The KL divergence, which is arguably a more accurate metric than the raw count, acts even more favorably. ::: :::success 如Table 4所示，Metz et al.使用WGAN-GP loss會比原始GAN loss得到還要來的多的模式(mode)，即使使用較小的discriminator(K/4)還是可以比unrolled original GAN還要來的多。KL divergence可以說是一種比原始計數(raw count)還要來的準確的指標，它的作用甚至更為有利。 ::: :::info ![image](https://hackmd.io/_uploads/rJvTuK6Na.png) Table 4: Results for MNIST discrete mode test using two tiny discriminators (K/4, K/2) defined by Metz et al. (2016). The number of covered modes (#) and KL divergence from a uniform distribution are given as an average ± standard deviation over 8 random initializations. Higher is better for the number of modes, and lower is better for KL divergence. ::: :::info Replacing batch normalization with our normalization (equalized learning rate, pixelwise normalization) improves the result considerably, while also removing a few trainable parameters from the discriminators. The addition of a minibatch stddev layer further improves the scores, while restoring the discriminator capacity to within 0.5% of the original. Progression does not help much with these tiny images, but it does not hurt either. ::: :::success 改用我們的normalization(equalized learning rate、pixelwise normalization)來取代batch normalization有相當程度的提升結果，同時也從discriminator中移除一些可訓練參數。增加一個minibatch stddev layer進一步的提高分數，同時將discriminator的容量恢復到原來的0.5%之內。漸進式對這些小的影像沒有太多的幫助，不過也沒有壞處。 ::: ### F ADDITIONAL CELEBA-HQ RESULTS :::info Figure 10 shows the nearest neighbors found for our generated images. Figure 11 gives additional generated examples from CELEBA-HQ. We enabled mirror augmentation for all tests using CELEBA and CELEBA-HQ. In addition to the sliced Wasserstein distance (SWD), we also quote the recently introduced Frechet Inception Distance (FID) (Heusel et al., 2017) computed from 50K images. ::: :::success Figure 10顯示的是，為我們所生成的影像找出最相近的。Figure 11給出從CELEBA-HQ中所生成的其它範例。我們使用CELEBA與CELEBA-HQ為所有的測試啟用mirror augmentation。除了sliced Wasserstein distance (SWD)，我們還從50K影像中引用了最近所引入的Frechet Inception Distance (FID) (Heusel et al., 2017) ::: :::info ![image](https://hackmd.io/_uploads/HynQTYpVT.png) Figure 10: Top: Our CELEBA-HQ results. Next five rows: Nearest neighbors found from the training data, based on feature-space distance. We used activations from five VGG layers, as suggested by Chen & Koltun (2017). Only the crop highlighted in bottom right image was used for comparison in order to exclude image background and focus the search on matching facial features. ::: :::info ![image](https://hackmd.io/_uploads/ByqITYa46.png) Figure 11: Additional 1024×1024 images generated using the CELEBA-HQ dataset. Sliced Wasserstein Distance (SWD) ×10^3^ for levels 1024, . . . , 16: 7.48, 7.24, 6.08, 3.51, 3.55, 3.02, 7.22, for which the average is 5.44. Frechet Inception Distance (FID) computed from 50K images was 7.30. See the video for latent space interpolations. ::: ### G LSUN RESULTS :::info Figures 12–17 show representative images generated for all 30 LSUN categories. A separate network was trained for each category using identical parameters. All categories were trained using 100k images, except for BEDROOM and DOG that used all the available data. Since 100k images is a very limited amount of training data for most categories, we enabled mirror augmentation in these tests (but not for BEDROOM or DOG). ::: :::success Figures 12–17是所有30 LSUN類別所生成的代表性影像。使用一樣的參數為每個類別訓練各自的網路。除了BEDROOM與DOG是拿所有能用的資料之外，所有的類別都使用100K的影像訓練。因為對於多數類別來說，100K已經是很緊很緊的訓練數量了，所以我們在這些測試中使用了mirror augmentation，不過BEDROOM與DOG是沒有的。 ::: ### H ADDITIONAL IMAGES FOR TABLE 1 :::info Figure 18 shows larger collections of images corresponding to the non-converged setups in Table 1. The training time was intentionally limited to make the differences between various methods more visible. :::