# A Style-Based Generator Architecture for Generative Adversarial Networks(翻譯) ###### tags: `pggan` `gan` `論文翻譯` `deeplearning` `對抗式生成網路` [TOC] ## 說明 區塊如下分類,原文區塊為藍底,翻譯區塊為綠底,部份專業用語翻譯參考國家教育研究院 :::info 原文 ::: :::success 翻譯 ::: :::warning 任何的翻譯不通暢部份都請留言指導 ::: :::danger * [paper hyperlink](https://arxiv.org/abs/1812.04948.pdf) * [NVIDIA GitHub Code](https://github.com/NVlabs/stylegan) * [Keras Sample Code](https://keras.io/examples/generative/stylegan/) ::: ## Abstract :::info We propose an alternative generator architecture for generative adversarial networks, borrowing from style transfer literature. The new architecture leads to an automatically learned, unsupervised separation of high-level attributes (e.g., pose and identity when trained on human faces) and stochastic variation in the generated images (e.g., freckles, hair), and it enables intuitive, scale-specific control of the synthesis. The new generator improves the state-of-the-art in terms of traditional distribution quality metrics, leads to demonstrably better interpolation properties, and also better disentangles the latent factors of variation. To quantify interpolation quality and disentanglement, we propose two new, automated methods that are applicable to any generator architecture. Finally, we introduce a new, highly varied and high-quality dataset of human faces. ::: :::success 我們為生成替代網路提出一種替代生成架構,這借鏡了style transfer的文獻內容。這新的架構讓學習自動化、高階屬性的非監督式分離(如訓練人臉時的姿勢與身份),生成影像中的隨機變化性(如雀班、頭髮),並且能夠直觀地在合成過程中做好尺度的控制。新的生成器讓目前最好的技術在傳統的分佈品質指標上有了提升,明顯有更好的插值屬性,並且更好的解決了潛在的變化因子。為了量化插值的品質與[解開糾結](https://tw.dictionary.search.yahoo.com/search?p=disentanglement)(disentanglement),我們提出兩個能夠用於任意生成架構的新的自動化的方法。最終,我們引入新的、高變化性以及高品質的人臉資料集。 ::: ## 1. Introduction :::info The resolution and quality of images produced by generative methods — especially generative adversarial networks (GAN) [22] — have seen rapid improvement recently [30, 45, 5]. Yet the generators continue to operate as black boxes, and despite recent efforts [3], the understanding of various aspects of the image synthesis process, e.g., the origin of stochastic features, is still lacking. The properties of the latent space are also poorly understood, and the commonly demonstrated latent space interpolations [13, 52, 37] provide no quantitative way to compare different generators against each other. ::: :::success 由生成方法(尤其是生成對抗網路(GAN))所生成的照片近來在解析度跟品質上都有很好升。然而,生成器(generator)仍然像是黑盒子一般的運作,儘管最近投入了研究,對於影像合成過程的各個面向(像是隨機特徵(stochastic features)的來源)仍然是知之甚少。潛在空間(latent space)的特性也是嚴重理解不足,常見用來說明的潛在空間插值也沒有辦法提供量化的方式來比較不同的生成器彼此之間的差異。 ::: :::info Motivated by style transfer literature [27], we re-design the generator architecture in a way that exposes novel ways to control the image synthesis process. Our generator starts from a learned constant input and adjusts the “style” of the image at each convolution layer based on the latent code, therefore directly controlling the strength of image features at different scales. Combined with noise injected directly into the network, this architectural change leads to automatic, unsupervised separation of high-level attributes (e.g., pose, identity) from stochastic variation (e.g., freckles, hair) in the generated images, and enables intuitive scale-specific mixing and interpolation operations. We do not modify the discriminator or the loss function in any way, and our work is thus orthogonal to the ongoing discussion about GAN loss functions, regularization, and hyperparameters [24, 45, 5, 40, 44, 36]. ::: :::success 受到style transfer文獻的影響,我們用一種新穎的方法來控制影像合成的過程,以此重新設置生成器架構。我們的生成器從學習constant input開始,然後根據latent code在每一個卷積層調整照片的"style(風格)",用這種方式直接的控制不同尺度下照片特徵的強度。結合直接注入(injected)噪點到網路的方式,這個架構的改變讓模型可以從生成的照片中高階屬性(如姿勢、身份)與隨機變化(如雀斑、頭髮)能夠自動、非監督地分離,並且能夠直觀的針對特定尺度混合與插值的操作。我們並沒有用任何的方式調整discriminator或是loss function,因此,我們的研究跟進行中的GAN loss functions、regularization與hyperparameters的討論是互不影響的。 ::: :::info Our generator embeds the input latent code into an intermediate latent space, which has a profound effect on how the factors of variation are represented in the network. The input latent space must follow the probability density of the training data, and we argue that this leads to some degree of unavoidable entanglement. Our intermediate latent space is free from that restriction and is therefore allowed to be disentangled. As previous methods for estimating the degree of latent space disentanglement are not directly applicable in our case, we propose two new automated metrics — perceptual path length and linear separability — for quantifying these aspects of the generator. Using these metrics, we show that compared to a traditional generator architecture, our generator admits a more linear, less entangled representation of different factors of variation. ::: :::success 我們的生成器把input latent code嵌入到一個中間的潛在空間,那,這個中間的潛在空間會深刻的影響網路中變動因子的表示。輸入的潛在空間必需依循著訓練資料的機率密度(probability density),這也導致了某種程度上無可避免的[糾纏](https://terms.naer.edu.tw/detail/f4c3b24a65a1aa0938371dc2f2fa03fc/)。我們的中間潛在空間就不受此限制了,因此是可以解耦的。因為先前估測潛在空間解耦程度的方法在我們的案例中並不直接適用,所以我們提出兩種新的自動化指標,perceptual path length(感知路徑長度?)與linear separability([線性可分度?](https://zh-yue.wikipedia.org/wiki/%E7%B7%9A%E6%80%A7%E5%8F%AF%E5%88%86%E5%BA%A6)),來量化生成器的各方方面面。使用這些指標,我們說明了,對比傳統生成器架構,我們的生成器允許更線性,更少糾纏的表示不同的變動因子。 ::: :::info Finally, we present a new dataset of human faces (Flickr-Faces-HQ, FFHQ) that offers much higher quality and covers considerably wider variation than existing high-resolution datasets (Appendix A). We have made this dataset publicly available, along with our source code and pre-trained networks.1 The accompanying video can be found under the same link. ::: :::success 最後,我們提供一個新的人臉資料庫(fLICKR-fACES-hq, FFHQ),對比現在的高解析度資料集來說更為高品質,涵蓋更廣泛的變化(Appendix A)。我們已經公開這個資料集了,還有原始碼跟預訓練的網路。相關影片可以在同一連結找到。 連結:https://github.com/NVlabs/stylegan ::: ## 2. Style-based generator :::info Traditionally the latent code is provided to the generator through an input layer, i.e., the first layer of a feedforward network (Figure 1a). We depart from this design by omitting the input layer altogether and starting from a learned constant instead (Figure 1b, right). Given a latent code $\mathbf{z}$ in the input latent space $\mathcal{Z}$, a non-linear mapping network $f: \mathcal{Z} \to \mathcal{W}$ first produces $\mathbf{w}\in \mathcal{W}$ (Figure 1b, left). For simplicity, we set the dimensionality of both spaces to 512, and the mapping $f$ is implemented using an 8-layer MLP, a decision we will analyze in Section 4.1. Learned affine transformations then specialize $\mathbf{w}$ to styles $\mathbf{y}=(\mathbf{y}_s,\mathbf{y}_b)$ that control adaptive instance normalization (AdaIN) [27, 17, 21, 16] operations after each convolution layer of the synthesis network $g$. The AdaIN operation is defined as $$ \text{AdaIN}(\mathbf{x}_i,\mathbf{y})=\mathbf{y}_{s,i}\dfrac{\mathbf{x}_i-\mu(\mathbf{x}_i)}{\sigma(\mathbf{x}_i)}+\mathbf{y}_{b,i}\tag{1} $$ where each feature map $\mathbf{x}_i$ is normalized separately, and then scaled and biased using the corresponding scalar components from style $\mathbf{y}$. Thus the dimensionality of $\mathbf{y}$ is twice the number of feature maps on that layer. ::: :::success 傳統上,latent code是透過input layer提供給生成器的,也就是前饋網路(feedforward network)的第一層(Figure 1a)。我們拋棄這種設計,忽略掉input layer,改成一個學習到的常數(learned constant)開始(Figure 1b, right)。給定一個input latent space $\mathcal{Z}$中的latent code $\mathbf{z}$,一個非線性的映射網路,$f: \mathcal{Z} \to \mathcal{W}$,首先生成$\mathbf{w}\in \mathcal{W}$(Figure 1b, left)。為求簡單,我們把兩個空間維度都設置為512,映射函數$f$則是使用8層的MLP來實現,我們會在Section 4.1說明這個決定。然後,學習到的[仿射](https://terms.naer.edu.tw/detail/c6a237084a99931ea9a50ac6a44cb1e4/)轉換(affine transformations)會把$\mathbf{w}$轉換成style $\mathbf{y}=(\mathbf{y}_s,\mathbf{y}_b)$,在合成網路(synthesis network)$g$的每一層卷積之後,控制著adaptive instance normalization(AdaIN)的計算。AdaIN的計算定義為 $$ \text{AdaIN}(\mathbf{x}_i,\mathbf{y})=\mathbf{y}_{s,i}\dfrac{\mathbf{x}_i-\mu(\mathbf{x}_i)}{\sigma(\mathbf{x}_i)}+\mathbf{y}_{b,i}\tag{1} $$ 其中每一個feature map $\mathbf{x}_i$是各自正規化的,然後使用style $\mathbf{y}$中對應的[純量因子](https://terms.naer.edu.tw/detail/ddabb2dfc396ac27cf3d06e29fb2b68c/)(scalar components)來做縮放(scaled)與偏移(biased)。因此,$\mathbf{y}$的維度會是該層的feature maps的數量的兩倍。 ::: :::info ![image](https://hackmd.io/_uploads/BJD4azdAp.png) Figure 1. While a traditional generator [30] feeds the latent code though the input layer only, we first map the input to an intermediate latent space W, which then controls the generator through adaptive instance normalization (AdaIN) at each convolution layer. Gaussian noise is added after each convolution, before evaluating the nonlinearity. Here “A” stands for a learned affine transform, and “B” applies learned per-channel scaling factors to the noise input. The mapping network f consists of 8 layers and the synthesis network g consists of 18 layers— two for each resolution (4 2 − 10242 ). The output of the last layer is converted to RGB using a separate 1 × 1 convolution, similar to Karras et al. [30]. Our generator has a total of 26.2M trainable parameters, compared to 23.1M in the traditional generator. Figure 1. 傳統的生成器就只是單純的把laten code餵到input layer,我們的話會先把輸入的laten code映射到一個中間的潛在空間$\mathcal{W}$,然後在這邊的每一個卷積層都透過adaptive instance normalization (AdaIN)控制生成器。會在評估非線性之前,每個卷積之後加入高斯噪點。這邊的"A"表示學習到的仿射轉換,然後"B"表示把學習到的per-channel scaling factors應用到輸入的噪點。映射網路$f$包含8個網路層,合成網路$g$包含18個層路層,每個解析度($4^2 - 1024^2$)都有兩層。最後一層的輸出用1x1的卷積轉換成RBG,如Karras et al.所述。我們的生成器的總共有26.2M的可訓練參數,傳統生成器的話則是23.1M。 ::: :::info Comparing our approach to style transfer, we compute the spatially invariant style y from vector w instead of an example image. We choose to reuse the word “style” for y because similar network architectures are already used for feedforward style transfer [27], unsupervised image-toimage translation [28], and domain mixtures [23]. Compared to more general feature transforms [38, 57], AdaIN is particularly well suited for our purposes due to its efficiency and compact representation. ::: :::success 對比style transfer,我們計算空間不變風格(spatially invariant style) $\mathbf{y}$是從向量$\mathbf{w}$,而不是從樣本照片。我們選擇用style來表示$\mathbf{y}$是因為類似的網路架構已經用在feedforward style transfer、unsupervised image-toimage translation、與domain mixtures。對比更為通用性的特徵轉換,AdaIN因為它的高效與[緊密](https://terms.naer.edu.tw/detail/59d4f1d9f3112b7ee918e96e38b37f0f/)的表示(compact representation)特別適合我們的研究。 ::: :::info Finally, we provide our generator with a direct means to generate stochastic detail by introducing explicit noise inputs. These are single-channel images consisting of uncorrelated Gaussian noise, and we feed a dedicated noise image to each layer of the synthesis network. The noise image is broadcasted to all feature maps using learned perfeature scaling factors and then added to the output of the corresponding convolution, as illustrated in Figure 1b. The implications of adding the noise inputs are discussed in Sections 3.2 and 3.3. ::: :::success 最後,我們提供一個直接的方式來使用我們的生成器生成隨機細節(透過引入顯性的噪點輸入)。這些噪點是包含不相關的高斯噪點的single-channel(單通道)的影像,我們把這些專用噪點影像餵到合成網路的每一層。噪點影像會使用學習到的per-feature scaling factors擴散到所有的feature map,然後再被加到對應的卷積層的輸出,如Figure 1b所述。加入噪點的意義會在Section 3.2、.3.3討論。 ::: ### 2.1. Quality of generated images :::info Before studying the properties of our generator, we demonstrate experimentally that the redesign does not compromise image quality but, in fact, improves it considerably. Table 1 gives Frechet inception distances (FID) [25] for various generator architectures in CELEBA-HQ [30] and our new FFHQ dataset (Appendix A). Results for other datasets are given in Appendix E. Our baseline configuration (A) is the Progressive GAN setup of Karras et al. [30], from which we inherit the networks and all hyperparameters except where stated otherwise. We first switch to an improved baseline (B) by using bilinear up/downsampling operations [64], longer training, and tuned hyperparameters. A detailed description of training setups and hyperparameters is included in Appendix C. We then improve this new baseline further by adding the mapping network and AdaIN operations (C), and make a surprising observation that the network no longer benefits from feeding the latent code into the first convolution layer. We therefore simplify the architecture by removing the traditional input layer and starting the image synthesis from a learned 4 × 4 × 512 constant tensor (D). We find it quite remarkable that the synthesis network is able to produce meaningful results even though it receives input only through the styles that control the AdaIN operations. ::: :::success 在研究生成器的屬性之前我們先用實驗來證明一件事,那就是重新設計不但沒有影響照片的品質,相反的還明顯提高。Table 1給出在不同生成器架構在CELEBA-HQ跟我們新的FFHQ資料集上的Frechet inception distances (FID)。其它資料集的結果在Appendix E,有興趣的就看看。我們的基線設置(A)是Karras et al.的Progressive GAN,除非有另外說明,不然我們就是直接引用他們的網路跟所有的超參數。我們首先談談改進的基線(B)用bilinear up/downsampling operations、更長的訓練跟超參數的調整。訓練配置跟超參數得細節都在Appendix C。然後我們通過加入映射網路跟AdaIN的操作進一步的改進這個新的基線,這個架構是(C),這邊我們很驚訝發現一件事,把latent code餵給第一層卷積層已經沒有任何好處了。所以我們就簡化這個架構,把傳統的input layer移除,然後從一個學習到的4x4x512的constant tensor開始做照片的合成,這個架構是(D)。我們發現到,即使合成網路只能夠透過AdaIN控制的style來接受input,合成網路仍然能夠生成有意義結果,不得鳥。 ::: :::info ![image](https://hackmd.io/_uploads/r1SGeuFCa.png) Table 1. Frechet inception distance (FID) for various generator designs (lower is better). In this paper we calculate the FIDs using 50,000 images drawn randomly from the training set, and report the lowest distance encountered over the course of training. Table 1. 不同生成器架構的Frechet inception distance (FID)(愈小愈好)。論文中我們使用從訓練集隨機抽的50,000張照片,並說明訓練過程中遇到的最短距離。 ::: :::info Finally, we introduce the noise inputs (E) that improve the results further, as well as novel mixing regularization (F) that decorrelates neighboring styles and enables more finegrained control over the generated imagery (Section 3.1). ::: :::success 最終,我們引入噪點輸入(E),這可以進一步的提升成果還有新穎的mixing regularization(F),它能夠decorrelates(去掉相關性)相鄰的風格(style),並且能夠更細緻的控制照片的生成(Section 3.1)。 ::: :::info We evaluate our methods using two different loss functions: for CELEBA-HQ we rely on WGAN-GP [24], while FFHQ uses WGAN-GP for configuration A and nonsaturating loss [22] with $R_1$ regularization [44, 51, 14] for configurations B–F. We found these choices to give the best results. Our contributions do not modify the loss function. ::: :::success 我們用兩個不同的loss function來評估我們提出的方法:對CELEBA-HQ的部份使用WGAN-GP,FFHQ的話配置A使用WGAN-GP,配B-F的話使用nonsaturating loss with $R_1$ regularization。我們發現到這些選擇可以得到最佳的成果。我們的貢獻並沒有改變這些損失函數。 ::: :::info We observe that the style-based generator (E) improves FIDs quite significantly over the traditional generator (B), almost 20%, corroborating the large-scale ImageNet measurements made in parallel work [6, 5]. Figure 2 shows an uncurated set of novel images generated from the FFHQ dataset using our generator. As confirmed by the FIDs, the average quality is high, and even accessories such as eyeglasses and hats get successfully synthesized. For this figure, we avoided sampling from the extreme regions of $\mathcal{W}$ using the so-called truncation trick [42, 5, 34] — Appendix B details how the trick can be performed in $\mathcal{W}$ instead of $\mathcal{Z}$. Note that our generator allows applying the truncation selectively to low resolutions only, so that highresolution details are not affected. ::: :::success 我們觀察到,style-based的生成器(E)比起傳統生成器(B)在FIDs上有著明顯的提升,幾乎有20%,這證實了兩個研究中ImageNet上大規模的量測數據。Figure 2說明的是使用我們的生成器從FFHQ資料集中生成出從未見過的新照片。如FIDs所確定,平均的品質的高的,即使是飾品,像是眼鏡跟帽子,也都可以很成功的合成。這張照片的部份,我們使用所謂的truncation trick,避免從$\mathcal{W}$的極端區域採樣,有興趣的話Append B會說明為什麼是$\mathcal{W}$而不是$\mathcal{Z}$。請注意,我們的生成器只允許選擇性地將截斷應用於低解析度,這樣高解析度的細節就不會受到影響。 ::: :::info ![image](https://hackmd.io/_uploads/SyLhcdF0T.png) Figure 2. Uncurated set of images produced by our style-based generator (config F) with the FFHQ dataset. Here we used a variation of the truncation trick [42, 5, 34] with $\psi =0.7$ for resolutions $4^2 - 32^2$. Please see the accompanying video for more results. ::: :::info All FIDs in this paper are computed without the truncation trick, and we only use it for illustrative purposes in Figure 2 and the video. All images are generated in $1024^2$ resolution. ::: :::success 這篇論文中的所有FIDs計算都沒有使用truncation trick,只有為了說明Figure 2跟vidoe的時候才有這麼做。所有的照片都是以$1024^2$的解析度生成。 ::: ### 2.2. Prior art :::info Much of the work on GAN architectures has focused on improving the discriminator by, e.g., using multiple discriminators [18, 47, 11], multiresolution discrimination [60, 55], or self-attention [63]. The work on generator side has mostly focused on the exact distribution in the input latent space [5] or shaping the input latent space via Gaussian mixture models [4], clustering [48], or encouraging convexity [52]. ::: :::success GAN架構多數的研究都關注在提升discriminator,像是透過使用多個discriminators、多解析度的discriminators,或是self-attention。generator的部份則是主要集中在輸入潛在空間的精確分佈或是通過Gaussian mixture models、clustering、或是鼓勵[凸性](https://terms.naer.edu.tw/detail/9bf099455ee808a4a2b2d713c45ba57a/)的方式來塑造輸入的潛在空間。 ::: :::info Recent conditional generators feed the class identifier through a separate embedding network to a large number of layers in the generator [46], while the latent is still provided though the input layer. A few authors have considered feeding parts of the latent code to multiple generator layers [9, 5]. In parallel work, Chen et al. [6] “self modulate” the generator using AdaINs, similarly to our work, but do not consider an intermediate latent space or noise inputs. ::: :::success 近來的條件生成器把class identifier(類別[識別符](https://terms.naer.edu.tw/detail/033bec722b7450cd1c209fec7534b8d4/))通過一個獨立的嵌入網路餵給生成器中多個網路層,而潛在(latent)的部份仍然是由input layers來提供。部份的研究人員考慮把latent code的部份餵給多個生成器的網路層。同時,Chen等人使用AdaINS來"self modulate(自我調製)"生成器。這跟我們的研究很類似,不過他們並沒有考慮到中間的潛在空間或是噪點輸入。 ::: ## 3. Properties of the style-based generator :::info Our generator architecture makes it possible to control the image synthesis via scale-specific modifications to the styles. We can view the mapping network and affine transformations as a way to draw samples for each style from a learned distribution, and the synthesis network as a way to generate a novel image based on a collection of styles. The effects of each style are localized in the network, i.e., modifying a specific subset of the styles can be expected to affect only certain aspects of the image. ::: :::success 我們的生成器架構讓透過特定尺度(scale-specific)的調整來控制合成影像變成可能。我們可以將映射網路跟仿射轉換視為一種從學習到的分佈中為每個風格(style)採樣的方法,合成網路的話則是基於收集到的風格(styles)生成一張新穎照片的方法。網路中的每個風格的影響都是局部性的,也就是說,調整一個特定的風格子集可以預期只會影響照片中的某些特定部份。 ::: :::info To see the reason for this localization, let us consider how the AdaIN operation (Eq. 1) first normalizes each channel to zero mean and unit variance, and only then applies scales and biases based on the style. The new per-channel statistics, as dictated by the style, modify the relative importance of features for the subsequent convolution operation, but they do not depend on the original statistics because of the normalization. Thus each style controls only one convolution before being overridden by the next AdaIN operation. ::: :::success 為了瞭解這種局部性的原因,讓我們來瞭解AdaIN的運算是怎麼一回事(Eq.1),首先把每個channel正規化成均值為0,變異數為1,然後再根據style($\mathbf{y}$)計算縮放與偏差。由style($\mathbf{y}$)所決定的每一個新的通道的統計資訊(per-channel statistics)會修改特徵對後續卷積的計算的相對重要性,不過因為正規化的關係,這並不會依賴原始的統計資訊。因此,每個style只會控制一個卷積,然後在下一個AdaIN計算之前被覆蓋掉。 ::: :::warning $$ \text{AdaIN}(\mathbf{x}_i,\mathbf{y})=\mathbf{y}_{s,i}\dfrac{\mathbf{x}_i-\mu(\mathbf{x}_i)}{\sigma(\mathbf{x}_i)}+\mathbf{y}_{b,i}\tag{1} $$ ::: ### 3.1. Style mixing :::info To further encourage the styles to localize, we employ mixing regularization, where a given percentage of images are generated using two random latent codes instead of one during training. When generating such an image, we simply switch from one latent code to another — an operation we refer to as style mixing— at a randomly selected point in the synthesis network. To be specific, we run two latent codes z1, z2 through the mapping network, and have the corresponding w1, w2 control the styles so that w1 applies before the crossover point and w2 after it. This regularization technique prevents the network from assuming that adjacent styles are correlated. ::: :::success 為了進一步的鼓勵styles能夠局部化,我們採用混合正規化(mixing regularization),在訓練過程中,一定百分比的照片使用兩個隨機的latent codes而不是一個(這邊的翻譯不是很確定)。生成這類照片的時候,我們會在合成網路中隨機選擇一個點,然後從一個latent code切換到另一個latent code,這樣的操作我們將之視為style mixing。具體來說,我們會透過映射網路執行兩個latent code,$\mathbf{z}_1, \mathbf{z}_2$,然後有兩個對應的$\mathbf{w_1},\mathbf{w_2}$控制著styles,這樣的話,在交叉點(crossover point)之前就用$\mathbf{w_1}$,之後則是使用$\mathbf{w_2}$。這個正規化技術可以預防網路認為鄰近的風格(style)是有相關性的。 ::: :::info Table 2 shows how enabling mixing regularization during training improves the localization considerably, indicated by improved FIDs in scenarios where multiple latents are mixed at test time. Figure 3 presents examples of images synthesized by mixing two latent codes at various scales. We can see that each subset of styles controls meaningful high-level attributes of the image. ::: :::success Table 2給出訓練期間啟用混合正規化大幅改善局部化,這可以從測試期間混合多個latent code的情況下,其FIDs的提升看的出來。Figure 3呈現以不同尺度混合兩個latent codes來合成出照片的範例。我們可以看到,每一個風格(styles)的子集都控制著照片中有意義的高階屬性。 ::: :::info ![image](https://hackmd.io/_uploads/BJJOBh00a.png) Figure 3. Two sets of images were generated from their respective latent codes (sources A and B); the rest of the images were generated by copying a specified subset of styles from source B and taking the rest from source A. Copying the styles corresponding to coarse spatial resolutions ($4^2-8^2$) brings high-level aspects such as pose, general hair style, face shape, and eyeglasses from source B, while all colors (eyes, hair, lighting) and finer facial features resemble A. If we instead copy the styles of middle resolutions ($16^2-32^2$) from B, we inherit smaller scale facial features, hair style, eyes open/closed from B, while the pose, general face shape, and eyeglasses from A are preserved. Finally, copying the fine styles ($64^2-1024^2$) from B brings mainly the color scheme and microstructure. Figure 3. 由它們各自的latent codes(來源A與B)所生成的兩個照片子集;其它的照片則是用從B複製指定的風格子集所生成的,剩下的就是從A來的。複製對應低空間解析度($4^2-8^2$)的風格能夠從B帶來高階特徵,像是姿勢、髮型、臉型、跟眼鏡,同時所有的顏色(眼鏡、頭髮、光)跟更細膩的臉部份特徵則是類似於A。如果我們是從B的中間牽析度($16^2-32^2$)來複製風格的話,那就會從B繼承較小尺度的臉部特徵、髮型、張眼/閉眼,同時從A來的姿勢、臉型與眼鏡則是會被保留。最後,從B複製細部風格(fine styles)($64^2-1024^2$)則是帶來主要的配色與微觀結構。 ::: :::info ![image](https://hackmd.io/_uploads/rJbl4nRAp.png) Table 2. FIDs in FFHQ for networks trained by enabling the mixing regularization for different percentage of training examples. Here we stress test the trained networks by randomizing 1 . . . 4 latents and the crossover points between them. Mixing regularization improves the tolerance to these adverse operations significantly. Labels E and F refer to the configurations in Table 1. ::: ### 3.2. Stochastic variation :::info There are many aspects in human portraits that can be regarded as stochastic, such as the exact placement of hairs, stubble, freckles, or skin pores. Any of these can be randomized without affecting our perception of the image as long as they follow the correct distribution. ::: :::success 人類的肖像中有語多方面可以被視為是隨機的,像是頭髮的確切位置、鬍渣、雀斑、或是毛孔。只要它們依循正確的分佈,這些都可以被隨機的變化而不會影響我們對照片的感知。 ::: :::info Let us consider how a traditional generator implements stochastic variation. Given that the only input to the network is through the input layer, the network needs to invent a way to generate spatially-varying pseudorandom numbers from earlier activations whenever they are needed. This consumes network capacity and hiding the periodicity of generated signal is difficult — and not always successful, as evidenced by commonly seen repetitive patterns in generated images. Our architecture sidesteps these issues altogether by adding per-pixel noise after each convolution. ::: :::success 讓我們考慮一下傳統的生成器是如何實現隨機變化。假設,丟東西到網路的唯一方法是透過input layer,那網路就必需要創造出一個方法在需要的時候從earlier activations中生成spatially-varying pseudorandom numbers(空間變化偽隨機數?)。這會消耗網路的容量(能力,capacity),並且很難去隱藏生成信號的週期性,這並不總是成功的,這從生成出來的照片中常常看到重複的模式是可以證明。我們的架構利用在每個捲積層之後加入per-pixel noise的方式,完全避開這些問題。 ::: :::info Figure 4 shows stochastic realizations of the same underlying image, produced using our generator with different noise realizations. We can see that the noise affects only the stochastic aspects, leaving the overall composition and high-level aspects such as identity intact. Figure 5 further illustrates the effect of applying stochastic variation to different subsets of layers. Since these effects are best seen in animation, please consult the accompanying video for a demonstration of how changing the noise input of one layer leads to stochastic variation at a matching scale. ::: :::success Figure 4說明了相同的基本照片的隨機實現,這是使用我們的生成器搭配不同的噪點實現所生成的。我們可以看到,噪點只會影響隨機部位,整體的結構與高階屬性方面則是完好無缺。Figure 5則是進一步的說明了將這種隨機變化應用到不同網路層的子集的效果。因為這些效果的最佳賞閱模式是看片片,所以就請觀閱所附的影片,以瞭解一個網路層的噪點輸入的變化如何的導致在匹配尺度的隨機變化。 ::: :::info ![image](https://hackmd.io/_uploads/HJjtJaAR6.png) Figure 4. Examples of stochastic variation. (a) Two generated images. (b) Zoom-in with different realizations of input noise. While the overall appearance is almost identical, individual hairs are placed very differently. (c) Standard deviation of each pixel over 100 different realizations, highlighting which parts of the images are affected by the noise. The main areas are the hair, silhouettes, and parts of background, but there is also interesting stochastic variation in the eye reflections. Global aspects such as identity and pose are unaffected by stochastic variation. Figure 4. 隨機變化的範例。(a)兩張生成的照片。(b)放大不同的輸入噪點的實現。儘管整體的外觀幾乎相同,但每個毛的位置是有很大的不同的。(c)在100個不同實現上的每個像素的標準差,突顯出那些部位受噪點影響。主要的區域在頭髮、輪廓、與背景的部份,不過在眼睛的反射的部份也有著有趣的隨機變化。全域特徵像是身份、姿勢這種的就不受隨機變化的影響。 ::: :::info ![image](https://hackmd.io/_uploads/ByokGpRAa.png) Figure 5. Effect of noise inputs at different layers of our generator. (a) Noise is applied to all layers. (b) No noise. (c) Noise in fine layers only ($64^2 – 1024^2$). (d) Noise in coarse layers only ($4^2 – 32^2$). We can see that the artificial omission of noise leads to featureless “painterly” look. Coarse noise causes large-scale curling of hair and appearance of larger background features, while the fine noise brings out the finer curls of hair, finer background detail, and skin pores. Figure 5. 輸入噪點在我們的生成器中不同網路層的影響。(a)每一層都用噪點。(b)沒有噪點。(c)只有在fine layers用噪點($64^2 – 1024^2$)。(d)只有在coarse layers用噪點($4^2 – 32^2$)。我們可以看到,人工刻意省掉噪點會導致很沒有特色"繪畫般"的外觀。Coarse noise會導致頭髮大尺度的捲動,也會出現較大背景的特徵,而fine noise則是會帶出更細緻的捲髮、背景細節與皮膚毛孔。 ::: :::warning 這邊的fine layers看起來應該是指解析度較高的網路層,主要處理細節。而coarse layers的話則是指解析度較低的網路層,主要處理整體。這從卷積的概念來看,feature map較大的時候,所detect到的也是細節,也許是這樣的道理? ::: :::info We find it interesting that the effect of noise appears tightly localized in the network. We hypothesize that at any point in the generator, there is pressure to introduce new content as soon as possible, and the easiest way for our network to create stochastic variation is to rely on the noise provided. A fresh set of noise is available for every layer, and thus there is no incentive to generate the stochastic effects from earlier activations, leading to a localized effect. ::: :::success 我們發現一件有趣的事,就是噪點的影響似乎在網路中被緊緊地局部化。我們假設,在生成器中的任一個點,都會有盡快引入新內容的壓力,對我們的網路而言,建立隨機變化最簡單的方法就是靠所提供的噪點。每一層都有新鮮的噪點可以使用,因此,沒有任何的誘因從earlier activations生成隨機影響,也就導致這種局部化的影響。 ::: ### 3.3. Separation of global effects from stochasticity :::info The previous sections as well as the accompanying video demonstrate that while changes to the style have global effects (changing pose, identity, etc.), the noise affects only inconsequential stochastic variation (differently combed hair, beard, etc.). This observation is in line with style transfer literature, where it has been established that spatially invariant statistics (Gram matrix, channel-wise mean, variance, etc.) reliably encode the style of an image [20, 39] while spatially varying features encode a specific instance. ::: :::success 前面的章節跟片片演示說明,儘管對風格(style)的變化有著全域的影響(影響姿勢、身份等),但是噪點的影響就只會一些無關緊要的隨機變化(不同的髮型、鬍鬚等等)。這個觀察跟style transfer的文獻是一致的,其中已經確立空間不變統計(spatially invariant statistics)(Gram matrix, channel-wise mean, variance, etc.)可靠地編碼成照片的風格,而空間變化特徵則是編碼成一個特定實例。 ::: :::info In our style-based generator, the style affects the entire image because complete feature maps are scaled and biased with the same values. Therefore, global effects such as pose, lighting, or background style can be controlled coherently. Meanwhile, the noise is added independently to each pixel and is thus ideally suited for controlling stochastic variation. If the network tried to control, e.g., pose using the noise, that would lead to spatially inconsistent decisions that would then be penalized by the discriminator. Thus the network learns to use the global and local channels appropriately, without explicit guidance. ::: :::success 在我們的style-based generator中,風格(style)會影響整張照片,這是因為完整的feature maps是以相同的值來縮放(scaled)與偏移(biased)。因此,全域的影響,像是姿勢、光線或是背景風格就可以被連貫地控制。同時,噪點被各自地加到每個像素,因此非常適合拿來控制隨機變化。如果網路嚐試使用噪點控制姿勢,那就會導致空間不一致的決策,也就會受到discriminator的懲罰。因此,網路會學習在沒有明確的指導下適當地使用全域、局部通道(channel)。 ::: ## 4. Disentanglement studies :::info There are various definitions for disentanglement [54, 50, 2, 7, 19], but a common goal is a latent space that consists of linear subspaces, each of which controls one factor of variation. However, the sampling probability of each combination of factors in Z needs to match the corresponding density in the training data. As illustrated in Figure 6, this precludes the factors from being fully disentangled with typical datasets and input latent distributions.^2^ ::: :::success 關於解耦(disentanglement)的定義有很多,[54, 50, 2, 7, 19],不過一個常見的目標就是一個由linear subspaces(線性子空間)所組成的latent space,每一個線性子空間都控制一個變動的因子。然而,在$\mathcal{Z}$中的因子(factors)的每一個組合的採樣機率都必需跟訓練資料中的對應密度相匹配。如Figure 6所說明,這排除了這些因子跟典型的資料集與輸入潛在空間分佈的完全解耦問題。(這邊的翻譯不是很確定) ::: :::info ^2^ The few artificial datasets designed for disentanglement studies (e.g., [43, 19]) tabulate all combinations of predetermined factors of variation with uniform frequency, thus hiding the problem. ^2^ 那些為了解耦研究而設計的少數人造資料集,以均勻的頻率將預先定義的變化因子的所有組合製表,從而隱藏這個問題。 ::: :::info ![image](https://hackmd.io/_uploads/SksXTWx1R.png) Figure 6. Illustrative example with two factors of variation (image features, e.g., masculinity and hair length). (a) An example training set where some combination (e.g., long haired males) is missing. (b) This forces the mapping from Z to image features to become curved so that the forbidden combination disappears in Z to prevent the sampling of invalid combinations. (c) The learned mapping from Z to W is able to “undo” much of the warping. Figure 6. 兩個變化因子的描述範例(影像特徵,像是男子氣概跟頭髮長度)。(a)一個範例訓練集,其中某些組合像是長髮男性是有缺的。(b)這強制從$\mathcal{Z}$映射到影像特徵來變成曲線,從而禁止組合會從$\mathcal{Z}$消失,以此預防無效組合的採樣。(c)學習到從$\mathcal{Z}$映射到$\mathcal{W}$能夠"undo"大部份的扭曲。 ::: :::info A major benefit of our generator architecture is that the intermediate latent space $\mathcal{W}$ does not have to support sampling according to any fixed distribution; its sampling density is induced by the learned piecewise continuous mapping $f(\mathbf{z})$. This mapping can be adapted to “unwarp” $\mathcal{W}$ so that the factors of variation become more linear. We posit that there is pressure for the generator to do so, as it should be easier to generate realistic images based on a disentangled representation than based on an entangled representation. As such, we expect the training to yield a less entangled $\mathcal{W}$ in an unsupervised setting, i.e., when the factors of variation are not known in advance [10, 35, 49, 8, 26, 32, 7]. ::: :::success 我們生成器架構最主要的好處就是intermediate latent space $\mathcal{W}$並不需要根據任何固定的分佈來做採樣;其採樣密度是由學習到的piecewise continuous mapping $f(\mathbf{z})$所引導。這個映射可以被調整來"unwarp"$\mathcal{W}$,近而造成變化的因子變的更線性一些。我們假設,生成器有這麼做的壓力,因為基於disentangled representation來生成真實性的照片比起基於entangled representation還要來的簡單的多。因此啊,我們預期在非監督設置下訓練可以產生較少的entangled $\mathcal{W}$,也就是變動因子在事先是未知的情況。 ::: :::warning 忍不住的問了一下GPT問題: 在生成模型中,unwarp、disentangled、entangled 通常用來描述潛在空間$\mathcal{W}$的狀態: * Unwarp $\mathcal{W}$:指$\mathcal{W}$空間中的變因關係變得更加線性化,也就是更「解耦」 * Disentangled $\mathcal{W}$:指$\mathcal{W}$空間中的各個維度代表了相對獨立的變因,例如人臉圖像的年齡、性別、表情等 * Entangled$\mathcal{W}$:指$\mathcal{W}$空間中的各個維度相互糾纏在一起,難以分離 unwarp和disentangled在生成模型中很相似,都指代將潛在空間$\mathcal{W}$中的變因關係變得更加線性化、解耦的過程: * unwarp 是從$\mathcal{W}$空間的角度來描述的,而disentangled是從變因的角度來描述的。 * unwarp 強調的是$\mathcal{W}$空間的變形,而disentangled強調的是變因之間的關係。 * unwarp 可以通過調整$\mathcal{W}$空間的映射來實現,而disentangled可以通過訓練模型來實現。 ::: :::info Unfortunately the metrics recently proposed for quantifying disentanglement [26, 32, 7, 19] require an encoder network that maps input images to latent codes. These metrics are ill-suited for our purposes since our baseline GAN lacks such an encoder. While it is possible to add an extra network for this purpose [8, 12, 15], we want to avoid investing effort into a component that is not a part of the actual solution. To this end, we describe two new ways of quantifying disentanglement, neither of which requires an encoder or known factors of variation, and are therefore computable for any image dataset and generator. ::: :::success 不幸的是,近來提出用於量化解耦的指標都需要一個能將輸入的照片映射到latent code的encoder network。這些指標並不適合我們的情況,因為我們的baseline GAN缺乏這類的encoder。儘管為了這個去新增一個額外的網路是可能的,不過我們要避免投入太多力氣到這種不切實際的東西上。為此,我們提出兩種新的方法來量化解耦,這兩種方法既不需要一個encoder,也不需要變動因子是已知的狀態,因此對於任意的影像資料集與生成器都是可計算的。 ::: ### 4.1. Perceptual path length :::info As noted by Laine [37], interpolation of latent-space vectors may yield surprisingly non-linear changes in the image. For example, features that are absent in either endpoint may appear in the middle of a linear interpolation path. This is a sign that the latent space is entangled and the factors of variation are not properly separated. To quantify this effect, we can measure how drastic changes the image undergoes as we perform interpolation in the latent space. Intuitively, a less curved latent space should result in perceptually smoother transition than a highly curved latent space. ::: :::success 如Laine所指出,潛在空間向量的插值也許會在照片中產生讓人驚訝的非線性變化。舉例來說,兩端點都沒有特徵就可以出現在線性插值路徑的中間。這就說明了,潛在空間是糾纏在一起的,而且變動因子並沒有乖乖的分開。為了量化這個影呂,我們可以量測當我們在潛在空間中進行插值的時候,照片受到的變化有多麼的激烈。直觀來說就是,彎曲幅度較小的潛在空間相較彎曲幅度較高的潛在空間在感知上會有著較平滑的轉變。 ::: :::info As a basis for our metric, we use a perceptually-based pairwise image distance [65] that is calculated as a weighted difference between two VGG16 [58] embeddings, where the weights are fit so that the metric agrees with human perceptual similarity judgments. If we subdivide a latent space interpolation path into linear segments, we can define the total perceptual length of this segmented path as the sum of perceptual differences over each segment, as reported by the image distance metric. A natural definition for the perceptual path length would be the limit of this sum under infinitely fine subdivision, but in practice we approximate it using a small subdivision epsilon $\epsilon = 10^{−4}$ . The average perceptual path length in latent space $\mathcal{Z}$, over all possible endpoints, is therefore $$ l_\mathcal{Z}=\mathbb{E}\left[\dfrac{1}{\epsilon^2}d\left(G(\text{slerp}(\mathbf{z}_1, \mathbf{z}_2;t),G(\text{slerp}(\mathbf{z}_1,\mathbf{z}_2;t+\epsilon))) \right)\right]\tag{2} $$ where $\mathbf{z}_1,\mathbf{z}_2\sim P(\mathbf{z}),t\sim U(0,1)$,$G$ is the generator (i.e., $g$ of for style-based networks), and $d(\cdot,\cdot)$ evaluates the perceptual distance between the resulting images. Here $\text{slerp}$$\text{slerp}$ denotes spherical interpolation [56], which is the most appropriate way of interpolating in our normalized input latent space [61]. To concentrate on the facial features instead of background, we crop the generated images to contain only the face prior to evaluating the pairwise image metric. As the metric $d$ is quadratic [65], we divide by $\epsilon^2$. We compute the expectation by taking 100,000 samples. ::: :::success 做為我們指標的基準,我們使用perceptually-based pairwise image distance(基於感知的成對影像距離?),這是兩個VGG16 embeddings的加權差所計算而得,其中權重會被調整可以讓指標適合人類感知相似性的判斷。如果我們把潛在空間的插值路徑細分成linear segment(線性線段?),那我們就可以定義這個段落路徑(segemented path)的總的感知長度為每個線段(segment)上的感知差異的總和,如影像距離指標所指出。感知路徑長度的定義就會是在這個無限細分下的這個總和的極限值,不過實務上我們會用一個比較小的細分值$\epsilon = 10^{−4}$來近似它。所以,在潛在空間$\mathcal{Z}$中所有可能的endpoints上的平均感知路徑長度就是: $$ l_\mathcal{Z}=\mathbb{E}\left[\dfrac{1}{\epsilon^2}d\left(G(\text{slerp}(\mathbf{z}_1, \mathbf{z}_2;t),G(\text{slerp}(\mathbf{z}_1,\mathbf{z}_2;t+\epsilon))) \right)\right]\tag{2} $$ 其中$\mathbf{z}_1,\mathbf{z}_2\sim P(\mathbf{z}),t\sim U(0,1)$,$G$則是生成器(即style-based networks的$g$),然後$d(\cdot,\cdot)$為估測成果照片之間的感知距離的函數。這裡的$\text{slerp}$表示球面線性插值,這是在我們的正規化輸入潛在空間中做插值的最恰當的方式。為了能夠專注在臉部特徵而不是背景,我們在評估成對影像指標之前把生成的照片剪裁為單純的包含臉部的狀態。因為指標$d$是二次的,所以我們就除上$\epsilon^2$。我們通過採構100,000個樣本來計算期望值。 ::: :::info Computing the average perceptual path length in $\mathcal{W}$ is carried out in a similar fashion: $$ l_\mathcal{W}=\mathbb{E}\left[\dfrac{1}{\epsilon^2}d\left(g(\text{lerp}(\mathbf{z}_1, \mathbf{z}_2;t),g(\text{lerp}(\mathbf{z}_1,\mathbf{z}_2;t+\epsilon))) \right)\right]\tag{3} $$ where the only difference is that interpolation happens in $\mathcal{W}$ space. Because vectors in $\mathcal{W}$ are not normalized in any fashion, we use linear interpolation ($\text{lerp}$) ::: :::success 計算$\mathcal{W}$裡面的平均感知路徑長度用著類似的方式來處理: $$ l_\mathcal{W}=\mathbb{E}\left[\dfrac{1}{\epsilon^2}d\left(g(\text{lerp}(\mathbf{z}_1, \mathbf{z}_2;t),g(\text{lerp}(\mathbf{z}_1,\mathbf{z}_2;t+\epsilon))) \right)\right]\tag{3} $$ 唯一的不同就是插值是發生在$\mathcal{W}$空間中。因為$\mathcal{W}$中的向量並沒有以任何方式做過正規化,我們使用線性插值($\text{lerp}$)。 ::: :::info Table 3 shows that this full-path length is substantially shorter for our style-based generator with noise inputs, indicating that $\mathcal{W}$ is perceptually more linear than $\mathcal{Z}$. Yet, this measurement is in fact slightly biased in favor of the input latent space $\mathcal{Z}$. If $\mathcal{W}$ is indeed a disentangled and “flattened” mapping of $\mathcal{Z}$, it may contain regions that are not on the input manifold— and are thus badly reconstructed by the generator — even between points that are mapped from the input manifold, whereas the input latent space $\mathcal{Z}$ has no such regions by definition. It is therefore to be expected that if we restrict our measure to path endpoints, i.e., $t \in \left\{0, 1 \right\}$, we should obtain a smaller $l_\mathcal{W}$ while $l_\mathcal{Z}$ is not affected. This is indeed what we observe in Table 3 ::: :::success Table 3說明了,對於我們的style-based搭配噪點輸入的生成器,這種full-path length明顯較短,這說明著$\mathcal{W}$在感知上比$\mathcal{Z}$更為線性。不過這樣的測量實際上對於潛在空間$\mathcal{Z}$有一點點的偏頗。如果$\mathcal{W}$確實是$\mathcal{Z}$的一個解耦(disentangled)且"平坦(flattened)"的映射,那它也許包含不在輸入流形(input manifold)的區域上,也因此這些區域無法很好的被生成器重構,即使是來自input manifold映射的點之間也是如此,然而,輸入潛在空間$\mathcal{Z}$根據定義是沒有這樣的區域的。所以吼,可以預期,如果我們將量測限制在path endpoints,也就是$t \in \left\{0, 1 \right\}$,我們應該可以得到一個較小的$l_\mathcal{W}$,而$l_\mathcal{Z}$則是不受影響。這正是我們在Table 3中所觀察到的。 ::: :::info ![image](https://hackmd.io/_uploads/ryd6VhG1C.png) Table 3. Perceptual path lengths and separability scores for various generator architectures in FFHQ (lower is better). We perform the measurements in $\mathcal{Z}$ for the traditional network, and in $\mathcal{W}$for style-based ones. Making the network resistant to style mixing appears to distort the intermediate latent space $\mathcal{W}$ somewhat. We hypothesize that mixing makes it more difficult for $\mathcal{W}$ to efficiently encode factors of variation that span multiple scales. Table 3. FFHQ(愈低愈好)中各種生成器架構的感知路徑長度與[可分性](https://terms.naer.edu.tw/detail/04df1748cf957249d2ffe5fbdc1d0f5d/)分數。我們對傳統的網路以$\mathcal{Z}$做量測,對style-based的網路則是以$\mathcal{W}$。要讓網路能夠抵抗風格混合(style mixing)似乎會些許的扭曲中間潛在空間$\mathcal{W}$。我們假設,混合會讓$\mathcal{W}$更難以有效地編碼跨多個尺度的因子。 ::: :::info ![image](https://hackmd.io/_uploads/SJiRNnz1R.png) Table 4 shows how path lengths are affected by the mapping network. We see that both traditional and style-based generators benefit from having a mapping network, and additional depth generally improves the perceptual path length as well as FIDs. It is interesting that while $l_\mathcal{W}$ improves in the traditional generator, $l_\mathcal{Z}$ becomes considerably worse, illustrating our claim that the input latent space can indeed be arbitrarily entangled in GANs ::: :::success Table 4說明了路徑南度是如何的受到映射網路的影響。我們看到傳統生成器跟style-based的生成器都得益於擁有映射網路,並且增加網路的深度通常會改進感知路徑長度以及FIDs。有趣的是,儘管$l_\mathcal{W}$在傳統生成器中有了改善,不過$l_\mathcal{Z}$卻是明顯爛了,這說明了我們的論點,也就是GAN中的輸入潛在空間確實是任意糾的。 ::: :::info ![image](https://hackmd.io/_uploads/Sk2iAp7eA.png) Table 4. The effect of a mapping network in FFHQ. The number in method name indicates the depth of the mapping network. We see that FID, separability, and path length all benefit from having a mapping network, and this holds for both style-based and traditional generator architectures. Furthermore, a deeper mapping network generally performs better than a shallow one Table 4. 映射網路在FFHQ中的影響。方法名稱中的數值表示映射網路的深度。我們看到FID、可分性、以及路徑長度皆因使用映射網路而得益,這對style-based與傳統生成器架構都是成立的。此外,更深的映射網路通常效果會比淺的來的好。 ::: ### 4.2. Linear separability :::info If a latent space is sufficiently disentangled, it should be possible to find direction vectors that consistently correspond to individual factors of variation. We propose another metric that quantifies this effect by measuring how well the latent-space points can be separated into two distinct sets via a linear hyperplane, so that each set corresponds to a specific binary attribute of the image ::: :::success 如果一個潛在空間足夠解耦,那它應該能夠找到始終不變的對應的個別變動因子的direction vectors。我們提出另一種量化這種影響的指標,透過量測潛在空間中的點如何地能夠很好的通過線性超平面被分離成兩個不同集合,從而讓每個集合對應影像的特定二元屬性。 ::: :::info In order to label the generated images, we train auxiliary classification networks for a number of binary attributes, e.g., to distinguish male and female faces. In our tests, the classifiers had the same architecture as the discriminator we use (i.e., same as in [30]), and were trained using the CELEBA-HQ dataset that retains the 40 attributes available in the original CelebA dataset. To measure the separability of one attribute, we generate 200,000 images with $\mathbf{z}\sim P(\mathbf{z})$ and classify them using the auxiliary classification network. We then sort the samples according to classifier confidence and remove the least confident half, yielding 100,000 labeled latent-space vectors. ::: :::success 為了標記生成的照片,我們一系列的二元屬性訓練輔助分類網路,像是區分男生、女生的臉。在我們的測試中,分類器跟我們所使用的distriminator有著相同的架構,然後使用CELEBA-HQ dataset訓練,這個資料集保有原始CelebA dataset中的40種屬性。為了量測一種屬性的可分性,我們用$\mathbf{z}\sim P(\mathbf{z})$生成200,000張照片,然後用輔助分類網路來分類它們。然後,我們再根據分類器的置信度對樣本做排序,移除置信度最低的那一半,最終得到100,000個標記的潛在空間向量。 ::: :::info For each attribute, we fit a linear SVM to predict the label based on the latent-space point —$\mathbf{z}$ for traditional and $\mathbf{w}$ for style-based — and classify the points by this plane. We then compute the conditional entropy $H(Y\vert X)$ where $X$ are the classes predicted by the SVM and $Y$ are the classes determined by the pre-trained classifier. This tells how much additional information is required to determine the true class of a sample, given that we know on which side of the hyperplane it lies. A low value suggests consistent latent space directions for the corresponding factor(s) of variation. ::: :::success 對於每個屬性,我們擬合了一個線性SVM基於潛在空間的點(latent-space point)來預測標記,其中$\mathbf{z}$是用於傳統網路,$\mathbf{w}$是用於style-based的網路,然後利用這個平面來分類這些點。然後我們計算條件熵$H(Y\vert X)$,其中$X$是由SVM預測而得的類別,而$Y$則是由預訓練分類器所確定的類別。這告訴我們,在知道樣本位於超平面的那一邊的情況下,我們需要多少信息才能決定一個樣本的實際類別。較低的數值就說明著跟對應變動因子的潛在空間方向是一致的。 ::: :::info We calculate the final separability score as $\exp(\sum_iH(Y_i\vert X_i))$, where $i$ enumerates the 40 attributes. Similar to the inception score [53], the exponentiation brings the values from logarithmic to linear domain so that they are easier to compare. ::: :::success 我們計算最終的可分性分數為$\exp(\sum_iH(Y_i\vert X_i))$,其中$i$表示表示枚舉這40種屬性。類似於inception score,[指數化](https://terms.naer.edu.tw/detail/f5d03dc80f2cf2151e8094c96c8cfde9/)將值域從對數轉成線性域(linear domain),所以很容易可以比較。 ::: :::info Tables 3 and 4 show that $\mathcal{W}$ is consistently better separable than $\mathcal{Z}$, suggesting a less entangled representation. Furthermore, increasing the depth of the mapping network improves both image quality and separability in $\mathcal{W}$, which is in line with the hypothesis that the synthesis network inherently favors a disentangled input representation. Interestingly, adding a mapping network in front of a traditional generator results in severe loss of separability in $\mathcal{Z}$ but improves the situation in the intermediate latent space $\mathcal{W}$, and the FID improves as well. This shows that even the traditional generator architecture performs better when we introduce an intermediate latent space that does not have to follow the distribution of the training data. ::: :::success Tables 3、4說明了,$\mathcal{W}$始終都比$\mathcal{Z}$還要來的好分離,這意謂著更低的糾纏表示。此外,隨著映射網路深度的增加,照片的品質跟可分性都有著提升,這跟我們的假設是一致的,也就是合成網路本質上傾向於一個解耦的輸入表示。有趣的是,在傳統生成器的前面增加映射網路會導致$\mathcal{Z}$中的可分性嚴重受損,但是卻改善中間潛在空間$\mathcal{W}$中的情況,並且FID也同時得到改善。這說明著,即使是傳統生成器架構,當我們引入那個不需要依循訓練資料分佈的中間潛在空間時,也可以做的很好。 ::: ## 5. Conclusion :::info Based on both our results and parallel work by Chen et al. [6], it is becoming clear that the traditional GAN generator architecture is in every way inferior to a style-based design. This is true in terms of established quality metrics, and we further believe that our investigations to the separation of high-level attributes and stochastic effects, as well as the linearity of the intermediate latent space will prove fruitful in improving the understanding and controllability of GAN synthesis. ::: :::success 基於我們的成果以及Chen et al的相關研究,很明顯的,傳統GAN生成器架構在各方面都比不style-based的設計。從確定的品質指標來看這確實如此,我們進一步的相信,我們對於分離高階屬性與隨機效果的研究、以及中間潛在空間的線性都將進一步的推升對於GAN合成的理解與可控性。 ::: :::info We note that our average path length metric could easily be used as a regularizer during training, and perhaps some variant of the linear separability metric could act as one, too. In general, we expect that methods for directly shaping the intermediate latent space during training will provide interesting avenues for future work. ::: :::success 我們注意到,我們的平均路徑長度指標能夠輕易的被用來做為訓練期間的正規化(regularizer),並且或許線性可分性指標的變體也能夠起到相同的作用。一般來說,我們預期那些訓練期間直接[作用](https://terms.naer.edu.tw/detail/179b4f95070e5791c85ee0e20f8a78ca/)於中間潛在空間的方法將為我們未來的研究提供有趣的康莊大道。 ::: ## A. The FFHQ dataset :::info We have collected a new dataset of human faces, FlickrFaces-HQ (FFHQ), consisting of 70,000 high-quality images at 10242 resolution (Figure 7). The dataset includes vastly more variation than CELEBA-HQ [30] in terms of age, ethnicity and image background, and also has much better coverage of accessories such as eyeglasses, sunglasses, hats, etc. The images were crawled from Flickr (thus inheriting all the biases of that website) and automatically aligned [31] and cropped. Only images under permissive licenses were collected. Various automatic filters were used to prune the set, and finally Mechanical Turk allowed us to remove the occasional statues, paintings, or photos of photos. We have made the dataset publicly available at https://github.com/NVlabs/ffhq-dataset ::: :::success 我們收集一個新的人臉資料集,FlickrFaces-HQ (FFHQ),包含70,000張高品質照片(1024^2^)(Figure 7)。這個資料集在年齡、種族與影像背景的變化性都遠超過CELEBA-HQ,而且像是眼鏡、太陽眼鏡、帽子之類的配件的涵蓋範圍更高。照片是從Flickr爬下來的(因此也繼承該網站的所有偏見),然後自動對齊、剪裁。我們只收集授權允許的照片。我們用了多種自動化的過濾器來篩選這個資料集,最終Mechanical Turk被允許移除雕像、繪圖或是photos of photos。資料集放在https://github.com/NVlabs/ffhq-dataset。 ::: :::info ![image](https://hackmd.io/_uploads/HJxS0U7Sg0.png) Figure 7. The FFHQ dataset offers a lot of variety in terms of age, ethnicity, viewpoint, lighting, and image background. ::: ## B. Truncation trick in $\mathcal{W}$ :::info If we consider the distribution of training data, it is clear that areas of low density are poorly represented and thus likely to be difficult for the generator to learn. This is a significant open problem in all generative modeling techniques. However, it is known that drawing latent vectors from a truncated [42, 5] or otherwise shrunk [34] sampling space tends to improve average image quality, although some amount of variation is lost. ::: :::success 如果我們考慮訓練資料的分佈,很明顯的,低密度的區域的表現都很不好,因此對生成器來說可能很難學習。這是一個在所有生成式建模技術中很明顯大家都知道的問題。然而,眾所皆知,從一個截斷或是其它shrunk sampling space(縮小的樣本空間?)抽取潛在向量會傾個於提高平均影像品質,儘管會損失一咪咪的變化性。 ::: :::info We can follow a similar strategy. To begin, we compute the center of mass of $\mathcal{W}$ as $\bar{\mathbf{w}}=\mathbb{E}_{\mathbf{z}\sim p(\mathbf{z})}[f(\mathbf{z})]$. In case of FFHQ this point represents a sort of an average face (Figure 8, $\psi=0$). We can then scale the deviation of a given w from the center as $\mathbf{w}'=\bar{\mathbf{w}}+\psi(\mathbf{w}-\bar{\mathbf{w}})$, where $\psi < 1$. While Brock et al. [5] observe that only a subset of networks is amenable to such truncation even when orthogonal regularization is used, truncation in $\mathcal{W}$ space seems to work reliably even without changes to the loss function. ::: :::success 我們可以依循著類似的策略。最開始,我們計算$\mathcal{W}$的質心,其值為$\bar{\mathbf{w}}=\mathbb{E}_{\mathbf{z}\sim p(\mathbf{z})}[f(\mathbf{z})]$。在FFHQ的範例中,這個質心的表示就比較像是一個平均的臉部(average face)(Figure 8, $\psi=0$)。然後我們可以用$\mathbf{w}'=\bar{\mathbf{w}}+\psi(\mathbf{w}-\bar{\mathbf{w}})$來縮放從質心所給定的$\mathbf{w}$的偏差,其中$\psi < 1$。雖然Brock et al.觀察到,即使使用orthogonal regularization,也只會有一部份的網路能夠接受這類的截斷(truncation),不過在$\mathcal{W}$空間中的截斷似乎能夠可靠地運作,儘管loss function沒有改變。 ::: ## C. Hyperparameters and training details :::info We build upon the official TensorFlow [1] implementation of Progressive GANs by Karras et al. [30], from which we inherit most of the training details. This original setup corresponds to configuration A in Table 1. In particular, we use the same discriminator architecture, resolution-dependent minibatch sizes, Adam [33] hyperparameters, and exponential moving average of the generator. We enable mirror augmentation for CelebA-HQ and FFHQ, but disable it for LSUN. Our training time is approximately oneweek on an NVIDIA DGX-1 with 8 Tesla V100 GPUs. ::: :::success 我們建立在Karras等人於官方TensorFlow所實作的Progressive GANs基礎上,我們從中繼續大部份的訓練細節。原始的設定對應Table 1中的配置A。特別是,我們使用相同的discriminator架構,批次大小取決於解析度,Adam超參數,以及生成器的指數移動平均。我們為CelebA-HQ跟FFHQ啟用鏡像增強,不過LSUN就關掉了。訓練時間在NVIDIA DGX-1搭配8張Telsa V100 GPUs上大概花了一個禮拜。 ::: :::info https://github.com/tkarras/progressive-growing-of-gans ::: :::info For our improved baseline (B in Table 1), we make several modifications to improve the overall result quality. We replace the nearest-neighbor up/downsampling in both networks with bilinear sampling, which we implement by low-pass filtering the activations with a separable 2^nd^ order binomial filter after each upsampling layer and before each downsampling layer [64]. We implement progressive growing the same way as Karras et al. [30], but we start from 8^2^ images instead of 4^2^. For the FFHQ dataset, we switch from WGAN-GP to the non-saturating loss [22] with R1 regularization [44] using $\gamma=10$. With R1 we found that the FID scores keep decreasing for considerably longer than with WGAN-GP, and we thus increase the training time from 12M to 25M images. We use the same learning rates as Karras et al. [30] for FFHQ, but we found that setting the learning rate to 0.002 instead of 0.003 for 512^2^ and 1024^2^ leads to better stability with CelebA-HQ. ::: :::success 為了提升我們的基線(Table 1中的B),我們做了一系列的調整來改善整體成果品質。兩個網路中,我們用bilinear sampling來取代nearest-neighbor up/downsampling,我們通過在每一次upsampling之後跟每一次downsampling之前利用可分離的二階二項式濾波器(2^nd^ order binomial filter)對啟動函數的值做低通濾波處理(low-pass filtering)。我們用跟Karras等人相同的方式實現漸近增長,不過我們是從8^2^的照片大小開始,而不是4^2^。對於FFHQ dataset,我們採用non-saturating loss,而不是WGAN-GP,其中$R_1$正規化使用$\gamma=10$。使用$R_1$讓我們發現到,FID分數明顯下降的比WGAN-GP還要長,因此我們增加訓練時間,從12M一路增加到25M的照片量。對FFHQ資料集,我們使用跟Karras等人相同的學習效率,不過我們發現到,對於CeleA-HQ資料來說,在解析度512^2^跟1024^2^的時候,將學習效率調整為0.002(不要0.003)的訓練效果會更為穩定。 ::: :::info For our style-based generator (F in Table 1), we use leaky ReLU [41] with $\alpha=0.2$ and equalized learning rate [30] for all layers. We use the same feature map counts in our convolution layers as Karras et al. [30]. Our mapping network consists of 8 fully- connected layers, and the dimensionality of all input and output activations— including $\mathbf{z}$ and $\mathbf{w}$ — is 512. We found that increasing the depth of the mapping network tends to make the training unstable with high learning rates. We thus reduce the learning rate by two orders of magnitude for the mapping network, i.e., $\lambda'=0.01\cdot\lambda$. We initialize all weights of the convolutional, fully-connected, and affine transform layers using $\mathcal{N}(0,1)$. The constant input in synthesis network is initialized to one. The biases and noise scaling factors are initialized to zero, except for the biases associated with ys that we initialize to one. ::: :::success 對於style-based的生成器(Table 1中的F),我們使用leaky ReLU($\alpha=0.2$),並且所有網路 增使使用equalized learning rate。捲積層的feature map的數量跟Karras等人一樣。映射網路包含8個全連結層,所有的輸入與輸出啟動函數的維度,包含$\mathbf{z}$跟$\mathbf{w}$都是512。我們發現到,增加映射網路的深度會導致在高學習效率情況下的訓練不穩定。因此,我們將映射網路的學習效率降低兩個量級,也就是$\lambda'=0.01\cdot\lambda$。我們用$\mathcal{N}(0,1)$來初始化所有權重,包含卷積、全連接以及仿射層。合層網路的常數輸入初始化為1。除了跟$\mathbf{y}_s$相關的偏差值初始化為1之外,其它的偏差值(bias)跟噪點(noise)縮放因子初始化為零。 ::: :::info The classifiers used by our separability metric (Section 4.2) have the same architecture as our discriminator except that minibatch standard deviation [30] is disabled. We use the learning rate of 10^-3^, minibatch size of 8, Adam optimizer, and training length of 150,000 images. The classifiers are trained independently of generators, and the same 40 classifiers, one for each CelebA attribute, are used for measuring the separability metric for all generators. We will release the pre-trained classifier networks so that our measurements can be reproduced. ::: :::success 我們的可分性指標(Section 4.2)所用的分類器跟我們的discriminator有相同的架構,除了沒有用minibatch standard deviation之外。我們使用10^-3^的學習效率,minibatch size為8,Adam optimizer,150,000張訓練照片。分類器的部份跟生成器分開訓練,然後所有的生成器都採用相同的40個分類器,每個CelebA attribute都一個,以此量測可分性指標。我們會發佈預訓練的分類器網路,這樣我們的量測結果就可以被重現。 ::: :::info We do not use batch normalization [29], spectral normalization [45], attention mechanisms [63], dropout [59], or pixelwise feature vector normalization [30] in our networks. ::: :::success 我們的網路並沒有使用batch normalization、spectral normalization、attention mechanisms、dropout、或是pixelwise feature vector normalization。 ::: ## D. Training convergence :::info Figure 9 shows how the FID and perceptual path length metrics evolve during the training of our configurations B and F with the FFHQ dataset. With $R_1$ regularization active in both configurations, FID continues to slowly decrease as the training progresses, motivating our choice to increase the training time from 12M images to 25M images. Even when the training has reached the full 1024^2^ resolution, the slowly rising path lengths indicate that the improvements in FID come at the cost of a more entangled representation. Considering future work, it is an interesting question whether this is unavoidable, or if it were possible to encourage shorter path lengths without compromising the convergence of FID. ::: :::success Figure 9給出使用FFHQ資料集在配置B跟F的訓練期間其FID跟感知路徑長度指標是如何演化的。兩個配置採用$R_1$ regularization的時候,FID會隨著訓練過程緩速下降,這也激起我們將照片量從12M增加到25M。儘管訓練最終來到完整的1024^2^的解析度,其緩慢上升的路徑長度也說明了,FID的改善是用一個更糾纏的表示為代價換來的。考慮到未來的研究,一個有趣的問題就是,這是否可以避免,又或者可以在不影響FID收斂性的情況下促成更短的路徑。 ::: :::info ![image](https://hackmd.io/_uploads/rkH5iuLlA.png) Figure 9. FID and perceptual path length metrics over the course of training in our configurations B and F using the FFHQ dataset. Horizontal axis denotes the number of training images seen by the discriminator. The dashed vertical line at 8.4M images marks the point when training has progressed to full 1024^2^ resolution. On the right, we show only one curve for the traditional generator’s path length measurements, because there is no discernible difference between full-path and endpoint sampling in $\mathcal{Z}$. ::: ## E. Other datasets :::info Figures 10, 11, and 12 show an uncurated set of results for LSUN [62] BEDROOM, CARS, and CATS, respectively. In these images we used the truncation trick from Appendix Bwith $\psi=0.7$ for resolutions 4^2^ − 32^2^. The accompanying video provides results for style mixing and stochastic variation tests. As can be seen therein, in case of BEDROOM the coarse styles basically control the viewpoint of the camera, middle styles select the particular furniture, and fine styles deal with colors and smaller details of materials. In CARS the effects are roughly similar. Stochastic variation affects primarily the fabrics in BEDROOM, backgrounds and headlamps in CARS, and fur, background, and interestingly, the positioning of paws in CATS. Somewhat surprisingly the wheels of a car never seem to rotate based on stochastic inputs. ::: :::success Figures 10、11跟12給出一系列未經篩選的結果,分別為LUSN BEDROOM、CARS與CATS。這些照片中,我們使用附錄B中的截斷技巧,對解析度4^2^到32^2^使用$\psi=0.7$。搭配的影片進一步的說明風格混合與隨機變化測試的測試結果。可以看的到,在BEDROOM的案例中,coarse styles基本控制了相機的焦點,middle styles選擇了特定的家具,fine styles處理材質的顏色與小細節。在CARS中的影響的大致相同。隨機變化主要影響BEDROOM中的紡織品,CARS中的背景與車前燈,CATS中毛髮與背景,有趣的是,爪子的位置。讓人驚訝的是,車子的輪胎似乎沒有因為隨機輸入而有所轉動。 ::: :::info ![image](https://hackmd.io/_uploads/rklJh_Ix0.png) Figure 10. Uncurated set of images produced by our style-based generator (config F) with the LSUN BEDROOM dataset at 256^2^ .FID computed for 50K images was 2.65 ::: :::info ![image](https://hackmd.io/_uploads/rJWZ3OLgR.png) Figure 11. Uncurated set of images produced by our style-based generator (config F) with the LSUN CAR dataset at 512 × 384.FID computed for 50K images was 3.27. ::: :::info ![image](https://hackmd.io/_uploads/rkyrh_8x0.png) Figure 12. Uncurated set of images produced by our style-based generator (config F) with the LSUN CAT dataset at 256^2^ . FID computed for 50K images was 8.53. ::: :::info These datasets were trained using the same setup as FFHQ for the duration of 70M images for BEDROOM and CATS, and 46M for CARS. We suspect that the results for BEDROOM are starting to approach the limits of the training data, as in many images the most objectionable issues are the severe compression artifacts that have been inherited from the low-quality training data. CARS has much higher quality training data that also allows higher spatial resolution (512 × 384 instead of 256^2^), and CATS continues to be a difficult dataset due to the high intrinsic variation in poses, zoom levels, and backgrounds. ::: :::success 這些照片用著跟FFHQ一樣的設置訓練,BEDROOM跟CATS用了70M,而CARS用了46M的照片。我們推測對於BEDROOM的結果來說已經開始近似訓練資料的極限,因為在很多照片中,最讓人討厭的問題就是繼承自低品質訓練資料中的嚴重壓縮的[瑕疵](https://terms.naer.edu.tw/detail/ce1251741a9f7d67bcb4632c6ad886af/)。CARS有較高的訓練資料品質,所以就有著較高的空間解析度(512x384,而不是256^2^),而CATS則是因為高度的內在變化,像是姿勢、縮放程度以及背景,所以目前為止仍然是一個具挑戰性的資料集。 :::