# Analyzing and Improving the Image Quality of StyleGAN ###### tags:`論文翻譯` `deeplearning` [TOC] ## 說明 區塊如下分類,原文區塊為藍底,翻譯區塊為綠底,部份專業用語翻譯參考國家教育研究院 :::info 原文 ::: :::success 翻譯 ::: :::warning 個人註解,任何的翻譯不通暢部份都請留言指導 ::: :::danger * [paper hyperlink](https://arxiv.org/pdf/1912.04958.pdf) ::: ## Abstrac :::info The style-based GAN architecture (StyleGAN) yields state-of-the-art results in data-driven unconditional generative image modeling. We expose and analyze several of its characteristic artifacts, and propose changes in both model architecture and training methods to address them. In particular, we redesign the generator normalization, revisit progressive growing, and regularize the generator to encourage good conditioning in the mapping from latent codes to images. In addition to improving image quality, this path length regularizer yields the additional benefit that the generator becomes significantly easier to invert. This makes it possible to reliably attribute a generated image to a particular network. We furthermore visualize how well the generator utilizes its output resolution, and identify a capacity problem, motivating us to train larger models for additional quality improvements. Overall, our improved model redefines the state of the art in unconditional image modeling, both in terms of existing distribution quality metrics as well as perceived image quality. ::: :::success style-based GAN architecture (StyleGAN)在資料驅動無條件生成影像模型中有著最棒棒的成果。我們提出並分析其多種特有的[瑕疵](https://terms.naer.edu.tw/detail/a97d7b2e880be99e7c0fa594bf9ac21f/),也提出模型架構與訓練方法的調整來解決這些問題。特別是,我們重新設計了生成器的正規化,重新審視漸近增長(progressive growing),然後正規生成器,鼓勵它從latent codes到影像的映射中有著好的改善。除了提升照片品質之外,這種path length regularizer帶來額外的好處,也就是生成器明顯變的更容易反轉。這使用我們能夠將生成影像歸因於一個特定網路。此外,我們可視化了生成器如何的利用其輸出解析度,確定一個能力問題(capacity problem),促使我們為了得到更多品質的提升而訓練更大型的模型。總的來說,我們改進的模型重新定義在unconditional image modeling界的最新技術,無論是在已知的分佈品質指示還是感知影像品質上都是如此。 ::: ## 1. Introduction :::info The resolution and quality of images produced by generative methods, especially generative adversarial networks (GAN) [16], are improving rapidly [23, 31, 5]. The current state-of-the-art method for high-resolution image synthesis is StyleGAN [24], which has been shown to work reliably on a variety of datasets. Our work focuses on fixing its characteristic artifacts and improving the result quality further. ::: :::success 生成方法所產生的照片的解析度跟品質,特別是生成對抗網路(GAN),都急速的增長中。目前最棒棒的高解析度影像成合方法是StyleGAN,它已經被證明在各種資料集上都可以可靠地運行。我們的研究關注在修復它特有的瑕疵(characteristic artifacts),並進一步的提升其所產生的品質。 ::: :::info The distinguishing feature of StyleGAN [24] is its unconventional generator architecture. Instead of feeding the input latent code $\mathbf{z} \in \mathcal{Z}$ only to the beginning of a the network, the mapping network $f$ first transforms it to an intermediate latent code $\mathbf{w} \in \mathcal{W}$. Affine transforms then produce styles that control the layers of the synthesis network $g$ via adaptive instance normalization (AdaIN) [21, 9, 13, 8]. Additionally, stochastic variation is facilitated by providing additional random noise maps to the synthesis network. It has been demonstrated [24, 38] that this design allows the intermediate latent space $\mathcal{W}$ to be much less entangled than the input latent space $\mathcal{Z}$. In this paper, we focus all analysis solely on $\mathcal{W}$, as it is the relevant latent space from the synthesis network’s point of view. ::: :::success StyleGAN的特別在於它的非典型生成器架構。不是單純的餵input latent code $\mathbf{z} \in \mathcal{Z}$做為神經網路的開始,而是首先由映射網路$f$將input latent code轉為intermediate latent code $\mathbf{w} \in \mathcal{W}$。然後,仿射轉換(affine transforms)產生風格,這些風格透過adaptive instance normalization (AdaIN)控制合併網路$g$的網路層。此外,透過對合併網路提供額外的隨機噪點來促進隨機變化。已經證明了,這樣的設計可以讓中間潛在空間$\mathcal{W}$比之輸入潛在空間$\mathcal{Z}$有著更少的糾纏。這個論文中,我們所有的分析都關注在$\mathcal{W}$,因為它從合成網路的視角來看是一個相關的潛在空間。 ::: :::info Many observers have noticed characteristic artifacts in images generated by StyleGAN [3]. We identify two causes for these artifacts, and describe changes in architecture and training methods that eliminate them. First, we investigate the origin of common blob-like artifacts, and find that the generator creates them to circumvent a design flaw in its architecture. In Section 2, we redesign the normalization used in the generator, which removes the artifacts. Second, we analyze artifacts related to progressive growing [23] that has been highly successful in stabilizing high-resolution GAN training. We propose an alternative design that achieves the same goal — training starts by focusing on low-resolution images and then progressively shifts focus to higher and higher resolutions — without changing the network topology during training. This new design also allows us to reason about the effective resolution of the generated images, which turns out to be lower than expected, motivating a capacity increase (Section 4). ::: :::success 許多觀察者都注意到StyleGAN所生成的照片中所特有的瑕疵。對於這些瑕疵我們確認兩個原因,並說明為了消除它在而在架構與訓練方法上的改變。首先,我們調查常見的斑點瑕疵的來源,然後發現到,生成器建立這些斑點是因為要規避其架構中的設計缺陷。在Section 2中,我們重新設計了生成器中所使用的正規化,這移除了這類瑕疵。再來就是,我們分析了與漸近式增長相關的瑕疵,其於穩定高解析度GAN訓練是相當成功的。我們提出一個替代設計來得到相同的目標,訓練開始的時候專注在低解析度照片上,然後慢慢的移動到愈來愈高的解析度上,並且在訓練期間不改變網路拓撲。這個新的設計仍然可以讓我們推論出生成照片的有效解析度,不過其結果比預期還要來的低,從而推動我們做容量的擴脴(Section 4)。 ::: :::info Quantitative analysis of the quality of images produced using generative methods continues to be a challenging topic. Frechet inception distance (FID) [20] measures differences in the density of two distributions in the highdimensional feature space of an InceptionV3 classifier [39]. Precision and Recall (P&R) [36, 27] provide additional visibility by explicitly quantifying the percentage of generated images that are similar to training data and the percentage of training data that can be generated, respectively. We use these metrics to quantify the improvements. ::: :::success 使用生成方法所生成的照片品質定量分析仍然是一個有挑戰性的議題。Frechet inception distance (FID) 量測於InceptionV3分類器的高維度特徵空間中兩個分佈的密度差異。Precision與Recal(P&R)分別通過顯示量化生成的圖像與訓練資料的相似百分比和可生成的訓練資料百分比來提供額外的可見性。我們使用這些指標來量化模型的提升。 ::: :::info Both FID and P&R are based on classifier networks that have recently been shown to focus on textures rather than shapes [12], and consequently, the metrics do not accurately capture all aspects of image quality. We observe that the perceptual path length (PPL) metric [24], originally introduced as a method for estimating the quality of latent space interpolations, correlates with consistency and stability of shapes. Based on this, we regularize the synthesis network to favor smooth mappings (Section 3) and achieve a clear improvement in quality. To counter its computational expense, we also propose executing all regularizations less frequently, observing that this can be done without compromising effectiveness. ::: :::success FID跟P&R都是基於分類器網路,這些分類器網路近來已經被證明更關注於紋理而非形狀,因此,這些指標並不能準確的的補捉照片品質的所有方方面面。我們觀察到,perceptual path length (PPL)(感知路徑長度)指標最一開始是用來估測潛在空間插值的品質,與形狀的一致性與穩定性相關。基於這一點,我們正規化合成網路有助於平滑映射(Section 3),並且在品質上得到了明顯的提升。為了抵銷計算成功,我們還建議以較低的桌率執行正規化,並且觀察到這是可以在不影響效果的情況下做到。 ::: :::info Finally, we find that projection of images to the latent space W works significantly better with the new, pathlength regularized StyleGAN2 generator than with the original StyleGAN. This makes it easier to attribute a generated image to its source (Section 5). ::: :::success 最後,我們發現到使用新的方法,也就是有著路徑正規化的StyleGAN2生成器將照片投影到潛在空間$\mathcal{W}$的效果明顯比原始的StyleGAN還要來的好。這使的將生成的照片歸因於其來源變的更容易(Section 5)。 ::: :::info Our implementation and trained models are available at https://github.com/NVlabs/stylegan2 ::: :::success 大密寶就在https://github.com/NVlabs/stylegan2 ,去拿吧。 ::: ## 2. Removing normalization artifacts :::info We begin by observing that most images generated by StyleGAN exhibit characteristic blob-shaped artifacts that resemble water droplets. As shown in Figure 1, even when the droplet may not be obvious in the final image, it is present in the intermediate feature maps of the generator. The anomaly starts to appear around 64×64 resolution, is present in all feature maps, and becomes progressively stronger at higher resolutions. The existence of such a consistent artifact is puzzling, as the discriminator should be able to detect it. ::: :::success 我們首先觀察到,多數由StyleGAN所生成的照片都呈現出特有的像水滴那樣的斑點瑕疵。如Figure 1所示,儘管在最終的照片中並不明顯,它也會出現在生成器的中間特徵圖中(intermediate feature maps)。這種異常在解析度64x64左右開始出現,所有的feature maps都有,而且會在高解析度情況下逐漸增強。這種一致性的瑕疵讓人感到毛毛滴,因為discriminator應該要能夠偵測到它才對。 ::: :::info ![image](https://hackmd.io/_uploads/HJxRZsX-A.png) Figure 1. Instance normalization causes water droplet-like artifacts in StyleGAN images. These are not always obvious in the generated images, but if we look at the activations inside the generator network, the problem is always there, in all feature maps starting from the 64x64 resolution. It is a systemic problem that plagues all StyleGAN images. ::: :::info We pinpoint the problem to the AdaIN operation that normalizes the mean and variance of each feature map separately, thereby potentially destroying any information found in the magnitudes of the features relative to each other. We hypothesize that the droplet artifact is a result of the generator intentionally sneaking signal strength information past instance normalization: by creating a strong, localized spike that dominates the statistics, the generator can effectively scale the signal as it likes elsewhere. Our hypothesis is supported by the finding that when the normalization step is removed from the generator, as detailed below, the droplet artifacts disappear completely. ::: :::success 我們確定問題就是AdaIN operation,這操作對每個feature map各別地正規化其均值與變異數,因而可能破壞在特徵相對於彼此的重要性中所發現的任何信息。我們假設,這水滴瑕疵是生成器故意偷度信息強烈信息給instance normalization的結果:透過建立強而局部化的spike([尖波](https://terms.naer.edu.tw/detail/8ee410da6a27c655e5d5e9f8782ced9e/)?)來主導統計分析,生成器可以有效地按其所好縮放信號。我們的假設得到下面發現的支持,也就是,當我們從生成器中移除正規化步驟之後,如下詳述,這個水滴瑕疵就完全滴消除了。 ::: ### 2.1. Generator architecture revisited :::info We will first revise several details of the StyleGAN generator to better facilitate our redesigned normalization. These changes have either a neutral or small positive effect on their own in terms of quality metrics. ::: :::success 我們要先來調調StyleGAN generator幾個細節,這有利於我們更新設計的正規化。這些調整在其品質指標方面有著一咪咪的正向影響。 ::: :::info Figure 2a shows the original StyleGAN synthesis network $g$ [24], and in Figure 2b we expand the diagram to full detail by showing the weights and biases and breaking the AdaIN operation to its two constituent parts: normalization and modulation. This allows us to re-draw the conceptual gray boxes so that each box indicates the part of the network where one style is active (i.e., “style block”). Interestingly, the original StyleGAN applies bias and noise within the style block, causing their relative impact to be inversely proportional to the current style’s magnitudes. We observe that more predictable results are obtained by moving these operations outside the style block, where they operate on normalized data. Furthermore, we notice that after this change it is sufficient for the normalization and modulation to operate on the standard deviation alone (i.e., the mean is not needed). The application of bias, noise, and normalization to the constant input can also be safely removed without observable drawbacks. This variant is shown in Figure 2c, and serves as a starting point for our redesigned normalization. ::: :::success Figure 2a說明的是原始的StyleGAN合成網路$g$,Figure 2b的部份,我們透過呈現權重與偏差,以及將AdaIN operation拆解成兩個部份:正規化與[調變](https://terms.naer.edu.tw/detail/d82f7c8f00aec6922f4bbe1a1201d986/)的方式將整個細節展開。這讓我們可以重新繪製conceptual gray boxes(概念性的灰框?),這樣,每個框就能指出其一個風格處理有效狀態的網路部份(即"style block")。有趣的是,原始的StyleGAN在style block內使用偏差(biae)與噪點(noise),這導致了它們的相對影響與當下的風格大小成反比。我們觀察到了,透過將這些操作移到style block外(在正規化過的資料上操作),我們可以得到更加可預測的結果。此外,我們注意到,這樣調整之後,就只需要用標準差來做正規化與調變就夠了,不需要均值。對於constant input的偏差、噪點以及正規化的應用也能夠在未觀察到缺點的情況下放心的移除。這個變體如Figure 2c所示,而且這是我們重新設計正規化的開始。 ::: :::info ![image](https://hackmd.io/_uploads/SktO1gKZA.png) Figure 2. We redesign the architecture of the StyleGAN synthesis network. (a) The original StyleGAN, where A denotes a learned affine transform from $\mathcal{W}$ that produces a style and B is a noise broadcast operation. (b) The same diagram with full detail. Here we have broken the AdaIN to explicit normalization followed by modulation, both operating on the mean and standard deviation per feature map. We have also annotated the learned weights ($\mathcal{w}$), biases ($\mathcal{b}$), and constant input ($\mathcal{c}$), and redrawn the gray boxes so that one style is active per box. The activation function (leaky ReLU) is always applied right after adding the bias. (c) We make several changes to the original architecture that are justified in the main text. We remove some redundant operations at the beginning, move the addition of $\mathcal{b}$ and B to be outside active area of a style, and adjust only the standard deviation per feature map. (d) The revised architecture enables us to replace instance normalization with a “demodulation” operation, which we apply to the weights associated with each convolution layer. Figure 2. 我們重新設計StyleGAN synthesis network的架構。(a)原始的StyleGAN,其中A表示從$\mathcal{W}$中學習到的仿射變換(affine transform),主要是產生風格,然後B則是噪點擴播(broadcast)的運算。(b)同一張,不過是有完成的細部說明。這邊我們把AdaIN拆解成明確的正規化,然後就是調變(modulation),這兩個計算都是基於每個feature map的均值與標準差,我們還標記了學習到的權重$\mathcal{w}$與偏差$\mathcal{b}$,以及constant input $\mathcal{c}$,並且重新繪製灰色的框框,這樣一個框框就對應一個風格(style)。啟動函數(leaky ReLU)都是在加入偏差之後執行。(c)我們對原始架構做了多種改變,這些在主文中都有說明。我們移除了一些開頭冗餘的操作,把與$\mathcal{b}$跟B的相加移到風格的活動區域外(灰色框框外),並且調整成對每個feature map單純以標準差計算正規化。(d)修訂之後的架構讓我們可以用"demodulation"([解調](https://terms.naer.edu.tw/detail/814a26b33d2cd185dc7a17e25ce61e51/))的操作來取代掉instance normalization,我們把這個計算應用到每個卷積層的相關權重上。 ::: ### 2.2. Instance normalization revisited :::info One of the main strengths of StyleGAN is the ability to control the generated images via style mixing, i.e., by feeding a different latent $\mathbf{w}$ to different layers at inference time. In practice, style modulation may amplify certain feature maps by an order of magnitude or more. For style mixing to work, we must explicitly counteract this amplification on a per-sample basis — otherwise the subsequent layers would not be able to operate on the data in a meaningful way. ::: :::success StyleGAN的一個主要強項就是能夠透過風格混合來控制生成的影像,也就是在推論時間餵給不同的網路層不同的latent $\mathbf{w}$。實務上,風格調變也許會放大某些featuer maps也許一個量級或是更大。為了讓風格混合能夠順利作業,我們必需明確地抵銷掉這種放大效果,而且是per-sample basis,不然吼,後續的網路層就沒有辦法以一種有意義的方式對資料操作。 ::: :::info If we were willing to sacrifice scale-specific controls (see video), we could simply remove the normalization, thus removing the artifacts and also improving FID slightly [27]. We will now propose a better alternative that removes the artifacts while retaining full controllability. The main idea is to base normalization on the expected statistics of the incoming feature maps, but without explicit forcing. ::: :::success 如果我們願意犧牲特定尺度的控制(見影片),那我們就可以很輕易的移除掉正規化,從而去除缺陷,還可以些許的提高FID。我們現在要來提出一個替代方案,這可以在移除缺陷的同時維持完整的可控制性。主要的概念就是在incoming feature maps的期望統計信息上做正規化,但沒有明確強制。 ::: :::info Recall that a style block in Figure 2c consists of modulation, convolution, and normalization. Let us start by considering the effect of a modulation followed by a convolution. The modulation scales each input feature map of the convolution based on the incoming style, which can alternatively be implemented by scaling the convolution weights: $$ \mathcal{w'}_{ijk} = s_i \cdot \mathcal{w}_{ijk} \tag{1} $$ where $\mathcal{w}$ and $\mathcal{w'}$ are the original and modulated weights, respectively, si is the scale corresponding to the $i$th input feature map, and $j$ and $k$ enumerate the output feature maps and spatial footprint of the convolution, respectively. ::: :::success 回想一下,在Figure 2c中的style block包含調變、卷積與正規化。讓我們考慮從調變然後是卷積的影響開始。調變縮放卷積的每個input feature map,這個縮放是基於傳入的風格(incoming style),這也可以用縮放卷積權重來替代: $$ \mathcal{w'}_{ijk} = s_i \cdot \mathcal{w}_{ijk} \tag{1} $$ 其中$\mathcal{w}$跟$\mathcal{w'}$代表原始權重與調變後的權重,$s_i$則是對應第$i$個input feature map的縮放,然後$j$跟$k$分別為枚舉output featuer maps與卷積的空間足跡(spatial footprint)。 ::: :::info Now, the purpose of instance normalization is to essentially remove the effect of s from the statistics of the convolution’s output feature maps. We observe that this goal can be achieved more directly. Let us assume that the input activations are i.i.d. random variables with unit standard deviation. After modulation and convolution, the output activations have standard deviation of ![image](https://hackmd.io/_uploads/SJwf32hWC.png) i.e., the outputs are scaled by the $L_2$ norm of the corresponding weights. The subsequent normalization aims to restore the outputs back to unit standard deviation. Based on Equation 2, this is achieved if we scale (“demodulate”) each output feature map $j$ by $1/\sigma_j$ . Alternatively, we can again bake this into the convolution weights: ![image](https://hackmd.io/_uploads/HJa1yTnZR.png) where $\epsilon$ is a small constant to avoid numerical issues. ::: :::success 現在,instance normalization的效果本質上就是從卷積的output feature maps的統計信息中去除$s$的影響。我們觀察到,這個目標可以更直接地實現。讓我們假設input activations為i.i.d的隨機變數,並且有單元標準差。在調變、卷積之後,其output activations有著標準差為 ![image](https://hackmd.io/_uploads/S1VMhhnZR.png) 也就是說,輸出(outputs)被對應權重的$L_2$正規化縮放著。接著的正規化的目標就是把output還原為單元標準差。根據方程式2,只要我們以$1/\sigma_j$縮放("解調")每個output feature map $j$,那就可以實現。或者,我們可以再一次的讓它跟卷積權重無法分離: ![image](https://hackmd.io/_uploads/HJa1yTnZR.png) 其中$\epsilon$是一個很小的常數,這是為了避免數值問題的常見手法。 ::: :::warning 這個方程式的平方位置實在不知道怎麼弄出來,如果有路過的前輩可以指導的話再請告知 ::: :::info We have now baked the entire style block to a single convolution layer whose weights are adjusted based on s using Equations 1 and 3 (Figure 2d). Compared to instance normalization, our demodulation technique is weaker because it is based on statistical assumptions about the signal instead of actual contents of the feature maps. Similar statistical analysis has been extensively used in modern network initializers [14, 19], but we are not aware of it being previously used as a replacement for data-dependent normalization. Our demodulation is also related to weight normalization [37] that performs the same calculation as a part of reparameterizing the weight tensor. Prior work has identified weight normalization as beneficial in the context of GAN training [43]. ::: :::success 我們現在已經將整個style block弄成一個卷積層,其權重的調整是基於$s$使用方程式1與3(Figure 2d)。對比instance normalization,我們的解調技術比較沒那麼厲害,因為它是基於關於信號的統計信息假設,而不是feature maps的實際內容。類似於現代網路初始化器(initializers)中已經被廣泛使用的統計信息分析,不過我們並不清楚其先前的應用是否是做為data-dependent normalization的替代就是。我們的解調也跟權重正規化相關,其執行跟重新參數化權重張量的一部份相同的計算。先前的研究已經確定權重正規化在GAN訓練的上下文中是有好處的。 ::: :::info Our new design removes the characteristic artifacts (Figure 3) while retaining full controllability, as demonstrated in the accompanying video. FID remains largely unaffected (Table 1, rows A, B), but there is a notable shift from precision to recall. We argue that this is generally desirable, since recall can be traded into precision via truncation, whereas the opposite is not true [27]. In practice our design can be implemented efficiently using grouped convolutions, as detailed in Appendix B. To avoid having to account for the activation function in Equation 3, we scale our activation functions so that they retain the expected signal variance. ::: :::success 我們的新設計移除了特有的缺陷(Figure 3),同時保留完整的可控性,相關說明可見隨書附上的片片。FID基本是不受影響的(Table 1、rows A、B),不過從精度到召回率有個顯著的變化。我們認為是合理的,因為召回率可以透過truncation轉為精度,反之則不然。實務上我們的設計可以用grouped convolutions高效地實現,相關細節見Appendix B。為了避免在方程式3中考慮activation function,我們縮放了activation function,以便可以保留期望的信號變異數(signal variance)。 ::: :::info ![image](https://hackmd.io/_uploads/BJAjH6hWC.png) Table 1. Main results. For each training run, we selected the training snapshot with the lowest FID. We computed each metric 10 times with different random seeds and report their average. Path length corresponds to the PPL metric, computed based on path endpoints in W [24], without the central crop used by Karras et al. [24]. The FFHQ dataset contains 70k images, and the discriminator saw 25M images during training. For LSUN CAR the numbers were 893k and 57M. ↑ indicates that higher is better, and ↓ that lower is better. Table 1. 主要的成果。對於每個訓練執行,我們選擇最低FID的訓練快照。我們以不同的隨機種子計算每個指標10次,然後求其平均。路徑長度對應於PPL指標,基於$\mathcal{W}$中的路徑端點(path endpoints)計算,並沒有使用Karras等人所用的中心剪裁(central crop)。FFHQ資料集包含70k的照片,discriminator在訓練期間見過25M的照片。LSUN Car則是893k的照片,discriminator在訓練期間見過57M的照片。↑代表愈高愈好,↓代表愈低愈好。 ::: :::info ![image](https://hackmd.io/_uploads/H1i6Xa3W0.png) Figure 3. Replacing normalization with demodulation removes the characteristic artifacts from images and activations. ::: ## 3. Image quality and generator smoothness :::info While GAN metrics such as FID or Precision and Recall (P&R) successfully capture many aspects of the generator, they continue to have somewhat of a blind spot for image quality. For an example, refer to Figures 13 and 14 that contrast generators with identical FID and P&R scores but markedly different overall quality.^2^ ::: :::success 雖然GAN的指標像是FID或是精度與召回率(P&R)成功地補捉到生成器的很多方方面面,不過它們在影像的品質上仍然有一些盲點。像是Figure 13與14,相同的FID與P&R分數,但總體品質仍然有明顯差異。 ::: :::info ^2^We believe that the key to the apparent inconsistency lies in the particular choice of feature space rather than the foundations of FID or P&R. It was recently discovered that classifiers trained using ImageNet [35] tend to base their decisions much more on texture than shape [12], while humans strongly focus on shape [28]. This is relevant in our context because FID and P&R use high-level features from InceptionV3 [39] and VGG-16 [39], respectively, which were trained in this way and are thus expected to be biased towards texture detection. As such, images with, e.g., strong cat textures may appear more similar to each other than a human observer would agree, thus partially compromising density-based metrics (FID) and manifold coverage metrics (P&R). ::: :::success 我們認為,這明顯的不一致的關鍵在於特徵空間(feature space)的特定選擇,而不在於FID與P&R的基礎。近來有發現到,使用ImageNet訓練出來的分類器在決策的時候會傾向於基於紋理更勝於形狀,儘管人類是更強烈的關注在形狀上面。這在我們的情況中是有相關的,因為FID跟P&R各自使用來自InceptionV3與VGG-16的高階特徵,這兩個模型就是以這樣的方式訓練的,因此可以預期是會往紋理偵測的方向偏過去的。所以吼,對於那些具有明顯貓貓紋理的圖像來說,它們彼此之間的相似度可能會被人工智慧模型高估,高於人類觀察者認為它們應該相似的程度,從而部份損壞了density-based metrics (FID)與manifold coverage metrics (P&R)。 ::: :::info ![image](https://hackmd.io/_uploads/HkJfjClf0.png) Figure 13. Uncurated examples from two generative models trained on LSUN CAT without truncation. FID, precision, and recall are similar for models 1 and 2, even though the latter produces cat-shaped objects more often. Perceptual path length (PPL) indicates a clear preference for model 2. Model 1 corresponds to configuration A in Table 3, and model 2 is an early training snapshot of configuration F. Figure 13. 來自於兩個訓練於LSUN CAT的生成模型(未使用truncation)亂挑的範例。models 1與2的FID、精度與召回率是相似的,儘管後者更常生成出貓貓形狀的物體。Perceptual path length (PPL)指出對model 2明顯的愛好。model 1對應Table 3中的配置A,model2則是配置F的early training snapshot。 ::: :::warning curated有精心策劃的意思,所以uncurated就是未經策劃,在下就直翻成亂挑 ::: :::info ![image](https://hackmd.io/_uploads/SJXrjClfA.png) Figure 14. Uncurated examples from two generative models trained on LSUN CAR without truncation. FID, precision, and recall are similar for models 1 and 2, even though the latter produces car-shaped objects more often. Perceptual path length (PPL) indicates a clear preference for model 2. Model 1 corresponds to configuration A in Table 3, and model 2 is an early training snapshot of configuration F. ::: :::info ![image](https://hackmd.io/_uploads/BJft9wvzC.png) Table 3. Improvement in LSUN datasets measured using FID and PPL. We trained CAR for 57M images, CAT for 88M, CHURCH for 48M, and HORSE for 100M images. ::: :::info We observe a correlation between perceived image quality and perceptual path length (PPL) [24], a metric that was originally introduced for quantifying the smoothness of the mapping from a latent space to the output image by measuring average LPIPS distances [50] between generated images under small perturbations in latent space. Again consulting Figures 13 and 14, a smaller PPL (smoother generator mapping) appears to correlate with higher overall image quality, whereas other metrics are blind to the change. Figure 4 examines this correlation more closely through per-image PPL scores on LSUN CAT, computed by sampling the latent space around $\mathbf{w}\sim f(\mathbf{z})$. Low scores are indeed indicative of high-quality images, and vice versa. Figure 5a shows the corresponding histogram and reveals the long tail of the distribution. The overall PPL for the model is simply the expected value of these per-image PPL scores. We always compute PPL for the entire image, as opposed to Karras et al. [24] who use a smaller central crop. ::: :::success 我們觀察到感知到的影像品質(perceived image quality)與感知路徑長度(perceptual path length)之間的關聯性,PPL這個指標最一開始是為了量化從潛在空間映射到輸出影像的平滑度而引入的(計算在潛在空間中微小[擾動](https://terms.naer.edu.tw/detail/47a9c68d4cf1b130b97626023168d91f/)情況下所生成的照片之間的平均LPIPS距離)。再一次的參考Figure 13與Figure 14,一個比較小的PPL(smoother generator mapping)似乎跟更高的總體影像品質相關,而其它指標對這個變化視而不見。Figure 4利用LSUN CAT的每一張影像的PPL分數更仔細的檢查這種相關性,其計算是利用潛在空間周圍$\mathbf{w}\sim f(\mathbf{z})$採樣計算而得。較低的分數確實表明了照片是高品質的,反之亦然。Figure 5a給出相對應的直方圖,並且顯示出分佈的長尾。模型的整體PPL就只是這些每張影像的PPL分數的期望值。我們始終會對完整的影像計算PPL,跟Karras等人所使用的較小的中間剪裁是不一樣的。 ::: :::info ![image](https://hackmd.io/_uploads/SJBDJdDGC.png) Figure 4. Connection between perceptual path length and image quality using baseline StyleGAN (config A) with LSUN CAT. (a) Random examples with low PPL ($\leq 10$^th^ percentile). (b) Examples with high PPL ($\geq 90$^th^ percentile). There is a clear correlation between PPL scores and semantic consistency of the images. Figure 4. 感知路徑長度(PPL)與影像品質(使用基線StyleGAN(配置A)搭配LSUN CAT)之間的連結。(a)低PPL的隨機樣本($\leq 10$^th^ percentile)。(b)高PPL的隨機樣本($\geq 90$^th^ percentile)。PPL分數與影像的語意一致性之間有著很明顯的相關性。 ::: :::info ![image](https://hackmd.io/_uploads/rye8luwfA.png) Figure 5. (a) Distribution of PPL scores of individual images generated using baseline StyleGAN (config A) with LSUN CAT (FID = 8.53, PPL = 924). The percentile ranges corresponding to Figure 4 are highlighted in orange. (b) StyleGAN2 (config F) improves the PPL distribution considerably (showing a snapshot with the same FID = 8.53, PPL = 387). Figure 5. (a)使用基線StyleGAN(配置A)搭配LSUN CAT所生成的單一影像的PPL分數的分佈(FID=8.52,PPL=924)。對應Figure 4的百分位數範圍以橘色來突顯。(b)StyleGAN2(配置F)明顯改善PPL分佈(呈現出有著相同FID=8.53、PPL=387的快照)。 ::: :::info It is not immediately obvious why a low PPL should correlate with image quality. We hypothesize that during training, as the discriminator penalizes broken images, the most direct way for the generator to improve is to effectively stretch the region of latent space that yields good images. This would lead to the low-quality images being squeezed into small latent space regions of rapid change. While this improves the average output quality in the short term, the accumulating distortions impair the training dynamics and consequently the final image quality. ::: :::success 關於為什麼低的PPL會跟照片品質有關這一點實在不是那麼明顯。我們假設訓練過程中,當discriminator懲罰損壞的照片的時候,對generator來說,要改善的最直接的方法就是有效地擴展潛在空間中能夠產生好的照片的區域。這將會導致低品質照片被擠壓到變化訊息的小的潛在空間區域中。儘管這在短期內能夠改善其平均輸出品質,不過累積的失真(distortions)已經損害到訓練動態,進而影響最終的照片品質。 ::: :::info Clearly, we cannot simply encourage minimal PPL since that would guide the generator toward a degenerate solution with zero recall. Instead, we will describe a new regularizer that aims for a smoother generator mapping without this drawback. As the resulting regularization term is somewhat expensive to compute, we first describe a general optimization that applies to any regularization technique. ::: :::success 很明顯的,我們不能簡單的鼓勵最小化PPL,這也許會引導generator走向零召回率的退化的解決方案。相反的,我們要來說明一個新的regularizer,主要是用來弄一個更平滑的生成器的映射,沒有這個退化問題的缺點。因為計算正規化項目的成本比較高的原因,我們首先說明一個可以應用於任意正規化技術的通用最佳化。 ::: ### 3.1. Lazy regularization :::info Typically the main loss function (e.g., logistic loss [16]) and regularization terms (e.g., $R_1$ [30]) are written as a single expression and are thus optimized simultaneously. We observe that the regularization terms can be computed less frequently than the main loss function, thus greatly diminishing their computational cost and the overall memory usage. Table 1, row C shows that no harm is caused when $R_1$ regularization is performed only once every 16 minibatches, and we adopt the same strategy for our new regularizer as well. Appendix B gives implementation details. ::: :::success 通常來說,主要的損失函數(如logistic loss)跟正規化項目(如$R_1$)都是被寫成一個單一表達式,然後同時最佳化。我們觀察到,正規化項目的計算頻率可以低於主要的損失函數,從而大幅減少其計算成本與整體的計憶體用量。Table 1,row C說明著,當每16個minibatches執行一次$R_1$正規化的時候,是沒有任何傷害的,我們新的正規化也採用相同的策略。Appendix B給出實作的細節。 ::: ### 3.2. Path length regularization :::info We would like to encourage that a fixed-size step in $\mathcal{W}$ results in a non-zero, fixed-magnitude change in the image. We can measure the deviation from this ideal empirically by stepping into random directions in the image space and observing the corresponding $\mathbf{w}$ gradients. These gradients should have close to an equal length regardless of $\mathbf{w}$ or the image-space direction, indicating that the mapping from the latent space to image space is well-conditioned [33]. ::: :::success 我們鼓勵在$\mathcal{W}$中採用以固定大小的步長,這可以在照片產生非零、固定幅度的變化。我們可以透過進入照片空間中的隨機方向,然後觀察相關的$\mathbf{w}$梯度,用經驗量測與理想情況之間的差異。這些梯度應該應該都有相近的長度(無論是$\mathbf{w}$或是影像空間方向),這說明著從潛在空間到影像空間的映射是[良置](https://ccjou.wordpress.com/2010/06/22/%E7%97%85%E6%85%8B%E7%B3%BB%E7%B5%B1/)的(well-conditioned)。 ::: :::info At a single $\mathbf{w}\in\mathcal{W}$, the local metric scaling properties of the generator mapping $g(\mathbf{w}): \mathcal{W} \mapsto \mathcal{Y}$ are captured by the Jacobian matrix $\mathbf{J_w} = \partial g(\mathbf{w}) / \partial{\mathbf{w}}$. Motivated by the desire to preserve the expected lengths of vectors regardless of the direction, we formulate our regularizer as $$ \mathbb{E}_{\mathbf{w},\mathbf{y}\sim\mathcal{N}(0,\mathbf{I})}\left(\Vert \mathbf{J}^T_{\mathbf{w}}\mathbf{y}\Vert_2-a\right)^2\tag{4} $$ where $\mathbf{y}$ are random images with normally distributed pixel intensities, and $\mathbf{w}\sim f(\mathbf{z})$, where $\mathbf{z}$ are normally distributed. We show in Appendix C that, in high dimensions, this prior is minimized when $\mathbf{J}_{\mathbf{w}}$ is orthogonal (up to a global scale) at any $\mathbf{w}$. An orthogonal matrix preserves lengths and introduces no squeezing along any dimension. ::: :::success 在單一個$\mathbf{w}\in\mathcal{W}$的時候,生成器映射$g(\mathbf{w}): \mathcal{W} \mapsto \mathcal{Y}$的局部度量縮放特性(local metric scaling properties)可以由Jacobian matrix $\mathbf{J_w} = \partial g(\mathbf{w}) / \partial{\mathbf{w}}$來補獲。出於不管方向如何都要保持向量的期望長度的傾向,我們將正規化公式寫為: $$ \mathbb{E}_{\mathbf{w},\mathbf{y}\sim\mathcal{N}(0,\mathbf{I})}\left(\Vert \mathbf{J}^T_{\mathbf{w}}\mathbf{y}\Vert_2-a\right)^2\tag{4} $$ 其中$\mathbf{y}$是有著像素強度常態分佈的隨機影像,且$\mathbf{w}\sim f(\mathbf{z})$,其中$\mathbf{z}$是常態分佈。我們在Appendix C中說明,在高維度中,當 $\mathbf{J}_{\mathbf{w}}$在任意$\mathbf{w}$是orthogonal(一直到全域尺度)的時候,那這個prior(先驗)就會被最小化。orthogonal matrix會保持長度,而且不會沿任何維度引入擠壓(squeezing)。 ::: :::info To avoid explicit computation of the Jacobian matrix, we use the identity $\mathbf{J}^T_{\mathbf{w}}\mathbf{y}=\nabla_\mathbf{w}(g(\mathbf{w})\cdot\mathbf{y})$, which is efficiently computable using standard backpropagation [6]. The constant $a$ is set dynamically during optimization as the long-running exponential moving average of the lengths $\Vert \mathbf{J}^T_{\mathbf{w}}\mathbf{y} \Vert_2$, allowing the optimization to find a suitable global scale by itself. ::: :::success 為了壁免顯示的計算Jacobian matrix,我們使用[恆等式](https://terms.naer.edu.tw/detail/2260de851a22768bbba2e8d4eb4d980b/)$\mathbf{J}^T_{\mathbf{w}}\mathbf{y}=\nabla_\mathbf{w}(g(\mathbf{w})\cdot\mathbf{y})$,這可以用標準的反向傳播高效地計算。常數$a$在最佳化期間是動態設置的,做為長度$\Vert \mathbf{J}^T_{\mathbf{w}}\mathbf{y} \Vert_2$的long-running exponential moving average(長期指數移動平均?),這可以讓最佳化自己去找到最適合的全域尺度。 ::: :::info Our regularizer is closely related to the Jacobian clamping regularizer presented by Odena et al. [33]. Practical differences include that we compute the products $\mathbf{J}^T_{\mathbf{w}}\mathbf{y}$ analytically whereas they use finite differences for estimating $\mathbf{J}_\mathbf{w}\mathbf{\delta}$ with $\mathcal{Z} \in \mathbf{\delta} \sim \mathcal{N}(0, \mathbf{I})$. It should be noted that spectral normalization [31] of the generator [46] only constrains the largest singular value, posing no constraints on the others and hence not necessarily leading to better conditioning. We find that enabling spectral normalization in addition to our contributions — or instead of them — invariably compromises FID, as detailed in Appendix E. ::: :::success 我們的regularizer與Odena等人提出的Jacobian clamping regularizer息息相關。實際上的差異在於,我們分析性地計算了$\mathbf{J}^T_{\mathbf{w}}\mathbf{y}$,而他們則使用[有限差分法](https://terms.naer.edu.tw/detail/ecf43e84d3f0f2d4d455931a00f3b38b/)來估計 $\mathbf{J}_\mathbf{w}\mathbf{\delta}$,其中 $\mathcal{Z} \in \mathbf{\delta} \sim \mathcal{N}(0, \mathbf{I})$。應該要注意的是,生成器的spectral normalization只約束了最大的[奇異值](https://terms.naer.edu.tw/detail/aad71ddb7d8d0922fb95fe00f55a850b/),對其它奇異值並沒有約束,因此不一定能導致更好的條件。我們發現,不論是我們的方法還是它們的方法,啟用spectral normalization,都會無一例外地損害FID,詳細情況見Appendix E。 ::: :::info In practice, we notice that path length regularization leads to more reliable and consistently behaving models, making architecture exploration easier. We also observe that the smoother generator is significantly easier to invert (Section 5). Figure 5b shows that path length regularization clearly tightens the distribution of per-image PPL scores, without pushing the mode to zero. However, Table 1, row D points toward a tradeoff between FID and PPL in datasets that are less structured than FFHQ. ::: :::success 實務上,我們有注意到,路徑長度正規化能讓模型表現的更可靠、一致,讓架構探索變的更容易。我們還觀察到,更平滑的生成器明顯的更容易反轉(invert)(Section 5)。Figure 5b說明著,路徑長度正規化明顯收緊了per-image PPL scores的分佈,但並未將模式推至零。然而,Table 1中的row D指出,在結構性不如FFHQ的資料集例如FFHQ中的FID和PPL之間的權衡。 ::: ## 4. Progressive growing revisited :::info Progressive growing [23] has been very successful in stabilizing high-resolution image synthesis, but it causes its own characteristic artifacts. The key issue is that the progressively grown generator appears to have a strong location preference for details; the accompanying video shows that when features like teeth or eyes should move smoothly over the image, they may instead remain stuck in place before jumping to the next preferred location. Figure 6 shows a related artifact. We believe the problem is that in progressive growing each resolution serves momentarily as the output resolution, forcing it to generate maximal frequency details, which then leads to the trained network to have excessively high frequencies in the intermediate layers, compromising shift invariance [49]. Appendix A shows an example. These issues prompt us to search for an alternative formulation that would retain the benefits of progressive growing without the drawbacks. ::: :::success 漸進增長在穩定高解析度影像合成上有著很大的成功,不過這也導致它的一些特有的缺陷。主要問題在於,漸進增長生成器在細節上有著強烈的位置偏好;隨文附上的片片說明著,當像是牙齒或是眼睛這類的特徵應該是要在照片上平滑的移動著,但是實際上在進到下一個更好的位置之前卻是會卡在原地。Figure 6說明著相關的瑕疵。我們確信這個問題在於,在漸進增長過程中,每個解析度都會暫時性的做為輸出解析度,迫使其生成最大頻率細節,這導致了訓練的網路在其中間網路層有著過高的頻率,從而損害了平移不變性。Appendix A給出一個範例。這些問題促使我們尋找一個替代的數學式,既能保留漸進增長的好處,又能避掉這個缺點。 ::: :::info ![image](https://hackmd.io/_uploads/HJA0BpZE0.png) Figure 6. Progressive growing leads to “phase” artifacts. In this example the teeth do not follow the pose but stay aligned to the camera, as indicated by the blue line. ::: ### 4.1. Alternative network architectures :::info While StyleGAN uses simple feedforward designs in the generator (synthesis network) and discriminator, there is a vast body of work dedicated to the study of better network architectures. Skip connections [34, 22], residual networks [18, 17, 31], and hierarchical methods [7, 47, 48] have proven highly successful also in the context of generative methods. As such, we decided to re-evaluate the network design of StyleGAN and search for an architecture that produces high-quality images without progressive growing. ::: :::success 儘管StyleGAN在generator (synthesis network)和discriminator中使用簡單的前饋設計,不過有大量研究專注於更好的網路架構。Skip connections、residual networks和hierarchical在生成方法的範疇中也證明了非常成功。因此,我們決定重新評估StyleGAN的網路設計,並尋找一種不需要漸進增長也可以生成高品質影像的架構。 ::: :::info Figure 7a shows MSG-GAN [22], which connects the matching resolutions of the generator and discriminator using multiple skip connections. The MSG-GAN generator is modified to output a mipmap [42] instead of an image, and a similar representation is computed for each real image as well. In Figure 7b we simplify this design by upsampling and summing the contributions of RGB outputs corresponding to different resolutions. In the discriminator, we similarly provide the downsampled image to each resolution block of the discriminator. We use bilinear filtering in all up and downsampling operations. In Figure 7c we further modify the design to use residual connections.3 This design is similar to LAPGAN [7] without the per-resolution discriminators employed by Denton et al. ::: :::success Figure 7a給出的是MSG-GAN,它使用多個skip connections連接generator和discriminator的匹配解析度。MSG-GAN生成器被修改為輸出一個mipmap而不是影像,對每個real image也計算類似的representation。在Figure 7b中,我們通過upsampling和匯總對應不同解析度的RGB輸出來簡化這個設計。在discriminator中,我們同樣將下採樣的影像提供給discriminator的每個解析度區塊(resolution block)。我們在所有上採樣和下採樣操作中使用[雙線性](https://terms.naer.edu.tw/detail/0b5342522b462cfc54fa4cc2e9d438e4/)濾波。在Figure 7c中,我們進一步修改設計來使用殘差連接。這個設計類似於LAPGAN,但沒有Denton等人所採用的per-resolution discriminators。 ::: :::info ![image](https://hackmd.io/_uploads/H1AboabER.png) Figure 7. Three generator (above the dashed line) and discriminator architectures. Up and Down denote bilinear up and downsampling, respectively. In residual networks these also include 1×1 convolutions to adjust the number of feature maps. tRGB and fRGB convert between RGB and high-dimensional per-pixel data. Architectures used in configs E and F are shown in green. ::: :::info Table 2 compares three generator and three discriminator architectures: original feedforward networks as used in StyleGAN, skip connections, and residual networks, all trained without progressive growing. FID and PPL are provided for each of the 9 combinations. We can see two broad trends: skip connections in the generator drastically improve PPL in all configurations, and a residual discriminator network is clearly beneficial for FID. The latter is perhaps not surprising since the structure of discriminator resembles classifiers where residual architectures are known to be helpful. However, a residual architecture was harmful in the generator — the lone exception was FID in LSUN CAR when both networks were residual. ::: :::success Table 2比較了三種generator和三種discriminator的架構:StyleGAN中原始的前饋網絡、skip connections和殘差網絡,這些都在沒有漸進增長的情況下進行訓練。這9種組合都提供FID和PPL。我們可以看到兩個大趨勢:在所有配置中,生成器中的skip connections大大地改善了PPL,而residual discriminator network明顯的對FID是有幫助的。後者可能並不令人驚訝,因為discriminator的結構類似於分類器,而殘差架構在分類器中已被證明是有幫助的。然而,殘差架構在生成器中是有害的,唯一的例外是當兩個網絡都是殘差時,LSUN CAR的FID表現是好的。 ::: :::info ![image](https://hackmd.io/_uploads/HyffWA-E0.png) Table 2. Comparison of generator and discriminator architectures without progressive growing. The combination of generator with output skips and residual discriminator corresponds to configuration E in the main result table. ::: :::info For the rest of the paper we use a skip generator and a residual discriminator, without progressive growing. This corresponds to configuration E in Table 1, and it significantly improves FID and PPL. ::: :::success 論文的剩餘部分,我們使用skip generator和residual discriminator,沒有使用漸進式增長。這對應於Table 1的配置E,這明顯提升FID與PPL。 ::: ### 4.2. Resolution usage :::info The key aspect of progressive growing, which we would like to preserve, is that the generator will initially focus on low-resolution features and then slowly shift its attention to finer details. The architectures in Figure 7 make it possible for the generator to first output low resolution images that are not affected by the higher-resolution layers in a significant way, and later shift the focus to the higher-resolution layers as the training proceeds. Since this is not enforced in any way, the generator will do it only if it is beneficial. To analyze the behavior in practice, we need to quantify how strongly the generator relies on particular resolutions over the course of training. ::: :::success 漸進式增長的一個關鍵點在於,我們希望保留生成器最初專注於低解析度特徵,然後逐漸將注意力轉移到更細緻的細節。圖7中的架構使生成器可以首先輸出不受高解析度層顯著影響的低解析度圖像,隨著訓練的進行,再將重點轉移到高解析度層。由於這並不是強制性的,生成器只有在有利的情況下才會這樣做。為了實際分析這種行為,我們需要量化生成器在訓練過程中依賴特定解析度的程度。 ::: :::info Since the skip generator (Figure 7b) forms the image by explicitly summing RGB values from multiple resolutions, we can estimate the relative importance of the corresponding layers by measuring how much they contribute to the final image. In Figure 8a, we plot the standard deviation of the pixel values produced by each tRGB layer as a function of training time. We calculate the standard deviations over 1024 random samples of $\mathbf{w}$ and normalize the values so that they sum to 100%. ::: :::success 因為skip generator (Figure 7b)形成影像是透過顯式地加總從多個解析度來的RGB值,所以我們可以量測它們對最終影像的貢獻來估測對應網路層的相對重要性。在Figure 8a中,我們將每一個tRGB layer生成的像素值的標準差做為訓練時間的函數繪製。我們在1024個$\mathbf{w}$的隨機樣本上計算標準差並正規化其值,因此總和為100%。 ::: :::warning 不是很確定上面的翻譯是否恰當 ::: :::info ![image](https://hackmd.io/_uploads/rJRBGXmN0.png) Figure 8. Contribution of each resolution to the output of the generator as a function of training time. The vertical axis shows a breakdown of the relative standard deviations of different resolutions, and the horizontal axis corresponds to training progress, measured in millions of training images shown to the discriminator. We can see that in the beginning the network focuses on lowresolution images and progressively shifts its focus on larger resolutions as training progresses. In (a) the generator basically outputs a 512^2^ image with some minor sharpening for 1024^2^ , while in (b) the larger network focuses more on the high-resolution details. Figure 8. 每個解析度對生成器輸出貢獻的訓練時間函數。垂直軸說明不同解析度的相對標準差的區分,水平軸對應於訓練進度,以給discriminator看過數百萬張訓練影像的方式來衡量。我們可以看到在訓練初期,網路關注在低解析度影像上,隨著訓練的進展逐步轉向較大解析度上。在(a)中,生成器基本上輸出512^2^的影像,並對1024^2^做了一些微小的銳化,而在(b)中,較大的網路更多是關注在高解析度的細節上。 ::: :::info At the start of training, we can see that the new skip generator behaves similar to progressive growing — now achieved without changing the network topology. It would thus be reasonable to expect the highest resolution to dominate towards the end of the training. The plot, however, shows that this fails to happen in practice, which indicates that the generator may not be able to “fully utilize” the target resolution. To verify this, we inspected the generated images manually and noticed that they generally lack some of the pixel-level detail that is present in the training data — the images could be described as being sharpened versions of 512^2^ images instead of true 1024^2^ images. ::: :::success 在訓練開始時,我們可以看到新的skip generator的行為類似於漸進增長,現在實現這一點無需改變網路拓撲。因此,我們可以合理的預期最高解析度會在訓練結束的時候主導方向。然而,圖表說明了,關於這點事實上並沒有發生,這指出了生成器可能無法"充分利用"目標解析度。為了驗證這一點,我們手動檢查生成的照片,發現它們通常缺少訓練資料中存在的某些像素級別(pixel-level)的細節,這些影像可以被描述為512^2的銳化版本,而不是真正的1024^2。 ::: :::info This leads us to hypothesize that there is a capacity problem in our networks, which we test by doubling the number of feature maps in the highest-resolution layers of both networks. This brings the behavior more in line with expectations: Figure 8b shows a significant increase in the contribution of the highest-resolution layers, and Table 1, row F shows that FID and Recall improve markedly. The last row shows that baseline StyleGAN also benefits from additional capacity, but its quality remains far below StyleGAN2. ::: :::success 這讓我們假設在我們的網路中有[容量](https://terms.naer.edu.tw/detail/8265ddb06aa22361aa0b49bc59064577/)的問題,關於這點我們透過把兩個網路中最高解析度的網路層的feature map的數量加倍的方式來測試^4^。這讓整個行為更加符合預期:Figure 8b說明著最高解析度網路層的貢獻明顯增加,然後Table 1的row F則說明著FID跟召回率也都明顯改善。最後一個row則說明著,baseline StyleGAN也從增加容量的作法中獲得好處,不過還是遠不如StyleGAN2。 ::: :::info ^4^ We double the number of feature maps in resolutions 64^2^–1024^2^ while keeping other parts of the networks unchanged. This increases the total number of trainable parameters in the generator by 22% (25M → 30M) and in the discriminator by 21% (24M → 29M). ^4^ 我們把解析度64^2^–1024^2^的feature maps數量加倍,然後其它的部份沒有改變。這造成generator的訓練參數增加22%(25M → 30M),discriminator的部份則是21%(24M → 29M)。 ::: :::info Table 3 compares StyleGAN and StyleGAN2 in four LSUN categories, again showing clear improvements in FID and significant advances in PPL. It is possible that further increases in the size could provide additional benefits. ::: :::success Table 3在四個LSUN categories中比較了StyleGAN與StyleGAN2,再次的說明FID跟PPL都明顯的進步。進一步的增加大小可能會帶來額外的好處。 ::: ## 5. Projection of images to latent space :::info Inverting the synthesis network g is an interesting problem that has many applications. Manipulating a given image in the latent feature space requires finding a matching latent code $\mathbf{w}$ for it first. Previous research [1, 10] suggests that instead of finding a common latent code $\mathbf{w}$, the results improve if a separate $\mathbf{w}$ is chosen for each layer of the generator. The same approach was used in an early encoder implementation [32]. While extending the latent space in this fashion finds a closer match to a given image, it also enables projecting arbitrary images that should have no latent representation. Instead, we concentrate on finding latent codes in the original, unextended latent space, as these correspond to images that the generator could have produced. ::: :::success 反轉~~術式~~合成網路$g$是一個很有趣的問題,這有很多應用。要在潛在特徵空間中操作給定的影像,首先就是要找到一個匹配的latent code $\mathbf{w}$。早前的研究建議,不要去找通用的latent code $\mathbf{w}$,如果能夠幫generator的每個網路都各別選擇$\mathbf{w}$的話,結果會更好。相同的方法在早先的encoder實現中也有被使用過。雖然用這樣的方法來擴展潛在空間可以找到跟給定的影像更接近的匹配,不過它還能夠將不應有潛在表示(latent representation)的任意影像投影到潛在空間。相反的,我們集中火力在原始、未擴展的潛在空間尋找潛在編碼(latent codes),因為這些latent codes對應到生成器能夠生成的影像。 ::: :::info Our projection method differs from previous methods in two ways. First, we add ramped-down noise to the latent code during optimization in order to explore the latent space more comprehensively. Second, we also optimize the stochastic noise inputs of the StyleGAN generator, regularizing them to ensure they do not end up carrying coherent signal. The regularization is based on enforcing the autocorrelation coefficients of the noise maps to match those of unit Gaussian noise over multiple scales. Details of our projection method can be found in Appendix D. ::: :::success 我們的投影方法跟早前的方法有兩點不同。首先,我們在最佳化期間逐漸減少的噪點(ramped-down noise),用這樣的方式更全面性的探索潛在空間。再來就是,我們還最佳化了StyleGAN generator的隨機噪點輸入,對它們做正規化,確保它們不會帶有[耦合](https://terms.naer.edu.tw/detail/1d428594e0b5565653605b5472859162/)的信號。正規化是基於強制噪點的[自相關](https://terms.naer.edu.tw/detail/72278cf1d1f47a135b71d7435ac8d343/)系數在多個尺度上匹配單位高斯噪點的自相關系數。細節可參考Appendix D。 ::: ### 5.1. Attribution of generated images :::info Detection of manipulated or generated images is a very important task. At present, classifier-based methods can quite reliably detect generated images, regardless of their exact origin [29, 45, 40, 51, 41]. However, given the rapid pace of progress in generative methods, this may not be a lasting situation. Besides general detection of fake images, we may also consider a more limited form of the problem: being able to attribute a fake image to its specific source [2]. With StyleGAN, this amounts to checking if there exists a $\mathbf{w}\in\mathcal{W}$ that re-synthesis the image in question. ::: :::success 檢測被操作或是生成的影像是一件非常重要的任務。目前來說,基於分類器的方法能夠十分可靠地檢測出生成的影像,不論它們的具體來源為何。然而,考量到生成方法的快速演進,這可能不會是長久的情況。除了對假照片的通常檢測之外,我們還考慮到問題的更有限的一個形式:能夠將假照片歸屬於其特定來源。對於StyleGAN來說,這相當於檢查其是否存在於一個$\mathbf{w}\in\mathcal{W}$所能重新合成的問題。 ::: :::info We measure how well the projection succeeds by computing the LPIPS [50] distance between original and re-synthesized image as $D_{\text{LPIPS}}[x, g(\tilde{g}^{-1}(x))]$, where $x$ is the image being analyzed and $\tilde{g}^{-1}$ denotes the approximate projection operation. Figure 10 shows histograms of these distances for LSUN CAR and FFHQ datasets using the original StyleGAN and StyleGAN2, and Figure 9 shows example projections. The images generated using StyleGAN2 can be projected into $\mathcal{W}$ so well that they can be almost unambiguously attributed to the generating network. However, with the original StyleGAN, even though it should technically be possible to find a matching latent code, it appears that the mapping from $\mathcal{W}$ to images is too complex for this to succeed reliably in practice. We find it encouraging that StyleGAN2 makes source attribution easier even though the image quality has improved significantly. ::: :::success 我們透過計算原始影像與重新合成影像之間的LPIPS distance來量測投影有多麼的成功,這個距離為$D_{\text{LPIPS}}[x, g(\tilde{g}^{-1}(x))]$,其中$x$為正在分析的影像,而$\tilde{g}^{-1}$則表示近似投影的操作。Figure 10呈現出來的是LSUN CAR與FFHQ資料集使用StyleGAN與StyleGAN2所做的距離直方圖。Figure 9說明了樣本投影。使用StyleGAN2生成的影像可以很好的投影到$\mathcal{W}$,以致於這可以幾乎明確地歸因於生成網路。不過吼,使用原始的StyleGAN的話,即使技術上應該是可以找到匹配的latent code,不過要從$\mathcal{W}$映射到影像的這部份在實作上似乎太過複雜而難以成功。我們發現到令人鼓舞的是,儘管影像的品質明顯地提升,不過StyleGAN2讓來源的歸因變成更加容易。 ::: :::info ![image](https://hackmd.io/_uploads/Hyz8MpBER.png) Figure 9. Example images and their projected and re-synthesized counterparts. For each configuration, top row shows the target images and bottom row shows the synthesis of the corresponding projected latent vector and noise inputs. With the baseline StyleGAN, projection often finds a reasonably close match for generated images, but especially the backgrounds differ from the originals. The images generated using StyleGAN2 can be projected almost perfectly back into generator inputs, while projected real images (from the training set) show clear differences to the originals, as expected. All tests were done using the same projection method and hyperparameters ::: :::info ![image](https://hackmd.io/_uploads/SJswMTH4A.png) Figure 10. LPIPS distance histograms between original and projected images for generated (blue) and real images (orange). Despite the higher image quality of our improved generator, it is much easier to project the generated images into its latent space $\mathcal{W}$. The same projection method was used in all cases. ::: ## 6. Conclusions and future work :::info We have identified and fixed several image quality issues in StyleGAN, improving the quality further and considerably advancing the state of the art in several datasets. In some cases the improvements are more clearly seen in motion, as demonstrated in the accompanying video. Appendix A includes further examples of results obtainable using our method. Despite the improved quality, StyleGAN2 makes it easier to attribute a generated image to its source. ::: :::success 我們已經確定並修復多個StyleGAN中的影像品質問題,進一步的提升品質,在多個資料集中也相當程度的推進。在某些情況中,這些提升更是清楚可見,如隨文附上的片片所說明。Appendix A包括使用我們的方法所能得到的結果的樣本。儘管品質有所提升,StyleGAN2使的將生成影像歸屬於其來源變得更加容易。 ::: :::info Training performance has also improved. At 1024^2^ resolution, the original StyleGAN (config A in Table 1) trains at 37 images per second on NVIDIA DGX-1 with 8 Tesla V100 GPUs, while our config E trains 40% faster at 61 img/s. Most of the speedup comes from simplified dataflow due to weight demodulation, lazy regularization, and code optimizations. StyleGAN2 (config F, larger networks) trains at 31 img/s, and is thus only slightly more expensive to train than original StyleGAN. Its total training time was 9 days for FFHQ and 13 days for LSUN CAR. ::: :::success 訓練的效能也是有所提升。在1024^2解析度中,原始的StyleGAN(Table 1的配置A)在NVIDIA DGX-1搭配8張Tesla V100 GPUs訓練,每秒可以處理37張照片,而我們的配置E的訓練速度快了40%,每秒61張照片。大多數的提升是來自於簡化的dataflow,主要有weight demodulation、lazy regularization與code optimizations。StyleGAN2(config F, larger networks)以每秒31張照片訓練,因此其訓練成本略高於原始的StyleGAN。總的訓練時間來說,FFHQ為9天,LSUN CAR為13天。 ::: :::info The entire project, including all exploration, consumed 132 MWh of electricity, of which 0.68 MWh went into training the final FFHQ model. In total, we used about 51 single-GPU years of computation (Volta class GPU). A more detailed discussion is available in Appendix F. ::: :::success 整個專案,包含所有的探索,共消耗了132MWh的電力,其中0.68MWh是用於訓練最終的FFHQ模型。總的來說,我們使用了大約51塊GPU的年度計算量。更詳細的討論請見Appendix F。 ::: :::info In the future, it could be fruitful to study further improvements to the path length regularization, e.g., by replacing the pixel-space L2 distance with a data-driven feature-space metric. Considering the practical deployment of GANs, we feel that it will be important to find new ways to reduce the training data requirements. This is especially crucial in applications where it is infeasible to acquire tens of thousands of training samples, and with datasets that include a lot of intrinsic variation. ::: :::success 未來,進一步的研究路徑長度正規化的改善可能是有成效的,舉例來說,將pixel-space L2 distance替換為data-driven feature-space metric。考量到GANs的實務佈署,我們認為找出一個減少訓練資料需求的新方法是重要的。這對那些無法獲取大量訓練樣本,以及那些存在大量內在變異的資料集來說是至關重要的。 ::: :::info Acknowledgements We thank Ming-Yu Liu for an early review, Timo Viitanen for help with the public release, David Luebke for in-depth discussions and helpful comments, and Tero Kuosmanen for technical support with the compute infrastructure. ::: :::success 謝天謝地。 ::: ## A. Image quality :::info We include several large images that illustrate various aspects related to image quality. Figure 11 shows hand-picked examples illustrating the quality and diversity achievable using our method in FFHQ, while Figure 12 shows uncurated results for all datasets mentioned in the paper. Figures 13 and 14 demonstrate cases where FID and P&R give non-intuitive results, but PPL seems to be more in line with human judgement. ::: :::success 我們包含幾個大型圖片說明著圖片品質相關的各方面。Figure 11為手工挑選的樣本,說明著使用我們的方法在FFHQ上能達到的品質與多樣性,而Figure12是論文中提過的所有資料集未經處理的結果。Figures 13、14給出FID與P&R非直觀的結果,不過PPL似乎更符合人類的判斷。 ::: :::info ![image](https://hackmd.io/_uploads/H16NzWo4C.png) Figure 11. Four hand-picked examples illustrating the image quality and diversity achievable using StylegGAN2 (config F). ::: :::info ![image](https://hackmd.io/_uploads/S11QM-iVR.png) ![image](https://hackmd.io/_uploads/rJhmzbiER.png) Figure 12. Uncurated results for each dataset used in Tables 1 and 3. The images correspond to random outputs produced by our generator (config F), with truncation applied at all resolutions using $\psi=0.5$ [24]. ::: :::info ![image](https://hackmd.io/_uploads/S1BC-ZoNC.png) ![image](https://hackmd.io/_uploads/Skmpb-jVR.png) Figure 13. Uncurated examples from two generative models trained on LSUN CAT without truncation. FID, precision, and recall are similar for models 1 and 2, even though the latter produces cat-shaped objects more often. Perceptual path length (PPL) indicates a clear preference for model 2. Model 1 corresponds to configuration A in Table 3, and model 2 is an early training snapshot of configuration F. ::: :::info ![image](https://hackmd.io/_uploads/SyAjWZs4C.png) ![image](https://hackmd.io/_uploads/S1_qWWs40.png) Figure 14. Uncurated examples from two generative models trained on LSUN CAR without truncation. FID, precision, and recall are similar for models 1 and 2, even though the latter produces car-shaped objects more often. Perceptual path length (PPL) indicates a clear preference for model 2. Model 1 corresponds to configuration A in Table 3, and model 2 is an early training snapshot of configuration F. ::: :::info We also include images relating to StyleGAN artifacts. Figure 15 shows a rare case where the blob artifact fails to appear in StyleGAN activations, leading to a seriously broken image. Figure 16 visualizes the activations inside Table 1 configurations A and F. It is evident that progressive growing leads to higher-frequency content in the intermediate layers, compromising shift invariance of the network. We hypothesize that this causes the observed uneven location preference for details when progressive growing is used. ::: :::success 我們還包括與StyleGAN瑕疵相關的圖片。Figure 15給出一個罕見的案例,其中StyleGAN activations未能出現斑點瑕疵,這導致圖片嚴重損壞。Figure 16視覺化了Table 1 configurations A與F的activations。很明顯的,漸進增長在中間網路層中導致了更高頻率的內容,危及了網路的平移不變性。我們假設這會導致使用漸進增長時所觀察到的細節位置偏好不均勻。 ::: :::info ![image](https://hackmd.io/_uploads/BJk-EWiVR.png) Figure 15. An example of the importance of the droplet artifact in StyleGAN generator. We compare two generated images, one successful and one severely corrupted. The corresponding feature maps were normalized to the viewable dynamic range using instance normalization. For the top image, the droplet artifact starts forming in 64^2^ resolution, is clearly visible in 128^2^ , and increasingly dominates the feature maps in higher resolutions. For the bottom image, 64^2^ is qualitatively similar to the top row, but the droplet does not materialize in 128^2^ .Consequently, the facial features are stronger in the normalized feature map. This leads to an overshoot in 256^2^ , followed by multiple spurious droplets forming in subsequent resolutions. Based on our experience, it is rare that the droplet is missing from StyleGAN images, and indeed the generator fully relies on its existence. Figure 15. StyleGAN生成器中水滴瑕疵重要性的範例。我們比較了兩張生成的影像,一張成功的影像和一張嚴重損壞的影像。對應的feature maps使用instance normalization正規化到可視化動態範圍中。對於上排影像,水滴瑕疵在64^2^解析度開始形成,在128^2^解析度中清晰可見,並在更高解析度的feature map中越來越占主導地位。對於下排影像,64^2^解析度質量上類似於[頂列](https://terms.naer.edu.tw/detail/b1b50a95d2e8c0554b1e2b6d11e3c0fd/)(top row),但水滴在128^2^解析度中未能形成。因此,臉部特徵在正規化的feature map中更強烈。這導致在256^2^解析度中出現[過衝](https://terms.naer.edu.tw/detail/60aafddb0ab55990df9799852064111a/),隨後在更高的解析度中形成多個水滴瑕疵。根據我們的經驗,水滴在StyleGAN影像中缺失的情況很少見,實際上生成器完全依賴其存在。 ::: :::info ![image](https://hackmd.io/_uploads/rkxz4-jN0.png) Figure 16. Progressive growing leads to significantly higher frequency content in the intermediate layers. This compromises shift invariance of the network and makes it harder to localize features precisely in the higher-resolution layers. Figure 16. 漸進增長導致中間網路層中higher frequency content非常明顯。這損害了網路的平移不變性,並且造成在較高解析度網路層中難以精確地定位特徵。 ::: ## B. Implementation details :::info We implemented our techniques on top of the official TensorFlow implementation of StyleGAN^5^ corresponding to configuration A in Table 1. We kept most of the details unchanged, including the dimensionality of $\mathcal{Z}$ and $\mathcal{W}(512)$, mapping network architecture (8 fully connected layers, $100\times$ lower learning rate), equalized learning rate for all trainable parameters [23], leaky ReLU activation with $\alpha=0.2$, bilinear filtering [49] in all up/downsampling layers [24], minibatch standard deviation layer at the end of the discriminator [23], exponential moving average of generator weights [23], style mixing regularization [24], nonsaturating logistic loss [16] with $R_1$ regularization [30], Adam optimizer [25] with the same hyperparameters ($\beta_1=0, \beta_2=0.99, \epsilon=10^{-8}$, minibatch = $32$), and training datasets [24, 44]. We performed all training runs on NVIDIA DGX-1 with 8 Tesla V100 GPUs using TensorFlow 1.14.0 and cuDNN 7.4.2. ::: :::success 我們使用官方的TensorFlow來實作StyleGAN^5^,對於Table 1的配置A。我們保持大多數細節不變,包含$\mathcal{Z}$與$\mathcal{W}$的維度$(512)$,映射網路的架構(8 fully connected layers、$100\times$ lower learning rate),所有可訓練參數都使用equalized learning rate、leaky ReLU activation with $\alpha=0.2$、所有up/downsampling網路層中的bilinear filtering、discriminator最終的minibatch standard deviation layer、生成器權重的指數移動平均,style mixing regularization、nonsaturating logistic loss with $R_1$ regularization、Adam optimizer with the same hyperparameters ($\beta_1=0, \beta_2=0.99, \epsilon=10^{-8}$, minibatch = $32$),以及訓練資料集。所有的訓練都是在NVIDIA DGX-1 with 8 Tesla V100 GPUs使用TensorFlow 1.14.0搭配cuDNN 7.4.2執行的。 ::: :::info ^5^ https://github.com/NVlabs/stylegan ::: :::info **Generator redesign** In configurations B–F we replace the original StyleGAN generator with our revised architecture. In addition to the changes highlighted in Section 2, we initialize components of the constant input $c_1$ using $\mathcal{N}(0, 1)$ and simplify the noise broadcast operations to use a single shared scaling factor for all feature maps. Similar to Karras et al. [24], we initialize all weights using $\mathcal{N}(0, 1)$ and all biases and noise scaling factors to zero, except for the biases of the affine transformation layers, which we initialize to one. We employ weight modulation and demodulation in all convolution layers, except for the output layers (tRGB in Figure 7) where we leave out the demodulation. With 1024^2^ output resolution, the generator contains a total of 18 affine transformation layers where the first one corresponds to 4^2^ resolution, the next two correspond to 8^2^ , and so forth. ::: :::success 配置B-F中,我們把原始的StyleGAN生成器替換成我們修訂過後的架構。除了Section 2中強調的變更之外,我們還使用$\mathcal{N}(0, 1)$來初始化常數輸入$c_1$的components,並簡化noise broadcast的操作,所有的feature maps使用一個共享的縮放因子。類似於Karras等人,我們使用$\mathcal{N}(0, 1)$初始化所有的權重,並且除了affine transformation layers的biase初始化為1之外,其它所有的biases與噪點縮放因子(noise scaling factors)都初始化為零。除了output layers(tRGB in Figure 7)省去解調(demodulation)之外,其餘所有的卷積層都採用權重的調變(modulation)、解調(demodulation)。使用1024^2^輸出解析度來看,生成器總共包含18 affine transformation layers,其中第一個網路層對應4^2^解析度,下一個對應8^2^,以此類推。 ::: :::info **Weight demodulation** Considering the practical implementation of Equations 1 and 3, it is important to note that the resulting set of weights will be different for each sample in a minibatch, which rules out direct implementation using standard convolution primitives. Instead, we choose to employ grouped convolutions [26] that were originally proposed as a way to reduce computational costs by dividing the input feature maps into multiple independent groups, each with their own dedicated set of weights. We implement Equations 1 and 3 by temporarily reshaping the weights and activations so that each convolution sees one sample with $N$ groups — instead of $N$ samples with one group. This approach is highly efficient because the reshaping operations do not actually modify the contents of the weight and activation tensors. ::: :::success **Weight demodulation** 考慮到方程式1、3實務上的實現,minibatch中的每一次採樣的權重集合都是不一樣的,這排除了使用標準卷積[基元](https://terms.naer.edu.tw/detail/def1d4459e4e9a2869a7bf4946c38d3e/)直接實現。因此,我們選擇採用grouped convolutions的方式,這一開始是為了以將input feature maps切割為多個不同群組(每一個群組都有它們自己的專屬權重)的方式來降低計算成本。我們透過暫時性的[重塑](https://terms.naer.edu.tw/detail/64d0436f8f0a87d4990e299789c06eee/)權重與啟動函數的方式來實作方程式1、3,因此每個卷積都會看到一個有$N$個群組的樣本,而不是一個群組$N$個樣本。這個方法高度有效,因為重塑操作並不會真正的調整到權重與activation tensors的內容。 ::: :::info **Lazy regularization** In configurations C–F we employ lazy regularization (Section 3.1) by evaluating the regularization terms ($R_1$ and path length) in a separate regularization pass that we execute once every $k$ training iterations. We share the internal state of the Adam optimizer between the main loss and the regularization terms, so that the optimizer first sees gradients from the main loss for $k$ iterations, followed by gradients from the regularization terms for one iteration. To compensate for the fact that we now perform $k+1$ training iterations instead of $k$, we adjust the optimizer hyperparameters $\lambda'=c\cdot\lambda,\beta_1'=(\beta_1)^c$, and $\beta_2'=(\beta_2)^c$, where $c = k/(k + 1)$. We also multiply the regularization term by $k$ to balance the overall magnitude of its gradients. We use $k = 16$ for the discriminator and $k = 8$ for the generator. ::: :::success **Lazy regularization** 在配置C-F中,我們採用lazy regularization (Section 3.1),透過每$k$次訓練迭代執行一次單獨的regularization pass的方式來評估正規化項目($R_1$ and path length)。我們在主要的損失(loss)與正規化項目之間共享Adam optimizer的內部狀態,這樣Adam optimizer就可以先看到來自對於$k$次迭代主要loss的梯度,接著就是對於一次迭代的正規化項目的梯度。為了彌補現在是執行$k+1$次的訓練迭代而不是$k$的這個事實,我們調整optimizer的超參數$\lambda'=c\cdot\lambda,\beta_1'=(\beta_1)^c$、$\beta_2'=(\beta_2)^c$,其中$c = k/(k + 1)$。我們還把正規化項目乘上$k$來平衡其梯度的整體[大小](https://terms.naer.edu.tw/detail/24690b19008bf9564d5efa32b3421972/)。discriminator採用$k = 16$,而generator則是$k = 8$。 ::: :::info **Path length regularization** Configurations D–F include our new path length regularizer (Section 3.2). We initialize the target scale $a$ to zero and track it on a per-GPU basis as the exponential moving average of $\Vert \mathbf{J}^T_\mathbf{w}\mathbf{y} \Vert_2$ using decay coefficient $\beta_{pl}=0.99$. We weight our regularization term by $$ \gamma{pl}=\dfrac{\ln_2}{r^2(\ln r - \ln 2)} \tag{5} $$ where $r$ specifies the output resolution (e.g. $r = 1024$). We have found these parameter choices to work reliably across all configurations and datasets. To ensure that our regularizer interacts correctly with style mixing regularization, we compute it as an average of all individual layers of the synthesis network. Appendix C provides detailed analysis of the effects of our regularizer on the mapping between $\mathcal{W}$ and image space. ::: :::success **Path length regularization** 配置D-F包含我們新的path length regularizer (Section 3.2)。我們將目標尺度初始化$a$為0,以每一GPU為基礎,用衰減系數$\beta_{pl}=0.99$,追蹤其為$\Vert \mathbf{J}^T_\mathbf{w}\mathbf{y} \Vert_2$的指數移動平均值。用下面公式加權正規化項: $$ \gamma{pl}=\dfrac{\ln_2}{r^2(\ln r - \ln 2)} \tag{5} $$ 其中$r$指出輸出解析度(如$r = 1024$)。我們發現到,這些參數的選擇可以在所有的配置與資料集上都可靠地執行。為了確保我們的regularizer跟style mixing regularization的交互正確,我們將之計算為合成網路的所有各別網路層的平均值。Appendix C詳細地分析regularizer對於$\mathcal{W}$與影像空間之間映射的影響。 ::: :::info **Progressive growing** In configurations A–D we use progressive growing with the same parameters as Karras et al. [24] (start at 8^2^ resolution and learning rate $\lambda=10^{-3}$ , train for 600k images per resolution, fade in next resolution for 600k images, increase learning rate gradually by $3\times$). In configurations E–F we disable progressive growing and set the learning rate to a fixed value $\lambda=2\times 10^{-3}$ , which we found to provide the best results. In addition, we use output skips in the generator and residual connections in the discriminator as detailed in Section 4.1. ::: :::success **Progressive growing** 在配置A-D之中,我們使用跟Karras等人相同參數的漸進增長(從解析度8^2^開始,學習效率為$\lambda=10^{-3}$),每個解析度都600k的照片做為訓練,漸入下一個解析度也是600k的照片做為訓練,學習效率逐步地增加$3\times$。配置E-F中,我們取消漸進增長,並且將學習效率固定為$\lambda=2\times 10^{-3}$,我們發現到,這樣可以給出最好的結果。此外,我們在generator中使用output skips,然後在discriminator使用residual connections,詳見Section 4.1。 ::: :::info **Dataset-specific tuning** Similar to Karras et al. [24], we augment the FFHQ dataset with horizontal flips to effectively increase the number of training images from 70k to 140k, and we do not perform any augmentation for the LSUN datasets. We have found that the optimal choices for the training length and $R_1$ regularization weight $\gamma$ tend to vary considerably between datasets and configurations. We use $\gamma=10$ for all training runs except for configuration E in Table 1, as well as LSUN CHURCH and LSUN HORSE in Table 3, where we use $\gamma=100$. It is possible that further tuning of $\gamma$ could provide additional benefits. ::: :::success 類似於Karras等人的研究,我們對FFHQ資料集做水平翻轉的增強,用這樣的方式將訓練照片從70k增加到140k,LSUN就沒有做任何的資料增強。我們發現到,對於訓練長度與$R_1$正規化權重$\gamma$的選擇在不同資料集與配置之間有著很大的差異。除了Table 1中的配置E以及Table 3中的LSUN CHURCH與LSUN HORSE使用$\gamma=100$之外,其它都使用$\gamma=10$。進一步的微調$\gamma$確實是可能得到額外的好處的。 ::: :::info **Performance optimizations** We profiled our training runs extensively and found that — in our case — the default primitives for image filtering, up/downsampling, bias addition, and leaky ReLU had surprisingly high overheads in terms of training time and GPU memory footprint. This motivated us to optimize these operations using hand-written CUDA kernels. We implemented filtered up/downsampling as a single fused operation, and bias and activation as another one. In configuration E at 1024^2^ resolution, our optimizations improved the overall training time by about 30% and memory footprint by about 20%. ::: :::success **Performance optimizations** 我們大量地分析了訓練執行情況,發現到在我們的案例中,用於圖像濾波、up/downsampling、bias addition與leaky ReLU的預設基元在訓練時間和GPU記憶體佔用方面存在意外的高開銷。這促使我們硬刻一發CUDA kernel來最佳化這些操作。我們將filtered up/downsampling實作為一個融合操作,將bias和activation作為另一個操作。在解析度1024^2的配置E中,我們的最佳化使的總體訓練時間提高了約30%,內存佔用減少了約20%。 ::: ## C. Effects of path length regularization :::info The path length regularizer described in Section 3.2 is of the form: $$ \mathcal{L}_{pl}=\mathbb{E}_{\mathbf{w}}\mathbb{E}_{\mathbf{y}}(\Vert \mathbf{J}^T_\mathbf{w}\mathbf{y} \Vert_2 - a)^2\tag{6} $$ here $\mathbf{y} \in \mathbb{R}^M$ is a unit normal distributed random variable in the space of generated images (of dimension $M = 3wh$, namely the RGB image dimensions), $\mathbf{J}_{\mathbf{w}} \in \mathbb{R}^{M\times L}$ is the Jacobian matrix of the generator function $g:\mathbb{R}^L \mapsto \mathbb{R}^M$ at a latent space point $\mathbf{w}\in\mathbb{R}^L$, and $a\in\mathbb{R}$ is a global value that expresses the desired scale of the gradient ::: :::success Section 3.2中所描述的path length regularizer的形式為: $$ \mathcal{L}_{pl}=\mathbb{E}_{\mathbf{w}}\mathbb{E}_{\mathbf{y}}(\Vert \mathbf{J}^T_\mathbf{w}\mathbf{y} \Vert_2 - a)^2\tag{6} $$ 這邊的$\mathbf{y} \in \mathbb{R}^M$是生成影像空間中(維度為$M = 3wh$,即RGB影像空間)的一個單位正態分佈的隨機變數,$\mathbf{J}_{\mathbf{w}} \in \mathbb{R}^{M\times L}$為生成器函數$g:\mathbb{R}^L \mapsto \mathbb{R}^M$於潛在空間點$\mathbf{w}\in\mathbb{R}^L$的Jacobian matrix,$a\in\mathbb{R}$為全域值,表示期望的梯度尺度。 ::: ### C.1. Effect on pointwise Jacobians :::info The value of this prior is minimized when the inner expectation over $\mathbf{y}$ is minimized at every latent space point $\mathbf{w}$ separately. In this subsection, we show that the inner expectation is (approximately) minimized when the Jacobian matrix $\mathbf{J}_{\mathbf{w}}$ is orthogonal, up to a global scaling factor. The general strategy is to use the well-known fact that, in high dimensions $L$, the density of a unit normal distribution is concentrated on a spherical shell of radius $\sqrt{L}$. The inner expectation is then minimized when the matrix $\mathbf{J}^T_{\mathbf{w}}$ scales the function under expectation to have its minima at this radius. This is achieved by any orthogonal matrix (with suitable global scale that is the same at every $\mathbf{w}$). ::: :::success 當$\mathbf{y}$上的內部期望值在每個潛在空間點$\mathbf{w}$都是最小的時候,那這個先驗的值就是最小的。在這個subsection中,我們說明了當Jacobian matrix $\mathbf{J}_{\mathbf{w}}$是正交的時候,那內部期望值就會(近似地)最小,一直達到全域縮放因子。一般策略是利用眾所周知的事實,也就是在高維度$L$中,單位正態分佈的密度集中在半徑為$\sqrt{L}$的球殼上。當矩陣$\mathbf{J}^T{\mathbf{w}}$將期望下的函數縮放到在這個半徑處有其最小值時,內部期望值就會被最小化。這是由任何[正交矩陣](https://terms.naer.edu.tw/detail/055df0d6ce2c61ce2f2e2c5b40827b72/)(具有適當的全域縮放,且於每個$\mathbf{w}$上相同)來實現的。 ::: :::info We begin by considering the inner expectation $$ \mathcal{L}_{\mathbf{w}} := \mathbb{E}_\mathbf{y}(\Vert \mathbf{J}^T_{\mathbf{w}}\mathbf{y} \Vert_2 - a)^2 $$ We first note that the radial symmetry of the distribution of \mathbf{y}, as well as of the l2 norm, allows us to focus on diagonal matrices only. This is seen using the Singular Value Decomposition $\mathbf{J}^T_{\mathbf{w}} = \mathbf{U}\tilde{\mathbf{\Sigma}}\mathbf{V}^T$ , where $\mathbf{U}\in\mathbb{R}^{L\times L}$ and $\mathbf{V}\in\mathbb{R}^{M\times M}$ are orthogonal matrices, and $\tilde{\mathbf{\Sigma}} = [\mathbf{\Sigma}\space 0]$ is a horizontal concatenation of a diagonal matrix $\mathbf{\Sigma}\in\mathbb{R}^{L\times L}$ and a zero matrix $\mathbf{0}\in\mathbb{R}^{L\times(M-L)}$ [15]. Because rotating a unit normal random variable by an orthogonal matrix leaves the distribution unchanged, and rotating a vector leaves its norm unchanged, the expression simplifies to $$ \begin{align} \mathcal{L}_{\mathbf{w}} &= \mathbb{E}_\mathbf{y}(\Vert \mathbf{U}\tilde{\mathbf{\Sigma}}\mathbf{V}^T\mathbf{y}\Vert_2 - a)^2 \\ &= \mathbb{E}_\mathbf{y}(\Vert \tilde{\mathbf{\Sigma}}\mathbf{y} \Vert_2 - a)^2 \end{align} $$ Furthermore, the zero matrix in $\tilde{\mathbf{\Sigma}}$ drops the dimensions of $\mathbf{y}$ beyond $L$, effectively marginalizing its distribution over those dimensions. The marginalized distribution is again a unit normal distribution over the remaining $L$ dimensions. We are then left to consider the minimization of the expression $$ \mathcal{L}_{\mathbf{w}} = \mathbb{E}_\tilde{\mathbf{y}}(\Vert \mathbf{\Sigma}\tilde{\mathbf{y}} \Vert_2 - a)^2 $$ over diagonal square matrices $\mathbf{\Sigma} \in \mathbb{R}^{L\times L}$, where $\tilde{\mathbf{y}}$ is unit normal distributed in dimension $L$. To summarize, all matrices $\mathbf{J}^T_{\mathbf{w}}$ that share the same singular values with $\mathbf{\Sigma}$ produce the same value for the original loss. ::: :::success 我們先來考慮內部期望值 $$ \mathcal{L}_{\mathbf{w}} := \mathbb{E}_\mathbf{y}(\Vert \mathbf{J}^T_{\mathbf{w}}\mathbf{y} \Vert_2 - a)^2 $$ 我們首先注意到$\mathbf{y}$的分佈以及l2 norm的[徑向對稱](https://terms.naer.edu.tw/detail/3e5ac8a2e8f625534232967fd43b62e4/),這讓我們可以單純的關注在對角矩陣上。我們可以用奇異值分解$\mathbf{J}^T_{\mathbf{w}} = \mathbf{U}\tilde{\mathbf{\Sigma}}\mathbf{V}^T$來看,其中$\mathbf{U}\in\mathbb{R}^{L\times L}$跟$\mathbf{V}\in\mathbb{R}^{M\times M}$是正交矩陣(orthogonal matrices),而$\tilde{\mathbf{\Sigma}} = [\mathbf{\Sigma}\space 0]$則是對角矩陣$\mathbf{\Sigma}\in\mathbb{R}^{L\times L}$與零矩陣$\mathbf{0}\in\mathbb{R}^{L\times(M-L)}$的水平串接。因為透過正交矩陣來旋轉單位正態隨機變數會讓分佈維持不變的,而旋轉向量則是維持其範數不變,因此表達式簡化為 $$ \begin{align} \mathcal{L}_{\mathbf{w}} &= \mathbb{E}_\mathbf{y}(\Vert \mathbf{U}\tilde{\mathbf{\Sigma}}\mathbf{V}^T\mathbf{y}\Vert_2 - a)^2 \\ &= \mathbb{E}_\mathbf{y}(\Vert \tilde{\mathbf{\Sigma}}\mathbf{y} \Vert_2 - a)^2 \end{align} $$ 另外,$\tilde{\mathbf{\Sigma}}$裡面的零矩陣讓$\mathbf{y}$的維度降到$L$之外,有效地邊緣化了其在這些維數上的分佈。邊緣化後的分佈在$L$的維度上仍然是單位正態分佈。然後讓我們考慮表達式的最小化 $$ \mathcal{L}_{\mathbf{w}} = \mathbb{E}_\tilde{\mathbf{y}}(\Vert \mathbf{\Sigma}\tilde{\mathbf{y}} \Vert_2 - a)^2 $$ 在對角方陣(diagonal square matrix)$\mathbf{\Sigma} \in \mathbb{R}^{L\times L}$上,其中$\tilde{\mathbf{y}}$在維度$L$中是單位正態分佈。總的來說,跟$\mathbf{\Sigma}$有相同奇異值的所有矩陣$\mathbf{J}^T_{\mathbf{w}}$對原始的損失都會產生相同的值。 ::: :::info Next, we show that this expression is minimized when the diagonal matrix $\mathbf{\Sigma}$ has a specific identical value at every diagonal entry, i.e., it is a constant multiple of an identity matrix. We first write the expectation as an integral over the probability density of $\tilde{\mathbf{y}}$: $$ \begin{align} \mathcal{L}_{\mathbf{w}} &= \int(\Vert \mathbf{\Sigma}\tilde{\mathbf{y}} \Vert_2-a)^2p_\tilde{\mathbf{y}}(\tilde{\mathbf{y}})d\tilde{\mathbf{y}} \\ &= (2\pi)^{-\frac{L}{2}} \int(\Vert \mathbf{\Sigma}\tilde{\mathbf{y}} \Vert_2-a)^2\exp(-\dfrac{\tilde{\mathbf{y}}^T\tilde{\mathbf{y}}}{2})d\tilde{\mathbf{y}} \end{align} $$ Observing the radially symmetric form of the density, we change into a polar coordinates $\tilde{\mathbf{y}}=r \phi$, where $r \in \Bbb R_+$ is the distance from origin, and $\phi \in \Bbb S^{L-1}$ is a unit vector, i.e., a point on the $L − 1$-dimensional unit sphere. This change of variables introduces a Jacobian factor $r^{L-1}$: $$ \tilde{\mathcal{L}}_{\mathbf{w}} = (2\pi)^{-\frac{L}{2}}\int_{\Bbb S}\int_0^\infty(r\Vert \Sigma\phi \Vert_2 - a)^2r^{L-1} \exp(-\dfrac{r^2}{2})\text{d}r \text{ d}\phi $$ ::: :::success 接下來,我們要來說明,當diagonal matrix$\mathbf{\Sigma}$的每個對角元素具有特定相同值的時候,也就是它是單位矩陣的常數倍時,該表達式被最小化。我們首先將期望值寫成 $\tilde{\mathbf{y}}$的機率密度的積分形式: $$ \begin{align} \mathcal{L}_{\mathbf{w}} &= \int(\Vert \mathbf{\Sigma}\tilde{\mathbf{y}} \Vert_2-a)^2p_\tilde{\mathbf{y}}(\tilde{\mathbf{y}})d\tilde{\mathbf{y}} \\ &= (2\pi)^{-\frac{L}{2}} \int(\Vert \mathbf{\Sigma}\tilde{\mathbf{y}} \Vert_2-a)^2\exp(-\dfrac{\tilde{\mathbf{y}}^T\tilde{\mathbf{y}}}{2})d\tilde{\mathbf{y}} \end{align} $$ 觀察到概率密度的徑向對稱(radially symmetric)形式,我們將 $\tilde{\mathbf{y}}$轉換為[極座標](https://terms.naer.edu.tw/detail/9ff78d5b85a9f2fbe6fda0913debf575/) $\tilde{\mathbf{y}}=r \phi$,其中 $r \in \Bbb R_+$ 是到原點的距離,$\phi \in \Bbb S^{L-1}$ 是一個單位向量,即在$L-1$維[單位球面](https://terms.naer.edu.tw/detail/07f3664a7f775b1350ff7cdaca5c0d7e/)上的一個點。這種變數的改變引入了一個Jacobian factor$r^{L-1}$: $$ \tilde{\mathcal{L}}_{\mathbf{w}} = (2\pi)^{-\frac{L}{2}}\int_{\Bbb S}\int_0^\infty(r\Vert \Sigma\phi \Vert_2 - a)^2r^{L-1} \exp(-\dfrac{r^2}{2})\text{d}r \text{ d}\phi $$ ::: :::info The probability density $(2\pi)^{L/2}r^{L-1}\exp(-\dfrac{r^2}{2})$ is then an $L$-dimensional unit normal density expressed in polar coordinates, dependent only on the radius and not on the angle. A standard argument by Taylor approximation shows that when $L$ is high, for any $\phi$ the density is well approximated by density $(2\pi e/L)^{-L/2}\exp(-\frac{1}{2}(r-\mu)^2/\sigma^2)$ which is a (unnormalized) one-dimensional normal density in $r$, centered at $\mu=\sqrt{L}$ of standard deviation $\sigma=1/\sqrt{2}$[4]. In other words, the density of the $L$-dimensional unit normal distribution is concentrated on a shell of radius $\sqrt{L}$. Substituting this density into the integral, the loss becomes approximately $$ \mathcal{L}_\mathbf{w} \approx (2\pi e/L)^{-L/2} \int_{\Bbb S}\int_0^\infty(r\Vert \Sigma \phi \Vert_2 - a)^2 \exp \bigg(-\dfrac{(r-\sqrt{L})^2}{2\sigma^2}\bigg) \text{d}r \text{ d}\phi \tag{7} $$ where the approximation becomes exact in the limit of infinite dimension $L$. ::: :::success 機率密度$(2\pi)^{L/2}r^{L-1}\exp(-\dfrac{r^2}{2})$在極座標中是一個$L$維的單位正態密度的表示,只取決於半徑而跟角度無關。透過Taylor approximation的標準論證說明著,當$L$很大的時候,對任意的$\phi$來說,其密度都可以很好的被密度$(2\pi e/L)^{-L/2}\exp(-\frac{1}{2}(r-\mu)^2/\sigma^2)$近似,這是一個在$r$中心位於 $\mu=\sqrt{L}$,且標準差為$\sigma=1/\sqrt{2}$的(未標準化的)一維正態密度。 ::: :::info To minimize this loss, we set $\Sigma$ such that the function $(r \Vert\Sigma\phi\Vert_2 - a)^2$ obtains minimal values on the spherical shell of radius $\sqrt{L}$. This is achieved by $\Sigma=\frac{a}{\sqrt{L}}\bf I$, whereby the function becomes constant in $\phi$ and the expression reduces to $$ \mathcal{L}_\mathbf{w} \approx (2\pi e/L)^{-L/2} \mathcal{A}(\Bbb S)a^2L^{-1}\int_0^\infty(r - \sqrt{L})^2 \exp \bigg(-\dfrac{(r-\sqrt{L})^2}{2\sigma^2}\bigg) \text{d}r, $$ where $\mathcal{A}(\Bbb S)$ is the surface area of the unit sphere (and like the other constant factors, irrelevant for minimization). Note that the zero of the parabola $(r-\sqrt{L})^2$ coincides with the maximum of the probability density, and therefore this choice of $\Sigma$ minimizes the inner integral in Eq.7 separately for every $\phi$. ::: :::success 為了最小化loss,我們設置$\Sigma$讓函數$(r \Vert\Sigma\phi\Vert_2 - a)^2$在半徑為$\sqrt{L}$的球殼上得到最小值。這是透過設置$\Sigma=\frac{a}{\sqrt{L}}\bf I$實現的,這使用函數在$\phi$上變成常數,表達式減化為 $$ \mathcal{L}_\mathbf{w} \approx (2\pi e/L)^{-L/2} \mathcal{A}(\Bbb S)a^2L^{-1}\int_0^\infty(r - \sqrt{L})^2 \exp \bigg(-\dfrac{(r-\sqrt{L})^2}{2\sigma^2}\bigg) \text{d}r, $$ 其中$\mathcal{A}(\Bbb S)$是單位球面的表面積(就像其它[常數因子](https://terms.naer.edu.tw/detail/5e3dcc4ddded3b3bc30d36864408ca04/)那樣,對最小化是無關緊要的)。注意到,拋物線$(r-\sqrt{L})^2$的零點跟機率密度的最大值是吻合的,所以這種$\Sigma$的選擇讓方程式7中的inner integral(內積分?)對每個$\phi$都是最小化的。 ::: :::info In summary, we have shown that — assuming a high dimensionality $L$ of the latent space — the value of the path length prior (Eq. 6) is minimized when all singular values of the Jacobian matrix of the generator are equal to a global constant, at every latent space point $\bf w$, i.e., they are orthogonal up to a globally constant scale. ::: :::success 總的來說,我們已經說明了,假設潛在空間的高維$L$,當生成器的Jacobian matrix的所有奇異值都等於全域常數(global constant)的時候,在每個潛在空間點$\bf w$,也就是它們在全域常數比例下是正交的,那其path length prior(方程式6)的值就會是最小化的。 ::: :::info While in theory a merely scales the values of the mapping without changing its properties and could be set to a fixed value (e.g., 1), in practice it does affect the dynamics of the training. If the imposed scale does not match the scale induced by the random initialization of the network, the training spends its critical early steps in pushing the weights towards the required overall magnitudes, rather than enforcing the actual objective of interest. This may degrade the internal state of the network weights and lead to sub-optimal performance in later training. Empirically we find that setting a fixed scale reduces the consistency of the training results across training runs and datasets. Instead, we set a dynamically based on a running average of the existing scale of the Jacobians, namely $a\approx\Bbb E_{\bf w,y}(\Vert \mathbf{J}^T_w)\bf y \Vert_2$. With this choice the prior targets the scale of the local Jacobians towards whatever global average already exists, rather than forcing a specific global average. This also eliminates the need to measure the appropriate scale of the Jacobian explicitly, as is done by Odena et al. [33] who consider a related conditioning prior. ::: :::success 儘管理論上單純的縮放映射的值而不改變其屬性,並且可以設置為固定值(像是1),不過實務上它確實會影響訓練動態。如果施加的尺度跟網路隨機初始化所引起的尺度不匹配的話,那訓練的關鍵早期步驟會將權重推向所需的整體幅度,而不是強制執行實際感興趣的目標。這可能會傷害到網路權重的內部狀態,從而導致後期訓練的[局部最佳化](https://terms.naer.edu.tw/detail/bf02b643d3cbf29fbf52da03763d9f06/)效能。經驗上我們發現到,設置一個固定的尺度會降低訓練執行與訓練集之間的訓練結果的一致性。相反的,我們設置一個基於Jacobians現有的尺度的[移動平均值](https://terms.naer.edu.tw/detail/8fb087cead4d3d9a5b624c6a0e0bb049/),也就$a\approx\Bbb E_{\bf w,y}(\Vert \mathbf{J}^T_w)\bf y \Vert_2$。這個選擇,prior會把local Jacobians的尺度目標帶往現有的全域平均,而不是強制使用一個特定的全域平均。這同時消除了明確量測Jacobian適當尺度的必要性,如Odena等人所考慮的相關條件先驗。 ::: :::info Figure 17 shows empirically measured magnitudes of singular values of the Jacobian matrix for networks trained with and without path length regularization. While orthogonality is not reached, the eigenvalues of the regularized network are closer to one another, implying better conditioning, with the strength of the effect correlated with the PPL metric (Table 1) ::: :::success Figure 17說明了使用與不使用path length regularization訓練的網路的Jacobian matrix的奇異值的量測幅度。雖然未達到[正交](https://terms.naer.edu.tw/detail/0e0a25d2e7975c710aeebddfcbe4507a/),但正規化網路的[特徵值](https://terms.naer.edu.tw/detail/be6e931f514c6780a21c1d9ae5a11b00/)彼此更接近了,這意味著條件較佳,且這種效果的強度與PPL指標(Table 1)相關。 ::: :::info ![image](https://hackmd.io/_uploads/SkSxaHYU0.png) Figure 17. The mean and standard deviation of the magnitudes of sorted singular values of the Jacobian matrix evaluated at random latent space points $\bf w$, with largest eigenvalue normalized to 1. In both datasets, path length regularization (Config D) and novel architecture (Config F) exhibit better conditioning; notably, the effect is more pronounced in the Cars dataset that contains much more variability, and where path length regularization has a relatively stronger effect on the PPL metric (Table 1). ::: ### C.2. Effect on global properties of generator mapping :::info In the previous subsection, we found that the prior encourages the Jacobians of the generator mapping to be everywhere orthogonal. While Figure 17 shows that the mapping does not satisfy this constraint exactly in practice, it is instructive to consider what global properties the constraint implies for mappings that do. Without loss of generality, we assume unit global scale for the matrices to simplify the presentation. ::: :::success 在前一小節中,我們發現prior鼓勵生成器的Jacobians映射到任何地方都是正交的。儘管Figure 17說明著,這個映射實際上並不完全滿足這一個約束,但考慮該約束對滿足這一條件的映射所暗示的全域性質是有啟發意義的。在不失一般性的情況下,我們假設矩陣具有單位全域尺度以簡化表述。 ::: :::info The key property is that that a mapping $g: \Bbb R^L \mapsto \Bbb R^M$ with everywhere orthogonal Jacobians preserves the lengths of curves. To see this, let $u:[t_o,t_1]\mapsto \Bbb R^L$ parametrize a curve in the latent space. Mapping the curve through the generator $g$, we obtain a curve $\tilde{u}=g \circ u$ in the space of images. Its arc length is $$ L=\int_{t_0}^{t_1}\vert \tilde{u}'(t)' \vert \text{d}t \tag{8} $$ where prime denotes derivative with respect to $t$. By chain rule, this equals $$ L=\int_{t_0}^{t_1}\vert J_g(u(t))u'(t) \vert \text{d}t \tag{9} $$ where $J_g \in\Bbb R^{L\times M}$ is the Jacobian matrix of $g$ evaluated at $u(t)$. By our assumption, the Jacobian is orthogonal, and consequently it leaves the 2-norm of the vector $u'(t)$ unaffected: $$ L=\int_{t_0}^{t_1} \vert u'(t) \vert \text{d}t \tag{10} $$ This is the length of the curve $u$ in the latent space, prior to mapping with $g$. Hence, the lengths of $u$ and $\tilde{u}$ are equal, and so $g$ preserves the length of any curve. ::: :::success 關鍵的特性是具有處處都是orthogonal Jacobians的映射$g: \Bbb R^L \mapsto \Bbb R^M$會保持曲線的長度。為了說明這一點,我們假設$u:[t_o,t_1]\mapsto \Bbb R^L$在潛在空間中參數化了一條曲。透過生成器$g$映射這曲線,我們就可以得到一個在影像空間中的曲線$\tilde{u}=g \circ u$。其弧長為 $$ L=\int_{t_0}^{t_1}\vert \tilde{u}'(t)' \vert \text{d}t \tag{8} $$ 其中[質式](https://terms.naer.edu.tw/detail/db9c6d75bb57d22345261c5f47da23a1/)(prime)表示對$t$的導數。根據鏈式法則,這等價於 $$ L=\int_{t_0}^{t_1}\vert J_g(u(t))u'(t) \vert \text{d}t \tag{9} $$ 其中$J_g \in\Bbb R^{L\times M}$就是$g$的在$u(t)$的Jacobian matrix。根據我們的假設,Jacobian是正交的,因此它保持向量$u'(t)$的2-norm不變: $$ L=\int_{t_0}^{t_1} \vert u'(t) \vert \text{d}t \tag{10} $$ 這是在以$g$映射之前,於潛在空間中曲線$u$的長度。因此,$u$的長度跟$\tilde{u}$的長度相等,這意味著$g$保持任何曲線的長度。 ::: :::info In the language of differential geometry, $g$ isometrically embeds the Euclidean latent space $\Bbb R^L$ into a submanifold $\mathcal{M}$ in $\Bbb R^M$ — e.g., the manifold of images representing faces, embedded within the space of all possible RGB images. A consequence of isometry is that straight line segments in the latent space are mapped to geodesics, or shortest paths, on the image manifold: a straight line $v$ that connects two latent space points cannot be made any shorter, so neither can there be a shorter on-manifold image-space path between the corresponding images than $g \circ u$. For example, a geodesic on the manifold of face images is a continuous morph between two faces that incurs the minimum total amount of change (as measured by $l_2$ difference in RGB space) when one sums up the image difference in each step of the morph. ::: :::success 在[微分幾何](https://terms.naer.edu.tw/detail/e4af21e19ad3c7747df0c3321d10d42d/)的語言中,$g$等距地將Euclidean latent space $\Bbb R^L$嵌入$\Bbb R^M$中的submanifold $\mathcal{M}$,舉例來說,表示臉的的影像的流形,嵌入到所有可能的RGB影像空間中。等距的結果就是,潛在空間中的直線段(straight line segments)被映射到影像流形上的[測地線](https://terms.naer.edu.tw/detail/bbf3d32f7dd2fe07bc81fa0346f1681b/)或是最短距離:連接兩個潛在空間點的直線$v$不能被縮短,因此在對應的影像之間也不可能有比$g \circ u$更短的on-manifold image-space path。例如,臉部影像流形上的測地線是一個在兩張臉孔之間連續變形的過程,當我們在變形的每一步中累加影像差異(以RGB空間中的$l_2$差異來衡量)時,它引起的總變化量最小。 ::: :::info Isometry is not achieved in practice, as demonstrated in empirical experiments in the previous subsection. The full loss function of the training is a combination of potentially conflicting criteria, and it is not clear if a genuinely isometric mapping would be capable of expressing the image manifold of interest. Nevertheless, a pressure to make the mapping as isometric as possible has desirable consequences. In particular, it discourages unnecessary “detours”: in a nonconstrained generator mapping, a latent space interpolation between two similar images may pass through any number of distant images in RGB space. With regularization, the mapping is encouraged to place distant images in different regions of the latent space, so as to obtain short image paths between any two endpoints. ::: :::success 實務上並沒有辦法真的等距,這在前一小節中的實驗已有證明。訓練的完整的損失函數潛在衝突標準(potentially conflicting criteria)的組合,而且還不清楚一個真正的等距映射是否能夠表達所感興趣的影像流形。儘管如此,盡可能的讓映射是等距還是會有不錯的結果。特別是,它阻斷了非必要性的"detours":在非約束的生成器映射中,兩個相似影像之間的潛在空間插值也許會經過RGB空間中的任意數量的遠距影像。使用正規化的話,映射就會被鼓勵去將遠距的影像放置在潛在空間中的不同區域,以此獲得任意兩個端點之間的最短的影像距離。 ::: ## D. Projection method details :::info Given a target image $x$, we seek to find the corresponding $\bf w \in \mathcal{W}$ and per-layer noise maps denoted $n_i\in\Bbb R^{r_i\times r_i}$ where $i$ is the layer index and $r_i$ denotes the resolution of the $i$th noise map. The baseline StyleGAN generator in 1024×1024 resolution has 18 noise inputs, i.e., two for each resolution from 4×4 to 1024×1024 pixels. Our improved architecture has one fewer noise input because we do not add noise to the learned 4×4 constant (Figure 2). ::: :::success 給定目標照片$x$,我們尋求找到對應的$\bf w \in \mathcal{W}$,並且每一個網路層noise maps表示為$n_i\in\Bbb R^{r_i\times r_i}$,其中$i$是網路層第$i$層的索引,$r_i$則是表示第$i$層noise maps的解析度。做為基線的StyleGAN generator在1024x1024解析度中有18個噪點輸入,也就是從4x4到1024x1024的每個解析度都有2個。我們改進後的架構因為沒有將噪點加入4x4 constant(Figure 2)中,所以少了一個。 ::: :::info Before optimization, we compute $\bf \mu_{\bf w}$ by running 10000 random latent codes $\bf z$ through the mapping network $f$. We also approximate the scale of $\mathcal{W}$ by computing $\sigma^2_{\bf w} = \Bbb E_z\Vert f(\bf z)-\bf \mu_{\bf w}\Vert^2_2$, i.e., the average square Euclidean distance to the center. ::: :::success 在最佳化之前,我們透過以映射網路$f$執行10000個random latent codes $\bf z$的方式計算$\bf \mu_{\bf w}$。我們還透過計算$\sigma^2_{\bf w} = \Bbb E_z\Vert f(\bf z)-\bf \mu_{\bf w}\Vert^2_2$來近似$\mathcal{W}$的尺度,也就是到中心點的平均歐幾里德距離。 ::: :::info At the beginning of optimization, we initialize $\bf w = \bf \mu_{\bf w}$ and $\mathbf{n}_i = \mathcal{N}(\bf 0, \bf I)$ for all $i$. The trainable parameters are the components of $\bf w$ as well as all components in all noise maps $\mathbf{n}_i$. The optimization is run for 1000 iterations using Adam optimizer [25] with default parameters. Maximum learning rate is $\lambda_{max}=0.1$, and it is ramped up from zero linearly during the first 50 iterations and ramped down to zero using a cosine schedule during the last 250 iterations. In the first three quarters of the optimization we add Gaussian noise to $\bf w$ when evaluating the loss function as $\tilde{\bf w}=\mathbf{w} + \mathcal{N}(0, 0.05\sigma_{\bf w}t^2)$, where $t$ goes from one to zero during the first 750 iterations. This adds stochasticity to the optimization and stabilizes finding of the global optimum. ::: :::success 在最佳化的初期,我們初始化$\bf w = \bf \mu_{\bf w}$,$\mathbf{n}_i = \mathcal{N}(\bf 0, \bf I)$(所有的$i$)。可訓練參數就是$\bf w$的[分量](https://terms.naer.edu.tw/detail/ba2eed123e1838d213e261296217535f/),還有所有noise maps $\mathbf{n}_i$中的分量(components)。整個最佳化執行1000次迭代,使用Adam optimizer(採預設參數)。最大的學習效率為$\lambda_{max}=0.1$,並且在前50次迭代過程中線性地從零增加,然後在最後250次迭代過程中以cosine schedule的方式縮減至零。在最佳化過程的前四分之三階段中,我們在評估損失函數時在$\bf w$加入高斯噪點$\tilde{\bf w}=\mathbf{w} + \mathcal{N}(0, 0.05\sigma_{\bf w}t^2)$,其中$t$在前750個迭代過程中從一逐步歸零。這在最佳化過程中加入了隨機性,並能穩定的找到全域最佳。 ::: :::info Given that we are explicitly optimizing the noise maps, we must be careful to avoid the optimization from sneaking actual signal into them. Thus we include several noise map regularization terms in our loss function, in addition to an image quality term. The image quality term is the LPIPS [50] distance between target image $x$ and the synthesized image: $L_{image}= D_{\text{LPIPS}[\bf x, g(\tilde{\bf w}, \mathbf{n}_0, \mathbf{n}_1,...)]}$. For increased performance and stability, we downsample both images to 256×256 resolution before computing the LPIPS distance. Regularization of the noise maps is performed on multiple resolution scales. For this purpose, we form for each noise map greater than 8×8 in size a pyramid down to 8×8 resolution by averaging 2×2 pixel neighborhoods and multiplying by 2 at each step to retain the expected unit variance. These downsampled noise maps are used for regularization only and have no part in synthesis. ::: :::success 因為我們是顯式的最佳化noise maps,所以要小心避免在最佳化過程中將實際信號引入noise maps。為此,除了影像項質項之外,我們還在損失函數中包含多個noise maps的正規項。影像項質項是目標影像與合成影像$L_{image}= D_{\text{LPIPS}[\bf x, g(\tilde{\bf w}, \mathbf{n}_0, \mathbf{n}_1,...)]}$之間的LPIPS distance。為了提高效能與穩定度,在計算LPIPS distance之前,兩張影像都downsample到解析度256x256。noise maps的正規項是在多個解析度尺度上執行的。為此,對於每個大於8×8的noise map,我們形成了一個直到8×8解析度的金字塔,這是透過在每個step對2x2像素鄰域做平均並乘2來維持預期的單位變異數。這些downsampled noise maps單純的用於正規化,沒合成的戲。 ::: :::info Let us denote the original noise maps by $\mathbf{n}_{i, 0}=\mathbf{n}_i$ and the downsampled versions by $\mathbf{n}_{i, j>0}$. Similarly, let $r_{i, j}$ be the resolution of an original $(j = 0)$ or downsampled $(j > 0)$ noise map so that $r_{i,j+1}=r_{i,j}/2$. The regularization term for noise map $\mathbf{n}_{i,j}$ is then $$ \begin{align} L_{ij} &= \Big(r_{i,j}^{\frac{1}{2}} \cdot \sum_{x,y}\mathbf{n}_{i,j}(x, y)\cdot \mathbf{n}_{i,j}(x-1, y) \Big)^2 \\ &+ \Big(r_{i,j}^{\frac{1}{2}} \cdot \sum_{x,y}\mathbf{n}_{i,j}(x, y)\cdot \mathbf{n}_{i,j}(x, y - 1) \Big)^2 \end{align} $$ , where the noise map is considered to wrap at the edges. The regularization term is thus sum of squares of the resolution normalized autocorrelation coefficients at one pixel shifts horizontally and vertically, which should be zero for a normally distributed signal. The overall loss term is then $L_{total}=L_{image}+\alpha\sum_{i,j}L_{i,j}$. In all our tests, we have used noise regularization weight $\alpha = 10^5$ . In addition, we renormalize all noise maps to zero mean and unit variance after each optimization step. Figure 18 illustrates the effect of noise regularization on the resulting noise maps. ::: :::success 讓我們用$\mathbf{n}_{i, 0}=\mathbf{n}_i$來表示原始的noise maps,然後$\mathbf{n}_{i, j>0}$表示downsampled versions。類似地,假設$r_{i, j}$是原始的$(j = 0)$或是downsampled $(j > 0)$ noise map的解析度,那就可以$r_{i,j+1}=r_{i,j}/2$。那麼,noise map $\mathbf{n}_{i,j}$的正規化項就會是 $$ \begin{align} L_{ij} &= \Big(r_{i,j}^{\frac{1}{2}} \cdot \sum_{x,y}\mathbf{n}_{i,j}(x, y)\cdot \mathbf{n}_{i,j}(x-1, y) \Big)^2 \\ &+ \Big(r_{i,j}^{\frac{1}{2}} \cdot \sum_{x,y}\mathbf{n}_{i,j}(x, y)\cdot \mathbf{n}_{i,j}(x, y - 1) \Big)^2 \end{align} $$ ,其中wrap被認為是在邊緣(edges)被纏繞著。因此,正規化項是水平、垂直偏移一個像素的解析度正規化[自相關系數](https://terms.naer.edu.tw/detail/351f43b36e420aadc516d329c5c9ab33/)的平方和,其正態分佈的信號應該為零。那麼,總的損失項就是$L_{total}=L_{image}+\alpha\sum_{i,j}L_{i,j}$。我們所有的測試中都使用noise regularization weight $\alpha = 10^5$。除此之外,我們在每次最佳化之後將所有的noise maps重新正規化為均值為零且為單位變異數。Figure 18描述著noise regularization在產生的noise maps的影響。 ::: :::info ![image](https://hackmd.io/_uploads/HyBBi0lv0.png) Figure 18. Effect of noise regularization in latent-space projection where we also optimize the contents of the noise inputs of the synthesis network. Top to bottom: target image, re-synthesized image, contents of two noise maps at different resolutions. When regularization is turned off in this test, we only normalize the noise maps to zero mean and unit variance, which leads the optimization to sneak signal into the noise maps. Enabling the noise regularization prevents this. The model used here corresponds to configuration F in Table 1. ::: ## E. Results with spectral normalization :::info Since spectral normalization (SN) is widely used in GANs [31], we investigated its effect on StyleGAN2. Table 4 gives the results for a variety of configurations where spectral normalization is enabled in addition to our techniques (weight demodulation, path length regularization) or instead of them. ::: :::success 由於spectral normalization (SN)廣泛的被應用在GANs,我們也研究了一下它在StyleGAN2的影響。Table 4給出多種配置的影響,除了我們的技術(weight demodulation, path length regularization)之外,還啟用spectral normalization或是取代它們。 ::: :::info ![image](https://hackmd.io/_uploads/H1M6fCgwC.png) Table 4. Effect of spectral normalization with FFHQ at 1024^2^.The first row corresponds to StyleGAN2, i.e., config F in Table 1. In the subsequent rows, we enable spectral normalization in the generator (SN-G) and in the discriminator (SN-D). We also test the training without weight demodulation (Demod) and path length regularization (P.reg). All of these configurations are highly detrimental to FID, as well as to Recall. ↑ indicates that higher is better, and ↓ that lower is better. ::: :::info Interestingly, adding spectral normalization to our generator is almost a no-op. On an implementation level, SN scales the weight tensor of each layer with a scalar value $1/\sigma(w)$. The effect of such scaling, however, is overridden by Equation 3 for the main convolutional layers as well as the affine transformation layers. Thus, the only thing that SN adds on top of weight demodulation is through its effect on the tRGB layers. ::: :::success 有趣的是,把spectral normalization加到我們的生成器幾乎是沒有影響的。在實作層面上,SN以scalar value $1/\sigma(w)$縮放每一層的權重張量。然而,這樣的縮放影響在主要的卷積層還有仿射變換層上會被方程式3給覆蓋掉。因此,SN在權重解調之後唯一作用就是對tRGB的影響。 ::: :::info When we enable spectral normalization in the discriminator, FID is slightly compromised. Enabling it in the generator as well leads to significantly worse results, even though its effect is isolated to the tRGB layers. Leaving SN enabled, but disabling a subset of our contributions does not improve the situation. Thus we conclude that StyleGAN2 gives better results without spectral normalization. ::: :::success 當我們在discriminator中啟用spectral normalization時,FID略受影響。在生成器中啟用spectral normalization,儘管它的影響就只有tRGB層,但還是導致糟糕的結果。保留SN啟用然後關閉我們的部份貢獻技術,這麼做也無法改善這種情況。因此,我們得到結論,StyleGAN2在沒有使用spectral normalization的情況下可以有更好的結果。 ::: ## F. Energy consumption :::info Computation is a core resource in any machine learning project: its availability and cost, as well as the associated energy consumption, are key factors in both choosing research directions and practical adoption. We provide a detailed breakdown for our entire project in Table 5 in terms of both GPU time and electricity consumption. ::: :::success 在任何的機器學習專安,計算始終是核心的資源:其可用性與成本,以及相關的能源損耗,都是選擇研究方向以及實際應用的關鍵因子。我們在Table 5中提供整個專案的GPU time與電力損耗的詳細拆解。 ::: :::info ![image](https://hackmd.io/_uploads/r1wPhpeDA.png) Table 5. Computational effort expenditure and electricity consumption data for this project. The unit for computation is GPU years on a single NVIDIA V100 GPU — it would have taken approximately 51 years to execute this project using a single GPU. See the text for additional details about the computation and energy consumption estimates. Initial exploration includes all training runs after the release of StyleGAN [24] that affected our decision to start this project. Paper exploration includes all training runs that were done specifically for this project, but were not intended to be used in the paper as-is. FFHQ config F refers to the training of the final network. This is approximately the cost of training the network for another dataset without hyperparameter tuning. Other runs in paper covers the training of all other networks shown in the paper. Backup runs left out includes the training of various networks that could potentially have been shown in the paper, but were ultimately left out to keep the exposition more focused. Video, figures, etc. includes computation that was spent on producing the images and graphs in the paper, as well as on the result video. Public release covers testing, benchmarking, and large-scale image dumps related to the public release. ::: :::info We report expended computational effort as single-GPU years (Volta class GPU). We used a varying number of NVIDIA DGX-1s for different stages of the project, and converted each run to single-GPU equivalents by simply scaling by the number of GPUs used. ::: :::success 我們將所需要的計算工作量以single-GPU years (Volta class GPU)來報告呈現。我們在專案的不同階段使用不同數量的NVIDIA DGX-1,並將每次的執行透過按所使用的GPU數量進行縮放來轉換為等價於single-GPU。 ::: :::warning 這邊的GPU years看起來是指單一指定型號的GPU需要幾年才能夠重現實驗結果? ::: :::info The entire project consumed approximately 131.61 megawatt hours (MWh) of electricity. We followed the Green500 power measurements guidelines [11] as follows. For each job, we logged the exact duration, number of GPUs used, and which of our two separate compute clusters the job was executed on. We then measured the actual power draw of an 8-GPU DGX-1 when it was training FFHQ config F. A separate estimate was obtained for the two clusters because they use different DGX-1 SKUs. The vast majority of our training runs used 8 GPUs, and for the rest we approximated the power draw by scaling linearly with $n/8$, where $n$ is the number of GPUs. ::: :::success 整個專案損耗了大約131.61 megawatt hours (MWh)的電力。我們依循著Green500 power measurements guidelines如下。對於每個工作(job),我們記錄了確切的區間,使用的GPU數量,以及工作執行中是使用兩個clusters中那一個cluster。然後我們測量訓練FFHQ配置F在8張GPU DGX-1的實際功耗。因為兩個cluser使用不同的DGX-1 SKUs,所以我們可以得到各別的估測值。絕大多數的訓練執行使用了8張GPU,剩下的部份我們就以$n/8$(其中$n$是GPU的數量)線性縮放的方式來近似功耗。 ::: :::info Approximately half of the total energy was spent on early exploration and forming ideas. Then subsequently a quarter was spent on refining those ideas in more targeted experiments, and finally a quarter on producing this paper and preparing the public release of code, trained models, and large sets of images. Training a single FFHQ network (config F) took approximately 0.68 MWh (0.5% of the total project expenditure). This is the cost that one would pay when training the network from scratch, possibly using a different dataset. In short, vast majority of the electricity used went into shaping the ideas, testing hypotheses, and hyperparameter tuning. We did not use automated tools for finding hyperparameters or optimizing network architectures. ::: :::success 大約一半的總能量花費是用在早期的探索和形成想法上。然後有四分之一的能量花費在更有目標的實驗重構上,最後四分之一的能量用於撰寫這篇論文以及準備程式碼、訓練模型和大量影像的釋出。訓練單個FFHQ network (config F)大約消耗了0.68 MWh(佔總專案支出的0.5%)。這是從頭開始訓練網路時,使用不同資料集可能所需的成本。簡單來說,大部分電力都是用在想法的構建、測試假設和超參數的調整。我們沒有使用自動化工具來尋找超參數或最佳化網絡架構。 :::