Image-to-Image Translation with Conditional Adversarial Networks(翻譯)

# Image-to-Image Translation with Conditional Adversarial Networks(翻譯) ###### tags:`論文翻譯` `deeplearning` [TOC] ## 說明區塊如下分類，原文區塊為藍底，翻譯區塊為綠底，部份專業用語翻譯參考國家教育研究院 :::info 原文 ::: :::success 翻譯 ::: :::warning 個人註解，任何的翻譯不通暢部份都請留言指導 ::: :::danger * [paper hyperlink](https://arxiv.org/pdf/1611.07004) ::: ## Abstract :::info We investigate conditional adversarial networks as a general-purpose solution to image-to-image translation problems. These networks not only learn the mapping from input image to output image, but also learn a loss function to train this mapping. This makes it possible to apply the same generic approach to problems that traditionally would require very different loss formulations. We demonstrate that this approach is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks. Indeed, since the release of the pix2pix software associated with this paper, a large number of internet users (many of them artists) have posted their own experiments with our system, further demonstrating its wide applicability and ease of adoption without the need for parameter tweaking. As a community, we no longer hand-engineer our mapping functions, and this work suggests we can achieve reasonable results without hand-engineering our loss functions either ::: :::success 我們探討條件式對抗網路(conditional adversarial networks)作為image-to-image translation的通用解決方案。這些網路不僅學習輸入影像到輸出影像的映射，也學習訓練這個映射的損失函數(loss function)。這讓將相同的方法用於傳統上需要非常不同的損失函數公式的問題變的可能。我們證明了這個方法在從標記地圖(label maps)中合成照片、從邊緣地圖重建物件和影像著色等任務中有效。確實，從本論文相關的pix2pix軟體釋出以來，大量的網路使用者(其中許多是藝術家)發布了他們使用我們的這個系統的實驗，進一步證明了其廣泛的適應性和無需參數調整的易用性。作為一個社群，我們不再手工設計映射函數，而這研究表明了，我們可以在沒有手工設計損失函數的情況下得到合理的結果。 ::: ## 1. Introduction :::info Many problems in image processing, computer graphics, and computer vision can be posed as “translating” an input image into a corresponding output image. Just as a concept may be expressed in either English or French, a scene may be rendered as an RGB image, a gradient field, an edge map, a semantic label map, etc. In analogy to automatic language translation, we define automatic image-to-image translation as the task of translating one possible representation of a scene into another, given sufficient training data (see Figure 1). Traditionally, each of these tasks has been tackled with separate, special-purpose machinery (e.g., [16, 25, 20, 9, 11, 53, 33, 39, 18, 58, 62]), despite the fact that the setting is always the same: predict pixels from pixels. Our goal in this paper is to develop a common framework for all these problems. ::: :::success 影像處理、[電腦圖學](https://terms.naer.edu.tw/detail/0998604313d93527dcfa63ce063713e3/)與電腦視覺的許多問題可以被視為將輸入影像"轉換"成相對應的輸出影像。就如同一個概念可以用英文或法文表述，一個場景可以被渲染成RGB影像、梯度場(gradient field)、邊緣圖、語意標記圖(semantic label map)等。類比成自動語言翻譯的話，我們定義自動image-to-image translation為在給定足夠訓練資料的情況下，將場景的一種可能的表示轉換成另一種表示的任務(Figure 1)。傳統上，這些任務已經各自使用不同的專用機制處理(例如[16, 25, 20, 9, 11, 53, 33, 39, 18, 58, 62])，儘管事實上設定總是相同：由像素預測像素。這篇論文的目標是開發一個適用於所有這些問題的通同框架。 ::: :::info ![image](https://hackmd.io/_uploads/rJo6eHGd0.png) Figure 1: Many problems in image processing, graphics, and vision involve translating an input image into a corresponding output image. These problems are often treated with application-specific algorithms, even though the setting is always the same: map pixels to pixels. Conditional adversarial nets are a general-purpose solution that appears to work well on a wide variety of these problems. Here we show results of the method on several. In each case we use the same architecture and objective, and simply train on different data. ::: :::info The community has already taken significant steps in this direction, with convolutional neural nets (CNNs) becoming the common workhorse behind a wide variety of image prediction problems. CNNs learn to minimize a loss function – an objective that scores the quality of results – and although the learning process is automatic, a lot of manual effort still goes into designing effective losses. In other words, we still have to tell the CNN what we wish it to minimize. But, just like King Midas, we must be careful what we wish for! If we take a naive approach and ask the CNN to minimize the Euclidean distance between predicted and ground truth pixels, it will tend to produce blurry results [43, 62]. This is because Euclidean distance is minimized by averaging all plausible outputs, which causes blurring. Coming up with loss functions that force the CNN to do what we really want – e.g., output sharp, realistic images – is an open problem and generally requires expert knowledge. ::: :::success 社群已經在這方面採取了重大步驟，使用卷積神經網路(CNN)已經是各種影像預測問題的主力工具。CNN學習最小化損失函數，這是一個評估結果品質的目標，儘管學習過程是自動化的，不過還是需要大量人工投入設計有效的損失函數。換句話說，我們仍然需要告訴CNN我們希望它最小化什麼。但是，就像King Midas一樣，我們必須小心我們自己的願望！如果我們採用naive approach，並要求CNN最小化預測像素與真實像素之間的歐幾里得距離(Euclidean distance)，它往往會產生模糊的結果。這是因為歐幾里得距離是通過平均所有可能的輸出值來最小化，這導致模糊(blurring)。強制CNN做我們真正想要的事情，像是輸出清晰且真實影像的損失函數是一個開放性問題，通常需要專家知識。 ::: :::info It would be highly desirable if we could instead specify only a high-level goal, like “make the output indistinguishable from reality”, and then automatically learn a loss function appropriate for satisfying this goal. Fortunately, this is exactly what is done by the recently proposed Generative Adversarial Networks (GANs) [24, 13, 44, 52, 63]. GANs learn a loss that tries to classify if the output image is real or fake, while simultaneously training a generative model to minimize this loss. Blurry images will not be tolerated since they look obviously fake. Because GANs learn a loss that adapts to the data, they can be applied to a multitude of tasks that traditionally would require very different kinds of loss functions. ::: :::success 如果我們能夠單純指定一個高階目標，像是"讓輸出與真實無法區分"，然後自動學習一個適合作為滿足該目標的損失函數，這會非常理想。幸運的是，最近提出的生成對抗網路(GANs)，就是這樣做的。GANs學習一個損失函數，嚐試分類輸出圖像是真的還是假的，同時訓練一個生成模型來最小化這個損失。模糊的圖像將不被容忍，因為它們看起來明顯是假的。由於GANs是學習適應資料的損失函數，所以它們可以被應用在傳統上需要非常不同類型的損失函數的許多任務上。 ::: :::info In this paper, we explore GANs in the conditional setting. Just as GANs learn a generative model of data, conditional GANs (cGANs) learn a conditional generative model [24]. This makes cGANs suitable for image-to-image translation tasks, where we condition on an input image and generate a corresponding output image. ::: :::success 在這篇論文中，我們要來探討條件設置下的GANs。如同GANs學習資料的生成模型，conditional GANs (cGANs)學習條件式生成模型。這使得cGANs適用於image-to-image translation的任務，其中我們以輸入影像為條件，然後生成對應的輸出影像。 ::: :::info GANs have been vigorously studied in the last two years and many of the techniques we explore in this paper have been previously proposed. Nonetheless, earlier papers have focused on specific applications, and it has remained unclear how effective image-conditional GANs can be as a general-purpose solution for image-toimage translation. Our primary contribution is to demonstrate that on a wide variety of problems, conditional GANs produce reasonable results. Our second contribution is to present a simple framework sufficient to achieve good results, and to analyze the effects of several important architectural choices. Code is available at https://github.com/phillipi/pix2pix. ::: :::success 在過去兩年中GANs受到了廣泛的研究，這篇論文中所探討的許多技術先前都已經被提出了。儘管如此，早期的論文集中在特定的應用，對於image-conditional GANs作為image-toimage translation的通用解決方案的有效性仍然不是那麼清楚。我們的主要貢獻是證明conditional GANs在各種問題上都能產生合理的結果。我們的第二個貢獻是提出一個簡單的框架，足以獲得好的結果，並分析幾個重要架構選擇的影響。程式碼就放在https://github.com/phillipi/pix2pix 。 ::: ## 2. Related work :::info **Structured losses for image modeling** Image-to-image translation problems are often formulated as per-pixel classification or regression (e.g., [39, 58, 28, 35, 62]). These formulations treat the output space as “unstructured” in the sense that each output pixel is considered conditionally independent from all others given the input image. Conditional GANs instead learn a structured loss. Structured losses penalize the joint configuration of the output. A large body of literature has considered losses of this kind, with methods including conditional random fields [10], the SSIM metric [56], feature matching [15], nonparametric losses [37], the convolutional pseudo-prior [57], and losses based on matching covariance statistics [30]. The conditional GAN is different in that the loss is learned, and can, in theory, penalize any possible structure that differs between output and target. ::: :::success **Structured losses for image modeling** Image-to-image translation的問題通常被寫為per-pixel的分類問題或是迴歸問題的公式。這些公式將輸出空間視為"非結構化"，就是說，給定輸入影像的情況下，每個輸出像素都被認為跟所有其它輸出像素條件獨立。Conditional GANs不是這樣想的，它就是學習一個結構損失。結構損失懲罰了輸出的[接合組態](https://terms.naer.edu.tw/detail/387ddf4f79d7807e3278c7825b166b67/)。大量的文獻考慮了這類的損失，這些方法包括conditional random fields、SSIM metric、feature matching、nonparametric losses、convolutional pseudo-prior以及以及基於匹配共變異數統計分析的損失。conditional GAN跟它們不同的地方在於，loss是學到的，而且，理論上是可以懲罰任何判輸出與目標之間可能的結構差異。 ::: :::info **Conditional GANs** We are not the first to apply GANs in the conditional setting. Prior and concurrent works have conditioned GANs on discrete labels [41, 23, 13], text [46], and, indeed, images. The image-conditional models have tackled image prediction from a normal map [55], future frame prediction [40], product photo generation [59], and image generation from sparse annotations [31, 48] (c.f. [47] for an autoregressive approach to the same problem). Several other papers have also used GANs for image-to-image mappings, but only applied the GAN unconditionally, relying on other terms (such as L2 regression) to force the output to be conditioned on the input. These papers have achieved impressive results on inpainting [43], future state prediction [64], image manipulation guided by user constraints [65], style transfer [38], and superresolution [36]. Each of the methods was tailored for a specific application. Our framework differs in that nothing is applicationspecific. This makes our setup considerably simpler than most others. ::: :::success **Conditional GANs** 我們並不是第一個以條件配置來應用GANs的人。先前的研究跟同期的研究已經在discrete labels、text、還有images有著條件化的GANs。image-conditional models已經處理過由法線圖(normal map)預測影像、future frame prediction、product photo generation，以及從稀疏註解中生成影像。其它幾篇論文也使用GANs進行image-to-image mappings，不過是條件式的GANs，依賴其它的項目(像是L2 regression)來強制輸出條以輸入做為條件。這些論文在修復(inpainting)、未來狀態預測(future state prediction)、使用者約束引導的影像操作、風格轉換(style transfer)和超解析度(superresolution)等方面取得了令人印象深刻的結果。每一種方法都是為特定應用量身定制的。我們的框架不同之處在於沒有任何應用特定的部分。這使得我們的設置比大多數其他方法簡單得多。 ::: :::info Our method also differs from the prior works in several architectural choices for the generator and discriminator. Unlike past work, for our generator we use a “U-Net”-based architecture [50], and for our discriminator we use a convolutional “PatchGAN” classifier, which only penalizes structure at the scale of image patches. A similar PatchGAN architecture was previously proposed in [38] to capture local style statistics. Here we show that this approach is effective on a wider range of problems, and we investigate the effect of changing the patch size. ::: :::success 我們的方法與先前的研究在generator和discriminator的架構選擇上存在著一些差異。與過去的研究不同的是，對於我們的generator，我們使用的是基於“U-Net”的架構，對於我們的discriminator，我們使用的是convolutional “PatchGAN” classifier，它單純的以image patches(影像塊)的尺度(scale)來懲罰結構。在[38]中已經有提出一種類似於PatchGAN架構來捕捉局部風格統計信息。這邊，我們要來說明這種方法在更廣泛的問題是有效的，而且我們研究了改變影像塊大小的影響。 ::: ## 3. Method :::info GANs are generative models that learn a mapping from random noise vector $z$ to output image $y, G : z \to y$ [24]. In contrast, conditional GANs learn a mapping from observed image $x$ and random noise vector $z$, to $y, G : {x, z} \to y$. The generator $G$ is trained to produce outputs that cannot be distinguished from “real” images by an adversarially trained discriminator, $D$, which is trained to do as well as possible at detecting the generator’s “fakes”. This training procedure is diagrammed in Figure 2. ::: :::success GANs是一種生成模型，其從隨機噪點向量$z$學習映射到輸出影像$y, G : z \to y$。相比之下，conditional GANs則是學習從觀察到的影像$x$與隨機噪點向量$z$映射到$y, G : {x, z} \to y$。生成器$G$則是訓練來產生一個連對手discriminator $D$都無法區分所生成的影像是真是假。訓練過程如Figure 2所示。 ::: :::info ![image](https://hackmd.io/_uploads/HyRQLbS_C.png) Figure 2: Training a conditional GAN to map edges $\to$ photo. The discriminator, $D$, learns to classify between fake (synthesized by the generator) and real {edge, photo} tuples. The generator, $G$, learns to fool the discriminator. Unlike an unconditional GAN, both the generator and discriminator observe the input edge map. ::: ### 3.1. Objective :::info The objective of a conditional GAN can be expressed as $$ \begin{align} \mathcal{L}_{cGAN} (G, D) &= \mathbb{E}_{x,y}[\log D(x, y)] + \\ & \mathbb{E}_{x,z}[\log(1 - D(x,G(x, z)))]\tag{1} \end{align} $$ where $G$ tries to minimize this objective against an adversarial $D$ that tries to maximize it, i.e. $G^∗ =\arg\min_G\max_D\mathcal{L}_{cGAN} (G, D)$. ::: :::success conditional GAN的目標函數可以表述為 $$ \begin{align} \mathcal{L}_{cGAN} (G, D) &= \mathbb{E}_{x,y}[\log D(x, y)] + \\ & \mathbb{E}_{x,z}[\log(1 - D(x,G(x, z)))]\tag{1} \end{align} $$ 其中$G$嚐試最小化這個目標函數，去對抗嚐試最大化它的adversarial $D$，也就是$G^∗ =\arg\min_G\max_D\mathcal{L}_{cGAN} (G, D)$。 ::: :::info To test the importance of conditioning the discriminator, we also compare to an unconditional variant in which the discriminator does not observe $x$: $$ \begin{align} \mathcal{L}_{GAN}(G,D)&=\mathbb{E}_y[\log D(y)] + \\ & \mathbb{E}_{x,z}[\log(1-D(G(x,z)))] \tag{2} \end{align} $$ Previous approaches have found it beneficial to mix the GAN objective with a more traditional loss, such as L2 distance [43]. The discriminator’s job remains unchanged, but the generator is tasked to not only fool the discriminator but also to be near the ground truth output in an L2 sense. We also explore this option, using L1 distance rather than L2 as L1 encourages less blurring: $$ \mathcal{L}_{L1}(G) = \mathbb{E}_{x,y,z}[\Vert y − G(x, z)\Vert]. \tag{3} $$ Our final objective is $$ G^* = \arg\min_G\max_D\mathcal{L}_{cGAN}(G,D)+\lambda\mathcal{L}_{L1}(G)\tag{4} $$ ::: :::success 為了測試discriminator conditioning的重要性，我們還比較了一個unconditional的變體，也就是discriminator不觀察$x$： $$ \begin{align} \mathcal{L}_{GAN}(G,D)&=\mathbb{E}_y[\log D(y)] + \\ & \mathbb{E}_{x,z}[\log(1-D(G(x,z)))] \tag{2} \end{align} $$ 先前的方法已經有發現到，將GAN的目標函數跟一些傳統的loss，像是L2 distance，混合是有好處的。discriminator的工作仍然是不變的，不過generator被要求的就不再單純是欺騙discriminator，還要在L2的意義上更接近實際的輸出。我們還探索這個選項，用L1來試試，因為L1能夠鼓勵更少的模糊： $$ \mathcal{L}_{L1}(G) = \mathbb{E}_{x,y,z}[\Vert y − G(x, z)\Vert]. \tag{3} $$ 我們的最終目標函數就變成： $$ G^* = \arg\min_G\max_D\mathcal{L}_{cGAN}(G,D)+\lambda\mathcal{L}_{L1}(G)\tag{4} $$ ::: :::info Without $z$, the net could still learn a mapping from $x$ to $y$, but would produce deterministic outputs, and therefore fail to match any distribution other than a delta function. Past conditional GANs have acknowledged this and provided Gaussian noise $z$ as an input to the generator, in addition to $x$ (e.g., [55]). In initial experiments, we did not find this strategy effective – the generator simply learned to ignore the noise – which is consistent with Mathieu et al. [40]. Instead, for our final models, we provide noise only in the form of dropout, applied on several layers of our generator at both training and test time. Despite the dropout noise, we observe only minor stochasticity in the output of our nets. Designing conditional GANs that produce highly stochastic output, and thereby capture the full entropy of the conditional distributions they model, is an important question left open by the present work. ::: :::success 沒有$z$的情況下，神經網路仍然可以學習到一個從$x$到$y$的映射，不過會產生確定性的輸出，因而無法匹配delta function以外的任何的分佈。過去的conditional GANs已經意識到這一點，所以除了$x$之外，還提供Gaussian noise $z$來做為generator的輸入。在一開始的實驗中，我們並沒有發現這個策略的有效性(就是無效)，generator單純的學習到忽略這個噪點，這個Mathieu等人的研究是一樣的。反而是在我們的最終模型中，我們只有以dropout的形式來提供噪點，而且是在訓練與測試應用在我們生成器中的幾個網路層中。我們觀察到，儘管有著這些dropout noise，在我們的網路的輸出中的隨機性也是很小的。設計一個能產出高隨機性輸出的conditional GANs，從而補捉到它們建模的條件分佈的完整的熵(full entropy)是目前研究中未解的重要問題。 ::: ### 3.2. Network architectures :::info We adapt our generator and discriminator architectures from those in [44]. Both generator and discriminator use modules of the form convolution-BatchNorm-ReLu [29]. Details of the architecture are provided in the supplemental materials online, with key features discussed below. ::: :::success 我們採用[44]中的generator、discriminator的架構。generator跟discriminator都使用模組化的概念，convolution-BatchNorm-ReLu。架構的細節都在線上補充教材提供，下面討論關鍵特徵。 ::: #### 3.2.1 Generator with skips :::info A defining feature of image-to-image translation problems is that they map a high resolution input grid to a high resolution output grid. In addition, for the problems we consider, the input and output differ in surface appearance, but both are renderings of the same underlying structure. Therefore, structure in the input is roughly aligned with structure in the output. We design the generator architecture around these considerations. ::: :::success 影像轉換問題的定義特徵就是它們會把一個高解析度輸入網格映射到高解析度輸出網格。此外，對於我們所考慮的問題，輸入與輸出在表面外觀上有所不同，不過都是對相同底層結構渲染的。因此，在輸入中的結構就會大致的跟輸出中的結構大致對齊。我們在這些思慮下設置生成式的架構。 ::: :::info Many previous solutions [43, 55, 30, 64, 59] to problems in this area have used an encoder-decoder network [26]. In such a network, the input is passed through a series of layers that progressively downsample, until a bottleneck layer, at which point the process is reversed. Such a network requires that all information flow pass through all the layers, including the bottleneck. For many image translation problems, there is a great deal of low-level information shared between the input and output, and it would be desirable to shuttle this information directly across the net. For example, in the case of image colorization, the input and output share the location of prominent edges. ::: :::success 關於這領域中的問題的先前解決方案都使用了encoder-decoder network。在這類的神經網路中，輸入會經過一系列逐步地降採樣(樂詞網翻為[降低取樣](https://terms.naer.edu.tw/detail/44b16bf53d61d6109057b56e3dfad517/))，一直到瓶頸層(bottleneck layer)，然後在這邊開始反過來處理(改上採樣)。這樣的網路需要所有的信息經過所有的網路層，包括瓶頸層。對於很多影像轉換問題有一個好的作法，那就是輸入與輸出之間會共享一些低階的信息，並希望能夠直接透過網路傳遞這些信息。舉例來說，影像著色的情況下，輸入與輸出會共享突出邊緣(prominent edges)的位置。 ::: :::info To give the generator a means to circumvent the bottleneck for information like this, we add skip connections, following the general shape of a “U-Net” [50]. Specifically, we add skip connections between each layer $i$ and layer $n − i$, where $n$ is the total number of layers. Each skip connection simply concatenates all channels at layer $i$ with those at layer $n − i$. ::: :::success 為了給生成器一個繞過這種信息瓶頸的方法，我們依著“U-Net”的一般形狀來增加skip connections。特別是，我們在每個網路層$i$與$n-i$之間加入skip connections，其中$n$是總的網路層數量。每個skip connections就只是將$i$層與$n-i$層的的所有通道連接起來。 ::: #### 3.2.2 Markovian discriminator (PatchGAN) :::info It is well known that the L2 loss – and L1, see Figure 4 – produces blurry results on image generation problems [34]. Although these losses fail to encourage high-frequency crispness, in many cases they nonetheless accurately capture the low frequencies. For problems where this is the case, we do not need an entirely new framework to enforce correctness at the low frequencies. L1 will already do. ::: :::success 眾所皆知，L2 loss跟L1 loss(見Figure 4)會在影像生成問題上產生模糊的結果。儘管這些loss無法鼓勵高頻[清昕度](https://dictionary.cambridge.org/zht/%E8%A9%9E%E5%85%B8/%E8%8B%B1%E8%AA%9E-%E6%BC%A2%E8%AA%9E-%E7%B9%81%E9%AB%94/crispness)，不過在許多情況下仍然是準確地補捉到低頻的部份。對於這問題來說，我們並不需要一個完整全新的框架來強制在低頻的正確性。L1就夠看了。 ::: :::info ![image](https://hackmd.io/_uploads/ryP-P8UdA.png) Figure 4: Different losses induce different quality of results. Each column shows results trained under a different loss. Please see https://phillipi.github.io/pix2pix/ for additional examples. ::: :::info This motivates restricting the GAN discriminator to only model high-frequency structure, relying on an L1 term to force low-frequency correctness (Eqn. 4). In order to model high-frequencies, it is sufficient to restrict our attention to the structure in local image patches. Therefore, we design a discriminator architecture – which we term a PatchGAN – that only penalizes structure at the scale of patches. This discriminator tries to classify if each $N \times N$ patch in an image is real or fake. We run this discriminator convolutionally across the image, averaging all responses to provide the ultimate output of $D$. ::: :::success 這促使限制GAN discriminator單能只能建構高頻的結構，然後靠著L1的項目來強制低頻的正確性(方程式4)。為了建構高頻的部份，把我們的注意力限制在局部的影像區塊的結構上就夠了。因此，我們設計一個discriminator的架構，我們稱之為PatchGAN，單純的懲罰區塊尺度的結構。這個discriminator會試著分辨影像中的每個$N \times N$的區塊是真的還是假的。我們用卷積的方式在影像執行這個discriminator，平均所有的響應，然後給出$D$的最終輸出。 ::: :::info In Section 4.4, we demonstrate that $N$ can be much smaller than the full size of the image and still produce high quality results. This is advantageous because a smaller PatchGAN has fewer parameters, runs faster, and can be applied to arbitrarily large images. ::: :::success 在Section 4.4中，我們會證明$N$是可以比影像的完整大小還要小很多，這種情況下仍然可以產生高品質的結果。這是有好處的，因為較小的PatchGAN有著較少的參數，可以跑的快，而且可以用在任意大小的影像上。 ::: :::info Such a discriminator effectively models the image as a Markov random field, assuming independence between pixels separated by more than a patch diameter. This connection was previously explored in [38], and is also the common assumption in models of texture [17, 21] and style [16, 25, 22, 37]. Therefore, our PatchGAN can be understood as a form of texture/style loss. ::: :::success 這樣的一個discriminator可以有效地把影像建構為[馬可夫隨機場](https://terms.naer.edu.tw/detail/2227bf2f12ffe9e259435c72b58bbe1d/)，這假設了間隔大於區塊直徑的像素之間是不相關的。這種連結在[38]中有探討過，同時也是紋理與風格模型中常見的假設。因此，我們的PatchGAN可以理解成是紋理/風格損失的一種形式。 ::: ### 3.3. Optimization and inference :::info To optimize our networks, we follow the standard approach from [24]: we alternate between one gradient descent step on $D$, then one step on $G$. As suggested in the original GAN paper, rather than training $G$ to minimize $\log(1 − D(x, G(x, z))$, we instead train to maximize $\log D(x, G(x, z))$ [24]. In addition, we divide the objective by 2 while optimizing $D$, which slows down the rate at which $D$ learns relative to $G$. We use minibatch SGD and apply the Adam solver [32], with a learning rate of $0.0002$, and momentum parameters $\beta_1 = 0.5, \beta_2 = 0.999$. ::: :::success 為了最佳化我們的網路，我們依著[24]裡面的標準：我們在$D$上一次梯度下降，然後在$G$上一次梯度下降，交替著做。如同原始GAN論文中的建議，不要訓練$G$來最小化$\log(1 − D(x, G(x, z))$，而是最大化$\log D(x, G(x ) , z))$。此外，我們在最佳化$D$的時候把目標除以2，以此降低$D$相對$G$的學習速度。我們使用minibatch SGD，然採用Adam slover，學習效率為$0.0002$，且momentum parameters $\beta_1 = 0.5, \beta_2 = 0.999$。 ::: :::info At inference time, we run the generator net in exactly the same manner as during the training phase. This differs from the usual protocol in that we apply dropout at test time, and we apply batch normalization [29] using the statistics of the test batch, rather than aggregated statistics of the training batch. This approach to batch normalization, when the batch size is set to 1, has been termed “instance normalization” and has been demonstrated to be effective at image generation tasks [54]. In our experiments, we use batch sizes between 1 and 10 depending on the experiment. ::: :::success 在推理的時候，我們用著跟訓練階段完全相同的方式來執行generator。這跟大家不一樣的地方在於，我們即使在測試階段也是使用dropout的，而且我們用測試批(test batch)的統計分析資訊來執行batch normalization。當batch size設置為1的時候，這種batch normalization就稱為“instance normalization”，而且已經證明在影像生成任務上是有效地。在我們的實驗中，我們根據實驗需求使用1~10的batch size。 ::: ## 4. Experiments :::info To explore the generality of conditional GANs, we test the method on a variety of tasks and datasets, including both graphics tasks, like photo generation, and vision tasks, like semantic segmentation: * Semantic labels $\leftrightarrow$ photo, trained on the Cityscapes dataset [12]. * Architectural labels $\to$ photo, trained on CMP Facades [45]. * Map $\leftrightarrow$ aerial photo, trained on data scraped from Google Maps. * BW $\to$ color photos, trained on [51]. * Edges $\to$ photo, trained on data from [65] and [60]; binary edges generated using the HED edge detector [58] plus postprocessing. * Sketch $\to$ photo: tests edges $\to$ photo models on humandrawn sketches from [19]. * Day $\to$ night, trained on [33]. * Thermal $\to$ color photos, trained on data from [27]. * Photo with missing pixels $\to$ inpainted photo, trained on Paris StreetView from [14]. ::: :::success 為了探索conditional GANs的通用性，我們在各種任務上與資料集上測試這個方法，包括圖學任務，像是照片生成，與視覺任務，像是語意分割： * Semantic labels $\leftrightarrow$ photo, trained on the Cityscapes dataset [12]. * Architectural labels $\to$ photo, trained on CMP Facades [45]. * Map $\leftrightarrow$ aerial photo, trained on data scraped from Google Maps. * BW $\to$ color photos, trained on [51]. * Edges $\to$ photo, trained on data from [65] and [60]; binary edges generated using the HED edge detector [58] plus postprocessing. * Sketch $\to$ photo: tests edges $\to$ photo models on humandrawn sketches from [19]. * Day $\to$ night, trained on [33]. * Thermal $\to$ color photos, trained on data from [27]. * Photo with missing pixels $\to$ inpainted photo, trained on Paris StreetView from [14]. ::: :::info Details of training on each of these datasets are provided in the supplemental materials online. In all cases, the input and output are simply 1-3 channel images. Qualitative results are shown in Figures 8, 9, 11, 10, 13, 14, 15, 16, 17, 18, 19, 20. Several failure cases are highlighted in Figure 21. More comprehensive results are available at https://phillipi.github.io/pix2pix/. ::: :::success 細節都在線上補充教材中，有空有興趣就去喵喵看。在所有的情況下，輸入與輸出都只是1-3的channl image。定性結果如下面好多圖(Figure 8、9、11、10、13、14、15、16、17、18、19、20)。Figure 21突顯出幾個失敗案例。更全面的結果可以來網站看看：https://phillipi.github.io/pix2pix/ 。 ::: :::info ![image](https://hackmd.io/_uploads/H1xW2ov_C.png) Figure 8: Example results on Google Maps at 512x512 resolution (model was trained on images at 256 × 256 resolution, and run convolutionally on the larger images at test time). Contrast adjusted for clarity. ::: :::info ![image](https://hackmd.io/_uploads/B1ON3jvOA.png) Figure 9: Colorization results of conditional GANs versus the L2 regression from [62] and the full method (classification with rebalancing) from [64]. The cGANs can produce compelling colorizations (first two rows), but have a common failure mode of producing a grayscale or desaturated result (last row). ::: :::info ![image](https://hackmd.io/_uploads/BJDLhsvdC.png) Figure 10: Applying a conditional GAN to semantic segmentation. The cGAN produces sharp images that look at glance like the ground truth, but in fact include many small, hallucinated objects. ::: :::info ![image](https://hackmd.io/_uploads/HJkO3ovuA.png) Figure 11: Example applications developed by online community based on our pix2pix codebase: #edges2cats [3] by Christopher Hesse, Background removal [6] by Kaihu Chen, Palette generation [5] by Jack Qiao, Sketch → Portrait [7] by Mario Klingemann, Sketch→ Pokemon [1] by Bertrand Gondouin, “Do As I Do” pose transfer [2] by Brannon Dorsey, and #fotogenerator by Bosman et al. [4]. ::: :::info ![image](https://hackmd.io/_uploads/rJl93sP_C.png) Figure 13: Example results of our method on Cityscapes labels→photo, compared to ground truth. ::: :::info ![image](https://hackmd.io/_uploads/r1bshsvOC.png) Figure 14: Example results of our method on facades labels→photo, compared to ground truth. ::: :::info ![image](https://hackmd.io/_uploads/BkE33iPd0.png) Figure 15: Example results of our method on day→night, compared to ground truth. ::: :::info ![image](https://hackmd.io/_uploads/HJPp3iw_C.png) Figure 16: Example results of our method on automatically detected edges→handbags, compared to ground truth. ::: :::info ![image](https://hackmd.io/_uploads/SkHjTsDuR.png) Figure 17: Example results of our method on automatically detected edges→shoes, compared to ground truth. ::: :::info ![image](https://hackmd.io/_uploads/Bk3IRiP_C.png) Figure 18: Additional results of the edges→photo models applied to human-drawn sketches from [19]. Note that the models were trained on automatically detected edges, but generalize to human drawings ::: :::info ![image](https://hackmd.io/_uploads/HJq_AjvOC.png) Figure 19: Example results on photo inpainting, compared to [43], on the Paris StreetView dataset [14]. This experiment demonstrates that the U-net architecture can be effective even when the predicted pixels are not geometrically aligned with the information in the input – the information used to fill in the central hole has to be found in the periphery of these photos. ::: :::info ![image](https://hackmd.io/_uploads/B1qi0jvdA.png) Figure 20: Example results on translating thermal images to RGB photos, on the dataset from [27]. ::: :::info ![image](https://hackmd.io/_uploads/S1Lh0swdA.png) Figure 21: Example failure cases. Each pair of images shows input on the left and output on the right. These examples are selected as some of the worst results on our tasks. Common failures include artifacts in regions where the input image is sparse, and difficulty in handling unusual inputs. Please see https://phillipi.github.io/pix2pix/ for more comprehensive results. ::: :::info **Data requirements and speed** We note that decent results can often be obtained even on small datasets. Our facade training set consists of just 400 images (see results in Figure 14), and the day to night training set consists of only 91 unique webcams (see results in Figure 15). On datasets of this size, training can be very fast: for example, the results shown in Figure 14 took less than two hours of training on a single Pascal Titan X GPU. At test time, all models run in well under a second on this GPU. ::: :::success **Data requirements and speed** 我們注意到，即使是在小資料集上還是可以得到一些不錯的結果。facade training set就單純的400張照片(見Figure 14的結果)，然後day to night training set就只有包括91個獨特的webcams(見Figure 15的結果)。在這種大小的資料集上，訓練可以是非常快的：舉例來說，Figure 14中所示的結果就是在單一張Pascal Titan X GPU上訓練不到兩個小時就好的。在測試的時候，所有的模型在這GPU上也都不到一秒就秒推了。 ::: ## 4.1. Evaluation metrics :::info Evaluating the quality of synthesized images is an open and difficult problem [52]. Traditional metrics such as perpixel mean-squared error do not assess joint statistics of the result, and therefore do not measure the very structure that structured losses aim to capture ::: :::success 評估合成影像的品質是一個開放而且困難的問題。傳統的指標，像是per-pixel mean-squared error，並不會評估到結果的joint statistics(聯合統計分析)，因此不會量測結構化損失所要補捉的結構。 ::: :::info To more holistically evaluate the visual quality of our results, we employ two tactics. First, we run “real vs. fake” perceptual studies on Amazon Mechanical Turk (AMT). For graphics problems like colorization and photo generation, plausibility to a human observer is often the ultimate goal. Therefore, we test our map generation, aerial photo generation, and image colorization using this approach. ::: :::success 為了更全面地評估我們成果的視覺品質，我們用了兩種策略。首先，我們在Amazon Mechanical Turk (AMT)上做了“real vs. fake”的感知研究。對於圖學問題，像是著色或是照片生成，對人類觀察者的合理性通常是最終目標。因此，我們用這種方法來測試地圖生成、航空圖生成與影像著色。 ::: :::info Second, we measure whether or not our synthesized cityscapes are realistic enough that off-the-shelf recognition system can recognize the objects in them. This metric is similar to the “inception score” from [52], the object detection evaluation in [55], and the “semantic interpretability” measures in [62] and [42]. ::: :::success 再來就是，我們量測我們的合成城市景觀是不是有足夠的真實性，夠真的話那現有的辨識系統就可以辨識其中的物體。這個指標類似於[52]的“inception score”、[55]的“object detection evaluation”、[62]與[42]的“semantic interpretability”。 ::: :::info **AMT perceptual studies** For our AMT experiments, we followed the protocol from [62]: Turkers were presented with a series of trials that pitted a “real” image against a “fake” image generated by our algorithm. On each trial, each image appeared for 1 second, after which the images disappeared and Turkers were given unlimited time to respond as to which was fake. The first 10 images of each session were practice and Turkers were given feedback. No feedback was provided on the 40 trials of the main experiment. Each session tested just one algorithm at a time, and Turkers were not allowed to complete more than one session. ∼ 50 Turkers evaluated each algorithm. Unlike [62], we did not include vigilance trials. For our colorization experiments, the real and fake images were generated from the same grayscale input. For map $\leftrightarrow$ aerial photo, the real and fake images were not generated from the same input, in order to make the task more difficult and avoid floor-level results. For map $\leftrightarrow$ aerial photo, we trained on 256×256 resolution images, but exploited fully-convolutional translation (described above) to test on 512 × 512 images, which were then downsampled and presented to Turkers at 256 × 256 resolution. For colorization, we trained and tested on 256 × 256 resolution images and presented the results to Turkers at this same resolution. ::: :::success 對於我們的AMT實驗，我們依著[62]的協議：向Turkers展示一系列的試驗，將"真實"的影像跟我們演算法生成的"假"的影像做比較。每次的試驗中，每一張影像都會出現一秒，然後消失，然後Turkers可以在不限制時間的情況下來判斷那一張是假的。每個session的前十張影像都是練習，Turkers會得到回饋。主要實驗的40次試驗是沒有提供回饋的。每個session就只測試一種演算法，並且Turkers不能夠完成超過一個session，會有50位Turkers來評估每一種演算法。跟[62]不一樣的是，我們並不包含vigilance trials(警覺試驗？)。對於我們的著色實驗，真實與假的影像都是由相同的灰階輸入所生成的。對於地圖$\leftrightarrow$空拍照片，其真實與假的影像就不是從相同的輸入所生成的，為了能夠讓任務更加地困難且避免最低水平的結果。對於地圖$\leftrightarrow$航空照，我們是在256x256解析度影像上訓練的，不過利用fully-convolutional translation(如上所述)在512x512的解析度影像上測試的，反正就是做降採樣然後以256x256解析度呈現給Turkers。著色任務的部份，我們的訓練與測試都是在256x256解析度影像上做的，然後以相同的解析度呈現給Turkers。 ::: :::info “**FCN-score**” While quantitative evaluation of generative models is known to be challenging, recent works [52, 55, 62, 42] have tried using pre-trained semantic classifiers to measure the discriminability of the generated stimuli as a pseudo-metric. The intuition is that if the generated images are realistic, classifiers trained on real images will be able to classify the synthesized image correctly as well. To this end, we adopt the popular FCN-8s [39] architecture for semantic segmentation, and train it on the cityscapes dataset. We then score synthesized photos by the classification accuracy against the labels these photos were synthesized from. ::: :::success “**FCN-score**” 雖然大家都知道，生成式模型的定量評估深具挑戰性，近來的研究也嚐試著使用預訓練的語意分類器來量測生成的[刺激](https://terms.naer.edu.tw/detail/df7134a0ea91a86b1d766dbe78a01748/)(色質？)的[可辨別性](https://terms.naer.edu.tw/detail/95610ac4fdd6da1b7a4150af91f3c8d3/)做為[擬度量](https://terms.naer.edu.tw/detail/79235908962aa51f16295d424ac9814a/)(pseudo-metric)。直觀來看，如果生成的影像是真的，那麼在真實影像上訓練的類器也就能夠很好的分類這些合成的影像。為此，我們採用FCN-8s架構來做語意分割，在城市景觀資料集上訓練。然後，我們再根據合成這些照片的標記的分類準確度來對合成照片進行評分。 ::: ### 4.2. Analysis of the objective function :::info Which components of the objective in Eqn. 4 are important? We run ablation studies to isolate the effect of the L1 term, the GAN term, and to compare using a discriminator conditioned on the input (cGAN, Eqn. 1) against using an unconditional discriminator (GAN, Eqn. 2). ::: :::success 方程式4中的那些目標函數的組成是重要的呢？我們消融研究來隔離L1、GAN的影響，然並且比較在輸入使用discriminator conditioned(cGAN，方程式1)與unconditional discriminator(GAN，方程式2)的差異。 ::: :::info Figure 4 shows the qualitative effects of these variations on two labels $\to$ photo problems. L1 alone leads to reasonable but blurry results. The cGAN alone (setting $\lambda=0$ in Eqn. 4) gives much sharper results but introduces visual artifacts on certain applications. Adding both terms together (with $\lambda=100$) reduces these artifacts. ::: :::success Figure 4說明了這些變化在兩個labels $\to$ photo問題上的定性影響。單獨的L1會產生合理但模糊的結果。單獨的cGAN的話(在方程式4中設置$\lambda=0$)會給出更清晰的結果，不過會在某些應用上引入視覺瑕庛。兩個項目加在一起的話($\lambda=100$)就可以降低這些視覺瑕庛。 ::: :::info We quantify these observations using the FCN-score on the cityscapes labels $\to$ photo task (Table 1): the GAN-based objectives achieve higher scores, indicating that the synthesized images include more recognizable structure. We also test the effect of removing conditioning from the discriminator (labeled as GAN). In this case, the loss does not penalize mismatch between the input and output; it only cares that the output look realistic. This variant results in poor performance; examining the results reveals that the generator collapsed into producing nearly the exact same output regardless of input photograph. Clearly, it is important, in this case, that the loss measure the quality of the match between input and output, and indeed cGAN performs much better than GAN. Note, however, that adding an L1 term also encourages that the output respect the input, since the L1 loss penalizes the distance between ground truth outputs, which correctly match the input, and synthesized outputs, which may not. Correspondingly, L1+GAN is also effective at creating realistic renderings that respect the input label maps. Combining all terms, L1+cGAN, performs similarly well. ::: :::success 我們在cityscapes labels $\to$ photo任務上使用FCN-score來量化這些觀察結果(Table 1)：GAN-based的目標式得到較高的分數，這說明了合成影像包含了更多的可識別的結構。我們還測試了一下從discriminator中移除掉conditioning的影響(標籤為GAN的那個)。這種情況下，loss並不會懲罰輸入與輸出之間沒有匹配的部份；它就只關心輸出看起來是不是很真實。這種變體會導致效能不佳；實驗結果說明，生成器會落入不管給怎麼樣的輸入，它會產生近乎完全相同的輸出。很明顯的，在這種情況下，損失量測輸入與輸出之間的匹配品質是很重要的，確實，cGAN效能比GAN要好太多了。注意，加入一個L1項目也可以鼓勵輸出重視輸入，因為L1 loss會懲罰能夠正確匹配輸入的真實輸出與可能無法匹配輸入的合成輸出之間的距離。相對的，L1+GAN在建立真實性渲染(重視input label map方面)也是很有效的。結合所有的項目，L1+cGAN一樣好棒棒。 ::: :::info ![image](https://hackmd.io/_uploads/SkyQYHoOR.png) Table 1: FCN-scores for different losses, evaluated on Cityscapes labels $\leftrightarrow$ photos. ::: :::info **Colorfulness A** striking effect of conditional GANs is that they produce sharp images, hallucinating spatial structure even where it does not exist in the input label map. One might imagine cGANs have a similar effect on “sharpening” in the spectral dimension – i.e. making images more colorful. Just as L1 will incentivize a blur when it is uncertain where exactly to locate an edge, it will also incentivize an average, grayish color when it is uncertain which of several plausible color values a pixel should take on. Specially, L1 will be minimized by choosing the median of the conditional probability density function over possible colors. An adversarial loss, on the other hand, can in principle become aware that grayish outputs are unrealistic, and encourage matching the true color distribution [24]. In Figure 7, we investigate whether our cGANs actually achieve this effect on the Cityscapes dataset. The plots show the marginal distributions over output color values in Lab color space. The ground truth distributions are shown with a dotted line. It is apparent that L1 leads to a narrower distribution than the ground truth, confirming the hypothesis that L1 encourages average, grayish colors. Using a cGAN, on the other hand, pushes the output distribution closer to the ground truth. ::: :::success **Colorfulness A** conditional GANs的顯著效果在於它們能夠產生清晰的影像，即使input label map中不存在的空間結構，也能夠產生幻覺。人們可能會覺得cGANs在光譜維度中有著類似於"銳化"的效果，也就是讓影像更多彩多汁。正如L1在不確定邊緣的確切位置的時候會激勵模糊化那樣，當它不確定應該取用哪幾個看起來正確的顏色值來計算的時候，它也會激勵使用平均的灰色。特別是，L1會透過選擇所有可能的顏色的條件機率密度函數的中位數來最小化。另一方面，adversarial loss原則上會察覺到灰色的輸出是不切實際的，並鼓勵匹配實際的顏色分佈。在Figure 7中，我們研究我們的cGANs在Cityscapes dataset上是否會確實的受到影響。這圖說明的是在Lab色彩空間中輸出色彩值的邊際分佈。實際分佈的部份以虛線來呈現。很明顯的，L1導致了比實際分佈更狹窄的分佈，這證明了我們的假設，也就是l1會鼓勵平均、灰色值的輸出。另一方面，使用cGAN讓輸出分佈更加地接近真實狀況。 ::: :::info ![image](https://hackmd.io/_uploads/HJ7lRFi_A.png) Figure 7: Color distribution matching property of the cGAN, tested on Cityscapes. (c.f. Figure 1 of the original GAN paper [24]). Note that the histogram intersection scores are dominated by differences in the high probability region, which are imperceptible in the plots, which show log probability and therefore emphasize differences in the low probability regions ::: ### 4.3. Analysis of the generator architecture :::info A U-Net architecture allows low-level information to shortcut across the network. Does this lead to better results? Figure 5 and Table 2 compare the U-Net against an encoder-decoder on cityscape generation. The encoder-decoder is created simply by severing the skip connections in the UNet. The encoder-decoder is unable to learn to generate realistic images in our experiments. The advantages of the U-Net appear not to be specific to conditional GANs: when both U-Net and encoder-decoder are trained with an L1 loss, the U-Net again achieves the superior results. ::: :::success U-Net架構允許低階信息以短捷的方式在網路上傳輸。這能導致更好的結果嗎？Figure 5與Table 2針對U-Net跟encoder-decode在cityscape的生成上做了比較。只要截斷UNet裡面幾個skip connections就可以建立encoder-decoder。在我們的實驗中，encoder-decoder沒有辦法學習生成真實性的影像。U-Net的優點似乎不是conditional GANs特有的：當U-Net跟encoder-decoder都是以L1 loss訓練的時候，U-Net再次的取得優異的成果。 ::: :::info ![image](https://hackmd.io/_uploads/Hknd15iuA.png) Figure 5: Adding skip connections to an encoder-decoder to create a “U-Net” results in much higher quality results. ::: :::info ![image](https://hackmd.io/_uploads/rJ6xx9jOA.png) Table 2: FCN-scores for different generator architectures (and objectives), evaluated on Cityscapes labels↔photos. (U-net (L1-cGAN) scores differ from those reported in other tables since batch size was 10 for this experiment and 1 for other tables, and random variation between training runs.) ::: ### 4.4. From PixelGANs to PatchGANs to ImageGANs :::info We test the effect of varying the patch size $N$ of our discriminator receptive fields, from a 1 × 1 “PixelGAN” to a full 286 × 286 “ImageGAN”1 . Figure 6 shows qualitative results of this analysis and Table 3 quantifies the effects using the FCN-score. Note that elsewhere in this paper, unless specified, all experiments use 70 x 70 PatchGANs, and for this section all experiments use an L1+cGAN loss. ::: :::success 我們測試改變discriminator receptive fields的patch size $N$的影響，從1 × 1的“PixelGAN”到完整的286 × 286的“ImageGAN”。Figure 6說明了這個分析的定性結果，Table 3使用FCN-score量化這個影響。注意，論文的其它地方，除非另有說明，不然所有的實驗就是使用70x70的PatchGANs，並且這個session的所有實驗都是使用L1+cGAN loss。 ::: :::info ![image](https://hackmd.io/_uploads/SJkz-5oOC.png) Figure 6: Patch size variations. Uncertainty in the output manifests itself differently for different loss functions. Uncertain regions become blurry and desaturated under L1. The 1x1 PixelGAN encourages greater color diversity but has no effect on spatial statistics. The 16x16 PatchGAN creates locally sharp results, but also leads to tiling artifacts beyond the scale it can observe. The 70×70 PatchGAN forces outputs that are sharp, even if incorrect, in both the spatial and spectral (colorfulness) dimensions. The full 286×286 ImageGAN produces results that are visually similar to the 70×70 PatchGAN, but somewhat lower quality according to our FCN-score metric (Table 3). Please 648 see https://phillipi.github.io/pix2pix/ for additional examples. ::: :::info ![image](https://hackmd.io/_uploads/S1oPZ9s_A.png) Table 3: FCN-scores for different receptive field sizes of the discriminator, evaluated on Cityscapes labels→photos. Note that input images are 256 × 256 pixels and larger receptive fields are padded with zeros. ::: :::info The PixelGAN has no effect on spatial sharpness but does increase the colorfulness of the results (quantified in Figure 7). For example, the bus in Figure 6 is painted gray when the net is trained with an L1 loss, but becomes red with the PixelGAN loss. Color histogram matching is a common problem in image processing [49], and PixelGANs may be a promising lightweight solution. ::: :::success PixelGAN在空間清晰度上是沒有影響的，不過可以增加結果的[色彩度](https://terms.naer.edu.tw/detail/542625f54ef4e6f4dd5642e373aa4f4a/)(Figure 7中量化)。舉例來說，當網路的訓練是採用L1 loss的時候，Figure 6中的公車會被塗成灰色的，不過採用PixelGAN loss的話就會變成紅色的。色彩直方圖匹配是影像處理中的一個常見的問題，PixelGANs也許會是一種有前途的輕量級解決方案。 ::: :::info Using a 16×16 PatchGAN is sufficient to promote sharp outputs, and achieves good FCN-scores, but also leads to tiling artifacts. The 70 × 70 PatchGAN alleviates these artifacts and achieves slightly better scores. Scaling beyond this, to the full 286 × 286 ImageGAN, does not appear to improve the visual quality of the results, and in fact gets a considerably lower FCN-score (Table 3). This may be because the ImageGAN has many more parameters and greater depth than the 70 × 70 PatchGAN, and may be harder to train. ::: :::success 使用16×16 PatchGAN足以促進清晰的輸出，並實現好的FCN-scores，不過也會導致[拼貼類型](https://terms.naer.edu.tw/detail/e3238bab3d5b879a9f74006035be143c/)的瑕庛。70 × 70 PatchGAN減輕這些瑕庛，而且能得到些許好的分數。將之擴展至完整的286 × 286 ImageGAN，似乎並沒有提升結果的視覺品質，事實上還得到相當低的FCN-scores(Table 3)。這有可能是因為ImagenGAN有著比70 × 70 PatchGAN更多的參數，網路更深，所以更難訓練。 ::: :::info **Fully-convolutional translation** An advantage of the PatchGAN is that a fixed-size patch discriminator can be applied to arbitrarily large images. We may also apply the generator convolutionally, on larger images than those on which it was trained. We test this on the map $\leftrightarrow$ aerial photo task. After training a generator on 256×256 images, we test it on 512×512 images. The results in Figure 8 demonstrate the effectiveness of this approach. ::: :::success **Fully-convolutional translation** PatchGAN的一個優點是，一個固定大小區塊(patch)的discriminator可以應用到任意大小的影像上。我們還可以將generator convolutionally應用於比訓練圖像更大的圖像上。我們在map $\leftrightarrow$ aerial photo任務上測試這一點。在256×256影像上訓練generator之後，我們將之測試於512x512影像上。Figure 8結果說明著這個方法的有效性。 ::: ### 4.5. Perceptual validation :::info We validate the perceptual realism of our results on the tasks of map $\leftrightarrow$ aerial photograph and grayscale $\to$ color. Results of our AMT experiment for map $\leftrightarrow$ photo are given in Table 4. The aerial photos generated by our method fooled participants on 18.9% of trials, significantly above the L1 baseline, which produces blurry results and nearly never fooled participants. In contrast, in the photo $\to$ map direction our method only fooled participants on 6.1% of trials, and this was not significantly different than the performance of the L1 baseline (based on bootstrap test). This may be because minor structural errors are more visible in maps, which have rigid geometry, than in aerial photographs, which are more chaotic. ::: :::success 我們在map $\leftrightarrow$ aerial photograph與grayscale $\to$ color任務上驗證結果的感知真實性。map $\leftrightarrow$ photo的AMT實結果在Table 4中給出。我們的方法所生成的空拍照在18.9%的試驗中騙過了參與者，明顯高於L1 baseline(L1產生模糊化結果，幾乎沒騙到參與者)。相比之下，photo $\to$ map指出我們的方法就只有在6.1%的試驗中騙過參與者，這方法跟L1 baseline的表現並沒有明顯的差異。這可能是因為在具有剛性幾何結構的地圖中，細小的結構錯誤比在更混亂的空拍照片中更明顯。 ::: :::info ![image](https://hackmd.io/_uploads/BJ2x_cjdR.png) Table 4: AMT "real vs fake" test on maps $\leftrightarrow$ aerial photos. ::: :::info We trained colorization on ImageNet [51], and tested on the test split introduced by [62, 35]. Our method, with L1+cGAN loss, fooled participants on 22.5% of trials (Table 5). We also tested the results of [62] and a variant of their method that used an L2 loss (see [62] for details). The conditional GAN scored similarly to the L2 variant of [62] (difference insignificant by bootstrap test), but fell short of [62]’s full method, which fooled participants on 27.8% of trials in our experiment. We note that their method was specifically engineered to do well on colorization. ::: :::success 我們在ImageNet上訓練著色，然後在[62, 35]所引入的測試分割上測試。我們的方法，也就是使用L1+cGAN loss，在22.5%的試驗上騙過參與者(Table 5)。我們還測試了[62]的結果以及他們使用L2 loss變體的方法(細節見[62])。conditional GAN得到的分數類似於L2變體，不過比[62]的完整方法還來的低就是，該方法在我們的實驗中的27.8%的試驗中騙過參與者。我們注意到他們的方法是經過專門設計來可以應用在著色任務上的。 ::: :::info ![image](https://hackmd.io/_uploads/r17P99juR.png) Table 5: AMT “real vs fake” test on colorization. ::: ### 4.6. Semantic segmentation :::info Conditional GANs appear to be effective on problems where the output is highly detailed or photographic, as is common in image processing and graphics tasks. What about vision problems, like semantic segmentation, where the output is instead less complex than the input? ::: :::success Conditional GANs似乎在解決輸出高度細節或攝影上是有效的，這在影像處理跟圖形任務上是常見的。那Conditional GANs在視覺問題，像是語意分割，它的輸出比輸入更簡單，的表現如何呢？ ::: :::info To begin to test this, we train a cGAN (with/without L1 loss) on cityscape photo $\to$ labels. Figure 10 shows qualitative results, and quantitative classification accuracies are reported in Table 6. Interestingly, cGANs, trained without the L1 loss, are able to solve this problem at a reasonable degree of accuracy. To our knowledge, this is the first demonstration of GANs successfully generating “labels”, which are nearly discrete, rather than “images”, with their continuousvalued variation2 . Although cGANs achieve some success, they are far from the best available method for solving this problem: simply using L1 regression gets better scores than using a cGAN, as shown in Table 6. We argue that for vision problems, the goal (i.e. predicting output close to the ground truth) may be less ambiguous than graphics tasks, and reconstruction losses like L1 are mostly sufficient. ::: :::success 為了開始測試這一點，我們在cityscape photo $\to$ labels的任務上訓練cGAN(with/without L1 loss)。Figure 10說明著定性的結果，Table6則是給出定量分類準確度。有趣的是，在沒有L1 loss訓練的cGANs，能夠以合理的準確度解決問題。據我們所知，這是GANs首次成功地生成近乎離散的"lables"，而不是連續值變化的"images"。雖然cGANs有一些些的成功，不過跟解決這個問題的最佳方法還是離很長一段距離：單純的使用L1 regression比使用cGAN得到更好的分數，如Table 6中所示。我們讀罔，對於視覺問題，其目標(即預測接近真實的輸出)也許比圖形任務更為明確，像l1這種reconstruction losses就很夠用了。 ::: :::info ![image](https://hackmd.io/_uploads/Bklwa5o_R.png) Figure 10: Applying a conditional GAN to semantic segmentation. The cGAN produces sharp images that look at glance like the ground truth, but in fact include many small, hallucinated objects. ::: :::info ![image](https://hackmd.io/_uploads/B1Atp5o_A.png) Table 6: Performance of photo $\to$ labels on cityscapes ::: ### 4.7. Community-driven Research :::info Since the initial release of the paper and our pix2pix codebase, the Twitter community, including computer vision and graphics practitioners as well as visual artists, have successfully applied our framework to a variety of novel image-to-image translation tasks, far beyond the scope of the original paper. Figure 11 and Figure 12 show just a few examples from the #pix2pix hashtag, including Background removal, Palette generation, Sketch → Portrait, Sketch→Pokemon, ”Do as I Do” pose transfer, Learning to see: Gloomy Sunday, as well as the bizarrely popular #edges2cats and #fotogenerator. Note that these applications are creative projects, were not obtained in controlled, scientific conditions, and may rely on some modifications to the pix2pix code we released. Nonetheless, they demonstrate the promise of our approach as a generic commodity tool for image-to-image translation problems. ::: :::success 這邊聊社群的應用，有興趣就自行喵喵看。 ::: ## 5. Conclusion :::info The results in this paper suggest that conditional adversarial networks are a promising approach for many image-to-image translation tasks, especially those involving highly structured graphical outputs. These networks learn a loss adapted to the task and data at hand, which makes them applicable in a wide variety of settings. ::: :::success 本篇論文中的結果說明著，conditional adversarial networks對很多影像轉換的任務來說是一種有前途有未來有光明的方法，特別是那些涉及高度結構的圖形輸出。這些網路學習適應當前手邊有的任務與資料本身的loss，這讓它們在廣泛的應用中有好的可用性。 ::: :::info Acknowledgments: We thank Richard Zhang, Deepak Pathak, and Shubham Tulsiani for helpful discussions, Saining Xie for help with the HED edge detector, and the online community for exploring many applications and suggesting improvements. Thanks to Christopher Hesse, Memo Akten, Kaihu Chen, Jack Qiao, Mario Klingemann, Brannon Dorsey, Gerda Bosman, Ivy Tsai, and Yann LeCun for allowing the use of their creations in Figure 11 and Figure 12. This work was supported in part by NSF SMA-1514512, NGA NURI, IARPA via Air Force Research Laboratory, Intel Corp, Berkeley Deep Drive, and hardware donations by Nvidia. J.-Y.Z. is supported by the Facebook Graduate Fellowship. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, AFRL or the U.S. Government. ::: :::success 謝天謝地 ::: ## 6. Appendix ### 6.1. Network architectures :::info We adapt our network architectures from those in [44]. Code for the models is available at https://github.com/phillipi/pix2pix ::: :::success 程式碼可以看https://github.com/phillipi/pix2pix 。 ::: :::info Let `Ck` denote a Convolution-BatchNorm-ReLU layer with `k` filters. `CDk` denotes a Convolution-BatchNormDropout-ReLU layer with a dropout rate of 50%. All convolutions are 4 × 4 spatial filters applied with stride 2. Convolutions in the encoder, and in the discriminator, downsample by a factor of 2, whereas in the decoder they upsample by a factor of 2. ::: :::success 假設，`Ck`表示一個有`k`個filter的Convolution-BatchNorm-ReLU layer。`CDk`表示Convolution-BatchNormDropout-ReLU layer with a dropout rate of 50%。所有的卷積都是4× 4 spatial filters，stride為2。encoder、discriminator中的卷積，以2倍做降採樣，decoder則是2倍上採樣。 ::: #### 6.1.1 Generator architectures :::info The encoder-decoder architecture consists of: encoder: `C64-C128-C256-C512-C512-C512-C512-C512` decoder: `CD512-CD512-CD512-C512-C256-C128-C64` ::: :::info After the last layer in the decoder, a convolution is applied to map to the number of output channels (3 in general, except in colorization, where it is 2), followed by a Tanh function. As an exception to the above notation, BatchNorm is not applied to the first C64 layer in the encoder. All ReLUs in the encoder are leaky, with slope 0.2, while ReLUs in the decoder are not leaky. ::: :::success decoder中的最後一層之後，就會有一個卷積執行來映射到輸出的通道(一般是3，除非是colorization，那就是2)，然後再接一個Tanh function。作為上述表示法的例外，BatchNorm在encoder的第一個`C64`不使用。編碼器中的所有ReLUs都是leaky ReLUs的，slope為0.2，而decoder中的ReLU就不是leaky的。 ::: :::info The U-Net architecture is identical except with skip connections between each layer `i` in the encoder and layer `n−i` in the decoder, where `n` is the total number of layers. The skip connections concatenate activations from layer `i` to layer `n − i`. This changes the number of channels in the decoder: **U-Net decoder:** CD512-CD1024-CD1024-C1024-C1024-C512 -C256-C128 ::: :::success U-Net架構是相同的，除了encoder中的每個layer $i$與decoder中的layer $n-i$有skip connections，其中$n$表示總的網路層數。skip connections從layer $i$連接到layer$n-1$的activation。這改變了decoder中的通道數量： **U-Net decoder:** CD512-CD1024-CD1024-C1024-C1024-C512 -C256-C128 ::: #### 6.1.2 Discriminator architectures :::info The 70 × 70 discriminator architecture is: `C64-C128-C256-C512` ::: :::info After the last layer, a convolution is applied to map to a 1-dimensional output, followed by a Sigmoid function. As an exception to the above notation, BatchNorm is not applied to the first `C64` layer. All ReLUs are leaky, with slope 0.2. ::: :::success 最後一層之後會有一個卷積執行映射到一個1維輸出，後面再接一個Sigmoid function。做為上述的例外，BatchNorm並不會在第一個`C64`使用。所有的ReLUs都是leaky，slope為2。 ::: :::info All other discriminators follow the same basic architecture, with depth varied to modify the receptive field size: 1 × 1 **discriminator**: `C64-C128` (note, in this special case, all convolutions are 1 × 1 spatial filters) 16 × 16 **discriminator**: `C64-C128` 286 × 286 **discriminator**: `C64-C128-C256-C512-C512-C512` ::: :::success 所有其它的discriminators都是依著相同的基礎架構，因深度不同而調整receptive field size： 1 × 1 **discriminator**: `C64-C128` (note, in this special case, all convolutions are 1 × 1 spatial filters) 16 × 16 **discriminator**: `C64-C128` 286 × 286 **discriminator**: `C64-C128-C256-C512-C512-C512` ::: ### 6.2. Training details :::info Random jitter was applied by resizing the 256×256 input images to 286 × 286 and then randomly cropping back to size 256 × 256. ::: :::success 會把輸入照片從256x256轉為286x286，然後再隨機裁切為256x256。 ::: :::info All networks were trained from scratch. Weights were initialized from a Gaussian distribution with mean 0 and standard deviation 0.02. ::: :::success 所有的網路都是從頭訓練的。權重以高斯分佈(均直為0，標準差為0.02)初始化。 ::: :::info **Cityscapes labels→photo** 2975 training images from the Cityscapes training set [12], trained for 200 epochs, with random jitter and mirroring. We used the Cityscapes validation set for testing. To compare the U-net against an encoder-decoder, we used a batch size of 10, whereas for the objective function experiments we used batch size 1. We find that batch size 1 produces better results for the Unet, but is inappropriate for the encoder-decoder. This is because we apply batchnorm on all layers of our network, and for batch size 1 this operation zeros the activations on the bottleneck layer. The U-net can skip over the bottleneck, but the encoder-decoder cannot, and so the encoder-decoder requires a batch size greater than 1. Note, an alternative strategy is to remove batchnorm from the bottleneck layer. See errata for more details. ::: :::info **Architectural labels→photo** 400 training images from [45], trained for 200 epochs, batch size 1, with random jitter and mirroring. Data were split into train and test randomly. ::: :::info **Maps↔aerial photograph** 1096 training images scraped from Google Maps, trained for 200 epochs, batch size 1, with random jitter and mirroring. Images were sampled from in and around New York City. Data were then split into train and test about the median latitude of the sampling region (with a buffer region added to ensure that no training pixel appeared in the test set). ::: :::info **BW→color** 1.2 million training images (Imagenet training set [51]), trained for ∼ 6 epochs, batch size 4, with only mirroring, no random jitter. Tested on subset of Imagenet val set, following protocol of [62] and [35]. ::: :::info **Edges→shoes** 50k training images from UT Zappos 50K dataset [61] trained for 15 epochs, batch size 4. Data were split into train and test randomly. ::: :::info **Edges→Handbag** 137K Amazon Handbag images from [65], trained for 15 epochs, batch size 4. Data were split into train and test randomly. ::: :::info **Day→night** 17823 training images extracted from 91 webcams, from [33] trained for 17 epochs, batch size 4, with random jitter and mirroring. We use 91 webcams as training, and 10 webcams for test. ::: :::info **Thermal→color** photos 36609 training images from set 00–05 of [27], trained for 10 epochs, batch size 4. Images from set 06-11 are used for testing. ::: :::info **Photo with missing pixels→inpainted photo** 14900 training images from [14], trained for 25 epochs, batch size 4, and tested on 100 held out images following the split of [43]. :::