# Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks(CycleGAN) ###### tags:`論文翻譯` `deeplearning` [TOC] ## 說明 區塊如下分類,原文區塊為藍底,翻譯區塊為綠底,部份專業用語翻譯參考國家教育研究院 :::info 原文 ::: :::success 翻譯 ::: :::warning 個人註解,任何的翻譯不通暢部份都請留言指導 ::: :::danger * [paper hyperlink](https://arxiv.org/pdf/1703.10593) * [Image-to-Image Translation with Conditional Adversarial Networks(翻譯)](https://hackmd.io/@shaoeChen/HJ-UN4fO0) * [Perceptual Losses for Real-Time Style Transfer and Super-Resolution(翻譯)](https://hackmd.io/@shaoeChen/r1fHVEzO0) ::: ## Abstract :::info Image-to-image translation is a class of vision and graphics problems where the goal is to learn the mapping between an input image and an output image using a training set of aligned image pairs. However, for many tasks, paired training data will not be available. We present an approach for learning to translate an image from a source domain $X$ to a target domain $Y$ in the absence of paired examples. Our goal is to learn a mapping $G : X \to Y$ such that the distribution of images from $G(X)$ is indistinguishable from the distribution $Y$ using an adversarial loss. Because this mapping is highly under-constrained, we couple it with an inverse mapping $F : Y \to X$ and introduce a cycle consistency loss to enforce $F(G(X)) \approx X$ (and vice versa). Qualitative results are presented on several tasks where paired training data does not exist, including collection style transfer, object transfiguration, season transfer, photo enhancement, etc. Quantitative comparisons against several prior methods demonstrate the superiority of our approach. ::: :::success Image-to-image translation是視覺與圖形一類的問題,其目標是使用對齊的成對影像(image pairs)的訓練集來學得輸入與輸出影像之間的映射。然而,對於許多任務來說,這種成對的訓練資料是沒有辦法用的。我們提出一個方法可以在沒有成對樣本的情況下學習將影像從source domain $X$轉移到target domain $Y$。我們的目標是學習一個映射$G : X \to Y$,讓來自 $G(X)$的影像分佈跟使用adversarial loss的分佈$Y$無法區分。因為這樣的映射高度受限,所以我們要把它跟逆映射$F : Y \to X$連結在一起,然後引入一個cycle consistency loss來強制$F(G(X)) \approx X$(反之亦然)。我們在多後沒有成對訓練資料的任務上給出[定性](https://terms.naer.edu.tw/detail/c278d58bee5ad582f2f16dc222151e0d/)結果,包括收集style transfer、object transfiguration、season transfer、photo enhancement等。與幾種先前的方法進行的定量比較說明了我們方法的優越性。 ::: ## 1. Introduction :::info What did Claude Monet see as he placed his easel by the bank of the Seine near Argenteuil on a lovely spring day in 1873 (Figure 1, top-left)? A color photograph, had it been invented, may have documented a crisp blue sky and a glassy river reflecting it. Monet conveyed his impression of this same scene through wispy brush strokes and a bright palette. ::: :::success 1873年一個美好的春天,Claude Monet把畫架放在Argenteuil附近的Seine,他看到了什麼(Figure 1, top-left)?如果彩色的相片已經被發明的話,可能就會記錄一個晴朗的藍色天空跟一條像鏡子一般反射它的河流。莫內透過他飄逸的筆尖與明亮的色彩來傳達他對這一場景的印象。 ::: :::info ![image](https://hackmd.io/_uploads/Hkc8FQBvA.png) Figure 1: Given any two unordered image collections $X$ and $Y$ , our algorithm learns to automatically “translate” an image from one into the other and vice versa: (left) Monet paintings and landscape photos from Flickr; (center) zebras and horses from ImageNet; (right) summer and winter Yosemite photos from Flickr. Example application (bottom): using a collection of paintings of famous artists, our method learns to render natural photographs into the respective styles. ::: :::info What if Monet had happened upon the little harbor in Cassis on a cool summer evening (Figure 1, bottom-left)? A brief stroll through a gallery of Monet paintings makes it possible to imagine how he would have rendered the scene: perhaps in pastel shades, with abrupt dabs of paint, and a somewhat flattened dynamic range. ::: :::success 如果莫內在一個涼爽的夏夜來到Cassis的小港口又當如何(Figure 1, bottom-left)?在莫內的畫廊裡短暫漫步,就可以想像他會如何渲染場景:也許是用柔和的色調,突然點上油漆,動態範圍有些許的平坦。 ::: :::info We can imagine all this despite never having seen a side by side example of a Monet painting next to a photo of the scene he painted. Instead, we have knowledge of the set of Monet paintings and of the set of landscape photographs. We can reason about the stylistic differences between these two sets, and thereby imagine what a scene might look like if we were to “translate” it from one set into the other. ::: :::success 我們可以想像這一切,儘管從未見過莫內畫作與他所畫的場景照片並排的案例。相反的,我們擁有莫內畫作的集合以及風景照片的集合。我們可以推理這兩個集合之間的風格差異(stylistic differences),從而想像如果我們把一個場景"translate"到另一個場景會是怎麼樣子的。 ::: :::info In this paper, we present a method that can learn to do the same: capturing special characteristics of one image collection and figuring out how these characteristics could be translated into the other image collection, all in the absence of any paired training examples. ::: :::success 在這篇論文中,我們提出一種可以學習做出同樣事情的方法:補捉一個影像集合的特殊特徵,並搞清楚如何將這些特徵轉到另一個影像集合,所有的這一切都是在沒有任何成對的訓練樣本前提下做的。 ::: :::info This problem can be more broadly described as image-to-image translation [22], converting an image from one representation of a given scene, $x$, to another, $y$, e.g., grayscale to color, image to semantic labels, edge-map to photograph. Years of research in computer vision, image processing, computational photography, and graphics have produced powerful translation systems in the supervised setting, where example image pairs $\left\{x_i, y_i \right\}^N_{i=1}$ are available (Figure 2, left), e.g., [11, 19, 22, 23, 28, 33, 45, 56, 58, 62]. However, obtaining paired training data can be difficult and expensive. For example, only a couple of datasets exist for tasks like semantic segmentation (e.g., [4]), and they are relatively small. Obtaining input-output pairs for graphics tasks like artistic stylization can be even more difficult since the desired output is highly complex, typically requiring artistic authoring. For many tasks, like object transfiguration (e.g., zebra↔horse, Figure 1 top-middle), the desired output is not even well-defined. ::: :::success 這個問題可以更廣義的把它描述成一個image-to-image translation,把影像從給定場景的一個表示$x$轉換成另一種表示$y$,舉例來說,灰階到彩色,影像到語意標記、edge-map到相片。在影像視覺、影像處理、計算攝影學、以及圖學方面多年的研究中,已經在supervised setting中產生強而有力的轉換系統,其中。樣本影像對$\left\{x_i, y_i \right\}^N_{i=1}$是可用的(Figure 2, left),像是[11, 19, 22, 23, 28, 33, 45, 56, 58, 62]。然而,獲得成對的訓練資料可能是很困難,而且所資不斐。舉例來說,只會有幾個資料集是用於像是語意分割的任務,而且這是相對少數的。獲得圖形任務,像是artistic stylization(藝術風格化)的成對資料就更難了,因為所需要的輸出是非常困難的,通常需要藝術創作。對很多任務,像是物體變形(object transfiguration)(e.g., zebra↔horse, Figure 1 top-middle),所需要的輸出甚至很不好定義。 ::: :::info ::: :::warning translation systems,這邊似乎翻譯成轉換系統比較恰當? ::: :::info We therefore seek an algorithm that can learn to translate between domains without paired input-output examples (Figure 2, right). We assume there is some underlying relationship between the domains – for example, that they are two different renderings of the same underlying scene – and seek to learn that relationship. Although we lack supervision in the form of paired examples, we can exploit supervision at the level of sets: we are given one set of images in domain $X$ and a different set in domain $Y$. We may train a mapping $G : X \to Y$ such that the output $\hat{y}=G(x)$, $x \in X$, is indistinguishable from images $y \in Y$ by an adversary trained to classify $\hat{y}$ apart from $y$. In theory, this objective can induce an output distribution over $\hat{y}$ that matches the empirical distribution $p_{data}(y)$ (in general, this requires $G$ to be stochastic) [16]. The optimal $G$ thereby translates the domain $X$ to a domain $\hat{Y}$ distributed identically to $Y$. However, such a translation does not guarantee that an individual input $x$ and output $y$ are paired up in a meaningful way – there are infinitely many mappings $G$ that will induce the same distribution over $\hat{y}$. Moreover, in practice, we have found it difficult to optimize the adversarial objective in isolation: standard procedures often lead to the wellknown problem of mode collapse, where all input images map to the same output image and the optimization fails to make progress [15]. ::: :::success 因此,我們在尋找一種演算法,一種可以在沒有成對輸出入樣本的情況下學習domains之間的轉換的演算法(Figure 2, right)。我們假設在domains之間有著一種潛在的關聯,舉例來說,它們是有著相同的基本景場的兩種不同的渲染,然後嚐試學習這一個關聯。雖然我們缺少這種成對樣本形式中的監督,不過我們可以利用集合(sets)層級的監督:我們考慮domain $X$中的一個集合以及domain $Y$中一個不同的集合。我們可以訓練一個映射$G : X \to Y$,這樣,其輸出 $\hat{y}=G(x)$,$x \in X$,對於一個訓練來辨別$\hat{y}$跟$y$有什麼不一樣的對手(adversary)來說就是無法區別來自$y \in Y$的影像。理論上,這個目標會導致$\hat{y}$上的輸出分佈匹配經驗分佈$p_{data}(y)$(一般來說,這需要$G$是隨機的)。因此,最佳的$G$會將domain $X$轉換為與$Y$分佈相同的domain $\hat{Y}$。然而,這樣的轉換並不能保證各別的輸入$x$與輸出$y$會以有意義的方式成對,有無限多個映射$G$會在$\hat{y}$上產生相同的分佈。而且吼,實務上來說,我們發現很難單獨的最佳化對抗性目標(adversarial objective):標準程序通常會導致眾所階知的問題,也就是mode collapse(模式崩潰),所有的輸入影像都映射到相同的輸出影像,並且最佳化無法取得進展。 ::: :::info ![image](https://hackmd.io/_uploads/SkWiuNHDR.png) Figure 2: Paired training data (left) consists of training examples $\left\{x_i, y_i \right\}^N_{i=1}$, where the correspondence between $x_i$ and $y_i$ exists [22]. We instead consider unpaired training data (right), consisting of a source set $\left\{x_i \right\}^N_{i=1}(x_i\in X)$ and a target set $\left\{y_j \right\}^M_{j=1}(y_j\in Y)$, with no information provided as to which $x_i$ matches which $y_j$. ::: :::info These issues call for adding more structure to our objective. Therefore, we exploit the property that translation should be “cycle consistent”, in the sense that if we translate, e.g., a sentence from English to French, and then translate it back from French to English, we should arrive back at the original sentence [3]. Mathematically, if we have a translator $G : X \to Y$ and another translator $F : Y \to X$, then $G$ and $F$ should be inverses of each other, and both mappings should be bijections. We apply this structural assumption by training both the mapping $G$ and $F$ simultaneously, and adding a cycle consistency loss [64] that encourages $F(G(x)) \approx x$ and $G(F(y)) \approx y$. Combining this loss with adversarial losses on domains $X$ and $Y$ yields our full objective for unpaired image-to-image translation. ::: :::success 這些問題要求為我們的目標新增更多的結構。因此,我們利用這個特性,也就是轉換應該是“cycle consistent”(循環一致?),舉例來說,如果我們把一個句子從英語翻譯成法語,然後再把它從法語翻譯回英語,那我們應該回到原始的句子。數學上來說,如果我們有一個translator $G : X \to Y$跟另一個translator $F : Y \to X$,那麼$G$跟$F$應該是彼此的逆(inverse),兩個映射應該是[對射](https://terms.naer.edu.tw/detail/73af7a9467afde87224191576b72f972/)(bijections)的。我們透過同時訓練映射$G$跟$F$來實現這個結構假設,並加入一個 cycle consistency loss,以此鼓勵$F(G(x)) \approx x$ and $G(F(y)) \approx y$。結合這個loss跟domain $X$、$Y$上adversarial losses,產生unpaired image-to-image translation(不成對的影像到影像的轉換)完整的目標。 ::: :::info We apply our method to a wide range of applications, including collection style transfer, object transfiguration, season transfer and photo enhancement. We also compare against previous approaches that rely either on hand-defined factorizations of style and content, or on shared embedding functions, and show that our method outperforms these baselines. We provide both PyTorch and Torch implementations. Check out more results at our website. ::: :::success 我們把這個方法做了廣泛的應用,包括style transfer、object transfiguration、season transfer與photo enhancement。我們還跟之前那些依賴著手工定義構式與內容分解或是以shared embedding functions的方法做了比較,並說明我們的方法優化這些基線。我們提供PyTorch與Torch的實作。看一下我們的網站吧,那邊有大秘寶。 ::: ## 2. Related work :::info **Generative Adversarial Networks (GANs)** [16, 63] have achieved impressive results in image generation [6, 39], image editing [66], and representation learning [39, 43, 37]. Recent methods adopt the same idea for conditional image generation applications, such as text2image [41], image inpainting [38], and future prediction [36], as well as to other domains like videos [54] and 3D data [57]. The key to GANs’ success is the idea of an adversarial loss that forces the generated images to be, in principle, indistinguishable from real photos. This loss is particularly powerful for image generation tasks, as this is exactly the objective that much of computer graphics aims to optimize. We adopt an adversarial loss to learn the mapping such that the translated images cannot be distinguished from images in the target domain. ::: :::success **Generative Adversarial Networks (GANs)** 在影像生成、影像編輯、與表示學習(representation learning)中已經取得令人驚豔的成果。近來的方法對於條件影像生成方法採用相同的想法,像是text2image、image inpainting與uture prediction,以及其它領域像是videos與3D data。GAN成功的關鍵在於adversarial loss的觀念,這迫使生成的影像在原則上能夠跟實際的照片沒有區分。這種loss對於影像生成任務特別厲害,因為這正是[電腦圖學](https://terms.naer.edu.tw/detail/0998604313d93527dcfa63ce063713e3/)要最佳化的目標。我們採用adversarial loss來學習映射,這讓轉變後的影像跟target domain中的影像沒有區別。 ::: :::info **Image-to-Image Translation** The idea of image-to-image translation goes back at least to Hertzmann et al.’s Image Analogies [19], who employ a non-parametric texture model [10] on a single input-output training image pair. More recent approaches use a dataset of input-output examples to learn a parametric translation function using CNNs (e.g., [33]). Our approach builds on the “pix2pix” framework of Isola et al. [22], which uses a conditional generative adversarial network [16] to learn a mapping from input to output images. Similar ideas have been applied to various tasks such as generating photographs from sketches [44] or from attribute and semantic layouts [25]. However, unlike the above prior work, we learn the mapping without paired training examples. ::: :::success **Image-to-Image Translation** image-to-image translation的概念可以追溯到Hertzmann等人的Image Analogies,他們在單一輸入-輸出訓練image pair中採用了一種non-parametric texture model。近來的方法使用輸入-輸出樣本的資料集來學習參數轉換函數(使用CNNs)。我們的方法建立在Isola等人的“pix2pix”上,其使用一種條件生成對抗網路來學習輸入到輸出影像的映射。類似的想法已經被應用到多種的任務上,像是從草圖或是屬性與語意佈局來生成照片。然而,跟上面提到的先前的研究不同的是,我們並沒有使用成對的訓練樣本來學習映射。 ::: :::info **Unpaired Image-to-Image Translation** Several other methods also tackle the unpaired setting, where the goal is to relate two data domains: $X$ and $Y$ . Rosales et al. [42] propose a Bayesian framework that includes a prior based on a patch-based Markov random field computed from a source image and a likelihood term obtained from multiple style images. More recently, CoGAN [32] and cross-modal scene networks [1] use a weight-sharing strategy to learn a common representation across domains. Concurrent to our method, Liu et al. [31] extends the above framework with a combination of variational autoencoders [27] and generative adversarial networks [16]. Another line of concurrentwork [46, 49, 2] encourages the input and output to share specific “content” features even though they may differ in “style“. These methods also use adversarial networks, with additional terms to enforce the output to be close to the input in a predefined metric space, such as class label space [2], image pixel space [46], and image feature space [49]. ::: :::success **Unpaired Image-to-Image Translation** 其它多種方法也是可以處理不成對的設置,其目標就是去關聯兩個data domains:$X$與$Y$。Rosales等人提出一個貝葉斯框架(Bayesian framework),其包括從來源影像計算基於patch-based Markov random field的先驗(prior)以及從多個風格影像中獲得的似然項目。近來,CoGAN跟cross-modal scene networks使用一種權重共享的策略來學習跨領域的通用表示。跟我們的方法一起提出的,Liu等人以variational autoencoders與generative adversarial networks的組合延伸了上述的框架。另一條並行研究線,其鼓勵了輸入與輸出共享特定的"內容"特徵,儘管它們在"風格"上可能不是那麼一樣。這些方法仍然是採用對抗網路,加入額外的項目來強制輸出在預定義的指標空間中能夠更加的接近輸入,像是類別標記空間、影像像素空間以及影像特徵空間。 ::: :::info Unlike the above approaches, our formulation does not rely on any task-specific, predefined similarity function between the input and output, nor do we assume that the input and output have to lie in the same low-dimensional embedding space. This makes our method a general-purpose solution for many vision and graphics tasks. We directly compare against several prior and contemporary approaches in Section 5.1. ::: :::success 與上述方法不同,我們的公式並不依賴輸入與輸出之間的特定任務、預定義相似性函數,也不假設輸入與輸出必需要位於相同的低維度嵌入空間中(low-dimensional embedding space)。這讓我們的方法成為許多視覺與圖形任務的通用解決方案。我們在Section 5.1直接跟幾個上古與當代方法比較。 ::: :::info **Cycle Consistency** The idea of using transitivity as a way to regularize structured data has a long history. In visual tracking, enforcing simple forward-backward consistency has been a standard trick for decades [24, 48]. In the language domain, verifying and improving translations via “back translation and reconciliation” is a technique used by human translators [3] (including, humorously, by Mark Twain [51]), as well as by machines [17]. More recently, higher-order cycle consistency has been used in structure from motion [61], 3D shape matching [21], cosegmentation [55], dense semantic alignment [65, 64], and depth estimation [14]. Of these, Zhou et al. [64] and Godard et al. [14] are most similar to our work, as they use a cycle consistency loss as a way of using transitivity to supervise CNN training. In this work, we are introducing a similar loss to push $G$ and $F$ to be consistent with each other. Concurrent with our work, in these same proceedings, Yi et al. [59] independently use a similar objective for unpaired image-to-image translation, inspired by dual learning in machine translation [17]. ::: :::success **Cycle Consistency** 使用[遞移性](https://terms.naer.edu.tw/detail/2f4b9688012db1c159f93a6e3d5d2d6b/)做為正規化性結構性資料的想法已經有很長一段歷史了。在[視覺追蹤](https://terms.naer.edu.tw/detail/566c382ad4cd3e01ee5ba8e399c3dd7a/)中,這幾十年來,強制做簡單的前向-反向一致性一直是標準技巧。在語言領域中,透過"[回譯](https://terms.naer.edu.tw/detail/385129fc9070a84ec3a894d6bda4ab19/)與[調節](https://terms.naer.edu.tw/detail/27d283f4eb5051cc54593b7baf789870/)"來驗證與改進翻譯人類翻譯以及機器會使用到的技術。近年來,higher-order cycle consistency已經被用於動作、3D形狀匹配、cosegmentation(協同分割?)、dense semantic alignment(密集語意對齊?)與深度估計。其中,Zhou等人與Godard等人跟我們的研究最相似,因為他們使用cycle consistency loss做為supervise CNN訓練的一種方式。在這個研究中,我們引入一個類似的loss來推動讓$G$跟$F$能夠彼此一致。在我們研究的同時,在這些相同的程序中,Yi等人受到[機器翻譯](https://terms.naer.edu.tw/detail/8164232b1bd9d136c089bf91e90e07c4/)中的dual learning啟發,使用一種用於不成對的影像到影像轉換的目標。 ::: :::info **Neural Style Transfer** [13, 23, 52, 12] is another way to perform image-to-image translation, which synthesizes a novel image by combining the content of one image with the style of another image (typically a painting) based on matching the Gram matrix statistics of pre-trained deep features. Our primary focus, on the other hand, is learning the mapping between two image collections, rather than between two specific images, by trying to capture correspondences between higher level appearance structures. Therefore, our method can be applied to other tasks, such as painting→ photo, object transfiguration, etc. where single sample transfer methods do not perform well. We compare these two methods in Section 5.2. ::: :::success **Neural Style Transfer** 是另一種執行image-to-image translation的方法,基於匹配預訓練的深度特徵的Gram matrix statistics,透過結合一張影像的內容與另一張影像的風格(通常是一幅畫)來合成為一張新的影像。另一方面,我們主要關注的是透過嚐試補捉更高層級的外觀結構之間的對應關係來學習兩個影像集合之間的映射,而不是兩張特定影像之間的映射。因此,我們的方法可以被應用到其它的任務,像是繪圖$\to$照片,物體變形(object transfiguration)等。我們在Section 5.2中比較這兩種方法。 ::: ## 3. Formulation :::info Our goal is to learn mapping functions between two domains $X$ and $Y$ given training samples $\left\{x_i\right\}^N_{i=1}$ where $x_i \in X$ and $\left\{y_j\right\}^M_{j=1}$ where $y_j \in Y$ . We denote the data distribution as $x \sim p_{data}(x)$ and $y \sim p_{data}(y)$. As illustrated in Figure 3 (a), our model includes two mappings $G : X \to Y$ and $F : Y \to X$. In addition, we introduce two adversarial discriminators $D_X$ and $D_Y$ , where $D_X$ aims to distinguish between images $\left\{x\right\}$ and translated images $\left\{F(y)\right\}$; in the same way, $D_Y$ aims to discriminate between $\left\{y\right\}$ and $\left\{G(x)\right\}$. Our objective contains two types of terms: adversarial losses [16] for matching the distribution of generated images to the data distribution in the target domain; and cycle consistency losses to prevent the learned mappings $G$ and $F$ from contradicting each other. ::: :::success 我們的目標是在給定訓練樣本$\left\{x_i\right\}^N_{i=1}$、$\left\{y_j\right\}^M_{j=1}$的情況下學習介於domains $X$、$Y$之間的映射函數,其中$x_i \in X$、$y_j \in Y$。我們把資料分佈表示為$x \sim p_{data}(x)$和$y \sim p_{data}(y)$。如Figure 3(a)所描述,我們的模型包括兩個映射,$G : X \to Y$與$F : Y \to X$。此外,我們還引入兩個adversarial discriminators,$D_X$與$D_Y$,其中$D_X$旨在區別$\left\{x\right\}$與轉變後的$\left\{F(y)\right\}$;相同的,$D_Y$則是區別$\left\{y\right\}$與$\left\{G(x)\right\}$。我們的目標包含兩個類型的項目:adversarial losses,用於匹配生成影像的分佈到[目標定義域](https://terms.naer.edu.tw/detail/ca2345cd7633acdbee75c0abc816edb7/)(target domain)的資料分佈;與cycle consistency losses,用於預防學習到的映射$G$和$F$相互矛盾。 ::: :::info ![image](https://hackmd.io/_uploads/HkyssiiDA.png) Figure 3: (a) Our model contains two mapping functions $G : X \to Y$ and $F : Y \to X$, and associated adversarial discriminators $D_Y$ and $D_X$. $D_Y$ encourages $G$ to translate $X$ into outputs indistinguishable from domain $Y$ , and vice versa for $D_X$ and $F$. To further regularize the mappings, we introduce two cycle consistency losses that capture the intuition that if we translate from one domain to the other and back again we should arrive at where we started: (b) forward cycle-consistency loss: $x \to G(x) \to F(G(x)) \approx x$, and (c) backward cycle-consistency loss: $y \to F(y) to G(F(y)) \approx y$. ::: ### 3.1. Adversarial Loss :::info We apply adversarial losses [16] to both mapping functions. For the mapping function $G : X \to Y$ and its discriminator $D_Y$, we express the objective as: $$ \begin{align} \mathcal{L}_{GAN}(G, D_Y , X, Y ) &= \mathbb{E}_{y\sim p_{data}(y)} [\log D_Y (y)] \\ &+ \mathbb{E}_{x\sim p_{data}(x)} [\log(1 − D_Y (G(x))], \tag{1} \end{align} $$ where $G$ tries to generate images $G(x)$ that look similar to images from domain $Y$ , while $D_Y$ aims to distinguish between translated samples $G(x)$ and real samples $y$. $G$ aims to minimize this objective against an adversary $D$ that tries to maximize it, i.e., $\min_G \max_{D_Y} \mathcal{L}_{GAN}(G, D_Y , X, Y )$. We introduce a similar adversarial loss for the mapping function $F : Y \to X$ and its discriminator $D_X$ as well: i.e., $\min_F \max_{D_X} \mathcal{L}_{GAN}(F, D_X, Y, X)$. ::: :::success 我們對兩個映射函數使用adversarial losses。對於映射函數$G : X \to Y$與其discriminator $D_Y$,我們將其目標式表示為: $$ \begin{align} \mathcal{L}_{GAN}(G, D_Y , X, Y ) &= \mathbb{E}_{y\sim p_{data}(y)} [\log D_Y (y)] \\ &+ \mathbb{E}_{x\sim p_{data}(x)} [\log(1 − D_Y (G(x))], \tag{1} \end{align} $$ 其中$G$嚐試生成跟domain $Y$的影像很相似的影像$G(x)$,而$D_Y$則是要區別轉變後的樣本$G(x)$與實際樣本$y$。$G$的目標是最小化這個目標式,而對手$D$則是要最大化這個目標,也就是$\min_G \max_{D_Y} \mathcal{L}_{GAN}(G, D_Y , X, Y )$。我們為映射函數$F : Y \to X$與其discriminator $D_X$引入了類似的adversarial loss,也就是$\min_F \max_{D_X} \mathcal{L}_{GAN}(F, D_X, Y, X)$。 ::: ### 3.2. Cycle Consistency Loss :::info Adversarial training can, in theory, learn mappings $G$ and $F$ that produce outputs identically distributed as target domains $Y$ and $X$ respectively (strictly speaking, this requires $G$ and $F$ to be stochastic functions) [15]. However, with large enough capacity, a network can map the same set of input images to any random permutation of images in the target domain, where any of the learned mappings can induce an output distribution that matches the target distribution. Thus, adversarial losses alone cannot guarantee that the learned function can map an individual input $x_i$ to a desired output $y_i$. To further reduce the space of possible mapping functions, we argue that the learned mapping functions should be cycle-consistent: as shown in Figure 3 (b), for each image $x$ from domain $X$, the image translation cycle should be able to bring $x$ back to the original image, i.e., $x \to G(x) \to F(G(x)) \approx x$. We call this forward cycle consistency. Similarly, as illustrated in Figure 3 (c), for each image $y$ from domain $Y$ , $G$ and $F$ should also satisfy backward cycle consistency: $y \to F(y) \to G(F(y)) \approx y$. We incentivize this behavior using a cycle consistency loss: $$ \begin{align} \mathcal{L}_{cyc}(G,F) &= \mathbb{E}_{x\sim p_{data}(x)}[\Vert F(G(x)) - x \Vert_1] \\ &+ \mathbb{E}_{y\sim p_{data}(y)}[\Vert G(F(y)) \Vert_1] \tag{2} \end{align} $$ In preliminary experiments, we also tried replacing the L1 norm in this loss with an adversarial loss between $F(G(x))$ and $x$, and between $G(F(y))$ and $y$, but did not observe improved performance. ::: :::success 理論上,對抗訓練可以學習到映射$G$和$F$,使它們分別產生的輸出與目標定義域$Y$和$X$的分佈完全相同(嚴格來說,這要求$G$和$F$是隨機函數)。然而,如果容量夠大,神經網路可以將同一組輸入影像映射到目標定義域中影像的任意隨機排列,其中,任何學習到的映射都可以導致匹配目標定義域的輸出分佈。因此,單純的adversarial losses並不能保證學習到的函數可以將單獨的輸入$x_i$映射到所期望的輸出$y_i$。為了進一步減少可能的映射函數的空間,我們認為,學習到的映射函數應該要是循環一致性的(cycle-consistent):如Figure 3(b)所示,對於每個來自domain $X$的影像$x$,其影像轉變循環應該要能夠將$x$轉回原本的影像,也就是$x \to G(x) \to F(G(x)) \approx x$。我們將之稱為forward cycle consistency(前向循環一致性)。類似地,如Figure 3(c)所示,對於每個來自於domain $Y$的影像$y$,$G$跟$F$也應該要能夠滿足backward cycle consistency(反向循環一致性):$y \to F(y) \to G(F(y)) \approx y$。我們使用cycle consistency loss來鼓勵這種行為: $$ \begin{align} \mathcal{L}_{cyc}(G,F) &= \mathbb{E}_{x\sim p_{data}(x)}[\Vert F(G(x)) - x \Vert_1] \\ &+ \mathbb{E}_{y\sim p_{data}(y)}[\Vert G(F(y)) \Vert_1] \tag{2} \end{align} $$ 在初步實驗中,我們也有試著用$F(G(x))$與$x$之間的adversarial loss來取代該loss中的L1 norm,不過這並沒有觀察到效能上的改善就是。 ::: :::info The behavior induced by the cycle consistency loss can be observed in Figure 4: the reconstructed images $F(G(x))$ end up matching closely to the input images $x$. ::: :::success 由cycle consistency loss所引起的行為可以在Figure 4中觀察到:重建的影像$F(G(x))$最終跟輸入影像$x$緊密匹配。 ::: :::info ![image](https://hackmd.io/_uploads/BkLa72jP0.png) Figure 4: The input images x, output images G(x) and the reconstructed images F(G(x)) from various experiments. From top to bottom: photo $\leftrightarrow$ Cezanne, horses $\leftrightarrow$ zebras, winter $\leftrightarrow$ summer Yosemite, aerial photos↔Google maps. ::: ### 3.3. Full Objective :::info Our full objective is: $$ \begin{align} \mathcal{L}(G,F,D_X,D_Y) &=\mathcal{L}_{GAN}(G,D_Y,X,Y)\\ &+ \mathcal{L}_{GAN}(F,D_X,Y,X)\\ &+ \lambda_{cyc}(G,F) \tag{3} \end{align} $$ where $\lambda$ controls the relative importance of the two objectives. We aim to solve: $$ G^*,F^* = \arg\min_{\rm G,F}\max_{\rm D_x,D_Y}\mathcal{L}(G,F,D_X,D_Y)\tag{4} $$ ::: :::success 我們完成的目標式是: $$ \begin{align} \mathcal{L}(G,F,D_X,D_Y) &=\mathcal{L}_{GAN}(G,D_Y,X,Y)\\ &+ \mathcal{L}_{GAN}(F,D_X,Y,X)\\ &+ \lambda_{cyc}(G,F) \tag{3} \end{align} $$ 其中$\lambda$控制著兩個目標的相對重要性,我們旨在解決: $$ G^*,F^* = \arg\min_{\rm G,F}\max_{\rm D_x,D_Y}\mathcal{L}(G,F,D_X,D_Y)\tag{4} $$ ::: :::warning 這邊的公式中,$\max$的下標是$x$而不是$X$,不知道是不是論文錯誤?可能還需要再理解一下。 ::: :::info Notice that our model can be viewed as training two “autoencoders” [20]: we learn one autoencoder $F \circ G : X \to X$ jointly with another $G \circ F : Y \to Y$ . However, these autoencoders each have special internal structures: they map an image to itself via an intermediate representation that is a translation of the image into another domain. Such a setup can also be seen as a special case of “adversarial autoencoders” [34], which use an adversarial loss to train the bottleneck layer of an autoencoder to match an arbitrary target distribution. In our case, the target distribution for the $X \to X$ autoencoder is that of the domain $Y$. ::: :::success 注意到,我們的模型可以被視為訓練兩個"autoencoders":我們學習一個autoencoder $F \circ G : X \to X$跟另一個$G \circ F : Y \to Y$。然而,這些autoencoders都有特別的內部結構:它們透過一個將影像轉到另一個domain的中間表示將影像映射到自身。這樣的設定也可以被視為“adversarial autoencoders”的特例,它使用adversarial loss來訓練autoencoder的bottleneck layer來匹配任務目標分佈。在我們的案例的話,$X \to X$ autoencoder的目標分佈就是domain $Y$的目標分佈。 ::: :::info In Section 5.1.4, we compare our method against ablations of the full objective, including the adversarial loss $\mathcal{L}_{GAN}$ alone and the cycle consistency loss $\mathcal{L}_{cyc}$ alone, and empirically show that both objectives play critical roles in arriving at high-quality results. We also evaluate our method with only cycle loss in one direction and show that a single cycle is not sufficient to regularize the training for this under-constrained problem. ::: :::success Section 5.1.4中,我們會比較我們的方法跟完整目標的消融,包括單獨的adversarial loss $\mathcal{L}_{GAN}$跟單獨的cycle consistency loss $\mathcal{L}_{cyc}$,並根據經驗來說明兩個目標對於獲得高品質結果都起了關鍵作用。我們還評估只在一個方向上單純使用cycle loss的方法,並說明對於這個under-constrained的問題,single cycle是不足以正規化練的。 ::: ## 4. Implementation :::info **Network Architecture** We adopt the architecture for our generative networks from Johnson et al. [23] who have shown impressive results for neural style transfer and superresolution. This network contains three convolutions, several residual blocks [18], two fractionally-strided convolutions with stride $\dfrac{1}{2}$, and one convolution that maps features to RGB. We use 6 blocks for $128 \times 128$ images and 9 blocks for $256 \times 256$ and higher-resolution training images. Similar to Johnson et al. [23], we use instance normalization [53]. For the discriminator networks we use $70 \times 70$ PatchGANs [22, 30, 29], which aim to classify whether $70 \times 70$ overlapping image patches are real or fake. Such a patch-level discriminator architecture has fewer parameters than a full-image discriminator and can work on arbitrarilysized images in a fully convolutional fashion [22]. ::: :::success **Network Architecture** 我們採用Johnson等人的架構做為做為我們的生成網路,這架構在neural style transfer與superresolution都有著讓人驚豔的成果。這網路包含三個卷積,多個residual blocks,兩個步長為$\dfrac{1}{2}$的卷積,跟一個將特徵映射到RGB的卷積。對於$128 \times 128$的影像,我們使用6個block來處理,然後$256 \times 256$與更高解析度的解析度訓練影像就用9個block,類似於Johnson等人,我們使用instance normalization。 discriminator networks的部份,我們使用$70 \times 70$ PatchGANs,目的是分辨$70 \times 70$ overlapping image patches(重疊影像的區域)是真的還是假的。這種patch-level discriminator的架構比起full-image discriminator有著更少的參數,而且可以以fully convolutional的方式在任意大小的影像上執行。 ::: :::info **Training details** We apply two techniques from recent works to stabilize our model training procedure. First, for $\mathcal{L}_{GAN}$ (Equation 1), we replace the negative log likelihood objective by a least-squares loss [35]. This loss is more stable during training and generates higher quality results. In particular, for a GAN loss $\mathcal{L}_{GAN}(G,D,X,Y)$, we train the $G$ to minimize $\mathbb{E}_{x\sim p_{data}(x)}[D(G(x))-1)^2]$ and train the $D$ to minimize $\mathbb{E}_{y\sim p_{data}(y)}[D(y)-1)^2] + \mathbb{E}_{x\sim p_{data}(x)} [D(G(x))^2 ]$. ::: :::success **Training details** 我們從近來的研究中採用兩種技術來穩定我們的模型訓練過程。首先,對於$\mathcal{L}_{GAN}$ (Equation 1),我們用least-squares loss來取代掉負的對數[似然度](https://terms.naer.edu.tw/detail/b33b53e444ad5a5186090790424d72a8/)目標(negative log likelihood objective)。這個loss在訓練過程中更為穩定,而且能產出更高品質的結果。特別是,對於GAN loss $\mathcal{L}_{GAN}(G,D,X,Y)$,我們訓練$G$來最小化$\mathbb{E}_{x\sim p_{data}(x)}[D(G(x))-1)^2]$,然後訓練$D$來最小化$\mathbb{E}_{y\sim p_{data}(y)}[D(y)-1)^2] + \mathbb{E}_{x\sim p_{data}(x)} [D(G(x))^2 ]$。 ::: :::info Second, to reduce model oscillation [15], we follow Shrivastava et al.’s strategy [46] and update the discrimi nators using a history of generated images rather than the ones produced by the latest generators. We keep an image buffer that stores the 50 previously created images. ::: :::success 再來就是,為了減少模型的振盪,我們依循著Shrivastava等人的策略,使用生成影像的歷史資料,而不是最新的生成器生成的影像更新discriminators。我們會維持影像緩衝區來保存50張先前建立的影像。 ::: :::info For all the experiments, we set $\lambda=10$ in Equation 3. We use the Adam solver [26] with a batch size of 1. All networks were trained from scratch with a learning rate of 0.0002. We keep the same learning rate for the first 100 epochs and linearly decay the rate to zero over the next 100 epochs. Please see the appendix (Section 7) for more details about the datasets, architectures, and training procedures. ::: :::success 所有的實驗中,我們都把方程式3中的$\lambda$設置為$10$。使用Adam solver,搭配batch size為1的設置。所有的神經網路都是以0.0002的學習效率重頭開始訓練。在前100個epochs會保持相同的學習效率,然後接下來的100個epochs會將學習效率以線性方式一路減減減減到零。對於資料集、架構、訓練過程的更多細節可參考附錄。 ::: ## 5. Results :::info We first compare our approach against recent methods for unpaired image-to-image translation on paired datasets where ground truth input-output pairs are available for evaluation. We then study the importance of both the adversarial loss and the cycle consistency loss and compare our full method against several variants. Finally, we demonstrate the generality of our algorithm on a wide range of applications where paired data does not exist. For brevity, we refer to our method as CycleGAN. The PyTorch and Torch code, models, and full results can be found at our website. ::: :::success 我們首先在有成對的資料集上將我們的方法與近來非成對的圖像到圖像轉換方法做比較,這些資料集中有可用於評估的輸入、輸出配對資料。然後研究adversarial loss與cycle consistency loss的重要性,然後把我們的完整方法跟幾個變體做比較。最終,我們在不存在成對資瞪的廣泛應用中證明我們演算法的通用性。簡潔起見,我們將我們的方法稱之為CycleGAN。PyTorch與Torch的程式碼、模型以及完整的結果都可以在我們的網站上找到。 ::: ### 5.1. Evaluation :::info Using the same evaluation datasets and metrics as “pix2pix” [22], we compare our method against several baselines both qualitatively and quantitatively. The tasks include semantic labels↔photo on the Cityscapes dataset [4], and map↔aerial photo on data scraped from Google Maps. We also perform ablation study on the full loss function. ::: :::success 使用跟"pix2pix"相同的評估資料集與指標,把我們的方法跟幾個基線做定性與定量的比較。任務包含Cityscapes資料集上的semantic label $\leftrightarrow$ photo跟爬自Google Maps的map $\leftrightarrow$ aerial photo。我們也對完整的loss function做消融的研究。 ::: #### 5.1.1 Evaluation Metrics :::info **AMT perceptual studies** On the map↔aerial photo task, we run “real vs fake” perceptual studies on Amazon Mechanical Turk (AMT) to assess the realism of our outputs. We follow the same perceptual study protocol from Isola et al. [22], except we only gather data from 25 participants per algorithm we tested. Participants were shown a sequence of pairs of images, one a real photo or map and one fake (generated by our algorithm or a baseline), and asked to click on the image they thought was real. The first 10 trials of each session were practice and feedback was given as to whether the participant’s response was correct or incorrect. The remaining 40 trials were used to assess the rate at which each algorithm fooled participants. Each session only tested a single algorithm, and participants were only allowed to complete a single session. The numbers we report here are not directly comparable to those in [22] as our ground truth images were processed slightly differently and the participant pool we tested may be differently distributed from those tested in [22] (due to running the experiment at a different date and time). Therefore, our numbers should only be used to compare our current method against the baselines (which were run under identical conditions), rather than against [22]. ::: :::success **AMT perceptual studies** 在map $\leftrightarrow$ aerial photo的任務中,我們在Amazon Mechanical Turk (AMT)上執行了"真與假"的感知研究來評估我們輸出的真實性。我們依循著與Isola等人相同的感知研究準則,不過我們所測試的每個演算法就只收集25個參與者的資料。我們向參與者展示一系列的成對影像,一個是真的照片或是地圖,然後一個是假的(演算法或是基線方法生成的),然後要求參與者點擊他們覺得是真實的那一個。每個session的前十個試驗都是練習,不管參與者的回應是正確還是錯誤的都會給予反饋。其餘40個試驗就被用來評估每種演算法騙過參與者的比例。每個session只會測試一種演算法,並且參與者只被允許完成單一個session。我們這邊報告的數字並不能直接地跟[22]中的資料做比較,因為我們的實際影像的處理有些許的不同,而且我們測試的受試群體(participant pool)可能與之分佈不同(不同的日期、不同的時間)。因此,我們的數字就只能比較我們當前的方法跟基線方法,而不是跟[22]比較。 ::: :::info **FCN score** Although perceptual studies may be the gold standard for assessing graphical realism, we also seek an automatic quantitative measure that does not require human experiments. For this, we adopt the “FCN score” from [22], and use it to evaluate the Cityscapes labels→photo task. The FCN metric evaluates how interpretable the generated photos are according to an off-the-shelf semantic segmentation algorithm (the fully-convolutional network, FCN, from [33]). The FCN predicts a label map for a generated photo. This label map can then be compared against the input ground truth labels using standard semantic segmentation metrics described below. The intuition is that if we generate a photo from a label map of “car on the road”, then we have succeeded if the FCN applied to the generated photo detects “car on the road”. ::: :::success **FCN score** 雖然感知器的研究也許對於評估圖形真實性來說是黃金準則,不過我們仍然尋求一種不需要人類實驗的自動化定量測量。為此,我們採用[22]中的"FCN score",並用它來評估City scapes labels $\to$ photo的任務。FCN metric根據一種現成的語意分割演算法評估生成的照片的可解釋性(來自[33]的fully-convolutional network, FCN)。FCN為生成的照片預測一個label map(標記地圖?)。然後,這個label map可以使用下面會提到的標準的語意分割指標來跟實際的labels比較。直觀來看就是,如果我們從"car on the road"的一個label map中生成一張照片,然後FCN用在這張生成的照片還檢測到"car on the road",那就成功。 ::: :::info **Semantic segmentation metrics** To evaluate the performance of photo $\to$ labels, we use the standard metrics from the Cityscapes benchmark [4], including per-pixel accuracy, per-class accuracy, and mean class Intersection-Over-Union (Class IOU) [4]. ::: :::success **Semantic segmentation metrics** 為能評估photo $\to$ labels的效能,我們使用Cityscapes benchmark中的標準指標,包含per-pixel accuracy、per-class accuracy、與mean class Intersection-Over-Union (Class IOU) 。 ::: #### 5.1.2 Baselines :::info **CoGAN** [32] This method learns one GAN generator for domain $X$ and one for domain $Y$, with tied weights on the first few layers for shared latent representations. Translation from $X$ to $Y$ can be achieved by finding a latent representation that generates image $X$ and then rendering this latent representation into style $Y$. ::: :::success **CoGAN** 這個方法學習一個用於domain $X$跟一個用於domain $Y$的GAN generator,前幾個網路層的權重是一起的,以此共享潛在的表示(latent representations)。從$X$到$Y$的轉換可以透過找出一個生成影像$x$的潛在表示來實現,然後再渲染這個潛在表示為style $Y$。 ::: :::info **SimGAN** [46] Like our method, Shrivastava et al.[46] uses an adversarial loss to train a translation from $X$ to $Y$.The regularization term $\Vert x - G(x) \Vert_1$ is used to penalize making large changes at pixel level. ::: :::success **SimGAN** 跟我們的方法很像,Shrivastava等人使用adversarial loss訓練一個從$X$到$Y$的轉換。正規化項目$\Vert x - G(x) \Vert_1$主要用來懲罰在pixel level上造成較大變化的部份。 ::: :::info **Feature loss + GAN** We also test a variant of SimGAN [46] where the L1 loss is computed over deep image features using a pretrained network (VGG-16 relu4_2 [47]), rather than over RGB pixel values. Computing distances in deep feature space, like this, is also sometimes referred to as using a “perceptual loss” [8, 23]. ::: :::success **Feature loss + GAN** 我們還測試了SimGAN的一種變體,其L1 loss是使用預訓練的神經網路(VGG-16 relu4_2)根據深度影像特徵計算而得,而不是根據RGB像素值。像這樣計算深度特徵空間中的距離有時候也被稱為使用“perceptual loss”。 ::: :::info **BiGAN/ALI** [9, 7] Unconditional GANs [16] learn a generator $G : Z \to X$, that maps a random noise $z$ to an image $x$. The BiGAN [9] and ALI [7] propose to also learn the inverse mapping function $F : X \to Z$. Though they were originally designed for mapping a latent vector $z$ to an image $x$, we implemented the same objective for mapping a source image $x$ to a target image $y$. ::: :::success **BiGAN/ALI** Unconditional GANs學習一個generator $G : Z \to X$,映射一個隨機噪點$z$到影像$x$。BiGAN跟ALI提出同樣學習逆映射函數$F : X \to Z$。雖然它們最一開始的設計用用來映射潛在向量$z$到影像$x$,不過我們現實相同的目標,也就是將來源影像$x$映射到目標影像$y$。 ::: :::info **pix2pix** [22] We also compare against pix2pix [22], which is trained on paired data, to see how close we can get to this “upper bound” without using any paired data. ::: :::success **pix2pix** 我們同時也跟pix2pix做了比較,這個方法是訓練在成對資料上,看看在沒有使用任何成對資料的情況下,我們能夠多接近這個"上限"。 ::: :::info For a fair comparison, we implement all the baselines using the same architecture and details as our method, except for CoGAN [32]. CoGAN builds on generators that produce images from a shared latent representation, which is incompatible with our image-to-image network. We use the public implementation of CoGAN instead. ::: :::success 為了公平比較,我們用跟我們的方法相同的架構、細節來實作所有的基線演算法,除了CoGAN。CoGAN是建立在從共享的潛在表示中生成影像的生成器,這跟我們的image-to-image network是不相容的。我們使用公開發行的CoGAN來替代。 ::: #### 5.1.3 Comparison against baselines :::info As can be seen in Figure 5 and Figure 6, we were unable to achieve compelling results with any of the baselines. Our method, on the other hand, can produce translations that are often of similar quality to the fully supervised pix2pix. ::: :::success 在Figure 5與Figure 6中可以看的到,我們無法使用任何基線方法取得引人注目的結果。另一方面,我們的方法能夠產生跟fully supervised pix2pix品質相似的轉換。 ::: :::info ![image](https://hackmd.io/_uploads/SydJEQTPR.png) Figure 5: Different methods for mapping labels↔photos trained on Cityscapes images. From left to right: input, BiGAN/ALI [7, 9], CoGAN [32], feature loss + GAN, SimGAN [46], CycleGAN (ours), pix2pix [22] trained on paired data, and ground truth. ::: :::info ![image](https://hackmd.io/_uploads/HkwxV7TPR.png) Figure 6: Different methods for mapping aerial photos↔maps on Google Maps. From left to right: input, BiGAN/ALI [7, 9], CoGAN [32], feature loss + GAN, SimGAN [46], CycleGAN (ours), pix2pix [22] trained on paired data, and ground truth. ::: :::info Table 1 reports performance regarding the AMT perceptual realism task. Here, we see that our method can fool participants on around a quarter of trials, in both the maps $\to$ aerial photos direction and the aerial photos $\to$ maps direction at 256 × 256 resolution. All the baselines almost never fooled participants. ::: :::success Table 1給出關於AMT perceptual realism task的效能。在這邊,我們可以看到,我們的方法可以欺騙大約四分之一的與參者,無法是在maps $\to$ aerial photos direction還是aerial photos $\to$ maps direction(解析度256x256)。所有的基線幾乎沒有騙到參與者過。 ::: :::info ![image](https://hackmd.io/_uploads/BJL24QTw0.png) Table 1: AMT “real vs fake” test on maps $\leftrightarrow$ aerial photos at 256 × 256 resolution. ::: :::info Table 2 assesses the performance of the labels $\to$ photo task on the Cityscapes and Table 3 evaluates the opposite mapping (photos $\to$ labels). In both cases, our method again outperforms the baselines. ::: :::success Table 2評估了在Cityscapes上的labels $\to$ photo task的效能,Table 3評估相反的映射(photos $\to$ labels)。在這兩種情況中,我們的方法再一次的優於基線方法。 ::: :::info ![image](https://hackmd.io/_uploads/HyfbBmpvA.png) Table 2: FCN-scores for different methods, evaluated on Cityscapes labels $\to$ photo ::: :::info ![image](https://hackmd.io/_uploads/r1vZIX6PC.png) Table 3: Classification performance of photo $\to$ labels for different methods on cityscapes ::: #### 5.1.4 Analysis of the loss function :::info In Table 4 and Table 5, we compare against ablations of our full loss. Removing the GAN loss substantially degrades results, as does removing the cycle-consistency loss. We therefore conclude that both terms are critical to our results. We also evaluate our method with the cycle loss in only one direction: GAN + forward cycle loss $\mathbb{E}_{x\sim p_{data}(x)}[\Vert F(G(x))-x\Vert_1]$, or GAN + backward cycle loss $\mathbb{E}_{y\sim p_{data}(y)}[\Vert G(F(y))-y\Vert_1]$ (Equation 2) and find that it often incurs training instability and causes mode collapse, especially for the direction of the mapping that was removed. Figure 7 shows several qualitative examples. ::: :::success 在Table 4與Table 5中,我們比較了所有loss的消融研究。移除掉GAN loss會降低結果,移除cycle-consistency loss也是一樣。因此,我們得到一個結論,這兩個項目對我們的結果都是非常重要的。我們還用只有一個方向的cycle loss來評估我們的方法:GAN + forward cycle loss $\mathbb{E}_{x\sim p_{data}(x)}[\Vert F(G(x))-x\Vert_1]$,或是GAN + backward cycle loss $\mathbb{E}_{y\sim p_{data}(y)}[\Vert G(F(y))-y\Vert_1]$(方程式2),然後發現它通常會導致訓練上的不穩定並導致模式崩潰的問題,特別是對於被移除掉的那個映射的方向。Figure 7說明了幾個定性的範例。 ::: :::info ![image](https://hackmd.io/_uploads/HJAPGlCv0.png) Figure 7: Different variants of our method for mapping labels↔photos trained on cityscapes. From left to right: input, cycleconsistency loss alone, adversarial loss alone, GAN + forward cycle-consistency loss $(F(G(x)) \approx x)$, GAN + backward cycle-consistency loss $(G(F(y)) \approx y)$, CycleGAN (our full method), and ground truth. Both Cycle alone and GAN + backward fail to produce images similar to the target domain. GAN alone and GAN + forward suffer from mode collapse, producing identical label maps regardless of the input photo. ::: :::info ![image](https://hackmd.io/_uploads/Skm3MgCvR.png) Table 4: Ablation study: FCN-scores for different variants of our method, evaluated on Cityscapes labels→photo. ::: :::info ![image](https://hackmd.io/_uploads/S1l6Mg0wR.png) Table 5: Ablation study: classification performance of photo $\to$ labels for different losses, evaluated on Cityscapes. ::: :::warning ablations,消融研究,是一種分析方法,用於評估模型或系統中各個組件的重要性。 ::: #### 5.1.5 Image reconstruction quality :::info In Figure 4, we show a few random samples of the reconstructed images $F(G(x))$. We observed that the reconstructed images were often close to the original inputs $x$, at both training and testing time, even in cases where one domain represents significantly more diverse information, such as map $\leftrightarrow$ aerial photos. ::: :::success 在Figure 4中,我們展示了一些重構影像$F(G(x))$的隨機樣本。我們觀察到,在訓練跟測試階段,重構影像通常會接近原始的輸入$x$,即使其中一個domain的表示(represents)有著更多樣化的信息也是一樣,像是map $\leftrightarrow$ aerial photos。 ::: #### 5.1.6 Additional results on paired datasets :::info Figure 8 shows some example results on other paired datasets used in “pix2pix” [22], such as architectural labels $\leftrightarrow$ photos from the CMP Facade Database [40], and edges $$\leftrightarrow$$ shoes from the UT Zappos50K dataset [60]. The image quality of our results is close to those produced by the fully supervised pix2pix while our method learns the mapping without paired supervision. ::: :::success Figure 8給出的是“pix2pix”中使用的一些其它的成對資料的樣本結果,像是來自CMP Facade資料集的architectural labels $\leftrightarrow$ photos,還有UT Zappos50K資料集的edges $$\leftrightarrow$$ shoes。我們所產出的影像品質非常接近這些fully supervised pix2pix的影像品質,而我們的模型是在沒有成對監督的情況下學習映射的。 ::: :::info ![image](https://hackmd.io/_uploads/rJhRQeRPR.png) Figure 8: Example results of CycleGAN on paired datasets used in “pix2pix” [22] such as architectural labels $\leftrightarrow$ photos and edges $\leftrightarrow$ shoes. ::: ### 5.2. Applications :::info We demonstrate our method on several applications where paired training data does not exist. Please refer to the appendix (Section 7) for more details about the datasets. We observe that translations on training data are often more appealing than those on test data, and full results of all applications on both training and test data can be viewed on our project website. ::: :::success 我們在不存在成對訓練資料的幾個應用程式上展式我們的方法。關於資料集的細節請參閱appendix (Section 7)。我們觀察到,訓練資料上的轉換通常比訓練資料上的轉換更有吸引力,完整的結果在我們的專案網站有所有應該程式的訓練、測試資料可以查看。 ::: :::info **Collection style transfer (Figure 10 and Figure 11)** We train the model on landscape photographs downloaded from Flickr and WikiArt. Unlike recent work on “neural style transfer” [13], our method learns to mimic the style of an entire collection of artworks, rather than transferring the style of a single selected piece of art. Therefore, we can learn to generate photos in the style of, e.g., Van Gogh, rather than just in the style of Starry Night. The size of the dataset for each artist/style was 526, 1073, 400, and 563 for Cezanne, Monet, Van Gogh, and Ukiyo-e. ::: :::success **Collection style transfer (Figure 10 and Figure 11)** 我們使用從Flickr與WikiArt下載的風景照來訓練模型。跟近來的研究“neural style transfer”不同,我們的方法是學習模仿整個藝術品收藏的風格,而不是transferring單一選定的風格。因此,我們可以學習到生成像是Van Gogh風格的照片,而不單單只有Starry Night的風格。對於Cezanne、Monet、 Van Gogh與Ukiyo-e的資料集大小分別為526、1073、400、563。 ::: :::info ![image](https://hackmd.io/_uploads/By5mPlCvR.png) Figure 10: Collection style transfer I: we transfer input images into the artistic styles of Monet, Van Gogh, Cezanne, and Ukiyo-e. Please see our website for additional examples. ::: :::info ![image](https://hackmd.io/_uploads/rJh5wlADC.png) Figure 11: Collection style transfer II: we transfer input images into the artistic styles of Monet, Van Gogh, Cezanne, Ukiyo-e. Please see our website for additional examples. ::: :::info **Object transfiguration (Figure 13)** The model is trained to translate one object class from ImageNet [5] to another (each class contains around 1000 training images). Turmukhambetov et al. [50] propose a subspace model to translate one object into another object of the same category, while our method focuses on object transfiguration between two visually similar categories. ::: :::success 這個模型是訓練來從ImageNet的一個目標類別轉換到另一個目標類別(每個類別包含真實的1000張訓練影像)。Turmukhambetov等人提出一種子空間模型來將一個目標轉變到相同類別的另一個目標,而我們的方法關注在兩個視覺相似的類別之間的目標轉戶。 ::: :::info ![image](https://hackmd.io/_uploads/B1irdlCPA.png) Figure 13: Our method applied to several translation problems. These images are selected as relatively successful results – please see our website for more comprehensive and random results. In the top two rows, we show results on object transfiguration between horses and zebras, trained on 939 images from the wild horse class and 1177 images from the zebra class in Imagenet [5]. Also check out the horse $\to$ zebra demo video. The middle two rows show results on season transfer, trained on winter and summer photos of Yosemite from Flickr. In the bottom two rows, we train our method on 996 apple images and 1020 navel orange images from ImageNet. ::: :::info **Season transfer (Figure 13)** The model is trained on 854 winter photos and 1273 summer photos of Yosemite downloaded from Flickr. ::: :::success **Season transfer (Figure 13)** 這模型使用從Flickr下載的854張Yosemite冬季照片與1273張夏季照片進行訓練。 ::: :::info **Photo generation from paintings (Figure 12)** For painting $\to$ photo, we find that it is helpful to introduce an additional loss to encourage the mapping to preserve color composition between the input and output. In particular, we adopt the technique of Taigman et al. [49] and regularize the generator to be near an identity mapping when real samples of the target domain are provided as the input to the generator: i.e., $\mathcal{L}_{\text{identity}}(G,F)=\mathbb{E}_{y\sim p_{data}(y)}[\Vert G(y)-y\Vert_1] + \mathbb{E}_{x\sim p_{dat(x)}}[\Vert F(x)-x\Vert_1].$ ::: :::success **Photo generation from paintings (Figure 12)** 對於painting $\to$ photo的任務,我們發現到,引入一個額外的loss來鼓勵映射保留輸入與輸出之間的色彩成份是有幫助的。特別是,我們採用Taigman等人的技術,當提供target domain的真實樣本做為生成器的輸入的時候,正規化生成器能夠更接近恆等映射:也就是$\mathcal{L}_{\text{identity}}(G,F)=\mathbb{E}_{y\sim p_{data}(y)}[\Vert G(y)-y\Vert_1] + \mathbb{E}_{x\sim p_{dat(x)}}[\Vert F(x)-x\Vert_1]$。 ::: :::info ![image](https://hackmd.io/_uploads/B1xq_g0w0.png) Figure 12: Relatively successful results on mapping Monet’s paintings to a photographic style. Please see our website for additional examples. ::: :::info Without $\mathcal{L}_{\text{identity}}$, the generator $G$ and $F$ are free to change the tint of input images when there is no need to. For example, when learning the mapping between Monet’s paintings and Flickr photographs, the generator often maps paintings of daytime to photographs taken during sunset, because such a mapping may be equally valid under the adversarial loss and cycle consistency loss. The effect of this identity mapping loss are shown in Figure 9. ::: :::success 如果沒有$\mathcal{L}_{\text{identity}}$的話,生成器$G$跟$F$就都可以在不需要的時候自由更改輸入影像的色調。舉例來說,當學習Monet’s paintings與Flickr photographs之間的映射時,生成器通常將白天的畫作映射到日落拍攝的照片上,因為這樣的映射在adversarial loss與cycle consistency loss下可能同樣是有效的。這種identity mapping loss的影響如Figure 9所示。 ::: :::info ![image](https://hackmd.io/_uploads/SkqEtxAwC.png) Figure 9: The effect of the identity mapping loss on Monet’s painting $\to$ photos. From left to right: input paintings, CycleGAN without identity mapping loss, CycleGAN with identity mapping loss. The identity mapping loss helps preserve the color of the input paintings. ::: :::info In Figure 12, we show additional results translating Monet’s paintings to photographs. This figure and Figure 9 show results on paintings that were included in the training set, whereas for all other experiments in the paper, we only evaluate and show test set results. Because the training set does not include paired data, coming up with a plausible translation for a training set painting is a nontrivial task. Indeed, since Monet is no longer able to create new paintings, generalization to unseen, “test set”, paintings is not a pressing problem. ::: :::success 在Figure 12中,我們說明了將莫內畫作轉換到照片的其它結果。這圖跟Figure 9呈現了包含在訓練集中的畫作上的成果,而對於論文中的其它實驗,我們就只有評估,然後呈現測試集的結果。因為訓練集並不包含成對的資料,因此為訓練集繪製處理這種似真性的轉換是一項艱鉅的任務。確實,因為莫內已經沒有辦法再創作新的畫作了,對於沒看過的"測試集"畫作的泛化就不是一個迫切的問題了。 ::: :::info **Photo enhancement (Figure 14)** We show that our method can be used to generate photos with shallower depth of field. We train the model on flower photos downloaded from Flickr. The source domain consists of flower photos taken by smartphones, which usually have deep DoF due to a small aperture. The target contains photos captured by DSLRs with a larger aperture. Our model successfully generates photos with shallower depth of field from the photos taken by smartphones. ::: :::success **Photo enhancement (Figure 14)** 我們說明我們的方法可以用來生成景深較淺的照片。我們利用從Flickr下載的花的照片訓練模型。原始的domain是由手機拍的花花照片所組成,因為光圈較小,所以通常有較深的景深。目標包含有著大光圈的DSLRs所拍的照片。我們的模型成功地從手機所拍攝的照片產生景深較淺的照片。 ::: :::info ![image](https://hackmd.io/_uploads/ByZcKxRPC.png) Figure 14: Photo enhancement: mapping from a set of smartphone snaps to professional DSLR photographs, the system often learns to produce shallow focus. Here we show some of the most successful results in our test set – average performance is considerably worse. Please see our website for more comprehensive and random examples. ::: :::info **Comparison with Gatys et al.** [13] In Figure 15, we compare our results with neural style transfer [13] on photo stylization. For each row, we first use two representative artworks as the style images for [13]. Our method, on the other hand, can produce photos in the style of entire collection. To compare against neural style transfer of an entire collection, we compute the average Gram Matrix across the target domain and use this matrix to transfer the “average style” with Gatys et al [13]. ::: :::success **Comparison with Gatys et al.** 在Figure 15中,我們在照片風格上跟neural style transfer做比較。對於每個row,我們首先使用兩個代表性的藝術品做為[13]的風格照片。另一方面,我們的方法可以產生整個集合中的風格照片。為了跟整個集合的neural style transfer比較,我們計算target domain的average Gram Matrix,然後用這個矩陣跟Gatys等人一起transfer這個“average style”。 ::: :::info ![image](https://hackmd.io/_uploads/rkrTKgAP0.png) Figure 15: We compare our method with neural style transfer [13] on photo stylization. Left to right: input image, results from Gatys et al. [13] using two different representative artworks as style images, results from Gatys et al. [13] using the entire collection of the artist, and CycleGAN (ours). ::: :::info Figure 16 demonstrates similar comparisons for other translation tasks. We observe that Gatys et al. [13] requires finding target style images that closely match the desired output, but still often fails to produce photorealistic results, while our method succeeds to generate natural-looking results, similar to the target domain. ::: :::success Figure 16展示了其它轉換任務的類似比較。我們觀察到Gatys等人的方法需要找到跟期望輸出緊密匹配的目標風格影像,不過常常會無法產生有真實感的結果,而我們的方法成功地生成跟target domain類似的混然天成的結果。 ::: :::info ![image](https://hackmd.io/_uploads/BkI-5eAwC.png) Figure 16: We compare our method with neural style transfer [13] on various applications. From top to bottom: apple $\to$ orange, horse $\to$ zebra, and Monet $\to$ photo. Left to right: input image, results from Gatys et al. [13] using two different images as style images, results from Gatys et al. [13] using all the images from the target domain, and CycleGAN (ours). ::: ## 6. Limitations and Discussion :::info Although our method can achieve compelling results in many cases, the results are far from uniformly positive. Figure 17 shows several typical failure cases. On translation tasks that involve color and texture changes, as many of those reported above, the method often succeeds. We have also explored tasks that require geometric changes, with little success. For example, on the task of dog $\to$ cat transfiguration, the learned translation degenerates into making minimal changes to the input (Figure 17). This failure might be caused by our generator architectures which are tailored for good performance on the appearance changes. Handling more varied and extreme transformations, especially geometric changes, is an important problem for future work. ::: :::success 雖然我們的方法可以在許多情況下取得令人驚豔的成果,不過也不是每天在過年的。Figure 17就給出幾個經典的失敗案例。在涉及色彩及紋理改變的轉換任務上,正如上面所說的那些,這方法通常是會成功的。我們還探索了需要幾何變化的任務,就一咪咪的成功就是。舉例來說,在dog $\to$ cat的轉換任務上,學習到的轉換退化到只對輸入做最小程度的改變(Figure 17)。這失敗也許是因為我們生成器架構所引起的,因為這個架構是為了外觀變化有好的效能所量身定做的。處理更多樣化和極端的變化,特別是幾何變化,是未來研究的重要議題。 ::: :::info ![image](https://hackmd.io/_uploads/r12QgXAv0.png) Figure 17: Typical failure cases of our method. Left: in the task of dog→cat transfiguration, CycleGAN can only make minimal changes to the input. Right: CycleGAN also fails in this horse $\to$ zebra example as our model has not seen images of horseback riding during training. Please see our website for more comprehensive results. ::: :::info Some failure cases are caused by the distribution characteristics of the training datasets. For example, our method has got confused in the horse $\to$ zebra example (Figure 17, right), because our model was trained on the wild horse and zebra synsets of ImageNet, which does not contain images of a person riding a horse or zebra. ::: :::success 一些失敗的案例是因為訓練資料集的分佈特徵所引起的。舉例來說,我們的方法對於horse $\to$ zebra的樣本感到困惑(Figure 17, right),因為我們的模型是訓練ImageNet的野馬與斑馬這種同義詞集上,並不包含騎著馬或是斑馬的人這類的影像。 ::: :::info We also observe a lingering gap between the results achievable with paired training data and those achieved by our unpaired method. In some cases, this gap may be very hard – or even impossible – to close: for example, our method sometimes permutes the labels for tree and building in the output of the photos $\to$ labels task. Resolving this ambiguity may require some form of weak semantic supervision. Integrating weak or semi-supervised data may lead to substantially more powerful translators, still at a fraction of the annotation cost of the fully-supervised systems. ::: :::success 我們還觀察到使用成對訓練資料可實現的結果與透過我們的未成對方法實現的結果之間存在揮之不去的差距。在某些情況下,這個差距可能是難以或者甚至不可能去接近的:舉例來說,我們的方法有時候在photos $\to$ labels任務的輸出中會交換樹跟建築物的標記。解決這種歧義可能需要某些形式的弱的語意監督。整合弱或是半監督資料也許可以得到更強大的轉換器,其標記資料的成本可能也只是完全監督系統的一小部份。 ::: :::info Nonetheless, in many cases completely unpaired data is plentifully available and should be made use of. This paper pushes the boundaries of what is possible in this “unsupervised” setting. ::: :::success 儘管如此,在大多數情況下,完全不成對的資料是大量可用的,應該可以利用才是。這篇論文突破天際線了。 ::: :::info Acknowledgments: We thank Aaron Hertzmann, Shiry Ginosar, Deepak Pathak, Bryan Russell, Eli Shechtman, Richard Zhang, and Tinghui Zhou for many helpful comments. This work was supported in part by NSF SMA1514512, NSF IIS-1633310, a Google Research Award, Intel Corp, and hardware donations from NVIDIA. JYZ is supported by the Facebook Graduate Fellowship and TP is supported by the Samsung Scholarship. The photographs used for style transfer were taken by AE, mostly in France. ::: :::success 謝天謝地。 ::: ## 7. Appendix ### 7.1. Training details :::info We train our networks from scratch, with a learning rate of 0.0002. In practice, we divide the objective by 2 while optimizing $D$, which slows down the rate at which $D$ learns, relative to the rate of $G$. We keep the same learning rate for the first 100 epochs and linearly decay the rate to zero over the next 100 epochs. Weights are initialized from a Gaussian distribution $\mathcal{N}(0, 0.002)$. ::: :::success 我們從頭開始訓練我們的神經網路,學習效率為0.0002。實務上,我們在最佳化$D$的時候會把目標式除以2,這樣可以降低$D$學習的效率(相對於$G$的效率)。前100個epochs維持相同的學習效率,然後在接下來的100個epochs中將學習效率線性衰減到零。權重的部份是從$\mathcal{N}(0, 0.002)$的高斯分佈初始化。 ::: :::warning 下面資料集的介紹就單純整理不翻譯了。 ::: :::info **Cityscapes label↔Photo** 2975 training images from the Cityscapes training set [4] with image size 128 × 128. We used the Cityscapes val set for testing. ::: :::info **Maps↔aerial** photograph 1096 training images were scraped from Google Maps [22] with image size 256×256. Images were sampled from in and around New York City. Data was then split into train and test about the median latitude of the sampling region (with a buffer region added to ensure that no training pixel appeared in the test set). ::: :::info **Architectural facades labels↔photo** 400 training images from the CMP Facade Database [40]. ::: :::info **Edges→shoes** around 50, 000 training images from UT Zappos50K dataset [60]. The model was trained for 5 epochs. ::: :::info **Horse↔Zebra and Apple↔Orange** We downloaded the images from ImageNet [5] using keywords wild horse, zebra, apple, and navel orange. The images were scaled to 256 × 256 pixels. The training set size of each class: 939 (horse), 1177 (zebra), 996 (apple), and 1020 (orange). ::: :::info **Summer↔Winter Yosemite** The images were downloaded using Flickr API with the tag yosemite and the datetaken field. Black-and-white photos were pruned. The images were scaled to 256 × 256 pixels. The training size of each class: 1273 (summer) and 854 ( winter). ::: :::info **Photo↔Art for style transfer** The art images were downloaded from Wikiart.org. Some artworks that were sketches or too obscene were pruned by hand. The photos were downloaded from Flickr using the combination of tags landscape and landscapephotography. Black-andwhite photos were pruned. The images were scaled to 256 × 256 pixels. The training set size of each class was 1074 (Monet), 584 (Cezanne), 401 (Van Gogh), 1433 (Ukiyo-e), and 6853 (Photographs). The Monet dataset was particularly pruned to include only landscape paintings, and the Van Gogh dataset included only his later works that represent his most recognizable artistic style. ::: :::info **Monet’s paintings→photos** To achieve high resolution while conserving memory, we used random square crops of the original images for training. To generate results, we passed images of width 512 pixels with correct aspect ratio to the generator network as input. The weight for the identity mapping loss was $0.5\lambda$ where $\lambda$ was the weight for cycle consistency loss. We set $\lambda=10$. ::: :::info **Flower photo enhancement** Flower images taken on smartphones were downloaded from Flickr by searching for the photos taken by Apple iPhone 5, 5s, or 6, with search text flower. DSLR images with shallow DoF were also downloaded from Flickr by search tag flower, dof. The images were scaled to 360 pixels by width. The identity mapping loss of weight $0.5\lambda$ was used. The training set size of the smartphone and DSLR dataset were 1813 and 3326, respectively. We set $\lambda=10$. ::: ### 7.2. Network architectures :::info We provide both PyTorch and Torch implementations. ::: :::success 我們提供PyTorch跟Torch兩個版本的實作。 ::: :::info **Generator architectures** We adopt our architectures from Johnson et al. [23]. We use 6 residual blocks for 128 × 128 training images, and 9 residual blocks for 256 × 256 or higher-resolution training images. Below, we follow the naming convention used in the Johnson et al.’s Github repository. ::: :::success **Generator architectures** 我們採用Johnson等人的架構。對於128 × 128的訓練影像使用6個residual blocks,256 × 256或是更高解析度的訓練影像則是9個residual blocks。下面我們依著Johnson等人Github repository的命名約定。 ::: :::info Let `c7s1-k` denote a 7×7 Convolution-InstanceNormReLU layer with $k$ filters and stride 1. `dk` denotes a 3×3 Convolution-InstanceNorm-ReLU layer with $k$ filters and stride 2. Reflection padding was used to reduce artifacts. `Rk` denotes a residual block that contains two 3×3 convolutional layers with the same number of filters on both layer. `uk` denotes a 3×3 fractional-strided-ConvolutionInstanceNorm-ReLU layer with $k$ filters and stride $\dfrac{1}{2}$. ::: :::success 假設`c7s1-k`表示一個7x7的Convolution-InstanceNormReLU layer,有$k$個filters且stride為1。`dk`表示3x3的Convolution-InstanceNorm-ReLU layer,有$k$個filters且stride為2。Reflection padding是用來降低瑕庛。`Rk`表示一個residual block,包含兩個3×3 convolutional layers,兩個網路層有著相同的filter的數量。`uk`表示3×3 fractional-strided-ConvolutionInstanceNorm-ReLU layer,有著$k$個filters且stride為$\dfrac{1}{2}$。 ::: :::info The network with 6 residual blocks consists of: `c7s1-64`,`d128`,`d256`,`R256`,`R256`,`R256`, `R256`,`R256`,`R256`,`u128`,`u64`,`c7s1-3` ::: :::info The network with 9 residual blocks consists of: `c7s1-64`,`d128`,`d256`,`R256`,`R256`,`R256`, `R256`,`R256`,`R256`,`R256`,`R256`,`R256`,`u128`, `u64`,`c7s1-3` ::: :::info **Discriminator architectures** For discriminator networks, we use 70 × 70 PatchGAN [22]. Let `Ck` denote a 4 × 4 Convolution-InstanceNorm-LeakyReLU layer with $k$ filters and stride 2. After the last layer, we apply a convolution to produce a 1-dimensional output. We do not use InstanceNorm for the first `C64` layer. We use leaky ReLUs with a slope of 0.2. The discriminator architecture is: `C64-C128-C256-C512` ::: :::success **Discriminator architectures** discriminator的部份,我們使用70 × 70 PatchGAN的架構。假設`Ck`表示4 × 4 Convolution-InstanceNorm-LeakyReLU layer,有著$k$個filters且stride為2。最後一層之後,我們做一次卷積來產生一個1-dimensional的輸出。對於第一個`C64`我們並沒有使用InstanceNorm。使用leaky ReLUs,其slope為0.2。discriminator的架構為:`C64-C128-C256-C512`。 :::