# Perceptual Losses for Real-Time Style Transfer and Super-Resolution(翻譯)
###### tags:`論文翻譯` `deeplearning`
[TOC]
## 說明
區塊如下分類,原文區塊為藍底,翻譯區塊為綠底,部份專業用語翻譯參考國家教育研究院
:::info
原文
:::
:::success
翻譯
:::
:::warning
個人註解,任何的翻譯不通暢部份都請留言指導
:::
:::danger
* [paper hyperlink](https://arxiv.org/pdf/1603.08155)
:::
## Abstract
:::info
We consider image transformation problems, where an input image is transformed into an output image. Recent methods for such problems typically train feed-forward convolutional neural networks using a per-pixel loss between the output and ground-truth images. Parallel work has shown that high-quality images can be generated by defining and optimizing perceptual loss functions based on high-level features extracted from pretrained networks. We combine the benefits of both approaches, and propose the use of perceptual loss functions for training feed-forward networks for image transformation tasks. We show results on image style transfer, where a feed-forward network is trained to solve the optimization problem proposed by Gatys et al in real-time. Compared to the optimization-based method, our network gives similar qualitative results but is three orders of magnitude faster. We also experiment with single-image super-resolution, where replacing a per-pixel loss with a perceptual loss gives visually pleasing results.
:::
:::success
我們認真的思考影像轉換問題,這是一個將輸入影像轉換為另一個輸出影像的問題。對於這類問題,近來的方法通常是使用輸出與真實影像之間的per-pixel loss來訓練一個前饋卷積神經網路(feed-forward convolutional neural networks)。同期的研究已經說明了,基於從預訓練網路中提取高階特徵來定義、最佳化perceptual loss functions的作法是可以生成高品質影像的。我們結合兩個方法的優點,然後提出使用perceptual loss functions來訓練影像轉換任務的前饋網路。我們說明了在image style transfer上的結果,其中前饋網路也訓練來解決Gatys等人所提出的實時最佳化的問題。對比基於最佳化的方法,我們的網路給出類似的定性結果,不過速度上了三個量級。我們還實驗了單一張圖片的超解析度,用perceptual loss來取代per-pixel loss,這給出了在視覺上令人愉悅的結果。
:::
## 1 Introduction
:::info
Many classic problems can be framed as image transformation tasks, where a system receives some input image and transforms it into an output image. Examples from image processing include denoising, super-resolution, and colorization, where the input is a degraded image (noisy, low-resolution, or grayscale) and the output is a high-quality color image. Examples from computer vision include semantic segmentation and depth estimation, where the input is a color image and the output image encodes semantic or geometric information about the scene.
:::
:::success
許多經典的問題能夠被設計成影像轉換任務,系統接收某些輸入影像,然後將之轉換為輸出影像。像是影像處理的話就有去噪、超解析度、以及著色,其中輸入為劣化的影像(雜訊、低解析度或是灰階),輸出則是高品質的彩色影像。電腦視覺的範例就包括語意分割與深度估計,其中輸入是彩色影像,輸出影像則是對相關景場的詞意或是幾何做編碼。
:::
:::info
One approach for solving image transformation tasks is to train a feedforward convolutional neural network in a supervised manner, using a per-pixel loss function to measure the difference between output and ground-truth images. This approach has been used for example by Dong et al for super-resolution [1], by Cheng et al for colorization [2], by Long et al for segmentation [3], and by Eigen et al for depth and surface normal prediction [4,5]. Such approaches are efficient at test-time, requiring only a forward pass through the trained network.
:::
:::success
解決影像轉換任務的一種方法就是以監督式的方式來訓練一個前饋卷積神經網路,使用per-pixel loss function來量測輸出與實際影像之間的差異。這個方法已經廣泛的應用,像是Dong等人的超解析度、Cheng等人的著色、Long等人的分割、Eigen等人的深度與表面法線預測。這方法在測試的時候非常有效,就只需要給訓練好的網路來個前饋的傳遞即可。
:::
:::info
However, the per-pixel losses used by these methods do not capture perceptual differences between output and ground-truth images. For example, consider two identical images offset from each other by one pixel; despite their perceptual similarity they would be very different as measured by per-pixel losses.
:::
:::success
然而,這些方法所使用的per-pixel losses並沒有補捉到輸出與實際影像之間的感知差異。舉例來說,考慮兩張相同的影像,然後彼此偏移一個像素;儘管它們在感知上是相似的,但是它們在per-pixel losses的量測上卻有著很大的差異。
:::
:::info
In parallel, recent work has shown that high-quality images can be generated using perceptual loss functions based not on differences between pixels but instead on differences between high-level image feature representations extracted from pretrained convolutional neural networks. Images are generated by minimizing a loss function. This strategy has been applied to feature inversion [6] by Mahendran et al, to feature visualization by Simonyan et al [7] and Yosinski et al [8], and to texture synthesis and style transfer by Gatys et al [9,10]. These approaches produce high-quality images, but are slow since inference requires solving an optimization problem.
:::
:::success
同時,近來的研究已經說明了,高品質的影像是可以使用perceptual loss functions生成,這並不是基於像素之間的差異,而是基於從預訓練的卷積神經網路中提取的高階影像特徵表示之間的差異。影像是透過最小化loss function所生成的。這個策略已經被應用在Mahendran等人的feature inversion、Simonyan等人與Yosinski等人的feature visualization、以及Gatys等人的紋理合成與風格轉換。這些方法產生高品質的影像,不過很慢,因為推理需要解決最佳化問題。
:::
:::info
In this paper we combine the benefits of these two approaches. We train feedforward transformation networks for image transformation tasks, but rather than using per-pixel loss functions depending only on low-level pixel information, we train our networks using perceptual loss functions that depend on high-level features from a pretrained loss network. During training, perceptual losses measure image similarities more robustly than per-pixel losses, and at test-time the transformation networks run in real-time.
:::
:::success
這篇論文中,我們結合了兩種方法的優點。我們針對影像轉換任務訓練一個前饋轉換網路,但我們並不是使用單純的依賴在低階像素信息中的per-pixel loss function,而是使用從一個預訓練的loss network中的高階特徵的perceptual losses來訓練網路。訓練過程中,perceptual losses量測影像的相似性比起per-pixel losses更具魯棒性,且在測試時,這個轉換網路是可以實時的執行的。
:::
:::info
We experiment on two tasks: style transfer and single-image super-resolution. Both are inherently ill-posed; for style transfer there is no single correct output, and for super-resolution there are many high-resolution images that could have generated the same low-resolution input. Success in either task requires semantic reasoning about the input image. For style transfer the output must be semantically similar to the input despite drastic changes in color and texture; for superresolution fine details must be inferred from visually ambiguous low-resolution inputs. In principle a high-capacity neural network trained for either task could implicitly learn to reason about the relevant semantics; however in practice we need not learn from scratch: the use of perceptual loss functions allows the transfer of semantic knowledge from the loss network to the transformation network.
:::
:::success
我們在兩個任務上實驗,風格轉換跟單張影像的超解析度。這兩個任務本質上都是[非良置](https://terms.naer.edu.tw/detail/cf1bfee3086a03112bc78a669072b9dd/)的;對風格轉換來說,並沒有單一個正確的輸出,對超解析問題來說則是有很多高解析度影像可以產生相同的低解析度輸入。任一個任務的成功都需要對輸入影像做語意的推理。對風格轉換而言,輸出影像必需在語意上跟輸入相似,儘管其顏色與紋理有著巨大的變化;對超解析度來說,細小的細節必須要從視覺上模糊的低解析度輸入中推論出來。原則上,針對任一任務訓練的高容量神經網路都可以隱式地學習推理相關的語意;然後,實務上,我們並不需要從頭開始學習:perceptual loss functions的使用允許將語意知識從loss network轉移到transformation network。
:::
:::info
For style transfer our feed-forward networks are trained to solve the optimization problem from [10]; our results are similar to [10] both qualitatively and as measured by objective function value, but are three orders of magnitude faster to generate. For super-resolution we show that replacing the per-pixel loss with a perceptual loss gives visually pleasing results for ×4 and ×8 super-resolution.
:::
:::success
風格轉移的部份,我們的前饋網路訓練來解決[10]中的最佳化問題;我們的結果在定性和目標函數值測量方面都與[10]相似,但生成速度快了三個數量級。超解析度的部份,我們說明著,用perceptual loss替換per-pixel loss可以為×4和×8超解析度提供視覺上令人愉悅的結果。
:::
## 2 Related Work
:::info
**Feed-forward image transformation.** In recent years, a wide variety of feedforward image transformation tasks have been solved by training deep convolutional neural networks with per-pixel loss functions.
:::
:::success
**Feed-forward image transformation.** 近年來,各種的前饋影像轉換任務已經透過訓練深度卷積神經網路搭配per-pixel loss function有著一定的成果。
:::
:::info
Semantic segmentation methods [3,5,12,13,14,15] produce dense scene labels by running a network in a fully-convolutional manner over an input image, training with a per-pixel classification loss. [15] moves beyond per-pixel losses by framing CRF inference as a recurrent layer trained jointly with the rest of the network. The architecture of our transformation networks are inspired by [3] and [14], which use in-network downsampling to reduce the spatial extent of feature maps followed by in-network upsampling to produce the final output image.
:::
:::success
語意分割方法透過以fully-convolutional的方式在輸入影像上執行網路來生成密集的場景標記,該網路是以per-pixel classification loss來訓練。[15]透過將CRF推論設計為跟網路的其餘部份聯合訓練的recurrent layer的方式來超越per-pixel losses。我們的轉換網路的架構受到[3]與[14]的啟發,其使用in-network downsampling的方式來降低feature mpas的空間範圍,然後再以in-network upsampling的方式來流生最終的輸出影像。
:::
:::info
Recent methods for depth [5,4,16] and surface normal estimation [5,17] are similar in that they transform a color input image into a geometrically meaningful output image using a feed-forward convolutional network trained with per-pixel regression [4,5] or classification [17] losses. Some methods move beyond per-pixel losses by penalizing image gradients [5] or using a CRF loss layer [16] to enforce local consistency in the output image. In [2] a feed-forward model is trained using a per-pixel loss to transform grayscale images to color.
:::
:::success
近來對於深度估測跟表面法線估測的方法相同在於,他們使用一個前饋卷積神經網路搭配per-pixel regression或是classification losses,將一張彩色輸入影像轉換為一張幾何上有意義的輸出影像。有些方法會透過懲罰影像梯度或是使用CRF loss layer來超越per-pixel losses,以強制輸出影像中的局部一致性。在[2]中的前饋網路就是用per-pixel loss訓練來將灰階影像轉為彩色。
:::
:::info
**Perceptual optimization.** A number of recent papers have used optimization to generate images where the objective is perceptual, depending on highlevel features extracted from a convolutional network. Images can be generated to maximize class prediction scores [7,8] or individual features [8] in order to understand the functions encoded in trained networks. Similar optimization techniques can also be used to generate high-confidence fooling images [18,19].
:::
:::success
**Perceptual optimization.** 近來不少論文用採用最佳化來生成影像,其目標就是perceptual,這取決於從卷積網路中擷取的高階特徵。為了理解訓練好的網路中編碼的函數,我們可以生成影像生成來最大化類別預測分數,或是各別特徵。這類似的最佳化技術也可以用來生成高置信度的騙照影像。
:::
:::info
Mahendran and Vedaldi [6] invert features from convolutional networks by minimizing a feature reconstruction loss in order to understand the image information retained by different network layers; similar methods had previously been used to invert local binary descriptors [20] and HOG features [21].
:::
:::success
為了瞭解不同網路層的影像信息的保留程度,Mahendran與Vedaldi透過最小化特徵重構損失來反轉卷積網路中的特徵;類似的方法已經有被用來反轉二進位描述與HOG特徵。
:::
:::info
The work of Dosovitskiy and Brox [22] is particularly relevant to ours, as they train a feed-forward neural network to invert convolutional features, quickly approximating a solution to the optimization problem posed by [6]. However, their feed-forward network is trained with a per-pixel reconstruction loss, while our networks directly optimize the feature reconstruction loss of [6].
:::
:::success
Dosovitskiy與Brox的研究跟我們特別相關,因為他們訓練一個前饋網路來反轉卷積特徵,快速逼近由[6]所提出的最佳化問題的解。然而,他們的前饋網路是用per-pixel reconstruction loss訓練的,而我們的網路是直接地最佳化[6]所提出的feature reconstruction loss。
:::
:::info
**Style Transfer.** Gatys et al [10] perform artistic style transfer, combining the content of one image with the style of another by jointly minimizing the feature reconstruction loss of [6] and a style reconstruction loss also based on features extracted from a pretrained convolutional network; a similar method had previously been used for texture synthesis [9]. Their method produces highquality results, but is computationally expensive since each step of the optimization problem requires a forward and backward pass through the pretrained network. To overcome this computational burden, we train a feed-forward network to quickly approximate solutions to their optimization problem.
:::
:::success
**Style Transfer.** Gatys等人結合一張影像的內容與另一張的風格,透過聯合最佳化[6]的特徵重構損失跟同樣基於從預訓練的卷積網路中的特徵提取的風格重構損失的方式來做藝術風格轉換;類似的方法之前就曾用來做紋理合成。他們的方法產生高品質的結果,不過計算成本很高就是,因為最佳化問題的每個步驟都需要一個前饋一個反向來經過預訓練的網路。為了解決這種計算負擔,我們訓練一個前饋網路快速地求得一個近似解來解決這個問題。
:::
:::info
**Image super-resolution.** Image super-resolution is a classic problem for which a wide variety of techniques have been developed. Yang et al [23] provide an exhaustive evaluation of the prevailing techniques prior to the widespread adoption of convolutional neural networks. They group super-resolution techniques into prediction-based methods (bilinear, bicubic, Lanczos, [24]), edgebased methods [25,26], statistical methods [27,28,29], patch-based methods [25,30,31,32,3?]and sparse dictionary methods [37,38]. Recently [1] achieved excellent performance on single-image super-resolution using a three-layer convolutional neural network trained with a per-pixel Euclidean loss. Other recent state-of-the-art methods include [39,40,41].
:::
:::success
**Image super-resolution.** 影像超解析度是一個經典問題,相對的也已經開發多種技術。Yang等人對於那些卷積神經網路廣泛採用的現有的技術給出一個全面的評估。他們把超解析度技術分類為prediction-based methods (bilinear, bicubic, Lanczos, [24])、edgebased methods [25,26]、statistical methods [27,28,29]、patch-based methods [25,30,31,32,3?]、與sparse dictionary methods [37,38]。最近[1]使用三層卷積神經網路搭配per-pixel Euclidean loss在單張影像超解析度問題上有著極佳的效能。其它棒棒的方法包括[39,40,41]。
:::
## 3 Method
:::info
As shown in Figure 2, our system consists of two components: an image transformation network $f_W$ and a loss network $\phi$ that is used to define several loss functions $\mathscr{l}_1,...,\mathscr{l}_k$. The image transformation network is a deep residual convolutional neural network parameterized by weights $W$; it transforms input images $x$ into output images $\hat{y}$ via the mapping $\hat{y}=f_W(x)$. Each loss function computes a scalar value $\mathscr{l}_{i}(\hat{y}, y_i)$ measuring the difference between the output image $\hat{y}$ and a target image $y_i$ . The image transformation network is trained using stochastic gradient descent to minimize a weighted combination of loss functions:
$$
W^* = \arg\min_W\mathbf{E}_{x,\left\{y_i \right\}}\big[\sum_{i=1} \lambda_i\mathscr{l}_i(f_W(x),y_i)\big]\tag{1}
$$
:::
:::success
如Figure 2所示,我們的系統組成包含兩個組件:一個影像轉換網路$f_W$以及一個損失網路(loss network)$\phi$,這是用來定義多個損失函數$\mathscr{l}_1,...,\mathscr{l}_k$所使用。影像轉換網路是一個由權重$W$所參數化的深度殘差卷積神經網路;它透過映射$\hat{y}=f_W(x)$將輸入影像$x$轉為輸出影像$\hat{y}$。每一個損失函數計算一個[純量值](https://terms.naer.edu.tw/detail/7106e3351a2931816d32809feeb6d9f1/)$\mathscr{l}_{i}(\hat{y}, y_i)$來量測輸出影像$\hat{y}$跟目標影像$y_i$之間的差異。影像轉換網路的訓練則使用隨機梯度下降來最小化損失函數的加權組合:
$$
W^* = \arg\min_W\mathbf{E}_{x,\left\{y_i \right\}}\big[\sum_{i=1} \lambda_i\mathscr{l}_i(f_W(x),y_i)\big]\tag{1}
$$
:::
:::info
![image](https://hackmd.io/_uploads/rkX3ztgFC.png)
Fig. 2. System overview. We train an image transformation network to transform input images into output images. We use a loss network pretrained for image classification to define perceptual loss functions that measure perceptual differences in content and style between images. The loss network remains fixed during the training process.
:::
:::info
To address the shortcomings of per-pixel losses and allow our loss functions to better measure perceptual and semantic differences between images, we draw inspiration from recent work that generates images via optimization [6,7,8,9,10]. The key insight of these methods is that convolutional neural networks pretrained for image classification have already learned to encode the perceptual and semantic information we would like to measure in our loss functions. We therefore make use of a network $\phi$ which as been pretrained for image classification as a fixed loss network in order to define our loss functions. Our deep convolutional transformation network is then trained using loss functions that are also deep convolutional networks.
:::
:::success
為了解決per-pixel losses的缺點,並且讓我們的loss function可好的量測影像之間的感知與語意差義,我們從近來的研究,也就是透過最佳化生成影像[6,7,8,9,10],的觀念中擷取靈感。這些方法的關鍵見解在於,用於影像分類的預訓練網路已經學習到對我們的損失函數中想要量測的感知與語意信息的編碼。因此,為了定義我們的損失函數,我們使用網路$\phi$,也就是已經預訓練好用於影像分類的網路,來做為固定的loss network。然後,再使用也深度卷積網路的loss function來訓練我們的深度卷積轉換網路。
:::
:::info
The loss network $\phi$ is used to define a feature reconstruction loss ${\scr{l}}^\phi_{feat}$ and a style reconstruction loss ${\scr{l}}^\phi_{style}$ that measure differences in content and style between images. For each input image $x$ we have a content target $y_c$ and a style target $y_s$. For style transfer, the content target $y_c$ is the input image $x$ and the output image $\hat{y}$ should combine the content of $x = y_c$ with the style of $y_s$; we train one network per style target. For single-image super-resolution, the input image $x$ is a low-resolution input, the content target $y_c$ is the ground-truth highresolution image, and the style reconstruction loss is not used; we train one network per super-resolution factor.
:::
:::success
loss network $\phi$是用來定義特徵重構損失${\scr{l}}^\phi_{feat}$與風格重構損失${\scr{l}}^\phi_{style}$,也就是量測影像之間的內容與風格的差異。對於每個輸入影像$x$,我們有一個內容目標$y_c$跟一個風個目標$y_s$。對風格轉換來說,內容目標$y_c$是輸入影像$x$,而輸出影像$\hat{y}$應該結合$x = y_c$的內容與$y_s的風格$;我們針對每個風格目標訓練一個網路。對於單張影像超解析度任務的部份,輸入影像$x$是一個低解析度的輸入,內容目標$y_c$則是一個實際的高解析度影像,且沒有使用風格重構的損失;我們針對每個超解析度的因子各自訓一個網路。
:::
### 3.1 Image Transformation Networks
:::info
Our image transformation networks roughly follow the architectural guidelines set forth by Radford et al [42]. We do not use any pooling layers, instead using strided and fractionally strided convolutions for in-network downsampling and upsampling. Our network body consists of five residual blocks [43] using the architecture of [44]. All non-residual convolutional layers are followed by spatial batch normalization [45] and ReLU nonlinearities with the exception of the output layer, which instead uses a scaled tanh to ensure that the output image has pixels in the range [0, 255]. Other than the first and last layers which use $9 \times 9$ kernels, all convolutional layers use $3 \times 3$ kernels. The exact architectures of all our networks can be found in the supplementary material.
:::
:::success
我們的影像轉換網路大致上是依循著Radford等人的架構指南。我們並沒有使用任何的pooling layers,而是使用strided與fractionally strided(反卷積的概念)來做in-network的downsampling與upsampling。網路主體的部份由五個residual blocks所組成,用的是[44]的架構。除了輸出層之外,所有non-residual convolutional layers的後面都接spatial batch normalization跟ReLU nonlinearities,輸出層用的是scaled tanh來確保輸出影像的像素值範圍是在[0, 255]之間。除了第一層跟最後一層是使用$9 \times 9$ kernels,其它卷積層都是使用$3 \times 3$ kernels。所有網路的確切架構都可以在補充資料裡面找到。
:::
:::info
**Inputs and Outputs.** For style transfer the input and output are both color images of shape $3 \times 256 \times 256$. For super-resolution with an upsampling factor of $f$, the output is a high-resolution image patch of shape $3 \times 288 \times 288$ and the input is a low-resolution patch of shape $3 \times 288/f \times 288/f$. Since the image transformation networks are fully-convolutional, at test-time they can be applied to images of any resolution.
:::
:::success
**Inputs and Outputs.** 對於style transfer,輸入跟輸出的形狀都是$3 \times 256 \times 256$的彩色影像。對於使用上採樣因數(factory)為$f$的超解析度的部份,它的輸出是形狀為$3 \times 288 \times 288$的高解析度影像區塊,輸入是形狀為$3 \times 288/f \times 288/f$的低解析度區塊。因為影像轉換網路是fully-convolutional,所以在測試的時候可以是任何解析度的影像。
:::
:::info
**Downsampling and Upsampling.** For super-resolution with an upsampling factor of $f$, we use several residual blocks followed by $\log_2 f$ convolutional layers with stride $1/2$. This is different from [1] who use bicubic interpolation to upsample the low-resolution input before passing it to the network. Rather than relying on a fixed upsampling function, fractionally-strided convolution allows the upsampling function to be learned jointly with the rest of the network.
:::
:::success
**Downsampling and Upsampling.** 對於使用上採樣因數(factory)為$f$的超解析度的部份,我們使用多個殘差塊(residual blocks),然後接著用步幅為$1/2$的$\log_2 f$ convolutional layers。這跟[1]不一樣,[1]是在把輸入丟到神經網路之前使用[雙三次內挿值](https://terms.naer.edu.tw/detail/620bf21b9342124acb5694f440e4a966/)來對低解析度的輸入做upsampling。跟依賴固定的上採樣函數不一樣,fractionally-strided convolution允許上採樣函數可以網路的其它部份聯合學習。
:::
:::info
For style transfer our networks use two stride-2 convolutions to downsample the input followed by several residual blocks and then two convolutional layers with stride $1/2$ to upsample. Although the input and output have the same size, there are several benefits to networks that downsample and then upsample.
:::
:::success
在風格轉換的部份,我們的網路使用兩個stride-2 convolutions來對輸入做降採樣,然後後面再接幾個殘差塊,然後再兩個步幅為$1/2$的卷積層來做上採樣。雖然輸入跟輸出有相同的大小,但是先下再上的網會有幾個好處。
:::
:::info
The first is computational. With a naive implementation, a $3\times 3$ convolution with $C$ filters on an input of size $C \times H \times W$ requires $9HWC^2$ multiply-adds, which is the same cost as a $3 \times 3$ convolution with $DC$ filters on an input of shape $DC \times H/D \times W/D$. After downsampling, we can therefore use a larger network for the same computational cost.
:::
:::success
首先是計算成本。來個簡單的實現,一個有著$C$個filters的$3\times 3$卷積在一個大小為$C \times H \times W$的輸入需要$9HWC^2$的乘加,這跟有著$DC$個fileters的$3 \times 3$卷積在形狀為$DC \times H/D \times W/D$的輸入是一樣的。因此,在降採樣之後,我們可以在相同計算成本的情況下使用更大型的網路。
:::
:::info
The second benefit has to do with effective receptive field sizes. High-quality style transfer requires changing large parts of the image in a coherent way; therefore it is advantageous for each pixel in the output to have a large effective receptive field in the input. Without downsampling, each additional $3 \times 3$ convolutional layer increases the effective receptive field size by $2$. After downsampling by a factor of $D$, each $3 \times 3$ convolution instead increases effective receptive field size by $2D$, giving larger effective receptive fields with the same number of layers.
:::
:::success
第二個好處就跟接受域的大小有關了。高品質的風格轉換需要以連貫的方式改變一張影像的絕大部份;因此,輸出中的每個像素在輸入中具有較大的有效接受域是有好處的。在不做降採樣的情況下,每增加$3 \times 3$的卷積層,其有效的接受域大小就會增加$2$。在$D$倍的降採樣之後,每一個$3 \times 3$的卷積會讓有效的接受域增加2,也就是在相同的網路層數量的情況下,我們可以有更大的有效的接受域。
:::
:::info
**Residual Connections.** He et al [43] use residual connections to train very deep networks for image classification. They argue that residual connections make it easy for the network to learn the identify function; this is an appealing property for image transformation networks, since in most cases the output image should share structure with the input image. The body of our network thus consists of several residual blocks, each of which contains two $3 \times 3$ convolutional layers. We use the residual block design of [44], shown in the supplementary material.
:::
:::success
**Residual Connections.** He等人使用殘差連接來訓練用於影像分類的深度網路。他們認為,殘差連接讓網路學習identify function變的更簡單了;這對影像轉換網路來說是一個具吸引力的特性,因為多數情況下,輸出影像應該跟輸入影像共享結構。因此,我們的神經網路的主體部份是由幾個殘差塊所組成,每個殘差塊都包含兩個$3 \times 3$的卷積層。我們使用[44]的殘差塊設計,如補充資料所示。
:::
### 3.2 Perceptual Loss Functions
:::info
We define two perceptual loss functions that measure high-level perceptual and semantic differences between images. They make use of a loss network $\phi$ pretrained for image classification, meaning that these perceptual loss functions are themselves deep convolutional neural networks. In all our experiments $\phi$ is the 16-layer VGG network [46] pretrained on the ImageNet dataset [47].
:::
:::success
我們定義兩個perceptual loss functions來量測影像之間的高階高知與語意差異。這兩個函數使用一個用於影像分類的預訓練網路$\phi$來做為loss network,意思就是說,這些perceptual loss functions本身就是深度卷積神經網路。在我們的所有實驗中,$\phi$是一個在ImageNet資料集上預訓練的16層的VGG。
:::
:::info
**Feature Reconstruction Loss.** Rather than encouraging the pixels of the output image $\hat{y}=f_W(x)$ to exactly match the pixels of the target image $y$, we instead encourage them to have similar feature representations as computed by the loss network $\phi$. Let $\phi_j(x)$ be the activations of the $j$th layer of the network $\phi$ when processing the image $x$; if $j$ is a convolutional layer then $\phi_j(x)$ will be a feature map of shape $C_j \times H_j \times W_j$ . The feature reconstruction loss is the (squared, normalized) Euclidean distance between feature representations:
$$
\mathscr{l}^{\phi,j}_{feat}(\hat{y}, y)=\dfrac{1}{C_jH_jW_j}\Vert \phi_j(\hat{y}) - \phi_j(y) \Vert_2^2 \tag{2}
$$
:::
:::success
與其要鼓勵輸出影像$\hat{y}=f_W(x)$的像素完全地匹配目標影像$y$的像素,不如鼓勵它們在loss network $\phi$的計算下能有類似的特徵表示。假設,$\phi_j(x)$是在處理影像$x$的時候,網路$\phi$的第$j$層的啟動函數;如果$j$是卷積層,那$\phi_j(x)$就會是形狀為$C_j \times H_j \times W_j$的特徵圖(feature map)。那特徵重構損失就會是兩個特徵表示之間的(squared, normalized) Euclidean distance:
$$
\mathscr{l}^{\phi,j}_{feat}(\hat{y}, y)=\dfrac{1}{C_jH_jW_j}\Vert \phi_j(\hat{y}) - \phi_j(y) \Vert_2^2 \tag{2}
$$
:::
:::warning
* $\dfrac{1}{C_jH_jW_j}:做為正規化使用,其中$C_j \times H_j \times W_j$為該feature map的channel、寬、高
* $\Vert \phi_j(\hat{y}) - \phi_j(y) \Vert_2^2$:計算兩個feature map之間的歐幾里德距離
:::
:::info
As demonstrated in [6] and reproduced in Figure 3, finding an image $\hat{y}$ that minimizes the feature reconstruction loss for early layers tends to produce images that are visually indistinguishable from $y$. As we reconstruct from higher layers, image content and overall spatial structure are preserved but color, texture, and exact shape are not. Using a feature reconstruction loss for training our image transformation networks encourages the output image $\hat{y}$ to be perceptually similar to the target image $y$, but does not force them to match exactly.
:::
:::success
如[6]所說明,在Figure 3中重現,尋找一張影像$\hat{y}$來最小化前面幾個網路層的特徵重構損失的作法會導致產出跟$y$在視覺上無法區分的影像。當我們從較高的網路層去重構的時候,影像的內容跟整體的空間結構會保留,不過色彩、紋理跟精確的形狀則不會被保留。使用特徵重構損失來訓練我們的影像轉換網路會鼓勵輸出影像$\hat{y}$在感知上類似於目標影像$y$,不過不會強制它們完全地匹配就是。
:::
:::info
![image](https://hackmd.io/_uploads/HkDbbvEFR.png)
Fig. 3. Similar to [6], we use optimization to find an image $\hat{y}$ that minimizes the feature reconstruction loss $\mathscr{l}_{feat}^{\phi,j}(\hat{y}, y)$ for several layers $j$ from the pretrained VGG-16 loss network $\phi$. As we reconstruct from higher layers, image content and overall spatial structure are preserved, but color, texture, and exact shape are not.
:::
:::info
**Style Reconstruction Loss.** The feature reconstruction loss penalizes the output image $\hat{y}$ when it deviates in content from the target $y$. We also wish to penalize differences in style: colors, textures, common patterns, etc. To achieve this effect, Gatys et al [9,10] propose the following style reconstruction loss.
:::
:::success
**Style Reconstruction Loss.** 當輸出影像$\hat{y}$偏離了目標影像$y$的內容的時候,特徵重構損失會懲罰了輸出影像。我們還希望懲罰在風格上的差異:色彩、紋理、常見的圖案等等。為了能做到這種效果,Gatys等人提出了下面的風格重構損失。
:::
:::info
As above, let $\phi_j(x)$ be the activations at the $j$th layer of the network $\phi$ for the input $x$, which is a feature map of shape $C_j \times H_j \times W_j$. Define the Gram matrix $G_j^{\phi}(x)$ to be the $C_j \times C_j$ matrix whose elements are given by
$$
G^\phi_j(x)_{c, c'}=\dfrac{1}{C_jH_jW_j}\sum_{h=1}^{H_j}\sum_{w=1}^{W_j}\phi_j(x)_{h,w,c}\phi_j(x)_{h,w,c'}\tag{3}
$$
:::
:::success
如上所述,假設$\phi_j(x)$是網路$\phi$在第$j$層對輸入$x$的啟動函數值,這個feature map的形狀為$C_j \times H_j \times W_j$。我們定義Gram matrix $G_j^{\phi}(x)$是$C_j \times C_j$的矩陣,其元素公式如下:
$$
G^\phi_j(x)_{c, c'}=\dfrac{1}{C_jH_jW_j}\sum_{h=1}^{H_j}\sum_{w=1}^{W_j}\phi_j(x)_{h,w,c}\phi_j(x)_{h,w,c'}\tag{3}
$$
:::
:::warning
* $G_j^{\phi}(x)$:Gram matrix
* $G^\phi_j(x)_{c, c'}$,Gram matrix的(row, column)元素
:::
:::info
If we interpret $\phi_j(x)$ as giving $C_j$-dimensional features for each point on a $H_j \times W_j$ grid, then $G^\phi_j(x)$ is proportional to the uncentered covariance of the $C_j$-dimensional features, treating each grid location as an independent sample. It thus captures information about which features tend to activate together. The Gram matrix can be computed efficiently by reshaping $\phi_j(x)$ into a matrix $\psi$ of shape $C_j \times H_jW_j$;then $G^\phi_j(x)=\psi\psi^T/C_jH_jW_j$.
:::
:::success
如果我們把$\phi_j(x)$解釋為,為$H_j \times W_j$網格上的每個點給出$C_j$維的特徵,那麼$G^\phi_j(x)$就跟$C_j$維特徵的uncentered covariance(未中心化的共變異數?)成比例,把每個網格的位置都視為是獨立樣本。因此,它補捉到關於那些特徵傾向於一起啟動(激活,總之就是activate)的信息。Gram matrix可以透過將$\phi_j(x)$重塑為像$C_j \times H_jW_j$的矩陣$\psi$來有效地計算;然後$G^\phi_j(x)=\psi\psi^T/C_jH_jW_j$。
:::
:::info
The style reconstruction loss is then the squared Frobenius norm of the difference between the Gram matrices of the output and target images:
$$
\mathscr{l}^{\phi,j}_{style}(\hat{y},y) = \Vert G^\phi_j(\hat{y}) - G^\phi_j(y) \Vert^2_F \tag{4}
$$
The style reconstruction loss is well-defined even when $\hat{y}$ and $y$ have different sizes, since their Gram matrices will both have the same shape.
:::
:::success
然後,風格重構損失是輸出與目標影像之間的Gram matrices的差異的[Frobenius範數](https://terms.naer.edu.tw/detail/e135e53f652b94dbbe345e1982fbe69f/)的平方:
$$
\mathscr{l}^{\phi,j}_{style}(\hat{y},y) = \Vert G^\phi_j(\hat{y}) - G^\phi_j(y) \Vert^2_F \tag{4}
$$
即使$\hat{y}$跟$y$有著不同的大小,風格重構損失仍然是[良適定義](https://terms.naer.edu.tw/detail/eda9042ddf91503af12545a9b467a4b9/)的,因為它們的Gram matrices有落相同的形狀。
:::
:::info
As demonstrated in [10] and reproduced in Figure 5, generating an image $\hat{y}$ that minimizes the style reconstruction loss preserves stylistic features from the target image, but does not preserve its spatial structure. Reconstructing from higher layers transfers larger-scale structure from the target image.
:::
:::success
如[10]所說明,於Figure 5中重現,生成一張最小化風格重構損失的影像$\hat{y}$,從目標影像中保留了風格特徵,但是並沒有保留它的空間結構。可以從較高的網路層將目標影像中的大尺度結構轉移過來。
:::
:::info
![image](https://hackmd.io/_uploads/SyT3MvNYR.png)
Fig. 5. Our style transfer networks and [10] minimize the same objective. We compare their objective values on 50 images; dashed lines and error bars show standard deviations. Our networks are trained on 256 × 256 images but generalize to larger images.
:::
:::info
To perform style reconstruction from a set of layers $J$ rather than a single layer $j$, we define $\mathscr{l}^{\phi,J}_{style}(\hat{y},y)$ to be the sum of losses for each layer $j \in J$.
:::
:::success
為了從一個網路層的集合$J$做風格重構,而不是單一個網路層$j$,我們定義$\mathscr{l}^{\phi,J}_{style}(\hat{y},y)$為$j \in J$中的每一層損失總和。
:::
### 3.3 Simple Loss Functions
:::info
In addition to the perceptual losses defined above, we also define two simple loss functions that depend only on low-level pixel information.
:::
:::success
除了上述所定義的perceptual losses之外,我們還定義兩個simple loss functions,這單純的依賴低階的像素信息。
:::
:::info
**Pixel Loss.** The pixel loss is the (normalized) Euclidean distance between the output image $\hat{y}$ and the target $y$. If both have shape $C \times H \times W$, then the pixel loss is defined as $\mathscr{l}_{pixel}(\hat{y},y)=\Vert \hat{y}-y \Vert_2^2/CHW$. This can only be used when when we have a ground-truth target $y$ that the network is expected to match.
:::
:::success
**Pixel Loss.** 像素損失是輸出影像$\hat{y}$與目標影像之間的歐幾里德距離(正規化的)。如果兩個的形狀皆為$C \times H \times W$的話,那像素損失的定義就是$\mathscr{l}_{pixel}(\hat{y},y)=\Vert \hat{y}-y \Vert_2^2/CHW$。這只有在我們有網路預期匹配實際目標$y$的時候可以使用。
:::
:::info
**Total Variation Regularization.** To encourage spatial smoothness in the
output image $\hat{y}$, we follow prior work on feature inversion [6,20] and superresolution [48,49] and make use of total variation regularizer $\mathscr{l}_{TV}(\hat{y})$.
:::
:::success
**Total Variation Regularization.** 為了鼓勵輸出影像$\hat{y}$中的空間平滑性,我們依循著前輩們在feature inversion [6,20]與superresolution [48,49] 的研究,使用total variation regularizer $\mathscr{l}_{TV}(\hat{y})$。
:::
## 4 Experiments
:::info
We perform experiments on two image transformation tasks: style transfer and single-image super-resolution. Prior work on style transfer has used optimization to generate images; our feed-forward networks give similar qualitative results but are up to three orders of magnitude faster. Prior work on single-image super-resolution with convolutional neural networks has used a per-pixel loss; we show encouraging qualitative results by using a perceptual loss instead.
:::
:::success
我們在兩個影像轉換任務上做了實驗:風格轉移與單張影像超解析處理。風格轉換的早期研究已經使用最佳化的方式來生成影像;我們的前饋網路給了一個類似的定性結果,不過速度上快了三個量級。單張影像超解析的部份則是單純卷積神經網路並使用per-pixel loss;我們透過使用perceptual loss來取代per-pixel loss的方式,其顯示出令人興奮的定性結果。
:::
### 4.1 Style Transfer
:::info
The goal of style transfer is to generate an image $\hat{y}$ that combines the content of a target content image $y_c$ with the the style of a target style image $y_s$. We train one image transformation network per style target for several hand-picked style targets and compare our results with the baseline approach of Gatys et al [10].
:::
:::success
風格轉移的目標是去生成一張影像$\hat{y}$,這影像結合目標內容影像$y_c$的內容與目標風格影像$y_s$的風格。我們針對幾個精心挑戰的風格目標來為每一個風格目標訓練一個影像轉換網路,然後跟Gatys等人的基線方法做比較。
:::
:::info
**Baseline.** As a baseline, we reimplement the method of Gatys et al [10]. Given style and content targets $y_s$ and $y_c$ and layers $j$ and $J$ at which to perform feature and style reconstruction, an image $\hat{y}$ is generated by solving the problem
$$
\hat{y}=\arg\min_y\lambda_c\mathscr{l}_{feat}^{\phi,j}(y,y_c) + \lambda_s\mathscr{l}_{style}^{\phi,J}(y,y_s)+\lambda_{TV}\mathscr{l}{TV}(y)\tag{5}
$$
where $\lambda_c$, $\lambda_s$, and $\lambda_{TV}$ are scalars, $y$ is initialized with white noise, and optimization is performed using **L-BFGS**. We find that unconstrained optimization of Equation 5 typically results in images whose pixels fall outside the range [0, 255]. For a more fair comparison with our method whose output is constrained to this range, for the baseline we minimize Equation 5 using projected **L-BFGS** by clipping the image y to the range [0, 255] at each iteration. In most cases optimization converges to satisfactory results within 500 iterations. This method is slow because each **L-BFGS** iteration requires a forward and backward pass through the VGG-16 loss network $\phi$.
:::
:::success
做為基線,我們重現了Gatys等人的方法。給定風格與內容目標,$y_s$與$y_c$,然後是執行特徵與風格重建的$j$與$J$,我們透過解下面問題來生成影像$\hat{y}$
$$
\hat{y}=\arg\min_y\lambda_c\mathscr{l}_{feat}^{\phi,j}(y,y_c) + \lambda_s\mathscr{l}_{style}^{\phi,J}(y,y_s)+\lambda_{TV}\mathscr{l}{TV}(y)\tag{5}
$$
其中$\lambda_c$、$\lambda_s$與$\lambda_{TV}$是純量(scalars),$y$是以[白噪音](https://terms.naer.edu.tw/detail/bef29628fbf42081fae803307534a07a/)初始化而得的,使用**L-BFGS**來做最佳化。我們有發現到,方程式5的無約束最佳化通常會導致影像的像素質超過[0,255]這個範圍。為了能夠有更公平的比較,我們的方法的輸出會限制輸出在這個範圍內,對於基線的部份,我們使用projected **L-BFGS**在每次的迭代中將影像$y$裁剪到[0,255]這個範圍來最小化方程式5。多數情況下最佳化會在500次迭代內收斂到一個滿意的結果。這個方法是慢的,因為每次**L-BFGS**迭代都會需要一個前饋、反向傳播來經過VGG-16 loss network $\phi$。
:::
:::info
**Training Details.** Our style transfer networks are trained on the Microsoft COCO dataset [50]. We resize each of the 80k training images to $256 \times 256$ and train our networks with a batch size of 4 for 40,000 iterations, giving roughly two epochs over the training data. We use Adam [51] with a learning rate of $1 \times 10^{−3}$ . The output images are regularized with total variation regularization with a strength of between $1 \times 10^{−6}$ and $1 \times 10^{−4}$ , chosen via cross-validation per style target. We do not use weight decay or dropout, as the model does not overfit within two epochs. For all style transfer experiments we compute feature reconstruction loss at layer `relu2_2` and style reconstruction loss at layers `relu1_2`, `relu2_2`, `relu3_3`, and `relu4_3` of the VGG-16 loss network $\phi$. Our implementation uses Torch [52] and cuDNN [53]; training takes roughly 4 hours on a single GTX Titan X GPU.
:::
:::success
**Training Details.** 我們的風格轉移網路是在Microsoft COCO dataset上訓練的。我們把80k張的訓練影像大小調整為$256 \times 256$,然後batch size為4,訓練40000次迭代,在訓練資料上大約做了2個epochs。我們使用Adam,學習效率為$1 \times 10^{−3}$。輸出影像是total variation regularization來做正規化,強度介於$1 \times 10^{−6}$與$1 \times 10^{−4}$之間,實際的是透過交叉驗證來選擇。我們並沒有使用權限衰減或是dropout,因為模型在兩個epochs內並不會過擬合。所有的風格轉移實驗中,我們在VGG-16 loss network $\phi$的網路層`relu2_2`計算特徵重構損失,在網路層`relu1_2`、`relu2_2`、`relu3_3`、`relu4_3`計算風格重構損失。我們使用Torch與cuDNN來實作;在單張GTX Titan X GPU跑了大約四個小時。
:::
:::info
**Qualitative Results.** In Figure 6 we show qualitative examples comparing our results with those of the baseline method for a variety of style and content images. In all cases the hyperparameters $\lambda_c$, $\lambda_s$, and $\lambda_{TV}$ are exactly the same between the two methods; all content images are taken from the MS-COCO 2014 validation set. Overall our results are qualitatively similar to the baseline.
:::
:::success
**Qualitative Results.** 在Figure 6中我們說明了我們的結果跟這些基線方法在各種風格與內容影像上定性範例,在所有的情況中,兩個方法的超參數$\lambda_c$、$\lambda_s$與$\lambda_{TV}$完全相同;所有的內容影像皆取自MS-COCO 2014驗證集。總的來說,我們的結果在品質上是類似於基線方法的。
:::
:::info
![image](https://hackmd.io/_uploads/SkPlGhHtR.png)
Fig. 6. Example results of style transfer using our image transformation networks. Our results are qualitatively similar to Gatys et al [10] but are much faster to generate (see Table 1). All generated images are $256 \times 256$ pixels.
:::
:::info
Although our models are trained with $256 \times 256$ images, they can be applied in a fully-convolutional manner to images of any size at test-time. In Figure 7 we show examples of style transfer using our models on $512 \times 512$ images.
:::
:::success
雖然我們的模型是以$256 \times 256$的影像訓練的,不過它們在測試的時候可以以fully-convolutional的方式應用於任意大小的影像上。在Figure 7中我們給出在我們的模型中使用$512 \times 512$來做風格轉移的範例。
:::
:::info
![image](https://hackmd.io/_uploads/S1-xbV8Y0.png)
Fig. 7. Example results for style transfer on $512 \times 512$ images. The model is applied in in a fully-convolutional manner to high-resolution images at test-time. The style images are the same as Figure 6.
:::
:::info
In these results it is clear that the trained style transfer network is aware of the semantic content of images. For example in the beach image in Figure 7 the people are clearly recognizable in the transformed image but the background is warped beyond recognition; similarly in the cat image, the cat’s face is clear in the transformed image, but its body is not. One explanation is that the VGG-16 loss network has features which are selective for people and animals since these objects are present in the classification dataset on which it was trained. Our style transfer networks are trained to preserve VGG-16 features, and in doing so they learn to preserve people and animals more than background objects.
:::
:::success
從這些結果可以明顯看的出來,訓練過的風格轉移網路是能瞭解影像的語意內容的。舉例來說,Figure 7中的海灘影像,轉換後的影像中的人物是清析可辨,不過背景就不行了;類似的在貓咪的影像也是一樣,貓咪的臉很清楚,那身體就不行了。一個解釋就是,VGG-16 loss network對人與動物有選擇性的特徵,因為這些物件存在於其所訓練的分類資料集中。我們的風格轉移網路的訓練有保留VGG-16的特徵,因為這麼做,所以它們學會保留比背景物件還要多的人與動物的特徵。
:::
:::info
**Quantitative Results.** The baseline and our method both minimize Equation 5. The baseline performs explicit optimization over the output image, while our method is trained to find a solution for any content image $y_c$ in a single forward pass. We may therefore quantitatively compare the two methods by measuring the degree to which they successfully minimize Equation 5.
:::
:::success
**Quantitative Results.** 基線與我們的方法都是最小化方程式5。基線方法對輸出影像做了顯式的最佳化,而我們的方法則是訓練來在單一個前饋傳播的情況下找出對任意內容影像$y_c$的解。因此,我們可以透過量測成功最小化方程式5的程度定量比較這兩種方法。
:::
:::info
We run our method and the baseline on 50 images from the MS-COCO validation set, using The Muse by Pablo Picasso as a style image. For the baseline we record the value of the objective function at each iteration of optimization, and for our method we record the value of Equation 5 for each image; we also compute the value of Equation 5 when $y$ is equal to the content image $y_c$. Results are shown in Figure 5. We see that the content image $y_c$ achieves a very high loss, and that our method achieves a loss comparable to between 50 and 100 iterations of explicit optimization.
:::
:::success
我們從MS-COCO驗證集中挑了50張影像來執行我們的方法與基線方法,使用Pablo Picasso的The Muse做為風格影像。基線的部份,我們記錄了每次迭代最佳化的過程中目標函數的值,對於我們的方法,我們則是記錄了每一張影像的方程式5的值;當$y$等於內容影像$y_c$的時候,我們也會計算方程式5的值。結果如Figure 5所示。我們看到,內容影像$y_c$來到非常高的loss,然後我們的方法也有著跟顯式最佳化50~100次迭代的結果差不多的loss。
:::
:::info
Although our networks are trained to minimize Equation 5 for $256 \times 256$ images, they are also successful at minimizing the objective when applied to larger images. We repeat the same quantitative evaluation for 50 images at $512 \times 512$ and $1024 \times 1024$; results are shown in Figure 5. We see that even at higher resolutions our model achieves a loss comparable to 50 to 100 iterations of the baseline method.
:::
:::success
雖然我們的網路是訓練來針對$256 \times 256$的影像最佳化,不過它們仍然可以成功的對更大的影像最小化目標。我們反復的對50張影像以$512 \times 512$、$1024 \times 1024$解析度做定量評估;結果如Figure 5所示。我們看到,即使是較高的解析度,我們的模型也能夠有著跟基線方法50~100迭代相當的loss。
:::
:::info
**Speed.** In Table 1 we compare the runtime of our method and the baseline for several image sizes; for the baseline we report times for varying numbers of optimization iterations. Across all image sizes, we see that the runtime of our method is approximately twice the speed of a single iteration of the baseline method. Compared to 500 iterations of the baseline method, our method is three orders of magnitude faster. Our method processes images of size $512 \times 512$ at 20 FPS, making it feasible to run style transfer in real-time or on video.
:::
:::success
**Speed.** 在Table 1中,我們比較我們的方法與基線方法在不同影像大小的執行時間;基線方法的部份,我們記錄了不同數量的最佳化迭代時間。在所有的影像大小中我們看到,我們的方法的執行時間大約是基線方法單次迭代速度的兩倍。跟基線方法的500次迭代來比的話,我們的方法快了三個量級。我們的方法是以20FPS來處理大小為$512 \times 512$的影像,這能夠在視訊以實時的方法來執行風格轉移。
:::
:::info
![image](https://hackmd.io/_uploads/r1ctTWPF0.png)
Table 1. Speed (in seconds) for our style transfer network vs the optimization-based baseline for varying numbers of iterations and image resolutions. Our method gives similar qualitative results (see Figure 6) but is faster than a single optimization step of the baseline method. Both methods are benchmarked on a GTX Titan X GPU.
:::
### 4.2 Single-Image Super-Resolution
:::info
In single-image super-resolution, the task is to generate a high-resolution output image from a low-resolution input. This is an inherently ill-posed problem, since for each low-resolution image there exist multiple high-resolution images that could have generated it. The ambiguity becomes more extreme as the super-resolution factor grows; for large factors ($\times 4$, $\times 8$), fine details of the high-resolution image may have little or no evidence in its low-resolution version.
:::
:::success
在單一影像的超解析度任務中,主要是從一張低解析度輸入來生成高解析度輸出。這本質上是一個非良置的問題,因為對於每一張低解析影像都存在著多個能生成的高解析度影像。當超解析度的倍數增長的時候,其模糊性也變的更為極端;對於較大的倍數($\times 4$, $\times 8$),高解析度影像的細膩細節也許是在其低解析度版本中幾乎沒有或者根本沒有的。
:::
:::info
To overcome this problem, we train super-resolution networks not with the per-pixel loss typically used [1] but instead with a feature reconstruction loss (see Section 3) to allow transfer of semantic knowledge from the pretrained loss network to the super-resolution network. We focus on $\times 4$ and $\times 8$ superresolution since larger factors require more semantic reasoning about the input.
:::
:::success
為了克服這個問題,我們沒有採用常見的per-pixel loss,而是使用feature reconstruction loss(見Session 3)來訓練超解析度網路,這樣能將預訓練的loss network中的語義知識轉移到超解析網路中。我們關注在$\times 4$與$\times 8$,因為更大的倍數需要更多關於輸入的語意推理。
:::
:::info
The traditional metrics used to evaluate super-resolution are PSNR and SSIM [54], both of which have been found to correlate poorly with human assessment of visual quality [55,56,57,58,59]. PSNR and SSIM rely only on lowlevel differences between pixels and operate under the assumption of additive Gaussian noise, which may be invalid for super-resolution. In addition, PSNR is equivalent to the per-pixel loss $\mathscr{l}_{pixel}$, so as measured by PSNR a model trained to minimize per-pixel loss should always outperform a model trained to minimize feature reconstruction loss. We therefore emphasize that the goal of these experiments is not to achieve state-of-the-art PSNR or SSIM results, but instead to showcase the qualitative difference between models trained with per-pixel and feature reconstruction losses.
:::
:::success
用來評估超解析度的傳統指標是PSNR與SSIM,這兩個指標都已經被發現到跟人類對於評估視覺品質的相關性是很低的。PSNR與SSIM單純的依賴像素之間的低階差異,並且它的操作是建立在加成性高斯雜訊的假設之下,這對超解析度也許是無效的。此外,PSNR等價於per-pixel loss $\mathscr{l}_{pixel}$,所以吼,利用PSNR的量測所訓練模型的,最小化per-pixel loss應該總是會優於最小化feature reconstruction loss。因此,我們要強調一下,這些實驗的目標並不是去拿到最佳的PSNR或是SSIM結果,而是要展示出使用per-pixel與feature reconstruction losses來做為模型訓練之間的差異。
:::
:::info
**Model Details.** We train models to perform $\times 4$ and $\times 8$ super-resolution by minimizing feature reconstruction loss at layer `relu2_2` from the VGG-16 loss network $\phi$. We train with $288 \times 288$ patches from 10k images from the MS-COCO training set, and prepare low-resolution inputs by blurring with a Gaussian kernel of width $\sigma = 1.0$ and downsampling with bicubic interpolation. We train with a batch size of 4 for 200k iterations using Adam [51] with a learning rate of $1\times 10^−3^$ without weight decay or dropout. As a post-processing step, we perform histogram matching between our network output and the low-resolution input.
:::
:::success
**Model Details.** 我們透過最小化VGG-16 loss network $\phi$的網路層`relu2_2`的特徵重構損失來訓練模型執行$\times 4$與$\times 8$的超解析度處理。我們從MS-COCO training set裡面的10k張的影像,以$288 \times 288$的區塊來訓練,然後用寬度為$\sigma = 1.0$的Gaussian kernel來做模糊處理,然後再做雙三次內挿值的降採樣,用這樣的方式來準備低解析度的影像。我們使用Adam搭配batch size為4做200k次的迭代訓練,學習效率為$1\times 10^−3^$,不執行權重衰減或是dropout。做為後處理的步驟,我們對我們網路的輸出與低解輸入之間執行直方圖匹配的動作。
:::
:::info
**Baselines.** As a baseline model we use SRCNN [1] for its state-of-the-art performance. SRCNN is a three-layer convolutional network trained to minimize per-pixel loss on 33 × 33 patches from the ILSVRC 2013 detection dataset. SRCNN is not trained for $\times 8$ super-resolution, so we can only evaluate it on $\times 4$.
:::
:::success
**Baselines.** 做為基線模型,我們使用SRCNN,因為它有著最棒棒的效能。SRCNN是一個三層卷積網路,其訓練是最小化ILSVRC 2013 detection dataset中的33 × 33 patches的per-pixel loss。SRCNN並沒有針對$\times 8$超解析度訓練,所以我們就只能夠在$\times 4$上評估就是。
:::
:::info
SRCNN is trained for more than $10^9$ iterations, which is not computationally feasible for our models. To account for differences between SRCNN and our model in data, training, and architecture, we train image transformation networks for $\times 4$ and $\times 8$ super-resolution using $\mathscr{l}_{pixel}$; these networks use identical data, architecture, and training as the networks trained to minimize $\mathscr{l}_{feat}$.
:::
:::success
SRCNN的訓練超過$10^9$次迭代,這對我們的模型來說是不可能的。為了SRCNN跟我們的模型在資料、訓練與架構上的不同,我們使用$\mathscr{l}_{pixel}$來針對$\times 4$與$\times 8$超解析度訓練影像轉換網路;這些網路使用相同的資料、架構、與訓練,都是訓練以最小化$\mathscr{l}_{feat}$為目標。
:::
:::info
**Evaluation**. We evaluate all models on the standard Set5 [60], Set14 [61], and BSD100 [41] datasets. We report PSNR and SSIM [54], computing both only on the $Y$ channel after converting to the YCbCr colorspace, following [1,39].
:::
:::success
**Evaluation**. 我們在standard Set5、Set14、與BSD100資料集上評估所有的模型。我們提出PSNR與SSIM,單純在轉換為YCbCr色域空間之後計算$Y$通道,依著[1,39]處理。
:::
:::info
**Results.** We show results for $\times 4$ super-resolution in Figure 8. Compared to the other methods, our model trained for feature reconstruction does a very good job at reconstructing sharp edges and fine details, such as the eyelashes in the first image and the individual elements of the hat in the second image. The feature reconstruction loss gives rise to a slight cross-hatch pattern visible under magnification, which harms its PSNR and SSIM compared to baseline methods.
:::
:::success
**Results.** 我們在Figure 8給出$\times 4$超解析度的結果。對比其它方法,我們的模型訓練是針對特徵重構,這對重構銳利邊緣跟細部的部份有著很好的表現,舉例來說,第一張影像的睫毛跟第二張影像的帽子的各個元素。特徵重構損失會造成放大些許可見的交叉紋路,這造成了對比於基線方法,其PSNR與SSIM較低。
:::
:::info
![image](https://hackmd.io/_uploads/SJ9f4YwtR.png)
Fig. 8. Results for ×4 super-resolution on images from Set5 (top) and Set14 (bottom). We report PSNR / SSIM for each example and the mean for each dataset. More results are shown in the supplementary material.
:::
:::info
Results for $\times 8$ super-resolution are shown in Figure 9. Again we see that our $\mathscr{l}_{feat}$ model does a good job at edges and fine details compared to other models, such as the horse’s legs and hooves. The $\mathscr{l}_{feat}$ model does not sharpen edges indiscriminately; compared to the $\mathscr{l}_{pixel}$ model, the $\mathscr{l}_{feat}$ model sharpens the boundary edges of the horse and rider but the background trees remain diffuse, suggesting that the $\mathscr{l}_{feat}$ model may be more aware of image semantics.
:::
:::success
$\times 8$超解析度的結果如Figure 9所示。一樣的,我們看到我們的模型$\mathscr{l}_{feat}$在邊緣與細節的部份比其它模型好太多了,像是馬兒的腿跟蹄的部份。模型$\mathscr{l}_{feat}$不會莫名其妙的就去做銳利化邊緣;相較於模型$\mathscr{l}_{pixel}$,$\mathscr{l}_{feat}$銳利了馬跟騎馬者的邊界邊緣,不過背景樹的部份仍然是擴散的,這說明著,模型 $\mathscr{l}_{feat}$可能更加的瞭解影像語意。
:::
:::info
Since our $\mathscr{l}_{pixel}$ and our $\mathscr{l}_{feat}$ models share the same architecture, data, and training procedure, all differences between them are due to the difference between the $\mathscr{l}_{pixel}$ and $\mathscr{l}_{feat}$ losses. The $\mathscr{l}_{pixel}$ loss gives fewer visual artifacts and higher PSNR values but the $\mathscr{l}_{feat}$ loss does a better job at reconstructing fine details, leading to pleasing visual results.
:::
:::success
因為$\mathscr{l}_{pixel}$跟$\mathscr{l}_{feat}$有相同的架構、資料與訓練程序,因此,它們之間的所有差異都是來自於$\mathscr{l}_{pixel}$與$\mathscr{l}_{feat}$兩個loss之間的差異。$\mathscr{l}_{pixel}$ loss給出較少的視覺瑕庛與更高的PSNR值,不過$\mathscr{l}_{feat}$ loss在重構細節上做的比較好,這導致有著較好的視覺結果。
:::
:::info
![image](https://hackmd.io/_uploads/rk5dbUuFA.png)
Fig. 9. Super-resolution results with scale factor $\times 8$ on an image from the BSD100 dataset. We report PSNR / SSIM for the example image and the mean for each dataset. More results are shown in the supplementary material.
:::
## 5 Conclusion
:::info
In this paper we have combined the benefits of feed-forward image transformation tasks and optimization-based methods for image generation by training feed-forward transformation networks with perceptual loss functions. We have applied this method to style transfer where we achieve comparable performance and drastically improved speed compared to existing methods, and to singleimage super-resolution where we show that training with a perceptual loss allows the model to better reconstruct fine details and edges.
:::
:::success
這篇論文中,我們透過以perceptual loss function訓練前饋轉換網路結合前饋影像轉換任務與基於最佳化方法的影像生成的優點。我們已經將這方法應用到風格轉換上,對比現有的方法,我們的效能一樣,速度加倍翻,也應用到單張影像超解析度,這邊我們也說明著,使用perceptual loss來訓練可以讓模型更好的重構細節與邊緣。
:::
:::info
In future work we hope to explore the use of perceptual loss functions for other image transformation tasks, such as colorization and semantic segmentation. We also plan to investigate the use of different loss networks to see whether for example loss networks trained on different tasks or datasets can impart image transformation networks with different types of semantic knowledge.
:::
:::success
未來的研究,我們希望探索perceptual loss functions對於其它影像轉換任務的使用,像是色、語意分割,我們還計劃研究使用不同的loss networks來看看在不同任務或資料集上訓練的loss networks是否可以向影像轉換網路傳遞不同類型的語意知識。
:::