Instance Normalization: The Missing Ingredient for Fast Stylization

# Instance Normalization: The Missing Ingredient for Fast Stylization ###### tags:`論文翻譯` `deeplearning` `Normalization` [TOC] ## 說明區塊如下分類，原文區塊為藍底，翻譯區塊為綠底，部份專業用語翻譯參考國家教育研究院 :::info 原文 ::: :::success 翻譯 ::: :::warning 個人註解，任何的翻譯不通暢部份都請留言指導 ::: :::danger * [paper hyperlink](https://arxiv.org/pdf/1607.08022.pdf) ::: ## Abstract :::info It this paper we revisit the fast stylization method introduced in Ulyanov et al. (2016). We show how a small change in the stylization architecture results in a significant qualitative improvement in the generated images. The change is limited to swapping batch normalization with instance normalization, and to apply the latter both at training and testing times. The resulting method can be used to train high-performance architectures for real-time image generation. The code is available at https://github.com/DmitryUlyanov/texture_nets. Full paper can be found at https://arxiv.org/abs/1701.02096. ::: :::success 這邊論文中，我們快速的重溫Ulyanov等人所提出的快速風格化的方法。我們會說明在風格化架構中一個小小的變化如何造成所生成的照片的品質明顯的提升。這個小小的變化就只是將batch normalization轉成instance normalization，而且在訓練跟測試階段都會使用instance normalization。這個方法可以用來訓練實時影像生成的高效能架構。程式碼跟論文都可以參考上述連結。 ::: ## 1 Introduction :::info The recent work of Gatys et al. (2016) introduced a method for transferring a style from an image onto another one, as demonstrated in fig. 1. The stylized image matches simultaneously selected statistics of the style image and of the content image. Both style and content statistics are obtained from a deep convolutional network pre-trained for image classification. The style statistics are extracted from shallower layers and averaged across spatial locations whereas the content statistics are extracted form deeper layers and preserve spatial information. In this manner, the style statistics capture the “texture” of the style image whereas the content statistics capture the “structure” of the content image. ::: :::success Gatys等人近來的研究提出一種用於從一張照片將風格轉移至另一張照片的方法，如fig. 1所示。風格化(stylized)後的照片同時匹配所選擇的風格照片與內容照片的統計資訊。風格跟內容的統計資訊都是從預訓練用於影像分類的深度卷積網路。風格統計資訊的提取是從較淺的網路層處理，並平均其空間位置，而內容統計資訊則是從較深的網路層提取，並保留其空間信息。這樣的方法，其風格統計資訊捕捉風格照片的"紋理"，而內容統計資訊則是捕捉內容照片的"結構"。 ::: :::info ![image](https://hackmd.io/_uploads/By-8jvcl0.png) Figure 1: Artistic style transfer example of Gatys et al. (2016) method. ::: :::info Although the method of Gatys et. al produces remarkably good results, it is computationally inefficient. The stylized image is, in fact, obtained by iterative optimization until it matches the desired statistics. In practice, it takes several minutes to stylize an image of size 512 × 512. Two recent works, Ulyanov et al. (2016) Johnson et al. (2016), sought to address this problem by learning equivalent feed-forward generator networks that can generate the stylized image in a single pass. These two methods differ mainly by the details of the generator architecture and produce results of a comparable quality; however, neither achieved as good results as the slower optimization-based method of Gatys et. al. ::: :::success 雖然Gatys等人的方法能夠產生不哩好的結果，不過其計算是沒有效率的。事實上，風格化後的照片是經過不斷迭代最佳化一直到匹配到期望的統計資訊。實務上，要處理一張512x512的照片是需要花好幾分鐘的。兩項近來的研究，Ulyanov與Johnson兩派人馬嚐試透過學習等效的前饋生成器網路以單一次推論就生成風格化後的照片來解決這個問題。這兩種方法最主要的差異在於生成器架構的細節，並產生一個可比較的品質；不過吼，這兩種架構都沒有Gatys這種基於最佳化慢慢來的方法好。 ::: :::info In this paper we revisit the method for feed-forward stylization of Ulyanov et al. (2016) and show that a small change in a generator architecture leads to much improved results. The results are in fact of comparable quality as the slow optimization method of Gatys et al. but can be obtained in real time on standard GPU hardware. The key idea (section 2) is to replace batch normalization layers in the generator architecture with instance normalization layers, and to keep them at test time (as opposed to freeze and simplify them out as done for batch normalization). Intuitively, the normalization process allows to remove instance-specific contrast information from the content image, which simplifies generation. In practice, this results in vastly improved images (section 3). ::: :::success 這篇論文中，我們重溫了Ulyanov等人的feed-forward stylization的方法，然後說明生成器架構中一個小小的改變所導致的大大提升的結果。事實上這個結果跟Gatys等人的慢慢來最佳化方法是有得比的，而且在標準的GPU上還可以實時獲得結果。主要的概念(section 2)就是以instance normalization layers取代掉生成器架構中的batch normalization layer，並且在測試的時候仍然保留這個操作(而不是像batch normalization那樣在測試的時候會凍結這個layer)。直觀來看，正規化的過程能夠從內容照片中移除特定實例的對比信息(instance-specific contrast information)。實務上，這個結果大大的提升照片(section 3)。 ::: ## 2 Method :::info The work of Ulyanov et al. (2016) showed that it is possible to learn a generator network $g(\mathbf{x}, \mathbf{z})$ that can apply to a given input image $\mathbf{x}$ the style of another $\mathbf{x}_0$, reproducing to some extent the results of the optimization method of Gatys et al. Here, the style image $\mathbf{x}_0$ is fixed and the generator $g$ is learned to apply the style to any input image $\mathbf{x}$. The variable $\mathbf{z}$ is a random seed that can be used to obtain sample stylization results. ::: :::success Ulyanov等人的研究說明著，可以學習一個生成器網路$g(\mathbf{x}, \mathbf{z})$將給定的輸入照片$\mathbf{x}$套用另一張照片$\mathbf{x}_0$的風格，這部份重現了Gatys等人最佳化方法的結果。這邊的風格照片$\mathbf{x}_0$是固定的，生成器$g$則是學習將其風格弄到任意的輸入照片$\mathbf{x}$。變數$\mathbf{z}$是一個隨機種子(random seed)，可以用來捕捉樣本風格化的結果。 ::: :::info The function $g$ is a convolutional neural network learned from examples. Here an example is just a content image $\mathbf{x}_t, t=1,...,n$ and learning solves the problem $$ \min_g\dfrac{1}{n}\sum^n_{t=1}\mathcal{L}(\mathbf{x}_0,\mathbf{x}_t,g(\mathbf{x}_t,\mathbf{z}_t)) $$ where $\mathbf{z}_t\sim\mathcal{N}(0,1)$ are i.i.d. samples from a Gaussian distribution. The loss $\mathcal{L}$ uses a pre-trained CNN (not shown) to extracts features from the style $\mathbf{x}_0$ image, the content image $\mathbf{x}_t$, and the stylized image $g(\mathbf{x}_t,\mathbf{z}_t)$, and compares their statistics as explained before. ::: :::success 函數$g$是一個從樣本學習的卷積神經網路。這邊的樣本就是一張內容照片$\mathbf{x}_t, t=1,...,n$，並且學習解決問題 $$ \min_g\dfrac{1}{n}\sum^n_{t=1}\mathcal{L}(\mathbf{x}_0,\mathbf{x}_t,g(\mathbf{x}_t,\mathbf{z}_t)) $$ 其中$\mathbf{z}_t\sim\mathcal{N}(0,1)$，且為從高斯分佈中採樣(i.i.d)。損失$\mathcal{L}$使用一個預訓練的CNN來從風格照片$\mathbf{x}_0$、內容照片$\mathbf{x}_t$，跟風格化後$g(\mathbf{x}_t,\mathbf{z}_t)$的照片中提取特徵，然後用之前說明的方式比較它們的統計資訊。 ::: :::info While the generator network $g$ is fast, the authors of Ulyanov et al. (2016) observed that learning it from too many training examples yield poorer qualitative results. In particular, a network trained on just 16 example images produced better results than one trained from thousands of those. The most serious artifacts were found along the border of the image due to the zero padding added before every convolution operation (see fig. 3). Even by using more complex padding techniques it was not possible to solve this issue. Ultimately, the best results presented in Ulyanov et al. (2016) were obtained using a small number of training images and stopping the learning process early. We conjectured that the training objective was too hard to learn for a standard neural network architecture. ::: :::success 儘管生成器網路$g$很快，Ulyanov等人的作者觀察到，從過多的訓練樣本中學習會產生較低品質的結果。特別是，在16個樣本上訓練的網路生成的結果會比從1000個樣本上還要來的好。最嚴重的假影被發現到是出現在影像的邊界附近，這是因為在每個卷積操作之前加入的zero padding(見fig. 3)。即使使用更複雜的padding技術也沒有辦法解決這個問題。最終，Ulyanov等人提出最好的結果就是使用少量的訓練照片，然後提升停止學習過程。我們猜測，對於標準神經網路架構來說，這訓練目標太難學習了。 ::: :::info ![image](https://hackmd.io/_uploads/B1Hq5xTlA.png) Figure 3: Row 1: content image (left), style image (middle) and style transfer using method of Gatys et. al (right). Row 2: typical stylization results when trained for a large number of iterations using fast stylization method from Ulyanov et al. (2016): with zero padding (left), with a better padding technique (middle), with zero padding and instance normalization (right). Figure 3：Row 1：內容照片(左)，風格照片(中間)，以及使用Gatys等人的風格轉移方法(右邊)。Row 2：使用Ulyanov等人的快速風格化方法在多次迭代訓練後常見的結果：使用zero padding(左)，使用較好的padding技術(中)，使用zero padding跟instance normalization(右)。 ::: :::info A simple observation is that the result of stylization should not, in general, depend on the contrast of the content image (see fig. 2). In fact, the style loss is designed to transfer elements from a style image to the content image such that the contrast of the stylized image is similar to the contrast of the style image. Thus, the generator network should discard contrast information in the content image. The question is whether contrast normalization can be implemented efficiently by combining standard CNN building blocks or whether, instead, is best implemented directly in the architecture. ::: :::success 一個簡單的觀察就是，通常，風格化的結果不應該取決於內容照片的對比(見fig. 2)。事實上，style loss主要是從風格照片轉移元素到內容照片，這讓風格化後照片的對比度類似於風格照片的對比度。問題是，是不是能夠透過結合標準CNN的建構模塊高效地處理對比度正規化(contrast normalization)，又或者，在架構中直接實現。 ::: :::info ![image](https://hackmd.io/_uploads/SyK5hepxC.png) Figure 2: A contrast of a stylized image is mostly determined by a contrast of a style image and almost independent of a content image contrast. The stylization is performed with method of Gatys et al. (2016). Figure 2：風格化照片的對比度主要由風格照片的對比決定，幾乎跟內容照片的對比度無關。這是以Gatys等人的方法處理的風格化。 ::: :::info The generators used in Ulyanov et al. (2016) and Johnson et al. (2016) use convolution, pooling, upsampling, and batch normalization. In practice, it may be difficult to learn a highly nonlinear contrast normalization function as a combination of such layers. To see why, let $x\in\mathbb{R}^{T\times C\times W\times H}$ be an input tensor containing a batch of $T$ images. Let $x_{tijk}$ denote its $tijk$-th element, where $k$ and $j$ span spatial dimensions, $i$ is the feature channel (color channel if the input is an RGB image), and $t$ is the index of the image in the batch. Then a simple version of contrast normalization is given by: $$ y_{tijk} = \dfrac{x_{tijk}}{\sum^W_{l=1}\sum^H_{m=1}x_{tilm}} \tag{1} $$ It is unclear how such as function could be implemented as a sequence of ReLU and convolution operator. ::: :::success Ulyanov等人跟Johnson等人所使用的生成器中採用convolution、pooling、upsampling與batch normalization。實務上，把這些網路層結合而成的highly nonlinear contrast normalization function是很難學習的。我們來看看為什麼，假設$x\in\mathbb{R}^{T\times C\times W\times H}$是一個input tensor，包含一個批次$T$的照片量。假設$x_{tijk}$表示其第$tijk$個元素，其中$k$與$j$跨越(span)空間維度，$i$是feature channel(如果輸入是RGB照片的話，那就是顏色通道)，$t$為在$T$ batch中的照片索引。然後，contrast normalization的簡單版本如下： $$ y_{tijk} = \dfrac{x_{tijk}}{\sum^W_{l=1}\sum^H_{m=1}x_{tilm}} \tag{1} $$ 目前還不清楚這樣的函數要如何被實現為一系列的ReLU與卷積操作。 ::: :::warning * 分母：$\sum^W_{l=1}\sum^H_{m=1}x_{tilm}$應該是加總索引$t$的第$i$個channel的所有pixel * 分子：$x_{tijk}$，應該是在第$t$張照片的第$i$個通道的$j,k$位置的像素 ::: :::info On the other hand, the generator network of Ulyanov et al. (2016) does contain a normalization layers, and precisely batch normalization ones. The key difference between eq. (1) and batch normalization is that the latter applies the normalization to a whole batch of images instead for single ones: $$ y_{tijk}=\dfrac{x_{tijk}-\mu_i}{\sqrt{\sigma^2_i+\epsilon}}, \mu_i=\dfrac{1}{HWT}\sum^T_{t=1}\sum^W_{l=1}\sum^H_{m=1}x_{tilm},\sigma^2_i=\dfrac{1}{HWT}\sum^T_{t=1}\sum^W_{l=1}\sum^H_{m=1}(x_{tilm}-mu_i)^2\tag{2} $$ ::: :::success 另一方面，Ulyanov等人的生成器網路確實包含了一個正規化網路層，精確的說是batch normalization。跟方程式(1)之間的關鍵差異在於，bn是整批正規化，而不是單一張照片： $$ y_{tijk}=\dfrac{x_{tijk}-\mu_i}{\sqrt{\sigma^2_i+\epsilon}}, \mu_i=\dfrac{1}{HWT}\sum^T_{t=1}\sum^W_{l=1}\sum^H_{m=1}x_{tilm},\sigma^2_i=\dfrac{1}{HWT}\sum^T_{t=1}\sum^W_{l=1}\sum^H_{m=1}(x_{tilm}-mu_i)^2\tag{2} $$ ::: :::info In order to combine the effects of instance-specific normalization and batch normalization, we propose to replace the latter by the instance normalization (also known as “contrast normalization”) layer: $$ y_{tijk}=\dfrac{x_{tijk}-\mu_{ti}}{\sqrt{\sigma^2_{ti}+\epsilon}},\mu_{ti}=\dfrac{1}{HW}\sum^W_{l=1}\sum^H_{m=1}x_{tilm},\sigma^2_{ti}=\dfrac{1}{HW}\sum^W_{l=1}\sum^H_{m=1}(x_{tilm-mu_{ti}})^2 \tag{3} $$ ::: :::success 為了能夠結合instance-specific normalization與batch normalization的效果，我們提出用instance normalization(又稱contrast normalization)來取代bn： $$ y_{tijk}=\dfrac{x_{tijk}-\mu_{ti}}{\sqrt{\sigma^2_{ti}+\epsilon}},\mu_{ti}=\dfrac{1}{HW}\sum^W_{l=1}\sum^H_{m=1}x_{tilm},\sigma^2_{ti}=\dfrac{1}{HW}\sum^W_{l=1}\sum^H_{m=1}(x_{tilm-mu_{ti}})^2 \tag{3} $$ ::: :::info We replace batch normalization with instance normalization everywhere in the generator network $g$. This prevents instance-specific mean and covariance shift simplifying the learning process. Differently from batch normalization, furthermore, the instance normalization layer is applied at test time as well. ::: :::success 我們把生成器網路$g$中的所有batch normalization用instance normalization取代掉。這預防了instance-specific的均值與共變異數造成簡化了學習過程。此外，跟batch normalization不同，instance normalization layer在測試時還是會處理的。 ::: ## 3 Experiments :::info In this section, we evaluate the effect of the modification proposed in section 2 and replace batch normalization with instance normalization. We tested both generator architectures described in Ulyanov et al. (2016) and Johnson et al. (2016) in order to see whether the modification applies to different architectures. While we did not have access to the original network by Johnson et al. (2016), we carefully reproduced their model from the description in the paper. Ultimately, we found that both generator networks have similar performance and shortcomings (fig. 5 first row). ::: :::success 這個章節中，我們評估了section 2中提出的調整效果，並且用instance normalization取代掉batch normalization。為了看看這種調整是不是能適用於不同架構上，我們測試了Ulyanov等人與Johnson等人所提出的兩個生成器架構。儘管我們無法取得Johnson等人的原始網路，我們還是仔細小心滴從他們論文中的描述來重建。最終，我們發現到，兩個生成器網路有著類似的效能與缺點。(fig 5. first row) ::: :::info ![image](https://hackmd.io/_uploads/SJMwTW6xA.png) Figure 5: Qualitative comparison of generators proposed in Ulyanov et al. (2016) (left), Johnson et al. (2016) (right) with batch normalization (first row) and instance normalization (second row). Both architectures benefit from instance normalization. ::: :::info Next, the replaced batch normalization with instance normalization and retrained the generators using the same hyperparameters. We found that both architectures significantly improved by the use of instance normalization (fig. 5 second row). The quality of both generators is similar, but we found the residuals architecture of Johnson et al. (2016) to be somewhat more efficient and easy to use, so we adopted it for the results shown in fig. 4. ::: :::success 然後，用instance normalization取代掉batch normalization，並且以相同的超參數重新訓練。我們發現到，兩個架構在使用instance normalization的情況下都有著明顯的提升(見fig. 5 second row)。兩個生成器的品質類似，不過我們發現到，Johnson等人的殘差架構更高效而且容易使用，所以我們用它來做結果的呈現(fig. 4) ::: :::info ![image](https://hackmd.io/_uploads/B15QAZagA.png) Figure 4: Stylization examples using proposed method. First row: style images; second row: original image and its stylized versions. ::: ## 4 Conclusion :::info In this short note, we demonstrate that by replacing batch normalization with instance normalization it is possible to dramatically improve the performance of certain deep neural networks for image generation. The result is suggestive, and we are currently experimenting with similar ideas for image discrimination tasks as well. ::: :::success 在這簡短的論文中，我們證明了，透過以instance normalization來取代batch normalization能夠強烈提升某些用於影像生成的深度神經網路。這個結果是有啟發性的，我們目前也在為影像分類任務做類似的嚐試實驗。 ::: :::info ![image](https://hackmd.io/_uploads/SywnCWTxA.png) Figure 6: Processing a content image from fig. 4 with Delaunay style at different resolutions: 512 (left) and 1080 (right). :::