Notes on [Reference Based Sketch Image colorization using augmented self reference and Dense Sematic correspondence](https://arxiv.org/pdf/2005.05207.pdf)

# Notes on [Reference Based Sketch Image colorization using augmented self reference and Dense Sematic correspondence](https://arxiv.org/pdf/2005.05207.pdf) CVPR 2020 [Code](https://github.com/Jungjaewon/Reference_based_Skectch_Image_Colorization) Author: **Amandeep Kumar** ## Introduction * This paper tackles the automatic colorization task of a sketch image given an already-colored reference image. * They have used the identical image with the geometric distortion as a virtual reference, which makes it possible for the ground truth for a colored output image and have a paired like apperance. * The reference image contain most of the content of original image and transfer the contextual information obtained from the reference into the spatially corresponding positions of the sketch by the attention-based pixel-wise feature transfer module. ## Methodology ### Architecture ![](https://i.imgur.com/0sB2qwi.png) * Image I is passed through the outlier extractor to get $I_{s}$. To get $I_{r}$, I is passed through thin plate spin(TPS) transformation. * $I_{s}$ and $I_{r}$ are passed to the encoders $E_{s}$ and $E_{r}$ respectively and obtain the activation map $f_{s}$ and $f_{r}$. * $I_{s}$ and $I_{r}$ passes through several residual blocks and U-net-based decoder sequentially to obtain the final colored output. ### Augmented-Self Reference Generation * Two nontrivial transformation, appearance and spatial transformation are applied to generate the $I_{r}$ for a given $I_s$ * First the appearance transformation $a(.)$ is applied by adding a particular random noise per each of the RGB channel of I. * This appearance transformation acts as a groundtruth $I_{gt}$ for the output image. * The color perturbation or appearance transformation is done to prevent the model to memorize the color bias, which means that a particular object is highly correlated with the single gt in train data. eg a red color for apples * To get the final $I_r$ they have applied TPS transformation which prevent the model not to become lazy and bring the color in same pixel position from $I_r$ * This leads the semantically meaningful spatial correspondences even for a reference image with a spatially different layout ## Spatially Corresponding Feature Transfer ![](https://i.imgur.com/stjcOl5.png) * The goal of this module is to learn which part of a reference image to bring the information from, and as well as which part of a sketch image to transfer such information to. * Encoders produced L activation maps $f^1$, $f^2$, $f^3$,...., $f^L$ which is downsample in size of $f^L$ and concatenated to get the final activation map $V$, $V = [\varphi(f^1);\varphi(f^2);......;\varphi(f^L)]$ * Reshape the $V$ as $\overline{V}=[v_1, v_2,...., v_{hw}]$ $\epsilon$ $\mathbb{R}^{d_{v}\times hw}$ where each part of $v_i$ represents part ith region in the image. $v_i$ $\epsilon$ $\mathbb{R}^{d_{v}}$ * $v_i^s$ and $v_j^r$ are respective vectors from $E_s$ and $E_r$. A attention matrix $A$ $\epsilon$ $\mathbb{R}^{hw\times hw}$ is computed by the model which has element $\alpha_{ij}$ which is computed by * $\alpha_{ij} = \underset{j}{softmax}$ $((W_qV^s_i). (W_kV^r_j))/\sqrt(d_v)$ where $W_q, W_k$ $\epsilon$ $\mathbb{R}^{d_{v}\times d_v}$ represent the linear transformation matrix into a query and a key vector and $\sqrt(d_v)$ represents a scaling factor. $α_{ij}$ is a coefficient representing how much information $v_i^s$ should bring from $v_j^r$. * Context vector $v_i^* = \underset{j}{\Sigma} \alpha_{ij}W_vv^{r_{j}}$ where $W_v$ $\epsilon$ $\mathbb{R}^{d_{v}\times d_v}$ which contain the color feature in a sematically related region of a reference image. * Finally $c_i = v_i^s + v_i^*$, $c_i$ is fed to the decoder. ## Loss functions ### Similarity-Based Triplet Loss. * It is a variant of triplet loss, to directly supervise the affinity between the pixel-wise query and key vectors used to compute the attention map A. $L_{tr} = max(0, [-S(v_q, v_k^p) + S(v_q, v_k^n) + \gamma])$ * where $S(·, ·)$ computes the scaled dot product. Given a query vector $v_q$ as an anchor, $v_k^p$ indicates a feature vector sampled from the positive region, and $v_k^n$ is a negative sample. $\gamma$ is the margin. * Loss plays a crucial role in directly enforcing model to find the semantically matching pairs and reflect the reference color into the corresponding position. * It encourages the query representation to be close to the correct (positive) key representation, while penalizing to be far from the wrong (negatively sampled) one. ### L1 Loss * As $I_{gt}$ is present be apply reconstruction loss. $L_{rec}=\mathbb{E}(||G(I_s, I_r)-I_{gt}||_1)$ ### Adversarial Loss * conditional gan is used for adversarial loss $L_{adv}=\mathbb{E}_{I_{gt},I_s}[log D(I_{gt}, I_s)]+ \mathbb{E}_{I_{s},I_r}[log(1-D(G(I_s,I_r),I_s))]$ ### Perceptual loss * This loss penalizes the model to decrease the semantic gap, which means the difference of intermediate activation maps between the generated output $\overline{I}$ and the ground truth $I_{gt}$ from the ImageNet pretrained network. $L_{prec}=\mathbb{E}[\underset{l}{\Sigma}||\phi_l(\overline{I})-\phi_l(I_{gt})||_{1,1}]$ ### Style loss * $L_{style}=\mathbb{E}[||g(\phi_l(\overline{I}))-g(\phi_l(I_{gt}))]_{1,1}$ * $g$ is the grammatrix ## Conclusion * The paper presents a novel training scheme, integrating the augmented-self reference and the attention-based feature transfer module to directly learn the semantic correspondence for the reference-based sketch colorization task ## Results ![](https://i.imgur.com/FLhef0X.jpg)