# Notes on [Cross-domain Correspondence Learning for Exemplar-based Image Translation](https://arxiv.org/pdf/2004.05571.pdf) CVPR 2020 Author: **Amandeep Kumar** ## Introduction * They have proposed a general framework for image translation from the input in a distinct domain(e.g semantic segmentation mask, edge map, pose keypoints) given a exemplar image. * The paper jointly learn the crossdomain correspondence and the image translation with weak supervision. * The images from distinct domains are first aligned to an intermediate domain where dense correspondence is established. Then, the network synthesizes images based on the appearance of semantically corresponding patches in the exemplar by the use of de-normalization blocks similar to [SPADE](https://arxiv.org/pdf/1903.07291.pdf). ## Methodology ![](https://i.imgur.com/eeKj4AC.png) * The model aims to learn the translation from the image from source domain $x_A$ $\epsilon$ $A$ to the target domain $y_B$ $\epsilon$ $B$. * The generated output is desired to conform to the content as $x_A$ while resembling the style from semantically similar parts in $y_B$. ### Cross-domain correspondence network * They first adapt the input image and the exemplar to a shared domain $S$. * $x_A$ and $y_B$ are fed into FPN network to get both the local and global image context. The feature map are then transformed to the representations in $S$. $x_S = F_{A\rightarrow S}(x_A;\theta_{F,A\rightarrow S})$ $y_S = F_{B\rightarrow S}(y_B;\theta_{F,B\rightarrow S})$ * where $x_S$, $y_S$ $\epsilon$ $\mathbb{R}^{HW \times C}$,$F_{A\rightarrow S}$ and $F_{B\rightarrow S}$ be domain transformation from the two input domains respectively * The representation $x_S$ and $y_S$ comprise discriminative features that characterize the semantics of inputs. * A correlation matrix is created $M$$\epsilon$$\mathbb{R}_{HW \times HW}$ of each element is a pairwise feature correlations. $M(u, v) = \frac{\hat{x}_S(u)^T\hat{y}_S(v)}{||\hat(x)_S(u)||\;||\hat(y)_S(v)}$ * $\hat{x}_S(u)$ and $\hat{y}_S(v)$ $\epsilon$ $\mathbb{R}^C$ represent the channel-wise centralized feature of $x_S$ and $y_S$ in position $u$ an $v$ * $\hat{x}_S(u) = x_S(u)-mean(y_S(u))$ and $\hat{y}_S(u) = y_S(u)-mean(y_S(u))$. * The translation network may find it easier to generate high-quality outputs only by referring to the correct corresponding regions in the exemplar, which implicitly pushes the network to learn the accurate correspondence. * $r_{y\rightarrow x}(u)=\underset{v}{\Sigma}\underset{v}{softmax}(\alpha M(u,v)). y_B(v)$ * $r_{y\rightarrow x}$ $\epsilon$ $\mathbb{R}^{HW}$, \alpha is the coefficient that control the sharpness of the softmax. ### Translation network * Random vector $Z$ is passed through the translation network $G$ to have the desired output $\hat{x}_B$ $\epsilon$ $B$ under the guidance of $r_{y\rightarrow x}$. * They have proposed to have positional normalization and spatially-variant denormalization for high-fidelity texture transfer from the exemplar. * They inject the exemplar style, given the activation map $F_i$ $\epsilon$ $\mathbb{R}^{C_i \times H_i \times W_i}$ before the $i^{th}$ normalization layer. * $\alpha^i_{h,w}(r_{y \rightarrow x}) \times \frac{F^i_{c,h,w}-\mu^i_{h,w}}{\sigma^i_{h,w}} + \beta^i_{h,w}(r_{y \rightarrow x})$ * $\mu^i_{h,w}$ and $\sigma^i_{h,w}$ are calculated exclusively across channel direction compare to BN. $\alpha^i$ $\beta^i$ are the denormalization parameter characteristic the style of the exemplar. * $\alpha^i, \beta^i=T_i(r_{y\rightarrow x}; \theta_T)$, Two plain cnvolutional layer to implement $T$. * overall image translation can be formlated as $\hat{x}_B=G(z,T_i(r_{y\rightarrow x}; \theta_T); \theta_G)$ ## Loss function #### Losses for pseudo exemplar pairs * They have created pseudo paired images $(x_A, x_B)$ where $x_A, x_B$ are sematically aligned but from different domain. * They apply random geometric distortion to $x_B$ to get $\overline{x}_B = h(x_B)$ where h denotes the augmentation operation like image warping or random flip and treat $\hat{x}_B$ as exemplar and groundtruth as $x_B$, $L_{feat} = \underset{l}{\Sigma} \lambda_l||\phi_l(G(x_A,\overline{x}_B))-\phi_l(x_B)||_1$ #### Domain alignment loss * They have agained used the same labelled image. $L^{l_1}_{domain}=||F_{A \rightarrow S}(x_A) - F_{B \rightarrow S}(x_B)||_1$ #### Exemplar translation losses * Perceptual Loss $L_{prec} = ||\phi_l(\hat{x}_B) - \phi_l(x_B)||_1$. * They proposed contextual loss that encourages $\hat{x}_B$ to adopt the appearance from the semantically corresponding patches from $y_B$. $L_{content}=\underset{l}{\Sigma}w_l[-log(\frac{1}{n_l}\underset{i}{\Sigma}\underset{i}{max}A^l(\phi^l_i(\hat{x}_B),\phi^l_j(y_B))]$ #### Correspondence regularization * The image should match itself after forward-backward warping by correspondence regularization. $L_{reg}=||r_{y \rightarrow x \rightarrow y}-y_B||_1$ * $r_{y \rightarrow x \rightarrow y}(v)=\Sigma_u softmax_u(\alpha M(u,v)).r_{y\rightarrow x}(u)$ * This loss play a crucial role as the rest loss functions, imposed at the end of the network, are weak supervision and cannot guarantee that the network learns a meaningful correspondence. #### Adversarial Loss * $L^D_{adv}=-\mathbb{E}[h(D(y_B))]-\mathbb{E}[h(-D(G(x_A, y_B)))]$ $L^G_{adv}=-\mathbb{E}[D(G(x_A, y_B))]$ where $h(t)=min(0,-1+t)$ ## Conclusion * They present the CocosNet, which translates the image by relying on the cross-domain correspondence * The model learns the dense correspondence for cross-domain images, paving a way for several intriguing applications. ## Results ![](https://i.imgur.com/kqOAh8D.png)