Notes on [Patch2Pix: Epipolar - Guided Pixel - Level Correspondences](https://arxiv.org/pdf/2012.01909.pdf)

# Notes on [Patch2Pix: Epipolar - Guided Pixel - Level Correspondences](https://arxiv.org/pdf/2012.01909.pdf) ###### tags: `notes` `correspondences` `homography estimation` ##### Authors: Qunjie Zhou, Torsten Sattler and Laura Leal - Taixé ##### Notes written by Arihant Gaur and Saurabh Kemekar ## Abstract The paper proposes the use of deep learning based match proposals and refinement. This method is a weakly supervised approach, guided by epipolar geometry of an input image pair. The following contributions have been made: 1. Novel view of finding correspondences and refinement. 2. Match refinement network has been proposed. 3. Consistent match accuracy of correspondence network. 4. Generalizes to fully supervised methods without retraining. 5. SOTA for indoor and outdoor long - term localization. ## Related Work 1. **Feature Detection**: D2Net detects at four times lower resolution, so less accuracy. ASLFeat uses deformable CNNs and feature maps at multiple levels. R2D2 uses dilated CNNs for preserving the image resolution. CAPS fuses features at several resolution levels and obtains per - pixel descriptors by interpolation. 2. **Matching and Outlier Rejection**: Feature matching can be done with NN search in descriptor space. For outlier rejection, RANSAC or its derivatives can be used. Recent works learn the matching function. SuperGlue + SuperPoint uses graph networks with attention, which is currently SOTA for the matching problem. S2DNet performs sparse feature extraction. However, these methods don't solve keypoint detection. 3. **End - to - end matching**: Learning of correspondences and retrieving output in a single pass. NCNet uses correlation layers and 4D convolutions for score consistency. However, scores are obtained at 16 times downscaled resolution. SparseNCNet uses it at 4 times downscaled resolution. ## Patch2Pix: Match Refinement Network ![](https://i.imgur.com/AEYxwzt.png) This is a two stage detect - to - refine method. 1. **Correspondence stage**: Adaption of correspondence network to predict a set of patch - level match proposals. 2. **Refinement Stage**: Estimation of proposal confidence and use of regressor to detect a match at pixel resolution within local patches centred by the proposed match. ### Refinement: Pixel - level Matching #### Feature Extraction 1. Extraction of feature maps from each image through a CNN backbone with $L$ layers. Input: $(I_A,I_B)$. Obtained maps: $\{f_l^A\}_{l = 0}^{L},\{f_l^B\}_{l = 0}^{L}, l \in [0, L - 1]$. 2. Dimension of feature map $f_l$ is $H/2^l \times W/2^l, l \in [0, L - 1]$. NOTE: $f_0^A = I_A, f_0^B = I_B$. #### From Match Proposals to Patches 1. Match proposal $m_i = (p_i^A,p_i^B) = (x_i^A,y_i^A,x_i^B,y_i^B)$. 2. Find accurate matches. A search region of $S \times S$ local patches centered at $p_i^A$ and $p_i^B$ where $S > 2^{L - 1}$ is considered to cover a larger region than original patches. The matches are then regressed by the network from feature maps. #### Local Patch Expansion Matching is done for the neighbourhood region by moving the patches by $d$ pixels in $x$ and $y$ directions. This is done over $p_i^A$ from $B$ and vice versa, to get $8$ new match proposals. This allows searching over two $2S \times 2S$, compared to original $S \times S$ patches. ![](https://i.imgur.com/g9rnFyt.png) In paper, $d = S/2$. #### Progressive Match Regression 1. Extract point location on a patch to the corresponding feature map. 2. Select all features from the layers and concatenate into a single feature vector. 3. Concatenate gathered feature matches and feed them into the regressor. 4. Regressor converts input to a feature vector (through two convolutional layers), then passed through two FC layers. 5. Classification head yields a set of good matches. 6. New set of patches are obtained centred by mid-level patches and fed to fine level regressor. ![](https://i.imgur.com/m0J8wHu.png) ### Losses 1. Classification Loss $\mathcal{L}_{cls}$. 2. Geometric loss $\mathcal{L}_{geo}$. 3. Overall: \begin{equation} \mathcal{L}_{pixel} = \alpha \mathcal{L}_{cls} + \mathcal{L}_{geo} \end{equation} $\alpha$ is weighting parameter for balencing two losses. In paper, $\alpha = 10$. NOTE: Two correct matches must lie on the epipolar line. Epipolar distance is not considered for better experimental results. Instead, Sampson Distance is used. 1. **Sampson Distance from Fundamental Matrix**: Sampson distance can be calculated as, \begin{equation} \phi_i = \Phi(m_i, F) = \frac{((P_i^B)^TFP_I^A)^2}{(FP_i^A)^2_1 + (FP_i^A)^2_2 + (FP_i^B)^2_1 + (FP_i^B)^2_2} \\ P_i^A = (x_i^A,y_i^A,1)^T, P_i^B = (x_i^B, y_i^B, 1)^T \end{equation} $(FP_i^A)^2_k$ represents square of entry $k$ of vector $FP_i^A$. 2. **Match Proposal**: $m_i = (x_i^A, y_i^A, x_i^B, y_i^B)$. Binary Cross Entropy loss is used. \begin{equation} \mathcal{B}(\mathcal{C}, \mathcal{C^*}) = -\frac{1}{N}\sum_{i = 1}^{N}wc_i^* \log{c_i} + (1 - c_i^*)\log{(1 - c_i)} \\ w = |\{c_i^*|c_i^* = 0\}|/|\{c_i^*|c_i^* = 1\}|\\ c_i^* = 1, \phi_i < \theta_{cls} \text{(Positive Pair)} \end{equation} $\theta_{cls}$ is geometric distance threshold for classification. Other pairs are labelled as negative. 3. Separate thresholds $\hat{\theta}_{cls}$ and $\tilde{\theta}_{cls}$ is used for mid - level and fine - level classification loss. They are added to get the final classification loss. 4. **Geometric Loss**: Only updated if the Sampson Distance of its parent patch proposal is within the threshold $\theta_{geo}$. ## Implementation Details 1. Use of ResNet34 for feature extraction from input images. To have enough resolution in the last feature map $f_4$ change the stride to prevent further downscaling. 2. The distance threshold used during training is $\hat{\theta}_{cls} = \hat{\theta}_{geo} = 50$ for mid-level regression and $\tilde{\theta}_{cls} = \tilde{\theta}_{geo} = 5$ for fine-level regression. 3. Local patch size $S = 16$ that is double the downscale size of last feature map of ResNet34 network. 4. The pixel-level matching is optimized using Adam with an initial learning rate of $5e^{-4}$ for 5 epochs and then $e^{-4}$ until converge. 5. Confidence scores from fine-level regressor used for filtering out outliers $c = 0.5/0.9$, which presents the trade-off between quantity and quality of the matches. 6. The *Patch2Pix* network was trained on the large-scale out-door dataset MegaDepth, from which 60661 matching pairs were constructed. ## Evaluation and Limitations * **Image Matching** - Compared to all weakly supervised methods, *Patch2Pix* performs the best at both thresholds under illumination changes. * **Homography Estimations** - Homography estimation in illuminations changes *Patch2Pix* perform second best after NCNet, and outperforms all other fully supervised methods. Under viewpoint variations, *Patch2Pix* is best at 1-pixel error among weakly-supervised methods. Accuracy in viewpoint variation can further be increased by replacing the NCNet match proposals with Oracle match proposals. * **Outdoor Localization on Aachen Day-Night** - *Patch2Pix* performs worse compare to superpoint + CAPS and better than all other fully supervised methods. * **Indoor Localization on InLoc** - *Patch2Pix* is best among weakly supervised methods and second best overall after SuperPoint + SuperGlue. **Future Works** - Adding a loss that enforces detecting the same position of pixel in A for pairs (A,B) and (A,C) or to design a localization pipeline tailored to their matches.