# Notes on [Neighbourhood Consensus Networks](https://arxiv.org/pdf/1810.10510.pdf) ###### tags: `notes` `correspondences` `4D convolution` ##### Authors: Ignacio Rocco, Mircea Cimpoi, Relja Arandjelović, Akihiko Torii, Tomas Pajdla, Josef Sivic ##### Notes written by Arihant Gaur and Saurabh Kemekar ## Introduction The paper proposes an end - to - end pipeline for feature detection, description and matching and achieves the following tasks: 1. Development of an end - to - end CNN architecture (NCNet), to identify neighbourhood consensus patterns in 4D space. There is no need for a global geometric model. 2. Weak supervision training in the form of matching and non - matching image pairs. No manual intervention. 3. Applications in category and instance level matching. ## Related Work 1) **Handcrafted Image Descriptors** * For example, SIFT, SURF, FAST and ORB. Image matching can be performed using the nearest neighbour approach. Lowe's ratio test can be used for removal of ambiguous matches. * Issue: Too many correct matches discarded (issue will be prevalent in repetitive and textureless areas). Illumination invariance is limited. 2) **Trainable Descriptors** * Use of DoG to yield sparse descriptors. * Using pre-trained image level CNNs. Recent works perform both detection and description. 3) **Trainable Image Alignment** * Use of end - to - end trainable models to produce correspondences. Pairwise feature matches are computed to estimate geometric transformation parameters using CNN. Such methods take into account dense correspondences. * Issue: They only estimate low complexity parametric transformation. 4) **Match filtering by neighbourhood consensus** * Inspection of the neighbourhood of potential match. Useful for removal of many incorrect matches. ## Proposed Approach * This paper combines the robustness of neighborhood consensus filtering with the power of trainable neural architecture. * A fully differentiable way, such that the trainable matching module can be directly combined with strong CNN image descriptors. ![](https://i.imgur.com/MIFrbYT.png) There are mainly 5 components: 1) **Dense feature extraction and matching** * Given image $I$, this feature extractor will produce a dense set of descriptors, $f_{ij}^I$ * The dense features descriptors $f^A$ and $f^B$ of two images to be matched, the exhaustive pairwise cosine similarity between is computed and store in 4-D tensor $c$ referred to as *correlation map*. $$ c_{ijkl} = \frac{ \langle f_{ij}^A, f_{ij}^B\rangle}{{\lvert\lvert f_{ij} ^A}\rvert \rvert^2 \ \lvert\lvert{f_{ij}^B\rvert \rvert} ^2 } $$ 2) **Neighbourhood consensus network** * To further process and filter the matches, the authors propose a 4D convolution neural network(CNN) for the neighbourhood task. ![](https://i.imgur.com/70cveHV.png) * The first layer of the proposed CNN which has $N_1$ filters can specialize in learning different local geometric deformations, producing $N_1$ output channels * Second layer capture more complex patterns by combining the outputs from the previous layer. Finally, the neighbourhood consensus CNN produces a single channel output, which has the same dimension as the 4D input matches * In order for NCN network to be invarient to particular order of input images $(I^A,I^B)$ or $(I^B,I^A)$, the authors propose to apply network twice in the following way: $$ \tilde{c} = N(c) + (N(c^T))^T $$ where $(c^T)_{ijkl} = c_{klij}$ and $N(.)$ is the forward pass of NCN. 3) **Soft mutual nearest neighbour filtering** * To eliminate majority of matches, which makes it unsuitable for usage in an end-to-end trainable approach, the paper proposes a softer version of the mutual nearest neighbouring filtering both in sense of softer decision and better differentiability properties. $$ \hat{c} = M(c) \ where \ \widehat{c_{ijkl}} = r_{ikjl}^Ar_{ikjl}^Bc_{ikjl}, \\ r_{ikjl}^A = \frac{c_{ijkl}}{\max_{ab} \ c_{abkl}} \ and \ \ r_{ikjl}^A = \frac{c_{ijkl}}{\max_{cd} \ c_{ijcd}} $$ * The soft mutual nearest neighbour filtering is used to filter both the correlation map and output of NCN. 4) **Extracting correspondances from the correlation map** * To obtain image correspondences between images, two scores are defined from the correlation map, by performing soft-max in dimension corresponding to images A and B: $$ s_{ikjl}^A = \frac{exp(c_{ijkl})}{\sum_{ab}exp{(c_{abkl})}} \ \ and \ \ s_{ikjl}^B = \frac{exp(c_{ijkl})}{\sum_{cd}exp{(c_{ijcd})}} $$ * $$ P(K = k, L = l | I = i, J = j) = s_{ijkl}^B \ \ and \ \ P(I = i, J = j | K = k, L = l) = s_{ijkl}^A $$ where $(I,J,K,L)$ are discrete random variables indicating the position of match and $(i,j,k,l)$ the particular position of match. * $$ f_{kl}^B \text{ assigned to a given} \ f_{ij}^A = (k,l) = \arg \max_{cd} P(K = c, L = d\ |\ I = i, J = j) \\ = \arg \max_{cd} \ s_{ijcd}^B $$ This probabilistic intuition allows the modeling of match uncertainty. 5) **Weakly-supervised training** * The loss function used to train network only requires a weak-level of supervision. These training pair $(I^A,I^B)$ can have a positive label/match $y = 1$ or negative label/match $y = -1$ $$ L(I^A,I^B) = -y(s^{-A} + s^{-B}) $$ where $s^{-A} \ and \ s^{-B}$ are mean matching scores over all matches. * ## Implementation 1) **Feature extraction**: ResNet-101 upto conv4_23 layer. 2) **NCNet**: Three layers of $5 \times 5 \times 5 \times 5$ filters and two of $3 \times 3 \times 3 \times 3$. Training feature resolution: $25 \times 25$. Feature extraction resolution: $200 \times 150$. Correlation map downsampled once. 3) Model trained for 5 epochs using Adam Optimizer. Learning rate: $5 \times 10^{-4}$. Feature extraction layer weights are fixed. 4) **Category level matching**: Finetuning for 5 epochs at learning rate $1 \times 10^{-5}$. ## Limitations and Conclusion 1. Repetitive patterns with large scale changes causing incorrect patches. 2. Quadratic complexity. 3. Pixel limit of $1600 \times 1200$. **Future work**: End - to - end learning for applications in 3D category - level matching or visual localization across day/night illumination.