# Notes on [Neighbourhood Consensus Networks](https://arxiv.org/pdf/1810.10510.pdf)
###### tags: `notes` `correspondences` `4D convolution`
##### Authors: Ignacio Rocco, Mircea Cimpoi, Relja Arandjelović, Akihiko Torii, Tomas Pajdla, Josef Sivic
##### Notes written by Arihant Gaur and Saurabh Kemekar
## Introduction
The paper proposes an end - to - end pipeline for feature detection, description and matching and achieves the following tasks:
1. Development of an end - to - end CNN architecture (NCNet), to identify neighbourhood consensus patterns in 4D space. There is no need for a global geometric model.
2. Weak supervision training in the form of matching and non - matching image pairs. No manual intervention.
3. Applications in category and instance level matching.
## Related Work
1) **Handcrafted Image Descriptors**
* For example, SIFT, SURF, FAST and ORB. Image matching can be performed using the nearest neighbour approach. Lowe's ratio test can be used for removal of ambiguous matches.
* Issue: Too many correct matches discarded (issue will be prevalent in repetitive and textureless areas). Illumination invariance is limited.
2) **Trainable Descriptors**
* Use of DoG to yield sparse descriptors.
* Using pre-trained image level CNNs. Recent works perform both detection and description.
3) **Trainable Image Alignment**
* Use of end - to - end trainable models to produce correspondences. Pairwise feature matches are computed to estimate geometric transformation parameters using CNN. Such methods take into account dense correspondences.
* Issue: They only estimate low complexity parametric transformation.
4) **Match filtering by neighbourhood consensus**
* Inspection of the neighbourhood of potential match. Useful for removal of many incorrect matches.
## Proposed Approach
* This paper combines the robustness of neighborhood consensus filtering with the power of trainable neural architecture.
* A fully differentiable way, such that the trainable matching module can be directly combined with strong CNN image descriptors.
![](https://i.imgur.com/MIFrbYT.png)
There are mainly 5 components:
1) **Dense feature extraction and matching**
* Given image $I$, this feature extractor will produce a dense set of descriptors, $f_{ij}^I$
* The dense features descriptors $f^A$ and $f^B$ of two images to be matched, the exhaustive pairwise cosine similarity between is computed and store in 4-D tensor $c$ referred to as *correlation map*. $$ c_{ijkl} = \frac{ \langle f_{ij}^A, f_{ij}^B\rangle}{{\lvert\lvert f_{ij} ^A}\rvert \rvert^2 \ \lvert\lvert{f_{ij}^B\rvert \rvert} ^2 } $$
2) **Neighbourhood consensus network**
* To further process and filter the matches, the authors propose a 4D convolution neural network(CNN) for the neighbourhood task.
![](https://i.imgur.com/70cveHV.png)
* The first layer of the proposed CNN which has $N_1$ filters can specialize in learning different local geometric deformations, producing $N_1$ output channels
* Second layer capture more complex patterns by combining the outputs from the previous layer. Finally, the neighbourhood consensus CNN produces a single channel output, which has the same dimension as the 4D input matches
* In order for NCN network to be invarient to particular order of input images $(I^A,I^B)$ or $(I^B,I^A)$, the authors propose to apply network twice in the following way: $$ \tilde{c} = N(c) + (N(c^T))^T $$ where $(c^T)_{ijkl} = c_{klij}$ and $N(.)$ is the forward pass of NCN.
3) **Soft mutual nearest neighbour filtering**
* To eliminate majority of matches, which makes it unsuitable for usage in an end-to-end trainable approach, the paper proposes a softer version of the mutual nearest neighbouring filtering both in sense of softer decision and better differentiability properties. $$ \hat{c} = M(c) \ where \ \widehat{c_{ijkl}} = r_{ikjl}^Ar_{ikjl}^Bc_{ikjl}, \\ r_{ikjl}^A = \frac{c_{ijkl}}{\max_{ab} \ c_{abkl}} \ and \ \ r_{ikjl}^A = \frac{c_{ijkl}}{\max_{cd} \ c_{ijcd}} $$
* The soft mutual nearest neighbour filtering is used to filter both the correlation map and output of NCN.
4) **Extracting correspondances from the correlation map**
* To obtain image correspondences between images, two scores are defined from the correlation map, by performing soft-max in dimension corresponding to images A and B: $$ s_{ikjl}^A = \frac{exp(c_{ijkl})}{\sum_{ab}exp{(c_{abkl})}} \ \ and \ \ s_{ikjl}^B = \frac{exp(c_{ijkl})}{\sum_{cd}exp{(c_{ijcd})}} $$
* $$ P(K = k, L = l | I = i, J = j) = s_{ijkl}^B \ \ and \ \ P(I = i, J = j | K = k, L = l) = s_{ijkl}^A $$ where $(I,J,K,L)$ are discrete random variables indicating the position of match and $(i,j,k,l)$ the particular position of match.
* $$ f_{kl}^B \text{ assigned to a given} \ f_{ij}^A = (k,l) = \arg \max_{cd} P(K = c, L = d\ |\ I = i, J = j) \\ = \arg \max_{cd} \ s_{ijcd}^B $$ This probabilistic intuition allows the modeling of match uncertainty.
5) **Weakly-supervised training**
* The loss function used to train network only requires a weak-level of supervision. These training pair $(I^A,I^B)$ can have a positive label/match $y = 1$ or negative label/match $y = -1$ $$ L(I^A,I^B) = -y(s^{-A} + s^{-B}) $$ where $s^{-A} \ and \ s^{-B}$ are mean matching scores over all matches.
*
## Implementation
1) **Feature extraction**: ResNet-101 upto conv4_23 layer.
2) **NCNet**: Three layers of $5 \times 5 \times 5 \times 5$ filters and two of $3 \times 3 \times 3 \times 3$. Training feature resolution: $25 \times 25$. Feature extraction resolution: $200 \times 150$. Correlation map downsampled once.
3) Model trained for 5 epochs using Adam Optimizer. Learning rate: $5 \times 10^{-4}$. Feature extraction layer weights are fixed.
4) **Category level matching**: Finetuning for 5 epochs at learning rate $1 \times 10^{-5}$.
## Limitations and Conclusion
1. Repetitive patterns with large scale changes causing incorrect patches.
2. Quadratic complexity.
3. Pixel limit of $1600 \times 1200$.
**Future work**: End - to - end learning for applications in 3D category - level matching or visual localization across day/night illumination.