Notes on Neighbourhood Consensus Networks

tags: `notes` `correspondences` `4D convolution`

Authors: Ignacio Rocco, Mircea Cimpoi, Relja Arandjelović, Akihiko Torii, Tomas Pajdla, Josef Sivic

Notes written by Arihant Gaur and Saurabh Kemekar

Introduction

The paper proposes an end - to - end pipeline for feature detection, description and matching and achieves the following tasks:

Development of an end - to - end CNN architecture (NCNet), to identify neighbourhood consensus patterns in 4D space. There is no need for a global geometric model.
Weak supervision training in the form of matching and non - matching image pairs. No manual intervention.
Applications in category and instance level matching.

Handcrafted Image Descriptors
- For example, SIFT, SURF, FAST and ORB. Image matching can be performed using the nearest neighbour approach. Lowe's ratio test can be used for removal of ambiguous matches.
- Issue: Too many correct matches discarded (issue will be prevalent in repetitive and textureless areas). Illumination invariance is limited.
Trainable Descriptors
- Use of DoG to yield sparse descriptors.
- Using pre-trained image level CNNs. Recent works perform both detection and description.
Trainable Image Alignment
- Use of end - to - end trainable models to produce correspondences. Pairwise feature matches are computed to estimate geometric transformation parameters using CNN. Such methods take into account dense correspondences.
- Issue: They only estimate low complexity parametric transformation.
Match filtering by neighbourhood consensus
- Inspection of the neighbourhood of potential match. Useful for removal of many incorrect matches.

Proposed Approach

This paper combines the robustness of neighborhood consensus filtering with the power of trainable neural architecture.
A fully differentiable way, such that the trainable matching module can be directly combined with strong CNN image descriptors.
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
There are mainly 5 components:

Dense feature extraction and matching
- Given image
  $I$ , this feature extractor will produce a dense set of descriptors,
  $f_{i j}^{I}$
- The dense features descriptors
  $f^{A}$ and
  $f^{B}$ of two images to be matched, the exhaustive pairwise cosine similarity between is computed and store in 4-D tensor
  $c$ referred to as correlation map.
  $c_{i j k l} = \frac{⟨ f_{i j}^{A}, f_{i j}^{B} ⟩}{| | f_{i j}^{A} | |^{2} | | {f_{i j}^{B} | |}^{2}}$
Neighbourhood consensus network
- To further process and filter the matches, the authors propose a 4D convolution neural network(CNN) for the neighbourhood task.
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
- The first layer of the proposed CNN which has
  $N_{1}$ filters can specialize in learning different local geometric deformations, producing
  $N_{1}$ output channels
- Second layer capture more complex patterns by combining the outputs from the previous layer. Finally, the neighbourhood consensus CNN produces a single channel output, which has the same dimension as the 4D input matches
- In order for NCN network to be invarient to particular order of input images
  $(I^{A}, I^{B})$ or
  $(I^{B}, I^{A})$ , the authors propose to apply network twice in the following way:
  $\tilde{c} = N (c) + (N (c^{T}))^{T}$ where
  $(c^{T})_{i j k l} = c_{k l i j}$ and
  $N (.)$ is the forward pass of NCN.
Soft mutual nearest neighbour filtering
- To eliminate majority of matches, which makes it unsuitable for usage in an end-to-end trainable approach, the paper proposes a softer version of the mutual nearest neighbouring filtering both in sense of softer decision and better differentiability properties.
  $\hat{c} = M (c) w h e r e \hat{c_{i j k l}} = r_{i k j l}^{A} r_{i k j l}^{B} c_{i k j l}, r_{i k j l}^{A} = \frac{c_{i j k l}}{max_{a b} c_{a b k l}} a n d r_{i k j l}^{A} = \frac{c_{i j k l}}{max_{c d} c_{i j c d}}$
- The soft mutual nearest neighbour filtering is used to filter both the correlation map and output of NCN.
Extracting correspondances from the correlation map
- To obtain image correspondences between images, two scores are defined from the correlation map, by performing soft-max in dimension corresponding to images A and B:
  $s_{i k j l}^{A} = \frac{e x p (c_{i j k l})}{\sum_{a b} e x p (c_{a b k l})} a n d s_{i k j l}^{B} = \frac{e x p (c_{i j k l})}{\sum_{c d} e x p (c_{i j c d})}$
- $P (K = k, L = l | I = i, J = j) = s_{i j k l}^{B} a n d P (I = i, J = j | K = k, L = l) = s_{i j k l}^{A}$ where
  $(I, J, K, L)$ are discrete random variables indicating the position of match and
  $(i, j, k, l)$ the particular position of match.
- $f_{k l}^{B} assigned to a given f_{i j}^{A} = (k, l) = \arg max_{c d} P (K = c, L = d | I = i, J = j) = \arg max_{c d} s_{i j c d}^{B}$ This probabilistic intuition allows the modeling of match uncertainty.
Weakly-supervised training
- The loss function used to train network only requires a weak-level of supervision. These training pair
  $(I^{A}, I^{B})$ can have a positive label/match
  $y = 1$ or negative label/match
  $y = - 1$
  $L (I^{A}, I^{B}) = - y (s^{- A} + s^{- B})$ where
  $s^{- A} a n d s^{- B}$ are mean matching scores over all matches.

Implementation

Feature extraction: ResNet-101 upto conv4_23 layer.
NCNet: Three layers of
$5 \times 5 \times 5 \times 5$ filters and two of
$3 \times 3 \times 3 \times 3$ . Training feature resolution:
$25 \times 25$ . Feature extraction resolution:
$200 \times 150$ . Correlation map downsampled once.
Model trained for 5 epochs using Adam Optimizer. Learning rate:
$5 \times 10^{- 4}$ . Feature extraction layer weights are fixed.
Category level matching: Finetuning for 5 epochs at learning rate
$1 \times 10^{- 5}$ .

Limitations and Conclusion

Repetitive patterns with large scale changes causing incorrect patches.
Quadratic complexity.
Pixel limit of
$1600 \times 1200$ .

Future work: End - to - end learning for applications in 3D category - level matching or visual localization across day/night illumination.