Authors: Ignacio Rocco, Mircea Cimpoi, Relja Arandjelović, Akihiko Torii, Tomas Pajdla, Josef Sivic
Notes written by Arihant Gaur and Saurabh Kemekar
Introduction
The paper proposes an end - to - end pipeline for feature detection, description and matching and achieves the following tasks:
Development of an end - to - end CNN architecture (NCNet), to identify neighbourhood consensus patterns in 4D space. There is no need for a global geometric model.
Weak supervision training in the form of matching and non - matching image pairs. No manual intervention.
Applications in category and instance level matching.
Related Work
Handcrafted Image Descriptors
For example, SIFT, SURF, FAST and ORB. Image matching can be performed using the nearest neighbour approach. Lowe's ratio test can be used for removal of ambiguous matches.
Issue: Too many correct matches discarded (issue will be prevalent in repetitive and textureless areas). Illumination invariance is limited.
Trainable Descriptors
Use of DoG to yield sparse descriptors.
Using pre-trained image level CNNs. Recent works perform both detection and description.
Trainable Image Alignment
Use of end - to - end trainable models to produce correspondences. Pairwise feature matches are computed to estimate geometric transformation parameters using CNN. Such methods take into account dense correspondences.
Issue: They only estimate low complexity parametric transformation.
Match filtering by neighbourhood consensus
Inspection of the neighbourhood of potential match. Useful for removal of many incorrect matches.
Proposed Approach
This paper combines the robustness of neighborhood consensus filtering with the power of trainable neural architecture.
A fully differentiable way, such that the trainable matching module can be directly combined with strong CNN image descriptors.
Given image , this feature extractor will produce a dense set of descriptors,
The dense features descriptors and of two images to be matched, the exhaustive pairwise cosine similarity between is computed and store in 4-D tensor referred to as correlation map.
Neighbourhood consensus network
To further process and filter the matches, the authors propose a 4D convolution neural network(CNN) for the neighbourhood task.
The first layer of the proposed CNN which has filters can specialize in learning different local geometric deformations, producing output channels
Second layer capture more complex patterns by combining the outputs from the previous layer. Finally, the neighbourhood consensus CNN produces a single channel output, which has the same dimension as the 4D input matches
In order for NCN network to be invarient to particular order of input images or , the authors propose to apply network twice in the following way: where and is the forward pass of NCN.
Soft mutual nearest neighbour filtering
To eliminate majority of matches, which makes it unsuitable for usage in an end-to-end trainable approach, the paper proposes a softer version of the mutual nearest neighbouring filtering both in sense of softer decision and better differentiability properties.
The soft mutual nearest neighbour filtering is used to filter both the correlation map and output of NCN.
Extracting correspondances from the correlation map
To obtain image correspondences between images, two scores are defined from the correlation map, by performing soft-max in dimension corresponding to images A and B:
where are discrete random variables indicating the position of match and the particular position of match.
This probabilistic intuition allows the modeling of match uncertainty.
Weakly-supervised training
The loss function used to train network only requires a weak-level of supervision. These training pair can have a positive label/match or negative label/match where are mean matching scores over all matches.