Try   HackMD

Notes on Neighbourhood Consensus Networks

tags: notes correspondences 4D convolution
Authors: Ignacio Rocco, Mircea Cimpoi, Relja Arandjelović, Akihiko Torii, Tomas Pajdla, Josef Sivic
Notes written by Arihant Gaur and Saurabh Kemekar

Introduction

The paper proposes an end - to - end pipeline for feature detection, description and matching and achieves the following tasks:

  1. Development of an end - to - end CNN architecture (NCNet), to identify neighbourhood consensus patterns in 4D space. There is no need for a global geometric model.
  2. Weak supervision training in the form of matching and non - matching image pairs. No manual intervention.
  3. Applications in category and instance level matching.
  1. Handcrafted Image Descriptors
    • For example, SIFT, SURF, FAST and ORB. Image matching can be performed using the nearest neighbour approach. Lowe's ratio test can be used for removal of ambiguous matches.
    • Issue: Too many correct matches discarded (issue will be prevalent in repetitive and textureless areas). Illumination invariance is limited.
  2. Trainable Descriptors
    • Use of DoG to yield sparse descriptors.
    • Using pre-trained image level CNNs. Recent works perform both detection and description.
  3. Trainable Image Alignment
    • Use of end - to - end trainable models to produce correspondences. Pairwise feature matches are computed to estimate geometric transformation parameters using CNN. Such methods take into account dense correspondences.
    • Issue: They only estimate low complexity parametric transformation.
  4. Match filtering by neighbourhood consensus
    • Inspection of the neighbourhood of potential match. Useful for removal of many incorrect matches.

Proposed Approach

  • This paper combines the robustness of neighborhood consensus filtering with the power of trainable neural architecture.
  • A fully differentiable way, such that the trainable matching module can be directly combined with strong CNN image descriptors.
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

    There are mainly 5 components:
  1. Dense feature extraction and matching

    • Given image
      I
      , this feature extractor will produce a dense set of descriptors,
      fijI
    • The dense features descriptors
      fA
      and
      fB
      of two images to be matched, the exhaustive pairwise cosine similarity between is computed and store in 4-D tensor
      c
      referred to as correlation map.
      cijkl=fijA,fijB||fijA||2 ||fijB||2
  2. Neighbourhood consensus network

    • To further process and filter the matches, the authors propose a 4D convolution neural network(CNN) for the neighbourhood task.
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
    • The first layer of the proposed CNN which has
      N1
      filters can specialize in learning different local geometric deformations, producing
      N1
      output channels
    • Second layer capture more complex patterns by combining the outputs from the previous layer. Finally, the neighbourhood consensus CNN produces a single channel output, which has the same dimension as the 4D input matches
    • In order for NCN network to be invarient to particular order of input images
      (IA,IB)
      or
      (IB,IA)
      , the authors propose to apply network twice in the following way:
      c~=N(c)+(N(cT))T
      where
      (cT)ijkl=cklij
      and
      N(.)
      is the forward pass of NCN.
  3. Soft mutual nearest neighbour filtering

    • To eliminate majority of matches, which makes it unsuitable for usage in an end-to-end trainable approach, the paper proposes a softer version of the mutual nearest neighbouring filtering both in sense of softer decision and better differentiability properties.
      c^=M(c) where cijkl^=rikjlArikjlBcikjl,rikjlA=cijklmaxab cabkl and  rikjlA=cijklmaxcd cijcd
    • The soft mutual nearest neighbour filtering is used to filter both the correlation map and output of NCN.
  4. Extracting correspondances from the correlation map

    • To obtain image correspondences between images, two scores are defined from the correlation map, by performing soft-max in dimension corresponding to images A and B:
      sikjlA=exp(cijkl)abexp(cabkl)  and  sikjlB=exp(cijkl)cdexp(cijcd)
    • P(K=k,L=l|I=i,J=j)=sijklB  and  P(I=i,J=j|K=k,L=l)=sijklA
      where
      (I,J,K,L)
      are discrete random variables indicating the position of match and
      (i,j,k,l)
      the particular position of match.
    • fklB assigned to a given fijA=(k,l)=argmaxcdP(K=c,L=d | I=i,J=j)=argmaxcd sijcdB
      This probabilistic intuition allows the modeling of match uncertainty.
  5. Weakly-supervised training

    • The loss function used to train network only requires a weak-level of supervision. These training pair
      (IA,IB)
      can have a positive label/match
      y=1
      or negative label/match
      y=1
      L(IA,IB)=y(sA+sB)
      where
      sA and sB
      are mean matching scores over all matches.

Implementation

  1. Feature extraction: ResNet-101 upto conv4_23 layer.
  2. NCNet: Three layers of
    5×5×5×5
    filters and two of
    3×3×3×3
    . Training feature resolution:
    25×25
    . Feature extraction resolution:
    200×150
    . Correlation map downsampled once.
  3. Model trained for 5 epochs using Adam Optimizer. Learning rate:
    5×104
    . Feature extraction layer weights are fixed.
  4. Category level matching: Finetuning for 5 epochs at learning rate
    1×105
    .

Limitations and Conclusion

  1. Repetitive patterns with large scale changes causing incorrect patches.
  2. Quadratic complexity.
  3. Pixel limit of
    1600×1200
    .

Future work: End - to - end learning for applications in 3D category - level matching or visual localization across day/night illumination.