Repeatable and Reliable Detector and Descriptor with different losses

This blog documents the research on the performance of different loss functions the work carried out in “R2D2: repeatable and reliable detector and descriptor” by Jérome Revaud, Philippe Weinzaepfel, Cesar Roberto De Souza, Martin Humenberger [1].

This blog is made as part of the assignment for Seminar Computer Vision by Deep Learning (CS4245) at TUDelft.

Author:Ranbao Deng, (5460069), R.Deng@student.tudelft.nl
Author:Danning Zhao, (5459583), D.Zhao-3@student.tudelft.nl

1. Introduction

Finding and describing similar keypoints across images is crucial in many fields. Classical approaches are based on two-stage separate pipeline that first detects keypoints and then computes a local descriptor for each keypoint. However, descriptors trained on repeatable locations provided by the detector may harm the performance in regions that are repeatable but where accurate matching is not possible. In this paper, the authors claim that detection and description are inseparably tangled and propose to jointly learn the descriptor reliability seamlessly with the detection and description processes. They choose the keypoints which correspond to locations that maximize both repeatability map and reliability map.

This blog will discuss first the definition of repeatability and reliability, and then discuss the experiment results of different types of loss.

difference repeatbility reliability
Figure 1: Toy examples to illustrate the key difference between repeatability (2nd column) and reliability (3rd column) for a given image. Repeatable regions in the first image are only located near the black triangle, however, all patches containing it are equally reliable. In contrast, all squares in the checkerboard pattern are salient hence repeatable, but are not discriminative due to self-similarity.

2. Problem Formulation and Network Architecture

The purpose of R2D2 is to predict a set of sparse locations of an input image \(I\) that are repeatable and reliable for local feature matching.

R2D2 consists of a fully-convolutional network (FCN) that predicts 3 outputs for an image \(I\) of size \(H\times W\). The first output is a 3D tensor \(X\in R^{H×W×D}\) that corresponds to a set of dense \(D\)-dimensional descriptors, one per pixel. The second one is a saliency/repeatability map \(S \in [0, 1]^{H×W}\) whose goal is to provide sparse yet repeatable keypoint locations. Only keypoints at locations corresponding to local maxima in a patch in saliency map will be extracted. The third output is a reliability map \(R \in [0, 1]^{H×W}\) that indicates the estimated reliability of descriptor \(X_{ij}\) , i.e., the possibility that it is good for matching, at each pixel \((i, j)\) with \(i\in \{1, . . . , W\}\) and \(j \in \{1, . . . , H\}\)

overview of network
Figure 2: Overview of R2D2 network for jointly learning repeatable and reliable matches.

2.1 Repeatbility

Repeatability/Saliency map can be understood as the probability distribution of whether a point is a keypoint which is detected by the learned detector. In this paper, the authors employ an unsupervised formulation which locally enforces the peakniss of local saliency map patches, inspired by QuadNet.
Previous works proved that keypoint repeatability cannot be tackled by standard supervised learning. The authors treat the repeatability as a self-supervised task. What’s more, the authors train the network such that the positions of local maxima in saliency/repeatability map S are covariant to natural image transformations like illumination changes or optical flow.
\[ L_{cosim}(I,I',U)=1-\frac{1}{|P|}\sum_{p \in P} cosim(S[p],S'_U[p]) \]
where \(S[p]\in R^{N^2}\) denotes flattened \(N\times N\) patch \(p\) extracted from \(S\), and likewise for \(S'_U[p]\)
\(I\) and \(I'\) are two images of the same scene. \(U\in R^{H\times W\times2}\) is the ground-truth correspondences between them, with cosim definition:
\[ cosim(P,Q)=\frac{\sum_{i=1}^d P_i Q_i}{\sqrt{\sum_{i=1}^d P_i^2} \sqrt{\sum_{i=1}^d Q_i^2}} \]
Note that Lcosim can be minimised trivially by having \(S\) and \(S'_U\) constant. To avoid this, the authors employ a second loss function that aims to maximise the local peakiness of the repeatability map:
\[ L_{peaky}(I)=1-\frac{1}{|P|}\sum_{p \in P}(\mathop{max}_{(i,j)\in p} S_{ij}-\mathop{mean}_{(i,j)\in p} S_{ij}) \]
Finally, the resulting repeatability loss is composed as a weighted sum of the first loss and second loss applied to both images:
\[ L_{rep}(I,I',U)=L_{cosim}(I,I',U)+\frac{1}{2}(L_{peaky}(I)+L_{peaky}(I')) \]

2.2 Reliability

Reliability map shows the reliability of each descriptor corresponding to each pixel. For example, even well-textured regions keypoints are also known to be unreliable from their unstable nature, such as tree leafages or ocean waves. This means that if we choose the descriptors of these pixels, it will be difficult to perform accurate image matching of the corresponding regions.The goal is to let the network learn to make descriptors as discriminative as possible or, conversely, sparing its efforts on uniformative regions like the sky or the ground.
\[ L_{AP}=\frac{1}{B}\sum_{ij}1-\mathop{AP}^\sim (p_{ij}) \]
\[ L_{AP,R}=\frac{1}{B}\sum_{ij}1-\mathop{AP}^\sim (p_{ij})R_{ij}+\kappa(1-R_{ij}) \]
As in previous works,authors cast descriptor matching as a metric learning problem. They estimate the reliability by using the Average-Precision(AP), standard ranking metric.

Lap
Figure 3: Visualization of R2D2's loss \(L_{AP,R}\).

As can be seen from figure 3, when the \(AP(p_{ij})\) is large and the Reliability \(R_{ij}\) is also large(reliable keypoint), the loss value is small. When the \(AP(p_{ij})\) is small and the Reliability Rij is also large(reliable keypoints), the loss value is large. A high \(AP\) indicates that the descriptor is more accurate, and the distance between the descriptors of the two keypoints that match each other in the two pictures is closer than that of the mismatched keypoints, which indicates that the descriptor should be highly reliable. Vice versa, a low \(AP\) means that the distance between the descriptors of the correctly matched keypoints will be farther than the distance of incorrectly matched keypoints, so the reliability should be smaller.

3. Our methods

R2D2's main contributions include their selections of repeatibility and reliability losses. However, in their work, they didn't provide the reasoning behind such selections. We thus wonder whether other forms or familiys of losses could have improve their network. And this is the main task for this blog, trying out different losses and see what happens.

3.1 Choices on Repeatibility Loss

In the original model, a cosine similarity is selected as an indicator of how much images resembles and it's the core of repeatibility. The similarity losses is then calculated and added as a part of repatibility losses along with the peaky loss.

\[ cosim(P,Q)=\frac{\sum_{i=1}^{d} P_{i} Q_{i}}{\sqrt{\sum_{i=1}^{d} P_{i}^{2}} \sqrt{\sum_{i=1}^{d} Q_{i}^{2}}} \]

Cosine similarity is according to [3] a member of the inner product familiy. Inner product family all uses inner product, \(P_iQ_i\), as part of their formulas. We selected two other candidates from the same family, which are the jaccard loss and dice loss.

\[ Jaccard(P,Q)=\frac{\sum_{i=1}^d P_i Q_i}{\sum_{i=1}^d P_i^2+\sum_{i=1}^d Q_i^2-\sum_{i=1}^d P_iQ_i} \]

\[ dice(P,Q)=\frac{2\sum_{i=1}^d P_i Q_i}{\sum_{i=1}^d P_i^2+\sum_{i=1}^d Q_i^2} \]

To have more insights on other families of similarity, we chose Czekanowski similarity from Intersection family, which all uses the intersection of two pdfs, \(min(P_i, Q_i)\), as part of the formulas.

\[ Czekanowski(P,Q)=\frac{2\sum_{i=1}^dmin(P_i,Q_i)}{\sum_{i=1}^d (P_i+Q_i)} \]

And we also chose Squared-chord similarity from the Fidelity family or Squared-chord family, which uses the geometry means, \(\sqrt{P_iQ_i}\), instead of inner products as part of their formulas.

\[ squared-chord(P,Q)=\sum_{i=1}^d (\sqrt{P_i}-\sqrt{Q_i})^2 \]

By using these similarities, we have covered all families of similarity measures according to [3], excepct for Shannon's entropy family, which is only used for distance.

3.2 Choices on Reliability Loss

We choose to modify the method of metric learning for the reliability loss. According to the method of [2], this loss maximises the distance between the closest positive and closest negative example in the batch, as shown below.

sampling
Figure 4: Proposed sampling procedure. First, patches are described by the current network, then a distance matrix is calculated. The closest non-matching descriptor – shown in red – is selected for each \(a_i\) and \(p_i\) patch from positive pair (green) respectively. Finally, among two negative candidates the hardest one is chosen. All operations are done in a single forward pass.

The anchor descriptors \(A_i\) come from image1, and the positive descriptors \(P_i\) come from the descriptors of image2 warped by ground truth match \(U\).

We use Euclidean distance to define the distance function:
\[ D=pdist(a,p)=(a-p)^2=a^2+p^2-2ap \]
Next, for each matching pair \(a_i\) and \(p_i\), the closest non-matching descriptors i.e. the \(2^{nd}\) nearest neighbour, are found respectively:
\(a_i\)-anchor descriptor. \(p_i\) -positive descriptor.
\(p_{j_{min}}\)-closest non-matching descriptor to \(a_i\), where\(j_{min}=argmin_{j=1..n,j\neq i} d(a_i,p_j)\).
\(a_{k_{min}}\)-closest non-matching descriptor to \(p_i\) where \(k_{min}=argmin_{k=1..n,k\neq i} d(a_k,p_i)\).
Then from each quadruplet of descriptors\((a_i,p_i,p_{j_{min}},a_{k_{min}})\),a triplet is formed:\((a_i,p_i,p_{j_{min}})\), if \(d(a_i,p_{j_{min}})<d(a_{k_{min}},p_i)\) and \((p_i,a_i,a_{k_{min}})\)otherwise.
The loss will be high if the non-matching descriptors are close, which means the neighbours descriptors are from an unreliable region, such as sky, river and ground.The loss will be low if the non-matching descriptors are very far, which means the neighbours descriptors are from an reliable region.These n triplet distances are fed into the triplet margin loss. The goal of the loss is to minimise the distance between the matching descriptor and closest non-matching descriptor, which will make descriptors as discriminative as possible.
\[ L_{margin triplet}=\frac{1}{n}\sum_{i=1,n}max(0,1+d(q_i,p_i)-min(d(a_i,p_{j_{min}}),d(a_{k_{min}},p_i))) \]
Local descriptors are extracted at each pixel, but not all locations are equally interesting. In particular, uniform regions lack the distinctiveness necessary for accurate matching. Therefore, optimising the descriptors in such image regions can hinder performance. We therefore propose to enhance the triplet margin loss to spare the network in wasting its efforts on undistinctive regions:
\[ L_{margin triplet,R}=\frac{1}{n}\sum_{i=1,n}max(0,1+d(q_i,p_i)-min(d(a_i,p_{j_{min}}),d(a_{k_{min}},p_i)))R_{i} \]
We multiply the margin triplet loss and reliability. When the distance between the non-matching descriptors is very close, that is, when the neighbour is in a non-informative area, if the reliability is large, the performance will be lost when optimising in this area.Therefore, we decided to design this loss such that, when both loss and reliability have large values, the loss will be large, following the reliability loss method of R2D2.
At the implementation level, we randomly select the descriptor in the \(N\times N\) area of image1, the descriptor in the wardped image2 \(N\times N\) area by ground-truth \(U\) and the \(N\times N\) area reliability of the image1.Then we convert the patch of descriptor to \(N^2\times 128\) dimension matrix and finally get the distance matrix with dimension \(N^2\times N^2\)

4. Experiments, Results and Analysis

In the experiments, we used the faster R2D2 as our network architure, which is a simplified version of R2D2 by neglecting certain layers. The baseline is a pretrained fater R2D2 model with 25 epochs on the given dataset (Web images, Aachen day-time images, Aachen day-night synthetic pairs, and Aachen optical flow pairs.) All our models are trained based on this pretrained model with 5 epochs on the same dataset.

The experiments is conducted by using the different similarity measures and reliability measure listed in the previous sections, two of which are trained without pretrained models (with f0 or from 0 epoch). The hardnet is our choice of relaibility measure.

4.1 Performances

The performances of all our test variants are shown below. The metric we used is called Mean Matching Accuracy (MMA), which is the average percentage of correct matches in an image pair considering multiple pixel error thresholds [1]. We select 3 as the threshold (MMA@3) to report since it's used in R2D2 original paper, and it's intuitively more reasonable to have a adequate threshold but not to large threshold.

MMA score
Figure 5: Mean Matching Accuracy (MMA) for different pixel error thresholds. This is the main performance metric.

The MMA@3 could also be seen in below table. MMA@1, MMA@6 and MMA@10 are provided to have more accurate understanding of the performance of certain variant.

Variants MMA@1 MMA@3 MMA@6 MMA@10
Original (faster R2D2) 0.278 0.700 0.837 0.865
Original f0 0.282 0.703 0.837 0.865
Hardnet reliability 0.281 0.704 0.838 0.865
Hardnet reliability f0 0.283 0.704 0.836 0.864
Hardnet with dice 0.279 0.705 0.840 0.868
Czekanowski similarity 0.285 0.709 0.840 0.866
Dice similarity 0.289 0.715 0.846 0.872
Jaccard similarity 0.281 0.706 0.841 0.868
Squared-chord similarity 0.269 0.664 0.787 0.812

Overall, the performance of all the variants lies in the range of MMA@3 0.66-0.72, which confirms that all our choices are valid. And with higher thresholds, the difference in performance is decreased except for sqaured-chord. The squared-chord similarity bahaves the worst, with an MMA@3 of 0.66, and the difference of performances with other variants are larger with higher thresholds. This might imply that Squared-chord similarity or the entire Squared-chord family are not well-suited for this task.

All other variants perfrom either similarly with the baseline or perfroms slightly better. The hardnet performs almost the same with original. The Jaccard has a very slight improvement around 0.6%. The Czekanowski perfroms 1% better than the original andThe Dice 1.5% better. This might impliy that the Inner product family are indeed one of the best choices and Intersection familiy might work better on same occaions.

What we found weird is how the from 0 epoch (which are trained 5 epochs) behave almost the same with their 30 (which have pretrained models for 25 and 5 by us, thus 30 epochs in total) counter parts. This might due to the fact that the initialization of the network was relatively good and fine-tuned and may imply that even without the convergence of losses (see learning curves), R2D2 could still perform rather well. This is a good news for paractical applications.

The test results showing that the inner product family and intersection family are well suited for the image matching task. While the squared-chord or fidelity family are less competitive. This might result in the fact that since square-chord uses geometric means, the relatively non similar images would become more similar due to this "decrease" of differences. While in inner product family, the inner product would "enlarge" the possible difference.

4.2 Learning curves

Two sets of learning curves are provided in this section to provide more insights of different losses' influences on the learning.

4.2.1 Learning curves for all variants

The learning curves of these variants are shown below.

Individual losses
Figure 6: Learning cureve of individual losses for all variants.

Above figure shows the individual losses, similarity, relaibility and peaky loss with respect to training epochs of 5. Jaccard and dice performs similarly which coincides with the fact that they came from the same inner product family. Hardnet uses the cosine similarity and thus also performs similarly with Jaccard and dice on similarity loss. The almost same performance of the reliability loss of Hardnet may imply our choice of reliability is not far from the original one.

The Squared-chord similarity and the Czekanowski similarity behave more differently to others. The Czekanowski has a rather high similarity loss but low peaky loss and Squared-chord the exact opposite, which has a low similarity loss but high peaky loss. And it also has a much more rapid descent than the others.

The reason for such might be that the Czekanowski using intersection or min as part of the similarity and that would usually be lower since min is taken, and thus the loss would be larger. And a larger loss in similarity would make peaky loss less significant since its purpose is to counter the side effects of perfec similarity which means low simlarity loss. And with the same logic, squared-chord using geometry mean, which as mentioned before would "decrease" difference and making them more similar, resulting in higher simlarity, meaning lower similarity loss, and it would make peaky loss rather high. The rest are all in the family of inner products, and behaves similarly.

To see the learning curves more clearly, the sum losses are presented below.

Sum losses
Figure 7: Learning cureve of sum losses for all variants.

The above figure shows that in 5 epochs, Squared-chord and Jaccard might have converged. Czekanowski could still improve with more epochs. The rest, dice and hardnet might not present enough information. More epochs are needed to conclude whether they have converged.

It can be seen that the squared-chord have the largest absolute loss and Czekanowski is similarly having a larger loss as well. This coincides with that they are from other families of similarity measures. The rest have similar absolute sum losses.

The possible explanation for why Czwkanowski or the intersection family have a more rapid descent might be that since the pretrained model is using cosine similarity, a member of the inner product family, new metrics from other families would perform worse starting at the new epoch. And thus the descent would become more significant.

4.2.1 Learning curves for hardnet and original R2D2 trained from scratch

Two sets of variants, hardnet and original trained from 0 epoch, are also given to provide insights of how many epochs are really needed. The results of their indiviual losses are shown below.

Individual losses from 0 epoch
Figure 8: Learning cureve of individual losses for variants without pre-trained model or from 0 epoch.

As can be seen in the figure, hardnet_f0 behaves almost the same with original_f0, which coincides with the former analysis. Possible reason is that, at the stage of small epoch, the degree of discrimination between the descriptors of different keypoints is small, which means that they are closer. At this point, both \(loss_{AP}\) and \(loss_{hardnet}\) focus on avoiding sparing its efforts on uniformative regions like the sky or the ground. However, different from the previous cases, the descent of the losses are more siginificant. One peculiar facts is that the similarity (cosine) did not change much. This implies the importance of the peaky loss and reliability loss, otherwise with only the similarity, the results might be very bad. The sum of these loses is shown below.

Sum losses from 0 epoch
Figure 9: Learning cureve of sum losses for variants without pre-trained model or from 0 epoch.

Same deduction that hardnet and original behaves almost the same could be seen from above figure.

4.3 Training time

All our experiments are conducted on a laptop with a GPU GTX1070m and a CPU i7-8750h with a RAM of 16G. The average time used for 1 epoch is around an hour. The 5 epochs time and the standard error could be seen below. The standard error of 5 epochs is given as \(5*std(times)\), where \(times\) is the time spent for each of the 5 epochs.

Training time
Figure 10: Training time for all variants.

The training time of all variants are almost the same it's intuitive since we follow similar calculations for every variant. Hardnet with dice have a peculiarly fewer training time might due to the flucation of performance of our test device, since both dice and hardnet have the almost the same training time.

In the end, quailitative results are given to show the image matching performance of different variant. An enlarged image is provided here.

Qualitative
Figure 11: Qualitative results for all variants.

5. Conclusions

In this blog, we introduced a new and simple architecture for image matching named Repeatable and Reliable Detector and Descriptor (R2D2) by Revaud et al [1]. The main contribution of this work is their definition of losses for repetability and reliability.

By comparing the results generated by different losses with their original definitions, we could see that the Inner product family (cosine, jaccard and dice) are indeed good choices. Among which, dice similarity outperfroms others in this task, with a 1.5% improvement on MMA@3. Intersection family (Czekanowski) are also a good choice with a sliter improvement. However, fidelity or squared-chord family (squared-chord) are not well-suited. The hardnet behaves similarly with the original reliability both in learning curve and performance. The combination of hardnet and dice however performs worse than the combination of the original and dice.

The explanation for these might results in that the squared-chord family uses geometric means and would "decrease" the difference. While inner product and intersection family use inner product and min would preserve or "enlarge" the difference.

Future work of such comparisons could be having more epochs and statistical analyais.

6. Impelementation

Our codes are based on R2D2 and D2-Net. We focused on the trainer, nets, losses and visualizations.

References

[1] R2d2: Repeatable and reliable detector and descriptor, Revaud, Jerome and Weinzaepfel, Philippe and De Souza, C{'e}sar and Pion, Noe and Csurka, Gabriela and Cabon, Yohann and Humenberger, Martin, 2019.
[2] Working hard to know your neighbour’s margins: Local descriptor learning loss, Anastasiya Mishchuk, Dmytro Mishkin, Filip Radenovic, Jiri Matas, 2017.
[3] Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions, Sung-Hyuk Cha, 2007.

Select a repo