[D2 Net - A Trainable CNN for Joint Description and Detection of Local Features](https://arxiv.org/abs/1905.03561)

# [D2 Net - A Trainable CNN for Joint Description and Detection of Local Features](https://arxiv.org/abs/1905.03561) #### Authors - Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Pollefeys, Josef Sivic, Akihiko Torii, Torsten Sattler #### CVPR 2019 #### Paper notes - Aniket Gujarathi ### Introduction * Traditionally sparse local features have been used for correspondence estimation in computer vision problems. * These methods follow a detect-then-describe approach where the set of keypoints are first detected and then described. * However, such approaches perform poorly in extreme changes (eg. day and night, season change, etc.). * Major reason is a lack of repeatability(robustness and invariance to transtion, rotation and illumination changes) in the keypoint detector. The low level information used by detectors is often more significantly affected by changes. * Approaches that forego the detection stage and instead densely extract descriptors, perform much better in such challenging conditions. * But, there is a trade-off between robustness at the cost of higher matching times and memory consumption. * This paper proposes a **detect-and-describe** approach rather than performing detection early on using low-level features. They propose to postpone the detection in the pipeline. * This approach requires less memory than dense methods and performs comparably well or even better under challenging conditions. ![](https://i.imgur.com/BgNqlHi.png =250x150) * This approach uses a single branch describe-and-detect for sparse feature extraction and hence is able to detect keypoints belonging to higher-level structures and locally unique descriptors. ### Joint Detection and Description Pipeline * As both detector and descriptor share the underlying representation, the approach is called D2. ![](https://i.imgur.com/LkkfSQL.png) * First step is to apply a CNN on image $I$ to get a 3D tensor $F$. #### Feature Description * The most straightforward interpretation of the 3D tensor $F$ is as a dense set of description vectors $d$ : $d_{ij} = F_{ij}, d_{i,j} \in R^n$ * These descriptor vectors can be readily compared between images to establish correspondences using the Euclidean distance. * In training, the descriptors are so adjusted that the same points in the scene will provide similar descriptors, even when subjected to changes. #### Feature Detection * A different interpretation of the 3D tensor $F$ is as a collection of 2D responses $D$ : $D^k = F_{::k}, D^k \in R^{h \times w}$ * These detection response maps are like Difference of Gaussians in SIFT or the cornerness score maps in Harris Corner Detector. * The raw scores are post-processed to get only a subset of locations as keypoints. #### Hard Feature Detection * In traditional feature detectors, the detection maps would be sparsified by non-local-maximum suppression. However, now there are multiple detection maps $D^k$ and a detection can take place on any of them. * Therefore for any point ($i, j$) to be detected, ($i, j$) is a detection if $D_{ij}^k$ is a local maxima in $D^k$. #### Soft Feature Detection * During training the hard detection is softened for backpropagation. * First, a soft local-max score is defined: $\alpha_{ij}^k = \dfrac{exp(D_{ij}^k)} {\sum_{i',j' \in N(i,j)} exp(D_{i'j'}^k)}$ * $N(i,j)$ is the set of 9 neighbours of the pixel $(i, j)$. * Then soft channel selection is done, which computes a ratio-to-max. per descriptor that emulates channel-wise non-maximum suppression: $B_{ij}^k = \dfrac{D_{ij}^k}{max_tD_{ij}^t}$ * To take both the criteria, the product of both the scores across all feature maps is maximized to get a single score map - $\gamma_{ij} = max_k(\alpha_{ij}^k, \beta_{ij}^k)$ * The soft detection score $s$ is obtained by image-level normalization. $s = \gamma_{ij} / \sum_{i',j'}\gamma_{i'j'}$ **This pipeline is not inherently invariant to scale variance and the matching fails with significant changes in viewpoint**. ### Jointly optimizing detection and description #### Training Loss * As the pipeline uses a single CNN for detection and description, there is a need for an appropriate loss function that can jointly optimize both objectives. * The detected keypoints need to be repeatable and invariant to lighting conditions, etc. The descriptors need to be unique. So, an extension of **triplet margin ranking loss** is used, which has been used for descriptor learning, but in this version it is extended for the detector as well. * Given a pair of images ($I_1, I_2$) and the correspondence $c:A <-> B$, the triplet margin ranking loss aims to minimize the distance between the corresponding descriptors $\hat{d_A^1}$ and $\hat{d_B^2}$, while maximizing the distance to other confusing descriptors $\hat{d_N^1}$ and $\hat{d_N^2}$ that may exist due to similar looking structures. * Positive descriptor distance $p(c) = \|d_A^1 - d_B^2\|_2$ * Negative distance $n(c) = min (\|d_A^1 - d_N^2\|, \|d_N^1 - d_B^2\|)$ * The triplet margin loss $m(c)$ for a margin $M$ is: $m(c) = max(0, M + p(c)^2 - n(c)^2)$ * To additionally seek for the repeatability of the detections, a detection term is added to the loss - $L(I_1, I_2) = \sum_{c\in C} \dfrac{s_c^1s_c^2}{\sum_{q \in C}s_q^1s_q^2}m(p(c), n(c))$