Progress Summary

# Progress Summary > This shared progressive document is a record of what has been done so far and what is planned for the future. This document will be updated weekly by Yanan, Yuanyuan and Qi. You can leave comments and feedback by selecting the text so that we can communicate better. [toc] ## Revised Pipeline ## Local Feature Extraction and Aggregation ([@Yanan](https://hackmd.io/?nav=overview)) Yanan explored the performance of CNN image retrieval model, both feature extraction[1](https://) and feature aggregation are involved. ### Aggregation Methods (Off the shelf) #### 1.GeM(Generalized Mean Pooling) [Fine-tuning CNN Image Retrieval with No Human Annotation](https://arxiv.org/pdf/1711.02512.pdf) $$ \mathbf{f}^{(g)}=\left[\mathrm{f}_{1}^{(g)} \ldots \mathrm{f}_{k}^{(g)} \ldots \mathrm{f}_{K}^{(g)}\right]^{\top} \quad \mathrm{f}_{k}^{(g)}=\left(\frac{1}{\left|\mathcal{X}_{k}\right|} \sum_{x \in \mathcal{X}_{k}} x^{p_{k}}\right)^{\frac{1}{p_{k}}} $$ * The pooling parameter pk can be manually set or learned since this operation is differentiable and can be part of the back-propagation. * The goal of pooling is to synthesise the features of N small regions into the features of that channel, and the comparison of the similarity of two images for that channel is also a comparison of small local regions. * MAC similarity implicitly forms patch correspondences. The strength of each correspondence is given by the product of the associated descriptor components. * **GeM pooling outperforms traditional average and maximum pooling, and learning a single P outperforms learning the respective Pk of multiple channels** #### 2.SPoc [Multilayer Convolutional Feature Aggregation Algorithm for Image Retrieval](https://www.hindawi.com/journals/mpe/2019/9794202/) * Based on the aggregation of raw deep convolutional features without embedding. * **Sum pooling** The construction of the SPoC descriptor starts with the sum pooling of the deep features: $$ \begin{gathered} \psi_{1}(I)=\sum_{y=1}^{H} \sum_{x=1}^{W} f_{(x, y)} \\ \operatorname{sim}\left(I_{1}, I_{2}\right)=\left\langle\psi\left(I_{1}\right), \psi\left(I_{2}\right)\right\rangle=\sum_{f_{i} \in I_{1}} \sum_{f_{j} \in I_{2}}\left\langle f_{i}, f_{j}\right\rangle \end{gathered} $$ * **Centering prior** SPoC descriptor can be modifified to incorporate such centering prior via a simple weighting heuristic. This heuristics assigns larger weights to the features from the center of the feature map stack. * **Post-processing** The obtained representation ψ(I) is subsequently l2-normalized, then PCA compression and whitening are performed. * **PCA dimensionality reduction applied to SPoC features can improve retrieval results. Sum pooling is more effective than max pooling for deep convolutional features** #### 3.RMac [Particular object retrieval with integral max-pooling of CNN activations](http://arxiv.org/abs/1511.05879) ![](https://i.imgur.com/sW4bceB.png) * Build compact feature vectors by aggregating the maximum activation over multiple spatial regions. * We now consider a set of R regions of different sizes. We defifine them on the CNN response maps and not on the original image. We sample square regions at L different scales. At the largest scale (l = 1), the region size is determined to be as large as possible, i.e., its height and width are both equal to min(W, H). The regions are sampled uniformly such that the overlap between consecutive regions is as close as possible to 40%. • Then we calculate the feature vector associated with each region, and post-process it with PCA-whitening and L2-normalization. We combine the collection of regional feature vectors into a single image vector by summing them and L2-normalizing in the end. • Keeps the dimensionality low • Integrates the multi-scale information effectively and plays the auxiliary role of multi-region matching • Multi-regional weights are consistent in the R-MAC method and do not take into account the differences in regional importance of different visual content. #### 4.Crow [Multilayer Convolutional Feature Aggregation Algorithm for Image Retrieval](https://www.hindawi.com/journals/mpe/2019/9794202/) * Weighti both spatially and per channel before sum-pooling to create a fifinal aggregation on convolutional features. * CROW pooling can increase the weight of regions of interest and decrease the weight of non-object regions to some extent. ![](https://i.imgur.com/ZR2jEky.png) * **Identification of areas of interest is not always accurate. I didn't try this method in the following experiments.** ### Comparison Experiments and Results (By Myself) #### 1.Model Structure #### 2.Comparison about parameter p in GeM #### 3.Comparison between CNN Models #### 4.Comparison about different pooling methods #### 5.Results ### Post processing methods (To Do) - [ ] whitening methods - [ ] find a balance between precision and speed ## Data Compression and Nearest Neighbour Search (@Yuanyuan) What is the nearest neighbour search problem? Database: $N$ $D$-dimensional vectors $\chi=\left\{\boldsymbol{x}_{n}\right\}_{n=1}^{N}, \boldsymbol{x}_{n} \in \mathbb{R}^{D}$ Query vector: $\boldsymbol{y} \in \mathbb{R}^{D}$ $$ n^{\star}=\underset{n \in\{1, \ldots, N\}}{\arg \min }\left\|\mathbf{y}-\mathbf{x}_{n}\right\|_{2}^{2} $$ linear scan: $\mathcal O(DN)$ How to reduce the computational cost when $N$ is large? ### Product Quantization (Compression-based) #### Classic PQ [Jegou, H., Douze, M., & Schmid, C. (2010). Product quantization for nearest neighbor search. *IEEE transactions on pattern analysis and machine intelligence*, *33*(1), 117-128.](https://www.researchgate.net/profile/Herve-Jegou/publication/47815472_Product_Quantization_for_Nearest_Neighbor_Search/links/00b4953c9a4b399203000000/Product-Quantization-for-Nearest-Neighbor-Search.pdf) - Representing a vector $\mathbf x\in \mathbb R^D$ as a concatenation of $M$ subvectors $$ \mathbf{x}=[\underbrace{x_{1}, x_{2}, \ldots, x_{D / M}}_{\mathbf{x}^{1}}, \ldots, \underbrace{x_{D-D / M+1}, \ldots, x_{D}}_{\mathbf{x}^{M}}]^{\top} $$ where $m^{\text {th }}$ sub-vector is denoted as $\mathbf{x}^{m} \in \mathbb{R}^{D / M}$, for each $m \in\{1, \ldots, M\}$ - **Sub-codebook**: $\mathcal C^m=\{\mathbf c_k^m \}_{k=1}^K,\mathbf{c}_k^{m} \in \mathbb{R}^{D / M}$, trained by k-means clustering - **Sub-encoder**: $i^{m}\left(\mathbf{x}^{m}\right)=\underset{k \in\{1, \ldots, K\}}{\arg \min }\left\|\mathbf{x}^{m}-\mathbf{c}_{k}^{m}\right\|_{2}^{2}$ - **Encoder**: $\mathbf{i}(\mathbf{x})=\left[i^{1}\left(\mathbf{x}^{1}\right), \ldots, i^{M}\left(\mathbf{x}^{M}\right)\right]^{\top}$ - **Decoder**: $\tilde{\mathbf{x}}=\mathbf{i}^{-1}\left(\mathbf{i}_{\mathbf{x}}\right)=\mathbf{i}^{-1}\left(\left[\begin{array}{c} i^{1} \\ \vdots \\ i^{M} \end{array}\right]\right)=\left[\begin{array}{c} \mathbf{c}_{i^{1}}^{1} \\ \vdots \\ \mathbf{c}_{i^{M}}^{M} \end{array}\right]$ - **Distance computation**: - Representing the query vector $\mathbf y$ as a concatenation of $M$ subvectors - Constructing a distance table $\mathbf A$: $A(m,k)=d(\mathbf y^m,\mathbf c_k^m)\to \mathcal O(DK)$ - The distance can be decomposed as $$ d(\mathbf{y}, \mathbf{x})^{2} \approx d(\mathbf{y}, \tilde{\mathbf{x}})^{2}=\sum_{m=1}^{M} d\left(\mathbf{y}^{m}, \mathbf{c}_{i^{m}}^{m}\right)^{2}=\sum_{m=1}^{M} A\left(m, i^{m}\right) $$ - $A\left(m, i^{m}\right)$ is fetched by taking the entry identified by $m$ and $i^m$ $\to M$ look-ups - **Computational complexity**: $\mathcal O(DK+MN)$ [before: $\mathcal O(DN)$] - The search is still **exhaustive** #### Supervised PQ [Yu, Tan, et al. "Product quantization network for fast image retrieval." *Proceedings of the European Conference on Computer Vision (ECCV)*. 2018.](https://link-springer-com.tudelft.idm.oclc.org/chapter/10.1007/978-3-030-01246-5_12) - Incorporate PQ into a neural network by using soft quantization operation - Train the codewords with labels - Can be trained jointly with a global descriptor - Structure: ![](https://i.imgur.com/ITunlBS.png) - Generally has higher accuracy than the conventional PQ - The supervised PQ layer can be incorporated into many existed models flexibly ### Hashing (Compression-based) #### Classic Hashing [Wang, Jingdong, et al. "A survey on learning to hash." IEEE transactions on pattern analysis and machine intelligence 40.4 (2017): 769-790.](https://ieeexplore-ieee-org.tudelft.idm.oclc.org/abstract/document/7915742?casa_token=bLdSqIVg3B8AAAAA:I87qkaHN0cwLmdRD8MDOK2ohG7bq7FLev0aW_gl0G9-xx2N6mYQYJVHmMm6pyg44I72uv4rKZw) - **What does hashing do?** Converts the original features of high latitudes into low-dimensional hash codes feature: $\mathbf x$ hash function: $h(\cdot)$ hash code: $\left[\begin{matrix}y_1\\\vdots\\y_M \end{matrix}\right]=\left[\begin{matrix}h_1(\mathbf x)\\\vdots\\h_M(\mathbf x) \end{matrix}\right]$, or $\mathbf y=\mathbf h(\mathbf x)$. $y_i$ are **binary** values. - **How to generate good hash codes?** - **Preserve similarity**: similar in the original space $\to$ close hash codes - Pairwise: aligning the distances/similarities of **a pair of items** $s_{ij}^o,s_{ij}^h$ are the similarities of item pair $(x_i,x_j)$ in the original/hash code space Loss function: $\min \sum_{(i,j)\in \varepsilon}(s^o_{ij}-s^h_{ij})^2$ … - Multi-wise: aligning the distances/similarities of **more than two items** - How to measure similarity/distance? What is the loss function? How to deal with the binary constraint (greedy search, relaxation, regularization…)? What is the form of the hash function (linear projection, kernels, DNN…)? - **How to search with hash codes?** - Build a hash table: mapping the similar items into the same bucket - Given the query $\mathbf q$, the items lying in the bucket $\mathbf h(\mathbf q)$ are retrieved as the candidates of the nearest items of $\mathbf q$ - Rank the retrieved items with Hamming distances ![](https://i.imgur.com/9QWODx2.png) - **Why hashing is fast?** The Hamming distance can be efficiently computed by XOR operation and the cost is much smaller than that of the l2 distance computation in the original input space. #### Supervised Hashing [Luo, Xiao, et al. "A survey on deep hashing methods." arXiv preprint arXiv:2003.03369 (2020).](https://arxiv.org/abs/2003.03369) - **Why deep learning?** - the powerful representation capabilities of deep learning can learn very complex hash functions - features and hash codes can be trained together -> more compatible codes and higher accuracy - **Typical framework: similarity preserving** - Input: a set of images (=2-pairwise, >2-multi-wise) - Standard deep CNNs as the backbone networks - A hash layer after extracted features - Similarity preserving loss ![](https://i.imgur.com/8Vjabqn.png) - **Hash layer: greedy hash** [Su, Shupeng, et al. "Greedy hash: Towards fast optimization for accurate hash coding in cnn." *Proceedings of the 32nd International Conference on Neural Information Processing Systems*. 2018.](https://proceedings.neurips.cc/paper/7360-greedy-hash-towards-fast-optimization-for-accurate-hash-coding-in-cnn.pdf) ### ANNOY (Tree-based) - [Author's blog](https://erikbern.com/2015/10/01/nearest-neighbors-and-vector-models-part-2-how-to-search-in-high-dimensional-spaces.html) - Used at Spotify for music recommendations - A binary tree where each node is a **random** split - - Points that are close to each other in the space are more likely to be close to each other in the tree. - It's still possible that a hyperplane will cut two close points apart -> **build a forest of trees** -> memory - Works well **empirically** (when $d<1000$) but lacks a theoretical guarantee ### HNSW (Graph-based) [Malkov Y A, Yashunin D A. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs[J]. IEEE transactions on pattern analysis and machine intelligence, 2018, 42(4): 824-836.](https://ieeexplore.ieee.org/abstract/document/8594636?casa_token=Y1loc-9Q7B4AAAAA:nm5BTdvMWP5wgpQ8wleiktPH1AGVMawtAv_ON7PYe1my-22TDFXhfafEsW4OiaZw9blOpfjvcQ) - Motivation: 'highways' - Hierarchical structure: - For every element we select an integer level $l$ which defines the maximum layer to which the element belongs - Set an exponentially decaying probability of $l$ - number of points: $N$ decay factor: a - number of points on layer $0\to N$, layer $1\to N/a$, layer $2\to N/a^2$, ... -> sparser - Search from layer MAX to layer 0 (greedy search) - The links in the deeper layers serve as highways ![](https://i.imgur.com/Q2HMGIV.png) ### Hybrid Methods ### PyNNDescent (Graph-based, @Qi Zhang) #### 1 Introduction There is another very accurate NN search method: PyNNDescent. PyNNDescent uses neighbor graph based searching to efficiently find good candidates for the nearest neighbors of query points from a large training set. While the basic ideas turn out to be quite simple, the layers of refinements and tricks used to get the highest degree of efficiency complicate things a lot. With that in mind we will begin with the simple naive approach and gradually work out further refinements and details as we go. The core concept, upon which most everything else is based, is using a nearest neighbor graph to perform a search for approximate nearest neighbors. #### 2 Theoretical analysis * Start with a random graph (connect each node to k random nodes) * For each node: * Measure the distance from the node to the neighbors of its neighbors * If any are closer then update the graph accordingly, and keep only the k closest * If any updates were made to the graph then go back to step 2, otherwise stop ![](https://i.imgur.com/7nxQrG2.png) ![](https://i.imgur.com/RWos25W.png) ![](https://i.imgur.com/t1IP7lM.jpg) #### 3 Results of my experiment | | Matching speed per query | mAP | |:------------:|:------------------------:|:-----------:| | L2 NN search | 1 | 1 | | PQ Net | ~19X | ~0.92-0.99X | | PyNNDesecent | ~1.05-1.15X | ~1.06-1.15X | #### 4 References https://pynndescent.readthedocs.io/en/latest/how_pynndescent_works.html https://github.com/lmcinnes/pynndescent ## Preprocessing, NN search, Reranking methods and datasets building (@Qi Zhang) My main work is: 1. use neural network methods to preprocess images; 2. propose a new accurate NN search method; 3. implement methods such as LoFTR to achieve accurate reranking and obtain fine results for preliminary coarse results; 4. collect off-the-shell datasets and build our dataset. Moreover, I also list what I am going to do. ### 1. Preprocessing Before inputting images into the convolutional network for feature extraction, we need to perform pre-processing on images to reduce the image size and the image noise. Therefore, I implemented image quantization technology and autoencoder. Image quantization can reduce the image size on the premise that the main features of the image remain unchanged. This can ensure the system to reduce memory usage and improve the efficiency. The autoencoder can promote the system to denoise the noise from the image, retain useful information, and improve the accuracy of the system. #### 1.1 Image quantization ##### 1.1.1 Introduction Image quantization is a technique used to reduce the size of images. This technique can represent the original image with fewer colors. ##### 1.1.2 Results of my experiment ![](https://i.imgur.com/9q6q2FO.jpg) It can be found through experiments that the size of the image processed by image quantization is greatly reduced. ##### 1.1.4 References #### 1.2 Autoencoder ##### 1.1.1 Introduction Autoencoders are an unsupervised learning technique which combine neural networks for the task of representation learning. I designed a neural network architecture such that can impose a bottleneck in the network which forces a compressed knowledge representation of the original input. There are the encoder, the bottleneck and the decoder in the autoencoder. ![](https://i.imgur.com/AeN8HKc.png) The autoencoder could be trained to denoise images and extract features. For now, I have realized the denoising function. And I will try the extraction function in the future. ##### 1.1.2 Theoretical analysis There is a detailed blog which introduce the autoencoder. https://www.jeremyjordan.me/autoencoders/ ##### 1.1.3 Results of my experiment ##### 1.1.4 References ### 2. NN search This part has been elaborated in: Data Compression and Nearest Neighbour Search (Yuanyuan) ---> PyNNDescent (Graph-based, @QZhang). ### 3. Reranking methods #### 3.1 Introduction Because the previous methods are based on global features, it is difficult for these methods to process images accurately. If there are sifferent illumination and viewpoint situations in images, the accuracy of the previous method may drop significantly. So we need to implement the local feature method to further improve the accuracy of the engine. There are a lot of local feature methods, such as D2-Net, SIFT. After testing the most top approaches, I found the most accurate and fastest local feature based method: LoFTR. #### 3.2 Theoretical analysis The paper: https://arxiv.org/pdf/2104.00680.pdf #### 3.3 Results of my experiment ![](https://i.imgur.com/GGB52Ib.png) #### 3.4 References ### 4. Datasets #### 4.1 Introduction There are 5 datasets: HPatches, InLoc (indoor), Aachen Day-Night dataset (outdoor & illumination), Google Landmark v.2 and the customized dataset. #### 4.2 HPatches * Patches are extracted from a number of image sequences, where each sequence contains images of the same scenes. * For each image sequence, HPatches provides a set of reference patches ref.png extracted from an image used as reference. * Each patch has a size of 65x65 pixels. * Each image sequence contains a reference image and 5 target images taken under a different illumination and/or, for a planar scenes, a different viewpoint. * There are 116 scenes in the datasets, 57 scenes the main nuisance factors are photometric changes and the remaining 59 sequences show significant geometric deformations due to viewpoint change. Each scene is with 6 images (a reference image and 5 target images). #### 4.3 Google Landmark v.2 * Google Landmark v.2 is a dataset created by Google. There are a lot of images of famous landmarks around the world. #### 4.4 Aachen Day-Night dataset * The Aachen Day-Night dataset depicts the old inner city of Aachen, Germany. * The database images used to build the reference scene representations were all taken during daytime with hand-held cameras over a period of about two years. * The datasets offers query images taken at daytime and at nighttime. * In the Aachen Day-Night v1.1 dataset, there are 6,697 reference images and 1015 (824 daytime, 191 nighttime) query images. #### 4.5 InLoc * InLoc is a dataset with reference 6DoF poses for large-scale indoor localization. * The base indoor RGBD dataset consists of 277 RGBD panoramic images obtained from scanning two buildings at the Washington University in St. Louis with a Faro 3D scanner. Researchers obtain 36 perspective RGBD images from each panorama. * There are 356 query images obtained by a smartphone camera (iPhone 7). #### 4.6 The customized dataset * There are 21 query images and 156 reference images. * Images are indoor images about artworks and outdoor images about historical buildings. * This dataset contains some useful images from the Oxford and Paris datasets, which are also very famous in image retrieval. #### 4.7 References ### 5. To do There are two tasks I am going to research: diffusion and online/offline system design. #### 5.1 Diffusion ##### 5.1.1 Introduction Diffusion is a mechanism for spreading the query similarities over the manifolds with the closed form theorem. ##### 5.1.2 Theoretical analysis 1. Problem setup: * A database: $χ = {{x_1}, ..., {x_n}} ⊂ {R^d}$, where each ${x_i}$ is a feature vector. * The query: $Q = {{q_1}, ..., {q_m}} ⊂ {R^d}$, where $m = 1$ when the query is described by a global feature and $m > 1$ when it contains regional features. ![](https://i.imgur.com/3eraSIf.png) 2. There are four main steps in diffusion: * Graph construction ![](https://i.imgur.com/tX6BLp3.png) 1. Step 1: Extract features of images 2. Step 2: Perform k-NN search, using cosine similarity as the metric query: 1 neighbors: 2, 3, 4 similarites: 1-2 (0.88), 1-3 (0.92), 1-4 (0.98) affinity matrix: ![](https://i.imgur.com/NPqReHj.png) 3. Step 3: Construct graph woth l-NN search results * Random walk ![](https://i.imgur.com/PwLLUQI.png) 1. Start from a a single node (plcae a droplet of ink onto the node) 2. Random transition ot other nodes 3. Converge to a stable state 4. Obtain ranking scores (amount of ink on each node) * Decomposition Decompose transition/stochastic matrix and get closed-form solution for any new queries. * Truncation An approach to scale up diffusion for large datasets. ![](https://i.imgur.com/2S1PgAz.png) 3. Drawbacks: * A little slower: ![](https://i.imgur.com/MHPjvmw.png) ![](https://i.imgur.com/TWPMkdg.png) * Online/Offline combination: ![](https://i.imgur.com/XPQcYz0.png) 4. Theoretical advantages: ![](https://i.imgur.com/7IQG9IS.png) ##### 5.1.3 References https://arxiv.org/abs/1811.10907 #### 5.2 Online/Offline system design This approach has not yet been decided for implementation and needs to be determined in the meeting on Friday (21/1).