# [Paper Reading] Learning to Hash for Indexing Big Data - A Survey [Jun Wang, Wei Liu, Sanjiv Kumar, Shih-Fu Chang, 2015](https://arxiv.org/abs/1509.05472) ###### tags: `paper_reading` `AMMAI` ## Abstract The explosive growth in big data makes efficient indexing and search methods important. Hashing is a effective technique to solve it. Difficulties of Nearest Neighbors Problem: * the scale of the data * the dimensionality * costs and limits of accessing data These facters demands efficient hashing methods. Tree-based indexing appoarches requires too much storage space, and do not work well on high dimensional data. ## Background ### Pipeline of Hashing-based ANN Search 1. Designing Hash Functions 2. Indexing ![](https://i.imgur.com/6y36qYX.png) 3. Online querying The query is first converted into query code H(q), and then the inverse lookup table is used to find codes near H(q) by Hamming distance. ### Randomized Hashing Functions #### Random Projection based Hashing Locally Sensitive Hashing $h_k(x)=sign(w_k^Tx+b_k)$ Given l K-bit tables, the collision probability is $P[H(x_i)=H(x_j)] \propto l*[1- \frac{1}{\pi} \cos^{-1}\frac{x_i^T x_j} {\|x_i\| \|x_j\|}]^K$ Cons * long hash codes * multiple hash tables * semantic gap The discrepancy between semantic and metric spaces in the computer vision and machine learning communities, namely as semantic gap. The random projections based hash functions ignore the specific properties of a given data set, which leads to less effective performance compared to the learning based methods. #### Random Permutation based Hashing MinHash: min-wise independent permutation hashing ![](https://i.imgur.com/VimejeO.png =40%x) * widely used for approximating Jaccard similarity $P[h_k(S_i) = h_k(S_j)] = J(S_i,S_j) = \frac {S_i \cap S_j} {S_i \cup S_j}$ In practice, the min-hash scheme has shown powerful performance for high-dimensional and sparse vectors like the bag-of-word representation of documents or feature histograms of images. ## Categories of Learning based Hashing Method ### Unsupervised, Supervised, and Semi-Supervised * Unsupervised methods: * spectral hashing * isotropic hashing Learn an orthogonal matrix to rotate the PCA projection matrix, similar to ITQ but use isotropic projection instead. * spherical hashing * Supervised learning * kernel learning * metric learning * deep learning * Semi-supervised learning In some methods, the labeled data is used for deriving optimal metric distance while the hash function design uses no supervision. ### Pointwise, Pairwise, Triplet-wise and Listwise ![](https://i.imgur.com/fGNyB7M.png =75%x) By converting rank lists to a triplet tensor matrix, listwise hashing is designed to preserve the ranking in the Hamming space. ### Linear vs. Nonlinear * Linear: PCA, LDA Linear hashing often suffers from insufficient discriminative power. Techniques that use variance as the underlying objective often output orthogonal projections. However, in real-world data, most of the variance contained only in top few directions. * Nonlinear: isotropic hashing * circulant binary embedding * spectral hashing * kernel function method Use a kernel function to measure similarity. ### Weighted Hashing * Boosted Similarity Sensitive Coding * BoostMAP Database and query objects are embedded into a Euclidean space, in which similarities can be rapidly measured using a weighted Manhattan distance. * Hash Bit Selection Select the most informative hash bits generated by different types of hashing methods. The bit selection problem is formulated as quadratic programming on the graph, and is solved efficiently by replicator dynamics. ## Methodology Review and Analysis ![](https://i.imgur.com/AR7gdOC.png) ### Spectral Hashing First extracts the principal projections, and then partitions the projected data by a sinusoidal function with a specific angular frequency. But, the assumption of uniform data distribution usually hardly hold for real-world data. ### Anchor Graph Hashing Use a small set of anchor points to approximate the graph structure. Similarity between any pair of points can be approximated using point-to-anchor similarities. ![](https://i.imgur.com/NsM4t0o.png =75%x) The exact graph hashing method first constructs an exact neighborhood graph, and then performs partitioning with spectral techniques. (a)(d) : First 2 bits of spectral hashing (b)(e) : First 2 bits of spectral hashing with exact neighborhood graph (c)(f) : First 2 bits of anchor graph hashing ### Angular Quantization based Hashing Map feature vectors onto a vertex of the binary hypercube with the smallest angle. A data-dependent extension learns a rotation matrix to align the data to the binary vertices without changing the similarity between point pairs. ### Binary Reconstructive Embedding First use sample set data to determine the hash functions. $h_k(x)=sign(\sum\limits_{q=1}^sW_{kq}K(x_{kq},x))$ So, the W, weight matrix is determined from minizing reconstructive distance, $\sum\limits_k$ (euclidean metric - hamming distance)^2. *I don't understand why don't use the kernelized distance instead.* In practice, the original data point x is often mapped to a hypersphere with unit length because the reconstruction distance(hamming distance) is in 0~1. Liu et al. proposed to use supervised and kernel-based hash functions for easier and better optimization. ### Metric Learning based Hashing pairwise label -> parameterized Mahalanobis metric -> standard random projection hash functions *Looks like all kinds of metric learning hashing algorithm can be used.* ### Semi-Supervised Hashing a small set of pairswise labels and a large amount of unlabled data $\DeclareMathOperator*{\argmax}{argmax} M^*=\argmax\, tr(W^\mathsf{T} X_l S X_l^\mathsf{T} W)+ u\mathop{tr} (W^\mathsf{T} X X^\mathsf{T} W^\mathsf{T})$ ### Column Generation Hashing Optimize a large margin problem using column generation technique to learn the hash function and the associated bit weights iteratively. ### Ranking Supervised Hashing ![](https://i.imgur.com/r9wuUM9.png =80%x) $L_H=\sum H(q_m)^\mathsf{T}[H(x_i)-H(x_j)]S_{mij}$ $\DeclareMathOperator*{\argmax}{argmax} W^*=\argmax L_H=\argmax tr(WW^\mathsf{T}B)\quad \text{s.t.}\quad W^TW=I \\ \text{where}\quad B=\sum\limits_m (\sum[x_i-x_j]S_{mij}) q_m^\mathsf{T}$ The problem solved by augmented Lagrangian multiplier method, which is a primal-dual algorithm. ### Circulant Binary Embedding The circulant matrix enables the use of Fast Fourier Transformation to speed up the computation. ## Deep Learning for Hashing ![](https://i.imgur.com/ltBw6sT.png) ### Semantic Hashing word count vector -> deep RBM supervised version: nonlinear Neighbourhood Component Analysis embedding *(just neural network with softmax loss(?))* ### Sparse Similarity-Preserving Hashing To address the low recall issue pertaining to long hash codes, this work enforces sparsity into the hash codes. ![](https://i.imgur.com/6z0ne7w.png =75%x) Two ISTA-type networks that share the same set of parameters and conduct fast approximations of sparse coding are coupled in the training phase. *a kind of symmetric autoecnoder(?)* ### Deep Hashing goals same as ITQ 1. reconstruction error is minimized 2. each bit has a balance 3. all bits are independent * Supervised Deep Hashing * CNN Hashing ### Deep Semantic Ranking Hashing * multilevel semantic similarities among multi-labeled images are preserved * feature representations and hash values are jointly learned ### Issues * There is no evaluation of hash code generation time and search performance. ## Advanced Methods and Related Applications ### Hyperplane Hashing ![](https://i.imgur.com/Csiqmfm.png=75%x) “point-to-hyperplane” scenario emerges #### Angle-Hyperplane Hash $h(z)=[sign(u^\mathsf{T} z),\, sign(v^\mathsf{T}z)]$ if z is a point $h(w)=[sign(u^\mathsf{T} w),\, -sign(v^\mathsf{T}w)]$ if w is a hyperplane normal u,v ~ $N(0, I_{d\times d})$ $P[h(z)=h(w)]=\frac{1}{4}-\frac{\alpha_{x,w}^2}{\pi^2}$ #### Embedding Hyperplane Hash $h(z)= sign(U^TV(zz^T))\,\quad$ if z is a point $h(P_w)= -h(w)\quad$ if w is a hyperplane normal V(A) returns vectorial concatenation of matrix U ~ $N(0, I_{d^2\times d^2})$ $P[h(z)=h(P_w)] =\frac{\cos^{-1}sin^2(\alpha_{x,w})}{\pi}$ EH has higher recall rate, but is far more expansive to compute. #### Bilinear-Hyperplane Hash $h(z)=sign(u^\mathsf{T} z z^\mathsf{T} v)$ $h(P_w) = -h(w)$ u,v ~ $N(1, I)$ $P[h(z)=h(P_w)]=\frac1{2}-\frac{2\alpha_{x,w}^2}{\pi^2}$ BH-Hash has the highest probability of collision, which indicates that the BH-Hash has a better angle-sensitive property. ### Subspace Hashing Given a query to search for a nearest subspace is frequently encountered in image synthesis, scene classification, speaker recognition, face recognition, and motion-based action recognition. #### Approximate Nearest Subspace Search * Uniformly deals with the cases that query is a vector or subspace, and subspaces can be in different dimension. * Maps both query and database elements to points in a new vector space. * Suffers from the high dimensionality of the constructed new vector space. #### Grassmannian based Hashing * rigorously faster * hash $D$-dimensional vectors or $D\times r$-dimensional subspaces ### MultiModality Hashing Such type of hashing methods are closely related to the applications in social network, whether multimodality and heterogeneity are often observed. ### Applications * Indexing video sequences, a straightforward method is to independently compute binary codes for each key frames. Ye et al. proposed a structure learning framework that incorporates both temporal and spatial structure information. * Indexing documents, Wang et al. leveraged both tag information and semantic topic modeling to achieve more accurate hash codes. * Hashing techniques have also been applied to the active learning framework to cope with big data applications. ## Open Issues and Future Directions * Most of the learning based hashing techniques lack the theoretical guaranteess on the quality of returned neighbors. * Can one directly use compact codes to do machine learning? *how good do the hash functions extract features?*