# ANN - Symbols: - N: num dataset items. - d: length of one item. - k: num items to query. ## LSH: - Generate random m vectors of length d. - Hash function: for each generated vector, assign 1 if same direction, otherwise 0. ![](https://hackmd.io/_uploads/Hk1xNAkr2.png) Calculate the hash for all dataset items and the query item. - Within the dataset items having same (or minimize hamming distance) hash with the query item: - Brute force all. - Return random k vectors (**more common** if m >> d). - Complexity: $O(m^2)$ - Calculate hash: $O(m.d)$ - Search: $O(m^2)$ ## PQ: - Hash function: - Segment a vector into m sub-vectors. ![](https://hackmd.io/_uploads/SkblgkgSn.png) - In each sub-vector position, calculate c centroids using k-Means from all dataset items. ![](https://hackmd.io/_uploads/rkFJ-1xH3.png) - Replace each sub-vector with its centroid ID (length = log2\(c\)) - For query item, calculate distance between its sub-vectors and all centroids. - For all dataset items, get the sum distance to the query item based on above distance to centroids. ![](https://hackmd.io/_uploads/SyQRBJerh.png) - Complexity: $O(d.c) + O(N.m)$ - Calculate distance to centroids: $O(d.c)$ - Search: $O(N.m)$ ## IVFPQ: - Find c clusters from dataset items using kMeans. - For each cluster, calculate PQ hash for all residual vectors = raw - centroid. - For the query item, find its cluster and nearby clusters, query PQ hash for all items in those clusters.