# Deduplication workplan, week starting 22/07 ## Antonio * Create notebook showing how to extract fingerprints (wavelet hash, blockmean hash, image size, and activations from Resnet) - summary of findings * Create a synthetic dataset of crop and zoomed image pairs (benchmarks for primary image deduplication) * Try a logistic regression algo for deduplication using geolocation, addresses, names, and image phashes (match beyond primary image) * Look into Apache Beam for algorithm deployment in Bigquery ## Tadas * TF-IDF queries of names and addresses * Evaluate results of MTurkers who did tagging of image pairs from duplicate offers