Corpus/Document Search Algorithms

# Corpus/Document Search Algorithms Repl: https://replit.com/@lizthedeveloper/Document-Corpus-Search-Methods#main.py ## Term Frequency / Inverse Document Frequency * https://www.learndatasci.com/glossary/tf-idf-term-frequency-inverse-document-frequency/ * http://www.sefidian.com/2022/02/28/understanding-tf-idf-with-python-example/#:~:text=Term%20Frequency%20%E2%80%93%20Inverse%20Document%20Frequency,%2C%20relative%20to%20a%20corpus * https://nlp.stanford.edu/IR-book/html/htmledition/inverse-document-frequency-1.html Psudeocode: ```python! document_count = len(documents) for token in document.split(" "): documents_containing_term = #the number of documents that have the individual token we're looking at in them token_frequency = the count of the number of times this token appears in the document / the total number of tokens in the document ## inverse_document_frequency[token] = math.log(document_count / documents_containing_term) tf_idf[token] = token_frequency * inverse_document_frequency[token] score += tf_idf[token] ``` This is a form of vectorizing a word's relevance and context within the context of a larger dataset, for use when we're trying to search a lot of documents for a set of terms and rank them in a rational way. ## Exercises 1. [The above repl](https://replit.com/@lizthedeveloper/Document-Corpus-Search-Methods#main.py) is much faster in JavaScript, because File I/O is async. Reimplement this in JavaScript in order to trace the whole process. 1. after implementing it in JS, measure the time difference to see the speed increase, using the `time` module in Python and the `process.hrtime()` method in Node.js 1. after the speed increase, try cacheing your term frequencies (just use memory) to speed things up further 2. after that, try implementing the TF-IDF in a cloud layer #### Most cloud platforms will offer a TF-IDF implementation or database as a caching layer, these are a great exercise to implement. * https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-transform.html * https://cloud.google.com/vertex-ai/docs/matching-engine/overview ## More Modern Methods Most search algorithms have really transitioned to neural networks for ranking, but the TF-IDF and other vector representations of individual query tokens (and phrases) is pre-computed often to provide some context to other phases of search, [or is used to train models](https://github.com/ksalama/tf-textanalysis-gcp). Case studies of "the other side" (SEO marketers, aka "the red team" of search) who use TF-IDF for analysis of the top results for certain keywords in order to copy the term frequency in their own SEO-optimized sites (originally called googlebombing). * https://blog.marketmuse.com/does-google-really-use-tf-idf/ * https://blog.marketmuse.com/why-tf-idf-doesnt-solve-your-content-and-seo-problem-but-feels-like-it-does/ #### modern search papers on Arxiv * https://arxiv.org/pdf/1704.03940.pdf * https://arxiv.org/pdf/1706.06613.pdf * https://arxiv.org/pdf/1711.08611.pdf * 2021, on LLVMs: https://arxiv.org/abs/2105.02274 * Text Analysis: https://github.com/ksalama/tf-textanalysis-gcp * Follow Up Study: https://en.wikipedia.org/wiki/Bloom_filter