# Corpus/Document Search Algorithms
Repl: https://replit.com/@lizthedeveloper/Document-Corpus-Search-Methods#main.py
## Term Frequency / Inverse Document Frequency
* https://www.learndatasci.com/glossary/tf-idf-term-frequency-inverse-document-frequency/
* http://www.sefidian.com/2022/02/28/understanding-tf-idf-with-python-example/#:~:text=Term%20Frequency%20%E2%80%93%20Inverse%20Document%20Frequency,%2C%20relative%20to%20a%20corpus
* https://nlp.stanford.edu/IR-book/html/htmledition/inverse-document-frequency-1.html
Psudeocode:
```python!
document_count = len(documents)
for token in document.split(" "):
documents_containing_term = #the number of documents that have the individual token we're looking at in them
token_frequency = the count of the number of times this token appears in the document / the total number of tokens in the document
##
inverse_document_frequency[token] = math.log(document_count / documents_containing_term)
tf_idf[token] = token_frequency * inverse_document_frequency[token]
score += tf_idf[token]
```
This is a form of vectorizing a word's relevance and context within the context of a larger dataset, for use when we're trying to search a lot of documents for a set of terms and rank them in a rational way.
## Exercises
1. [The above repl](https://replit.com/@lizthedeveloper/Document-Corpus-Search-Methods#main.py) is much faster in JavaScript, because File I/O is async. Reimplement this in JavaScript in order to trace the whole process.
1. after implementing it in JS, measure the time difference to see the speed increase, using the `time` module in Python and the `process.hrtime()` method in Node.js
1. after the speed increase, try cacheing your term frequencies (just use memory) to speed things up further
2. after that, try implementing the TF-IDF in a cloud layer
#### Most cloud platforms will offer a TF-IDF implementation or database as a caching layer, these are a great exercise to implement.
* https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-transform.html
* https://cloud.google.com/vertex-ai/docs/matching-engine/overview
## More Modern Methods
Most search algorithms have really transitioned to neural networks for ranking, but the TF-IDF and other vector representations of individual query tokens (and phrases) is pre-computed often to provide some context to other phases of search, [or is used to train models](https://github.com/ksalama/tf-textanalysis-gcp).
Case studies of "the other side" (SEO marketers, aka "the red team" of search) who use TF-IDF for analysis of the top results for certain keywords in order to copy the term frequency in their own SEO-optimized sites (originally called googlebombing).
* https://blog.marketmuse.com/does-google-really-use-tf-idf/
* https://blog.marketmuse.com/why-tf-idf-doesnt-solve-your-content-and-seo-problem-but-feels-like-it-does/
#### modern search papers on Arxiv
* https://arxiv.org/pdf/1704.03940.pdf
* https://arxiv.org/pdf/1706.06613.pdf
* https://arxiv.org/pdf/1711.08611.pdf
* 2021, on LLVMs: https://arxiv.org/abs/2105.02274
* Text Analysis: https://github.com/ksalama/tf-textanalysis-gcp
* Follow Up Study: https://en.wikipedia.org/wiki/Bloom_filter