Data Science for Search

# DataScience for Search ? While I’m still early in my journey with e-commerce search systems, this note is a repository of search functionalities I could comprehend; which for the most of us, is beautifully abstracted as a black box. Search and Reco go hand in hand, so to draw a line we'll keep recommendations as the listings that pop up separately either on the home page, or under a listing currently being viewed by the user. Search is user-initiated, while recommendations are system-initiated. Reco has to be **Near Real Time** (NRT), while Search can be a bit relaxed (okay if we don't show up items listed just moments ago). <hr> Table of Contents [TOC] ## Avenues for DS in Search We can make 'search smarter' by targeting multiple avenues, discussed below : 1. Indexing Indexing items/documents (entire corpus of listings) is commonly achieved with the help of an inverted index, which stores the docID in which a particular term appears. ![invertedIndex](https://hackmd.io/_uploads/SJR0P5xMll.jpg) 2. Understanding the Query terms Extract the important parts from the query, pass it to ElasticSearch (StructuredSearch) Highly unlikely that the userbase will be mono-lingual, this necessitates understanding of synonym pairs between languages. 3. Understanding Query Intent **Entity** Recognition : recognize famous names/places, *`not sure how this works`*, having a pre-defined set is mundane and would be required to be updated frequently. **Username** search ? Some platforms allow users to search for other users based on a unique userID/username. A small % of queries might not be marketplace-searches. Users might want to browse a popular store's page, to scour to new items. **Browsing Intent** : Category browse or general marketplace browse ? If we can associate the query with a particular (or a bunch of) category, listings attached to the identified categories are boosted (ranked higher). 4. Retrieval and Ranking Based on the query, get the most suitable listings (retrieval) and rank these listings based on relevancy and business needs (soft exponential decay based on time gives a lower score to older listings). 5. Advanced Search Query to be broken down into chunks, which are connected by boolean operators. **Exact match** : `"mango raw"` only returns listings with the two words, we do not consider partial matches (mango juice, sweet mango are discarded) **Partial match** : `"rolex" green watch` should return all listings having the term 'rolex' in title/description, with green watch being a preference, but not absolutely. **Negative match** : `"amul" - milk` should return listings that only match with the "amul" and not "milk" **Wildcard match** : `run* shoes` evaluates to all variants of the word 'run' followed by any character (running, runic, runner, runway) 7. Identify profanities / problematic words Can have a dictionary of such words and display 0 results to avoid backlash / legal issues. ## +ve signals for marketplaces Weights can be assigned to these signals to calculate a relevancy score for each <query-listing> pair, * Transactions (strongest signal 😃 which brings in revenue) * Add to cart (high purchase intent) * Add to wishlist (user wants similar items, possibly waiting for a price drop) * Quality Chat (`num_messages > threshold`, btw buyer and seller) * Chat * Like * View ## metrics for Search Click Through Rate of all forms are referred to as Search Success Metrics, these can be tracked by the total number of views/likes/add-to-cart/transactions; Or to measure of how useful the ranked results are, one can monitor the avg/p50/p90 of **listing-index interacted with**. This helps us keep a track of how good the results are. If users consistently click on listings with a high index, this signals poor ranking relevance. ## Precision and Recall in Search Out of all results, how much is actually relevant ==> Precision Out of all relevant content, how much did we retrieve ==> Recall <strong> Precision </strong>: out of all listings shown, how many are relevant to the query, P@4, P@8 is a metric. <strong> Recall </strong>: out of all results, how many could my algorithm retrieve. This is a bit hard to compute, as getting the raw counts for all relevant listings is tough. ## Indexing Pretty simple For each word in our corpus, store the docIDs a particular term shows up in with the term-frequency in the doc. This creates a postings list / inverted index as we saw in the first image. ## Understanding the Query Before we try to analyze the query, we need to normalize it by converting everything to the same case (usually lowercase) and then stem/lemmatize to get the root word. | stemming | lemmatization | | -------- | -------- | | crude process that chops off suffix | get to the morphological root of a word | | easy, quick, based on rules | based on a vocab, computation heavy | | personalization ==> personalize | personalization ==> person | Arriving at the root word, we split our phrase into tokens. ### Query Expansion Have a learned synonym map across languages and append synonyms to the search query before passing it to ElasticSearch; which will simply perform a union of posting lists of all the query terms to obtain all items in our index that have any of these terms. Increases number of results. ### Attribute / Structured Search ? Allows the search bar to work with natural language queries which is seamless as compared to a standard keyword search function wherein the user has to fill in values for attributes he's looking for. Critical for Fashion / Electronics wherein attributes are specified with the query itself - iPhone black 16gb - nike shoes under 3500 - red football shoes 12 Atrtribute keys are not mentioned, but the values are passed in the query itself. This increases number of search results and therefore the avg impressions per query, as we can further include relevant listings that match attributes. Alongside as retrieved results are more relevant, we should observe an increase in num_listings viewed/clicked through. We pass the query to a QueryUnderstanding Service, which performs Entity Recognition and filters out attribute-value pairs for us, that we pass onto ElasticSearch ## Ranking and Retreival We want to get the matching-results from our database to present to the user. ### Retrieval Collect (mostly) relevant documents/listings, this can be a massive collection. Boolean retrieval is fast, just perform union/intersections of the query terms Usually unions are preferred to improve number of results ### Ranking (can skip this section) From the huge set of documents collected during the retrieval phase, we want to sort items based on a relevancy / recency score. These methods ignore the order of words in a document, and hence referred to as a 'bag of words' concept. --- #### Term Frequency Number of occurences of a term in the document Well not all words in a document are equally important, there are some statistically boring words as well. How do we identify these ? #### Inverse Document Frequency Certain words have no discriminating power, they exist in a lot of documents and do not help us in determining relevance. Here, `N` is the total number of documents, and `df` is the number of documents containing the term. $$IDF= {N \over df }$$ For a large corpus, we usually take the log of this #### tf-IDF score We would like to score documents higher that have query terms appear more times and these terms should not appear a lot in the corpus overall. Words like `and`, `the` would appear in all documents and provide us with no relevance. $$ tf_-idf = TF * log{N \over df}$$ * Have a higher score when term appears many times within a small number of documents, * Lower score when the terms appears fewer times in a doc or occurs in many documents. * Lowest when the term occurs in almost all documents. ## Things I don't understand well enough * Comparing document Vectors with the Query Vectors using cosine similarity is intuitive, but how do we generate the docVectors. * How do we evaluate Search results ? Not sure, haha will have to read and learn * Phrase completion / query suggestions ## Acknowledgements [Talk by Scott Hajek, PhD guy from NewYork]([https://](https://www.youtube.com/watch?v=dqRDyeFJUvk&list=WL&index=11)) Amazing book, have read the first 8 Chapters [Introduction to Information Retrieval by Manning](https://nlp.stanford.edu/IR-book/information-retrieval-book.html) Lots of documentation as well, haha Zhihu, which is the Chinese version of Qoura