This document outlines a high-level generic strategy for developing a product-based search engine. This strategy is conceptual and does not reflect specific implementations.
Team:
To enhance user browsing experiences by optimizing search results. This includes improving search accuracy for misspelled words (batminton → badminton), broadening search result suggestions (badminton racket → badminton racquet), and providing recommendations for related products (badminton racket → shuttlecock).
The process of developing a product based search engine can be divided into four major processes:
Due to a signed Non Disclosure Agreement, I will not go into details.
Query processing is an essential step in preparing text data for machine learning algorithms. It involves preprocessing the query to make it more suitable for analysis and model training.
In the preprocessing phase, the following steps are general knowledge and represents standard practices in query processing
Tokenization is the process of dividing the query into individual words, symbols, or other meaningful units known as tokens.
Example: For the query "Machine learning algorithms," tokenization results in tokens: ["Machine", "learning", "algorithms"].
All text is converted to lowercase letters to ensure consistency in the data.
Example: "Machine" becomes "machine."
Stopwords are common words like "the," "is," "and," etc., that often do not provide significant meaning to the text and can be safely removed.
Example: "The" is removed from the query.
Consider the query "Machine learning algorithms for image classification." After preprocessing, the query is transformed into:
These resources are commonly used in machine learning, deep learning, and natural language processing (NLP):
These are some general algorithms for correcting spelling:
Instead of merely processing individual words, consider the context of the query. This enables us to capture the embeddings of other words in the query and determine which words match best, enhancing the accuracy of the matching process. Word embeddings can help you consider the context and relationships between words. For instance, the word that often show up together such as "machine" and "learning".
This part will focus on increasing increased breadth of search results with suggestion(badminton racket → tennis racket) and getting more accurate results based on intent of the query rather than explicit search terms.
Model selection: Select a model that balance inference speed and accuracy. There are open source models, such as BERT and word2vec.
Fine-Tuning: The model is fine-tuned on the product catalogue in order to recommend products avaiable at the retail store. This further trains a pre-trained model on a specific task.
Word embeddings are vector representations of words in a high-dimensional space, capturing semantic relationships between words. Here's how the word embeddings were generated:
Tokenization: Before feeding text to the model, tokenization is performed. Tokenization breaks down text into smaller units known as tokens. This step is crucial because transformer models cannot directly process raw text; they require tokenized input.
Tokenization Process: In the system, related search terms were read from the dataset and tokenized. Each term was split into tokens to create a list of tokens.
Embedding Computation: The model computes embeddings for each term, effectively transforming them into numerical vectors in a high-dimensional space.
Mean Pooling: To obtain a single embedding vector for each term, mean pooling was applied to the model's outputs. Mean pooling involves calculating the mean (average) of the token embeddings.
Once the word embeddings were computed for all the related search terms, the system used these embeddings to identify related search terms for a given user query:
User Query Processing: When a user inputs a query, it undergoes tokenization, similar to how the search terms were tokenized.
Embedding for Query: The Distilbert model is employed to generate an embedding vector for the user's query using the same tokenization and mean pooling techniques.
Cosine Similarity: To find related search terms, the system calculates the cosine similarity between the embedding vector of the user's query and the embedding vectors of all the potential search terms. This similarity score quantifies how similar the query is to each term.
Threshold Filtering: To ensure that highly similar or identical terms are not recommended, a threshold (e.g., 98% similarity) is set. Search terms with similarity scores above this threshold are filtered out.
Results: The remaining search terms, which are sufficiently related but not identical to the query, are returned as related search terms.
The model model, in combination with tokenization, mean pooling, and cosine similarity calculations, is used to create word embeddings and identify related search terms based on semantic similarity. These embeddings allow the system to understand the meaning and context of words, making it possible to provide relevant search term suggestions to users.
Efficient search and information retrieval are crucial in the digital landscape. This document explains the mechanics of two vital well known approaches - the Inverted Index and TF-IDF model - used for optimizing search processes.
Inverted Index:
Inverted Index
TF-IDF (Term Frequency-Inverse Document Frequency):
TF-IDF
These diagrams illustrate the concepts of Inverted Index and TF-IDF, which are crucial components of the search implementation, enhancing the efficiency and relevance of search results.