# Introduction This document outlines a high-level generic strategy for developing a product-based search engine. This strategy is conceptual and does not reflect specific implementations. Team: - Hadi Saghir - Azam Suleiman - Evan Ebdo - Jakob Hagman - Philip Holmqvist ## Objective To enhance user browsing experiences by optimizing search results. This includes improving search accuracy for misspelled words (batminton → badminton), broadening search result suggestions (badminton racket → badminton racquet), and providing recommendations for related products (badminton racket → shuttlecock). ## Components to develop a product-based search engine The process of developing a product based search engine can be divided into four major processes: 1. Data Collection and Processing (Blue) 2. Query processing (Red) 3. Related-Search Recommendations (Green) 4. Search (Lavender) 5. Related-Product Recommendations (Yellow) ![image](https://hackmd.io/_uploads/SyoDipg5T.png) # Data Collection (Blue) Due to a signed Non Disclosure Agreement, I will not go into details. # Query processing (Red) Query processing is an essential step in preparing text data for machine learning algorithms. It involves preprocessing the query to make it more suitable for analysis and model training. In the preprocessing phase, the following steps are general knowledge and represents standard practices in query processing ### Tokenization Tokenization is the process of dividing the query into individual words, symbols, or other meaningful units known as tokens. **Example:** For the query "Machine learning algorithms," tokenization results in tokens: ["Machine", "learning", "algorithms"]. ### Lowercasing All text is converted to lowercase letters to ensure consistency in the data. **Example:** "Machine" becomes "machine." ### Stopword Removal Stopwords are common words like "the," "is," "and," etc., that often do not provide significant meaning to the text and can be safely removed. **Example:** "The" is removed from the query. ### Noise reduction Example Consider the query "Machine learning algorithms for image classification." After preprocessing, the query is transformed into: - Tokenization: ["Machine", "learning", "algorithms", "image", "classification"] - Lowercasing: ["machine", "learning", "algorithms", "image", "classification"] - Stopword Removal: ["machine", "learning", "algorithms", "image", "classification"] ### Resources for NLP These resources are commonly used in machine learning, deep learning, and natural language processing (NLP): - [TensorFlow](https://www.tensorflow.org/) and [Keras](https://keras.io/) for building and training deep learning models. - [SpaCy](https://spacy.io/) and [NLTK](https://www.nltk.org/) are popular NLP library, for text processing. ### Correct misspelling These are some general algorithms for correcting spelling: - [The Damerau-Levenshtein algorithm](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance) for efficient string similarity calculations. This algorithm is particularly useful for spell correction. - [Memoization](https://en.wikipedia.org/wiki/Memoization) or [tabulation hasing](https://en.wikipedia.org/wiki/Tabulation_hashing) is often used to store previously computed word distances, which can significantly improve efficiency when dealing with frequently misspelled words. Instead of merely processing individual words, consider the context of the query. This enables us to capture the embeddings of other words in the query and determine which words match best, enhancing the accuracy of the matching process. Word embeddings can help you consider the context and relationships between words. For instance, the word that often show up together such as "machine" and "learning". # Related Search (Green) This part will focus on increasing increased breadth of search results with suggestion(badminton racket → tennis racket) and getting more accurate results based on intent of the query rather than explicit search terms. ## Model Selection and Usage: 1. **Model selection**: Select a model that balance inference speed and accuracy. There are open source models, such as BERT and word2vec. 2. **Fine-Tuning:** The model is fine-tuned on the product catalogue in order to recommend products avaiable at the retail store. This further trains a pre-trained model on a specific task. ### Word Embedding Generation: Word embeddings are vector representations of words in a high-dimensional space, capturing semantic relationships between words. Here's how the word embeddings were generated: 1. **Tokenization:** Before feeding text to the model, tokenization is performed. Tokenization breaks down text into smaller units known as tokens. This step is crucial because transformer models cannot directly process raw text; they require tokenized input. 2. **Tokenization Process:** In the system, related search terms were read from the dataset and tokenized. Each term was split into tokens to create a list of tokens. 3. **Embedding Computation:** The model computes embeddings for each term, effectively transforming them into numerical vectors in a high-dimensional space. 4. **Mean Pooling:** To obtain a single embedding vector for each term, mean pooling was applied to the model's outputs. Mean pooling involves calculating the mean (average) of the token embeddings. ### Generating Related Search Terms: Once the word embeddings were computed for all the related search terms, the system used these embeddings to identify related search terms for a given user query: 1. **User Query Processing:** When a user inputs a query, it undergoes tokenization, similar to how the search terms were tokenized. 2. **Embedding for Query:** The Distilbert model is employed to generate an embedding vector for the user's query using the same tokenization and mean pooling techniques. 3. **Cosine Similarity:** To find related search terms, the system calculates the cosine similarity between the embedding vector of the user's query and the embedding vectors of all the potential search terms. This similarity score quantifies how similar the query is to each term. 4. **Threshold Filtering:** To ensure that highly similar or identical terms are not recommended, a threshold (e.g., 98% similarity) is set. Search terms with similarity scores above this threshold are filtered out. 5. **Results:** The remaining search terms, which are sufficiently related but not identical to the query, are returned as related search terms. The model model, in combination with tokenization, mean pooling, and cosine similarity calculations, is used to create word embeddings and identify related search terms based on semantic similarity. These embeddings allow the system to understand the meaning and context of words, making it possible to provide relevant search term suggestions to users. # Search (Lavender) Efficient search and information retrieval are crucial in the digital landscape. This document explains the mechanics of two vital well known approaches - the Inverted Index and TF-IDF model - used for optimizing search processes. **Inverted Index:** - **How it works:** - Tokenization: Documents are broken down into terms (tokens). - Building the Index: Each term is mapped to documents where it appears. - Searching: Quickly look up terms in the index to retrieve relevant products. - **Why Inverted Index:** - Efficiency: Fast keyword searches. - Scalability: Works well with large datasets. - Flexibility: Supports advanced features like phrase queries. **Inverted Index** ![image](https://hackmd.io/_uploads/rJ5WhAl5T.png) **TF-IDF (Term Frequency-Inverse Document Frequency):** - **How it works:** - Structure: Mathematical model with formulas for term relevance. - Term Frequency (TF): Measures term frequency in a document. - Inverse Document Frequency (IDF): Measures term importance across all documents. - Calculation: Final TF-IDF score is TF multiplied by IDF. - **Why TF-IDF:** - Relevance Ranking: Ranks search results based on term relevance. - Handling Common Words: Diminishes weight of irrelevant terms. - Enhanced Search: Works with data structures like Inverted Indexes. - Term Discrimination: Distinguishes term importance within and across documents. **TF-IDF** ![image](https://hackmd.io/_uploads/S1Gmn0xq6.png) These diagrams illustrate the concepts of Inverted Index and TF-IDF, which are crucial components of the search implementation, enhancing the efficiency and relevance of search results.