Introduction

This document outlines a high-level generic strategy for developing a product-based search engine. This strategy is conceptual and does not reflect specific implementations.

Team:

Hadi Saghir
Azam Suleiman
Evan Ebdo
Jakob Hagman
Philip Holmqvist

Objective

To enhance user browsing experiences by optimizing search results. This includes improving search accuracy for misspelled words (batminton → badminton), broadening search result suggestions (badminton racket → badminton racquet), and providing recommendations for related products (badminton racket → shuttlecock).

Components to develop a product-based search engine

The process of developing a product based search engine can be divided into four major processes:

Data Collection and Processing (Blue)
Query processing (Red)
Related-Search Recommendations (Green)
Search (Lavender)
Related-Product Recommendations (Yellow)

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Data Collection (Blue)

Due to a signed Non Disclosure Agreement, I will not go into details.

Query processing (Red)

Query processing is an essential step in preparing text data for machine learning algorithms. It involves preprocessing the query to make it more suitable for analysis and model training.

In the preprocessing phase, the following steps are general knowledge and represents standard practices in query processing

Tokenization

Tokenization is the process of dividing the query into individual words, symbols, or other meaningful units known as tokens.

Example: For the query "Machine learning algorithms," tokenization results in tokens: ["Machine", "learning", "algorithms"].

Lowercasing

All text is converted to lowercase letters to ensure consistency in the data.

Example: "Machine" becomes "machine."

Stopword Removal

Stopwords are common words like "the," "is," "and," etc., that often do not provide significant meaning to the text and can be safely removed.

Example: "The" is removed from the query.

Noise reduction Example

Consider the query "Machine learning algorithms for image classification." After preprocessing, the query is transformed into:

Tokenization: ["Machine", "learning", "algorithms", "image", "classification"]
Lowercasing: ["machine", "learning", "algorithms", "image", "classification"]
Stopword Removal: ["machine", "learning", "algorithms", "image", "classification"]

Resources for NLP

These resources are commonly used in machine learning, deep learning, and natural language processing (NLP):

TensorFlow and Keras for building and training deep learning models.
SpaCy and NLTK are popular NLP library, for text processing.

Correct misspelling

These are some general algorithms for correcting spelling:

The Damerau-Levenshtein algorithm for efficient string similarity calculations. This algorithm is particularly useful for spell correction.
Memoization or tabulation hasing is often used to store previously computed word distances, which can significantly improve efficiency when dealing with frequently misspelled words.

Instead of merely processing individual words, consider the context of the query. This enables us to capture the embeddings of other words in the query and determine which words match best, enhancing the accuracy of the matching process. Word embeddings can help you consider the context and relationships between words. For instance, the word that often show up together such as "machine" and "learning".

This part will focus on increasing increased breadth of search results with suggestion(badminton racket → tennis racket) and getting more accurate results based on intent of the query rather than explicit search terms.

Model Selection and Usage:

Model selection: Select a model that balance inference speed and accuracy. There are open source models, such as BERT and word2vec.
Fine-Tuning: The model is fine-tuned on the product catalogue in order to recommend products avaiable at the retail store. This further trains a pre-trained model on a specific task.

Word Embedding Generation:

Word embeddings are vector representations of words in a high-dimensional space, capturing semantic relationships between words. Here's how the word embeddings were generated:

Tokenization: Before feeding text to the model, tokenization is performed. Tokenization breaks down text into smaller units known as tokens. This step is crucial because transformer models cannot directly process raw text; they require tokenized input.
Tokenization Process: In the system, related search terms were read from the dataset and tokenized. Each term was split into tokens to create a list of tokens.
Embedding Computation: The model computes embeddings for each term, effectively transforming them into numerical vectors in a high-dimensional space.
Mean Pooling: To obtain a single embedding vector for each term, mean pooling was applied to the model's outputs. Mean pooling involves calculating the mean (average) of the token embeddings.

Once the word embeddings were computed for all the related search terms, the system used these embeddings to identify related search terms for a given user query:

User Query Processing: When a user inputs a query, it undergoes tokenization, similar to how the search terms were tokenized.
Embedding for Query: The Distilbert model is employed to generate an embedding vector for the user's query using the same tokenization and mean pooling techniques.
Cosine Similarity: To find related search terms, the system calculates the cosine similarity between the embedding vector of the user's query and the embedding vectors of all the potential search terms. This similarity score quantifies how similar the query is to each term.
Threshold Filtering: To ensure that highly similar or identical terms are not recommended, a threshold (e.g., 98% similarity) is set. Search terms with similarity scores above this threshold are filtered out.
Results: The remaining search terms, which are sufficiently related but not identical to the query, are returned as related search terms.

The model model, in combination with tokenization, mean pooling, and cosine similarity calculations, is used to create word embeddings and identify related search terms based on semantic similarity. These embeddings allow the system to understand the meaning and context of words, making it possible to provide relevant search term suggestions to users.

Search (Lavender)

Efficient search and information retrieval are crucial in the digital landscape. This document explains the mechanics of two vital well known approaches - the Inverted Index and TF-IDF model - used for optimizing search processes.

Inverted Index:

How it works:
- Tokenization: Documents are broken down into terms (tokens).
- Building the Index: Each term is mapped to documents where it appears.
- Searching: Quickly look up terms in the index to retrieve relevant products.
Why Inverted Index:
- Efficiency: Fast keyword searches.
- Scalability: Works well with large datasets.
- Flexibility: Supports advanced features like phrase queries.

Inverted Index

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

TF-IDF (Term Frequency-Inverse Document Frequency):

How it works:
- Structure: Mathematical model with formulas for term relevance.
- Term Frequency (TF): Measures term frequency in a document.
- Inverse Document Frequency (IDF): Measures term importance across all documents.
- Calculation: Final TF-IDF score is TF multiplied by IDF.
Why TF-IDF:
- Relevance Ranking: Ranks search results based on term relevance.
- Handling Common Words: Diminishes weight of irrelevant terms.
- Enhanced Search: Works with data structures like Inverted Indexes.
- Term Discrimination: Distinguishes term importance within and across documents.

TF-IDF

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

These diagrams illustrate the concepts of Inverted Index and TF-IDF, which are crucial components of the search implementation, enhancing the efficiency and relevance of search results.

Introduction

Objective

Components to develop a product-based search engine

Data Collection (Blue)

Query processing (Red)

Tokenization

Lowercasing

Stopword Removal

Noise reduction Example

Resources for NLP

Correct misspelling

Related Search (Green)

Model Selection and Usage:

Word Embedding Generation:

Generating Related Search Terms:

Search (Lavender)

Read more

IoT Network: StarLink VS LoRaWan? Impact on Agriculture

Java? Clojure? Scala? Kotlin? Investigating the Applications of JVM Languages

Making the BOSCH Styline TKA8013 Smart