Embedding Searching Engine

Embedding Searching Engine

Overview

Since the search function in Instagram is terrible, one of my friends wanted a function which can search his articles so that his customers can find the most related articles by their question.

Basically, I built this searching function according to embedding searching. Moreover, I want to set Thompson Sampling experiment in this function for finding the most suitable embedding models from OpenAI.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

The whole structure of this function is belowed, there are three parts in this project: Search, Record, and Update. This article will focus on Search and Update functions, if you just want to build your own embedding searching engine, then read this article is enough. However, if you want to know how I build a experiment on this search engine or you are interested in Record, then please read my next article (progressing…).

Search

Below is the structure of searching function. I will start to explain what I do in each steps.

Assign a Embedding Model from OpenAI

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

When users type in some questions to this function, the first thing is to assign a embedding model to it. The way to assign embedding model is using beta model. Each embedding model has its own beta model, when new question comes, they output probabilities and assign the model with biggest probability.

Since I just want to mention how I built the embedding search engine, the testing part will be left in the next article.

The embedding models in my options are all from OpenAI, they are:

text-embedding-3-small
ada v2
~~text-embedding-3-large~~

Since test-embedding-3-large cannot handle traditional Chinese as well as other two models, I decide to delete it before the test.

When embedding model is ready, next step is to transform the question into embedding. After this, we are ready to search article.

Please beware that the process of transform text into embedding is not reversible, so in fact, Semantic Search output the indexes, not contents.

Semantic Search with Faiss

FAISS (Facebook AI Similarity Search) is a library that allows developers to quickly search for embeddings of multimedia documents that are similar to each other.

In this project, I decide to use faiss as searching library since it is the fastest free embedding searching library. Simply, I put all the embedding of article in the Faiss, query the top 3 most nearest embeddings from Faiss, and get the 3 corresponded indexes of articles.

Get Selected Article & Update

After getting the indexes, we can get the content of article from those indexes, and send those information to users. Also, I update the user information for Record function and update testing parameters.

Upload

Below is the structure of upload function. The purpose of this function is to help author to update his article without doing any coding and some complex processes.

Trigger Google Apps Script

Actually, there are two parts of function in this section, one is in Google Apps Script and the other is in Google Cloud Function.

The function in Google Apps Script to identify whether there is a new article in Google Sheet. The Google Sheet looks like this

Title	URL	Article	Update
Title Old	old_url_.com	old article	Done
Title New	new_url_.com	new article

When author fills in "Title", "URL", "Article", the function will check whether these three columns are completed. Once they are completed, then function will update article into Google Cloud Function, and fill "Done" into "Update" column automatically.

Google Cloud Function

When new article into Google Cloud Function, it update the Article Database right away. At the same time, new article needs to transform into embedding and update Embedding Database. The embedding models are the same as Search function.

Conclusion

Actually, this system is very simple, but it will be more complex once I want to add experiment into it. Therefore, please expected the next article, I will talk about the Thompson Sampling, a Bayesian Testing, on this feature.

Also, I do not have time to push my code into Github, so please wait.

Embedding Searching Engine

Overview

Search

Assign a Embedding Model from OpenAI

Semantic Search with Faiss

Get Selected Article & Update

Upload

Trigger Google Apps Script

Google Cloud Function

Conclusion

Read more

How to use AWS dynamoDB outside AWS Environment

Find Out the Interest of Users - Fake Door Testing

Introduction to Sequential Testing

Thompson Sampling on Searching Engine