# Embedding Searching Engine [TOC] ## Overview Since the search function in Instagram is terrible, one of my friends wanted a function which can search his articles so that his customers can find the most related articles by their question. Basically, I built this searching function according to embedding searching. Moreover, I want to set Thompson Sampling experiment in this function for finding the most suitable embedding models from [OpenAI](https://openai.com/api/pricing/). ![Example for Searching Article](https://hackmd.io/_uploads/By93iF6IR.jpg =270x850) The whole structure of this function is belowed, there are three parts in this project: **Search**, **Record**, and **Update**. This article will focus on **Search** and **Update** functions, if you just want to build your own embedding searching engine, then read this article is enough. However, if you want to know how I build a experiment on this search engine or you are interested in **Record**, then please read my next article (progressing...). ```mermaid graph LR User(("User")) Author(("Author")) Search{"Search"} Update{"Update"} Record{"Record"} Embedding[("Embedding\nDatabase")] Article[("Article\nDatabase")] Test[("Testing\nParameters")] UserDB[("User\nDatabase")] User -- Question --> Search -- Results --> User Author -- New Article --> Update -- Update --> Article Update -- Embedding --> Embedding User -- Click Url --> Record -- Update --> Test Search -- Update --> UserDB -- Check --> Record Test -- Parameters --> Search Embedding --> Search Article --> Search ``` ## Search Below is the structure of searching function. I will start to explain what I do in each steps. ```mermaid graph TD User(("User")) Test[("Testing\nParameters")] Embedding[("Embedding\nDatabase")] Article[("Article\nDatabase")] OpenAI[["OpenAI"]] Faiss[["Faiss"]] beta_model{{"Beta Model"}} Model[["Embedding Model"]] TA["Target Article"] UserDB[("User\nDatabase")] RI["Result Index"] subgraph Search beta_model TA Model -- embedding --> Faiss -- Semantic Searching --> RI end Test -- parameters --> beta_model -- Assign Model --> OpenAI --> Model User -- question --> Model Embedding --> Faiss TA --> User RI -- query --> Article -- article --> TA TA -- Update --> UserDB ``` ### Assign a Embedding Model from OpenAI ![2 Model decide which is the best](https://hackmd.io/_uploads/ryrjqD6UA.jpg) When users type in some questions to this function, the first thing is to assign a embedding model to it. The way to assign embedding model is using beta model. Each embedding model has its own beta model, when new question comes, they output probabilities and assign the model with biggest probability. Since I just want to mention how I built the embedding search engine, the testing part will be left in the next article. The embedding models in my options are all from OpenAI, they are: - text-embedding-3-small - ada v2 - ~~text-embedding-3-large~~ Since test-embedding-3-large cannot handle traditional Chinese as well as other two models, I decide to delete it before the test. When embedding model is ready, next step is to transform the question into embedding. After this, we are ready to search article. Please beware that the process of transform text into embedding is not reversible, so in fact, Semantic Search output the indexes, not contents. ### Semantic Search with Faiss > [FAISS (Facebook AI Similarity Search)](https://ai.meta.com/tools/faiss/) is a library that allows developers to quickly search for embeddings of multimedia documents that are similar to each other. In this project, I decide to use faiss as searching library since it is the fastest free embedding searching library. Simply, I put all the embedding of article in the Faiss, query the top 3 most nearest embeddings from Faiss, and get the 3 corresponded indexes of articles. ### Get Selected Article & Update After getting the indexes, we can get the content of article from those indexes, and send those information to users. Also, I update the user information for **Record** function and update testing parameters. ## Upload Below is the structure of upload function. The purpose of this function is to help author to update his article without doing any coding and some complex processes. ```mermaid graph TD New(("New Article")) Sheet[["Google Sheet"]] OpenAI[["OpenAI"]] Model[["Embedding Model"]] Article[("Article\nDatabase")] Embedding[("Embedding\nDatabase")] GAS["Google Apps Script"] subgraph Update Article Model -- Update --> Embedding end New -- Fill in --> Sheet GAS -- Update --> Article GAS -- Upload --> Model OpenAI -- API --> Model Sheet -- trigger --> GAS ``` ### Trigger Google Apps Script Actually, there are two parts of function in this section, one is in Google Apps Script and the other is in Google Cloud Function. The function in Google Apps Script to identify whether there is a new article in Google Sheet. The Google Sheet looks like this | Title | URL | Article | Update | | -------- | -------- | -------- | -------- | | Title Old | old_url_.com | old article | Done | | Title New | new_url_.com | new article | | When author fills in "Title", "URL", "Article", the function will check whether these three columns are completed. Once they are completed, then function will update article into Google Cloud Function, and fill "Done" into "Update" column automatically. ### Google Cloud Function When new article into Google Cloud Function, it update the Article Database right away. At the same time, new article needs to transform into embedding and update Embedding Database. The embedding models are the same as **Search** function. ## Conclusion Actually, this system is very simple, but it will be more complex once I want to add experiment into it. Therefore, please expected the next article, I will talk about the Thompson Sampling, a Bayesian Testing, on this feature. Also, I do not have time to push my code into Github, so please wait.