# Embedding Searching Engine
[TOC]
## Overview
Since the search function in Instagram is terrible, one of my friends wanted a function which can search his articles so that his customers can find the most related articles by their question.
Basically, I built this searching function according to embedding searching. Moreover, I want to set Thompson Sampling experiment in this function for finding the most suitable embedding models from [OpenAI](https://openai.com/api/pricing/).
![Example for Searching Article](https://hackmd.io/_uploads/By93iF6IR.jpg =270x850)
The whole structure of this function is belowed, there are three parts in this project: **Search**, **Record**, and **Update**. This article will focus on **Search** and **Update** functions, if you just want to build your own embedding searching engine, then read this article is enough. However, if you want to know how I build a experiment on this search engine or you are interested in **Record**, then please read my next article (progressing...).
```mermaid
graph LR
User(("User"))
Author(("Author"))
Search{"Search"}
Update{"Update"}
Record{"Record"}
Embedding[("Embedding\nDatabase")]
Article[("Article\nDatabase")]
Test[("Testing\nParameters")]
UserDB[("User\nDatabase")]
User -- Question --> Search -- Results --> User
Author -- New Article --> Update -- Update --> Article
Update -- Embedding --> Embedding
User -- Click Url --> Record -- Update --> Test
Search -- Update --> UserDB -- Check --> Record
Test -- Parameters --> Search
Embedding --> Search
Article --> Search
```
## Search
Below is the structure of searching function. I will start to explain what I do in each steps.
```mermaid
graph TD
User(("User"))
Test[("Testing\nParameters")]
Embedding[("Embedding\nDatabase")]
Article[("Article\nDatabase")]
OpenAI[["OpenAI"]]
Faiss[["Faiss"]]
beta_model{{"Beta Model"}}
Model[["Embedding Model"]]
TA["Target Article"]
UserDB[("User\nDatabase")]
RI["Result Index"]
subgraph Search
beta_model
TA
Model -- embedding --> Faiss -- Semantic Searching --> RI
end
Test -- parameters --> beta_model -- Assign Model --> OpenAI --> Model
User -- question --> Model
Embedding --> Faiss
TA --> User
RI -- query --> Article -- article --> TA
TA -- Update --> UserDB
```
### Assign a Embedding Model from OpenAI
![2 Model decide which is the best](https://hackmd.io/_uploads/ryrjqD6UA.jpg)
When users type in some questions to this function, the first thing is to assign a embedding model to it. The way to assign embedding model is using beta model. Each embedding model has its own beta model, when new question comes, they output probabilities and assign the model with biggest probability.
Since I just want to mention how I built the embedding search engine, the testing part will be left in the next article.
The embedding models in my options are all from OpenAI, they are:
- text-embedding-3-small
- ada v2
- ~~text-embedding-3-large~~
Since test-embedding-3-large cannot handle traditional Chinese as well as other two models, I decide to delete it before the test.
When embedding model is ready, next step is to transform the question into embedding. After this, we are ready to search article.
Please beware that the process of transform text into embedding is not reversible, so in fact, Semantic Search output the indexes, not contents.
### Semantic Search with Faiss
> [FAISS (Facebook AI Similarity Search)](https://ai.meta.com/tools/faiss/) is a library that allows developers to quickly search for embeddings of multimedia documents that are similar to each other.
In this project, I decide to use faiss as searching library since it is the fastest free embedding searching library. Simply, I put all the embedding of article in the Faiss, query the top 3 most nearest embeddings from Faiss, and get the 3 corresponded indexes of articles.
### Get Selected Article & Update
After getting the indexes, we can get the content of article from those indexes, and send those information to users. Also, I update the user information for **Record** function and update testing parameters.
## Upload
Below is the structure of upload function. The purpose of this function is to help author to update his article without doing any coding and some complex processes.
```mermaid
graph TD
New(("New Article"))
Sheet[["Google Sheet"]]
OpenAI[["OpenAI"]]
Model[["Embedding Model"]]
Article[("Article\nDatabase")]
Embedding[("Embedding\nDatabase")]
GAS["Google Apps Script"]
subgraph Update
Article
Model -- Update --> Embedding
end
New -- Fill in --> Sheet
GAS -- Update --> Article
GAS -- Upload --> Model
OpenAI -- API --> Model
Sheet -- trigger --> GAS
```
### Trigger Google Apps Script
Actually, there are two parts of function in this section, one is in Google Apps Script and the other is in Google Cloud Function.
The function in Google Apps Script to identify whether there is a new article in Google Sheet. The Google Sheet looks like this
| Title | URL | Article | Update |
| -------- | -------- | -------- | -------- |
| Title Old | old_url_.com | old article | Done |
| Title New | new_url_.com | new article | |
When author fills in "Title", "URL", "Article", the function will check whether these three columns are completed. Once they are completed, then function will update article into Google Cloud Function, and fill "Done" into "Update" column automatically.
### Google Cloud Function
When new article into Google Cloud Function, it update the Article Database right away. At the same time, new article needs to transform into embedding and update Embedding Database. The embedding models are the same as **Search** function.
## Conclusion
Actually, this system is very simple, but it will be more complex once I want to add experiment into it. Therefore, please expected the next article, I will talk about the Thompson Sampling, a Bayesian Testing, on this feature.
Also, I do not have time to push my code into Github, so please wait.