Semantic Search

--- tags: taxonomy --- # Semantic Search Technical Sections are Marked with **\*** TODO ## Outline: - What is semantic search - Use cases - Multimodal embedding semantic search - Things to think about before using semantic search - Training Method - contrastive learning - Success Cases So far ## What is semantic search Semantic search is a data-searching technique, aiming to understand the overall meaning of the query and consider related topics that the user might be interested in the future. It can serve as a tool or component that can be integrated with other products or projects. With the advancement of AI, semantic search is not limited to text. Instead, it can be multi-modality. It can be any combination of text and image. Below are some examples of semantic search in the era AI. ![](https://hackmd.io/_uploads/H1nqmvtNh.png) ## Why Important? - [Why is vector search important?](https://www.elastic.co/what-is/vector-search) - Recommendations - Question Answering - Prompt engineering - integrating with LLMs like ChatGPT, Bard. - Integrating into other ML projects - Guided generative AI - Classification - Dataset - Wrong label detection (Ongoing project) - ... - Open AI embedding [Embeddings](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings) ## How it works ![](https://i.imgur.com/ApBJKmm.png) [Semantic Search with S-BERT is all you need](https://medium.com/mlearning-ai/semantic-search-with-s-bert-is-all-you-need-951bc710e160) The idea behind semantic search is to embed all objects in your database, which can be sentences, paragraphs, images, into a embedding space. At search time, the query is embedded into the same embedding space and the closest embedding from your database is found. These entries should have a high semantic overlap with the query. ![](https://hackmd.io/_uploads/S1Ge3rxr3.png) [Vector Databases simply explained!](https://www.youtube.com/watch?v=dN0lsF2cvm4) ![](https://hackmd.io/_uploads/rkj5dYBrh.jpg) ## Key questions before applying semantic search Imagine you are an engineer and you boss asks you to solve the following two sementatic search problems. We have a database (Target) containing ads texts from different advertisors. Given the text query (Source) like `Product/Cars/Honda`, return top-k most relevent ads texts.  Let't identify three key components in semantic search. We depend on the embeddings to do semantic search. The embeddings depend on the **AI model** and **our data**. - **Our data** - How is our data compatible with the AI model and distance metric we have chosen? - **AI model** - Convert the input to embedding. - If input is text, then the AI model can be LLMs like Bert, GPT, etc. - If input is image, then the AI model can be vision models like Res-net, vision transformers, etc. - **Distance metric** - The metric for determining how similar between two embeddings. - Most common metrics are cosine similarity, euclidean distance. ## Methodology We will first show you a diagram of the overall methdology for applying semantic search. Then, in the next section, we will give you two real examples to explain how to apply the methdology. ![](https://hackmd.io/_uploads/BynPnVgSh.png) There are three stages in total, but they don't depend on each other and we don't necessaily need all of them. Instead, we can start from any stage as long as it works as expected. In general, stage 1 is relatively easy, stage 2 needs more effort, and stage 3 is the hardest, which requires in-depth knowledge about data and model training techniques. Typically, we start from stage 1. ### Stage 1 - Choose AI model Given the data, we first need to choose an AI model to transform the data to the embedding space. If the data is natural language, then we can choose [openAI Embedding](https://openai.com/blog/new-and-improved-embedding-model), like text-embedding-ada-002. Alternatively, we can choose from the opensource library [sentence-transformers](https://huggingface.co/sentence-transformers) - [model overview](https://www.sbert.net/docs/pretrained_models.html), like [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2), multi-qa-MiniLM-L6-cos-v1. If the data is image, then we might want to choose image encoder like [Vision Transformer](https://huggingface.co/docs/transformers/model_doc/vit), [clip-ViT-B-32](https://huggingface.co/sentence-transformers/clip-ViT-B-32). - Choose metric One widely used metric is [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity). Other options are [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance), [distance in Hyperbolic geometry](https://en.wikipedia.org/wiki/Hyperbolic_geometry) (Tree or graph structure), etc. - **Test embedding** - Critical Once we have chosen the AI model and the metric, it is critical to verify whether they work as expected on our dataset. To do so, we can: 1. [Visualizing_embeddings_in_2D](https://github.com/openai/openai-cookbook/blob/main/examples/Visualizing_embeddings_in_2D.ipynb) - Good: - work as expected ![](https://hackmd.io/_uploads/BJLtiH-r3.png) - Bad: - Choose another AI model ![](https://hackmd.io/_uploads/HJX43SWH3.png) 2. Construct a dataset consisting of ground-truth query-response pairs. For example, if our database contains: `Two dogs are playing together on Winter Snow Field` `They are angry` `Two dogs can be seen joyfully engaging in play` `Two dogs find joy in playing together` Then, we can construct the ground-truth pairs: `{query: 'two dogs in the snow', reponse: 'Two dogs are playing together on Winter Snow Field'}` `{query: 'They are so resentful', reponse: 'They are angry'}` `...` Since we know the correct response given the query, it is easy to verify whether the semantic search returns the correct response. - Query transformation - [Precise Zero-Shot Dense Retrieval without Relevance Labels](https://arxiv.org/pdf/2212.10496.pdf) - ![](https://hackmd.io/_uploads/BkrZZoR4n.jpg) ## Stage 2 We will use question answering semantic search as an example to illustrate how **Stage 2** works. Let's assume we failed on stage 1 and we have chosen the AI model and the metric. The next step is to do data transformation. We can either transform the query or transform the corpus in the database. For example, as shown in the figure below, our query is `Can you explain the PTO policy?`, and it might not be close enough to the answers in the embedding space. We then ask: can we transform the query to enrich its information? For example, input it to an LLM and get an inaccurate answer: `PTO stands for Paid Time Off, which is a policy that outlines the amount of time an employee can take off from work while still being paid.`. We then use this inaccurate answer as the proxy query to search for the real answer in the embedding space. Or we can generate several answers to construct a long text and using this long text do the semantic search. Next, we still need to verify if this approach works as expected. ![](https://hackmd.io/_uploads/BkozLelrn.png) We could also carefully design some prompt to get the proxy query. ## Stage 3 Our ongoing project is to detect new label and wrong label. We failed on stage 1 and stage 2. Then, we started to experiment with stage 3. We first talk about how we failed in stage 1 and stage 2. Briefly speaking, our query is some text label like `Product/Vihicles/Mercedes-Benz` and we want to find some texts that are closely related to this text. ![](https://hackmd.io/_uploads/SJYzW5ZB3.png) We first went through Stage 1 and experimented with a number of AI models including OpenAI embeddings, Bert, sentence transformers, etc. We also went through Stage 2 and transformed out text to proxy label using ChatGPT. Then, we search the embedding of the proxy text label. We failed in both cases. Next, we started to experiment with stage 3. That is, using our data to fine-tune the AI model to get the embedding space as we expected. We first use the pretrained model [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) from [sentence-transformers](https://huggingface.co/sentence-transformers). We feed our texts and get the embeddings. The 2-d visualization is shown below. Red dots are the embeddings of 30 Acura texts, and blue dots are 30 Honda texts. Green dots are the embeddings of Acura and Honda labels, and some other random labels. The original model did a decent job on grouping the acura texts(red) and the honda texts(blue). However, the embeddings of the two text labels: *Service/Vehicles/Maintenance/Acura* and *Service/Vehicles/Maintenance/Honda*, shown as the two green points in the middle, are not close their own group of texts. We want the label text *Service/Vehicles/Maintenance/Acura* to be close to red acura texts, and *Service/Vehicles/Maintenance/Honda* to be close to blue honda text. ![](https://i.imgur.com/pm2eQYB.png) We apply the cosine loss to fine-tune the model using the same data in the visualization and get a new version of the model. We refer to the [OpenAI embedding model paper](https://cdn.openai.com/papers/Text_and_Code_Embeddings_by_Contrastive_Pre_Training.pdf) and [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/pdf/1908.10084.pdf). This blog [Contrastive Representation Learning](https://lilianweng.github.io/posts/2021-05-31-contrastive/) gives a summary about contrastive learning, different loss functions, and different use cases. We feed the same texts to the new model, and the visualization becomes: ![](https://i.imgur.com/8Y9OAb7.png) Now, the label text *Service/Vehicles/Maintenance/Acura* is very close to the red acura text, and *Service/Vehicles/Maintenance/Honda* is close to the blue honda text. In addition, the two group of texts are seperated further. To summarize, we can manipulate the embedding space using the cosine loss. This loss function can serve as the human-model interface, connecting human intent with the model. ### Other resources - [openai-cookbook](https://github.com/openai/openai-cookbook)