[Week 4] Retrieval Augmented Generation

# [Week 4] Retrieval Augmented Generation [課程目錄](https://areganti.notion.site/Applied-LLMs-Mastery-2024-562ddaa27791463e9a1286199325045c) [課程連結](https://areganti.notion.site/Week-4-Retrieval-Augmented-Generation-1a0754b1ccc645b78edcaf42e9137d86) ## ETMI5: Explain to Me in 5 :::info In this week’s content, we will do an in-depth exploration of Retrieval Augmented Generation (RAG), an AI framework that enhances the capabilities of Large Language Models by integrating real-time, contextually relevant information from external sources during the response generation process. It addresses the limitations of LLMs, such as inconsistency and lack of domain-specific knowledge, hence reducing the risk of generating incorrect or hallucinated responses. ::: :::success 在本週的內容中，我們將深入探索Retrieval Augmented Generation (RAG)，這是一種人工智慧框架，透過在響應生成過程中整合來自外部來源的即時上下文相關信息的方式來增強大型語言模型的功能。它解決了LLMs的侷限性，像是不一致性和缺乏特定領域知識，從而降低了產生錯誤或幻覺響應的風險。 ::: :::info RAG operates in three key phases: ingestion, retrieval, and synthesis. In the ingestion phase, documents are segmented into smaller, manageable chunks, which are then transformed into embeddings and stored in an index for efficient retrieval. The retrieval phase involves leveraging the index to retrieve the top-k relevant documents based on similarity metrics when a user query is received. Finally, in the synthesis phase, the LLM utilizes the retrieved information along with its internal training data to formulate accurate responses to user queries. ::: :::success RAG分為三個關鍵階段：擷取、檢索和合成。在擷取階段，文件被分割成更小的、可管理的區塊，然後將其轉換為嵌入並儲存在索引中以進行高效檢索。檢索階段涉及在收到使用者查詢時利用索引根據相似度指標檢索前k個相關文件。最後，在合成階段，LLM利用檢索到的信息及其內部訓練資料來制定對使用者查詢的準確回應。 ::: :::info We will discuss the history of RAG and then delve into the key components of RAG, including ingestion, retrieval, and synthesis, providing detailed insights into each phase's processes and strategies for improvement. We will also go over various challenges associated with RAG, such as data ingestion complexity, efficient embedding, and fine-tuning for generalization and propose solutions to each of them. ::: :::success 我們將討論RAG的歷史，然後深入瞭解RAG的關鍵組成，包括擷取、檢索和合成，提供每個階段的流程和改進策略的詳細見解。我們還會討論與RAG相關的各種挑戰，像是資料擷取的複雜性、高效嵌入和泛化微調，並針對每個挑戰提出解決方案。 ::: ## What is RAG? (Recap) :::info Retrieval Augmented Generation (RAG) is an AI framework that enhances the quality of responses generated by LLMs by incorporating up-to-date and contextually relevant information from external sources during the generation process. It addresses the inconsistency and lack of domain-specific knowledge in LLMs, reducing the chances of hallucinations or incorrect responses. RAG involves two phases: retrieval, where relevant information is searched and retrieved, and content generation, where the LLM synthesizes an answer based on the retrieved information and its internal training data. This approach improves accuracy, allows source verification, and reduces the need for continuous model retraining. ::: :::success Retrieval Augmented Generation (RAG)是一種人工智慧框架，透過在生成過程中整合來自外部來源的最新且上下文相關的信息，以提高LLMs生成的答案的品質。它解決了LLMs中的不一致性與缺乏特定領域知識的問題，減少了幻覺或是不正確回應的機會。RAG涉及兩個階段：檢索(搜尋和檢索相關信息)與內容生成(其中LLMs根據檢索到的信息及其內部訓練資料合成答案)。這種方法提高了準確性，允許來源驗證，並減少了模型一直重新訓練的需求。 ::: ![image](https://hackmd.io/_uploads/H1g4NsciR.png) Image Source: https://www.deeplearning.ai/short-courses/langchain-for-llm-application-development/ :::info The diagram above outlines the fundamental RAG pipeline, consisting of three key components: 1. **Ingestion:** - Documents undergo segmentation into chunks, and embeddings are generated from these chunks, subsequently stored in an index. - Chunks are essential for pinpointing the relevant information in response to a given query, resembling a standard retrieval approach. 2. **Retrieval:** - Leveraging the index of embeddings, the system retrieves the top-k documents when a query is received, based on the similarity of embeddings. 3. **Synthesis:** - Examining the chunks as contextual information, the LLM utilizes this knowledge to formulate accurate responses. ::: :::success 上圖概述了基本的RAG pipeline，由三個關鍵組件所組成： 1. **擷取：** - 文件被分割成區塊(chunks)，並且從這些區塊產生嵌入，隨後儲存在索引中。 - 區塊對於回應給定查詢的精確定位相關信息來說是非常重要的，類似於標準檢索方法。 2. **檢索：** - 利用嵌入的索引，系統在收到查詢時根據嵌入的相似度來檢索前k個文件。 3. **合成：** - 將這些區塊作為上下文信息來檢查，LLM利用這些知識來製定準確的響應。 ::: :::info 💡Unlike previous methods for domain adaptation, it's important to highlight that RAG doesn't necessitate any model training whatsoever. It can be readily applied without the need for training when specific domain data is provided. ::: :::success 💡與先前的領域適應方法不同，需要強調的是，RAG並不需要任何的模型訓練。當提供特定領域資料的時候，模型不需訓練就能夠開箱即用。 ::: ## History :::info RAG, or Retrieval-Augmented Generation, made its debut in [this](https://arxiv.org/pdf/2005.11401.pdf) paper by Meta. The idea came about in response to the limitations observed in large pre-trained language models regarding their ability to access and manipulate knowledge effectively. ::: :::success RAG或者說Retrieval-Augmented Generation，於Meta的[論文](https://arxiv.org/pdf/2005.11401.pdf)中首次亮相。這個想法的產生是為了應對大型預訓練語言模型對於其存取與操作知識的能力方面所觀察到的侷限性而提出的 ::: ![image](https://hackmd.io/_uploads/rJUrVocjR.png) Image Source: https://arxiv.org/pdf/2005.11401.pdf :::info Below is a short summary of how the authors introduce the problem and provide a solution: RAG came about because, even though big language models were good at remembering facts and performing specific tasks, they struggled when it came to precisely using and manipulating that knowledge. This became evident in tasks heavy on knowledge, where other specialized models outperformed them. The authors identified challenges in existing models, such as difficulty explaining decisions and keeping up with real-world changes. Before RAG, there were promising results with hybrid models that mixed both parametric and non-parametric memories. Examples like REALM and ORQA combined masked language models with a retriever, showing positive outcomes in this direction. ::: :::success 以下是作者如何介紹問題並提供解決方案的簡短摘要： RAG的出現是因為，儘管大語言模型對於記住事實和執行特定任務的表現良好，不過在精確使用與操縱這些知識時卻遇到了困難。這在知識密集的任務上就變得很明顯，而其它專用模型的表現優於它們。作者指出了現有模型中的挑戰，像是難以解釋決策還有跟上現實世界的變化。在RAG之前，混合參數和非參數記憶的混合模型取得了有希望的結果。像REALM和ORQA這樣的例子將掩碼語言模型與檢索器結合起來，在這個方向上給出正向的結果。 ::: :::info Then, along came RAG, a game-changer in the form of a flexible fine-tuning method for retrieval-augmented generation. RAG combined pre-trained parametric memory (like a seq2seq model) with non-parametric memory from a dense vector index of Wikipedia, accessed through a pre-trained neural retriever like Dense Passage Retriever (DPR). RAG models aimed to enhance pre-trained, parametric-memory generation models by combining them with non-parametric memory through fine-tuning. The seq2seq model in RAG used latent documents retrieved by the neural retriever, creating a model trained end-to-end. Training involved fine-tuning on any seq2seq task, learning both the generator and retriever. Latent documents were then handled using a top-K approximation, either per output or per token. ::: :::success 然後，RAG閃亮亮登場了，它改變了這個生態，以靈活微調方法的形式來執行retrieval-augmented generation。RAG結合預訓練的參數化記憶(如模型seq2seq)與來自維基百科的密集向量索引的非參數化記憶，透過預訓練的神經檢索器，像是Dense Passage Retriever (DPR)，來進行存取。RAG模型主要通過結合非參數化記憶來增強預訓練的參數化記憶生成模型，從而增強預訓練的參數記憶生成模型。RAG中的seq2seq模型使用神經檢索器檢索的潛在文檔，建立一個端到端訓練的模型。訓練涉及對任意seq2seq任務的微調，同時學習生成器和檢索器。然後使用top-K approximation處理潛在文檔，無論是per output還是per token。 ::: :::info RAG's main significance was moving away from past approaches that proposed adding non-parametric memory to systems. Instead, RAG explored a new approach where both parametric and non-parametric memory components were pre-trained and filled with lots of knowledge. In experiments, RAG proved its worth by achieving top-notch results in open-domain question answering and surpassing previous models in fact verification and knowledge-intensive generation. Another win for RAG was showing it could adapt, allowing the non-parametric memory to be swapped out and updated to keep the model's knowledge fresh in a changing world. ::: :::success RAG的主要意義在於摒棄了過去提出在系統中添加非參數化記憶體的方法。相反，RAG探索了一種新方法，其中參數和非參數記憶組件都是經過預訓練的過程，並且填滿大量知識。在實驗中，RAG在開放領域問答方面取得了一流的成果，在事實驗證和知識密集型生成方面超越了先前的模型，證明了自己的價值。RAG的另一個勝利是表明它可以適應，允許更換和更新非參數化記憶，以此讓自己在不斷變化的世界中保持模型知識的新鮮度。 ::: ## Key Components :::info As mentioned earlier, the key elements of RAG involve the processes of ingestion, retrieval, and synthesis. Now, let's delve deeper into each of these components. ::: :::success 如前所述，RAG的關鍵要素涉及擷取、檢索和合成的過程。現在，讓我們更深入地研究每個組件。 ::: ### Ingestion :::info In RAG, the ingestion process refers to the handling and preparation of data before it is utilized by the model for generating responses. ::: :::success 在RAG中，擷取過程是指在模型利用資料生成回應之前，資料的處理與準備。 ::: ![image](https://hackmd.io/_uploads/BkiINocsR.png) :::info This process involves 3 key steps: 1. **Chunking:** Breaking down input text into smaller, more manageable segments or chunks. This can be based on size, sentences, or other natural divisions within the text. We will dig deeper into chunking strategies in the next sections. As an example, consider a comprehensive article on the Renaissance. The chunking process involves breaking down the article into manageable segments based on natural breaks, such as paragraphs or distinct historical periods (e.g., Early Renaissance, High Renaissance). Each of these segments becomes a chunk, enabling focused analysis by the language model. ::: :::success 這個過程涉及三個關鍵步驟： 1. **分塊：** 將輸入文本分解為更小、更易於管理的片段或區塊。這可以基於文本的大小、句子或其他自然劃分。我們將在接下來的部分中更深入地探討分割區塊的策略。舉個例子，考慮一篇關於文藝復興的綜合文章。分割區塊的動過程涉及根據[自然間斷法](https://medium.com/@thomaschuang/%E5%AD%B8%E6%9C%83%E7%9C%8B%E6%87%82%E5%88%86%E7%B4%9A%E7%AC%A6%E8%99%9F%E6%89%8D%E4%B8%8D%E6%9C%83%E8%A2%AB%E5%9C%B0%E7%90%86%E5%AD%B8%E5%AE%B6%E5%94%AC%E5%BE%97%E4%B8%80%E6%84%A3%E4%B8%80%E6%84%A3%E7%9A%84-%E4%B8%80%E7%AF%87%E6%96%87%E5%B8%B6%E4%BD%A0%E7%9C%8B%E5%9C%B0%E5%9C%96%E5%88%86%E7%B4%9A%E6%96%B9%E6%B3%95-9223c1667961)將文章拆解為可管理的片段，例如段落或不同的歷史時期(像是文藝復興早期、文藝復興全盛期)。每個片段都會成為一個區塊，從而可以透過語言模型進行聚焦式的分析。 ::: :::info 2. **Embedding**: Transforming the text or chunks into a vector format that captures essential qualities in a computationally friendly way. This step is crucial for efficient processing by the language model. Following from the previous example- once the article segments are identified, the embedding process transforms the content of each chunk into a vector format. For instance, the section on the High Renaissance could be embedded into a vector that captures key artistic, cultural, and historical aspects. This vector representation enhances the model's ability to understand and process the nuanced information within the chunk. ::: :::success 2. **嵌入**：將文本或區塊轉換為向量格式，以計算友好的方式捕捉基本品質。這一步對於語言模型的高效處理而言非常重要。接續前面的範例，一旦識別出文章的片段，嵌入過程就會將每個區塊的內容轉換為向量格式。舉例來說，關於文藝復興全盛時期的部分可以嵌入捕捉關鍵藝術、文化和歷史方面的向量。這種向量表示增強了模型理解和處理區塊內細微信息的能力。 ::: :::info 3. **Indexing:** Organizing the embedded data in a structured format optimized for quick and efficient retrieval. This often involves creating a vector representation for each document and storing these vectors in a searchable format, such as a vector database or search engine. In the example we discussed- The indexed database is created by organizing these vector representations of historical events. Each chunk, now represented as a vector, is indexed for efficient retrieval. When a user queries about a specific aspect of the Renaissance, the indexing enables the quick identification and retrieval of the most relevant chunks, providing contextually rich responses. ::: :::success 3. **索引：** 以最佳化的結構化格式組織嵌入資料，以實現快速高效的檢索。這通常涉及為每個文件建立向量表示並將這些向量以可搜尋的格式存儲，像是向量資料庫或是搜尋引擎。在我們討論的範例中，索引資料庫是透過組織這些歷史事件的向量表示所建立的。每個區塊，現在都被表示為向量，都能夠被以索引的方式進行高效的檢索。當有個使用者查詢文藝復興的特定方面時，索引能夠快速識別和檢索最相關的區塊，從而提供上下文豐富的回應。 ::: ### Retrieval :::info The retrieval component involves the following steps: ![image](https://hackmd.io/_uploads/BkRD4sqiR.png) 1. **User Query:** A user poses a natural language query to the LLM. For instance, let’s say we’ve completed the ingestion process for renaissance articles as explained in the above method and a user poses a query, "Tell me about the Renaissance period.” 2. **Query Conversion:** The query is sent to an embedding model, which converts the natural language query into a numeric format, creating an embedding or vector representation. The embedding model is the same as the model used to embed articles in the ingestion phase. 3. **Vector Comparison:** The numeric vectors of the query are compared to vectors in a index of a knowledge base created in the previous phase. This involves measuring similarity or distance metrics between the query vector and vectors stored in the index (often cosine similarity). 4. **Top-K Retrieval:** The system then retrieves the top-K documents or passages from the knowledge base that have the highest similarity to the query vector. This step involves selecting a predefined number (K) of the most relevant documents based on the vector similarities. These embeddings may include information about different aspects of the Renaissance. 5. **Data Retrieval:** The system retrieves the actual content or data from the selected top-K documents in the knowledge base. This content is typically in human-readable form, representing relevant information related to the user's query. ::: :::success 檢索組件涉及以下步驟： 1. **使用者查詢：** 使用者向LLM提出一個自然語言查詢。舉例來說，假設我們已經完成了上述方法中解釋的文藝復興時期文章的擷取過程，並且使用者提出了一個查詢：“告訴我有關文藝復興時期的信息。” 2. **查詢轉換：** 查詢被傳送到嵌入模型，該模型將自然語言查詢轉換為數值格式，然後建立一個嵌入或向量表示。嵌入模型與在擷取階段嵌入文章的模型相同。 3. **向量比較：** 將查詢的數值向量與前一階段所建立的知識庫索引中的向量進行比較。這涉及測量查詢向量和索引中儲存的向量之間的相似性或距離度量(通常是餘弦相似度)。 4. **Top-K 檢索：** 然後系統從知識庫中檢索與查詢向量相似度最高的top-K文件或段落。此步驟涉及根據向量相似度選擇預先定義好的數量(K)個最相關的文件。這些嵌入可能包括有關文藝復興時期不同方面的信息。 5. **資料檢索：** 系統從知識庫中選定的top-K文件中檢索出實際內容或資料。該內容通常是人類可讀的形式，表示與使用者查詢相關的相關信息。 ::: :::info Therefore, at the end of the retrieval phase, the LLM has access to relevant context regarding the segments of the knowledge base that hold utmost relevance to the user's query. In this example, the retrieval process ensures that the user receives a well-informed response about the Renaissance, drawing on historical documents stored in the knowledge base to provide contextually rich information. ::: :::success 因此，在檢索階段結束的時候，LLM可以存取與使用者查詢最相關的知識庫片段的相關上下文。在這個範例中，檢索過程確保使用者收到有關文藝復興時期見多識廣的回應，利用儲存在知識庫中的歷史文件來提供上下文豐富的信息。 ::: ### Synthesis :::info The Synthesis phase is very similar to regular LLM generation, except that now the LLM has access to additional context from the knowledge base. The LLM presents the final answer to the user, combining its own language generation with information retrieved from the knowledge base. The response may include references to specific documents or historical sources. ::: :::success 合成階段與常規LLM生成非常相似，不同之處在於現在LLM可以從知識庫存取其它上下文。LLM將自己的語言生成與從知識庫檢索的資訊結合，向使用者呈現最終答案。回應可能包括對特定文件或歷史來源的引用。 ::: ![image](https://hackmd.io/_uploads/HymF4scsC.png) ## RAG Challenges :::info Although RAG seems to be a very straightforward way to integrate LLMs with knowledge, there are still the below mentioned open research and application challenges with RAG. 1. **Data Ingestion Complexity:** Dealing with the complexity of ingesting extensive knowledge bases involves overcoming engineering challenges. For instance, parallelizing requests effectively, managing retry mechanisms, and scaling infrastructure are critical considerations. Imagine ingesting large volumes of diverse data sources, such as scientific articles, and ensuring efficient processing for subsequent retrieval and generation tasks. 2. **Efficient Embedding:** Ensuring the efficient embedding of large datasets poses challenges like addressing rate limits, implementing robust retry logic, and managing self-hosted models. Consider the scenario where an AI system needs to embed a vast collection of news articles, requiring strategies to handle changing data, syncing mechanisms, and optimizing embedding costs. 3. **Vector Database Considerations:** Storing data in a vector database introduces considerations such as understanding compute resources, monitoring, sharding, and addressing potential bottlenecks. Think about the challenges involved in maintaining a vector database for a diverse range of documents, each with varying levels of complexity and importance. 4. **Fine-Tuning and Generalization:** Fine-tuning RAG models for specific tasks while ensuring generalization across diverse knowledge-intensive NLP tasks is challenging. For instance, achieving optimal performance in question-answering tasks might require different fine-tuning approaches compared to tasks involving creative language generation, requiring careful balance. 5. **Hybrid Parametric and Non-Parametric Memory:** Integrating parametric and non-parametric memory components in models like RAG presents challenges related to knowledge revision, interpretability, and avoiding hallucinations. Consider the difficulty in ensuring that a language model combines its pre-trained knowledge with dynamically retrieved information, avoiding inaccuracies and maintaining coherence. 6. **Knowledge Update Mechanisms:** Developing mechanisms to update non-parametric memory as real-world knowledge evolves is crucial. Imagine a scenario where RAG models need to adapt to changing information in domains like medicine, where new research findings and treatments continually emerge, requiring timely updates for accurate responses. ::: :::success 儘管RAG看似是一種非常直接將LLMs與知識相結合的方式，但RAG仍然存在下面提到的開放研究與應用挑戰。 1. **資料擷取複雜度：** 處理擷取大量知識庫的複雜度涉及克服工程挑戰。像是，高效地並行請求、管理重試機制和擴展基礎架構是關鍵考慮因素。想像一下擷取大量不同的資料來源，像是科學文章，並確保後續檢索和生成任務的高效處理。 2. **高效嵌入：** 確保大型資料集的高效嵌入帶來了諸如解決速率限制、實現穩健的重試邏輯和管理自架服務模型等挑戰。考慮這樣的場景，人工智慧系統需要嵌入大量的新聞文章，需要一種處理不斷變化的資料的策略、同步機制和最佳化嵌入成本。 3. **向量資料庫注意事項：** 把資料存儲在向量資料庫中帶來了一些需要考量的因素，像是了解計算資源、監控、分片和解決潛在瓶頸。考慮一下維護各種文件的向量資料庫所涉及的挑戰，每個文件都有不同程度的複雜性和重要性。 4. **微調和泛化：** 針對特定任務微調RAG模型，同時確保跨不同知識密集型自然語言任務的泛化性是具有挑戰的。例如，與涉及創意語言生成的任務相比，在問答任務中實現最佳效能可能需要不同的微調方法，需要仔細的平衡。 5. **混合參數和非參數記憶：** 在RAG等模型中整合參數和非參數記憶組件是一種關於知識校正、可解釋性和避免幻覺相關的挑戰。考慮到如何確保語言模型能夠將其預訓練知識與動態擷取的信息結合，同時避免不準確性並維持一致性，這是一個困難的問題。 6. **知識更新機制：** 隨著現實世界知識的發展，開發更新非參數記憶的機制是很重要的一件事。想像一下這樣一個場景，RAG模型需要適應醫學等領域不斷變化的訊息，新的研究發現和治療方法不斷出現，需要及時更新才能做出準確的回應。 ::: ## Improving RAG components (Ingestion) ### 1. **Better Chunking Strategies** :::info In the context of enhancing the Ingestion process for the RAG components, adopting advanced chunking strategies is necessary for efficient handling of textual data. In a simple RAG pipeline, a fixed strategy is adopted, i.e., a fixed number of words or characters form a single chunk. Considering the complexities involved in large datasets, the following strategies are being used recently: 1. **Content-Based Chunking:** Breaks down text based on meaning and sentence structure using techniques like part-of-speech tagging or syntactic parsing. This preserves the sense and coherence of the text. However, one consideration of this chunking is it requires additional computational resources and algorithmic complexity. 2. **Sentence Chunking:** Involves breaking text into complete and grammatically correct sentences using sentence boundary recognition or speech segment. Maintains the unity and completeness of the text but can generate chunks of varying sizes, lacking homogeneity. 3. **Recursive Chunking:** Splits text into chunks of different levels, creating a hierarchical and flexible structure. Offers greater granularity and variety in text, but managing and indexing these chunks involves increased complexity. ::: :::success 在增強RAG組件的擷取過程的上下文中，採用先進的分塊策略對於有效處理文本資料是必要的。在簡易的RAG pipeline中，採用的就是固定策略，也就是固定數量的word或characters來形成單一的區塊。考慮到大型資料集所涉及到的複雜性，最近大致就是使用以下策略： 1. **Content-Based Chunking：** 使用詞性標記或句法分析等技術是基於語意和句子結構來拆解文本。這保留了文本的意義和連貫性。然而，這種分塊的一個考慮因素是它需要額外的計算資源與演算法複雜性。 2. **Sentence Chunking：** 這涉及使用句子邊界檢測或語音分段將文本拆解為完整且語法正確的句子。保持文本的統一性和完整性，但會產生大小不一的區塊，缺乏同質性。 3. **Recursive Chunking：** 將文字分割成不同層級的區塊，以建立分層且靈活的結構。提供更大的粒度和文本的多樣性，但管理和索引這些區塊會涉及更高的複雜性。 ::: ### **2. Better Indexing Strategies** :::info Improved indexing allows for more efficient search and retrieval of information. When chunks of data are properly indexed, it becomes easier to locate and retrieve specific pieces of information quickly. Some improved strategies include: 1. **Detailed Indexing:** Chunks through sub-parts (e.g., sentences) and assigns each chunk an identifier based on its position and a feature vector based on content. Provides specific context and accuracy but requires more memory and processing time. 2. **Question-Based Indexing:** Chunks through knowledge domains (e.g., topics) and assigns each chunk an identifier based on its category and a vector of characteristics based on relevance. Aligns directly with user requests, enhancing efficiency, but may result in information loss and lower accuracy. 3. **Optimized Indexing with Chunk Summaries:** Generates a summary for each chunk using extraction or compression techniques. Assigns an identifier based on the summary and a feature vector based on similarity. Provides greater synthesis and variety but demands complexity in generating and comparing summaries. ::: :::success 改進後的索引對於信息的查詢與檢索有著更好的效率。當資料區塊被適當地索引時，對於快速定位與檢索特定信息就會變得更加容易。一些改進的策略包括： 1. **Detailed Indexing：** 透過對子部分(例如句子)進行分塊，然後根據其位置為每個區塊分配一個標識符，並根據內容分配一個特徵向量。這提供了具體的上下文與準確性，不過需要更多的記憶體與處理時間。 2. **Question-Based Indexing：** 透過對知識領域(例如主題)進行分塊，然後根據每個區塊的類別為每個區塊分配一個標識符，並根據相關性為每個區塊分配一個特徵向量。直接與使用者請求保持一致，這可以提高效率，不過可能會導致信息的遺失與準確性的降低。 3. **Optimized Indexing with Chunk Summaries：** 使用擷取或壓縮技術為每個區塊生成摘要。根據摘要分配標識符，並根據相似性分配特徵向量。提供更大的綜合性和多樣性，但在生成與比較摘要的部份則是有著更多的複雜性。 ::: ## Improving RAG components (Retrieval) ### 1. **Hypothetical Questions and HyDE:** :::info The introduction of hypothetical questions involves generating a question for each chunk, embedding these questions in vectors, and performing a query search against this index of question vectors. This enhances search quality due to higher semantic similarity between queries and hypothetical questions compared to actual chunks. Conversely, HyDE (Hypothetical Response Extraction) involves generating a hypothetical response given the query, enhancing search quality by leveraging the vector representation of the query and its hypothetical response. ::: :::success 假設問題的引入涉及為每個區塊生成一個問題，將這些問題嵌入向量中，並針對該問題的向量索引做查詢搜尋。這可以提高搜尋品質，因為相較於實際區塊，查詢和假設問題之間具有更高的語義相似性，。相反，HyDE(Hypothetical Response Extraction)涉及在給定查詢的情況下產生假設響應，透過利用查詢及其假設響應的向量表示來提高搜尋品質。 ::: ![image](https://hackmd.io/_uploads/BJlC4i9sA.png) Image Source: https://arxiv.org/pdf/2212.10496.pdf ### 2. **Context Enrichment:** :::info The strategy here aims for smaller chunk retrieval for improved search quality while incorporating surrounding context for reasoning by the Language Model. Two options can explored: 1. Sentence Window Retrieval: Embedding each sentence in a document separately to achieve high accuracy in the cosine distance search between the query and the context. After retrieving the most relevant single sentence, a context window is extended by including a specified number of sentences before and after the retrieved sentence. This extended context is then sent to the LLM for reasoning upon the provided query. The goal is to enhance the LLM's understanding of the context surrounding the retrieved sentence, enabling more informed responses. ::: :::success 這裡的策略旨在進行更小的區塊檢索以提高搜尋質量，同時結合周圍的上下文以透過語言模型進行推理。這有兩種選擇可以探索： 1. Sentence Window Retrieval：將各別嵌入文件中的每個句子，以在查詢和上下文之間的餘弦距離搜尋中實現高精準度的響應。在檢索到最相關的單一句子後，透過在檢索到的句子前後(包含指定數量)的句子來擴展上下文視窗。然後將此擴展的上下文丟給LLM，以根據所提供的查詢進行推理。目標是增強LLM對檢索到的句子周圍上下文的理解，從而做出更有見識的回應。 ::: ![image](https://hackmd.io/_uploads/HJekBocs0.png) Image Source: https://medium.com/@shivansh.kaushik/advanced-text-retrieval-with-elasticsearch-llamaindex-sentence-window-retrieval-cb5ea720aa44 :::info 2. Auto-Merging Retriever: In this approach, documents are initially split into smaller child chunks, each referring to a larger parent chunk. During retrieval, smaller chunks are fetched first. If, among the top retrieved chunks, more than a specified number are linked to the same parent node (larger chunk), the context fed to the LLM is replaced by this parent node. This process can be likened to automatically merging several retrieved chunks into a larger parent chunk, hence the name "auto-merging retriever." The method aims to capture both granularity and context, contributing to more comprehensive and coherent responses from the LLM. ::: :::success 2. Auto-Merging Retriever：在這個方法中，文件最初被分割成較小的子區塊，每個子區塊都會指向一個較大的父區塊。在檢索期間，首先取得較小的區塊。如果在前幾個檢索的區塊中，超過指定數量連結到同一父節點(較大的區塊)，則就把這個上下文丟給LLM將被該父節點取代。這個過程可以比喻為自動將幾個檢索到的區塊合併到一個更大的父區塊中，因此稱為"auto-merging retriever"。該方法旨在捕捉粒度與上下文，有助於LLMs做出更全面性、一致性的回應。 ::: ![image](https://hackmd.io/_uploads/BydeBo5sC.png) Image Source: https://twitter.com/clusteredbytes ### 3. **Fusion Retrieval or Hybrid Search:** :::info This strategy integrates conventional keyword-based search approaches with contemporary semantic search techniques. By incorporating diverse algorithms like tf-idf (term frequency–inverse document frequency) or BM25 alongside vector-based search, RAG systems can harness the benefits of both semantic relevance and keyword matching, resulting in more thorough and inclusive search outcomes. ::: :::success 這個策略整合了傳統的關鍵字搜尋方法與現代的語義搜尋技術。透過結合多種演算法，像是tf-idf(詞頻–逆向文件頻率)或是BM25與向量搜查，RAG系統可以同時利用語義相關性和關鍵字匹配的優勢，從而產生更全面性、更具包容性的搜尋結果。 ::: ![image](https://hackmd.io/_uploads/Sku-BiciR.png) Image Source: https://towardsdatascience.com/improving-retrieval-performance-in-rag-pipelines-with-hybrid-search-c75203c2f2f5 ### 4. **Reranking & Filtering:** :::info Post-retrieval refinement is performed through filtering, reranking, or transformations. LlamaIndex provides various Postprocessors, allowing the filtering of results based on similarity score, keywords, metadata, or reranking with models like LLMs or sentence-transformer cross-encoders. This step precedes the final presentation of retrieved context to the LLM for answer generation. ::: :::success Post-retrieval refinement是透過過濾、重新排序或轉換來執行的。 LlamaIndex提供各種Postprocessors，允許根據相似度分數、關鍵字、元資料來過濾結果，或使用像LLM或sentence-transformer cross-encoders等模型重新排序。這個步驟是在將最終擷取到的上下文呈現給 LLM生成答案之前進行的。 ::: ![image](https://hackmd.io/_uploads/r1oGro9j0.png) Image Source: https://www.pinecone.io/learn/series/rag/rerankers/ ### 4. **Query Transformations and Routing [[Source](https://blog.langchain.dev/deconstructing-rag/)]** :::info Query transformation methods enhance retrieval by breaking down complex queries into sub-questions (Expansion) and improving poorly worded queries through re-writing. While dynamic Query Routing optimizes data retrieval in diverse sources. The below are popular approaches ::: :::success 查詢轉換方法透過將複雜查詢分解為子問題(擴展)並透過重寫改進措辭不佳的查詢來增強檢索的結果。同時，動態查詢路由優化了不同來源的資料檢索。以下是常見的方法 ::: ### **Query Transformations** :::info 1. **Query Expansion**: Query expansion decomposes the input into sub-questions, each of which is a more narrow retrieval challenge. For example, a question about physics can be stepped-back into a question (and LLM-generated answer) about the physical principles behind the user query. 2. **Query Re-writing**: Addressing poorly framed or worded user queries, the [Rewrite-Retrieve-Read](https://arxiv.org/pdf/2305.14283.pdf?ref=blog.langchain.dev) approach involves rephrasing questions to enhance retrieval effectiveness. The method is explained in detail in the paper. 3. **Query Compression**: In scenarios where a user question follows a broader chat conversation, the full conversational context may be necessary to answer the question. Query compression is utilized to condense chat history into a final question for retrieval. ::: :::success 1. **Query Expansion**：查詢擴展將輸入分解為子問題，每個子問題都是一個更精細的檢索挑戰。舉例來說，一個關於物理的問題可以退一步變成關於使用者查詢背後的物理原理的問題(以及LLM生成的答案)。 2. **Query Re-writing**：解決設計或措詞不佳的使用者查詢，[Rewrite-Retrieve-Read](https://arxiv.org/pdf/2305.14283.pdf?ref=blog.langchain.dev)，這個方法涉及重新表述問題以提高檢索效率。該方法在論文中有詳細解釋。 3. **Query Compression**：在使用者問題伴隨著更廣泛的聊天對話的情況下，可能需要完整的對話上下文來回答問題。Query compression用於將聊天歷史記錄壓縮為最終的問題以供檢索。 ::: ### **Query Routing** :::info 1. **Dynamic Query Routing**: The question of where the data resides is crucial in RAG, especially in production settings with diverse data-stores. Dynamic query routing, supported by LLMs, efficiently directs incoming queries to the appropriate datastores. This dynamic routing adapts to different sources and optimizes the retrieval process. ::: :::success 1. **動態查詢路由**：對RAG而言，資料存放在什麼地方的問題是非常重要的，尤其是在具有不同資料儲存的生產環境中。LLMs所支援的動態查詢路由可以有效地將傳入的查詢導向至適當的資料儲存。這種動態路由適應不同的來源並最佳化檢索過程。 ::: ## Improving RAG components (Generation) :::info The most straightforward method for LLM generation involves concatenating all the relevant context pieces, surpassing a predefined relevance threshold, and presenting them along with the query to the LLM in a single instance. However, more advanced alternatives exist, necessitating multiple calls to the LLM to iteratively enhance the retrieved context, ultimately leading to the generation of a more refined and improved answer. Some methods are illustrated below. ::: :::success 最直接的LLM生成方法涉及連接所有相關的上下文片段，也就是超過預先定義的相關性閾值的部份，並將這些連接起來的片段與與查詢以單一實例的方式呈現給LLM。然而，有更進階的替代方案，需要多次呼叫LLM來迭代增強檢索到的上下文，最終產生更完善和改進的答案。下面說明了一些方法。 ::: ### 1. **Response Synthesis Approaches:** :::info Involves 3 steps 1. **Iterative Refinement:** Refine the answer by sending retrieved context to the Language Model chunk by chunk. 2. **Summarization:** Summarize the retrieved context to fit into the prompt and generate a concise answer. 3. **Multiple Answers and Concatenation:** Generate multiple answers based on different context chunks and then concatenate or summarize them. ::: :::success 涉及3個步驟 1. **Iterative Refinement：** 透過將檢索到的上下文逐塊發送到語言模型來改善答案。 2. **Summarization：** 總結檢索到的上下文來構成提示詞，並生成簡潔的答案。 3. **Multiple Answers and Concatenation：** 根據不同的上下文區塊生成多個答案，然後將之串聯或總結。 ::: ### 2. **Encoder and LLM Fine-Tuning:** :::info This approach involves the fine-tuning the LLM models within our RAG pipeline. 1. **Encoder Fine-Tuning:** Fine-tune the Transformer Encoder for better embeddings quality and context retrieval. 2. **Ranker Fine-Tuning:** Use a cross-encoder for reranking retrieved results, especially if there's a lack of trust in the base Encoder. 3. **RA-DIT Technique:** Use a technique like RA-DIT to tune both the LLM and the Retriever on triplets of query, context, and answer. ::: :::success 這種方法涉及對我們的RAG pipeline中的LLM模型進行微調。 1. **Encoder Fine-Tuning：** 微調Transformer Encoder來得到一個更好的嵌入品質和上下文檢索。 2. **Ranker Fine-Tuning：** 使用cross-encoder對檢索到的結果進行重新排名，尤其是在基本的Encoder缺乏信任的情況下。 3. **RA-DIT Technique：** 使用像是RA-DIT等技術根據查詢、上下文和答案三件組來調整LLM與Retriever。 ::: ## Read/Watch These Resources (Optional) 1. Building Production Ready RAG Applications: https://www.youtube.com/watch?v=TRjq7t2Ms5I 2. Amazon article on RAG- https://docs.aws.amazon.com/sagemaker/latest/dg/jumpstart-foundation-models-customize-rag.html 3. Huggingface tools for RAG- https://huggingface.co/docs/transformers/model_doc/rag 4. 12 RAG Pain Points and Proposed Solutions- https://towardsdatascience.com/12-rag-pain-points-and-proposed-solutions-43709939a28c ## Read These Papers (Optional) [Retrieval-Augmented Generation for Large Language Models: A Survey](https://arxiv.org/pdf/2312.10997.pdf) [Seven Failure Points When Engineering a Retrieval Augmented Generation System](https://arxiv.org/abs/2401.05856)