[GenAI][RAG] Chunking Strategies & LLM Agentic Chunking

### [AI / ML領域相關學習筆記入口頁面](https://hackmd.io/@YungHuiHsu/BySsb5dfp) ### GenAI之RAG 系列筆記 ![image](https://hackmd.io/_uploads/r1g1kaXgC.png =600x) > Modified from [2023.12。IVAN ILIN。Advanced RAG Techniques: an Illustrated Overview](https://pub.towardsai.net/advanced-rag-techniques-an-illustrated-overview-04d193d8fec6) > ## Indexing/Chunking Module系列 - [[RAG for GenAI] LLM Agentic Chunking](https://hackmd.io/@YungHuiHsu/Hk1O6n7x0) ### Chunking策略與考量推薦閱讀： #### [2023.01。Pinecone。Chunking Strategies for LLM Applications](https://www.pinecone.io/learn/chunking-strategies/) 以下列出考量原則的重點 - Chunking Considerations * **Nature of Content** * Consider if the content is long (e.g., articles, books) or short (tweets, messages) * **Embedding Model and Optimal Chunk Sizes** * Different models have preferences for chunk sizes. * sentence-transformers for sentences * text-embedding-ada-002 for 256 or 512 tokens) * **User Query Expectations** * Match query and content embeddings * short and specific V.S. long and complex * **Application of Results** * Application's purpose * semantic search, question answering, etc. * Token limitation - Determining the optimal chunk size - **Preprocessing Data** * Ensure data quality by preprocessing, such as removing HTML tags from web-sourced data to reduce noise - **Selecting a Range of Chunk Sizes** * Post-preprocessing, explore various chunk sizes considering content nature and model capabilities * Balance between context preservation and accuracy. - **Evaluating Performance of Chunk Sizes** * Test different sizes using a dataset to create embeddings * Run queries to compare performance * Iteratively find the best size for accuracy and context relevance #### [2024.01。Anurag Mishra。Five Levels of Chunking Strategies in RAG| Notes from Greg’s Video](https://medium.com/@anuragmishra_27746/five-levels-of-chunking-strategies-in-rag-notes-from-gregs-video-7b735895694d) [tutorials](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb) - Level 1 : Fixed Size Chunking - Level 2: Recursive Chunking - Level 3 : Document Based Chunking - Level 4: Semantic Chunking - Level 5: Agentic Chunking #### [2024.0213。Pratik Bhavsar，Galileo Labs。Mastering RAG: Advanced Chunking Techniques for LLM Applications](https://www.rungalileo.io/blog/mastering-rag-advanced-chunking-techniques-for-llm-applications?utm_medium=email&_hsenc=p2ANqtz-8maBXx7vmFHU6cHv5jZV16O8KYbSg2CfHBdeeOYdV-pSeS_GDOXkj8HFgvczEebMKOl5pUhKM9tsehpFxWEDnde9--Xg&_hsmi=301760418&utm_content=301761114&utm_source=hs_email) :::info - :pencil2: 向量資料庫與LLM生成的cost是相反的，切分的文本快越小越多，向量儲存的成本及查找時間越慢，但相反的，送入LLM生成時，所花費的token與文本的雜訊可能越少。 - 越複雜的問題可能需要越多的文本塊、越完整的上下文。 ::: - Impact of chunking - Retrieval Quality: 分塊質量直接影響檢索結果的準確性 * Vector database cost: 分塊大小影響向量數據庫的儲存成本 * Vector database query latency: 分塊策略可能會增加查詢延遲 * LLM latency and cost: 適當的分塊可降低LLM處理的延遲與成本 * LLM hallucinations: 防止模型產生不基於事實的回答 - Factors influencing chunking * Text structure: 文本的結構性影響分塊方式 * 精練的短句、段落 vs 低資訊量的聊天、會議紀錄 * Embedding model: 嵌入模型選擇對分塊結果有決定性影響 * LLM context length: 模型的上下文長度限制決定分塊大小 * Type of questions: 問題類型指導分塊策略的選擇 * 問題的複雜度決定是否需要多個分塊、文本快中上下文的完整性 - Types of chunking ![image](https://hackmd.io/_uploads/r1f-4A7eR.png =800x) --- ## LLM Agentic Chunking 筆記以下主要筆記關於Agentic Chunking論文的重要內容，目前提出這個概念的主要來自於以下這篇論文，在論文發表到現在半年內，目前似乎沒看到其他方法的改進目前只有在Langchain上有實作(Llamaindex尚未釋出?) ### 論文 [arXiv:2312.06648。Dense X Retrieval: What Retrieval Granularity Should We Use?](https://chentong0.github.io/factoid-wiki/) #### 主要結果 ![image](https://hackmd.io/_uploads/SJGtpGIlC.png =400x) ![image](https://hackmd.io/_uploads/r1wIpfIeA.png =400x) ![image](https://hackmd.io/_uploads/SyrT6fLlA.png) > 圖中的流程展示了使用命題作為檢索單位進行密集檢索的過程： > * A. **內容轉換**：將維基百科的內容透過「Propositionizer」轉換成命題，這些命題是精簡而自足的事實陳述。 > * B. **資料庫準備**：這些命題被彙編成一個結構化的資料庫，稱為FactoidWiki，代表著被分段並索引的檢索資料庫。 > * C. **檢索過程**：給定一個查詢後，檢索器會在這個資料庫中搜索，識別與查詢向量相似的相關命題。 > * D. **回答生成**：問答模型隨後使用這些檢索到的命題來生成準確且相關的回答。 #### 重要insight :pencil2: Proposition as a Retrieval Unit 以命題作為檢索單元 :::info 1. 每個命題應對應於文本中一個獨特的意義片段，所有命題的組合共同代表了整個文本的語義。 2. 命題應該是最小的單位，也就是說，它不能進一步被分割成其他命題。 3. 命題應當是有上下文且自成一體的。命題應該包含文本中所有必要的上下文（例如，共指現象），以解釋其意義。 ::: - 具體範例 `"埃菲爾鐵塔位於法國巴黎，高300米，是1889年世界博覽會的主要展覽建築。這座鐵塔由古斯塔夫·埃菲爾設計，是巴黎的象徵。"` 根據命題作為檢索單位的定義，把這段文本分解為以下命題： - `埃菲爾鐵塔位於法國巴黎。（包含地理位置）` - `高300米，是1889年世界博覽會的主要展覽建築。（包含高度和歷史意義）` - `由古斯塔夫·埃菲爾設計，是巴黎的象徵。（包含設計者資訊和文化象徵意義` 每個命題都是獨立且自包含的，即使脫離原文，也能表達一個完整的事實。這些命題包含了足夠的上下文資訊，因此，即使在不參考原始全文的情況下，讀者仍然可以理解每個命題的具體意義。 #### 論文方法的適用情境 :pencil: 使用命題作為檢索單位可提升檢索精確性，但可能缺乏足夠的上下文且處理較為耗時 - :pencil: 可能適用的情境與文見類型： - 需要高度精確信息檢索的任務，例如常見的問答系統，或者是在法律文件和技術文檔中查找特定事實。對於這些情境，每個命題提供了獨立且明確的信息，能夠直接回應查詢需求相反的，可能不適用的情境： - 高度敘事性或依賴大範圍上下文的文本，例如小說或長篇文章，因為這些文類型中的信息往往需要更廣泛的上下文來進行解釋和理解 --- # Supply ### RAG 架構與模組化對於目前(2024年)整體架構發展，推薦以下幾篇： - [2023.12。IVAN ILIN。Advanced RAG Techniques: an Illustrated Overview](https://pub.towardsai.net/advanced-rag-techniques-an-illustrated-overview-04d193d8fec6) - [arXiv:2312.10997。Retrieval-Augmented Generation for Large Language Models: A Survey](https://arxiv.org/abs/2312.10997) 比較通俗好理解的介紹，特別是第一作者的Yunfan Gao的blog - [2024.0325。Prompt Engineering Guide。Retrieval Augmented Generation (RAG) for LLMs](https://www.promptingguide.ai/research/rag) - Prompt Engineering Guide中擷取精華的介紹 ![image](https://hackmd.io/_uploads/r1L6pt47A.png =600x) - [2024.01。Yunfan Gao。Modular RAG and RAG Flow: Part Ⅰ](https://medium.com/@yufan1602/modular-rag-and-rag-flow-part-%E2%85%B0-e69b32dc13a3) - 論文第一作者的Yunfan Gao的blog，module細節有補了不少比較多細節的圖 ### Agent + RAG - [2024.05。Yantraka.ai。Deep Dive into Agentic Retrieval Augmented Generation (A-RAG)](https://www.linkedin.com/pulse/deep-dive-agentic-retrieval-augmented-generation-a-rag-sai-panyam-22dlc/) 如何用Agent 設計拓展RAG的能力 ![image](https://hackmd.io/_uploads/r1A10u47A.png =600x) > Plan And Execute Agent: Langchain Experimental

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.