[Preprocessing Unstructured Data for LLM Applications] - Metadata Extraction and Chunking

### [AI / ML領域相關學習筆記入口頁面](https://hackmd.io/@YungHuiHsu/BySsb5dfp) #### [Deeplearning.ai GenAI/LLM系列課程筆記](https://learn.deeplearning.ai/) - [Large Language Models with Semantic Search。大型語言模型與語義搜索 ](https://hackmd.io/@YungHuiHsu/rku-vjhZT) - [LangChain for LLM Application Development。使用LangChain進行LLM應用開發](https://hackmd.io/1r4pzdfFRwOIRrhtF9iFKQ) - [Finetuning Large Language Models。微調大型語言模型](https://hackmd.io/@YungHuiHsu/HJ6AT8XG6) - [Building and Evaluating Advanced RAG。建立與評估進階RAG](https://hackmd.io/@YungHuiHsu/rkqGpCDca) - [Preprocessing Unstructured Data for LLM Applications。大型語言模型(LLM)應用的非結構化資料前處理](https://hackmd.io/@YungHuiHsu/BJDAbgpgR) - [標準化文件內容。Normalizing the Content](https://hackmd.io/a6pABFa5RyCk6uitgJ9qEA?both) - [後設資料的提取語文本分塊。Metadata Extraction and Chunking](https://hackmd.io/@YungHuiHsu/HyJhA80lA) - [PDF與影像的預處理。Preprocessing PDFs and Images](https://hackmd.io/@YungHuiHsu/SkJUlPCeA) - [表格提取。Extracting Tables](https://hackmd.io/@YungHuiHsu/HJEE5rkbC) - [Build Your Own RAG Bot]() --- # [Preprocessing Unstructured Data for LLM Applications 大型語言模型(LLM)應用的非結構化資料前處理](https://www.deeplearning.ai/short-courses/preprocessing-unstructured-data-for-llm-applications/) ## 後設資料的提取語文本分塊 Metadata Extraction and Chunking ### 什麼是後設資料? What is Metadata? - **文件詳情(Document Details)** - Metadata提供有關從原始文件中提取內容的額外資訊 - **來源識別(Source Identification)** - Metadata的一種類型是關於文件本身的資訊，如文件名、來源URL或文件類型，有助於追蹤文件的來源和格式 - **結構後設資料(Structural Metadata)** - Metadata可以從文件的結構中構建 - Examples: element type, hierarchy information, section information - Structural Metadata描述文件的布局和組織，例如章節劃分、標題層次以及文檔中各個元素的排列方式 ![image](https://hackmd.io/_uploads/BJ8fuNCxA.png =400x) - **搜索增強(Search Enhancement)** - 在RAG(Retrieve-and-Generate)系統中，Metadata提供了混合搜索的過濾選項 - :pencil2:使用Metadata能夠增強搜索系統的功能，通過提供額外的過濾條件，使搜索結果更加精確和相關 ### LLM的語意搜尋 Semantic Search for LLMs #### 語義搜索與向量資料庫 - **目標(Goal)** - 理解和分析輸入文本的意義，然後在一大批文檔中尋找具有相似意義的內容，以便於生成或改善與輸入查詢相關的提示 - **嵌入(Embedding)** - 文本嵌入是將自然語言文本轉化為數學向量的過程 - 這些向量的設計使得文本的語義特性可以通過向量間的數學操作（例如計算兩個向量的cosine similarity）來比較 - **向量資料庫(Vector Database)** - 向量資料庫專門設計用於儲存向量並快速執行向量之間的相似度查詢 - **提示模板化(Prompt Templating)**: - 提示模板化是一種將選定的文本或資訊嵌入到定制化模板中的過程，以產生特定的查詢或任務提示。這種方法常用於機器學習和人工智能應用中，特別是在需要生成特定回應或行動的情境中。 ![image](https://hackmd.io/_uploads/rJ9PcERxR.png =200x) > - 向量資料庫的操作流程 > 1. **載入(Load)** > - 將文本轉換成的向量和相應的來源資訊一起存儲在向量資料庫中。這使得之後可以依據向量快速查詢原始文檔。 > > 2. **查詢嵌入(Query Embed)** > - 將用戶輸入的文本轉化為向量形式，這一過程稱為查詢嵌入。這是準備查詢過程的初步，通過這個向量，系統可以在資料庫中搜索語義上相似的文檔。 > > 3. **比較與檢索(Compare and Retrieve)** > - 系統將查詢向量與資料庫中存儲的向量進行比較，通常使用cosine similarity或其他相似度測量方法。系統然後根據相似度分數檢索並返回最匹配的前k個文檔，這些文檔在語義上與用戶查詢最為接近 ### Hybrid Search #### 語義相似度搜索的挑戰 - **過多匹配(Too Many Matches)** - 在文件數量龐大的情況下，可能出現過多語義上相似的匹配結果 -. **最新資訊需求(Most Recent Information)**: - 用戶可能希望獲得最新的資訊，而不僅僅是語義上最相似的 - 在某些應用場景，如新聞更新或學術研究，最新的資訊往往比過時的資訊更有價值。這要求系統能夠識別並優先檢索最新的文件，即使它們的語義相似度可能略低 - **重要資訊的遺失(Loss of Important Information)** - 在文件中與搜索相關的重要資訊可能會丟失，如章節資訊 - 當文檔被轉換為向量時，可能無法完全捕捉到文檔中所有關鍵的結構性或內容性資訊，如章節標題或特定段落的內容。這種資訊的缺失可能對搜索結果的相關性和有用性造成負面影響 #### 混合搜索策略 - **混合策略(Hybrid Strategy)** - 混合搜索是一種結合語義搜索與其他資訊檢索技術（如過濾和關鍵詞搜索）的搜索策略 - 通過結合語義理解的深度和傳統關鍵詞搜索的精確性，提高搜索效率和結果的相關性 - 語義搜索可以捕捉查詢的深層意義，而關鍵詞搜索則可以針對具體詞彙進行快速過濾 - **過濾選項(Filtering Options)** - 文件的metadata提供了混合搜索的過濾選項 - 在進行混合搜索時，可以利用metadata（如文件創建日期、作者、文件類型等）來過濾搜索結果 - 例如，用戶可以設定過濾條件來僅顯示最近一個月內發布的文件，或者過濾掉某些特定格式的文檔。 - Example: Most Recent Information(例如，只想返回特定日期或較新的資訊) ![image](https://hackmd.io/_uploads/Syhj3V0gR.png =400x) #### [Lab L3: Metadata Extraction and Chunking](https://learn.deeplearning.ai/courses/preprocessing-unstructured-data-for-llm-applications/lesson/4/metadata-extraction-and-chunking) * 環境設定 ```python= # Warning control import warnings warnings.filterwarnings('ignore') import logging logger = logging.getLogger() logger.setLevel(logging.CRITICAL) import json from IPython.display import JSON from unstructured_client import UnstructuredClient from unstructured_client.models import shared from unstructured_client.models.errors import SDKError from Utils import Utils utils = Utils() DLAI_API_KEY = utils.get_dlai_api_key() DLAI_API_URL = utils.get_dlai_url() s = UnstructuredClient( api_key_auth=DLAI_API_KEY, server_url=DLAI_API_URL, ) ``` - Run the document through the Unstructured API ![image](https://hackmd.io/_uploads/SkaLlHRe0.png =200x) ```python= filename = "example_files/winter-sports.epub" with open(filename, "rb") as f: files=shared.Files( content=f.read(), file_name=filename, ) req = shared.PartitionParameters(files=files) try: resp = s.general.partition(req) except SDKError as e: print(e) JSON(json.dumps(resp.elements[0:3], indent=2)) ``` ![image](https://hackmd.io/_uploads/BkvKgrReC.png =400x - 找出含有標題的元素 - 過濾包含曲棍球("hockey")一詞的標題元素(title elements) - 得到章節名稱`'text': 'ICE-HOCKEY'` ```python [x for x in resp.elements if x['type'] == 'Title' and 'hockey' in x['text'].lower()] # [{'type': 'Title', # 'element_id': 'dcea2070d3d9171c9c817967917ca77b', # 'text': 'ICE-HOCKEY', # 'metadata': {'page_number': 2, # 'languages': ['eng'], # 'filename': 'winter-sports.epub', # 'filetype': 'application/epub'}}, # {'type': 'Title', # 'element_id': '8738b816c128ffcafa1fbfedd3be5f44', # 'text': 'ICE HOCKEY', # 'metadata': {'page_number': 2, # 'languages': ['eng'], # 'filename': 'winter-sports.epub', # 'filetype': 'application/epub'}}] ``` - 取得章節id ```python chapters = [ "THE SUN-SEEKER", "RINKS AND SKATERS", "TEES AND CRAMPITS", "ICE-HOCKEY", "SKI-ING", "NOTES ON WINTER RESORTS", "FOR PARENTS AND GUARDIANS", ] chapter_ids = {} for element in resp.elements: for chapter in chapters: if element["text"] == chapter and element["type"] == "Title": chapter_ids[element["element_id"]] = chapter break # 得到 # {'d9725192dcad4b316b9000b2e0b9d803': 'THE SUN-SEEKER', # '85a84e38543ad3417cdef25e3616cbbc': 'RINKS AND SKATERS', # '5229c64d88a2a9bd98581911deedc70c': 'TEES AND CRAMPITS', # 'dcea2070d3d9171c9c817967917ca77b': 'ICE-HOCKEY', # '0354439dc766447e6406fdd882e8e918': 'SKI-ING', # '272ffcc12c87aef80e8c6d59aeb699b0': 'NOTES ON WINTER RESORTS', # 'bc0864682405480021aabbd2e39fb6c9': 'FOR PARENTS AND GUARDIANS'} ``` - 將文件載入向量資料庫Load documents into a vector db - :warning:在向量資料庫中用的"metadata"向量資料庫本身的，不是剛剛文件的，不要弄混了 - `metadata={"hnsw:space": "cosine"}` - 為集合指定的參數配置 - 表示使用HNSW（Hierarchical Navigable Small World）圖演算法相關的數據索引配置有關。 - `"cosine"`指的是使用余弦相似度作為空間距離的度量 ```python import chromadb client = chromadb.PersistentClient(path="chroma_tmp", settings=chromadb.Settings(allow_reset=True)) client.reset() collection = client.create_collection( name="winter_sports", metadata={"hnsw:space": "cosine"} ) for element in resp.elements: parent_id = element["metadata"].get("parent_id") chapter = chapter_ids.get(parent_id, "") collection.add( documents=[element["text"]], ids=[element["element_id"]], metadatas=[{"chapter": chapter}] ) ``` - 利用metadata進行混合搜尋 Perform a hybrid search with metadata - 使用`where={"chapter": "ICE-HOCKEY"}`指定章節 ```python= result = collection.query( query_texts=["How many players are on a team?"], n_results=2, where={"chapter": "ICE-HOCKEY"}, ) print(json.dumps(result, indent=2)) ``` - return ```python "ids": [ [ "392b0250af6c3b3a472b9bba6c9fc813", "e66fc77667bdc277cfa11527f8796648" ] ], "distances": [ [ 0.5229758024215698, 0.7836340665817261 ] ], "metadatas": [ [ { "chapter": "ICE-HOCKEY" }, { "chapter": "ICE-HOCKEY" } ] ], "embeddings": null, "documents": [ ... ], "uris": null, "data": null } ``` ### 分本切塊Chunking ![image](https://hackmd.io/_uploads/S1UTiHRgA.png) #### 切塊的必要性 - **切塊必要性(Chunking Necessity)** - 切塊是將大型文檔拆分成更小、更管理的部分的過程。有助於向量資料庫更有效地處理和檢索資訊，尤其是在涉及大規模數據集時 - **查詢結果的變異性(Query Result Variability)** - 文檔的切塊方式會影響查詢結果的一致性和相關性。不同的切塊策略可能導致重要資訊的分散，從而影響檢索效果 - **等大小切塊(Even Size Chunks)**: - 最簡單的方法是將文檔分割成大致相等大小的塊 - 但它可能會導致相關連的內容被不恰當地分割，影響資料的理解和後續的使用。 - **按原子元素切塊(Chunking by Atomic Elements)** - 通過識別原子元素(找到適合的檢索單元)，可以通過組合元素而不是分割原始文本來進行切塊 - 這種方法可以產生更連貫的內容塊，從而提高資訊的完整性和可用性 - 例如，將相同章節標題下的內容組合到同一塊中。這樣可以確保相關資訊保持在一起，有助於維持內容的語境連貫性 #### 從文件元素構建文本塊 Constructing Chunks from Document Elements - **分割(Partitioning)**。 - 創建文本塊的初步步驟，目的是將文件拆分成最基本的組件，如段落、句子或標題等 - **組合元素成文本塊(Combine Elements into Chunks)** - 將一個文件元素添加到塊中。繼續添加文件元素到塊中，直到達到設定的大小限制(字符character或token閾值) - **應用分割條件(Apply Break Conditions)** - 應用開始新文本塊的條件，如達到新的標題元素（表明新的部分）、當元素更變（表明新的頁面或部分），或當內容超過某個閾值 - 這些條件有助於確定何時應該停止當前塊的擴展並開始一個新的塊，以保持內容的組織結構和語境的清晰 - 基本組合式分塊(Basic combinative chunking):完全不切割 - **結合小文本塊(Combine Smaller Chunks)** - (可選地)小文本塊可能不足以提供足够的語境脈絡資訊進行有效的語義搜索 - Chunking Strategy :::success - **連貫的文本塊Coherent Chunks** - 保持來自同一文件元素的內容在一起，使得每個文本塊都保留了其內在的邏輯和結構 - **結構化的文本塊Structured Chunks** - 利用文件的自然結構（如章節、標題、段落）進行分塊，可以保持內容的結構性和語境的完整性 - **快速實驗** - 從文件元素建立塊允許快速實驗（分割是昂貴的，分塊則快速） - 分割(partitioning)整個文檔可能需要大量的處理時間和資源，尤其是在大規模數據集上 - 然而，一旦文檔被初始分割完成，對這些文本塊進行快速重組或重新分配(chunking)來進行實驗變得相對容易和低成本 :pencil2:partitioning is expensive, chunking is fast ::: - 範例 ![image](https://hackmd.io/_uploads/B1Xx980xC.png =400x) > 文本切分(chunking)。上圖照固定字符數量、下圖照文件結構。顯然後者可以保持內容的結構性和語境的完整性 - 利用title( 元素)來進行文本切分 - 如果一個元素的字符數少於100，則將其與相鄰元素結合 - 如果文本塊的超過最大字符限制，系統將建立一個新的文本塊 - 752個元素被合併成了255個較大的文本塊 ```python=! from unstructured.chunking.basic import chunk_elements from unstructured.chunking.title import chunk_by_title from unstructured.staging.base import dict_to_elements elements = dict_to_elements(resp.elements) chunks = chunk_by_title( elements, combine_text_under_n_chars=100, max_characters=3000, ) len(chunks) # 752 len(chunks) # 253 JSON(json.dumps(chunks[0].to_dict(), indent=2)) {'type': 'CompositeElement', 'element_id': '064cf072a51da0b00f32510011bfa7dc', 'text': 'The Project Gutenberg eBook of Winter Sports in\nSwitzerland, by E. F. Benson\n\n\nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this ebook or online at\n\n\nwww.gutenberg.org. If you are not located\nin the United States, you’ll have to check the laws of the country where\nyou are located before using this eBook.', 'metadata': {'filename': 'winter-sports.epub', 'filetype': 'application/epub', 'languages': ['eng'], 'page_number': 1}} ``` ####