### [AI / ML領域相關學習筆記入口頁面](https://hackmd.io/@YungHuiHsu/BySsb5dfp) #### [Deeplearning.ai GenAI/LLM系列課程筆記](https://learn.deeplearning.ai/) - [Large Language Models with Semantic Search。大型語言模型與語義搜索 ](https://hackmd.io/@YungHuiHsu/rku-vjhZT) - [LangChain for LLM Application Development。使用LangChain進行LLM應用開發](https://hackmd.io/1r4pzdfFRwOIRrhtF9iFKQ) - [Finetuning Large Language Models。微調大型語言模型](https://hackmd.io/@YungHuiHsu/HJ6AT8XG6) - [Building and Evaluating Advanced RAG。建立與評估進階RAG](https://hackmd.io/@YungHuiHsu/rkqGpCDca) - [Preprocessing Unstructured Data for LLM Applications。大型語言模型(LLM)應用的非結構化資料前處理](https://hackmd.io/@YungHuiHsu/BJDAbgpgR) - [標準化文件內容。Normalizing the Content](https://hackmd.io/a6pABFa5RyCk6uitgJ9qEA?both) - [後設資料的提取語文本分塊。Metadata Extraction and Chunking](https://hackmd.io/@YungHuiHsu/HyJhA80lA) - [PDF與影像的預處理。Preprocessing PDFs and Images](https://hackmd.io/@YungHuiHsu/SkJUlPCeA) - [表格提取。Extracting Tables](https://hackmd.io/@YungHuiHsu/HJEE5rkbC) - [Build Your Own RAG Bot]() --- # [Preprocessing Unstructured Data for LLM Applications<br>大型語言模型(LLM)應用的非結構化資料前處理](https://www.deeplearning.ai/short-courses/preprocessing-unstructured-data-for-llm-applications/) ## PDF與影像的預處理<br>Preprocessing PDFs and Images :::warning Unstructure.IO的pdf解析方案需要呼叫模型,是採用call API方案將資料送到他們服務中,再回完解析結果,如果考慮私有資料不外流的話,可能還是需要自己解析,下文提供幾個方向,可以找開源方案自行搭建 ::: ### 文件影像分析Document Image Analysis :::success 從文件的原始影像中提取格式訊息和文本 - 將掃描或拍攝的文件影像轉化為可以分析和處理的數位格式 - 包括文字識別(如OCR技術)和格式識別,允許從靜態圖像中重新構建文檔的數字版本,以及捕捉其結構和格式細節。 ::: - **預處理與基於規則的解析器(Preprocessing with Rules-based Parsers)** - 許多文件類型,如HTML、Word文件和Markdown,包含格式訊息 - 利用文件固有的結構規則來識別和提取信息。例如,在HTML文檔中,解析器可以根據標籤識別標題、段落等元素 - **視覺信息(Visual Information)** - :pencil2:對於其他文件類型,如PDF和圖像,格式訊息是**視覺的** - 這類文件中的格式信息並不是明確標記出來的,**而是通過視覺元素(如版面配置、字型和圖形)來表達** ### 文件影像分析方法Document Image Analysis(DIA Methods) 包括兩種技術:1) 文件排版檢測(Document Layout Detection, DLD)和2) 視覺轉換器(Vision Transformer, ViT) - **文件排版檢測 (Document Layout Detection, DLD)** - 使用物件檢測模型在文件影像上繪製並標記排版元素的邊界框(label bounding boxes) - 通過識別並分類文件中的不同視覺元素(如文字塊、圖像、表格等),來理解文件的結構,將文檔從僅僅是圖像,轉變為有結構的、可操作的資訊集 - **視覺轉換器 (Vision Transformer, ViT)** - 模型將文件影像作為輸入,產生結構化輸出(如JSON格式)的文本表達 - 將圖像分割成多個小塊(patches),這些小塊被進一步轉化為一系列能夠輸入到轉換器模型中的數據點 - 模型學習這些文本塊之間的關係,最終產生對整個文件影像的深層次理解 - 可以選擇性地包括一個文本提示 - 可以選擇性地包括一個文本提示,這有助於模型更好地理解和生成期望的輸出格式 :pencil: 技術原理上跟目前在訓練視覺模型與文字、空間關係的理解與對齊是同樣概念,詳見[From Representation to Interface: The Evolution of Foundation for Vision Understanding- CH2 Visual Understanding筆記](https://hackmd.io/@YungHuiHsu/HyjAklf4T) #### DIA Methods - 1.文件排版檢測 (Document Layout Detection) > OCR後處理 (OCR Postprocessing) ```mermaid graph LR A["Image\n(Documents)"] --> Model["Document Layout\nModels"] Model--> C[Cropped \nTable/Image/TextBox] C --> OCR["OCR"] OCR --> Output["Extracted\nUnstructured Text"] style A fill:#f0f0f0, stroke:#333, stroke-width:2px style Model fill:#ccffcc, stroke:#333, stroke-width:2px style C fill:#f0f0f0, stroke:#333, stroke-width:2px style OCR fill:#ccffcc, stroke:#333, stroke-width:2px style Output fill:#f0f0f0, stroke:#333, stroke-width:2px ``` 文件排版檢測旨在識別和分類文件中的視覺元素,進而有效地提取和處理信息。以下是這個過程的主要步驟和方法: - **視覺檢測** - 利用物件檢測模型,如YOLOX或Detectron2,識別並分類文件中的邊界框 - 利用模型在圖像中快速精確地定位不同類型的排版元素,如文本塊、圖片、表格等 - 模型會為每個識別的元素繪製一個邊界框,這些邊界框是後續文本提取或進一步分析的基礎 ![image](https://hackmd.io/_uploads/SJW2dwCxA.png =400x) > [Source: YOLOX: Exceeding YOLO Series in 2021](https://arxiv.org/abs/2107.08430) - **文本提取** - 從必要的邊界框中使用光學字元辨識(OCR)技術提取文本 - 在邊界框識別出來後,可使用OCR技術從這些框中的圖像數據中提取文字 :pencil: 我看過測試記得這個方案效果不佳 - **直接提取** - 某些PDF文件通常包含metadata和結構化的文本層,這使得可以直接從文件中提取文本而不必依賴圖像識別 - 這種方法比OCR更高效,因為它直接從文檔的數據結構中讀取文本,不涉及圖像處理 ##### 分析 - 優點 - **固定的元素類型集**:具有預定義的元素類型,便於標準化處理 - **獲得邊界框信息**:可以直接提供元素的空間定位信息 - 缺點 - **需要調用兩個模型**:需要先使用物件檢測模型來識別元素,再使用OCR模型提取文本,增加了處理步驟和復雜。 - **靈活性較低**:對於非標準文件類型的處理能力有限 #### DIA Methods - 2.視覺轉換器 (Vision Transformers) ```mermaid graph LR A["Cropped Image\n(Documents)"] --> Model P[Prompt] --> Model Model("Table Structure Recognition Model") Model --> Output["Structured Output (JSON, etc.)"] style A fill:#f0f0f0, stroke:#333, stroke-width:2px style P fill:#f0f0f0, stroke:#333, stroke-width:2px style Model fill:#ccffcc, stroke:#333, stroke-width:2px style Output fill:#f0f0f0, stroke:#333, stroke-width:2px ``` :::info 個人覺得UnstructureIO講師對視覺相關的模型理解有誤,ViT(Vision Transformer)只是一類模型架構,而不是一種文件影像分析方法(DIA) ::: - **視覺理解** 輸入圖像傳遞給編碼器,解碼器則產生文本輸出 - 在這一過程中,編碼器首先將圖像分割成多個小塊,這些小塊經過一系列轉換器層的處理,以捕捉和學習圖像中的視覺關係和特徵(即ViT基本原理) - 解碼器則利用這些特徵生成對應的文本,這個文本可以直接表達圖像的內容(透過Cross Attention對齊文字與語言embedding) - **DONUT架構 (Document Understanding Transformer, DONUT)** - DONUT專為文檔理解和文本生成設計,可以直接從文件影像中識別和結構化信息,而不僅僅是將圖像轉化為文字 - **直接轉換** 無需OCR,圖像輸入直接轉換為文本 - 利用ViT,圖像可以直接被模型解析並轉化為機器可讀的文本格式,這一過程省略了傳統的OCR步驟 - **結構化訓練** 可以訓練模型輸出有效的JSON字符串與結構化的文檔輸出 - 通過結構化訓練,ViT不僅能夠生成文本,還能按照特定格式(如JSON)生成有結構的輸出 這些特點展示了視覺轉換器在現代文檔分析和處理領域的強大功能,它們提供了一個從圖像到文本的直接轉換路徑,同時保護了數據的結構完整性和豐富性。 ![image](https://hackmd.io/_uploads/HJlj020x0.png =400x) ![image](https://hackmd.io/_uploads/ByX-lT0g0.png =300x) > - 上圖:OCR方法 > - 下圖:OCR-Free方法 - 輸入(影像)-->模型-->輸出(jSON) ![image](https://hackmd.io/_uploads/rJpxahCe0.png =600x) ![image](https://hackmd.io/_uploads/rkM5pnRgA.png =600x) > Source: [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) > - Donut 是一個端到端(E2E)的模型,它直接將文件以圖像格式作為輸入,並輸出為 JSON 格式的結構化資料 > - Donut 使用Vision Transformer架構,並通過OCR-Free的方式進行文件理解,不依賴於傳統的光學字符識別(OCR)技術來讀取文件,而是直接從圖像中學習和提取結構化信息 :pencil: 這個方法適合用來解析文件內的排版,解析取出各元素內(例如標題、內文)的文字資料,但對於圖片與表格的深度理解,還是需要進一步的方案。在課程下一章節[表格提取。Extracting Tables]()會帶到(但圖片呢?)。 ##### 分析 - 優點 - 更高的靈活性:對非標準文件類型,如表格,有更好的適應性 - 更容易適應並處理新類型的文件:可以更容易地根據新的數據集或需求調整和學習 - 缺點 - 模型可能產生幻覺或重複:由於是生成型模型,可能在輸出中創造不存在的內容或重複現有內容 - 計算成本高:由於需要處理大量的參數和複雜的網絡結構,計算負擔較大 :::info 現在的多模態(Multi-modal)/Vision Language Model(VLM)方案可以視為第三類,用來補足對圖表資料的理解 文字:由目前的方法1或2來解析出文件內文字轉換為結構化的輸出 圖表:交由專用的物件偵測模型框選出來後,再由多模態的視覺語言模型去理解 直接用具有"視覺理解能力的大型語言模型",或是"視覺基礎模型+大型語言模型",將其轉換為embedding,裡面蘊含空間與人類語言資訊,並透過語言模型來理解文本內容做出回應 ```mermaid graph LR A["Image\n(Documents)"] --> T["Cropped Table\n(Object Detection Model)"] T--> B["Vision Language Model\nas Retriver"] B --> C["Embedding/\nTable,Image"] ``` ::: ## Lab4: Preprocessing PDFs and Images - 環境設定 ```python= from Utils import Utils utils = Utils() # DLAI_API_KEY = API_KEY = utils.get_dlai_api_key() Unstructured_API_Key = utils.get_private_api_key() DLAI_API_URL = utils.get_dlai_url() s = UnstructuredClient( # api_key_auth=DLAI_API_KEY, api_key_auth=Unstructured_API_Key, server_url=DLAI_API_URL, ) ``` 範例文件 ![image](https://hackmd.io/_uploads/rkx3S5kWR.png) ### Process the Document as HTML ```python= from unstructured.partition.html import partition_html filename = "example_files/el_nino.html" html_elements = partition_html(filename=filename) for element in html_elements[:10]: print(f"{element.category.upper()}: {element.text}") ``` - result ```python= TITLE: CNN UNCATEGORIZEDTEXT: 1/30/2024 TITLE: A potent pair of atmospheric rivers will drench California as El Niño makes its first mark on winter TITLE: By Mary Gilbert, CNN Meteorologist UNCATEGORIZEDTEXT: Updated: 3:49 PM EST, Tue January 30, 2024 TITLE: Source: CNN NARRATIVETEXT: A potent pair of atmospheric river-fueled storms are about to unleash a windy and incredibly wet week in California in what is the first clear sign of the influence El Niño was expected to have on the state this winter. NARRATIVETEXT: The soaking storms will raise the flood threat across much of California into next week, but it appears the wet pattern is likely to continue well into February as a more typical El Niño pattern kicks into gear. NARRATIVETEXT: El Niño – a natural phenomenon in the tropical Pacific that influences weather around the globe – causes changes in the jet stream that can point storms directly at California. Storms can also tap into an extra-potent supply of moisture from the tropics called an atmospheric river. NARRATIVETEXT: El Niño hasn’t materialized many atmospheric rivers for California so far this winter, with most hitting the Pacific Northwest. ``` ### Process the Document as PDF/IMAGE Unstructured庫提供了多種前處理文檔的方法,可以通過`strategy`參數來指定 - 基本用法: ```python elements = partition(filename=filename) ``` - [`strategy`](https://unstructured-io.github.io/unstructured/best_practices/strategies.html)可用選項: - **auto(默認策略)**:根據文檔特性和函數參數自動選擇分割策略。 - **fast**:利用傳統的NLP?(其實是用`pdfminer` lib)提取技術快速提取所有文本元素。:warning:"Fast"策略不適合基於圖像的檔案類型。 - **hi_res**:使用detectron2識別文檔排版。如果文件類型對於文檔元素的正確分類非常敏感,推薦使用這種策略。 - **ocr_only**:利用光學字符識別技術從基於圖像的檔案中提取文本。 #### Process the Document with `partition_pdf` 如果使用`partition(filename=filename)`,`strategy`預設參數選擇 "fast"的話,就會用NLP模型進行解析 ```python= def _partition_pdf_with_pdfminer( filename: str, file: Optional[IO[bytes]], include_page_breaks: bool, languages: List[str], metadata_last_modified: Optional[str], **kwargs: Any, ) -> List[Element]: """Partitions a PDF using PDFMiner instead of using a layoutmodel. Used for faster processing or detectron2 is not available. Implementation is based on the `extract_text` implemenation in pdfminer.six, but modified to support tracking page numbers and working with file-like objects. ref: https://github.com/pdfminer/pdfminer.six/blob/master/pdfminer/high_level.py """ ``` - 使用partition_pdf解析pdf文件 ```python= from unstructured_client import UnstructuredClient from unstructured_client.models import shared from unstructured_client.models.errors import SDKError from unstructured.partition.pdf import partition_pdf from unstructured.staging.base import dict_to_elements filename = "example_files/el_nino.pdf" pdf_elements = partition_pdf(filename=filename, strategy="fast") for element in pdf_elements[:10]: print(f"{element.category.upper()}: {element.text}") ``` - 解析結果 ```python= NCATEGORIZEDTEXT: 1/30/24, 5:11 PM NARRATIVETEXT: Pineapple express: California to get drenched by back-to-back storms fueling a serious flood threat | CNN UNCATEGORIZEDTEXT: CNN 1/30/2024 NARRATIVETEXT: A potent pair of atmospheric rivers will drench California as El Niño makes its first mark on winter TITLE: By Mary Gilbert, CNN Meteorologist TITLE: Updated: 3:49 PM EST, Tue January 30, 2024 TITLE: Source: CNN NARRATIVETEXT: A potent pair of atmospheric river-fueled storms are about to unleash a windy and incredibly wet week in California in what is the first clear sign of the influence El Niño was expected to have on the state this winter. NARRATIVETEXT: The soaking storms will raise the flood threat across much of California into next week, but it appears the wet pattern is likely to continue well into February as a more typical El Niño pattern kicks into gear. NARRATIVETEXT: El Niño – a natural phenomenon in the tropical Pacific that influences weather around the globe – causes changes in the jet stream that can point storms directly at California. Storms can also tap into an extra-potent supply of moisture from the tropics called an atmospheric river. #### Process the Document with Document Layout Detection ##### DIA Methods - 1.Document Layout Detection > OCR Postprocessing - 使用Document Layout Detection模型解析排版 - 前文的方法1 :"文件排版檢測 (Document Layout Detection) > OCR後處理 (OCR Postprocessing)"" - 但中間的OCR後處理過程被封裝起來看不到 ```python= with open(filename, "rb") as f: files=shared.Files( content=f.read(), file_name=filename, ) req = shared.PartitionParameters( files=files, strategy="hi_res", hi_res_model_name="yolox", ) try: resp = s.general.partition(req) dld_elements = dict_to_elements(resp.elements) except SDKError as e: print(e) for element in dld_elements[:10]: print(f"{element.category.upper()}: {element.text}") ``` - 解析結果 - 看來明顯比pdfminer好 ```python= HEADER: 1/30/24, 5:11 PM HEADER: Pineapple express: California to get drenched by back-to-back storms fueling a serious flood threat | CNN HEADER: CNN 1/30/2024 TITLE: A potent pair of atmospheric rivers will drench California as El Niño makes its first mark on winter NARRATIVETEXT: By Mary Gilbert, CNN Meteorologist NARRATIVETEXT: Updated: 3:49 PM EST, Tue January 30, 2024 NARRATIVETEXT: Source: CNN NARRATIVETEXT: A potent pair of atmospheric river-fueled storms are about to unleash a windy and incredibly wet week in California in what is the first clear sign of the influence El Niño was expected to have on the state this winter. NARRATIVETEXT: The soaking storms will raise the flood threat across much of California into next week, but it appears the wet pattern is likely to continue well into February as a more typical El Niño pattern kicks into gear. NARRATIVETEXT: El Niño – a natural phenomenon in the tropical Pacific that influences weather around the globe – causes changes in the jet stream that can point storms directly at California. Storms can also tap into an extra-potent supply of moisture from the tropics called an atmospheric river. ``` #### Process the Document with E2E Model(chipper) ##### DIA Methods - 2. Vision Transformers - 指定用自家的E2E Model(`hi_res_model_name="chipper"`),直接解析出結構化的輸出 ```python with open(filename, "rb") as f: files=shared.Files( content=f.read(), file_name=filename, ) req = shared.PartitionParameters( files=files, strategy="hi_res", hi_res_model_name="chipper", ) try: resp = s.general.partition(req) chipper_elements = dict_to_elements(resp.elements) except SDKError as e: print(e) for element in chipper_elements[:10]: print(f"{element.category.upper()}: {element.text}") ``` - 解析結果 - 沒有"1.Document Layout Detection > 接OCR後處理"的效果好 ```python UNCATEGORIZEDTEXT: 1/30/24, 5:11 PM NARRATIVETEXT: Pineapple express: California to get drenched by back-to-back storms fueling a serious flood threat | CNN UNCATEGORIZEDTEXT: CNN 1/30/2024 TITLE: A potent pair of atmospheric rivers will drench California asEl Niño makes its first mark on winter TITLE: By Mary Gilbert, CNN Meteorologist TITLE: Updated: 3:49 PM EST, Tue January 30, 2024 TITLE: Source: CNN NARRATIVETEXT: A potent pair of atmospheric river-fueled storms are about to unleash a windy and incredibly wet week in California in what is the first clear sign of the influence EI Niño was expected to have on the state this winter. NARRATIVETEXT: The soaking storms will raise the flood threat across much of California into next week, but it appears the wet pattern is likely to continue well into February as a more typical EI Niño pattern kicks into gear. NARRATIVETEXT: El Niño — a natural phenomenon in the tropical Pacific that influences weather around the globe — causes changes in the jet stream that can point storms directly at California. Storms can also tap into an extra-potent supply of moisture from the tropics called an atmospheric river. ``` #### 檢視不同方法解析的元件數量 - 較貴的Document Layout Detection方案可以解析出較細緻的元件 ```python= import collections # Process the Document as HTML len(html_elements) # 35 # ===================================================== # Process the Document as PDF # Process the Document with partition_pdf html_categories = [el.category for el in html_elements] collections.Counter(html_categories).most_common() # [('NarrativeText', 23), ('Title', 10), ('UncategorizedText', 2)] # ------------------------------------------------------- # Process the Document with Document Layout Detection len(dld_elements) # 39 dld_categories = [el.category for el in dld_elements] collections.Counter(dld_categories).most_common() # [('NarrativeText', 28), ('Header', 6), ('Title', 4), ('Footer', 1)] # ------------------------------------------------------- # Process the Document with E2E Model(chipper) print(len(chipper_elements)) #39 chipper_categories = [el.category for el in chipper_elements] collections.Counter(chipper_categories).most_common() # [('NarrativeText', 27), ('Title', 6), ('UncategorizedText', 5), ('Footer', 1)] ``` --- ## Supply 其他看起來頗不錯的pdf解析工具整理 ### 非多模態模型的傳統方案 #### [LLM Sherpa](https://github.com/nlmatics/llmsherpa#layoutpdfreader) 討論度有點低 #### [pymupdf](https://pymupdf.readthedocs.io/en/latest/about.html) ![image](https://hackmd.io/_uploads/HJ8XD2JWA.png) #### [2023.09。**Extracting Text from PDF Files with Python: A Comprehensive Guide**](https://towardsdatascience.com/extracting-text-from-pdf-files-with-python-a-comprehensive-guide-9fc4003d517) - 非深度學習方案中最完整的指引 - 使用pdfminer解析文件的排版 不知原理是時麼,也是dl model? ```python # To analyze the PDF layout and extract text from pdfminer.high_level import extract_pages, extract_text from pdfminer.layout import LTTextContainer, LTChar, LTRect, LTFigure # To extract text from tables in PDF import pdfplumber # To extract the images from the PDFs from PIL import Image from pdf2image import convert_from_path ``` ![image](https://hackmd.io/_uploads/Skt-_hybC.png =400x) ![image](https://hackmd.io/_uploads/SyxQ_2yWR.png =800x) #### [2023.10。LlamaIndex。Mastering PDFs: Extracting Sections, Headings, Paragraphs, and Tables with Cutting-Edge Parser](https://www.llamaindex.ai/blog/mastering-pdfs-extracting-sections-headings-paragraphs-and-tables-with-cutting-edge-parser-faea18870125)