[Preprocessing Unstructured Data for LLM Applications] -Preprocessing PDFs and Images

### [AI / ML領域相關學習筆記入口頁面](https://hackmd.io/@YungHuiHsu/BySsb5dfp) #### [Deeplearning.ai GenAI/LLM系列課程筆記](https://learn.deeplearning.ai/) - [Large Language Models with Semantic Search。大型語言模型與語義搜索 ](https://hackmd.io/@YungHuiHsu/rku-vjhZT) - [LangChain for LLM Application Development。使用LangChain進行LLM應用開發](https://hackmd.io/1r4pzdfFRwOIRrhtF9iFKQ) - [Finetuning Large Language Models。微調大型語言模型](https://hackmd.io/@YungHuiHsu/HJ6AT8XG6) - [Building and Evaluating Advanced RAG。建立與評估進階RAG](https://hackmd.io/@YungHuiHsu/rkqGpCDca) - [Preprocessing Unstructured Data for LLM Applications。大型語言模型(LLM)應用的非結構化資料前處理](https://hackmd.io/@YungHuiHsu/BJDAbgpgR) - [標準化文件內容。Normalizing the Content](https://hackmd.io/a6pABFa5RyCk6uitgJ9qEA?both) - [後設資料的提取語文本分塊。Metadata Extraction and Chunking](https://hackmd.io/@YungHuiHsu/HyJhA80lA) - [PDF與影像的預處理。Preprocessing PDFs and Images](https://hackmd.io/@YungHuiHsu/SkJUlPCeA) - [表格提取。Extracting Tables](https://hackmd.io/@YungHuiHsu/HJEE5rkbC) - [Build Your Own RAG Bot]() --- # [Preprocessing Unstructured Data for LLM Applications<br>大型語言模型(LLM)應用的非結構化資料前處理](https://www.deeplearning.ai/short-courses/preprocessing-unstructured-data-for-llm-applications/) ## PDF與影像的預處理<br>Preprocessing PDFs and Images :::warning Unstructure.IO的pdf解析方案需要呼叫模型，是採用call API方案將資料送到他們服務中，再回完解析結果，如果考慮私有資料不外流的話，可能還是需要自己解析，下文提供幾個方向，可以找開源方案自行搭建 ::: ### 文件影像分析Document Image Analysis :::success 從文件的原始影像中提取格式訊息和文本 - 將掃描或拍攝的文件影像轉化為可以分析和處理的數位格式 - 包括文字識別（如OCR技術）和格式識別，允許從靜態圖像中重新構建文檔的數字版本，以及捕捉其結構和格式細節。 ::: - **預處理與基於規則的解析器(Preprocessing with Rules-based Parsers)** - 許多文件類型，如HTML、Word文件和Markdown，包含格式訊息 - 利用文件固有的結構規則來識別和提取信息。例如，在HTML文檔中，解析器可以根據標籤識別標題、段落等元素 - **視覺信息(Visual Information)** - :pencil2:對於其他文件類型，如PDF和圖像，格式訊息是**視覺的** - 這類文件中的格式信息並不是明確標記出來的，**而是通過視覺元素（如版面配置、字型和圖形）來表達** ### 文件影像分析方法Document Image Analysis(DIA Methods) 包括兩種技術：1) 文件排版檢測(Document Layout Detection, DLD)和2) 視覺轉換器(Vision Transformer, ViT) - **文件排版檢測 (Document Layout Detection, DLD)** - 使用物件檢測模型在文件影像上繪製並標記排版元素的邊界框（label bounding boxes） - 通過識別並分類文件中的不同視覺元素（如文字塊、圖像、表格等），來理解文件的結構，將文檔從僅僅是圖像，轉變為有結構的、可操作的資訊集 - **視覺轉換器 (Vision Transformer, ViT)** - 模型將文件影像作為輸入，產生結構化輸出（如JSON格式）的文本表達 - 將圖像分割成多個小塊（patches），這些小塊被進一步轉化為一系列能夠輸入到轉換器模型中的數據點 - 模型學習這些文本塊之間的關係，最終產生對整個文件影像的深層次理解 - 可以選擇性地包括一個文本提示 - 可以選擇性地包括一個文本提示，這有助於模型更好地理解和生成期望的輸出格式 :pencil: 技術原理上跟目前在訓練視覺模型與文字、空間關係的理解與對齊是同樣概念，詳見[From Representation to Interface: The Evolution of Foundation for Vision Understanding- CH2 Visual Understanding筆記](https://hackmd.io/@YungHuiHsu/HyjAklf4T) #### DIA Methods - 1.文件排版檢測 (Document Layout Detection) > OCR後處理 (OCR Postprocessing) ```mermaid graph LR A["Image\n(Documents)"] --> Model["Document Layout\nModels"] Model--> C[Cropped \nTable/Image/TextBox] C --> OCR["OCR"] OCR --> Output["Extracted\nUnstructured Text"] style A fill:#f0f0f0, stroke:#333, stroke-width:2px style Model fill:#ccffcc, stroke:#333, stroke-width:2px style C fill:#f0f0f0, stroke:#333, stroke-width:2px style OCR fill:#ccffcc, stroke:#333, stroke-width:2px style Output fill:#f0f0f0, stroke:#333, stroke-width:2px ``` 文件排版檢測旨在識別和分類文件中的視覺元素，進而有效地提取和處理信息。以下是這個過程的主要步驟和方法： - **視覺檢測** - 利用物件檢測模型，如YOLOX或Detectron2，識別並分類文件中的邊界框 - 利用模型在圖像中快速精確地定位不同類型的排版元素，如文本塊、圖片、表格等 - 模型會為每個識別的元素繪製一個邊界框，這些邊界框是後續文本提取或進一步分析的基礎 ![image](https://hackmd.io/_uploads/SJW2dwCxA.png =400x) >　[Source: YOLOX: Exceeding YOLO Series in 2021](https://arxiv.org/abs/2107.08430) - **文本提取** - 從必要的邊界框中使用光學字元辨識（OCR）技術提取文本 - 在邊界框識別出來後，可使用OCR技術從這些框中的圖像數據中提取文字 :pencil: 我看過測試記得這個方案效果不佳 - **直接提取** - 某些PDF文件通常包含metadata和結構化的文本層，這使得可以直接從文件中提取文本而不必依賴圖像識別 - 這種方法比OCR更高效，因為它直接從文檔的數據結構中讀取文本，不涉及圖像處理 ##### 分析 - 優點 - **固定的元素類型集**：具有預定義的元素類型，便於標準化處理 - **獲得邊界框信息**：可以直接提供元素的空間定位信息 - 缺點 - **需要調用兩個模型**：需要先使用物件檢測模型來識別元素，再使用OCR模型提取文本，增加了處理步驟和復雜。 - **靈活性較低**：對於非標準文件類型的處理能力有限 #### DIA Methods - 2.視覺轉換器 (Vision Transformers) ```mermaid graph LR A["Cropped Image\n(Documents)"] --> Model P[Prompt] --> Model Model("Table Structure Recognition Model") Model --> Output["Structured Output (JSON, etc.)"] style A fill:#f0f0f0, stroke:#333, stroke-width:2px style P fill:#f0f0f0, stroke:#333, stroke-width:2px style Model fill:#ccffcc, stroke:#333, stroke-width:2px style Output fill:#f0f0f0, stroke:#333, stroke-width:2px ``` :::info 個人覺得UnstructureIO講師對視覺相關的模型理解有誤，ViT(Vision Transformer)只是一類模型架構，而不是一種文件影像分析方法(DIA) ::: - **視覺理解** 輸入圖像傳遞給編碼器，解碼器則產生文本輸出 - 在這一過程中，編碼器首先將圖像分割成多個小塊，這些小塊經過一系列轉換器層的處理，以捕捉和學習圖像中的視覺關係和特徵(即ViT基本原理) - 解碼器則利用這些特徵生成對應的文本，這個文本可以直接表達圖像的內容(透過Cross Attention對齊文字與語言embedding) - **DONUT架構 (Document Understanding Transformer, DONUT)** - DONUT專為文檔理解和文本生成設計，可以直接從文件影像中識別和結構化信息，而不僅僅是將圖像轉化為文字 - **直接轉換** 無需OCR，圖像輸入直接轉換為文本 - 利用ViT，圖像可以直接被模型解析並轉化為機器可讀的文本格式，這一過程省略了傳統的OCR步驟 - **結構化訓練** 可以訓練模型輸出有效的JSON字符串與結構化的文檔輸出 - 通過結構化訓練，ViT不僅能夠生成文本，還能按照特定格式（如JSON）生成有結構的輸出這些特點展示了視覺轉換器在現代文檔分析和處理領域的強大功能，它們提供了一個從圖像到文本的直接轉換路徑，同時保護了數據的結構完整性和豐富性。 ![image](https://hackmd.io/_uploads/HJlj020x0.png =400x) ![image](https://hackmd.io/_uploads/ByX-lT0g0.png =300x) > - 上圖：OCR方法 > - 下圖：OCR-Free方法 - 輸入(影像)-->模型-->輸出(jSON) ![image](https://hackmd.io/_uploads/rJpxahCe0.png =600x) ![image](https://hackmd.io/_uploads/rkM5pnRgA.png =600x) > Source: [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) > - Donut 是一個端到端(E2E)的模型，它直接將文件以圖像格式作為輸入，並輸出為 JSON 格式的結構化資料 > - Donut 使用Vision Transformer架構，並通過OCR-Free的方式進行文件理解，不依賴於傳統的光學字符識別（OCR）技術來讀取文件，而是直接從圖像中學習和提取結構化信息 :pencil: 這個方法適合用來解析文件內的排版，解析取出各元素內(例如標題、內文)的文字資料，但對於圖片與表格的深度理解，還是需要進一步的方案。在課程下一章節[表格提取。Extracting Tables]()會帶到(但圖片呢?)。 ##### 分析 - 優點 - 更高的靈活性：對非標準文件類型，如表格，有更好的適應性 - 更容易適應並處理新類型的文件：可以更容易地根據新的數據集或需求調整和學習 - 缺點 - 模型可能產生幻覺或重複：由於是生成型模型，可能在輸出中創造不存在的內容或重複現有內容 - 計算成本高：由於需要處理大量的參數和複雜的網絡結構，計算負擔較大 :::info 現在的多模態(Multi-modal)/Vision Language Model(VLM)方案可以視為第三類，用來補足對圖表資料的理解文字：由目前的方法1或2來解析出文件內文字轉換為結構化的輸出圖表:交由專用的物件偵測模型框選出來後，再由多模態的視覺語言模型去理解直接用具有"視覺理解能力的大型語言模型"，或是"視覺基礎模型+大型語言模型"，將其轉換為embedding，裡面蘊含空間與人類語言資訊，並透過語言模型來理解文本內容做出回應 ```mermaid graph LR A["Image\n(Documents)"] --> T["Cropped Table\n(Object Detection Model)"] T--> B["Vision Language Model\nas Retriver"] B --> C["Embedding/\nTable,Image"] ``` ::: ## Lab4: Preprocessing PDFs and Images - 環境設定 ```python= from Utils import Utils utils = Utils() # DLAI_API_KEY = API_KEY = utils.get_dlai_api_key() Unstructured_API_Key = utils.get_private_api_key() DLAI_API_URL = utils.get_dlai_url() s = UnstructuredClient( # api_key_auth=DLAI_API_KEY, api_key_auth=Unstructured_API_Key, server_url=DLAI_API_URL, ) ``` 範例文件 ![image](https://hackmd.io/_uploads/rkx3S5kWR.png) ### Process the Document as HTML ```python= from unstructured.partition.html import partition_html filename = "example_files/el_nino.html" html_elements = partition_html(filename=filename) for element in html_elements[:10]: print(f"{element.category.upper()}: {element.text}") ``` - result ```python= TITLE: CNN UNCATEGORIZEDTEXT: 1/30/2024 TITLE: A potent pair of atmospheric rivers will drench California as El Niño makes its first mark on winter TITLE: By Mary Gilbert, CNN Meteorologist UNCATEGORIZEDTEXT: Updated: 3:49 PM EST, Tue January 30, 2024 TITLE: Source: CNN NARRATIVETEXT: A potent pair of atmospheric river-fueled storms are about to unleash a windy and incredibly wet week in California in what is the first clear sign of the influence El Niño was expected to have on the state this winter. NARRATIVETEXT: The soaking storms will raise the flood threat across much of California into next week, but it appears the wet pattern is likely to continue well into February as a more typical El Niño pattern kicks into gear. NARRATIVETEXT: El Niño – a natural phenomenon in the tropical Pacific that influences weather around the globe – causes changes in the jet stream that can point storms directly at California. Storms can also tap into an extra-potent supply of moisture from the tropics called an atmospheric river. NARRATIVETEXT: El Niño hasn’t materialized many atmospheric rivers for California so far this winter, with most hitting the Pacific Northwest. ``` ### Process the Document as PDF/IMAGE Unstructured庫提供了多種前處理文檔的方法，可以通過`strategy`參數來指定 - 基本用法： ```python elements = partition(filename=filename) ``` - [`strategy`](https://unstructured-io.github.io/unstructured/best_practices/strategies.html)可用選項： - **auto（默認策略）**：根據文檔特性和函數參數自動選擇分割策略。 - **fast**：利用傳統的NLP?(其實是用`pdfminer` lib)提取技術快速提取所有文本元素。:warning:"Fast"策略不適合基於圖像的檔案類型。 - **hi_res**：使用detectron2識別文檔排版。如果文件類型對於文檔元素的正確分類非常敏感，推薦使用這種策略。 - **ocr_only**：利用光學字符識別技術從基於圖像的檔案中提取文本。 #### Process the Document with `partition_pdf` 如果使用`partition(filename=filename)`，`strategy`預設參數選擇 "fast"的話，就會用NLP模型進行解析 ```python= def _partition_pdf_with_pdfminer( filename: str, file: Optional[IO[bytes]], include_page_breaks: bool, languages: List[str], metadata_last_modified: Optional[str], **kwargs: Any, ) -> List[Element]: """Partitions a PDF using PDFMiner instead of using a layoutmodel. Used for faster processing or detectron2 is not available. Implementation is based on the `extract_text` implemenation in pdfminer.six, but modified to support tracking page numbers and working with file-like objects. ref: https://github.com/pdfminer/pdfminer.six/blob/master/pdfminer/high_level.py """ ``` - 使用partition_pdf解析pdf文件 ```python= from unstructured_client import UnstructuredClient from unstructured_client.models import shared from unstructured_client.models.errors import SDKError from unstructured.partition.pdf import partition_pdf from unstructured.staging.base import dict_to_elements filename = "example_files/el_nino.pdf" pdf_elements = partition_pdf(filename=filename, strategy="fast") for element in pdf_elements[:10]: print(f"{element.category.upper()}: {element.text}") ``` - 解析結果 ```python= NCATEGORIZEDTEXT: 1/30/24, 5:11 PM NARRATIVETEXT: Pineapple express: California to get drenched by back-to-back storms fueling a serious ﬂood threat | CNN UNCATEGORIZEDTEXT: CNN 1/30/2024 NARRATIVETEXT: A potent pair of atmospheric rivers will drench California as El Niño makes its ﬁrst mark on winter TITLE: By Mary Gilbert, CNN Meteorologist TITLE: Updated: 3:49 PM EST, Tue January 30, 2024 TITLE: Source: CNN NARRATIVETEXT: A potent pair of atmospheric river-fueled storms are about to unleash a windy and incredibly wet week in California in what is the ﬁrst clear sign of the inﬂuence El Niño was expected to have on the state this winter. NARRATIVETEXT: The soaking storms will raise the ﬂood threat across much of California into next week, but it appears the wet pattern is likely to continue well into February as a more typical El Niño pattern kicks into gear. NARRATIVETEXT: El Niño – a natural phenomenon in the tropical Paciﬁc that inﬂuences weather around the globe – causes changes in the jet stream that can point storms directly at California. Storms can also tap into an extra-potent supply of moisture from the tropics called an atmospheric river. #### Process the Document with Document Layout Detection ##### DIA Methods - 1.Document Layout Detection > OCR Postprocessing - 使用Document Layout Detection模型解析排版 - 前文的方法1 :"文件排版檢測 (Document Layout Detection) > OCR後處理 (OCR Postprocessing)"" - 但中間的OCR後處理過程被封裝起來看不到 ```python= with open(filename, "rb") as f: files=shared.Files( content=f.read(), file_name=filename, ) req = shared.PartitionParameters( files=files, strategy="hi_res", hi_res_model_name="yolox", ) try: resp = s.general.partition(req) dld_elements = dict_to_elements(resp.elements) except SDKError as e: print(e) for element in dld_elements[:10]: print(f"{element.category.upper()}: {element.text}") ``` - 解析結果 - 看來明顯比pdfminer好 ```python= HEADER: 1/30/24, 5:11 PM HEADER: Pineapple express: California to get drenched by back-to-back storms fueling a serious ﬂood threat | CNN HEADER: CNN 1/30/2024 TITLE: A potent pair of atmospheric rivers will drench California as El Niño makes its ﬁrst mark on winter NARRATIVETEXT: By Mary Gilbert, CNN Meteorologist NARRATIVETEXT: Updated: 3:49 PM EST, Tue January 30, 2024 NARRATIVETEXT: Source: CNN NARRATIVETEXT: A potent pair of atmospheric river-fueled storms are about to unleash a windy and incredibly wet week in California in what is the ﬁrst clear sign of the inﬂuence El Niño was expected to have on the state this winter. NARRATIVETEXT: The soaking storms will raise the ﬂood threat across much of California into next week, but it appears the wet pattern is likely to continue well into February as a more typical El Niño pattern kicks into gear. NARRATIVETEXT: El Niño – a natural phenomenon in the tropical Paciﬁc that inﬂuences weather around the globe – causes changes in the jet stream that can point storms directly at California. Storms can also tap into an extra-potent supply of moisture from the tropics called an atmospheric river. ``` #### Process the Document with E2E Model(chipper) ##### DIA Methods - 2. Vision Transformers - 指定用自家的E2E Model(`hi_res_model_name="chipper"`)，直接解析出結構化的輸出 ```python with open(filename, "rb") as f: files=shared.Files( content=f.read(), file_name=filename, ) req = shared.PartitionParameters( files=files, strategy="hi_res", hi_res_model_name="chipper", ) try: resp = s.general.partition(req) chipper_elements = dict_to_elements(resp.elements) except SDKError as e: print(e) for element in chipper_elements[:10]: print(f"{element.category.upper()}: {element.text}") ``` - 解析結果 - 沒有"1.Document Layout Detection > 接OCR後處理"的效果好 ```python UNCATEGORIZEDTEXT: 1/30/24, 5:11 PM NARRATIVETEXT: Pineapple express: California to get drenched by back-to-back storms fueling a serious flood threat | CNN UNCATEGORIZEDTEXT: CNN 1/30/2024 TITLE: A potent pair of atmospheric rivers will drench California asEl Niño makes its first mark on winter TITLE: By Mary Gilbert, CNN Meteorologist TITLE: Updated: 3:49 PM EST, Tue January 30, 2024 TITLE: Source: CNN NARRATIVETEXT: A potent pair of atmospheric river-fueled storms are about to unleash a windy and incredibly wet week in California in what is the first clear sign of the influence EI Niño was expected to have on the state this winter. NARRATIVETEXT: The soaking storms will raise the flood threat across much of California into next week, but it appears the wet pattern is likely to continue well into February as a more typical EI Niño pattern kicks into gear. NARRATIVETEXT: El Niño — a natural phenomenon in the tropical Pacific that influences weather around the globe — causes changes in the jet stream that can point storms directly at California. Storms can also tap into an extra-potent supply of moisture from the tropics called an atmospheric river. ``` #### 檢視不同方法解析的元件數量 - 較貴的Document Layout Detection方案可以解析出較細緻的元件 ```python= import collections # Process the Document as HTML len(html_elements) # 35 # ===================================================== # Process the Document as PDF # Process the Document with partition_pdf html_categories = [el.category for el in html_elements] collections.Counter(html_categories).most_common() # [('NarrativeText', 23), ('Title', 10), ('UncategorizedText', 2)] # ------------------------------------------------------- # Process the Document with Document Layout Detection len(dld_elements) # 39 dld_categories = [el.category for el in dld_elements] collections.Counter(dld_categories).most_common() # [('NarrativeText', 28), ('Header', 6), ('Title', 4), ('Footer', 1)] # ------------------------------------------------------- # Process the Document with E2E Model(chipper) print(len(chipper_elements)) #39 chipper_categories = [el.category for el in chipper_elements] collections.Counter(chipper_categories).most_common() # [('NarrativeText', 27), ('Title', 6), ('UncategorizedText', 5), ('Footer', 1)] ``` --- ## Supply 其他看起來頗不錯的pdf解析工具整理 ### 非多模態模型的傳統方案 #### [LLM Sherpa](https://github.com/nlmatics/llmsherpa#layoutpdfreader) 討論度有點低 #### [pymupdf](https://pymupdf.readthedocs.io/en/latest/about.html) ![image](https://hackmd.io/_uploads/HJ8XD2JWA.png) #### [2023.09。**Extracting Text from PDF Files with Python: A Comprehensive Guide**](https://towardsdatascience.com/extracting-text-from-pdf-files-with-python-a-comprehensive-guide-9fc4003d517) - 非深度學習方案中最完整的指引 - 使用pdfminer解析文件的排版不知原理是時麼，也是dl model? ```python # To analyze the PDF layout and extract text from pdfminer.high_level import extract_pages, extract_text from pdfminer.layout import LTTextContainer, LTChar, LTRect, LTFigure # To extract text from tables in PDF import pdfplumber # To extract the images from the PDFs from PIL import Image from pdf2image import convert_from_path ``` ![image](https://hackmd.io/_uploads/Skt-_hybC.png =400x) ![image](https://hackmd.io/_uploads/SyxQ_2yWR.png =800x) #### [2023.10。LlamaIndex。Mastering PDFs: Extracting Sections, Headings, Paragraphs, and Tables with Cutting-Edge Parser](https://www.llamaindex.ai/blog/mastering-pdfs-extracting-sections-headings-paragraphs-and-tables-with-cutting-edge-parser-faea18870125)