[Preprocessing Unstructured Data for LLM Applications] - Extracting Tables

### [AI / ML領域相關學習筆記入口頁面](https://hackmd.io/@YungHuiHsu/BySsb5dfp) #### [Deeplearning.ai GenAI/LLM系列課程筆記](https://learn.deeplearning.ai/) - [Large Language Models with Semantic Search。大型語言模型與語義搜索 ](https://hackmd.io/@YungHuiHsu/rku-vjhZT) - [LangChain for LLM Application Development。使用LangChain進行LLM應用開發](https://hackmd.io/1r4pzdfFRwOIRrhtF9iFKQ) - [Finetuning Large Language Models。微調大型語言模型](https://hackmd.io/@YungHuiHsu/HJ6AT8XG6) - [Building and Evaluating Advanced RAG。建立與評估進階RAG](https://hackmd.io/@YungHuiHsu/rkqGpCDca) - [Preprocessing Unstructured Data for LLM Applications。大型語言模型(LLM)應用的非結構化資料前處理](https://hackmd.io/@YungHuiHsu/BJDAbgpgR) - [標準化文件內容。Normalizing the Content](https://hackmd.io/a6pABFa5RyCk6uitgJ9qEA?both) - [後設資料的提取語文本分塊。Metadata Extraction and Chunking](https://hackmd.io/@YungHuiHsu/HyJhA80lA) - [PDF與影像的預處理。Preprocessing PDFs and Images](https://hackmd.io/@YungHuiHsu/SkJUlPCeA) - [表格提取。Extracting Tables](https://hackmd.io/@YungHuiHsu/HJEE5rkbC) - [Build Your Own RAG Bot]() --- # [Preprocessing Unstructured Data for LLM Applications<br>大型語言模型(LLM)應用的非結構化資料前處理](https://www.deeplearning.ai/short-courses/preprocessing-unstructured-data-for-llm-applications/) ## [表格提取<br>Extracting Tables](https://learn.deeplearning.ai/courses/preprocessing-unstructured-data-for-llm-applications/lesson/6/extracting-tables) ### 表格提取 Table Extraction 在許多應用場景中，尤其是涉及需要從文檔中直接提取和分析資料的領域，表格提取變得尤為重要。以下是相關的關鍵點： - **文本分析**：多數檢索生成應用(Retrieve and Generate, RAG)案例集中於文檔中的文本內容。 - **結構化數據**：某些行業（如金融、保險）頻繁處理嵌入在非結構化文檔中的結構化數據。 - **表格提取**：為了支持如表格問答等應用案例，從文檔中提取表格非常有幫助。 #### 文件中的表格結構處理不同類型的文檔對於表格結構的處理需求和技術方法各有不同 - **固有結構** 某些文件（如HTML、Word文件）包含表格結構信息 - 部分文檔類型已經具有內建的格式和標籤來描述表格，使得從這些文檔中提取表格相對直接和簡單 - **推理需求** 對於其他文檔類型（如PDF、圖像）需要推斷表格信息 - 有些文檔格式通常不包含直接的結構化表格數據，提取表格需要依靠圖像識別和模式識別技術來推斷和重建表格結構 - **技術方案** - Table Transformers：專門設計來識別和處理文檔中的表格結構 :warning: 這邊的Table Transformers不是微軟的Table Transformers - Vision Transformers：使用深度學習來分析文檔圖像，識別表格和其他視覺元素，以結構化格式輸出(json) - OCR後處理：在使用光學字符識別技術提取文本後，進行額外的處理來結構化表格數據。 - **HTML輸出** 以HTML格式提取表格以保留結構 - 提取表格為HTML格式可以保留其原有的格式和結構屬性，方便在Web頁面上直接顯示或進一步處理 :::warning [Unstructure.IO 官方文件關於模型說明](https://unstructured-io.github.io/unstructured/best_practices/models.html)： - 方案1.Table Transformers - 課程提到，但API實際沒有提供 - 方案2.Vision Transformers - 實際提供的是自家的chipper`模型 > `chipper (beta version)`: the Chipper model is Unstructured’s in-house image-to-text model based on transformer-based Visual Document Understanding (VDU) models. - 方案3. OCR後處理 (OCR Postprocessing) > - 排版(物件)偵測:有`detectron2_onnx`、`yolox`模型可選 ::: ### 方案1.Table Transformers ```mermaid graph LR A["Cropped Table\n(IMAGE)"] --> B B("Table Structure\nRecognition Model") B--> Bbox["BBOX"] --> Output B--> Tag["Structure Tags"] --> Output Output["Structured Output (JSON, etc.)"] style A fill:#f0f0f0, stroke:#333, stroke-width:2px style B fill:#ccffcc, stroke:#333, stroke-width:2px style Output fill:#f0f0f0, stroke:#333, stroke-width:2px style Bbox fill:#f0f0f0, stroke:#333, stroke-width:2px style Tag fill:#f0f0f0, stroke:#333, stroke-width:2px ``` Unstructure.IO課程表格轉換器是一種模型，它能識別表格單元格的邊界框，並將輸出轉換成HTML格式 ![image](https://hackmd.io/_uploads/HyQi0I1b0.png =600x) > [TableFormer: Table Structure Understanding with Transformers](https://paperswithcode.com/paper/tableformer-table-structure-understanding) > - 兩個步驟： 1. **識別表格**：首先使用文件排版偵測模型來識別文檔中的表格 2. **運行表格轉換器**：然後將表格通過表格轉換器進行處理 - 分析 - 優點： - **可以追踪到原始邊界框**：這使得用戶可以準確追踪每個單元格的原始位置 - 缺點： - **多個昂貴的模型調用**：需要多階段的模型處理，增加了計算成本和時間 :warning: Unstructure.IO課程提到的Table Transformer指的是IBM團隊於2022年發表的[TableFormer: Table Structure Understanding with Transformers](https://paperswithcode.com/paper/tableformer-table-structure-understanding)，但其實star/fork偏低 ![image](https://hackmd.io/_uploads/ryp0sLy-0.png) 目前(相關STAR數最高的應該是微軟的[Table Transformer (TATR)](https://github.com/microsoft/table-transformer)，建議採用微軟方案 ![image](https://hackmd.io/_uploads/r1K-hLJ-A.png) :::warning 微軟的[Table Transformer (TATR)](https://github.com/microsoft/table-transformer)包含Table Detection model與Table Structure Recognition Model ![image](https://hackmd.io/_uploads/ByhnOUyZA.png =800x) ::: ### 方案2.Vision Transformers ```mermaid graph LR A["Cropped Image\n(Documents)"] --> Model P[Prompt] --> Model Model("Table Structure Recognition Model") Model --> Output["Structured Output (JSON, etc.)"] style A fill:#f0f0f0, stroke:#333, stroke-width:2px style P fill:#f0f0f0, stroke:#333, stroke-width:2px style Model fill:#ccffcc, stroke:#333, stroke-width:2px style Output fill:#f0f0f0, stroke:#333, stroke-width:2px ``` ![image](https://hackmd.io/_uploads/BJU3ywyZC.png) 使用之前課程中的視覺轉換器模型（處理PDF和圖像），但目標輸出為HTML。 - 分析 - 優點： - **支持提示**：允許用戶自定義提示，增加了模型的靈活性 - **單一模型調用**：只需一次模型運行，減少了處理時間和成本 - 缺點： - **生成模型，容易產生幻覺**：:warning:可能生成不存在的內容，影響輸出的準確性。 - **沒有邊界框**：:warning:無法提供原始圖像中的確切位置信息 :warning: 這邊指的是IMAGE>HSON的E2E模型-[Donut](https://github.com/clovaai/donut) ![image](https://hackmd.io/_uploads/r1H0u8yW0.png=400x) :::info 如果多模態模型太貴，可以當作local/server端運行的方案? ::: ### 方案3. OCR後處理 (OCR Postprocessing) ```mermaid graph LR A["Image\n(Documents)"] --> Model["Document Layout\nModels"] Model--> C[Cropped \nTable/Image/TextBox] C --> OCR["OCR"] OCR --> Output["Extracted\nUnstructured Text"] style A fill:#f0f0f0, stroke:#333, stroke-width:2px style Model fill:#ccffcc, stroke:#333, stroke-width:2px style C fill:#f0f0f0, stroke:#333, stroke-width:2px style OCR fill:#ccffcc, stroke:#333, stroke-width:2px style Output fill:#f0f0f0, stroke:#333, stroke-width:2px ``` ![image](https://hackmd.io/_uploads/S17cJDybR.png) 對表格進行OCR處理，然後根據OCR輸出中的模式構建表格結構。 - 分析 - 優點： - **快速**：對於格式規整的表格，處理速度快且準確 - **準確性高**：對於結構清晰的表格能提供準確的輸出 - 缺點： - **需要統計或規則基礎的解析**：對於非結構化或格式不規則的表格處理較為困難 - **靈活性較低**：相比其他方法，適應性和靈活性較差 - **無法追踪到原始圖像中的邊界框**：不提供單元格在原始圖像中的位置鏈接 ### 課程lab範例 - 只有show OCR Postprocessing方案!? ```python filename = "example_files/embedded-images-tables.pdf" with open(filename, "rb") as f: files=shared.Files( content=f.read(), file_name=filename, ) req = shared.PartitionParameters( files=files, strategy="hi_res", hi_res_model_name="yolox", skip_infer_table_types=[], pdf_infer_table_structure=True, ) try: resp = s.general.partition(req) elements = dict_to_elements(resp.elements) except SDKError as e: print(e) ``` - 提取元素中的表格資料 ```python=! tables = [el for el in elements if el.category == "Table"] # return [<unstructured.documents.elements.Table at 0x2e49d42db10>] display(tables[0].to_dict()) # 注意，metadata內有個key id 是'text_as_html' {'type': 'Table', 'element_id': '65b8a22ca1625d55e182c40b938ce949', 'text': 'Inhibitor Polarization Corrosion be (V/dec) ba (V/dec) Ecorr (V) icorr (AJcm?) concentration (g) resistance (Q) rate (mmj/year) 0.0335 0.0409... 0.0919', 'metadata': {'text_as_html': '<table><thead><th>Inhibitor concentration (g)</th><th>be...</tr></table>', 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 1, 'filename': 'embedded-images-tables.pdf'}} from IPython.display import Markdown, display display(Markdown(tables[0].text)) # Inhibitor Polarization Corrosion be (V/dec) ba (V/dec) Ecorr (V) icorr (AJcm?) concentration (g) resistance (Q) rate (mmj/year) 0.0335 0.0409 —0.9393 0.0003 24.0910 2.8163 1.9460 0.0596 .8276 0.0002 121.440 1.5054 0.0163 0.2369 .8825 0.0001 42121 0.9476 s NO 03233 0.0540 —0.8027 5.39E-05 373.180 0.4318 0.1240 0.0556 .5896 5.46E-05 305.650 0.3772 = 5 0.0382 0.0086 .5356 1.24E-05 246.080 0.0919 ``` - 轉為html格式 ```python table_html = tables[0].metadata.text_as_html from io import StringIO from lxml import etree parser = etree.XMLParser(remove_blank_text=True) file_obj = StringIO(table_html) tree = etree.parse(file_obj, parser) print(etree.tostring(tree, pretty_print=True).decode()) # <table> # <thead> # <th>Inhibitor concentration (g)</th> # <th>be (V/dec)</th> # <th>ba (V/dec)</th> # <th>Ecorr (V)</th> # <th>icorr (AJcm?)</th> # <th>Polarization resistance (Q)</th> # <th>Corrosion rate (mmj/year)</th> # </thead> # <tr> # <td/> # <td>0.0335</td> # <td>0.0409</td> # <td>—0.9393</td> # <td>0.0003</td> # <td>24.0910</td> # <td>2.8163</td> # </tr> ``` - result 果然跟之前看過的OCR方案一樣，數字辨識沒有太好 ![image](https://hackmd.io/_uploads/H1k1R8ybR.png) ## 提取影像官方API有提供可以提取影像的API，不用付費訂閱，但必須手動安裝兩個APP，不然會噴錯([PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?](https://github.com/langchain-ai/langchain/issues/16085) ]，我的測試環境是windows，必須手動安裝再設定環境變數，頗麻煩 #### 測試小結 :::info - 用 `from unstructured.partition.pdf import partition_pdf`的話，api free - 在單純的pdf文件(paper)，可以快速獲得cropped的圖、表 - 圖表偵測、裁切跟OCR辨識綁在一起 - 圖、表名稱無法偵測到放進METADATA - 表格用免費的partition_pdf解析結果不如訂閱方案 - 圖片用ocr解析結果很差 ::: #### 環境設定 - [partition_pdf](https://unstructured-io.github.io/unstructured/core/partition.html) - 安裝Tesserac(開源OCR) - 載點： [Tesseract at UB Mannheim](https://github.com/UB-Mannheim/tesseract/wiki) - 下載後執行 tesseract-ocr-w64-setup-<version>.exe - 安裝完後要在使用者的環境變數新增路徑 - 安裝[poppler-windows](https://github.com/oschwartz10612/poppler-windows/releases/) - 下載` Release-24.02.0-0.zip`解壓縮 - 增加環境變數：新建`C:\Program Files\poppler-24.02.0-0\bin` - 在終端機輸入 'pdfinfo' 測試是否有讀入 ![image](https://hackmd.io/_uploads/B13VN3X-A.png =300x) ![image](https://hackmd.io/_uploads/Hyx7E27W0.png =300x) #### code 文件長這樣 ![image](https://hackmd.io/_uploads/H1qbL3QZC.png =400x) - 使用`partition_pd`提取表格與圖片 ```python= from IPython.display import display from unstructured.partition.pdf import partition_pdf filename = "example_files/embedded-images-tables.pdf" elements = partition_pdf( filename=filename , # mandatory strategy="hi_res", # mandatory to use ``hi_res`` strategy extract_images_in_pdf=True, # mandatory to set as ``True`` extract_image_block_types=["Image", "Table"], # optional extract_image_block_to_payload=False, # optional extract_image_block_output_dir="cropped_images", # optional - only works when ``extract_image_block_to_payload=False`` ) ``` - 提取表格 ```python= tables = [el for el in elements if el.category == "Table"] tables # [<unstructured.documents.elements.Table at 0x1f7472ceb90>] tables[0].to_dict()['metadata'].keys() # 使用partition_pdf的話，對於表格的解析不如付費api # 後者可以產出較為結構化的解析(text_as_html) # metadat沒有"text_as_html"，但有"座標"與"偵測的信心度" # dict_keys(['detection_class_prob', 'coordinates', 'last_modified', 'filetype', 'languages', 'page_number', 'image_path', 'file_directory', 'filename']) from IPython.display import Image display(tables[0].to_dict()) Image(filename="cropped_images/table-1-1.jpg", width=600) {'type': 'Table', 'element_id': 'e63e2e8360c2545db1ad277eb1e50b6e', 'text': 'Inhibitor be (V/dec) ba (V/dec) Ecorr (V) icorr (A/cm?) Polarization Corrosion concentration (g) resistance (Q) rate (mm/year) i) 0.0335 ... 246.080 0.0919', 'metadata': {'detection_class_prob': 0.9123603105545044, 'coordinates': {'points': ((124.59831237792969, 768.7587890625), (124.59831237792969, 1012.5152587890625), (1132.3974609375, 1012.5152587890625), (1132.3974609375, 768.7587890625)), 'system': 'PixelSpace', 'layout_width': 1300, 'layout_height': 1890}, 'last_modified': '2024-04-18T17:21:23', 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 1, 'image_path': 'cropped_images\\table-1-1.jpg', 'file_directory': 'example_files', 'filename': 'embedded-images-tables.pdf'}} ``` - 裁切出的表格 - 沒仔細檢查數值正確與否，不過大概不會比付費的好，但付費的也不太好就是 ![image](https://hackmd.io/_uploads/HyOeD37-C.png=400x) - 提取圖片 ```python= from IPython.display import display images = [el for el in elements if el.category == "Image"] images # [<unstructured.documents.elements.Image at 0x1f732e28050>,<unstructured.documents.elements.Image at 0x1f732e639d0>] from IPython.display import Image display(images[0].to_dict()) Image(filename="cropped_images/figure-1-1.jpg", width=300) {'type': 'Image', 'element_id': '04f130b2a9d65ae1c835c8d24c891ba3', 'text': 'LS 1 os = — 10; =o ° © —a 205 i —<é é —ie a5 — Control -2 — s& 2.5 T T T 0.0000001 —-0.00001 0.001 O41 Current Density (A/cm2)', 'metadata': {'coordinates': {'points': ((308.18888888888887, 185.35277777777802), (308.18888888888887, 633.7000000000002), (975.4944444444444, 633.7000000000002), (975.4944444444444, 185.35277777777802)), 'system': 'PixelSpace', 'layout_width': 1300, 'layout_height': 1890}, 'last_modified': '2024-04-18T17:21:23', 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 1, 'image_path': 'path/to/save/images\\figure-1-1.jpg', 'file_directory': 'example_files', 'filename': 'embedded-images-tables.pdf'}} ``` - 裁切出的圖片 ![image](https://hackmd.io/_uploads/SkI5v27-A.png =300x)