### [AI / ML領域相關學習筆記入口頁面](https://hackmd.io/@YungHuiHsu/BySsb5dfp)
#### [Deeplearning.ai GenAI/LLM系列課程筆記](https://learn.deeplearning.ai/)
- [Large Language Models with Semantic Search。大型語言模型與語義搜索 ](https://hackmd.io/@YungHuiHsu/rku-vjhZT)
- [LangChain for LLM Application Development。使用LangChain進行LLM應用開發](https://hackmd.io/1r4pzdfFRwOIRrhtF9iFKQ)
- [Finetuning Large Language Models。微調大型語言模型](https://hackmd.io/@YungHuiHsu/HJ6AT8XG6)
- [Building and Evaluating Advanced RAG。建立與評估進階RAG](https://hackmd.io/@YungHuiHsu/rkqGpCDca)
- [Preprocessing Unstructured Data for LLM Applications。大型語言模型(LLM)應用的非結構化資料前處理](https://hackmd.io/@YungHuiHsu/BJDAbgpgR)
- [標準化文件內容。Normalizing the Content](https://hackmd.io/a6pABFa5RyCk6uitgJ9qEA?both)
- [後設資料的提取語文本分塊。Metadata Extraction and Chunking](https://hackmd.io/@YungHuiHsu/HyJhA80lA)
- [PDF與影像的預處理。Preprocessing PDFs and Images](https://hackmd.io/@YungHuiHsu/SkJUlPCeA)
- [表格提取。Extracting Tables](https://hackmd.io/@YungHuiHsu/HJEE5rkbC)
- [Build Your Own RAG Bot]()
---
# [Preprocessing Unstructured Data for LLM Applications<br>大型語言模型(LLM)應用的非結構化資料前處理](https://www.deeplearning.ai/short-courses/preprocessing-unstructured-data-for-llm-applications/)
## PDF與影像的預處理<br>Preprocessing PDFs and Images
:::warning
Unstructure.IO的pdf解析方案需要呼叫模型,是採用call API方案將資料送到他們服務中,再回完解析結果,如果考慮私有資料不外流的話,可能還是需要自己解析,下文提供幾個方向,可以找開源方案自行搭建
:::
### 文件影像分析Document Image Analysis
:::success
從文件的原始影像中提取格式訊息和文本
- 將掃描或拍攝的文件影像轉化為可以分析和處理的數位格式
- 包括文字識別(如OCR技術)和格式識別,允許從靜態圖像中重新構建文檔的數字版本,以及捕捉其結構和格式細節。
:::
- **預處理與基於規則的解析器(Preprocessing with Rules-based Parsers)**
- 許多文件類型,如HTML、Word文件和Markdown,包含格式訊息
- 利用文件固有的結構規則來識別和提取信息。例如,在HTML文檔中,解析器可以根據標籤識別標題、段落等元素
- **視覺信息(Visual Information)**
- :pencil2:對於其他文件類型,如PDF和圖像,格式訊息是**視覺的**
- 這類文件中的格式信息並不是明確標記出來的,**而是通過視覺元素(如版面配置、字型和圖形)來表達**
### 文件影像分析方法Document Image Analysis(DIA Methods)
包括兩種技術:1) 文件排版檢測(Document Layout Detection, DLD)和2) 視覺轉換器(Vision Transformer, ViT)
- **文件排版檢測 (Document Layout Detection, DLD)**
- 使用物件檢測模型在文件影像上繪製並標記排版元素的邊界框(label bounding boxes)
- 通過識別並分類文件中的不同視覺元素(如文字塊、圖像、表格等),來理解文件的結構,將文檔從僅僅是圖像,轉變為有結構的、可操作的資訊集
- **視覺轉換器 (Vision Transformer, ViT)**
- 模型將文件影像作為輸入,產生結構化輸出(如JSON格式)的文本表達
- 將圖像分割成多個小塊(patches),這些小塊被進一步轉化為一系列能夠輸入到轉換器模型中的數據點
- 模型學習這些文本塊之間的關係,最終產生對整個文件影像的深層次理解
- 可以選擇性地包括一個文本提示
- 可以選擇性地包括一個文本提示,這有助於模型更好地理解和生成期望的輸出格式
:pencil: 技術原理上跟目前在訓練視覺模型與文字、空間關係的理解與對齊是同樣概念,詳見[From Representation to Interface: The Evolution of Foundation for Vision Understanding- CH2 Visual Understanding筆記](https://hackmd.io/@YungHuiHsu/HyjAklf4T)
#### DIA Methods - 1.文件排版檢測 (Document Layout Detection) > OCR後處理 (OCR Postprocessing)
```mermaid
graph LR
A["Image\n(Documents)"] --> Model["Document Layout\nModels"]
Model--> C[Cropped \nTable/Image/TextBox]
C --> OCR["OCR"]
OCR --> Output["Extracted\nUnstructured Text"]
style A fill:#f0f0f0, stroke:#333, stroke-width:2px
style Model fill:#ccffcc, stroke:#333, stroke-width:2px
style C fill:#f0f0f0, stroke:#333, stroke-width:2px
style OCR fill:#ccffcc, stroke:#333, stroke-width:2px
style Output fill:#f0f0f0, stroke:#333, stroke-width:2px
```
文件排版檢測旨在識別和分類文件中的視覺元素,進而有效地提取和處理信息。以下是這個過程的主要步驟和方法:
- **視覺檢測**
- 利用物件檢測模型,如YOLOX或Detectron2,識別並分類文件中的邊界框
- 利用模型在圖像中快速精確地定位不同類型的排版元素,如文本塊、圖片、表格等
- 模型會為每個識別的元素繪製一個邊界框,這些邊界框是後續文本提取或進一步分析的基礎

> [Source: YOLOX: Exceeding YOLO Series in 2021](https://arxiv.org/abs/2107.08430)
- **文本提取**
- 從必要的邊界框中使用光學字元辨識(OCR)技術提取文本
- 在邊界框識別出來後,可使用OCR技術從這些框中的圖像數據中提取文字
:pencil: 我看過測試記得這個方案效果不佳
- **直接提取**
- 某些PDF文件通常包含metadata和結構化的文本層,這使得可以直接從文件中提取文本而不必依賴圖像識別
- 這種方法比OCR更高效,因為它直接從文檔的數據結構中讀取文本,不涉及圖像處理
##### 分析
- 優點
- **固定的元素類型集**:具有預定義的元素類型,便於標準化處理
- **獲得邊界框信息**:可以直接提供元素的空間定位信息
- 缺點
- **需要調用兩個模型**:需要先使用物件檢測模型來識別元素,再使用OCR模型提取文本,增加了處理步驟和復雜。
- **靈活性較低**:對於非標準文件類型的處理能力有限
#### DIA Methods - 2.視覺轉換器 (Vision Transformers)
```mermaid
graph LR
A["Cropped Image\n(Documents)"] --> Model
P[Prompt] --> Model
Model("Table Structure Recognition Model")
Model --> Output["Structured Output (JSON, etc.)"]
style A fill:#f0f0f0, stroke:#333, stroke-width:2px
style P fill:#f0f0f0, stroke:#333, stroke-width:2px
style Model fill:#ccffcc, stroke:#333, stroke-width:2px
style Output fill:#f0f0f0, stroke:#333, stroke-width:2px
```
:::info
個人覺得UnstructureIO講師對視覺相關的模型理解有誤,ViT(Vision Transformer)只是一類模型架構,而不是一種文件影像分析方法(DIA)
:::
- **視覺理解**
輸入圖像傳遞給編碼器,解碼器則產生文本輸出
- 在這一過程中,編碼器首先將圖像分割成多個小塊,這些小塊經過一系列轉換器層的處理,以捕捉和學習圖像中的視覺關係和特徵(即ViT基本原理)
- 解碼器則利用這些特徵生成對應的文本,這個文本可以直接表達圖像的內容(透過Cross Attention對齊文字與語言embedding)
- **DONUT架構 (Document Understanding Transformer, DONUT)**
- DONUT專為文檔理解和文本生成設計,可以直接從文件影像中識別和結構化信息,而不僅僅是將圖像轉化為文字
- **直接轉換**
無需OCR,圖像輸入直接轉換為文本
- 利用ViT,圖像可以直接被模型解析並轉化為機器可讀的文本格式,這一過程省略了傳統的OCR步驟
- **結構化訓練**
可以訓練模型輸出有效的JSON字符串與結構化的文檔輸出
- 通過結構化訓練,ViT不僅能夠生成文本,還能按照特定格式(如JSON)生成有結構的輸出
這些特點展示了視覺轉換器在現代文檔分析和處理領域的強大功能,它們提供了一個從圖像到文本的直接轉換路徑,同時保護了數據的結構完整性和豐富性。


> - 上圖:OCR方法
> - 下圖:OCR-Free方法 - 輸入(影像)-->模型-->輸出(jSON)


> Source: [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664)
> - Donut 是一個端到端(E2E)的模型,它直接將文件以圖像格式作為輸入,並輸出為 JSON 格式的結構化資料
> - Donut 使用Vision Transformer架構,並通過OCR-Free的方式進行文件理解,不依賴於傳統的光學字符識別(OCR)技術來讀取文件,而是直接從圖像中學習和提取結構化信息
:pencil: 這個方法適合用來解析文件內的排版,解析取出各元素內(例如標題、內文)的文字資料,但對於圖片與表格的深度理解,還是需要進一步的方案。在課程下一章節[表格提取。Extracting Tables]()會帶到(但圖片呢?)。
##### 分析
- 優點
- 更高的靈活性:對非標準文件類型,如表格,有更好的適應性
- 更容易適應並處理新類型的文件:可以更容易地根據新的數據集或需求調整和學習
- 缺點
- 模型可能產生幻覺或重複:由於是生成型模型,可能在輸出中創造不存在的內容或重複現有內容
- 計算成本高:由於需要處理大量的參數和複雜的網絡結構,計算負擔較大
:::info
現在的多模態(Multi-modal)/Vision Language Model(VLM)方案可以視為第三類,用來補足對圖表資料的理解
文字:由目前的方法1或2來解析出文件內文字轉換為結構化的輸出
圖表:交由專用的物件偵測模型框選出來後,再由多模態的視覺語言模型去理解
直接用具有"視覺理解能力的大型語言模型",或是"視覺基礎模型+大型語言模型",將其轉換為embedding,裡面蘊含空間與人類語言資訊,並透過語言模型來理解文本內容做出回應
```mermaid
graph LR
A["Image\n(Documents)"] --> T["Cropped Table\n(Object Detection Model)"]
T--> B["Vision Language Model\nas Retriver"]
B --> C["Embedding/\nTable,Image"]
```
:::
## Lab4: Preprocessing PDFs and Images
- 環境設定
```python=
from Utils import Utils
utils = Utils()
# DLAI_API_KEY = API_KEY = utils.get_dlai_api_key()
Unstructured_API_Key = utils.get_private_api_key()
DLAI_API_URL = utils.get_dlai_url()
s = UnstructuredClient(
# api_key_auth=DLAI_API_KEY,
api_key_auth=Unstructured_API_Key,
server_url=DLAI_API_URL,
)
```
範例文件

### Process the Document as HTML
```python=
from unstructured.partition.html import partition_html
filename = "example_files/el_nino.html"
html_elements = partition_html(filename=filename)
for element in html_elements[:10]:
print(f"{element.category.upper()}: {element.text}")
```
- result
```python=
TITLE: CNN
UNCATEGORIZEDTEXT: 1/30/2024
TITLE: A potent pair of atmospheric rivers will drench California as El Niño makes its first mark on winter
TITLE: By Mary Gilbert, CNN Meteorologist
UNCATEGORIZEDTEXT: Updated:
3:49 PM EST, Tue January 30, 2024
TITLE: Source: CNN
NARRATIVETEXT: A potent pair of atmospheric river-fueled storms are about to unleash a windy and incredibly wet week in California in what is the first clear sign of the influence El Niño was expected to have on the state this winter.
NARRATIVETEXT: The soaking storms will raise the flood threat across much of California into next week, but it appears the wet pattern is likely to continue well into February as a more typical El Niño pattern kicks into gear.
NARRATIVETEXT: El Niño – a natural phenomenon in the tropical Pacific that influences weather around the globe – causes changes in the jet stream that can point storms directly at California. Storms can also tap into an extra-potent supply of moisture from the tropics called an atmospheric river.
NARRATIVETEXT: El Niño hasn’t materialized many atmospheric rivers for California so far this winter, with most hitting the Pacific Northwest.
```
### Process the Document as PDF/IMAGE
Unstructured庫提供了多種前處理文檔的方法,可以通過`strategy`參數來指定
- 基本用法:
```python
elements = partition(filename=filename)
```
- [`strategy`](https://unstructured-io.github.io/unstructured/best_practices/strategies.html)可用選項:
- **auto(默認策略)**:根據文檔特性和函數參數自動選擇分割策略。
- **fast**:利用傳統的NLP?(其實是用`pdfminer` lib)提取技術快速提取所有文本元素。:warning:"Fast"策略不適合基於圖像的檔案類型。
- **hi_res**:使用detectron2識別文檔排版。如果文件類型對於文檔元素的正確分類非常敏感,推薦使用這種策略。
- **ocr_only**:利用光學字符識別技術從基於圖像的檔案中提取文本。
#### Process the Document with `partition_pdf`
如果使用`partition(filename=filename)`,`strategy`預設參數選擇 "fast"的話,就會用NLP模型進行解析
```python=
def _partition_pdf_with_pdfminer(
filename: str,
file: Optional[IO[bytes]],
include_page_breaks: bool,
languages: List[str],
metadata_last_modified: Optional[str],
**kwargs: Any,
) -> List[Element]:
"""Partitions a PDF using PDFMiner instead of using a layoutmodel. Used for faster
processing or detectron2 is not available.
Implementation is based on the `extract_text` implemenation in pdfminer.six, but
modified to support tracking page numbers and working with file-like objects.
ref: https://github.com/pdfminer/pdfminer.six/blob/master/pdfminer/high_level.py
"""
```
- 使用partition_pdf解析pdf文件
```python=
from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError
from unstructured.partition.pdf import partition_pdf
from unstructured.staging.base import dict_to_elements
filename = "example_files/el_nino.pdf"
pdf_elements = partition_pdf(filename=filename, strategy="fast")
for element in pdf_elements[:10]:
print(f"{element.category.upper()}: {element.text}")
```
- 解析結果
```python=
NCATEGORIZEDTEXT: 1/30/24, 5:11 PM
NARRATIVETEXT: Pineapple express: California to get drenched by back-to-back storms fueling a serious flood threat | CNN
UNCATEGORIZEDTEXT: CNN 1/30/2024
NARRATIVETEXT: A potent pair of atmospheric rivers will drench California as El Niño makes its first mark on winter
TITLE: By Mary Gilbert, CNN Meteorologist
TITLE: Updated: 3:49 PM EST, Tue January 30, 2024
TITLE: Source: CNN
NARRATIVETEXT: A potent pair of atmospheric river-fueled storms are about to unleash a windy and incredibly wet week in California in what is the first clear sign of the influence El Niño was expected to have on the state this winter.
NARRATIVETEXT: The soaking storms will raise the flood threat across much of California into next week, but it appears the wet pattern is likely to continue well into February as a more typical El Niño pattern kicks into gear.
NARRATIVETEXT: El Niño – a natural phenomenon in the tropical Pacific that influences weather around the globe – causes changes in the jet stream that can point storms directly at California. Storms can also tap into an extra-potent supply of moisture from the tropics called an atmospheric river.
#### Process the Document with Document Layout Detection
##### DIA Methods - 1.Document Layout Detection > OCR Postprocessing
- 使用Document Layout Detection模型解析排版
- 前文的方法1 :"文件排版檢測 (Document Layout Detection) > OCR後處理 (OCR Postprocessing)""
- 但中間的OCR後處理過程被封裝起來看不到
```python=
with open(filename, "rb") as f:
files=shared.Files(
content=f.read(),
file_name=filename,
)
req = shared.PartitionParameters(
files=files,
strategy="hi_res",
hi_res_model_name="yolox",
)
try:
resp = s.general.partition(req)
dld_elements = dict_to_elements(resp.elements)
except SDKError as e:
print(e)
for element in dld_elements[:10]:
print(f"{element.category.upper()}: {element.text}")
```
- 解析結果
- 看來明顯比pdfminer好
```python=
HEADER: 1/30/24, 5:11 PM
HEADER: Pineapple express: California to get drenched by back-to-back storms fueling a serious flood threat | CNN
HEADER: CNN 1/30/2024
TITLE: A potent pair of atmospheric rivers will drench California as El Niño makes its first mark on winter
NARRATIVETEXT: By Mary Gilbert, CNN Meteorologist
NARRATIVETEXT: Updated: 3:49 PM EST, Tue January 30, 2024
NARRATIVETEXT: Source: CNN
NARRATIVETEXT: A potent pair of atmospheric river-fueled storms are about to unleash a windy and incredibly wet week in California in what is the first clear sign of the influence El Niño was expected to have on the state this winter.
NARRATIVETEXT: The soaking storms will raise the flood threat across much of California into next week, but it appears the wet pattern is likely to continue well into February as a more typical El Niño pattern kicks into gear.
NARRATIVETEXT: El Niño – a natural phenomenon in the tropical Pacific that influences weather around the globe – causes changes in the jet stream that can point storms directly at California. Storms can also tap into an extra-potent supply of moisture from the tropics called an atmospheric river.
```
#### Process the Document with E2E Model(chipper)
##### DIA Methods - 2. Vision Transformers
- 指定用自家的E2E Model(`hi_res_model_name="chipper"`),直接解析出結構化的輸出
```python
with open(filename, "rb") as f:
files=shared.Files(
content=f.read(),
file_name=filename,
)
req = shared.PartitionParameters(
files=files,
strategy="hi_res",
hi_res_model_name="chipper",
)
try:
resp = s.general.partition(req)
chipper_elements = dict_to_elements(resp.elements)
except SDKError as e:
print(e)
for element in chipper_elements[:10]:
print(f"{element.category.upper()}: {element.text}")
```
- 解析結果
- 沒有"1.Document Layout Detection > 接OCR後處理"的效果好
```python
UNCATEGORIZEDTEXT: 1/30/24, 5:11 PM
NARRATIVETEXT: Pineapple express: California to get drenched by back-to-back storms fueling a serious flood threat | CNN
UNCATEGORIZEDTEXT: CNN 1/30/2024
TITLE: A potent pair of atmospheric rivers will drench California asEl Niño makes its first mark on winter
TITLE: By Mary Gilbert, CNN Meteorologist
TITLE: Updated: 3:49 PM EST, Tue January 30, 2024
TITLE: Source: CNN
NARRATIVETEXT: A potent pair of atmospheric river-fueled storms are about to unleash a windy and incredibly wet week in California in what is the first clear sign of the influence EI Niño was expected to have on the state this winter.
NARRATIVETEXT: The soaking storms will raise the flood threat across much of California into next week, but it appears the wet pattern is likely to continue well into February as a more typical EI Niño pattern kicks into gear.
NARRATIVETEXT: El Niño — a natural phenomenon in the tropical Pacific that influences weather around the globe — causes changes in the jet stream that can point storms directly at California. Storms can also tap into an extra-potent supply of moisture from the tropics called an atmospheric river.
```
#### 檢視不同方法解析的元件數量
- 較貴的Document Layout Detection方案可以解析出較細緻的元件
```python=
import collections
# Process the Document as HTML
len(html_elements)
# 35
# =====================================================
# Process the Document as PDF
# Process the Document with partition_pdf
html_categories = [el.category for el in html_elements]
collections.Counter(html_categories).most_common()
# [('NarrativeText', 23), ('Title', 10), ('UncategorizedText', 2)]
# -------------------------------------------------------
# Process the Document with Document Layout Detection
len(dld_elements)
# 39
dld_categories = [el.category for el in dld_elements]
collections.Counter(dld_categories).most_common()
# [('NarrativeText', 28), ('Header', 6), ('Title', 4), ('Footer', 1)]
# -------------------------------------------------------
# Process the Document with E2E Model(chipper)
print(len(chipper_elements))
#39
chipper_categories = [el.category for el in chipper_elements]
collections.Counter(chipper_categories).most_common()
# [('NarrativeText', 27), ('Title', 6), ('UncategorizedText', 5), ('Footer', 1)]
```
---
## Supply
其他看起來頗不錯的pdf解析工具整理
### 非多模態模型的傳統方案
#### [LLM Sherpa](https://github.com/nlmatics/llmsherpa#layoutpdfreader)
討論度有點低
#### [pymupdf](https://pymupdf.readthedocs.io/en/latest/about.html)

#### [2023.09。**Extracting Text from PDF Files with Python: A Comprehensive Guide**](https://towardsdatascience.com/extracting-text-from-pdf-files-with-python-a-comprehensive-guide-9fc4003d517)
- 非深度學習方案中最完整的指引
- 使用pdfminer解析文件的排版
不知原理是時麼,也是dl model?
```python
# To analyze the PDF layout and extract text
from pdfminer.high_level import extract_pages, extract_text
from pdfminer.layout import LTTextContainer, LTChar, LTRect, LTFigure
# To extract text from tables in PDF
import pdfplumber
# To extract the images from the PDFs
from PIL import Image
from pdf2image import convert_from_path
```


#### [2023.10。LlamaIndex。Mastering PDFs: Extracting Sections, Headings, Paragraphs, and Tables with Cutting-Edge Parser](https://www.llamaindex.ai/blog/mastering-pdfs-extracting-sections-headings-paragraphs-and-tables-with-cutting-edge-parser-faea18870125)