[GenAI][RAG] 利用多模態模型解析PDF文件內的表格

### [AI / ML領域相關學習筆記入口頁面](https://hackmd.io/@YungHuiHsu/BySsb5dfp) #### [Deeplearning.ai GenAI/LLM系列課程筆記](https://learn.deeplearning.ai/) ##### GenAI - [Large Language Models with Semantic Search。大型語言模型與語義搜索 ](https://hackmd.io/@YungHuiHsu/rku-vjhZT) - [LangChain for LLM Application Development。使用LangChain進行LLM應用開發](https://hackmd.io/1r4pzdfFRwOIRrhtF9iFKQ) - [Finetuning Large Language Models。微調大型語言模型](https://hackmd.io/@YungHuiHsu/HJ6AT8XG6) ##### RAG - [Preprocessing Unstructured Data for LLM Applications。大型語言模型(LLM)應用的非結構化資料前處理](https://hackmd.io/@YungHuiHsu/BJDAbgpgR) - [Building and Evaluating Advanced RAG。建立與評估進階RAG](https://hackmd.io/@YungHuiHsu/rkqGpCDca) - [[GenAI][RAG] Multi-Modal Retrieval-Augmented Generation and Evaluaion。多模態的RAG與評估 ](https://hackmd.io/@YungHuiHsu/B1LJcOlfA) - [[GenAI][RAG] 利用多模態模型解析PDF文件內的表格](https://hackmd.io/@YungHuiHsu/HkthTngM0) --- 原文參見[Multi-Modal on PDF’s with tables](https://docs.llamaindex.ai/en/v0.10.17/examples/multi_modal/multi_modal_pdf_tables.html) [Table Transformer](https://huggingface.co/microsoft/table-transformer-detection)` + GPT4-V ### The experiment: 這裡測試的資料是論文格式的長文件 - 實驗摘要 <table border="1"> <thead> <tr> <th>實驗編號</th> <th>步驟</th> <th>觀察</th> </tr> </thead> <tbody> <tr> <td>Exp 1</td> <td> 1. 將每個 PDF 頁面提取並分離為圖像<br> 2. 建立 image embedding<br> 3. 讓檢索器根據提問檢索原始圖片<br> 4. 將原始圖片提供給 GPT4-V </td> <td> 即使抓到正確的 PDF 頁面/影像，但 GPT4-V 仍然無法根據 image reasoning 生成正確回應 </td> </tr> <tr> <td>Exp 2</td> <td> 1. 將每個 PDF 頁面提取並分離為圖像文件<br> 2. 讓 GPT4-V 識別表格並從每個 PDF 頁面提取表格信息<br> 3. 將每頁的 GPT4V 的理解建立索引到圖像推理向量存儲<br> 4. 從圖像推理向量存儲中檢索答案 </td> <td> GPT4-V 無法穩定地識別表格並從圖像中提取表格內容，尤其是當圖像中混雜了表格、文本和圖像時 </td> </tr> <tr> <td>Exp 3</td> <td> 1. 使用 Table Transformer 從檢索的影像中 crop 出表格資訊<br> 2. 將裁切出的表格影像使用 GPT4-V<br> 3. 讀取裁切好的表格並生成回應 </td> <td> 模型能提供準確的答案。這在 "思維鏈(COT)" 實驗中的發現相吻合，即向 GPT-4-V 提供特定的圖像信息可顯著增強其提供正確答案的能力 </td> </tr> <tr> <td>Exp 4</td> <td> 在裁切出的表格影像使用 OCR 辨識出文字，再送入 GPT4/ GPT-3.5 回答問題 </td> <td> 提取的 table 文字並不精準，從圖像中無法從表格中提取出正確的文字資訊，所以答案是錯誤的 </td> </tr> </tbody> </table> #### 小結 - 最好的方案：Exp3：Cropped Table。框選出表格，再直接提供給GPT4-V得到最好的回覆 - Exp1令人意外的表現不佳，但個人在pptx類型文件效果是可以的 - 長文件用頁為單位或許雜訊多 - pptx每頁已經是主題明確的內容、且文字少 - 最差的方案：Exp4： OCR。 - 可以改進點： - Exp3方案類似LangChain **[Chat with data](https://www.youtube.com/@chatwithdata)**。**GPT-4 Vision: How to use LangChain with Multimodal AI to Analyze Images in Financial Reports一文**提出的，將**原始圖像(圖、表)傳遞給LLM/VLM去生成回應** - [ ] 同樣的，透過多模態檢索器直接生成Multimodal Embedding 這部分尚未測試 - [ ] 透過提示工程生成進一步”解釋(圖表說明)” - 抓取1. 圖、表號、抓取 2. 對應的內文頁面(如果有) 3. 其他metadata：文件名稱、年代等 - [ ] ”根據圖號、表號找出對應內文生成更詳細解釋” - [ ] 結合pdfminer分析頁面布局，找出LTTextContainer內對應的圖表標題取得相關內文可能會有較好的表現 - [x] 根據圖表座標框選前後文字生成摘要，提供摘要讓模型生成進一步文字解釋 - [x] 根據 Exp 2 ，如果抓到其他無關的文字則會造成回應品質降低 - 文字資訊(text embedding)的提取方案: - [ ] Summary - [ ] Hypothetical QA > ✅ HyDE and LLM reranking enhance retrieval precision - Hypothetical Query - Hypothetical Answers/ HyDE (Hypothetical Document Embeddings) - [ ] Structured Output - Retrieve Only, Pass Raw image to Generator - 個人嘗試方案: Text Summary + Image embedding(每頁pdf為一個單位)，同時檢索 ![image](https://hackmd.io/_uploads/SJNEnalMA.png =300x) - image與text embedding 同時檢索時，image embedding cosine similarity score分數顯著較低 - 因為query embedding 也是用文字的embeder取得embedding，而image則是透過image embeder，兩者並非一起訓練，因此在檢索器內的text 與image embedding並沒有很好的對齊 - Image Embedder問題 - 這裡用的是CLIP - 針對通用情境而非圖表訓練 - 編碼的語意可能很大部分是空間結構、視覺特徵，圖表內函的語意可能不豐富? - text 與image embedding檢索回來的頁面不一致 - 雜訊或更豐富的資訊? - [ ] ranking-aware ? 排序較後面的檢索結果是否影響LLM 生成? - image embedding的排序顯著較text embedding低 - 單獨檢視只有Text embedding - 單獨檢視只有Image embedding 時的生成結果，並評估Answer Relevance - 檢查llama-index的低階API，看資料是如何餵入 --- 後面是[Multi-Modal on PDF’s with tables](https://docs.llamaindex.ai/en/v0.10.17/examples/multi_modal/multi_modal_pdf_tables.html)文章各實驗設置較詳細的筆記 >**LlamaIndex:** Parsing tables in PDFs is a super important RAG use case. We found that using the recent Table Transformer model ([Brandon Smock](https://www.linkedin.com/in/ACoAABU4TjMB5fIw1qKpIyfePSy5HG29SFYPWdk)) combined with GPT-4V gives you superpowers 💪The Table Transformer model extracts tables from PDFs using object detection 📊We have a full notebook guide 📗 on the following: 1) CLIP to retrieve relevant pages, 2) Table Transforms to extract table images, and 3) GPT-4V to help synthesize an answer.We compare to three other multi-modal table understanding baselines:🔨 Retrieving entire page via CLIP and feeding to GPT-4V🔨 Use GPT-4V to extract text from each page, index/retrieve based on text🔨 Do OCR on extracted table images, use that as contextNotebook: https://lnkd.in/gZJH9pxgDocs: https://lnkd.in/gHXR9FQxCredits to [Niels Rogge](https://www.linkedin.com/in/ACoAAB9DSA8BPsvimE8aW0sXjqc-UM3dsFmOfEg)'s fantastic [Hugging Face](https://www.linkedin.com/company/huggingface/) space for helping to inspire this notebook: https://lnkd.in/g8zT2vCKFull credits to [Haotian Zhang](https://www.linkedin.com/in/ACoAAAuLiT0BJltVi303SEiEzPl9N6aInlwx_zQ) and [Ravi Theja Desetty](https://www.linkedin.com/in/ACoAAAUWtEABhvXgvUh2Me4HrSvlTbb9ZrN2G_o) on the [LlamaIndex](https://www.linkedin.com/company/llamaindex/) team 🙌 > > **[Haotian Zhang :](https://www.linkedin.com/in/zhanghaotian/)** in the 2nd step, we are using Table Transformer (TATR) to detect the table. However in our example, :warning:**TATR did not give us accurate table OCR results** so we use GPT4V for final answer reasoning for using TATR for both table detection and OCR, pls check the 4th example > #### **Exp 1：單頁PDF以影像方式檢索 → GPT4-V** - Steps: - 將每個 PDF 頁面提取並分離為圖像(注意：這邊是以頁為單位) - 建立image embedding， - 讓檢索器根據提問檢索原始圖片 - 將原始圖片提供給GPT4-V ```python= from llama_index.core.indices.multi_modal.retriever import ( MultiModalVectorIndexRetriever, ) query = "Compare llama2 with llama1?" assert isinstance(retriever_engine, MultiModalVectorIndexRetriever) # retrieve for the query using text to image retrieval retrieval_results = retriever_engine.text_to_image_retrieve(query) retrieved_images # ['llama2/page_50.png', 'llama2/page_50.png'] ``` 實驗中已經檢索到正確影像了(不過沒有看到cosine 相似度分數) ![image](https://hackmd.io/_uploads/HJTpL6lfC.png =400x) - 將檢索到的影像提供給GPT4-V ```python=! image_documents = [ ImageDocument(image_path=image_path) for image_path in retrieved_images ] response = openai_mm_llm.complete( prompt="Compare llama2 with llama1?", image_documents=image_documents, ) print(response) # I'm sorry, but I am unable to compare fictional entities like "llama2" with "llama1" since in the images provided, there are no images or descriptions of llamas to make such a comparison. The images you've shared contain tables of data reflecting the performance of various models on different datasets and tasks related to machine learning and natural language processing. If you have specific data or images of llamas you would like to discuss or compare, please provide them, and I will help as best as I can. ``` - Observation: - :warning:即使抓到正確的pdf頁面/影像，但GPT4-V仍然無法根據image reasoning生成正確回應!!(令人意外!) #### **Exp 2：GPT4-V對每頁(image)進行理解建立Text Vector Store index，據此查詢(Query)生成回答** - Steps: - 將每個 PDF 頁面提取並分離為圖像文件 (注意：這邊是以頁為單位) - 讓 GPT4-V 識別表格並從每個 PDF 頁面提取表格信息 - 將每頁的 GPT4V 的理解建立索引(Index)到圖像推理向量存儲(Image Reasoning Vector Store)中 - 從圖像推理向量存儲(Image Reasoning Vector Store) - 從pdf 中提取文字資訊 - 輸出為json格式或文字摘要 ```python= openai_mm_llm = OpenAIMultiModal( model="gpt-4-vision-preview", api_key=OPENAI_API_TOKEN, max_new_tokens=1500 ) image_prompt = """ Please load the table data and output in the json format from the image. Please try your best to extract the table data from the image. If you can't extract the table data, please summarize image and return the summary. """ response = openai_mm_llm.complete( prompt=image_prompt, image_documents=[documents_images_v2[15]], ) print(response) ``` - Build Text-Only Vector Store by Indexing the Image Understandings from GPT4-V ```python= from llama_index.core import Document text_docs = [ Document( text=str(image_results[image_path]), metadata={"image_path": image_path}, ) for image_path in image_results ] from llama_index.core import VectorStoreIndex from llama_index.vector_stores.qdrant import QdrantVectorStore from llama_index.core import SimpleDirectoryReader, StorageContext import qdrant_client from llama_index.core import SimpleDirectoryReader # Create a local Qdrant vector store client = qdrant_client.QdrantClient(path="qdrant_mm_db_llama_v3") llama_text_store = QdrantVectorStore( client=client, collection_name="text_collection" ) storage_context = StorageContext.from_defaults(vector_store=llama_text_store) # Create the Text Vector index index = VectorStoreIndex.from_documents( text_docs, storage_context=storage_context, ) ``` - Build Top k retrieval for Vector Store Index - 注意這邊指有檢索文字 ```python= MAX_TOKENS = 50 retriever_engine = index.as_retriever( similarity_top_k=3, ) # retrieve more information from the GPT4V response retrieval_results = retriever_engine.retrieve("Compare llama2 with llama1?") from llama_index.core.response.notebook_utils import display_source_node retrieved_image = [] for res_node in retrieval_results: display_source_node(res_node, source_length=1000) ``` - Key points from the text: * The “Llama 🦙-C/C++” model is compared with ChatGPT “Llama 🦙-C++” 7B model and GPT-3.5. * The model demonstrates a win rate of 36% and a tie rate of 31%, relative to ChatGPT “Llama 🦙-C++” 7B model’s performance. * There are discussions on various topics such as advantages, the importance of statistical significance, token overlap, raters’ calibrations, and control for hypothesis prompt qu… - Observation: - **GPT4-V 無法穩定地識別表格並從圖像中提取表格內容**，尤其是當圖像中混雜了表格、文本和圖像時。這在 PDF 格式中很常見。 - 通過將 PDF 文件分割成單個圖像，讓 GPT4-V 將每個 PDF 頁面理解/總結為單個圖像，然後根據 PDF 圖像建立 RAG 文本索引。**這種方法在這項任務中表現不佳**。 - #### Exp **3：使用[`Table Transformer`](https://huggingface.co/microsoft/table-transformer-detection)從檢索的影像中crop出表格資訊，再將裁切出的表格影像GPT4-V** - 表格偵測模型 - 定位、框選出表格後儲存 ```python from transformers import AutoModelForObjectDetection # load table detection model # processor = TableTransformerImageProcessor(max_size=800) model = AutoModelForObjectDetection.from_pretrained( "microsoft/table-transformer-detection", revision="no_timm" ).to(device) # load table structure recognition model # structure_processor = TableTransformerImageProcessor(max_size=1000) structure_model = AutoModelForObjectDetection.from_pretrained( "microsoft/table-transformer-structure-recognition-v1.1-all" ).to(device) def detect_and_crop_save_table( file_path, cropped_table_directory="./table_images/" ): ... ``` - Crop the tables ```python for file_path in retrieved_images: detect_and_crop_save_table(file_path) ``` 檢視crop出的表格 ![image](https://hackmd.io/_uploads/S1CKcaxMA.png) - 讀取裁切好的表格並生成回應 - `image_documents = SimpleDirectoryReader("./table_images/"**).load_data()` ```python image_documents = SimpleDirectoryReader("./table_images/").load_data() response = openai_mm_llm.complete( prompt="Compare llama2 with llama1?", image_documents=image_documents, ) print(response) ``` - Observation: - :100:**該模型現在能提供準確的答案。** - 這在 "思維鏈"（COT）實驗中的發現相吻合，即向 GPT-4-V 提供特定的圖像信息可顯著增強其提供正確答案的能力 - 如果能額外加上內文說明應該效果會更好，就像平常在利用GPTs解讀論文一樣，只是透過手動框選圖表與對應的內文 > - 備註：跟我原本預想的設計接近，單純地將表格框選出來後轉成圖片，與內文分離 #### **Exp 4：承 Exp 3，在裁切出的表格影像使用OCR辨識出文字，再送入GPT4/ GPT-3.5回答問題** - Observation: - 提取的 table 文字並不精準。(每行代表一个表格信息） - 因為從圖像中無法從表格中提取出正確的文字資訊，所以答案是錯誤的 - 應該有節省費用的好處，但是否改成以json/csv/pandas DataFrame格式產生具結構性的資料格式解釋力會較好? ---