Day 1 宵夜場. [PPT Chatbot] 讀取PPT的策略整理

# Day 1 宵夜場. [PPT Chatbot] 讀取PPT的策略整理 ## 0. 前言身為懶人，把能找到的現成方式都要試一下。由於ppt內容是機敏資訊，除了微軟copilot之外，無法測試其他雲端的方案，只能自己寫程式爬內容。 ## 1. 讀取策略整理 ### 1-1. Docling IBM的開源方案。在所有嘗試過的方法裡，`Docling`真的是最泛用的工具。不只迅速，OCR、VLM、Langchain、LlamaIndex等工具都有整合，可以針對不同情境用不同方法取得內容。 ``` python from docling_parse.parsers.pptx_parser import parse_pptx from docling_parse.utils.markdown import to_markdown from docling.datamodel.pipeline_options import PptxPipelineOptions, TableStructureOptions, EasyOcrOptions # Define pipeline options for enhanced extraction options = PptxPipelineOptions( do_table_structure=True, # Enable structured table parsing do_picture_description=True, # Enable image captioning (if supported) do_ocr=True, # Enable OCR for text in images ocr_options=EasyOcrOptions(lang=["en", "zh"]), # OCR languages table_structure_options=TableStructureOptions( do_cell_matching=True, # Match predicted cells to layout mode="accurate" # Use accurate mode for table parsing ) ) # Path to your PowerPoint file pptx_path = "example.pptx" # Parse the PPTX file with custom options doc = parse_pptx(pptx_path, options=options) # Convert the structured document to Markdown markdown = to_markdown(doc) # Save the Markdown output with open("output.md", "w", encoding="utf-8") as f: f.write(markdown) # Optional: inspect the structured document print(doc.to_dict()) ``` 但就像我在[Day 0. [PPT Chatbot] 前言](https://hackmd.io/@9Ecf9gbnTv294Ng2ICWo9g/S1k4AgaCxg)這篇文章講的，合併儲存格的問題一定要解決，不然針對user prompt做語意搜尋時，可能被認為關聯性不高，導致資料蒐集不完全。 ### 1-2. OCR 在使用OCR前，都先把ppt轉成PDF或圖片，然後嘗試`pytesseract`、`PaddleOCR`。可能是沒時間研究參數怎麼調整，所以做出來的效果差強人意。再加上我的ppt內容可能不太適合用OCR讀取(特殊字、合併儲存格的辨識效果、字型限制、字體太小辨識率低)，所以最後就放棄這個方法。 ### 1-3. markitdown 微軟的開源方案，也是非常強大的套件，文件丟進去就直接產出markdown文件。只要一行程式就能轉換，好像在變魔術。跟`Docling`相比，我覺得markitdown還需要更多可以客製的設定。 ``` python from markitdown import MarkItDown from openai import OpenAI client = OpenAI() md = MarkItDown(llm_client=client, llm_model="gpt-4o", llm_prompt="optional custom prompt") result = md.convert("example.jpg") print(result.text_content) ``` ### 1-4. Microsoft Copilot 唯一個雲端方案。因為有簽訂合約，所以上傳機敏資訊比較放心。如果這個成功的話，就不用花太多心思在爬資料，而且有一個現成的機器人可以用XD 使用前提：可以使用Copilot Studio建立代理人建立代理人：上傳ppt→建立代理人→設定system prompt→開始使用使用方式非常簡單，但是讀取的過程如同黑盒子，無法從中客製化。測試過程中，我認為應該是有直接排除picture object和OLE object。 ## 2. 策略選定 - OCR - 只能當輔助工具，不能真的當懶人法使用。 - `docling`和`markitdown`都非常不錯，但有幾個點不是我想要的： - 沒有補滿合併儲存格。 - 無法讀取OLE object物件類型 - 如果需要客製化讀取的方式，要自己改source code。 - 微軟Copilot - 無法讀取image object和OLE object類型。 - 如果只詢問一個檔案的內容，效果非常好。 - 如果詢問的內容跨多個檔案，效果就不如預期，可能會漏內容 - 例如詢問「請給我專案B12在各客戶的每月進度」，回應內容可能會漏掉某幾個slide。 ## 3. 最後選擇稍微讀了`python-pptx`的範例，算是簡單上手，所以最後還是選擇自己手刻，比較多可以客製的地方。例如： - 運用LLM判斷slide內提到的item、客戶有哪些，並呼叫API存到DB - 尋找OLE Object的截圖，並丟到LLM詢問圖片內容 - 客製metadata在內容上方