Multi-Modal RAG Notes 多模態RAG檢索筆記

### [AI / ML領域相關學習筆記入口頁面](https://hackmd.io/@YungHuiHsu/BySsb5dfp) #### [Deeplearning.ai GenAI/LLM系列課程筆記](https://learn.deeplearning.ai/) ##### GenAI - [Large Language Models with Semantic Search。大型語言模型與語義搜索 ](https://hackmd.io/@YungHuiHsu/rku-vjhZT) - [LangChain for LLM Application Development。使用LangChain進行LLM應用開發](https://hackmd.io/1r4pzdfFRwOIRrhtF9iFKQ) - [Finetuning Large Language Models。微調大型語言模型](https://hackmd.io/@YungHuiHsu/HJ6AT8XG6) ##### RAG - [Preprocessing Unstructured Data for LLM Applications。大型語言模型(LLM)應用的非結構化資料前處理](https://hackmd.io/@YungHuiHsu/BJDAbgpgR) - [Building and Evaluating Advanced RAG。建立與評估進階RAG](https://hackmd.io/@YungHuiHsu/rkqGpCDca) - [[GenAI][RAG] Multi-Modal Retrieval-Augmented Generation and Evaluaion。多模態的RAG與評估 ](https://hackmd.io/@YungHuiHsu/B1LJcOlfA) - [[GenAI][RAG] 利用多模態模型解析PDF文件內的表格](https://hackmd.io/@YungHuiHsu/HkthTngM0) - [[GenAI][RAG] Prompt Engineering for Multimodal Model](https://hackmd.io/@YungHuiHsu/rJzaU3LSC) --- # Multi-Modal RAG Notes 多模態RAG檢索筆記 ## Multi-Modal Retrieval-Augmented Generation #### [2024.03 。zilliz.。Exploring the Frontier of Multimodal Retrieval-Augmented Generation (RAG)](https://zilliz.com/learn/multimodal-RAG) #### [2024.03。nvidia。An Easy Introduction to Multimodal Retrieval-Augmented Generation](https://developer.nvidia.com/blog/an-easy-introduction-to-multimodal-retrieval-augmented-generation/) #### [2023.12。weaviate。Multimodal Retrieval Augmented Generation(RAG)](https://weaviate.io/blog/multimodal-rag) #### [2024.05。How to build Multimodal Retrieval-Augmented Generation (RAG) with Gemini](https://www.youtube.com/watch?v=LF7I6raAIL4&t=887s) #### [Llamaindex 與 Langchain的多模態檢索實作架構圖](https://hackmd.io/@YungHuiHsu/HJ6iqpmPC) ![image](https://hackmd.io/_uploads/B18AMrrD0.png =400x) ### Multi-Modal RAG實做: Langchain #### [2023.10。langchain。Multi-Vector Retriever for RAG on tables, text, and images](https://blog.langchain.dev/semi-structured-multi-modal-rag/) ![image](https://hackmd.io/_uploads/H1oZinlM0.png) - path1: 直接生成多模態的Embedding - → 直接向量儲存Multimodal Embedding - option1：檢索原始的影像、表格、與文字 - path2: 先生成摘要(圖、表、文字)，再以文字Embedding的方式儲存 - → Multi-vector retriever - path1: 直接生成多模態的Embedding - → 直接向量儲存Multimodal Embedding - option1：檢索原始的影像、表格、與文字 - path2: 先生成摘要(圖、表、文字)，再以文字Embedding的方式儲存 - → Multi-vector retriever - 透過摘要後的文字Embedding去檢索出原始資料(圖、表、文字) - option2：檢索影像摘要、將**文字傳遞給LLM**去生成回應 - option3：透過影像摘要檢索、將**原始圖像傳遞給LLM**去生成回應 - 優缺點比較 - Option1：目前最具潛力 - 缺點部分：**UniIR(2023.12)剛好補足了多模態檢索器的不足** - 方案2&3是滿直觀的方案，因為最直觀易實做，但從前人經驗看來效果差(特別方案2) | | Pro | Con | | --- | --- | --- | | Option 1: Multi-modal embedding | Highest potential ceiling | Not many (Vertex, open_clip) | | Option 2: Embd / retrieve img summary | Easiest | Most lossy | | Option 3: Embd / retrieve img | Easy retrieval , img for answer | Could be lossy in retrieval | #### [2023.12。Multi-modal RAG on slide decks](https://blog.langchain.dev/multi-modal-rag-template/) :::info 使用影像的摘要檢索得到的檢索準確度最好，可能原因如下 - 目前文字有專用得embed model(text2text)，但文字對圖像(text2image)檢索並沒有專用的 - 影像檢索 - 個人用CLIP"ViT-L/14"實測結果，在文字對圖像(text2image)得到的cosine sumilarity分數多在0.30~0.35上下 - 表示用文字對圖像檢索時的效能並不是很好，query透過CLIP轉text embedding時，無法取得足夠細微的語意表徵跟圖像匹配 - 會因為query內的文字雜訊而導致score浮動、導致檢索排序不一致(text2text的結果與分數相對穩定，我使用"ada2") - 在現階段沒有效能足夠強、專門針對圖表資料訓練的multi-modal model前，使用圖像轉文字(可以是摘要、可以是假設性問題) ::: - 重要結果 :::success ![image](https://hackmd.io/_uploads/rJ5gkK3m0.png =400x) :warning:跟個人測試接近Multi-Vector表現最好，但是text與multi-modal(image) embedddings的表現相反 > > ![image](https://hackmd.io/_uploads/rJeZI8EV0.png =300x) > - y值：檢索到的結果是否跟提問問題來自相同來源的文件。如果topk的檢索結果都一致則得到滿分為1 > - 我用的CLIP版本為"ViT-L/14@336px"、文件轉圖片解析度為300dpi高解析度 > - 差異原因分析 > - 可能我用的測試簡報有明確的文字重點總結與描述，包含了較多資訊 ![image](https://hackmd.io/_uploads/S1QCd3pQC.png =400x) ![image](https://hackmd.io/_uploads/HJhEY3am0.png =400x) > 開源模型LLaVa-13B表現很悲劇... source:[LangChain。Using Multi-Modal LLMs](https://docs.google.com/presentation/d/19x0dvHGhbJOOUWqvPKrECPi1yI3makcoc-8tFLj9Sos/edit?ref=blog.langchain.dev#slide=id.g2642e7050fc_0_558) ::: 文章內有提供[notebook](https://langchain-ai.github.io/langchain-benchmarks/notebooks/retrieval/multi_modal_benchmarking/multi_modal_eval.html?ref=blog.langchain.dev)範例 - 使用[OpenCLIP](https://github.com/mlfoundations/open_clip)作為檢索器 - 關於LangChain的multi_vector_retriever > indexes summaries, but returns raw images or texts。到底如何檢索文字、返回原始影像還是文字? 從code看不出技術細節 - Langchain實作好的[`MultiVectorRetriever`](https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/multi_vector/?ref=blog.langchain.dev) - 欲檢索的資料加入vectorstore， - `retriever.vectorstore.add_documents(summary_docs)` - 檢索時返回對象則是docstore內儲存的內容(可以是文字、也可以是圖、表) - `retriever.docstore.mset(list(zip(doc_ids, doc_contents)))` - 這裡的`doc_contents` 塞的是字串化的影像(`images_base_64_processed`) ```python= retriever_multi_vector_img = create_multi_vector_retriever( vectorstore_mvr, image_summaries, images_base_64_processed,) ``` #### 網路blog長文本範例 [langchain/cookbook/Multi_modal_RAG.ipynb](https://github.com/langchain-ai/langchain/blob/master/cookbook/Multi_modal_RAG.ipynb) - 材料：網路上的BLOG文章、擷取好的圖檔 - **Multi-vector retriever** (w/ image summary) * 使用影像的摘要進行文字檢索(text2image_summary)，返回純影像 * 這是我之後想做的測試，因為目前文字有專用的 embed model，在細微語意的表現上較較有鑑別度 * 檢索單位(:pencil2:為了讓檢索基準一致，統一轉為摘要) * 表格摘要 * 來源是用unstructure io進行提取 * 文字摘要 * 因為4k char太多了，所以轉成摘要 * 把整份長文件合併後做切割 * 圖片摘要 * 手動crop出的圖片、用gpt-4v摘要 * 影像大小影響檢索與生成品質 * :pencil2:注意，是這三種摘要同時以文字格式進行檢索 * 三者間會有檢索競爭關係(LlamaIndex方案為圖、文分開檢索) ### Multi-Modal RAG實做: LlamaIndex #### [2323.11。LlamaIndex。Evaluating Multi-Modal Retrieval-Augmented Generation](https://www.llamaindex.ai/blog/evaluating-multi-modal-retrieval-augmented-generation-db3ca824d428) #### [2023.11。LlamaIndex。Multi-Modal RAG](https://www.llamaindex.ai/blog/multi-modal-rag-621de7525fea) #### [LlamaIndex。Multi-modal docs page](https://docs.llamaindex.ai/en/stable/module_guides/models/multi_modal/) #### [Evaluating Multi-Modal RAG](https://docs.llamaindex.ai/en/stable/examples/evaluation/multi_modal/multi_modal_rag_evaluation/) #### [LlamaIndex。Multi-Modal on PDF’s with tables](https://docs.llamaindex.ai/en/v0.10.17/examples/multi_modal/multi_modal_pdf_tables.html) 筆記見[[GenAI][RAG] 利用多模態模型解析PDF文件內的表格](https://hackmd.io/@YungHuiHsu/HkthTngM0) #### [2023.12。Praveen Veera。A Comprehensive Guide to Multi-Modal RAG on PDFs using GPT4Vision and LlamaIndex](https://medium.com/@praveenveera92/a-comprehensive-guide-to-multi-modal-rag-on-pdfs-using-gpt4-vision-and-llamaindex-ca7d61218f7d) 內有提取pdf影像metadata的程式碼可以參考(使用`PyMuPDF`) ![image](https://hackmd.io/_uploads/HkkwxybGC.png =600x) > The architecture of Multimodel RAG on PDF with GPT4V and LLamaIndex --- --- ## **Evaluation** #### [2023。eugeneyan。Patterns for Building LLM-based Systems & Products](https://eugeneyan.com/writing/llm-patterns/) 傳統與基於llm的評估指標與經驗整理 - [2023。eugeneyan。Evaluation & Hallucination Detection for Abstractive Summaries](https://eugeneyan.com/writing/abstractive/) > If our task has no correct answer but we have references (e.g., machine translation, extractive summarization), we can rely on reference metrics based on matching (BLEU, ROUGE) or semantic similarity (BERTScore, MoverScore). > However, these metrics may not work for more open-ended tasks such as abstractive summarization, dialogue, and others. But collecting human judgments can be slow and expensive. Thus, we may opt to lean on automated evaluations via a strong LLM. :::info 如果是對話、抽象摘要、或其他開放式對話，不會有正確或標準答案，蒐集人類回饋既緩慢有昂貴，使用Stong LLM作自動化評估就很有必要性 ::: #### [2023。stanford.edu。CS 329T: Trustworthy Machine Learning_Large Language Models and Applications](https://web.stanford.edu/class/cs329t/syllabus.html) :100: stanford大學針對建立LLM應用真實性評估的的必讀教材，包括RAG的評估與驗證、多模態LLM的RAG評估也包括在內 #### [2024.03。zilliz。How to Evaluate RAG Applications](https://zilliz.com/learn/How-To-Evaluate-RAG-Applications) #### [2024.01。confident-ai。LLM Evaluation Metrics: Everything You Need for LLM Evaluation](https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation) :pencil2: high quality LLM outputs is the product of a great retriever and generator. ![image](https://hackmd.io/_uploads/ryambJagC.png =400x) #### [2024.02。Nvidia。Evaluating Retriever for Enterprise-Grade RAG](https://developer.nvidia.com/blog/evaluating-retriever-for-enterprise-grade-rag/) #### [2024.02。Pratik Bhavsar Galileo Labs。Mastering RAG: Improve RAG Performance With 4 Powerful RAG Metrics](https://www.rungalileo.io/blog/mastering-rag-improve-performance-with-4-powerful-metrics?utm_medium=email&_hsenc=p2ANqtz-9Fe4SgKFU7gFnWkAj10qy31udk0aI5-agKBTBzcnrNyFFXzpf5khAuqu_yfS7Kzn-It3dlao353qyzijhGSLcYeOpJmg&_hsmi=301132332&utm_content=301135771&utm_source=hs_email) Galileo 公司提出的評估指標，可與Trulens的做對照 ![image](https://hackmd.io/_uploads/H1FKQyTx0.png =400x) [Choosing your Guardrail Metrics](https://docs.rungalileo.io/galileo/gen-ai-studio-products/galileo-evaluate/choosing-your-guardrail-metrics) #### [2023.11。LlamaIndex • Evaluating Multi-Modal Retrieval-Augmented Generation](https://www.llamaindex.ai/blog/evaluating-multi-modal-retrieval-augmented-generation-db3ca824d428) LlamaIndex針對多模態RAG的評估 - [Evaluating Multi-Modal RAG- correctness-faithfulness-relevancy](https://docs.llamaindex.ai/en/stable/examples/evaluation/multi_modal/multi_modal_rag_evaluation/#correctness-faithfulness-relevancy) * [Notebook guide for evaluating Multi-Modal RAG systems with LlamaIndex](https://docs.llamaindex.ai/en/stable/examples/evaluation/multi_modal/multi_modal_rag_evaluation.html) * [Intro to Multi-Modal RAG](https://blog.llamaindex.ai/multi-modal-rag-621de7525fea) * [Docs/guides on Multi-Modal Abstractions](https://docs.llamaindex.ai/en/stable/module_guides/models/multi_modal.html) ### Cross-Modal Retrieval Integration and Enhancement #### [arXiv:2312.01714。Retrieval-Augmented Multi-Modal Chain-of-Thoughts Reasoning for Large Language Models](https://arxiv.org/abs/2312.01714) - [2023.12。A Deep Dive into Retrieval-Augmented Multi-modal Chain-of-Thought Reasoning](https://www.linkedin.com/pulse/deep-dive-retrieval-augmented-multi-modal-reasoning-ashish-bhatia-2gwme/) COT + Cross-Modal Integration and Enhancement > the model can intelligently choose examples that closely match the text and images in a given query, ensuring more relevant and contextual reasoning. > ![image](https://hackmd.io/_uploads/r11avYlfR.png) ### Datasets - [ChartBench: A Benchmark for Complex Visual Reasoning in Charts](https://arxiv.org/abs/2312.15915) - [chartbench.github.io](https://chartbench.github.io/) 圖表理解的Benchmark資料集與mllm能力測試 ![image](https://hackmd.io/_uploads/S1oqnxbDC.png =800x) ![image](https://hackmd.io/_uploads/Sy9VhlZvC.png =600x) > * GPT CoT（CoT-GPT）方法在所有模型中都表現最好，顯著提升了圖表理解和推理能力。 > * Fixed CoT（CoT-fix）方法對於較弱的模型（如 MiniGPT-v2 和 Qwen-VL-Chat）有明顯的改善效果，但不如 GPT CoT 強大。 > * Self CoT（CoT-self）的效果不如 Fixed CoT 和 GPT CoT，因為自生成的思維鏈質量不穩定。 > * Base 方法效果最差，缺乏引導的模型無法有效進行複雜推理。 ### 幻覺控制與提示 - [2024.06。LangGPT：超越文字，多模态提示词在大模型中的创新实践（langgpt作者云中江树）筆記](https://hackmd.io/Y_TzTh43RWayBXwmeGFd_Q) - [2024.06。Pratik Bhavsar。Galileo Labs。Survey of Hallucinations in Multimodal Models](https://www.rungalileo.io/blog/survey-of-hallucinations-in-multimodal-models?utm_medium=email&_hsenc=p2ANqtz-8kWinr0h3mSULzr3FtJcPFYT6NITz2fWAXj-_KEnI4U09K-5BaLl9GplL9cc7ZE3Atcw1ZHrS4axMpJxWYtA2rZPHwaQ&_hsmi=314027574&utm_content=314028677&utm_source=hs_email) ![image](https://hackmd.io/_uploads/HJpYNe-PC.png =400x) >　 ![image](https://hackmd.io/_uploads/HJ4O4eWDA.png =800x)