Try   HackMD

AI / ML領域相關學習筆記入口頁面

Deeplearning.ai GenAI/LLM系列課程筆記

GenAI
RAG

Multi-Modal RAG Notes 多模態RAG檢索筆記

Multi-Modal Retrieval-Augmented Generation

2024.03 。zilliz.。Exploring the Frontier of Multimodal Retrieval-Augmented Generation (RAG)

2024.03。nvidia。An Easy Introduction to Multimodal Retrieval-Augmented Generation

2023.12。weaviate。Multimodal Retrieval Augmented Generation(RAG)

2024.05。How to build Multimodal Retrieval-Augmented Generation (RAG) with Gemini

Llamaindex 與 Langchain的多模態檢索實作架構圖

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Multi-Modal RAG實做: Langchain

2023.10。langchain。Multi-Vector Retriever for RAG on tables, text, and images

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

  • path1: 直接生成多模態的Embedding
    • → 直接向量儲存Multimodal Embedding
      • option1:檢索原始的影像、表格、與文字
  • path2: 先生成摘要(圖、表、文字),再以文字Embedding的方式儲存
    • → Multi-vector retriever

      • path1: 直接生成多模態的Embedding
        • → 直接向量儲存Multimodal Embedding
          • option1:檢索原始的影像、表格、與文字
      • path2: 先生成摘要(圖、表、文字),再以文字Embedding的方式儲存
        • → Multi-vector retriever
          • 透過摘要後的文字Embedding去檢索出原始資料(圖、表、文字)
          • option2:檢索影像摘要、將文字傳遞給LLM去生成回應
          • option3:透過影像摘要檢索、將原始圖像傳遞給LLM去生成回應
    • 優缺點比較

      • Option1:目前最具潛力
        • 缺點部分:UniIR(2023.12)剛好補足了多模態檢索器的不足
      • 方案2&3是滿直觀的方案,因為最直觀易實做,但從前人經驗看來效果差(特別方案2)
      Pro Con
      Option 1: Multi-modal embedding Highest potential ceiling Not many (Vertex, open_clip)
      Option 2: Embd / retrieve img summary Easiest Most lossy
      Option 3: Embd / retrieve img Easy retrieval , img for answer Could be lossy in retrieval

2023.12。Multi-modal RAG on slide decks

使用影像的摘要檢索得到的檢索準確度最好,可能原因如下

  • 目前文字有專用得embed model(text2text),但文字對圖像(text2image)檢索並沒有專用的
  • 影像檢索
    • 個人用CLIP"ViT-L/14"實測結果,在文字對圖像(text2image)得到的cosine sumilarity分數多在0.30~0.35上下
      • 表示用文字對圖像檢索時的效能並不是很好,query透過CLIP轉text embedding時,無法取得足夠細微的語意表徵跟圖像匹配
      • 會因為query內的文字雜訊而導致score浮動、導致檢索排序不一致(text2text的結果與分數相對穩定,我使用"ada2")
  • 在現階段沒有效能足夠強、專門針對圖表資料訓練的multi-modal model前,使用圖像轉文字(可以是摘要、可以是假設性問題)
  • 重要結果

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
跟個人測試接近Multi-Vector表現最好,但是text與multi-modal(image) embedddings的表現相反

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

  • y值:檢索到的結果是否跟提問問題來自相同來源的文件。如果topk的檢索結果都一致則得到滿分為1
  • 我用的CLIP版本為"ViT-L/14@336px"、文件轉圖片解析度為300dpi高解析度
  • 差異原因分析
  • 可能我用的測試簡報有明確的文字重點總結與描述,包含了較多資訊

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

開源模型LLaVa-13B表現很悲劇

source:LangChain。Using Multi-Modal LLMs

文章內有提供notebook範例

  • 使用OpenCLIP作為檢索器
  • 關於LangChain的multi_vector_retriever

indexes summaries, but returns raw images or texts。到底如何檢索文字、返回原始影像還是文字? 從code看不出技術細節

  • Langchain實作好的MultiVectorRetriever
  • 欲檢索的資料加入vectorstore,
    • retriever.vectorstore.add_documents(summary_docs)
  • 檢索時返回對象則是docstore內儲存的內容(可以是文字、也可以是圖、表)
    • retriever.docstore.mset(list(zip(doc_ids, doc_contents)))
    • 這裡的doc_contents 塞的是字串化的影像(images_base_64_processed)
      ​​​​​​​​retriever_multi_vector_img = create_multi_vector_retriever( ​​​​​​​​ vectorstore_mvr, ​​​​​​​​ image_summaries, ​​​​​​​​ images_base_64_processed,)

網路blog長文本範例 langchain/cookbook/Multi_modal_RAG.ipynb

  • 材料:網路上的BLOG文章、擷取好的圖檔
  • Multi-vector retriever (w/ image summary)
    • 使用影像的摘要進行文字檢索(text2image_summary),返回純影像
    • 這是我之後想做的測試,因為目前文字有專用的 embed model,在細微語意的表現上較較有鑑別度
    • 檢索單位(
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
      為了讓檢索基準一致,統一轉為摘要)
      • 表格摘要
        • 來源是用unstructure io進行提取
      • 文字摘要
        • 因為4k char太多了,所以轉成摘要
        • 把整份長文件合併後做切割
      • 圖片摘要
        • 手動crop出的圖片、用gpt-4v摘要
        • 影像大小影響檢索與生成品質
      • Image Not Showing Possible Reasons
        • The image file may be corrupted
        • The server hosting the image is unavailable
        • The image path is incorrect
        • The image format is not supported
        Learn More →
        注意,是這三種摘要同時以文字格式進行檢索
        • 三者間會有檢索競爭關係(LlamaIndex方案為圖、文分開檢索)

Multi-Modal RAG實做: LlamaIndex

2323.11。LlamaIndex。Evaluating Multi-Modal Retrieval-Augmented Generation

2023.11。LlamaIndex。Multi-Modal RAG

LlamaIndex。Multi-modal docs page

Evaluating Multi-Modal RAG

LlamaIndex。Multi-Modal on PDF’s with tables

筆記見[GenAI][RAG] 利用多模態模型解析PDF文件內的表格

2023.12。Praveen Veera。A Comprehensive Guide to Multi-Modal RAG on PDFs using GPT4Vision and LlamaIndex

內有提取pdf影像metadata的程式碼可以參考(使用PyMuPDF)

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

The architecture of Multimodel RAG on PDF with GPT4V and LLamaIndex



Evaluation

2023。eugeneyan。Patterns for Building LLM-based Systems & Products

傳統與基於llm的評估指標與經驗整理

If our task has no correct answer but we have references (e.g., machine translation, extractive summarization), we can rely on reference metrics based on matching (BLEU, ROUGE) or semantic similarity (BERTScore, MoverScore).

However, these metrics may not work for more open-ended tasks such as abstractive summarization, dialogue, and others. But collecting human judgments can be slow and expensive. Thus, we may opt to lean on automated evaluations via a strong LLM.

如果是對話、抽象摘要、或其他開放式對話,不會有正確或標準答案,蒐集人類回饋既緩慢有昂貴,使用Stong LLM作自動化評估就很有必要性

2023。stanford.edu。CS 329T: Trustworthy Machine Learning_Large Language Models and Applications

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
stanford大學針對建立LLM應用真實性評估的的必讀教材,包括RAG的評估與驗證、多模態LLM的RAG評估也包括在內

2024.03。zilliz。How to Evaluate RAG Applications

2024.01。confident-ai。LLM Evaluation Metrics: Everything You Need for LLM Evaluation

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
high quality LLM outputs is the product of a great retriever and generator.
image

2024.02。Nvidia。Evaluating Retriever for Enterprise-Grade RAG

2024.02。Pratik Bhavsar Galileo Labs。Mastering RAG: Improve RAG Performance With 4 Powerful RAG Metrics

Galileo 公司提出的評估指標,可與Trulens的做對照
image
Choosing your Guardrail Metrics

2023.11。LlamaIndex • Evaluating Multi-Modal Retrieval-Augmented Generation

LlamaIndex針對多模態RAG的評估

Cross-Modal Retrieval Integration and Enhancement

arXiv:2312.01714。Retrieval-Augmented Multi-Modal Chain-of-Thoughts Reasoning for Large Language Models

COT + Cross-Modal Integration and Enhancement

the model can intelligently choose examples that closely match the text and images in a given query, ensuring more relevant and contextual reasoning.

image

Datasets

  • ChartBench: A Benchmark for Complex Visual Reasoning in Charts

    image

    • GPT CoT(CoT-GPT) 方法在所有模型中都表現最好,顯著提升了圖表理解和推理能力。
    • Fixed CoT(CoT-fix) 方法對於較弱的模型(如 MiniGPT-v2 和 Qwen-VL-Chat)有明顯的改善效果,但不如 GPT CoT 強大。
    • Self CoT(CoT-self) 的效果不如 Fixed CoT 和 GPT CoT,因為自生成的思維鏈質量不穩定。
    • Base 方法效果最差,缺乏引導的模型無法有效進行複雜推理。

幻覺控制與提示