GenAI RAG Multi-Modal RAG Notes 多模態RAG檢索筆記 Multi-Modal Retrieval-Augmented Generation
Image Not Showing
Possible Reasons
The image was uploaded to a note which you don't have access to The note which the image was originally uploaded to has been deleted
Learn More →
Multi-Modal RAG實做: Langchain
Image Not Showing
Possible Reasons
The image was uploaded to a note which you don't have access to The note which the image was originally uploaded to has been deleted
Learn More →
path1: 直接生成多模態的Embedding
→ 直接向量儲存Multimodal Embedding
path2: 先生成摘要(圖、表、文字),再以文字Embedding的方式儲存
→ Multi-vector retriever
path1: 直接生成多模態的Embedding
→ 直接向量儲存Multimodal Embedding
path2: 先生成摘要(圖、表、文字),再以文字Embedding的方式儲存
→ Multi-vector retriever
透過摘要後的文字Embedding去檢索出原始資料(圖、表、文字)
option2:檢索影像摘要、將 文字傳遞給LLM 去生成回應
option3:透過影像摘要檢索、將 原始圖像傳遞給LLM 去生成回應
優缺點比較
Option1:目前最具潛力
缺點部分: UniIR(2023.12)剛好補足了多模態檢索器的不足
方案2&3是滿直觀的方案,因為最直觀易實做,但從前人經驗看來效果差(特別方案2)
Pro
Con
Option 1: Multi-modal embedding
Highest potential ceiling
Not many (Vertex, open_clip)
Option 2: Embd / retrieve img summary
Easiest
Most lossy
Option 3: Embd / retrieve img
Easy retrieval , img for answer
Could be lossy in retrieval
使用影像的摘要檢索得到的檢索準確度最好,可能原因如下
目前文字有專用得embed model(text2text),但文字對圖像(text2image)檢索並沒有專用的
影像檢索
個人用CLIP"ViT-L/14"實測結果,在文字對圖像(text2image)得到的cosine sumilarity分數多在0.30~0.35上下
表示用文字對圖像檢索時的效能並不是很好,query透過CLIP轉text embedding時,無法取得足夠細微的語意表徵跟圖像匹配
會因為query內的文字雜訊而導致score浮動、導致檢索排序不一致(text2text的結果與分數相對穩定,我使用"ada2")
在現階段沒有效能足夠強、專門針對圖表資料訓練的multi-modal model前,使用圖像轉文字(可以是摘要、可以是假設性問題)
Image Not Showing
Possible Reasons
The image was uploaded to a note which you don't have access to The note which the image was originally uploaded to has been deleted
Learn More →
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
跟個人測試接近Multi-Vector表現最好,但是text與multi-modal(image) embedddings的表現相反
Image Not Showing
Possible Reasons
The image was uploaded to a note which you don't have access to The note which the image was originally uploaded to has been deleted
Learn More →
y值:檢索到的結果是否跟提問問題來自相同來源的文件。如果topk的檢索結果都一致則得到滿分為1
我用的CLIP版本為"ViT-L/14@336px"、文件轉圖片解析度為300dpi高解析度
差異原因分析
可能我用的測試簡報有明確的文字重點總結與描述,包含了較多資訊
Image Not Showing
Possible Reasons
The image was uploaded to a note which you don't have access to The note which the image was originally uploaded to has been deleted
Learn More →
Image Not Showing
Possible Reasons
The image was uploaded to a note which you don't have access to The note which the image was originally uploaded to has been deleted
Learn More →
開源模型LLaVa-13B表現很悲劇 …
source: LangChain。Using Multi-Modal LLMs
文章內有提供 notebook 範例
使用 OpenCLIP 作為檢索器
關於LangChain的multi_vector_retriever
indexes summaries, but returns raw images or texts。到底如何檢索文字、返回原始影像還是文字? 從code看不出技術細節
Langchain實作好的 MultiVectorRetriever
欲檢索的資料加入vectorstore,
retriever.vectorstore.add_documents(summary_docs)
檢索時返回對象則是docstore內儲存的內容(可以是文字、也可以是圖、表)
retriever.docstore.mset(list(zip(doc_ids, doc_contents)))
這裡的 doc_contents
塞的是字串化的影像( images_base_64_processed
)
retriever_multi_vector_img = create_multi_vector_retriever(
vectorstore_mvr,
image_summaries,
images_base_64_processed,)
材料:網路上的BLOG文章、擷取好的圖檔
Multi-vector retriever (w/ image summary)
使用影像的摘要進行文字檢索(text2image_summary),返回純影像
這是我之後想做的測試,因為目前文字有專用的 embed model,在細微語意的表現上較較有鑑別度
檢索單位(
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
為了讓檢索基準一致,統一轉為摘要)
表格摘要
文字摘要
因為4k char太多了,所以轉成摘要
把整份長文件合併後做切割
圖片摘要
手動crop出的圖片、用gpt-4v摘要
影像大小影響檢索與生成品質
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
注意,是這三種摘要同時以文字格式進行檢索
三者間會有檢索競爭關係(LlamaIndex方案為圖、文分開檢索)
Multi-Modal RAG實做: LlamaIndex 筆記見 [GenAI][RAG] 利用多模態模型解析PDF文件內的表格
內有提取pdf影像metadata的程式碼可以參考(使用 PyMuPDF
)
Image Not Showing
Possible Reasons
The image was uploaded to a note which you don't have access to The note which the image was originally uploaded to has been deleted
Learn More →
The architecture of Multimodel RAG on PDF with GPT4V and LLamaIndex
Evaluation 傳統與基於llm的評估指標與經驗整理
If our task has no correct answer but we have references (e.g., machine translation, extractive summarization), we can rely on reference metrics based on matching (BLEU, ROUGE) or semantic similarity (BERTScore, MoverScore).
However, these metrics may not work for more open-ended tasks such as abstractive summarization, dialogue, and others. But collecting human judgments can be slow and expensive. Thus, we may opt to lean on automated evaluations via a strong LLM.
如果是對話、抽象摘要、或其他開放式對話,不會有正確或標準答案,蒐集人類回饋既緩慢有昂貴,使用Stong LLM作自動化評估就很有必要性
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
stanford大學針對建立LLM應用真實性評估的的必讀教材,包括RAG的評估與驗證、多模態LLM的RAG評估也包括在內
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
high quality LLM outputs is the product of a great retriever and generator.
Galileo 公司提出的評估指標,可與Trulens的做對照
Choosing your Guardrail Metrics
LlamaIndex針對多模態RAG的評估
Cross-Modal Retrieval Integration and Enhancement COT + Cross-Modal Integration and Enhancement
the model can intelligently choose examples that closely match the text and images in a given query, ensuring more relevant and contextual reasoning.
Datasets 幻覺控制與提示