How to evaluate RAG

# How to evaluate RAG ## RAG evaluation的面向 ![image](https://hackmd.io/_uploads/SJcXk9980.png) 1. 檢索(Retrieval): 從大量知識庫中檢索出相關的文件，包含 Indexing 與 Search 階段。 - dataset是可以動態增加，這會導致資訊的 Relevancy 和 Precision 會隨時間變化，增加了評估的複雜性。 - `Evaluate Retrieval Performance`: Recall, Precision, F1 score, MRR（Mean Reciprocal Rank）, NDCG（Normalized Discounted Cumulative Gain） 2. 生成(Generation): 將檢索到的資訊轉成為自然流暢的文字，包含 Prompting 與 Inferencing 階段。 - Faithfulness 與 Accuracy：創意內容生成或開放式問答等任務的主觀性增加了判斷高質量回應的變異性，很難去定義沒有正確答案的問題。 - `Evaluate Generation Performance`: BLEU, ROUGE, BERTScore, METEOR（Metric for Evaluation of Translation with Explicit ORdering） 3. 人類評估: 流利度、相關性、正確性、完整性 4. Task-specific Metrics: 依照具體任務需求，可以設計或尋找特定的評估指標 ## RAG 評估指標 ![image](https://hackmd.io/_uploads/BJIngc9I0.png) > Query表示使用者問的問題 > Context表示檢索到的上下文，大致對應到RAG的Retrieval > Response表示LLM根據Context的最後回答，大致對應到RAG的Generation。從這三角型中，可以直觀的看出 3 個應該評估的指標: 1. Context Relevancy：用於 `Context` 與`Query`的相關性，範圍從 0 到 1，數值越高表示相關性越好。該指標計算的是`上下文中與回答問題相關的句子數量`佔總句子數量的比例。 2. Groundedness: 有些框架稱其為 `Faithfulness`，用於評估`Response`與`Context`的一致性，範圍為 0 到 1，數值越高表示一致性越高，主要用於檢測 LLM 幻覺。該指標通過識別Response中的聲明，並檢查這些聲明是否可以從Context中推斷出來，來計算 Groundedness 分數。 3. Answer Relevancy: 評估`Response`對`Query`的相關性。這個指標計算從生成答案反推得出的`多個問題與原始問題之間的 mean cosine similarity`。高分表示Response直接且適當地回應了原始問題，而低分則表示答案不完整或包含冗餘資訊。 >幾乎所有評估 RAG 的框架都會包含以上 3 種指標，但這 3 種指標僅考量到 Evaluable Outputs 之間的關係，`並未考量到與 Ground Truths 之間的關係，也就是評估裡面並沒有 Ground Truths 來提供一個基準`，然而這在 RAG 評估中，也是很重要的一環。 ## RAG evaluation framework Raw Targets（原始目標）：評估過程中的核心指標和目標，包括： - Context Relevance（上下文相關性） - Answer Relevance（答案相關性） - Groundedness（基於實證） - Accuracy（準確性） - Faithfulness（忠實性） - Execution Time（執行時間） - Correctness（正確性） - Readability（可讀性） - Comprehensiveness（全面性） - Response Quality（回應質量） - Robustness（穩健性） - Information Integration（信息整合） - Noise Robustness（噪音穩健性） - Negative Rejection（負面拒絕） - Counterfactual Robustness（反事實穩健性） - Response Correctness（回應正確性） - Consistency（一致性） - Clarity（清晰度） - Coverage（覆蓋率） - Latency（延遲） - Diversity（多樣性）更詳細的Framework細節整理在[這一篇](https://hackmd.io/kTXaOexnRCS6x3vXVULvoA?view) ## 參考 - [Evaluation of Retrieval-Augmented Generation: A Survey](https://arxiv.org/abs/2405.07437) - [Benchmarking Large Language Models in Retrieval-Augmented Generation](https://arxiv.org/abs/2309.01431) - [A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions ](https://arxiv.org/abs/2311.05232) - [Evaluating Retrieval Quality in Retrieval-Augmented Generation ](https://arxiv.org/abs/2404.13781) - [RAGAS: Automated Evaluation of Retrieval Augmented Generation](https://arxiv.org/abs/2309.15217) - [Retrieval-Augmented Generation for AI-Generated Content: A Survey](https://arxiv.org/abs/2402.19473) - [LLM 評估方法指南：趨勢、指標與未來方向 - Medium](https://medium.com/@cch.chichieh/llm-%E8%A9%95%E4%BC%B0%E6%96%B9%E6%B3%95%E6%8C%87%E5%8D%97-%E8%B6%A8%E5%8B%A2-%E6%8C%87%E6%A8%99%E8%88%87%E6%9C%AA%E4%BE%86%E6%96%B9%E5%90%91-e81616d30e53) - [深入解析 RAG 評估框架：TruLens, RGAR, 與 RAGAs 的比較 - Medium](https://medium.com/@cch.chichieh/%E6%B7%B1%E5%85%A5%E8%A7%A3%E6%9E%90-rag-%E8%A9%95%E4%BC%B0%E6%A1%86%E6%9E%B6-trulens-rgar-%E8%88%87-ragas-%E7%9A%84%E6%AF%94%E8%BC%83-ab70d7117480) ## 補充