RAGAS 評估指標完整指南

# RAGAS 評估指標完整指南 ## 1. 業務場景分類 ### A. 問答系統評估 1. **檢索品質評估** - Context Precision (上下文精確度)：評估檢索內容與問題的相關程度 - Context Recall (上下文召回率)：評估是否檢索到所有必要資訊 - Context Entities Recall (實體召回率)：評估關鍵實體的檢索完整度 - Noise Sensitivity (雜訊敏感度)：評估系統對干擾資訊的穩定性 2. **回答品質評估** - Answer Relevancy (答案相關性)：評估回答是否切中問題要點 - Faithfulness (忠實度)：評估回答是否與上下文保持一致 - Factual Correctness (事實正確性)：評估回答中的事實陳述是否正確 - Topic Adherence (主題連貫性)：評估回答是否始終圍繞主題 ### B. 多模態系統評估 1. **視覺語言理解** - Multimodal Faithfulness (多模態忠實度)：評估系統對圖像內容的解讀準確性 - Multimodal Relevance (多模態相關性)：評估回答是否恰當結合圖像和文字資訊 ### C. 代理工具使用評估 1. **任務完成度** - Agent Goal Accuracy (代理目標準確率)：評估是否達成用戶意圖 - Tool Call Accuracy (工具調用準確率)：評估工具使用的準確性 ## 2. 技術實現與計算方法 ### A. 檢索相關指標 #### 1. Context Precision (上下文精確度) ```python def calculate_context_precision(context, question): """ 計算檢索內容與問題的相關程度參數: context: 檢索的上下文內容 question: 用戶問題返回: float: 0-1之間的精確度分數 """ # 1. 生成文本嵌入 context_embedding = get_embeddings(context) question_embedding = get_embeddings(question) # 2. 計算相似度 similarity = cosine_similarity(context_embedding, question_embedding) # 3. 根據閾值判斷相關性 relevance_score = (similarity > THRESHOLD).mean() return relevance_score # 使用範例問題 = "台北101的高度是多少？" 上下文 = "台北101高度為509.2公尺，是台灣最高的建築物。" 精確度分數 = calculate_context_precision(上下文, 問題) # 返回: 0.92 ``` #### 2. Context Recall (上下文召回率) ```python def calculate_context_recall(retrieved_context, reference_context): """ 計算檢索內容對參考答案的覆蓋程度參數: retrieved_context: 系統檢索的上下文 reference_context: 標準參考上下文返回: float: 0-1之間的召回率分數 """ # 1. 提取關鍵信息單元 retrieved_info = extract_key_info(retrieved_context) reference_info = extract_key_info(reference_context) # 2. 計算覆蓋率 recall = len(retrieved_info.intersection(reference_info)) / len(reference_info) return recall # 使用範例參考答案 = "台北101有101層，高度509.2公尺，2004年完工。" 檢索內容 = "台北101高度509.2公尺，2004年完工。" 召回率分數 = calculate_context_recall(檢索內容, 參考答案) # 返回: 0.67 ``` ### B. 答案品質指標 #### 1. Faithfulness (忠實度) ```python def calculate_faithfulness(answer, context): """ 計算回答與上下文的一致性參數: answer: 系統生成的回答 context: 參考上下文返回: float: 0-1之間的忠實度分數 """ # 1. 提取回答中的陳述 claims = extract_claims(answer) # 2. 驗證每個陳述 def verify_claim(claim, context): # 使用自然語言推理模型 entailment_score = nli_model(premise=context, hypothesis=claim) return entailment_score claim_scores = [verify_claim(claim, context) for claim in claims] return np.mean(claim_scores) # 使用範例上下文 = "台北101是台灣最高的建築物，高度為509.2公尺。" 回答 = "台北101高達509.2公尺，是全台最高建築。" 忠實度分數 = calculate_faithfulness(回答, 上下文) # 返回: 0.95 ``` ### C. 傳統自然語言處理指標 #### 1. ROUGE Score (ROUGE分數) ```python def calculate_rouge(generated_text, reference_text): """ 計算ROUGE-N分數參數: generated_text: 生成的文本 reference_text: 參考文本返回: dict: 包含不同ROUGE指標的分數 """ def rouge_n(n): # 1. 生成n-gram gen_ngrams = get_ngrams(generated_text, n) ref_ngrams = get_ngrams(reference_text, n) # 2. 計算重疊部分 overlap = len(gen_ngrams.intersection(ref_ngrams)) # 3. 計算精確率和召回率 precision = overlap / len(gen_ngrams) recall = overlap / len(ref_ngrams) # 4. 計算F1分數 f1 = 2 * (precision * recall) / (precision + recall) return f1 return { 'rouge-1': rouge_n(1), 'rouge-2': rouge_n(2), 'rouge-l': calculate_rouge_l() } # 使用範例參考文本 = "台北101是台灣標誌性建築。" 生成文本 = "台北101代表台灣的標誌性建築物。" rouge分數 = calculate_rouge(生成文本, 參考文本) # 返回: {'rouge-1': 0.85, 'rouge-2': 0.70, 'rouge-l': 0.80} ``` ### D. SQL評估指標 #### 1. Datacompy Score (數據比對分數) ```python def calculate_datacompy_score(query1_result, query2_result): """ 比較SQL查詢結果的相似度參數: query1_result: 第一個查詢結果 query2_result: 第二個查詢結果返回: float: 0-1之間的相似度分數 """ # 1. 比較schema schema_match = compare_schemas(query1_result, query2_result) # 2. 比較數據 row_match = compare_rows(query1_result, query2_result) value_match = compare_values(query1_result, query2_result) # 3. 計算總分 score = (schema_match + row_match + value_match) / 3 return score # 使用範例查詢1 = "SELECT * FROM 員工 WHERE 年齡 > 25" 查詢2 = "SELECT * FROM 員工 WHERE 年齡 >= 26" 相似度分數 = calculate_datacompy_score( execute_query(查詢1), execute_query(查詢2) ) # 返回: 0.92 ``` ## 3. 指標選擇最佳實踐 ### A. 使用場景對應 1. **一般問答系統** - 主要指標: Context Precision, Answer Relevancy - 次要指標: Faithfulness, Topic Adherence 2. **技術文檔檢索** - 主要指標: Context Entities Recall, Factual Correctness - 次要指標: Context Precision, ROUGE Score 3. **多模態應用** - 主要指標: Multimodal Faithfulness, Multimodal Relevance - 次要指標: Answer Relevancy, Topic Adherence ### B. 綜合評分方法 ```python def comprehensive_score(問題, 上下文, 回答, 參考答案): """ 計算綜合評分 """ 權重 = { 'precision': 0.3, 'recall': 0.2, 'faithfulness': 0.3, 'relevancy': 0.2 } 分數 = { 'precision': calculate_context_precision(上下文, 問題), 'recall': calculate_context_recall(上下文, 參考答案), 'faithfulness': calculate_faithfulness(回答, 上下文), 'relevancy': calculate_answer_relevancy(回答, 問題) } 最終分數 = sum(權重[k] * 分數[k] for k in 權重) return 最終分數, 分數 ``` ## 4. 分數解讀指南 ### A. 分數範圍說明 | 指標 | 範圍 | 優秀 | 可接受 | 需改進 | |--------|-------|------|------------|------| | Context Precision | 0-1 | >0.8 | 0.6-0.8 | <0.6 | | Faithfulness | 0-1 | >0.9 | 0.7-0.9 | <0.7 | | ROUGE-1 | 0-1 | >0.7 | 0.5-0.7 | <0.5 | | SQL Equivalence | 0-1 | >0.95 | 0.8-0.95 | <0.8 | ### B. 常見問題 1. **過度依賴單一指標** - 建議: 綜合使用多個指標 - 考慮具體業務場景的重要性權重 2. **閾值設定問題** - 定期驗證閾值的合理性 - 根據領域特點調整閾值 3. **特殊情況處理** - 多語言內容的評估方法 - 專業術語的處理方式 - 模糊查詢的評估標準 ### C. 優化建議 1. **效能優化** - 對常用上下文進行嵌入緩存 - 批量處理評估請求 - 使用高效的相似度計算方法 2. **準確度優化** - 使用領域特定的評估標準 - 定期更新評估模型 - 收集用戶反饋進行調整