RAG evaluation

## How to evaluate Retrieval augmented generation result? RAG process: ![The RAG Triad](https://www.researchgate.net/publication/388029451/figure/fig1/AS:11431281303657474@1736954965349/ASTRID-an-Automated-and-Scalable-TRIaD-for-evaluating-clinical-QA-systems-leveraging-RAG.ppm) Chowdhury, Mohita & He, Yajie & Higham, Aisling & Lim, Ernest. (2025). ASTRID -- An Automated and Scalable TRIaD for the Evaluation of RAG-based Clinical Question Answering Systems. 10.48550/arXiv.2501.08208. * Faithfulness / Groundedness * Answer Relevance * Context Relevance ### Commonly used indices #### Retriever * Hit Rate\/accuracy Whether any related doc in top k retrieved result * Precision * Recall * F1 Score ![confusion-matrix-related indices](https://www.researchgate.net/publication/367393140/figure/fig4/AS:11431281414573567@1746013536337/Confusion-matrix-Precision-Recall-Accuracy-and-F1-score.tif) Seol, Da & Choi, Jeong & Kim, Chan & Hong, Sang. (2023). Alleviating Class-Imbalance Data of Semiconductor Equipment Anomaly Detection Study. Electronics. 12. 585. 10.3390/electronics12030585. [CC BY](https://creativecommons.org/licenses/by/4.0/) * Mean Reciprocal Rank (MRR) Score for 1^st^ related doc's position $\text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i}$ * Q: sample queries * rank: the position of **first** doc across results * Normalised Discounted Cumulative Gain (NDCG) [a comprehensive index considering rank](https://hackmd.io/@Ibi/r1x0dGlGWll) * Coverage How many relevant docs successfully retrieved $\text{Coverage} = \frac{|RD \cap \text{Retrieved}|}{|RD|}$ #### Generator * Faithfulness (Response <-> Relevant Documents) Are LLM's responses based on retrieved docs? (or hallucination) $\text{Faithfulness} = \frac{ \text{number of claims supported by retrieved context}}{\text{Total number of claims}}$ * Relevance how relevant is the LLM's response to the query? $\text{relevancy} = \frac{1}{N} \sum_{i=1}^{N} \cos(E_g, E_o)$ * E~g~: embedding of the generated question i * E~o~: embedding of the original question (i.e. query) * N: number of generated answer * Correctness model response vs label (accuracy) * Perplexity (proxy) exponential variation of cross-entropy, meaning how _uncertain_ the LLM is $\text{Perplexity} = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log p(w_i | w_1, w_2, \dots, w_{i-1})\right)$ Benchmarks: * Bilingual Evaluation Understudy (BLEU) * Recall-Oriented Understudy for Gisting Evaluation (ROUGE) * LLM-based (primarily GPT-based) * RAGAS (Retrieval Augmented Generation Assessment System) * Databricks Eval * [...](https://arxiv.org/pdf/2504.14891) #### Implementation tools: * [RAGAS](https://docs.ragas.io/en/stable/) (Retrieval Augmented Generation Assessment System) _[repository](https://github.com/explodinggradients/ragas)_ ![RAGAS logo](https://github.com/explodinggradients/ragas/raw/main/docs/_static/imgs/logo.png) * [Opik](https://github.com/comet-ml/opik) ![Opik logo in svg](https://raw.githubusercontent.com/comet-ml/opik/refs/heads/main/apps/opik-documentation/documentation/static/img/logo-dark-mode.svg) --- ### Reference:  1. Gan, A., Yu, H., Zhang, K., Liu, Q., Yan, W., Huang, Z., Tong, S., & Hu, G. (2025). Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey. arXiv preprint arXiv:2504.14891. Es, S., Al-khassaweneh, M., Tork, H., & Shboul, B. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv preprint arXiv:2309.15217. (Published in EACL 2024) 2. Chen, J., Lin, H., Han, X., & Sun, L. (2024). Benchmarking Large Language Models in Retrieval-Augmented Generation. In Proceedings of the AAAI Conference on Artificial Intelligence, 38(16), 17754-17762. 3. Iaroshev, I., Pillai, R., Vaglietti, L., & Hanne, T. (2024). Evaluating Retrieval-Augmented Generation Models for Financial Report Question and Answering. Applied Sciences, 14(9318). 4. Gao, Y., et al. (2023). Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv preprint arXiv:2312.10997. 5. Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems, 33, 9459-9474. 6. [RAGAS official documents](https://docs.ragas.io/en/stable/)