## How to evaluate Retrieval augmented generation result?
RAG process:

<span style='font-size: 0.64em;'>Chowdhury, Mohita & He, Yajie & Higham, Aisling & Lim, Ernest. (2025). ASTRID -- An Automated and Scalable TRIaD for the Evaluation of RAG-based Clinical Question Answering Systems. 10.48550/arXiv.2501.08208.</span>
* Faithfulness / Groundedness
* Answer Relevance
* Context Relevance
### Commonly used indices
#### Retriever
* Hit Rate\/accuracy
Whether any related doc in top k retrieved result
* Precision
* Recall
* F1 Score

<span style='font-size: 0.64em;'>Seol, Da & Choi, Jeong & Kim, Chan & Hong, Sang. (2023). Alleviating Class-Imbalance Data of Semiconductor Equipment Anomaly Detection Study. Electronics. 12. 585. 10.3390/electronics12030585. [CC BY](https://creativecommons.org/licenses/by/4.0/)</span>
* Mean Reciprocal Rank (MRR)
Score for 1^st^ related doc's position
$\text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i}$
* Q: sample queries
* rank: the position of **first** doc across results
* Normalised Discounted Cumulative Gain (NDCG)
[a comprehensive index considering rank](https://hackmd.io/@Ibi/r1x0dGlGWll)
* Coverage
How many relevant docs successfully retrieved
$\text{Coverage} = \frac{|RD \cap \text{Retrieved}|}{|RD|}$
#### Generator
* Faithfulness (Response <-> Relevant Documents)
Are LLM's responses based on retrieved docs? (or hallucination)
$\text{Faithfulness} = \frac{ \text{number of claims supported by retrieved context}}{\text{Total number of claims}}$
* Relevance
how relevant is the LLM's response to the query?
$\text{relevancy} = \frac{1}{N} \sum_{i=1}^{N} \cos(E_g, E_o)$
* E~g~: embedding of the generated question i
* E~o~: embedding of the original question (i.e. query)
* N: number of generated answer
* Correctness
model response vs label (accuracy)
* Perplexity (proxy)
exponential variation of cross-entropy, meaning how _uncertain_ the LLM is
$\text{Perplexity} = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log p(w_i | w_1, w_2, \dots, w_{i-1})\right)$
Benchmarks:
* Bilingual Evaluation Understudy (BLEU)
* Recall-Oriented Understudy for Gisting Evaluation (ROUGE)
* LLM-based (primarily GPT-based)
* RAGAS (Retrieval Augmented Generation Assessment System)
* Databricks Eval
* [...](https://arxiv.org/pdf/2504.14891)
#### Implementation tools:
* [RAGAS](https://docs.ragas.io/en/stable/) (Retrieval Augmented Generation Assessment System)
_[repository](https://github.com/explodinggradients/ragas)_

* [Opik](https://github.com/comet-ml/opik)

<span style='font-size: 0.64em;'></span>
---
### Reference:
<!-- [RAG evaluation review](https://arxiv.org/abs/2504.14891) -->
1. Gan, A., Yu, H., Zhang, K., Liu, Q., Yan, W., Huang, Z., Tong, S., & Hu, G. (2025). Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey. arXiv preprint arXiv:2504.14891.
Es, S., Al-khassaweneh, M., Tork, H., & Shboul, B. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv preprint arXiv:2309.15217. (Published in EACL 2024)
2. Chen, J., Lin, H., Han, X., & Sun, L. (2024). Benchmarking Large Language Models in Retrieval-Augmented Generation. In Proceedings of the AAAI Conference on Artificial Intelligence, 38(16), 17754-17762.
3. Iaroshev, I., Pillai, R., Vaglietti, L., & Hanne, T. (2024). Evaluating Retrieval-Augmented Generation Models for Financial Report Question and Answering. Applied Sciences, 14(9318).
4. Gao, Y., et al. (2023). Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv preprint arXiv:2312.10997.
5. Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems, 33, 9459-9474.
6. [RAGAS official documents](https://docs.ragas.io/en/stable/)