Benchmarking Large Language Models in Retrieval-Augmented Generation - HackMD

<style> img { display: block; margin-left: auto; margin-right: auto; } </style> > [Paper link](https://arxiv.org/abs/2309.01431) | [Note link](https://blog.csdn.net/m0_52695557/article/details/134247484) | [Code link](https://github.com/chen700564/RGB) | AAAI 2024 :::success **Thoughts** This study proposes a benchmark to evaluate the performance of large language models with Retrieval-Augmented Generation (RAG). ::: ## Abstract Existing research lacks an evaluation benchmark for assessing the impact of retrieval-augmented generation. This study proposes a systematic analysis of various large language models' performance based on four fundamental abilities required for RAG: 1. Noise Robustness 2. Negative Rejection 3. Information Integration 4. Counterfactual Robustness ![image](https://hackmd.io/_uploads/HJGaijx50.png) ## Background Retrieval-Augmented Generation (RAG) has been regarded as a promising solution for addressing issues such as hallucination, outdated knowledge, and the lack of domain-specific expertise. However, RAG also introduces potential drawbacks. The information retrieved through RAG may include fake news, which can mislead large language models into generating unreliable outputs. Therefore, it is important to use a comprehensive evaluation of large language models on their ability to effectively utilize retrieved information. ## Method This paper conducts an evaluation of RAG for large language models, namely the Retrieval-Augmented Generation Benchmark (RGB). It supports both English and Chinese. This evaluation uses four testbeds to assess the following fundamental abilities of large language models, addressing common challenges in RAG: 1. **Noise Robustness**: Can the LLM extract useful information from noisy documents? 2. **Negative Rejection**: Can the LLM refuse to answer the question when the required knowledge is not present in any retrieved document? 3. **Information Integration**: Can the LLM answer complex questions that require integrating information from multiple documents? 4. **Counterfactual Robustness**: Can the LLM identify the risks of known factual errors in the retrieved documents when given warnings about potential risks through instructions? ![image](https://hackmd.io/_uploads/B1f0VhgcC.png) For RGB data generation: 1. Models (ChatGPT) are used to extract (event, question, answer) triplets from news articles. 2. Search engines (Google API) are then utilized to retrieve relevant web pages. 3. Finally, a dense retrieval model is employed to re-rank the content of these web pages. Below are the instructions used in their experiments: ![image](https://hackmd.io/_uploads/rkSJr2eqA.png) ## Experiment ### Large Language Model - ChatGPT - ChatGLM-6B - ChatGLM2-6B - Vicuna-7b-v1.3 - Qwen-7B-Chat - BELLE-7B-2M ### Evaluation metrics - **Accuracy**: Used to measure noise robustness and information integration. - **Rejection Rate**: Used to measure negative rejection. - **Error Detection Rate**: Measures whether the model can detect factual errors in the documents for counterfactual robustness. - **Error Correction Rate**: Measures whether the model can provide the correct answer after identifying errors for counterfactual robustness. ### Noise Robustness ![Noise Robustness](https://hackmd.io/_uploads/ryyIvne9R.png) We can see that the increasing noise rate poses a challenge for RAG in LLMs. ### Negative Rejection ![Negative Rejection](https://hackmd.io/_uploads/SynwP2xqR.png) Rej refers to the rejection rate (%), and Rej$^∗$ refers to the rejection rate evaluated by ChatGPT. ### Information Integration ![Information Integration](https://hackmd.io/_uploads/B1oKwhl90.png) When comparing the model to the table in "Noise Robustness," we observe that it has a weak information integration ability, which in turn affects its noise robustness. ### Counterfactual Robustness ![Counterfactual Robustness](https://hackmd.io/_uploads/Skvswnx5C.png) ACC is the accuracy (%) of LLMs without external documents. ACC$_{doc}$ is the accuracy (%) of LLMs with counterfactual documents. ED and ED$^∗$ are error detection rates evaluated by exact matching and ChatGPT, respectively. CR is the error correction rate.