Based on [RRRRAGAS](/n9CR0xXsT0KbsFWAn92XaQ) 被取消的指標: Context Relevance ### 1. To evaluate the conversational search feature, we use the RAGAS framework (Es et al., 2024), ==focusing on the Faithfulness and the Answer Relevance== of generated responses. **Faithfulness** evaluates if the generated answer is grounded in the given context, which is important to avoid hallucinations. **Answer relevance** evaluates if the generated answer actually addresses the provided question. We use GPT-4 to generate 50 random questions related to NLP, such as "Define perplexity in the context of language models". Subsequently, we utilize GPT3.5 (OpenAI, 2022) and GPT-4 in our conversational search pipeline described in §3.6 to generate grounded answers from retrieved publications. Finally, we use RAGAS to evaluate the generated responses. As shown in Table 5, both LLMs exhibit high faithfulness and answer relevance scores, indicating their ability to retrieve relevant publications from the RAG pipeline to effectively answer user queries based on provided contexts. | Model | Faithfulness | Answer Relevance | | -------- | -------- | -------- | | gpt-3.5-turbo-0125 | 0.9661 | 0.8479 | | gpt-4-0125-preview |0.9714 | 0.8670 | Table 5: Evaluation results of our conversational search pipeline. Metrics are scaled between 0 and 1, whereby the higher the score, the better the performance. ### 2. ==Towards RAG capabilities evaluation, we adopt four metrics from RAGAs==, including Faithfulness, Context Relevancy, Answer Relevancy, and Answer Correctness. **Faithfulness** measures how factually consistent the generated answer is with the retrieved context. An answer is considered faithful if all claims made can be directly inferred from the provided context. **Context Relevancy** evaluates how relevant the retrieved context is to the original query. **Answer Relevancy** assesses the pertinence of the generated answer to the original query. **Answer Correctness** involves the accuracy of the generated answer when compared to the ground truth. For example, Context Relevancy is calculated from the proportion of sentences within the retrieved context that are relevant for answering the given question to all sentences: context relevancy = |S|/|Total| where |S| denotes the number of relevant sentences, |T otal| denotes the total number of sentences retrieved. All these metrics are evaluated using the RAGAs framework, with GPT-4 serving as the judge. ### 4. To ensure the chatbot effectively addresses the specific needs of Volvo Group’s truck repair operations, a thorough evaluation framework was implemented as part of the research methodology. As with any machine learning model, the performance of individual components within the LLM and RAG pipeline significantly influences the overall user experience. For this evaluation, we employed the RAGAs library [7], which provides specialized metrics designed to evaluate each component of the RAG pipeline. ![image](https://hackmd.io/_uploads/r1hNYl33A.png) Figure 3 outlines a framework for evaluating the performance of a RAG system with a focus on alignment of generated answers with the relevant context and ground truth data. the main components of the evaluation framework are: Question, Answer, Context and Ground truth (optional). Following are the evaluation paths. #### Faithfulness Faithfulness measures the factual consistency of the generated answer against the given context.It is calculated from the answer and the retrieved context, scaled to a range of (0,1), with higher values indicating better performance. A generated answer is considered faithful if all the claims made in the answer can be inferred from the provided context. To determine this, a set of claims from the generated answer is first identified, and each claim is then cross-checked with the given context to see if it can be inferred from it. The formula for faithfulness is: ![image](https://hackmd.io/_uploads/H1oCYl3hA.png) #### Answer Relevancy Answer Relevancy focuses on assessing how pertinent the generated answer is to the given prompt. Lower scores are assigned to answers that are incomplete or contain redundant information, while higher scores indicate better relevancy. ==This metric is computed using the question, the context, and the answer.== Answer Relevancy is defined as the mean cosine similarity of the original question to a number of artificial questions, which are reverse-engineered based on the answer: ![image](https://hackmd.io/_uploads/rkNbql3nR.png) ![image](https://hackmd.io/_uploads/SJLf9gh2R.png) ![image](https://hackmd.io/_uploads/ryQm5x3nA.png) #### Context Precision Context Precision is a metric that evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. Ideally, all the relevant records must appear at the top ranks. This metric is computed using the question, ground truth, and the contexts, with values ranging between 0 and 1, where higher scores indicate better precision. The formula for Context Precision is given by: ![image](https://hackmd.io/_uploads/BJVP9gnh0.png) #### Context Recall Context Recall measures the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth. It is computed based on the ground truth and the retrieved context, with values ranging between 0 and 1, where higher values indicate better performance. To estimate context recall from the ground truth answer, each sentence in the ground truth answer is analyzed to determine whether it can be attributed to the retrieved context. In an ideal scenario, all sentences in the ground truth answer should be attributable to the retrieved context. The formula for calculating context recall is: ![image](https://hackmd.io/_uploads/B1jhqlh30.png) ### 5. Evaluation pipelines such as RAGAs have been utilized to assess the performance of the RAG Agent without the need for references. LLM-based metrics for evaluation — RAGAs. The RAGAs framework was created to evaluate the effectiveness of QA chatbots [13]. This evaluation framework offers a comprehensive assessment of the performance of the RAG pipeline. A significant feature of the RAGA framework is its capability to adapt to the use of LLMs for evaluation without the need for reference annotations by human experts. This strategy not only simplifies the evaluation process but also improves scalability by reducing the dependence on manual annotations. However, RAGAs also have potential drawbacks, particularly in terms of their subjective nature in the evaluation. This subjectivity could lead to biases and inconsistencies in the evaluation process if not closely monitored. Therefore, a controlled application of RAGAs is crucial during the evaluation phase. The performance of the RAG Agent on the RAGAs score is summarized in table 3.4 The principal observations and corresponding conclusions regarding the performance of the RAG Agent are delineated below. Additionally, the subsequent steps to enhance the efficacy of EIC-RAG are enumerated: LLM used by RAGAs to score is GPT4.0. Performance of the RAG Agent on RAGAs scores. The errors shown are statistically extracted from the respective score distribution ![image](https://hackmd.io/_uploads/Hy-Jpghn0.png) ### 6 RAGAS: Automated Evaluation of Retrieval Augmented Generation, a paper written by Es et al., introduces a framework for the evaluation of RAG pipelines. The implementation of RAG requires a great deal of tuning, as its performance is dependent on a lot of different components, like the retrieval model, the context itself, the LLM, prompt template, among others. A method to evaluate the RAG pipeline when working with this tuning is therefore paramount. RAG systems are often evaluated by their ability to tackle the LLM task itself by measuring perplexity on some context. They consider a standard RAG setting, where a question is used to fetch some external context before being used to generate an answer. The evaluation focuses on three different quality aspects. **Faithfulness** is a metric of how well the answer is grounded in the provided context. Secondly, **Answer Relevance** refers to the fact that the generated answer should address the actual question that was given. Finally, **Context Relevance** focuses on how good the RAG pipeline is at retrieving only relevant context, containing as little irrelevant information as possible A testing dataset was created in the form of a document as context, as well as relevant example questions. It was then possible to use these example questions to see which embedding model best retrieves relevant context. Another evaluation methodology that was used is the one mentioned in Section 2.2.5 from a paper called RAGAS: Automated Evaluation of Retrieval Augmented Generation. This evaluation method requires a dataset similar to the one already mentioned, in addition to answers given by the RAG pipeline and their respective ground truths. The RAGAS metrics that were used include context precision, context recall, and context relevancy. Context precision evaluates whether or not the most relevant text chunks are ranked higher or not when fetching the top four most relevant text chunks. The more relevant text chunks should be ranked at the top and given first to the LLM. Context recall is a measure of how well the retrieved context aligns with the ground truth answer. Finally, context relevancy is a measure of how relevant the retrieved context is in accordance with the question. If this number is low, a lot of irrelevant context has been retrieved, whilst a higher number means more relevant context. A more in-depth description of experiments and results can be found below in Section 4.3. Another methodology used for comparing embedding models was the one mentioned in Section 2.2.5 from a paper called RAGAS: Automated Evaluation of Retrieval Augmented Generation [9]. To utilize the RAGAS framework, a dataset similar to the one used above has to be created. Firstly, example questions have to be created, followed by the ground truth answers to these questions. Then the context chunks retrieved by each embedding model for each of the example questions have to be collected. Finally, the answer to each of the questions when using each of the embedding models has to be found. As example questions, the twenty test prompts were used. The ground truth answers were generated by using OpenAI’s gpt-4 model and giving it the correct context chunks manually. The context retrieved from each embedding model was found by using cosine similarity as in the methodology above. Then finally he answers derived when using each of the embedding models were found using the same gpt-4 model, with the same temperature (0.5), given the retrieved context using the respective embedding model. ![image](https://hackmd.io/_uploads/ryUERgnnR.png) In the thesis, RAGAS metrics were used to compare the embedding model. ### 7 Throughout this process, we employ the RAGAS framework [7] to evaluate the effectiveness of the RAG mechanism implemented within our framework. Leveraging state-of-the-art LLMs, such as OpenAI’s GPT-4 [22] or Anthropic’s Claude [3], as evaluative judges have been studied by Zheng et al. [35]. The framework scrutinizes the quality of RAG components by incorporating LLM as a judge for the ground truth creation. The RAGAS framework delineates four pivotal metrics crucial for assessing the efficacy of the RAG mechanism. - Context Recall: Assesses the relevance of the context to the question, aiming to cover a higher number of attributes from the ground truth. - Context Precision: Evaluates whether all relevant ground truth items are included in the context. - Faithfulness: Measures the factual consistency of the generated answer with the given context. - Answer Relevancy: Evaluates the relevance of the generated answer to the question, penalizing cases of incompleteness or redundancy rather than factuality Following the generation of the ground-truth dataset, we proceeded with evaluating the RAG mechanism employed in the framework using the RAGAS [7] framework. ### 8. To evaluate the performance of our system, we adopt a twofold approach including both quantitative and usability assessment methods. For the quantitative evaluation (See Section IV-B), we utilize the RAGAS [40] framework, while the SUS is adopted for usability assessment (See Section IV-C). In the following subsection, we first discuss the dataset and steps we took to prepossess them and then provide a detailed explanation of our evaluation approaches. We focus on the following metrics from the RAGAS framework (Es et al., 2023). Higher value is better for all of them. - Faithfulness (F aiF ul): Checks if the (generated) statements from RAG response are present in the retrieved context through verdicts; the ratio of valid verdicts to total number of statements in the context is the answer’s faithfulness. - Answer Relevance (AnsRel): The average cosine similarity of user’s question with generated questions, using the RAG response the reference, is the answer relevance. - Context Relevance (ConRel): The ratio of the number of statements considered relevant to the question given the context to the total number of statements in the context is the context relevance. - Answer Similarity (AnsSim): The similarity between the embedding of RAG response and the embedding of ground truth answer. - Factual Correctness (F acCor): This is the F1-Score of statements in RAG response classified as True Positive, False Positive and False Negative by the RAGAS LLM. - Answer Correctness (AnsCor): Determines correctness of the generated answer w.r.t. ground truth (as a weighted sum of factual correctness and answer similarity). - The RAGAS library has been a black box; hence, interpretability of the scores is difficult as the scores conflicted with human scores by SMEs. To address this, we store the intermediate outputs and verdicts. For details of computation of these metrics, readers can refer to Appendix A and sample output for representative questions in Appendix B. We conduct the following experiments to analyse RAG outputs using RAGAS metrics: (i) Compute RAGAS metric on RAG output, (ii) Domain adapt BAAI family of models for retriever in RAG using PT and FT and assess impact on RAGAS metrics, and (iii) Instruction fine tune the LLM for RAG with RAGAS evaluation metrics. ### 9. RAG evaluation Recently, several parallel efforts have proposed approaches to automated RAG evaluation. In RAGAS [9], the authors query an LLM-judge (GPT-3.5) with a curated prompt to evaluate context relevance, answer relevance and faithfulness of a RAG response. In response, automated RAG evaluation systems like RAGAS [9] and TruLens [37] have emerged. These systems adopt a zero-shot LLM prompt-based approach to predict a set of curated RAG evaluation metrics ARES [33] and RAGAS [9] define a **context relevance** metric to evaluate the quality of the retrieved documents, along with **answer relevance** and **faithfulness** to evaluate the quality of the generative model ### 10. To evaluate BARKPLUG V.2’s ability to produce contextually appropriate responses, we utilize the RAGAS framework [40]. We choose this framework because it is specifically designed to assess RAG pipelines. Other popular evaluation metrics such as ROUGE [41] and BLEU [42] are not suitable in our context. This is because ROUGE is generally used to evaluate summarization tasks, while BLEU is designed to evaluate language translation tasks. We evaluate both phase of BARKPLUG V.2 architecture (See section III) i.e. context retrieval and completion. To evaluate the retrieval, we employ two metrics such as **context precision** and **context recall**. The first metric represents the Signal-toNoise Ratio (SNR) of retrieved context, while the second metric evaluates whether the retriever has the ability to retrieve all the relevant evidence to answer a question. Similarly, to evalaute completion or generation we employ faithfullness and answer relevance metrics. Faithfulness evaluates how factually accurate the generated answer is while answer relevance evaluates how relevant the generated answer is to the question. The final RAGAS score, representing the harmonic mean of these four metrics, falls within a range of 0 to 1, with 1 denoting optimal generation. This score serves as a singular measure of a QA system’s performance. Therefore, the RAGAS score is essential for assessing the overall performance and relevance of BARKPLUG V.2 in its targeted educational environments. ### 11. RAG Evaluation Frameworks We evaluate two commercial RAG evaluation frmeworks: RAGAS (v0.1.7) (Es et al., 2024) and Trulens (v0.13.4). We report RAGAS **Faithfulness** and Trulens Groundedness metrics, which are designed for hallucination detection ### 12. Using RAGAs to evaluate RAG Ragas (35) is a large-scale model evaluation framework designed to assess the effectiveness of Retrieval-Augmented Generation (RAG). It aids in analyzing the output of models, providing insights into their performance on a given task. To assess the RAG system, Ragas requires the following information: - Questions: Queries provided by users. - Answers: Responses generated by the RAG system (elicited from a large language model, LLM). - Contexts: Documents relevant to the queries retrieved from external knowledge sources. - Ground Truths: Authentic answers provided by humans, serving as the correct references based on the questions. This is the sole required input. Once Ragas obtains this information, it will utilize LLMs to evaluate the RAG system. Ragas’s evaluation metrics comprise Faithfulness, Answer Relevance, Context Precision, Context Relevancy, Context Recall, Answer Semantic Similarity, Answer Correctness, and Aspect Critique. For this study, our chosen evaluation metrics are **Faithfulness, Answer Relevance, and Context Recall**. #### Faithfulness Faithfulness is evaluated by assessing the consistency of generated answers with the provided context, derived from both the answer itself and the context of retrieval. Scores are scaled from 0 to 1, with higher scores indicating greater faithfulness. An answer is deemed reliable if all assertions within it can be inferred from the given context. To compute this value; a set of assertions needs identification from the generated answer, followed by cross-checking each assertion against the provided context. The Equation (2) for computing faithfulness is as follows: ![image](https://hackmd.io/_uploads/BJu5V-nnR.png) #### Answer relevance To evaluate the relevance of answers: utilize LLM to generate potential questions and compute their similarity to the original question. The relevance score of an answer is determined by averaging the similarity between all generated questions and the original question. ![image](https://hackmd.io/_uploads/HkhxrWn30.png) #### Context recall Context recall assesses how well the retrieved context aligns with the authentic answers provided by humans. It is calculated by comparing the ground truth with the retrieved context, with scores ranging from 0 to 1, where higher scores indicate better performance.To estimate context recall based on the authentic answers, each sentence in the authentic answers is examined to determine its relevance to the retrieved context. Ideally, all sentences in the authentic answers should be relevant to the retrieved context. The context relevance score is calculated using the following Equation (4): ![image](https://hackmd.io/_uploads/ByxVBb22A.png) This formula quantifies the proportion of sentences in the authentic answers that can be attributed to the retrieved context, providing a measure of how well the retrieved context aligns with the ground truth RAGAs scores Our Ragas evaluation relies on GPT-3.5-Turbo. To ensure iversity and representativeness in the test set, we meticulously designedthe distribution of questions across categories such as “simple,” “inference,” “multi-context,” and “conditional.” Adhering to these uidelines, we curated a test set comprising 20 questions for assessment. he evaluation outcomes of Ragas are consolidated in Table 3. When employing he RAGAS framework to evaluate GastroBot, we attained a context recall rate of 95%, with faithfulness reaching 93.73%. Moreover, the answer relevancy score achieved 92.28%. Evaluation using the RAGAS framework showcased a context recall rate of 95%, faithfulness of 93.73%, and a high answer relevancy of 92.28%. ### 13 Result - Hybrid Context RAG Pipeline Evaluation To evaluate the pipeline we use the Retrieval Augmented Generation Assessment (RAGAs)[註] evaluation framework. To evaluate the pipeline we create a validation set which contains 10 manually created original user queries, ground truth answers, as well as the the context and the answer generated by the pipeline.RAGAs uses each of these fields to calculate the pipeline metrics. The pipeline was evaluated based on faithfulness, answer relevance, context recall and answer relevancy, context recall and answer correctness. These metrics are described below. - Faithfulness: The faithfulness metric is scaled [0,1], where 1 is the optimal outcome. The faithfulness score is a fraction, where the numerator is the ‘Number of claims in the generated answer that can be inferred from a given context’ over the “Total number of claims in the generated answer”. The metric evaluates the ‘“actual consistency of the generated answer against the given context”. - Answer Relevancy: Answer relevancy uses the pipeline response (answer), to create LLM-generated synthetic queries. Then, calculates the mean cosine similarity of actual query and the synthetic queries. The relevance score in practice is [0,1], where 1 is the optimal outcome. However, the documentation notes that the range is not guaranteed given that a cosine similarity score can range [-1,1]. - Context Relevance: Context relevance evaluates the retriever, and measures how well the retrieved context aligns with the question. The context recall is scaled [0, 1] where 1 is the optimal outcome. The metric compares the cardinality of set of sentences “|s|” over the cardinality of the “Total number of sentences in retrieved context”. - Context Recall: Context recall evaluates the retriever, and measures how well the retrieved context aligns with the ground truth. The context recall is scaled [0, 1] where 1 is the optimal outcome. The context recall is a fraction, where the numerator is the cardinality of ‘Ground Truth sentences that can be attributed to a context’ over the cardinality of “Number of sentences in Ground Truth”. The metric evaluates the extent to which “the retrieved context aligns with the annotated answer, treated as the ground truth”. - Answer Correctness: The answer correctness metric is a score ranging [0,1] where 1 is the optimal outcome. This gauges the accuracy of the pipeline generated response by comparing the ground truth and the generated answer. The calculated metric is a weighted sum between the ‘factual correctness’ (F1 Score) and the cosine similarity of the answer and truth vectors. *Table: RAG Pipeline Metric Summary* ![image](https://hackmd.io/_uploads/HJ5COxhh0.png) ### 14 Case study - Empirical evaluation We quantitatively assessed the RAG workflow designed for software usage Q&A using the RAGAs evaluation framework [(Es et al., 2024)](https://arxiv.org/abs/2309.15217). We had ChatGPT pose as a user and ask 20 questions about Vectorworks, covering basic usage, advanced features, troubleshooting, etc. A synthetic validation set was created after manually correcting some hallucination issues. We comprehensively evaluated the performance of various components in our RAG pipeline using the **faithfulness, context utilization, and answer relevancy** metrics provided by the RAGAs framework. Faithfulness measures the consistency of the generated answers with the factual content in the given context, answer relevancy focuses on assessing how relevant the generated answers are to the given query, and context utilization calculates whether the retrieved context 9 can be used to answer the query. The results in Table 2 demonstrate that our agent can reliably answer software usage questions based on external knowledge. *Table: Average evaluation metrics based on 20 usage questions* > No formula ### 15 Methodology - Assessment Criteria In our experimental evaluation of Large Language Models (LLMs) for Retrieval Augmented Generation (RAG) applications, our focus are on assessing both Generation Quality and Inference Speed. **Generation Quality**. About the Generation Quality, we focused on the key metrics below: 1. Faithfulness ![image](https://hackmd.io/_uploads/SyDXieh30.png) 2. Answer Relevance(Still more) ![image](https://hackmd.io/_uploads/H1KEoe3nA.png) 3. Overall Score ![image](https://hackmd.io/_uploads/B1lUog33A.png) *結果放在Experiment Result的章節* ### 16 Evaluation Context Retrieval ![image](https://hackmd.io/_uploads/BkZoigh3A.png) ### 17 Two Evaluation Protocols We assess two reference-free protocols. - ==RAGAS-Fact== (Es et al., 2023): This protocol utilizes the context-query-response triplets to assess the veracity of responses. It evaluates the faithfulness of a response by calculating the ratio of claims grounded on the context to the total claims made. This process involves identifying statements that hold atomic facts, following the methodology outlined by Chern et al. (2023). - SelfCheck (Manakul et al., 2023): This protocol relies on the stochastic generation of responses, based on the premise that incorrect answers are unlikely to be consistently produced. This principle, initially applied to LLMs, is adapted in our analysis of RAG systems. In a departure from its original application, we enhance the evaluation prompt to include not only stochastic responses but also the query itself. Refer to Appendix A for details. To generate four stochastic samples, we adjust the temperature setting to 1.0, contrasting with a temperature of 0.0 used to generate the primary response. A response is deemed correct if it aligns consistently across all stochastic samples. *Result - Table* ### 18 Evaluation of LLM-guided extraction As RAG is integral to the LLM, we assess the effectiveness of R generated by our model given P. A common metric for evaluating RAG-generated responses is RAGAS [註]. Our model’s responses are compared with those generated by humans. Figure 4 illustrates the **correctness scores** for all questions. Questions with low scores (<0.5) required more specific information regarding the privacy policies. Since our LLM only accesses GDPR articles, it performs well in questions regarding the similarity of a privacy policy to a GDPR article but struggles with overly specific questions about the policy itself. *Table: RAGAS answer correctness* ### 19 Quantitative Evaluation For the quantitative assessment of our system’s performance in generating contextually relevant Response Completion C, we exclusively utilize the RAGAS (Es et al. 2023) framework, a comprehensive metric designed for evaluating RAG pipelines. We opted for this framework because other popular frameworks such as BLEU (Papineni et al. 2002) and ROUGE (2004) are irrelevant to our context. BLEU is primarily tailored for evaluating machine translation tasks, and the ROUGE is specialized for evaluating text summarization tasks. These metrics examine the structural similarity between ground truth and the generated sentences. In our case, the sentence may be structurally different but factually similar. Existing metrics do not capture this. Given that LOCALINTEL primarily operates as a RAG-based question answering system with context mapping and summarization components, these traditional metrics are not well suited to measure its efficacy. In our context, the mean RAGAS score is reported as 0.9535 (ranges from 0 to 1, with 1 being the optimal generation), with a standard deviation of 0.0226 (see Figure 3), highlighting LOCALINTEL’s proficiency in delivering contextually relevant answers within the framework of retrievalaugmented question-answering tasks. *Figure: RAGAS evaluation score for Completion C with respect to the human evaluator ground truth.* ### 20 Experiment&Result - Evaluation metric **Evaluation metric for context retrieval** Among those LLMs-as-judges metric, we chose the Retrieval Augmented Generation Assessment (Ragas) framework [Eset al., 2023] for evaluating the accuracy of the context retrieval. Ragas is a framework for the evaluation of the RAG systems. It introduces a suite of metrics for evaluating RAG systems without relying solely on ground truth human annotations. The most notable feature of Ragas in evaluating the accuracy of context retrieval is that it does not require a “reference context answer”. Instead, the framework assesses the accuracy of the retrieved context solely based on the question and reference answer. This approach is specialized for situations where no direct reference context is available. The most prominent evaluation metrics for context retrieval Ragas offer include: - Context Precision: This assesses the relevance of the retrieved documents to the question. It is especially important when considering the necessity of retrieving guideline documents. - Context Recall: This evaluates the ability of the retrieval system to gather all the necessary information needed to answer the question. It involves using the ground truth answer and an LLM to check if each statement from the answer can be also found in the retrieved context. Thus, we employed these two metrics to evaluate the quality of the retrieved context in our experiments. By leveraging these metrics, we aimed to provide a comprehensive and objective assessment of the QA-RAG model’s performance in retrieving relevant and complete information from the pharmaceutical guidelines. **Evaluation metric for answer generation** The final answers generated by the model, based on the retrieved context, could be evaluated by comparing their similarity to the reference answers. For this purpose, we utilized Bertscore [Zhang et al., 2019]. Given Bertscore’s renowned ability to capture nuanced semantic correlations, it was an ideal choice for comparing the semantic similarity and relevance of the model’s responses against the reference answers. ### 21 Evaluation Our evaluation study encompasses a dual-pronged experimental approach. For the quantitative assessment of our framework’s performance in generating contextually relevant responses, we employ a comprehensive set of metrics, including Ragas scores [37] and TruLens scores [38]. Following the quantitative assessment of our model’s overall effectiveness, we proceeded with a Subject-Matter Expert (SME) evaluation study to validate the answer generation capabilities of MedInsight. Due to the resource-intensive nature of this evaluation, we engaged a panel of four medical residents. Their task involved scoring answers across 100 questions spanning across all medical speciality as explained in Section 4.1, considering two critical aspects: factual correctness and relevance to the patient’s unique context. **Quantitative Evaluation** To quantitatively evaluate the performance of our MedInsight in generating contextually relevant patient-centric responses, we utilize RAGAS [37] and TruLens [38] frameworks. These frameworks feature offers comprehensive metrics specifically designed for assessing RAG pipelines. We chose these frameworks over popular alternatives like BLEU [39] and ROUGE [40], as they are not aligned with our specific context. BLEU is primarily used to evaluate machine translation tasks, while ROUGE is specialized for evaluating text summarization tasks. Both metrics focus on structural similarity between ground truth 14 • Neupane et al. and generated sentences, which may not be suitable for our case where sentences can be structurally different but factually similar. Traditional metrics fail to capture this nuance. Given that MedInsight predominantly functions as a RAG-based question-answering system with context mapping, these conventional metrics are not well-suited to gauge its effectiveness. **Qualitative Evaluation**: Human evaluation ### 22 Evaluation **Retrieval Quality** The primary way to test the retrieval quality with structured data is with page-level and paragraph-level accuracy. Since in the data, we have access to the entire document and the section a human analyst referred to, we compare that section to the chunks returned by the retrieval algorithm. If both the reference context and algorithm context are on the same page, it will have a high retrieval accuracy, a similar idea for paragraph-level accuracy. For unstructured data, we can evaluate the retrieved chunk using Context Relevance, as defined by the RAGAS Framework. 13For this metric, an LLM is asked to count the number of sentences from the context that are relevant to answering the question. The ratio of the number of extracted sentences to the total number of sentences in the context is defined as the **context relevance** score. This score penalizes redundant information and rewards chunks that have a majority of sentences that provide useful information to answer the question. **Answer Accuracy**: BLEU and Rouge-L Score ### 23 Result And Discussion - Testing and Validation After the RAG pipelines were built, the synthetic testing data along with the ground truth were prepared using GPT-4 carefully. Using the prepared evaluation dataset with carefully crafted headers with question, answer, context, and ground truth, the work evaluated the systems built using the Retrieval Augmented Generation Assessments (RAGAs) technique. Here, the work uses faithfulness, context precision, context recall, answer relevancy, answer similarity, and answer correctness as a metric to evaluate the built RAG systems. As the evaluation's findings are shown in the fig. 5 ![image](https://hackmd.io/_uploads/H1tfEW32A.png) > 沒有解釋指標也沒有列公式 ### 24 Evaluation Strategies We consider a standard RAG setting, where given a question q, the system first retrieves some context c(q) and then uses the retrieved context to generate an answer as(q). When building a RAG system, we usually do not have access to human-annotated datasets or reference answers. We therefore focus on metrics that are fully self-contained and reference-free. We focus in particular three quality aspects, which we argue are of central importance. - Faithfulness - Answer Relevance - Context Relevance > 包含解釋、公式與prompt