[Week 6] LLM Evaluation Techniques

# [Week 6] LLM Evaluation Techniques [課程目錄](https://areganti.notion.site/Applied-LLMs-Mastery-2024-562ddaa27791463e9a1286199325045c) [課程連結](https://areganti.notion.site/Week-6-LLM-Evaluation-Techniques-c3302a6589e6473aa9bdd4e560a1ffbc) ## ETMI5: Explain to Me in 5 In this section of the content, we dive deep into the evaluation techniques applied to LLMs, focusing on two dimensions- pipeline and model evaluations. We examine how prompts are assessed for their effectiveness, leveraging tools like Prompt Registry and Playground. Additionally, we explore the importance of evaluating the quality of retrieved documents in RAG pipelines, utilizing metrics such as Context Precision and Relevancy.We then discuss the relevance metrics used to gauge response pertinence, including Perplexity and Human Evaluation, along with specialized RAG-specific metrics like Faithfulness and Answer Relevance.Additionally, we emphasize the significance of alignment metrics in ensuring LLMs adhere to human standards, covering dimensions such as Truthfulness and Safety. Lastly, we highlight the role of task-specific benchmarks like GLUE and SQuAD in assessing LLM performance across diverse real-world applications. 這一章節的內容，我們要來深入探討應用在LLMs的評估技術，然後關注在兩個維度，管線(pipeline)與模型評估。我們利用像是PromptRegistry與Playground之類的工具來研究如何評估提示的有效性。此外，我們也探討了利用上下文精確度和相關性等指標評估RAG pipelines中檢索到的文件品質的重要性。然後，我們討論用於衡量回應適當性的相關性指標，包括困惑度(Perplexity)和人類評估(Human Evaluation)，以及專屬於RAG的特定指標，像是忠實性(Faithfulness)和回答相關性(Answer Relevance)。此外，我們強調alignment metrics(一致性指標？)在確保LLMs遵守人類標準方面的重要性，這涵蓋了真實性和安全性等維度。最後，我們突顯出特定任務的比較基準，像是GLUE和SQuAD在評估跨不同實際應用程式的LLM效能方面的作用。 ## Evaluating Large Language Models (Dimensions) Understanding whether LLMs meet our specific needs is crucial. We must establish clear metrics to gauge the value added by LLM applications. When we refer to "LLM evaluation" in this section, we encompass assessing the entire pipeline, including the LLM itself, all input sources, and the content processed by it. This includes the prompts used for the LLM and, in the case of RAG use-cases, the quality of retrieved documents. To evaluate systems effectively, we'll break down LLM evaluation into dimensions: 1. **Pipeline Evaluation**: Assessing the effectiveness of individual components within the LLM pipeline, including prompts and retrieved documents. 2. **Model Evaluation**: Evaluating the performance of the LLM model itself, focusing on the quality and relevance of its generated output. 瞭解LLMs是否滿足我們的特定需求是很重要的。我們必須建立明確的指標來衡量LLMs應用程式所帶來的附加價值。當我們在本節中提到"LLM評估"時，我們涵蓋的意思就是評估整個管線(pipeline)，包括LLM本身、所有輸入來源及其處理的內容。這包括LLM所使用的提示，以及在RAG用例的情況下，檢索到的文件的品質。為了有效地評估系統，我們將LLM評估分為幾個維度： 1. **管線(pipeline)評估**：評估LLM管線(pipeline)內各個組件的有效性，包括提示和檢索的文檔。 2. **模型評估**：評估LLM模型本身的效能，重點在於其生成輸出的品質和相關性。 Now we’ll dig deeper into each of these two dimensions 現在我們將深入研究這兩個維度 ## LLM Pipeline Evaluation In this section, we’ll look at 2 types of evaluation: 1. **Evaluating Prompts**: Given the significant impact prompts have on the output of LLM pipelines, we will delve into various methods for assessing and experimenting with prompts. 2. **Evaluating the Retrieval Pipeline**: Essential for LLM pipelines incorporating RAG, this involves retrieving the top-k documents to assess the LLM's performance. 在本節中，我們要來討論兩種類型的評估： 1. **評估提示**：鑑於提示對LLM管線(pipeline)輸出的重大影響，我們將深入研究用於評估和實驗提示的各種方法。 2. **評估檢索管線(pipeline)**：對於包含RAG的LLM管線(pipeline)是不可或缺的，這涉及檢索top-k文件以評估LLM的表現。 ### Evaluating Prompts The effectiveness of prompts can be evaluated by experimenting with various prompts and observing the changes in LLM performance. This process is facilitated by prompt testing frameworks, which generally include: - Prompt Registry: A space for users to list prompts they wish to evaluate on the LLM. - Prompt Playground: A feature to experiment with different prompts, observe the responses generated, and log them. This function calls the LLM API to get responses. - Evaluation: A section with a user-defined function for evaluating how various prompts perform. - Analytics and Logging: Features providing additional information such as logging and resource usage, aiding in the selection of the most effective prompts. 提示的有效性可以透過嘗試各種提示並觀察LLM表現的變化來評估。這過程由提示測試框架所促進，通常包括： - 提示註表：一個空間供用戶列出他們希望在LLM上評估的提示。 - 提示遊戲場：用於試驗不同提示、觀察生成的響應並記錄它們的功能。此函數呼叫會LLM API來取得回應。 - 評估：具有使用者定義函數的部分，用於評估各種提示的執行情況。 - 分析與記錄：提供額外信息(例如日誌記錄和資源使用)的功能，有助於選擇最有效的提示。 Commonly used tools for prompt testing include Promptfoo, PromptLayer, and others. 常用的提示測試工具包括Promptfoo、PromptLayer等。 **Automatic Prompt Generation** More recently there have also been methods to optimize prompts in an automatic manner, for instance- [Zhou et al., (2022)](https://arxiv.org/abs/2211.01910) introduced Automatic Prompt EngineerAPE, a framework for automatically generating and selecting instructions. It treats prompt generation as a language synthesis problem and uses the LLM itself to generate and explore candidate solutions. First, an LLM generates prompt candidates based on output demonstrations. These candidates guide the search process. Then, the prompts are executed using a target model, and the best instruction is chosen based on evaluation scores. 最近也出現了以自動化的方式最佳化提示的方法，例如 - [Zhou et al., (2022)](https://arxiv.org/abs/2211.01910)引入了Automatic Prompt EngineerAPE，一個自動生成和選擇指令的框架。它將提示生成視為語言合成的問題，並使用LLM自身來生成和探索候選的解決方案。首先，LLM根據輸出目的生成提示候選詞。這些候選詞指導搜尋過程。然後，使用目標模型執行提示，並根據評估分數選擇最佳指令 ![image](https://hackmd.io/_uploads/S17Bs433C.png) ### Evaluating Retrieval Pipeline In RAG use-cases, solely assessing the end outcome doesn't capture the complete picture. Essentially, the LLM responds to queries based on the context provided. It's crucial to evaluate intermediate results, including the quality of retrieved documents. If the term RAG is unfamiliar to you, please refer to the Week 4 content explaining how RAG operates. Throughout this discussion, we'll refer to the top-k retrieved documents as "context" for the LLM, which requires evaluation. Below are some typical metrics to evaluate the quality of RAG context. 在RAG用例中，僅是評估最終的結果並不足以通盤的了解情況。本質上，LLM是基於所提供的上下文來對回應查詢。對於中間結果的評估(包括檢索到的文件的品質)是很重要的。如果你對RAG這個術語不是那麼熟悉的話，請參閱第4週的內容，那邊有解釋RAG的運作方式。在整個討論中，我們會將前k個檢索到的文檔稱為LLM的"上下文"，這需要評估。以下是評估RAG 下文品質的一些常用指標。 The below mentioned metrics are sourced from [RAGas](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html) an open-source library for RAG pipeline evaluations 下面提到的指標是源自於[RAGas](https://docs.ragas.io/en/stable/concepts/metrics/faithiness.html)，這是一個用於RAG pipeline評估的開源程式庫 1. **Context Precision (From RAGas [documentation](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html)):** Context Precision is a metric that evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. Ideally all the relevant chunks must appear at the top ranks. This metric is computed using the question and the contexts, with values ranging between 0 and 1, where higher scores indicate better precision. $$ \text{Context Precision@k} = {\sum {\text{precision@k}} \over \text{total number of relevant items in the top K results}} $$ $$ \text{Precision@k} = {\text{true positives@k} \over (\text{true positives@k} + \text{false positives@k})} $$ Where k is the total number of chunks in contexts 上下文精度是一種指標，用於評估所有在上下文出現的真實相關項目是否都被排在較高順位。理想情況下，所有相關區塊都必須出現在前面。這個指標是使用問題和上下文計算的，值範圍在0、1之間，其中分數越高表示精度越高。 $$ \text{Context Precision@k} = {\sum {\text{precision@k}} \over \text{total number of relevant items in the top K results}} $$ $$ \text{Precision@k} = {\text{true positives@k} \over (\text{true positives@k} + \text{false positives@k})} $$ 其中k是上下文中的區塊總數 2. **Context Relevancy(From RAGas [documentation](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html))** This metric gauges the relevancy of the retrieved context, calculated based on both the question and contexts. The values fall within the range of (0, 1), with higher values indicating better relevancy. Ideally, the retrieved context should exclusively contain essential information to address the provided query. To compute this, we initially estimate the value of by identifying sentences within the retrieved context that are relevant for answering the given question. The final score is determined by the following formula: $$ \text{context relevancy} = {|S| \over |\text{Total number of sentences in retrived context}|} $$ ```python Hint Question: What is the capital of France? High context relevancy: France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower. Low context relevancy: France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower. The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and the vast Palace of Versailles attest to its rich history. ``` 這個標衡量的是檢索到的上下文的相關性，根據問題和上下文計算。數值落在(0, 1)範圍內，數值越高表示相關性越好。理想情況下，檢索到的上下文應該只包含解決給定問題的必要信息。為了計算這個，我們先估計透過辨識檢索到的上下文中與回答給定問題相關的句子。最終得分由以下公式決定： $$ \text{context relevancy} = {|S| \over |\text{Total number of sentences in retrived context}|} $$ ```python Hint Question: What is the capital of France? High context relevancy: France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower. Low context relevancy: France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower. The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and the vast Palace of Versailles attest to its rich history. ``` 3. Context Recall**(From RAGas [documentation](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html)):** Context recall measures the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth. It is computed based on the ground truth and the retrieved context, and the values range between 0 and 1, with higher values indicating better performance. To estimate context recall from the ground truth answer, each sentence in the ground truth answer is analyzed to determine whether it can be attributed to the retrieved context or not. In an ideal scenario, all sentences in the ground truth answer should be attributable to the retrieved context. The formula for calculating context recall is as follows: $$ \text{context recall} = {|\text{GT sentences that can be attributed to context}| \over |\text{Number of sentences in GT}|} $$ 上下文召回率衡量所檢索到的上下文與註釋的答案(被視為基本事實)的一致程度。它是根據真實情況和檢索到的上下文所計算的，值範圍在0到1之間，值越高表示效能越好。為了從真實答案中估計上下文召回率，我們要分析真實答案中的每個句子，然後確定它是否可以歸因於檢索到的上下文。在理想情況下，真實答案中的所有句子都應歸因於檢索到的上下文。上下文召回率的計算公式如下： $$ \text{context recall} = {|\text{GT sentences that can be attributed to context}| \over |\text{Number of sentences in GT}|} $$ General retrieval metrics can also be used to evaluate the quality of retrieved documents or context, however, note that these metrics provide a lot more weight to the ranks of retrieved documents which might not be super crucial for RAG use-cases: 1. **Mean Average Precision (MAP)**: Averages the precision scores after each relevant document is retrieved, considering the order of the documents. It is particularly useful when the order of retrieval is important. 2. **Normalized Discounted Cumulative Gain (nDCG)**: Measures the gain of a document based on its position in the result list. The gain is accumulated from the top of the result list to the bottom, with the gain of each result discounted at lower ranks. 3. **Reciprocal Rank**: Focuses on the rank of the first relevant document, with higher scores for cases where the first relevant document is ranked higher. 4. **Mean Reciprocal Rank (MRR)**: Averages the reciprocal ranks of results for a sample of queries. It is particularly used when the interest is in the rank of the first correct answer. 一般的檢索指標也可用於評估檢索到的文檔或上下文的質量，不過要注意吼，這些指標為檢索到的文檔的排名提供了更多的權重，這對RAG用例來說可能不是非常重要： 1. **Mean Average Precision (MAP)**：考慮文件的順序，在檢索到每個相關文檔後，對精確度分數進行平均。當檢索順序很重要時，它特別有用。 2. **Normalized Discounted Cumulative Gain (nDCG)**：根據文件在結果清單中的位置來衡量文件的增益。增益從結果清單的頂部到底部累積，每個結果的增益在較低的排名中都會打折。 3. **Reciprocal Rank**：關注第一個相關文件的排名，第一個相關文件排名較高的情況下得分較高。 4. **Mean Reciprocal Rank (MRR)**：將查詢樣本的結果倒數排名進行平均。當感興趣的是第一個正確答案的排名時，它特別有用。 ## LLM Model Evaluation Now that we've discussed evaluating LLM pipeline components, let's delve into the heart of the pipeline: the LLM model itself. Assessing LLM models isn't straightforward due to their broad applicability and versatility. Different use cases may require focusing on certain dimensions more than others. For instance, in applications where accuracy is paramount, evaluating whether the model avoids hallucinations (generating responses that are not factual) can be crucial. Conversely, in other scenarios where maintaining impartiality across different populations is essential, adherence to principles to avoid bias is paramount. LLM evaluation can be broadly categorized into these dimensions: - **Relevance Metrics**: Assess the pertinence of the response to the user's query and context. - **Alignment Metrics**: Evaluate how well the model aligns with human preferences in the given use-case, in aspects such as fairness, robustness, and privacy. - **Task-Specific Metrics**: Gauge the performance of LLMs across different downstream tasks, such as multihop reasoning, mathematical reasoning, and more. 現在我們已經討論過評估LLM 管線(pipeline)組件，讓我們深入研究管線(pipeline)的核心：LLM模型本身。由於其廣泛的適用性和多功能性，評估LLM模型其實並不是那麼容易的事。不同的用例可能需要更關注於某些維度。舉例來說，在準確度至關重要的應用中，評估模型是否避免幻覺(產生不真實的反應)可能就是很重要的一件事。相反的，在其它需要保持不同母體的公正性的情況下，遵守避免偏見的原則就顯的重要了。LLM評估大致可以分為以下幾個維度： - **相關性指標**：評估回應給使用者的查詢與上下文的相關性。 - **一致性指標**：評估模型在給定用例中在公平性、穩健性和隱私性等方面與人類偏好的一致性程度。 - **任務特定指標**：衡量LLM在不同下游任務中的表現，像是Multi-Hop Reasoning、數學推理等等。 ### Relevance Metrics Some common response relevance metrics include: 1. Perplexity: Measures how well the LLM predicts a sample of text. Lower perplexity values indicate better performance. [Formula and mathematical explanation](https://huggingface.co/docs/transformers/en/perplexity) 2. Human Evaluation: Involves human evaluators assessing the quality of the model's output based on criteria such as relevance, fluency, coherence, and overall quality. 3. BLEU (Bilingual Evaluation Understudy): Compares the LLM generated output with reference answer to measure similarity. Higher BLEU scores signify better performance. [Formula](https://www.youtube.com/watch?v=M05L1DhFqcw) 4. Diversity: Measures the variety and uniqueness of generated LLM responses, including metrics like n-gram diversity or semantic similarity. Higher diversity scores indicate more diverse and unique outputs. 5. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a metric used to evaluate the quality of LLM generated text by comparing it with reference text. It assesses how well the generated text captures the key information present in the reference text. ROUGE calculates precision, recall, and F1-score, providing insights into the similarity between the generated and reference texts. [Formula](https://www.youtube.com/watch?v=TMshhnrEXlg) 一些常見的響應相關性指標包括： 1. 困惑度(Perplexity)：衡量LLMs預測文本樣本有多好。較低的困惑度值表示更好的效能。 [公式與數學解釋](https://huggingface.co/docs/transformers/en/perplexity) 2. 人工評估：人工評估員根據相關性、流暢性、連貫性和整體品質等標準評估模型輸出的品質。 3. BLEU(雙語替換評測)：比較LLM生成的輸出與參考答案的相似度。BLEU分數越高表示效能越好。[公式](https://www.youtube.com/watch?v=M05L1DhFqcw) 4. 多樣性：衡量LLM生成的答案的多樣性和獨特性，包括n-gram多樣性或語意相似性等指標。多樣性分數越高表示更加多樣化和獨特的輸出。 5. ROUGE(Recall-Oriented Understudy for Gisting Assessment)是一種透過與參考文本進行比較來評估LLM生成文本品質的指標。它評估生成的文本在多大程度上捕捉了参考文本中的關鍵信息。ROUGE計算精確率、召回率和F1-score，提供了對生成文本與參考文本之間相似性的深入解析。[公式](https://www.youtube.com/watch?v=TMshhnrEXlg) **RAG specific relevance metrics** Apart from the above mentioned generic relevance metrics, RAG pipelines use additional metrics to judge if the answer is relevant to the context provided and to the query posed. Some metrics as defined by [RAGas](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html) are: 除了上面提到的通用相關性指標之外，RAG pipelines還使用額外的指標來判斷答案是否與提供的上下文和提出的查詢相關。由[RAGas](https://docs.ragas.io/en/stable/concepts/metrics/faithativity.html)所定義的一些指標如下： 1. **Faithfulness(From RAGas [documentation](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html))** This measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better. The generated answer is regarded as faithful if all the claims that are made in the answer can be inferred from the given context. To calculate this a set of claims from the generated answer is first identified. Then each one of these claims are cross checked with given context to determine if it can be inferred from given context or not. The faithfulness score is given by: $$ {|\text{Number of claims in the generated answer that can be inferred from given context}| \over |\text{Total number of claims in the generated answer}|} $$ ```markdown Hint Question: Where and when was Einstein born? Context: Albert Einstein (born 14 March 1879) was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time High faithfulness answer: Einstein was born in Germany on 14th March 1879. Low faithfulness answer: Einstein was born in Germany on 20th March 1879. ``` 這可以衡量生成的答案與給定上下文的事實一致性。它是根據答案和檢索到的上下文計算得出的。答案縮放到(0,1)之間。越高越好。如果答案中提出的所有聲明都可以從給定的上下文中推斷出來，那生成的答案就會被認為是可信賴的。為了計算這一點，首先從生成的答案中識別出一組的聲明。然後，將這些聲明中的每一項都跟給定的上下文進行交叉檢查，以確定是否可以從給定的上下文中推斷出來。可信賴分數由下式給出： $$ {|\text{Number of claims in the generated answer that can be inferred from given context}| \over |\text{Total number of claims in the generated answer}|} $$ ```markdown Hint Question: Where and when was Einstein born? Context: Albert Einstein (born 14 March 1879) was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time High faithfulness answer: Einstein was born in Germany on 14th March 1879. Low faithfulness answer: Einstein was born in Germany on 20th March 1879. ``` 1. **Answer Relevance(From RAGas [documentation](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html))** The evaluation metric, Answer Relevancy, focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information. This metric is computed using the question and the answer with values ranging between 0 and 1, where higher scores indicate better relevancy. 回答相關性，這個評估指標主要關注在評估產生的答案與給定提示的相關程度。不完整或包含冗餘信息的回答就會得到較低的分數。此指標是使用問題和答案計算得出的，其值範圍在0、1之間，其中分數越高表示相關性越好。 An answer is deemed relevant when it directly and appropriately addresses the original question. Importantly, our assessment of answer relevance does not consider factuality but instead penalizes cases where the answer lacks completeness or contains redundant details. To calculate this score, the LLM is prompted to generate an appropriate question for the generated answer multiple times, and the mean cosine similarity between these generated questions and the original question is measured. The underlying idea is that if the generated answer accurately addresses the initial question, the LLM should be able to generate questions from the answer that align with the original question. 當答案直接且適當地解決原始問題時，那這個答案就會被視為是相關。重要的是，我們對回答相關性的評估並不考慮事實，而是對答案缺乏完整性或包含冗餘細節的情況進行懲罰。為了計算這個分數，LLM被提示多次為生成的答案生成適當的問題，並測量這些生成的問題與原始問題之間的平均餘弦相似度。基本想法是，如果產生的答案準確地解決了最初的問題，LLM應該能夠從答案中產生與原始問題一致的問題。 1. **Answer semantic similarity(From RAGas [documentation](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html))** The concept of Answer Semantic Similarity pertains to the assessment of the semantic resemblance between the generated answer and the ground truth. This evaluation is based on the ground truth answer and the generated LLM answer , with values falling within the range of 0 to 1. A higher score signifies a better alignment between the generated answer and the ground truth. 回答語意相似度的概念指的是生成的答案與基本事實之間語意相似性的評估。此評估基於真實答案和LLM所生成的答案，值在0、1之間。較高的分數意謂著答案與真實答案之間的一致性更好。 Measuring the semantic similarity between answers can offer valuable insights into the quality of the generated response. This evaluation utilizes a cross-encoder model to calculate the semantic similarity score. 測量答案之間的語意相似性可以為生成的響應的品質提供有價值的見解。此評估利用cross-encoder model來計算語意相似性分數。 ### Alignment Metrics Metrics of this type are crucial, especially when LLMs are utilized in applications that interact directly with people, to ensure they conform to acceptable human standards. The challenge with these metrics is their difficulty to quantify mathematically. Instead, the assessment of LLM alignment involves conducting specific tests on benchmarks designed to evaluate alignment, using the results as an indirect measure. For instance, to evaluate a model's fairness, datasets are employed where the model must recognize stereotypes, and its performance in this regard serves as an indirect indicator of the LLM's fairness alignment. Thus, there's no universally correct method for this evaluation. In our course, we will adopt the approaches outlined in the influential study “[TRUSTLLM: Trustworthiness in Large Language Models](https://arxiv.org/pdf/2401.05561.pdf)” to explore alignment dimensions and the proxy tasks that help gauge LLM alignment. 這類的指標非常重要，尤其是當LLMs用於與人們直接互動的應用程式時，以確保它們符合可接受的人類標準。這些指標的挑戰在於它們難以用數學量化。相反的，LLM一致性的評估涉及主要用於評估一致性的基準進行特定測試，並使用結果作為間接衡量標準。例如，為了評估模型的公平性，會使用一些資料集來要求模型識別大家都知道的事情，並且其在這方面的表現可以作為LLMs平性的間接指標。因此，這種評估沒有普遍正確的方法。在我們的課程中，我們將採用有影響力的研究[TRUSTLLM: Trustworthiness in Large Language Models](https://arxiv.org/pdf/2401.05561.pdf)中概述的方法來探討一致性維度以及衡量LLM一致性的代理任務。 There is no single definition for Alignment, but here are some dimensions to quantify alignment, we use definitions from the paper mentioned above: 1. **Truthfulness**-Pertains to the accurate representation of information by LLMs. It encompasses evaluations of their tendency to generate misinformation, hallucinate, exhibit sycophantic behavior, and correct adversarial facts. 2. **Safety**: Entails ability of LLMs avoiding unsafe or illegal outputs and promoting healthy conversations. 3. **Fairness**: Entails preventing biased or discriminatory outcomes from LLMs, with assessing stereotypes, disparagement, and preference biases. 4. **Robustness:** Refers to LLM’s stability and performance across various input conditions, distinct from resilience against attacks. 5. **Privacy**: Emphasizes preserving human and data autonomy, focusing on evaluating LLMs' privacy awareness and potential leakage. 6. **Machine Ethics**: Defining machine ethics for LLMs remains challenging due to the lack of a comprehensive ethical theory. Instead, we can divide it into three segments: implicit ethics, explicit ethics, and emotional awareness. 7. **Transparency**: Concerns the availability of information about LLMs and their outputs to users. 8. **Accountability**: The LLMs ability to autonomously provide explanations and justifications for their behavior. 9. **Regulations and Laws**: Ability of LLMs to abide by rules and regulations posed by nations and organizations. 對於Alignment並沒有單一的定義，但是這裡是有一些量化一致性的維度，我們使用上面提到的論文中的定義： 1. **真實性** - 指LLMs對信息的準確確的表示。它包括對其傾向於生成錯誤訊息、產生幻覺、表現出阿諛奉承行為以及糾正對立事實的評估。 2. **安全性**：意謂著LLMs避免不安全或非法輸出並且有助於健康對話的能力。 3. **公平性**：涉及防止LLMs產生偏見或歧視性結果，包括評估刻板印象、貶低言論和偏好偏見。 4. **穩健性**：指的是LLM在各種輸入條件下的穩定性和效能，不同於對抗攻擊的抗彈性。 5. **隱私性**：強調維護人與資料的自主權，關注在評估LLMs的隱私覺察和潛在的洩漏。 6. **機器倫理**：由於缺乏全面性的倫理理論，為LLMs定義機器倫理仍然具有挑戰性。相反，我們可以將其分為三個部分：隱性倫理、顯性倫理和情感覺察。 7. **透明度**：關注於LLMs及其輸出信息對於使用者的可用性。 8. **責任**：LLMs能夠自主為其行為提供解釋和理由。 9. **法規和法律**：LLMs遵守國家和組織制定的規則和法規的能力。 In the paper, the authors further dissect each of these dimensions into more specific categories, as illustrated in the image below. For instance, Truthfulness is segmented into aspects such as misinformation, hallucination, sycophancy, and adversarial factuality. Moreover, each of these sub-dimensions is accompanied by corresponding datasets and metrics designed to quantify them. 在論文中，作者進一步將每個維度分解為更具體的類別，如下圖所示。舉例來說，真實性可分為錯誤訊息、幻覺、阿諛奉承和對立事實等面向。此外，每個子維度都伴隨著相應的資料集和用來量化它們的指標。 💡This serves as a basic illustration of utilizing proxy tasks, datasets, and metrics to evaluate an LLM's performance within a specific dimension. The choice of which dimensions are relevant will vary based on your specific task, requiring you to select the most applicable ones for your needs. 💡這是利用代理任務、資料集和指標來評估LLMs在特定維度內表現的基本說明。相關維度的選擇將根據你的特定任務而有所不同，需要你根據需求來選擇最適合的維度。 ![image](https://hackmd.io/_uploads/H1_hjVnnR.png) ### Task-Specific Metrics Often, it's necessary to create tailored benchmarks, including datasets and metrics, to evaluate an LLM's performance in a specific task. For example, if developing a chatbot requiring strong reasoning abilities, utilizing common-sense reasoning benchmarks can be beneficial. Similarly, for multilingual understanding, machine translation benchmarks are valuable. 通常吼，為了評估LLMs在特定任務中的表現，有必要建立量身定制的基準，包括資料集和指標。例如，如果開發一個需要強大推理能力的聊天機器人，那麼利用常識推理基準可能會有所幫助。同樣，對於多語言理解，機器翻譯基準也很有價值。 Below, we outline some popular examples. 1. **GLUE (General Language Understanding Evaluation)**: A collection of nine tasks designed to measure a model's ability to understand English text. Tasks include sentiment analysis, question answering, and textual entailment. 2. **SuperGLUE**: An extension of GLUE with more challenging tasks, aimed at pushing the limits of models' comprehension capabilities. It includes tasks like word sense disambiguation, more complex question answering, and reasoning. 3. **SQuAD (Stanford Question Answering Dataset)**: A benchmark for models on reading comprehension, where the model must predict the answer to a question based on a given passage of text. 4. **Commonsense Reasoning Benchmarks**: - **Winograd Schema Challenge**: Tests models on commonsense reasoning and understanding by asking them to resolve pronoun references in sentences. - **SWAG (Situations With Adversarial Generations)**: Evaluates a model's ability to predict the most likely ending to a given sentence based on commonsense knowledge. 5. **Natural Language Inference (NLI) Benchmarks**: - **MultiNLI**: Tests a model's ability to predict whether a given hypothesis is true (entailment), false (contradiction), or undetermined (neutral) based on a given premise. - **SNLI (Stanford Natural Language Inference)**: Similar to MultiNLI but with a different dataset for evaluation. 6. **Machine Translation Benchmarks**: - **WMT (Workshop on Machine Translation)**: Annual competition with datasets for evaluating translation quality across various language pairs. 7. **Task-Oriented Dialogue Benchmarks**: - **MultiWOZ**: A dataset for evaluating dialogue systems in task-oriented conversations, like booking a hotel or finding a restaurant. 8. **Code Generation and Understanding Benchmarks**: - MBPP Dataset: The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers. 9. **Chart Understanding Benchmarks**: 1. ChartQA: Contains machine-generated questions based on chart summaries, focusing on complex reasoning tasks that existing datasets often overlook due to their reliance on template-based questions and fixed vocabularies. 下面，我們概述一些常用的評估指標： 1. **GLUE(General Language Understanding Evaluation)**：彙集九個任務，主要是衡量模型理解英語文本的能力。任務包括情緒分析、問答和文字蘊涵。 2. **SuperGLUE**：GLUE的擴展，具有更具挑戰性的任務，旨在推動模型理解能力。它包括詞義消歧、更複雜的問答和推理等任務。 3. **SQuAD(Stanford Question Answering Dataset)**：閱讀理解模型的基準，其中模型必須根據給定的文本段落預測問題的答案。 4. **常識推理基準**： - **Winograd Schema Challenge**：透過要求模型解決句子中的代名詞引用來測試模型的常識推理和理解能力。 - **SWAG(Situations With Adversarial Generations)**：評估模型根據常識預測給定句子最可能的結尾的能力。 5. **自然語言推理(NLI)基準**： - **MultiNLI**：根據給定的前提來預測模型對於給定的假設是否真(蘊涵)、假(矛盾)或不確定(中性)的能力。 - **SNLI(Stanford Natural Language Inference)**：與MultiNLI類似，但使用不同的資料集進行評估。 6. **機器翻譯基準**： - **WMT(Workshop on Machine Translation)**：年度競賽，用於評估各種語言對(language pair)翻譯品質的資料集。 7. **任務導向的對話基準**： - **MultiWOZ**：用於評估任務導向對話(如預訂酒店或尋找餐廳)中對話系統的資料集。 8. **程式碼生成和理解基準**： - MBPP資料集：此基準測試由大約1,000個眾包Python程式設計問題組成，旨在解決入門級程式設計師的問題。 9. **圖表理解基準**： 1. ChartQA：包含基於圖表摘要的機器生成的問題，關注現有資料集由於依賴基於模板的問題和固定詞彙而經常忽略的複雜推理任務。 The [Hugging Face OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) features an array of datasets and tasks used to assess foundational models and chatbots ![image](https://hackmd.io/_uploads/H1Cpo42hC.png) ## Read/Watch These Resources (Optional) 1. LLM Evaluation by Klu.ai: https://klu.ai/glossary/llm-evaluation 2. Microsoft LLM Evaluation Leaderboard: https://llm-eval.github.io/ 3. Evaluating and Debugging Generative AI Models Using Weights and Biases course: https://www.deeplearning.ai/short-courses/evaluating-debugging-generative-ai/ ## Read These Papers (Optional) 1. https://arxiv.org/abs/2310.19736 2. https://arxiv.org/abs/2401.05561