[Week 3] Fine Tuning LLMs

# [Week 3] Fine Tuning LLMs [課程目錄](https://areganti.notion.site/Applied-LLMs-Mastery-2024-562ddaa27791463e9a1286199325045c) [課程連結](https://areganti.notion.site/Week-3-Fine-Tuning-LLMs-14ca00d3071f4e528a762f41547868ef) ## ETMI5: Explain to Me in 5 :::info In this section, we will go over the Fine-Tuning domain adaptation method for LLMs. Fine-tuning involves further training pre-trained models for specific tasks or domains, adapting them to new data distributions, and enhancing efficiency by leveraging pre-existing knowledge. It is crucial for tasks where generic models may not excel. Two main types of fine-tuning include unsupervised (updating models without modifying behavior) and supervised (updating models with labeled data). We emphasize on the popular supervised method-Instruction fine-tuning which augments input-output examples with explicit instructions for better generalization. We’ll dig deeper into Reinforcement Learning from Human Feedback (RLHF) which incorporates human feedback for model fine-tuning and Direct Preference Optimization (DPO) that directly optimizes models based on user preferences. We provide an overview of Parameter-Efficient Fine-Tuning (PEFT) approaches as well where selective updates are made to model parameters, addressing computational challenges, memory efficiency, and allowing versatility across modalities. ::: :::success 在這一章節中，我們會來討論用於LLMs的Fine-Tuning domain adaptation method。微調(Fine-tuning)涉及針對特定任務或領域的預訓練模型的進一步訓練，使其適應新的資料分佈，並透過利用既有的知識來提高效率。這對通用模型可能不擅長的任務來說是很重要的。微調的兩種主要類型包括非監督式(unsupervised)(不修改行為的情況下更新模型)和監督式(supervised)(使用標記資料更新模型)。我們會專注在受歡迎的監督式方法——指令微調，它透過顯式的指令來增強輸入輸出範例，以實現更好的泛化。我們還會深入研究人類反饋強化學習(RLHF，Reinforcement Learning from Human Feedback)，它結合了用於模型微調的人類反饋和Direct Preference Optimization (DPO，直接偏好最佳化？)，可根據使用者偏好直接最佳化模型。我們給出一個關於Parameter-Efficient Fine-Tuning (PEFT)的概述，以及對模型參數進行選擇性更新、解決計算的挑戰、記憶體效率並允許跨模式的多功能性。 ::: ## Introduction :::info Fine-tuning is the process of taking pre-trained models and further training them on smaller, domain-specific datasets. The aim is to refine their capabilities and enhance performance in a specific task or domain. This process transforms general-purpose models into specialized ones, bridging the gap between generic pre-trained models and the unique requirements of particular applications. ::: :::success 微調是採用預訓練的模型並在較小的特定領域資料集上進一步訓練它們的過程。目的是提高他們的能力並增強在特定任務或領域的效能。此過程將通用模型轉換為專用模型，以彌合通用預訓練模型與特定應用程式的獨特要求之間的差距。 ::: :::info Consider OpenAI's GPT-3, a state-of-the-art LLM designed for a broad range of NLP tasks. To illustrate the need for fine-tuning, imagine a healthcare organization wanting to use GPT-3 to assist doctors in generating patient reports from textual notes. While GPT-3 is proficient in general text understanding, it may not be optimized for intricate medical terms and specific healthcare jargon. ::: :::success 以OpenAI的GPT-3為例，尖端科技的結晶，設計應用於廣泛的NLP任務。為了說明微調的必要性，我們想像一下一家醫療機構想要使用GPT-3來協助醫生根據文字註釋產生病患報告。雖然GPT-3精通一般文本的理解，但它可能並未針對複雜的醫學術語和特定的健康照護行業術語進行最佳化。 ::: :::info In this scenario, the organization engages in fine-tuning GPT-3 on a dataset filled with medical reports and patient notes. The model becomes more familiar with medical terminologies, nuances of clinical language, and typical report structures. As a result, after fine-tuning, GPT-3 is better suited to assist doctors in generating accurate and coherent patient reports, showcasing its adaptability for specific tasks. ::: :::success 在這種情況下，這組織在充滿著醫療報告和病患筆記的資料集上對GPT-3進行微調。這模型對醫學術語、臨床語言的細微差別和典型的報告結構都變得更加熟門熟路。因此，經過微調之後，GPT-3更適合協助醫師生成準確、一致的病患報告，展現其對特定任務的適應性。 ::: :::info Fine-tuning is not exclusive to language models; any machine learning model may require retraining under certain circumstances. It involves adjusting model parameters to align with the distribution of new, specific datasets. This process is illustrated with the example of a convolutional neural network trained to identify images of automobiles and the challenges it faces when applied to detecting trucks on highways. ::: :::success 微調並不是語言模型所特有的；任一個機器學習模型在某些情況下都可能需要重新訓練。它涉及調整模型參數讓其與特定資料集的新的分佈一致。我們用訓練來辨識汽車影像的捲積神經網路做為範例來說明這個過程，以及這個神經網路在偵測高速公路上的卡車時所面臨的挑戰。 ::: :::info The key principle behind fine-tuning is to leverage pre-trained models and recalibrate their parameters using novel data, adapting them to new contexts or applications. It is particularly beneficial when the distribution of training data significantly differs from the requirements of a specific application.The choice of the base general model model depends on the nature of the task, such as text generation or text classification. ::: :::success 微調背後的關鍵原則是利用預訓練的模型並使用新的資料重新校準其參數，使其適應新的上下文或應用程式。當訓練資料的分佈與特定應用的需求明顯顯著不同的時候，這麼做特別有好處。基礎通用模型模型的選擇取決於任務的特性，像是文本生成或文本分類。 ::: ## Why Fine-Tuning? :::info While large language models are indeed trained on a diverse set of tasks, the need for fine-tuning arises because these large generic models are designed to perform reasonably well across various applications, but not necessarily excel in a specific task. The optimization of generic models is aimed at achieving decent performance across a range of tasks, making them versatile but not specialized. ::: :::success 雖然大型語言模型確實是在多種任務上進行訓練的，不過因為這些大型通用模型是設計來在各種應用程式上都能表現良好，就不一定能夠在特定任務上表現出色，所以吼，微調的需求就由然而生了。通用模型的最佳化主要是在一系列任務中有著像樣的效能，使其具有通用性但不具專業。 ::: :::info Fine-tuning becomes essential to ensure that a model attains exceptional proficiency in a particular task or domain of interest. The emphasis shifts from achieving general competence to achieving mastery in a specific application. This is particularly crucial when the model is intended for a focused use case, and overall general performance is not the primary concern. ::: :::success 為了確保模型在特定任務或感興趣的領域中駕輕就熟，微調這件事就變的特別重要。重點是從獲得一般能力轉向掌握特定應用。當模型是準備來應用於特定情況並且其整體效能不是主要的關注點時，這一點尤其重要。 ::: :::info In essence, generic large language models can be considered as being proficient in multiple tasks but not reaching the level of mastery in any. Fine-tuned models, on the other hand, undergo a tailored optimization process to become masters of a specific task or domain. Therefore, the decision to fine-tune models is driven by the necessity to achieve superior performance in targeted applications, making them highly effective specialists in their designated areas. ::: :::success 本質上，通用大語言模型可以被視為是精通多個任務，樣樣懂卻是樣樣不精通。另一方面，微調模型經過量身訂製的最佳化過程，華麗變身成為特定任務或領域的大師。因此，微調模型的決定是出於在目標應用中實現卓越效能的必要性，使它們成為指定領域中的高效專家。 ::: :::info For a deeper understanding, explore why fine-tuning models for tasks in new domains is deemed crucial for several compelling reasons. 1. **Domain-Specific Adaptation:** Pre-trained LLMs may not be optimized for specific tasks or domains. Fine-tuning allows adaptation to the nuances and characteristics of a new domain, enhancing performance in domain-specific tasks. For instance, large generic LLMs might not be sufficiently trained on tasks like document analysis in the legal domain. Fine-tuning can allow the model to understand legal terminology and nuances for tasks like contract review. 2. **Shifts in Data Distribution:** Models trained on one dataset may not generalize well to out-of-distribution examples. Fine-tuning helps align the model with the distribution of new data, addressing shifts in data characteristics and improving performance on specific tasks. For example: Fine-tuning a sentiment analysis model for social media comments. The distribution of language and sentiments on social media may differ significantly from the original training data, requiring adaptation for accurate sentiment classification. 3. **Cost and Resource Efficiency:** Training a model from scratch on a new task often requires a large labeled dataset, which can be costly and time-consuming. Fine-tuning allows leveraging a pre-trained model's knowledge and adapting it to the new task with a smaller dataset, making the process more efficient. For example: Adapting a pre-trained model for a small e-commerce platform to recommend products based on user preferences. Fine-tuning is more resource-efficient than training a model from scratch with a limited dataset. 4. **Out-of-Distribution Data Handling:** - Fine-tuning mitigates the suboptimal performance of pre-trained models when dealing with out-of-distribution examples. Instead of starting training anew, fine-tuning allows building upon the existing model's foundation with a relatively modest dataset. For example: Fine-tuning a speech recognition model for a new regional accent. The model can be adapted to recognize speech patterns specific to the new accent without extensive retraining. 5. **Knowledge Transfer:** - Pre-trained models capture general patterns and knowledge from vast amounts of data during pre-training. Fine-tuning facilitates the transfer of this general knowledge to specific tasks, making it a valuable tool for leveraging pre-existing knowledge in new applications. For example: Transferring medical knowledge from a pre-trained model to a new healthcare chatbot. Fine-tuning with medical literature enables the model to provide accurate and contextually relevant responses in healthcare conversations. 6. **Task-Specific Optimization:** - Fine-tuning enables the optimization of model parameters for task-specific objectives. For example, in the medical domain, fine-tuning an LLM with medical literature can enhance its performance in medical applications. For example: Optimizing a pre-trained model for code generation in a software development environment. Fine-tuning with code examples allows the model to better understand and generate code snippets. 7. **Adaptation to User Preferences:** Fine-tuning allows adapting the model to user preferences and specific task requirements. It enables the model to generate more contextually relevant and task-specific responses. For example: Fine-tuning a virtual assistant model to align with user preferences in language and tone. This ensures that the assistant generates responses that match the user's communication style. 8. **Continual Learning:** Fine-tuning supports continual learning by allowing models to adapt to evolving data and user requirements over time. It enables models to stay relevant and effective in dynamic environments. For instance: Continually updating a news summarization model to adapt to evolving news topics and user preferences. Fine-tuning enables the model to stay relevant and provide timely summaries. ::: :::success 為了更深入的瞭解，探索為什麼對新領域中的任務微調模型被認為是至關重要的，這邊有幾個令人信服的原因。 1. **特定領域的適應：** 預訓練的LLMs也許不會針對特定任務或領域做最佳化。微調可以適應新領域的細微差別和特徵，從而增強特定領域任務的效能。舉例來說，大型通用LLMs可能沒有在法律領域中的文件分析等任務上充分訓練過。微調可以讓模型理解像是合約審查等任務的法律術語和細微差別。 2. **資料分佈的變化：** 在一個資料集上訓練的模型可能無法很好地泛化到分佈外的樣本。微調有助於使模型與新資料的分佈保持一致，解決資料特徵的變化並提高特定任務的效能。舉例來說：微調社群媒體評論的情緒分析模型。社群媒體上語言和情緒的分佈可能與原始訓練資料有很大不同，需要進行調整以實現準確的情感分類。 3. **成本和資源效率：** 在新任務上從頭開始訓練模型通常需要大量標記資料集，這過程可能又貴又沒營養。微調允許利用預訓練模型的知識，然後用較小的資料集讓它適應新的任務，從而使流程更加有效率。舉例來說：為小型電子商務平台採用預訓練模型以根據使用者偏好推薦產品。微調比使用有限資料集從頭開始訓練模型更節省資源。 4. **Out-of-Distribution資料處理：** - 微調減輕了預訓練模型在處理Out-of-Distribution樣本時效能不佳的問題。與其從頭開始訓練，微調允許利用現有模型的基礎，使用相對較小的資料集。舉例來說：微調一個用於新區域口音的語音識別模型。該模型可以適應識別特定於新口音的語音模式，而不需要進行大量的再訓練。 5. **知識移轉：** - 預訓練模型在預訓練期間從大量資料中捕捉通用的模式和知識。微調有助於將通用知識轉移到特定任務上，使其成為在新應用程式中利用預先存在的知識的有價值的工具。舉例來說：將醫學知識從預訓練的模式轉移到新的醫療保健聊天機器人。根據醫學文獻進行微調，使模型能夠在醫療保健對話中提供準確且與上下文相關的回應。 6. **特定於任務的最佳化：** - 微調可以針對特定任務目標最佳化模型參數。舉例來說，在醫學領域中，利用醫學文獻對LLM進行微調可以提高其在醫學應用中的效能。舉例來說：最佳化一個在軟體開發環境中程式碼生成的預訓練模型。使用程式碼範例進行微調可以讓模型更好地理解和生成程式碼片段。 7. **適應使用者偏好：** 微調能夠讓模型適應使用者偏好和特定任務要求。它讓模型能夠生成更多上下文相關且針對特定任務的回應。舉例來說：微調一個虛擬助理模型以符合使用者的語言和語氣偏好。這確保助理生成與使用者的溝通風格相符的回應。 8. **持續學習：** 微調透過允許模型採用不斷變化的資料和使用者需求來支援持續學習。它使模型能夠在動態環境中保持相關性和有效性。舉例來說：持續更新新聞摘要模型以適應不斷變化的新聞主題和使用者偏好。微調使模型能夠保持相關性並提供及時的摘要。 ::: :::info In summary, fine-tuning is a powerful technique that enables organizations to adapt pre-trained models to specific tasks, domains, and user requirements, providing a practical and efficient solution for deploying models in real-world applications. ::: :::success 總的來說，微調是一種強大的技術，讓組織能夠將預訓練模型適應於特定任務、領域與使用者需求，為在實際應用程式中部署模型提供實用且高效的解決方案。 ::: ## Types of Fine-Tuning :::info At a high level, fine-tuning methods for language models can be categorized into two main approaches: supervised and unsupervised. In machine learning, supervised methods involve having labeled data, where the model is trained on examples with corresponding desired outputs. On the other hand, unsupervised methods operate with unlabeled data, focusing on extracting patterns and structures without explicit labels. ::: :::success 宏觀來看，語言模型的微調方法可以分為兩種主要方法：監督式和非監督式。在機器學習中，監督方法涉及標記資料，其中模型在具有相對應期望輸出的樣本上進行訓練。另一方面，非監督式方法使用未標記的資料來操作，專注於提取沒有明確標記的模式和結構。 ::: ### **Unsupervised Fine-Tuning Methods:** :::info 1. **Unsupervised Full Fine-Tuning:** Unsupervised fine-tuning becomes relevant when there is a need to update the knowledge base of an LLM without modifying its existing behavior. For instance, if the goal is to fine-tune the model on legal literature or adapt it to a new language, an unstructured dataset containing legal documents or texts in the desired language can be utilized. In such cases, the unstructured dataset comprises articles, legal papers, or relevant content from authoritative sources in the legal domain. This approach allows the model to effectively refine its understanding and adapt to the nuances of legal language without relying on labeled examples, showcasing the versatility of unsupervised fine-tuning across various domains. 2. **Contrastive Learning:** Contrastive learning is a method employed in fine-tuning language models, emphasizing the training of the model to discern between similar and dissimilar examples in the latent space. The objective is to optimize the model's ability to distinguish subtle nuances and patterns within the data. This is achieved by encouraging the model to bring similar examples closer together in the latent space while pushing dissimilar examples apart. The resulting learned representations enable the model to capture intricate relationships and differences in the input data. Contrastive learning is particularly beneficial in tasks where a nuanced understanding of similarities and distinctions is crucial, making it a valuable technique for refining language models for specific applications that require fine-grained discrimination. ::: :::success 1. **Unsupervised Full Fine-Tuning：** 當需要在不調整其現有行為的情況下更新LLM的知識庫時，非監督式微調就變得相關。舉例來說，如果目標是用法律文獻來微調模型或使其適應新的語言的話，那我們就可以利用包含所需語言的法律文件或文本的非結構化資料集。在這種情況下，非結構化資料集包括文章、法律論文或來自法律領域權威來源的相關內容。這種方法使模型能夠在不需要依賴標記的樣本的情況下有效地完善其理解並適應法律語言的細微差別，從而展示了跨各個領域的非監督式微調的多功能性。 2. **Contrastive Learning：** Contrastive learning(對比學習)是一種用於微調語言模型的方法，強調模型的訓練以區分潛在空間中的相似和不相似的樣本。目標是最佳化模型區分資料中細微差別和模式的能力。這是透過鼓勵模型在潛在空間中將相似的樣本抓得更近，同時將不同的樣本推遠一點的方式來實現的。這種學習方式導致了所學習到的表示讓模型能夠捕捉輸入資料中的複雜關係和差異。對比學習在對相似性和區別的細緻理解至關重要的任務中特別有益，使其成為針對需要細粒度區分的特定應用程式完善語言模型的有價值的技術。 ::: ### **Supervised Fine-Tuning Methods:** :::info 1. **Parameter-Efficient Fine-Tuning: It** is a fine-tuning strategy that aims to reduce the computational expenses associated with updating the parameters of a language model. Instead of updating all parameters during fine-tuning, PEFT focuses on selectively updating a small set of parameters, often referred to as a low-dimensional matrix. One prominent example of PEFT is the low-rank adaptation (LoRA) technique. LoRA operates on the premise that fine-tuning a foundational model for downstream tasks only requires updates across certain parameters. The low-rank matrix effectively represents the relevant space related to the target task, and training this matrix is performed instead of adjusting the entire model's parameters. PEFT, and specifically techniques like LoRA, can significantly decrease the costs associated with fine-tuning, making it a more efficient process. 2. **Supervised Full Fine-Tuning**: It involves updating all parameters of the language model during the training process. Unlike PEFT, where only a subset of parameters is modified, full fine-tuning requires sufficient memory and computational resources to store and process all components being updated. This comprehensive approach results in a new version of the model with updated weights across all layers. While full fine-tuning is more resource-intensive, it ensures that the entire model is adapted to the specific task or domain, making it suitable for situations where a thorough adjustment of the language model is desired. 3. **Instruction Fine-Tuning:** Instruction Fine-Tuning involves the process of training a language model using examples that explicitly demonstrate how it should respond to specific queries or tasks. This method aims to enhance the model's performance on targeted tasks by providing explicit instructions within the training data. For instance, if the task involves summarization or translation, the dataset is curated to include examples with clear instructions like "summarize this text" or "translate this phrase." Instruction fine-tuning ensures that the model becomes adept at understanding and executing specific instructions, making it suitable for applications where precise task execution is essential. 4. **Reinforcement Learning from Human Feedback (RLHF):** RLHF takes the concept of supervised fine-tuning a step further by incorporating reinforcement learning principles. In RLHF, human evaluators are enlisted to rate the model's outputs based on specific prompts. These ratings serve as a form of reward, guiding the model to optimize its parameters to maximize positive feedback. RLHF is a resource-intensive process that leverages human preferences to refine the model's behavior. Human feedback contributes to training a reward model that guides the subsequent reinforcement learning phase, resulting in improved model performance aligned with human preferences. ::: :::success 1. **參數高效微調：** 它是一種微調策略，主要是減少與更新語言模型參數相關的計算開銷。不在微調過程中更新所有的參數，PEFT專注於選擇性地更新一小組的參數，通常稱為低維矩陣。PEFT的一個著名的案例就是low-rank adaptation (LoRA)。 LoRA的運作前提是，針對下游任務所微調的基礎模型就只需要更新某些參數。low-rank matrix有效地表示了與目標任務相關的相應空間，並且只對該矩陣進行訓練而不是調整整個模型的參數。PEFT，特別是LoRA等技術，可以明顯地降低與微調相關的成本，使其成為更有效率的過程。 2. **監督式full fine-tuning**：它涉及在訓練過程中更新語言模型的所有參數。不同於PEFT單純的調整一小部份的參數，full fine-tuning(full fine-tuning)需要足夠的記憶體和計算資源來儲存和處理所有正在更新的元件。這種全面性的方法產生了新模型版本，這是更新所有網路層權重的模型。雖然full fine-tuning更加耗費資源，但它可以確保整個模型適應特定的任務或領域，使其適合需要對語言模型進行徹底調整的情況。 3. **指令微調：** 指令微調(Instruction Fine-Tuning)涉及使用樣本來訓練語言模型的過程，這些樣本明確地說明了語言模型應如何回應特定查詢或任務。該方法主要是透過在訓練資料中提供明確的指令來增強模型在目標任務上的效能。舉例來說，如果任務涉及摘要或翻譯，則資料集會經過精心設計，讓資料集包含具有明確指令的樣本，像是"總結這個文本"或"翻譯這個片語"。指令微調可確保模型能夠理解並執行特定指令，使其適合需要精確任務執行的應用程式。 4. **來自人類反饋的強化學習 (RLHF)：** RLHF透過結合強化學習的原理，讓監督式微調的概念更進一步。在RLHF中，人類評估員被聘來根據特定提示對模型的輸出進行評分。這些評分作為獎勵的一種形式，指導模型最佳化其參數以最大化正反饋。RLHF是一種資源密集型的過程，它利用人類偏好來完善模型的行為。人類反饋有助於訓練獎勵模型，指導隨後的強化學習階段，進而提升模型效能，符合人類偏好。 ::: :::info Techniques such as contrastive learning, as well as supervised and unsupervised fine-tuning, are not exclusive to LLMs and have been employed for domain adaptation even before the advent of LLMs. However, following the rise of LLMs, there has been a notable increase in the prominence of techniques such as RLHF, instruction fine-tuning, and PEFT. In the upcoming sections, we will explore these methodologies in greater detail to comprehend their applications and significance. ::: :::success 對比學習以及監督和無監督微調等技術並不是LLMs所獨有的，甚至在LLMs出現之前就已被用於領域適應。然而，隨著LLMs的興起，RLHF、指令微調和PEFT等技術的重要性明顯增加。在接下來的部分中，我們要來深入探討這些方法，以理解它們的應用和意義。 ::: ## Instruction Fine-Tuning :::info Instruction fine-tuning is a method that has gained prominence in making LLMs more practical for real-world applications. In contrast to standard supervised fine-tuning, where models are trained on input examples and corresponding outputs, instruction tuning involves augmenting input-output examples with explicit instructions. This unique approach enables instruction-tuned models to generalize more effectively to new tasks. The data for instruction tuning is constructed differently, with instructions providing additional context for the model. ::: :::success 指令微調是一種能夠讓LLMs更貼近實際應用的方法。相較於標準的監督式微調方法，其模型是根據輸入樣本和相對應輸出來訓練的，指令調整涉及使用顯式指令來增強輸入輸出範例。這種獨特的方法使的指令調整的模型能夠更有效地泛化到新任務。用於指令調整的資料的建構方式不同，指令為模型提供了額外的上下文。 ::: ![image](https://hackmd.io/_uploads/ryx-g5gjA.png) Image Source: Wei et al., 2022 :::info One notable dataset for instruction tuning is "Natural Instructions". This dataset consists of 193,000 instruction-output examples sourced from 61 existing English NLP tasks. The uniqueness of this dataset lies in its structured approach, where crowd-sourced instructions from each task are aligned to a common schema. Each instruction is associated with a task, providing explicit guidance on how the model should respond. The instructions cover various fields, including a definition, things to avoid, and positive and negative examples. This structured nature makes the dataset valuable for fine-tuning models, as it provides clear and detailed instructions for the desired task. However, it's worth noting that the outputs in this dataset are relatively short, which might make the data less suitable for generating long-form content. Despite this limitation, Natural Instructions serves as a rich resource for training models through instruction tuning, enhancing their adaptability to specific NLP tasks. The below image contains an example instruction format ::: :::success 一個值得注意的指令微調資料集，也就是"Natural Instructions"。這個資料集包含61個現有英語自指語言任務的193,000個指令輸出範例。此資料集的獨特性在於其結構化方法，其中每個任務的眾包(crowd-sourced)指令都與通用模式保持一致。每條指令都與一項任務有相關，為模型應如何響應提供明確的指導。這些指令涵蓋多個領域，包含定義、要避免的事情以及正反面的範例。這種結構化性質使資料集對於微調模型是有價值的，因為它為所需任務提供了清晰詳細的指令。然而，值得注意的是，該資料集的輸出相對較短，這可能會造成資料不太適合生成長篇大論的內容。儘管有此限制，Natural Instructions對於透過指令調整做為訓練模型來說仍舊是豐富的資源，增強了模型對特定NLP任務的適應性。下圖包含範例指令格式 ::: ![image](https://hackmd.io/_uploads/B17Gg9lsC.png) Image Source: Mishra et al., 2022 :::info Instruction fine-tuning has become a valuable tool in the evolving landscape of natural language processing and machine learning, enabling LLMs to adapt to specific tasks with nuanced instructions. ::: :::success 指令微調已成為自然語言處理和機器學習不斷發展的領域中的一個有價值的工具，使LLMs能夠透過細緻入微的指令適應特定任務。 ::: ## **Reinforcement Learning from Human Feedback (RLHF)** :::info Reinforcement Learning from Human Feedback is a methodology designed to enhance language models by incorporating human feedback, aligning them more closely with intricate human values. The RLHF process comprises three fundamental steps: ::: :::success 來自人類反饋的強化學習是一種透過納入人類反饋來增強語言模型的方法，使它們更緊密地與複雜的人類價值觀保持一致。 RLHF流程包括三個基本步驟： ::: :::info **1. Pretraining Language Models (LMs):** RLHF initiates with a pretrained LM, typically achieved through classical pretraining objectives. The initial LM, which can vary in size, is flexible in choice. While optional, the initial LM can undergo fine-tuning on additional data. The crucial aspect is to have a model that exhibits a positive response to diverse instructions. ::: :::success **1.預訓練語言模型 (LM)：** RLHF從預訓練的LM開始，通常透過經典的預訓練目標來實現。初始LM的大小可以變化，可以靈活選擇。雖然是可選是，不過，初始LM可以針對附加資料進行微調。關鍵就是要有一個模型能夠對不同指令表現出正響應。 ::: :::info **2. Reward Model Training:** The subsequent step involves generating a reward model (RM) calibrated with human preferences. This model assigns scalar rewards to sequences of text, reflecting human preferences. The dataset for training the reward model is generated by sampling prompts and passing them through the initial LM to produce text. Human annotators rank the generated text outputs, and these rankings are used to create a regularized dataset for training the reward model. The reward function combines the preference model and a penalty on the difference between the RL policy and the initial model. ::: :::success **2.獎勵模型訓練：** 後續步驟涉及產生根據人類偏好校準的獎勵模型(RM)。該模型將純量獎勵分配給反映人類偏好的文本序列。用於訓練獎勵模型的資料集的生成是透過對提示做採樣然後將之傳遞給初始的LM來產生的。人類註釋者對生成的文字輸出進行排名，這些排名用來建立用於訓練獎勵模型的正規化過的資料集。獎勵函數結合了偏好模型與對RL policy與初始模型之間差異的懲罰。 ::: :::info **3. Fine-Tuning with RL:** The final step entails fine-tuning the initial LLM using reinforcement learning. Proximal Policy Optimization (PPO) is a commonly used RL algorithm for this task. The RL policy is the LM that takes in a prompt and produces text, with actions corresponding to tokens in the LM's vocabulary. The reward function, derived from the preference model and a constraint on policy shift, guides the fine-tuning. PPO updates the LM's parameters to maximize the reward metrics in the current batch of prompt-generation pairs. Some parameters of the LM are frozen due to computational constraints, and the fine-tuning aims to align the model with human preferences. ::: :::success **3.使用 RL 進行微調：** 最後一步需要使用強化學習對初始LLM進行微調。Proximal Policy Optimization (PPO)是常用用於此任務的強化學習演算法。RL policy是用提示來產生文本的LM，其actions(動作)則是對應於LM詞彙表中的tokens。源於偏好模型與策略轉移的約束的獎勵函數則是指導微調。PPO更新LM的參數，以最大化當前批次提示生成組合(prompt-generation pairs)中的獎勵指標。由於計算限制，LM的一些參數會被凍結，微調的目的是使模型與人類偏好保持一致。 ::: ![image](https://hackmd.io/_uploads/H1eh_h8Wo0.png) Image Source: https://openai.com/research/instruction-following :::info 💡If you’re lost understanding RL terms like PPO, policy etc. Think of this analogy- Fine-Tuning with RL, specifically using Proximal Policy Optimization (PPO), is similar to refining instructions to train a pet, such as teaching a dog tricks. Think of the dog initially learning with general guidance (policy) and receiving treats (rewards) for correct actions. Now, imagine the dog mastering a new trick but not quite perfectly. Fine-tuning, with PPO, involves adjusting your instructions slightly based on how well the dog performs, similar to tweaking the model's behavior (policy) in Reinforcement Learning. It's like refining the instructions to optimize the learning process, much like perfecting your pet's tricks through gradual adjustments and treats for better performance. ::: :::success 💡如果你對強化學習的術語，像是PPO還是policy等不是那麼理解的話。想像一下，使用強化學習微調，特別是Proximal Policy Optimization (PPO)，就像是改善訓練寵物的指令，教牠一些技巧之類的。狗最初是在一般指導(策略)下學習的，並因正確的行為而接受獎勵(獎勵)。現在，想像狗掌握了一種新技巧，不過還不是很完美。使用PPO進行微調，包括根據狗的表現稍微調整你的指令，類似於強化學習中調整模型的行為(策略)。這就像是改善指令以最佳化學習過程，就像透過逐步調整和治療來完善寵物的技巧以獲得更好的表現一樣。 ::: ## Direct Preference Optimization DPO (*Bonus Topic)* :::info Direct Preference Optimization (DPO) is an equivalent of RLHF and has been gaining significant traction these days. DPO offers a straightforward method for fine-tuning large language models based on human preferences. It eliminates the need for a complex reward model and directly incorporates user feedback into the optimization process. In DPO, users simply compare two model-generated outputs and express their preferences, allowing the LLM to adjust its behavior accordingly. This user-friendly approach comes with several advantages, including ease of implementation, computational efficiency, and greater control over the LLM's behavior. ::: :::success Direct Preference Optimization(DPO)相當於RLHF，近年來受到了廣泛關注。DPO提供了一種基於人類偏好微調大型語言模型一個直接了當的方法。它消除了對複雜獎勵模型的需要，並直接將使用者反饋納入最佳化過程。在DPO中，使用者只需比較兩個模型生成的輸出並表達他們的偏好，允許LLM相應地調整其行為。這種使用者友善的方法具有多種優點，包括易於實施、計算高效以及對LLMs行為的更好控制。 ::: ![image](https://hackmd.io/_uploads/ry8PCLZo0.png) Image Source: [Rafailov, Rafael, et al.](https://arxiv.org/html/2305.18290v2) :::info 💡In the context of LLMs, maximum likelihood is a principle used during the training of the model. Imagine the model is like a writer trying to predict the next word in a sentence. Maximum likelihood training involves adjusting the model's parameters (the factors that influence its predictions) to maximize the likelihood of generating the actual sequences of words observed in the training data. It's like tuning the writer's skills to make the sentences they create most closely resemble the sentences they've seen before. So, maximum likelihood helps the LLM learn to generate text that is most similar to the examples it was trained on. ::: :::success 💡在LLMs的上下文中，最大似然是模型訓練過程中所使用的原則。想像一下模型就像一個作家試著預測句子中的下一個單字。最大似然訓練涉及調整模型的參數(影響其預測的因素)，以最大化在訓練資料中觀察到的實際單字序列產生的可能性。這就像調整作家的技巧，使他們創造的句子與他們以前見過的句子最相似。因此，最大似然有助於LLM學習生成與其訓練範例最相似的文本。 ::: :::info ***DPO***: DPO takes a straightforward approach by directly optimizing the LM based on user preferences without the need for a separate reward model. Users compare two model-generated outputs, expressing their preferences to guide the optimization process. ::: :::success ***DPO***：DPO採用直接的方法，根據使用者偏好直接最佳化LM，不需要再來一個單獨的獎勵模型。使用者比較兩個模型生成的輸出，表達他們的偏好以指導最佳化過程。 ::: :::info ***RLHF***: RLHF follows a more structured path, leveraging reinforcement learning principles. It involves training a reward model that learns to identify and reward desirable LM outputs. The reward model then guides the LM's training process, shaping its behavior towards achieving positive outcomes. ::: :::success ***RLHF***：RLHF遵循更結構化的路徑，運用強化學習的原理。它涉及訓練一個獎勵模型，該模型學習識別和獎勵期望的LM輸出。然後，獎勵模型指導LM的訓練過程，塑造其行為以實現正向的結果。 ::: ### **DPO (Direct Policy Optimization) vs. RLHF (Reinforcement Learning from Human Feedback): Understanding the Differences** :::info **DPO - A Simpler Approach:** Direct Policy Optimization (DPO) takes a straightforward path, sidestepping the need for a complex reward model. It directly optimizes the Large Language Model (LLM) based on user preferences, where users compare two outputs and indicate their preference. This simplicity results in key advantages: ::: :::success **DPO - 更簡單的方法：** Direct Policy Optimization(DPO)採取直接的路徑，避免了對複雜獎勵模型的需求。它根據使用者偏好直接最佳化大型語言模型(LLM)，使用者可以比較兩個輸出並表明他們的偏好。這種簡單性帶來了主要優勢： ::: :::info 1. **Ease of Implementation:** DPO is more user-friendly as it eliminates the need for designing and training a separate reward model, making it accessible to a broader audience. 2. **Computational Efficiency:** Operating directly on the LLM, DPO leads to faster training times and lower computational costs compared to RLHF, which involves multiple phases. 3. **Greater Control:** Users have direct control over the LLM's behavior, guiding it toward specific goals and preferences without the complexities of RLHF. 4. **Faster Convergence:** Due to its simpler structure and direct optimization, DPO often achieves desired results faster, making it suitable for tasks with rapid iteration needs. 5. **Improved Performance:** Recent research suggests that DPO can outperform RLHF in scenarios like sentiment control and response quality, particularly in summarization and dialogue tasks. ::: :::success 1. **易於實施：** DPO更親民，因為它消除對獎勵模型的需求，從而可供更廣泛的受眾使用。 2. **計算效率：** 相較於涉及多個執行階段的RLHF，DPO直接在LLM執行，有更快的訓練效率與更低的計算成本。 3. **更好的控制：** 使用者可以直接控制LLM的行為，引導其實現特定目標和偏好，而無沒有RLHF的複雜性。 4. **更快的收斂：** 由於其更簡單的結構和直接最佳化的模式，DPO往往更快達到預期的結果，使其適合快速迭代需求的任務。 5. **效能提升：** 近來的研究說明著，DPO在像是情感控制和響應品質等場景中，特別是在摘要和對話任務方面，都是優於RLHF的。 ::: :::info **RLHF - A More Structured Approach:** It follows a more structured path, leveraging reinforcement learning principles. It includes three training phases: pre-training, reward model training, and fine-tuning with reinforcement learning. While flexible, RLHF comes with complexities: 1. **Complexity:** RLHF can be more complex and sometimes unstable, demanding more computational resources and dealing with challenges like convergence, drift, or uncorrelated distribution problems. 2. **Flexibility in Defining Rewards:** RLHF allows for more nuanced reward structures, beneficial for tasks requiring precise control over the LLM's output. 3. **Handling Diverse Feedback Formats:** RLHF can handle various forms of human feedback, including numerical ratings or textual corrections, whereas DPO primarily relies on binary preferences. 4. **Handling Large Datasets:** RLHF can be more efficient in handling massive datasets, especially with distributed training techniques. ::: :::success 它遵循一個更結構化的路徑，利用強化學習原理。它包括三個訓練階段：預訓練、獎勵模型訓練和搭配強化學習微調。儘管靈活性佳，不過，RLHF也有著一定的複雜性： 1. **複雜性：** RLHF可能更複雜，有時候不穩定，需要更多的計算資源，而且有著收斂、漂移或不相關分佈問題等挑戰。 2. **定義獎勵的靈活性：** RLHF允許更細緻的獎勵結構，有利於需要對LLM的輸出做精確控制的任務。 3. **處理多樣化的反饋格式：** RLHF可以處理各種形式的人類反饋，包括數值評分或文本校正，而DPO主要依賴binary preferences(兩個模型比較，然後看喜歡那一個的輸出)。 4. **處理大型資料集：** RLHF可以更有效地處理大型資料集，尤其是使用分散式訓練技術。 ::: :::info In summary, the choice depends on the specific task, available resources, and the desired level of control, with both methods offering strengths and weaknesses in different contexts. As advancements continue, these methods contribute to evolving and enhancing fine-tuning processes for LLMs. ::: :::success 總而言之，怎麼選擇是取決於特定任務、可用資源和所需的控制級別，這兩種方法在不同情況下各有優缺點。隨著不斷進步，這些方法有助於發展和增強LLMs的微調流程。 ::: ## Parameter Efficient Fine-Tuning (PEFT) :::info Parameter-Efficient Fine-Tuning (PEFT) addresses the resource-intensive nature of fine-tuning LLMs. Unlike full fine-tuning that modifies all parameters, PEFT fine-tunes only a small subset of additional parameters while keeping the majority of pretrained model weights frozen. This selective approach minimizes computational requirements, mitigates catastrophic forgetting, and facilitates fine-tuning even with limited computational resources. PEFT, as a whole, offers a more efficient and practical method for adapting LLMs to specific downstream tasks without the need for extensive computational power and memory. ::: :::success Parameter-Efficient Fine-Tuning(PEFT)解決了微調LLMs的資源密集問題。與修改所有參數的full fine-tuning不一樣，PEFT僅微調一小部分附加參數，同時保持大部分預訓練模型權重凍結。這種選擇性的方法最小化了計算需求，減輕了災難性遺忘，並且即使在計算資源有限的情況下也能進行微調。總體而言，PEFT提供了一種更有效率、更具實用性的方法，使LLM適應特定的下游任務，而不需要大量的運算能力和記憶體。 ::: :::info **Advantages of Parameter-Efficient Fine-Tuning (PEFT)** 1. **Computational Efficiency:** PEFT fine-tunes LLMs with significantly fewer parameters than full fine-tuning. This reduces the computational demands, making it feasible to fine-tune on less powerful hardware or in resource-constrained environments. 2. **Memory Efficiency:** By freezing the majority of pretrained model weights, PEFT avoids excessive memory usage associated with modifying all parameters. This makes PEFT particularly suitable for tasks where memory constraints are a concern. 3. **Catastrophic Forgetting Mitigation:** PEFT prevents catastrophic forgetting, a phenomenon observed during full fine-tuning where the model loses knowledge from its pre-trained state. This ensures that the LLM retains valuable information while adapting to new tasks. 4. **Versatility Across Modalities:** PEFT extends beyond natural language processing tasks, proving effective in various modalities such as computer vision and audio. Its versatility makes it applicable to a wide range of downstream tasks. 5. **Modular Adaptation for Multiple Tasks:** The modular nature of PEFT allows the same pretrained model to be adapted for multiple tasks by adding small task-specific weights. This avoids the need to store full copies for different applications, enhancing flexibility and efficiency. 6. **INT8 Tuning:** PEFT's capabilities include INT8 (8-bit integer) tuning, showcasing its adaptability to different quantization techniques. This enables fine-tuning even on platforms with limited computational resources. ::: :::success **Parameter-Efficient Fine-Tuning(PEFT)的優點** 1. **計算效率：** PEFT微調LLM的參數比full fine-tuning明顯少得多。這減少了計算需求，使其能夠在較弱的硬體或資源受限的環境中進行微調。 2. **記憶體效率：** 透過凍結大部分預訓練模型權重，PEFT避免了與修改所有參數相關的過多記憶體用量。這使得PEFT特別適合需要考慮記憶體限制的任務。 3. **減輕災難性遺忘：** PEFT可防止災難性遺忘的問題，這是一個在full fine-tuning過程中觀察到的現象，也就是模型會遺失掉預訓練狀態的知識。這確保了LLMs在適應新任務的同時保留有價值的信息。 4. **跨模態的多功能性：** PEFT擴展了自然語言處理任務，在各種模態，像是電腦視覺和音訊等中被證明是有效的。其多功能性使其適用於廣泛的下游任務。 5. **針對多個任務的模組化適應性：** PEFT的模組化性質允許透過添加小型的特定任務的權重來使相同的預訓練模型適應多個任務。這避免了為不同應用程式儲存完整副本的需求，提高了靈活性和效率。 6. **INT8 調整：** PEFT的功能包括INT8(8 位元整數)調整，展示了其對不同量化技術的適應性。即使在運算資源有限的平台上也可以微調。 ::: :::info In summary, PEFT offers a practical and efficient solution for fine-tuning large language models, addressing computational and memory challenges while maintaining performance on downstream tasks. ::: :::success 總的來說，PEFT提供了一種用於微調大型語言模型的實用且高效的解決方案，解決計算和記憶體挑戰，同時維持其於下游任務的效能。 ::: :::info A summary of the most popular PEFT methods are in the chart below. Please download for improved visibility. ::: :::success 下表總結了最受歡迎的PEFT方法。請下載以提高可見性。 ::: ![image](https://hackmd.io/_uploads/By3e1DWiC.png) ## Read/Watch These Resources (Optional) 1. https://www.superannotate.com/blog/llm-fine-tuning 2. https://www.deeplearning.ai/short-courses/finetuning-large-language-models/ 3. https://www.youtube.com/watch?v=eC6Hd1hFvos 4. https://www.labellerr.com/blog/comprehensive-guide-for-fine-tuning-of-llms/ ## Read These Papers (Optional) 1. https://arxiv.org/abs/2303.15647 2. https://arxiv.org/abs/2109.10686 3. https://arxiv.org/abs/2304.01933