[Week 1, Part 2] Domain and Task Adaptation Methods

# [Week 1, Part 2] Domain and Task Adaptation Methods [課程目錄](https://areganti.notion.site/Applied-LLMs-Mastery-2024-562ddaa27791463e9a1286199325045c) [課程連結](https://areganti.notion.site/Week-1-Part-2-Domain-and-Task-Adaptation-Methods-6ad3284a96a241f3bd2318f4f502a1da) ## ETMI5: Explain to Me in 5 :::info In this section, we delve into the limitations of general AI models in specialized domains, underscoring the significance of domain-adapted LLMs. We explore the advantages of these models, including depth, precision, improved user experiences, and addressing privacy concerns. ::: :::success 在本章節中，我們將深入研究通用人工智慧模型在專業領域的侷限性，並強調領域自適應大型語言模型的重要性。我們探討這些模型的優點，包括深度、精確度、改進的使用者經驗以及解決隱私問題。 ::: :::info We introduce three types of domain adaptation methods: Domain-Specific Pre-Training, Domain-Specific Fine-Tuning, and Retrieval Augmented Generation (RAG). Each method is outlined, providing details on types, training durations, and quick summaries. We then explain each of these methods in further detail with real-world examples. In the end, we provide an overview of when RAG should be used as opposed to model updating methods. ::: :::success 我們介紹了三種類型的領域自適應方法：Domain-Specific Pre-Training、Domain-Specific Fine-Tuning和Retrieval Augmented Generation (RAG)。每種方法都是概述，提供在類型、訓練期間的細節，然後快速摘要。然後，我們用真實世界的案例進一步詳細解釋這些方法。最後，我們概述說明什麼時候應該使用RAG而不是模型更新的方式。 ::: ## Using LLMs Effectively :::info While general AI models such as ChatGPT demonstrate impressive text generation abilities across various subjects, they may lack the depth and nuanced understanding required for specific domains. Additionally, these models are more prone to generating inaccurate or contextually inappropriate content, referred to as hallucinations. For instance, in healthcare, specific terms like "electronic health record interoperability" or "patient-centered medical home" hold significant importance, but a generic language model may struggle to fully comprehend their relevance due to a lack of specific training on healthcare data. This is where task-specific and domain-specific LLMs play a crucial role. These models need to possess specialized knowledge of industry-specific terminology and practices to ensure accurate interpretation of domain-specific concepts. Throughout the remainder of this course, we will refer to these specialized LLMs as **domain-specific LLMs**, a commonly used term for such models. ::: :::success 雖然通用人工智慧模型(像是ChatGPT)，在各個主題上展示了令人印象深刻的文本生成能力，不過它們可能缺乏對特定領域所需的深度與細微的理解。此外，這些模型較為容易生成不準確或上下文不適當的內容，稱為幻覺。舉例來說，在醫療保健領域，「電子健康記錄可交互運作性」或「以人為本的整合居家醫療照護模式」等特定術語非常重要，不過，由於缺乏對醫療保健資料的具體訓練，通用語言模型可能難以完全理解它們之間的相關性。這就是task-specific和domain-specific的LLMs發揮關鍵作用的地方。這些模型需要具備行業特定術語和實踐的專業知識，以確保準確解釋特定領域的概念。在本課程的其餘部分中，我們將這些專用的LLMs稱為**domain-specific LLMs**，這是此類模型的常用術語。 ::: :::info Here are some benefits of using domain-specific LLMs: 1. **Depth and Precision**: General LLMs, while proficient in generating text across diverse topics, may lack the depth and nuance required for specialized domains. Domain-specific LLMs are tailored to understand and interpret industry-specific terminology, ensuring precision in comprehension. 2. **Overcoming Limitations**: General LLMs have limitations, including potential inaccuracies, lack of context, and susceptibility to hallucinations. In domains like finance or medicine, where specific terminology is crucial, domain-specific LLMs excel in providing accurate and contextually relevant information. 3. **Enhanced User Experiences**: Domain-specific LLMs contribute to enhanced user experiences by offering tailored and personalized responses. In applications such as customer service chatbots or dynamic AI agents, these models leverage specialized knowledge to provide more accurate and insightful information. 4. **Improved Efficiency and Productivity**: Businesses can benefit from the improved efficiency of domain-specific LLMs. By automating tasks, generating content aligned with industry-specific terminology, and streamlining operations, these models free up human resources for higher-level tasks, ultimately boosting productivity. 5. **Addressing Privacy Concerns**: In industries dealing with sensitive data, such as healthcare, using general LLMs may pose privacy challenges. Domain-specific LLMs can provide a closed framework, ensuring the protection of confidential data and adherence to privacy agreements. ::: :::success 下面是使用domain-specific LLMs的一些好處： 1. **深度和精確度**：通用大型語言模型雖然擅長生成跨不同主題的文本，不過可能缺乏特定領域所需的深度和枝微末節。Domain-specific LLMs專門用來理解和解釋行業特定術語而量身定制的，確保在理解上的精確。 2. **克服侷限性**：通用大型語言模型有其侷限性，包括潛在的不準確、上下文的缺乏以及容易產生幻覺。在金融或醫學等領域，其行業特定術語至關重要，domain-specific LLMs擅長提供準確且與上下文相關的信息。 3. **增強的使用者經驗**：Domain-specific LLMs透過提供量身定制的個人化回應來增強使用者經驗。在客戶服務聊天機器人或動態人工智慧代理等應用中，這些模型利用專業知識來提供更準確和更有洞察力的信息。 4. **提高效率和生產力**：企業可以從domain-specific LLMs提升的效率中得到好處。透過自動化任務、生成與行業特定術語一致的內容以及簡化操作，這些模型可以釋放人力資源來執行更高層級的任務，最終提高生產力。 5. **解決隱私問題**：在涉及敏感資料的行業(例如醫療保健)中，使用通用大型語言模型可能會帶來隱私權的挑戰。Domain-specific LLMs可以提供一個封閉性的框架，確保機密資料的保護和隱私協議的遵守。 ::: :::info If you recall from the [previous section](https://www.notion.so/369ae7cf630d467cbfeedd3b9b3bfc46?pvs=21), we had multiple ways to use LLMs in specific use cases, namely 1. **Zero-shot learning** 2. **Few-shot learning** 3. **Domain Adaptation** Zero-shot learning and few-shot learning involve instructing the general model either through examples or by prompting it with specific questions of interest. Another concept introduced is domain adaptation, which will be the primary focus in this section. More details about the first two methods will be explored when we delve into the topic of prompting. ::: :::success 如果你還記得[上一節](https://www.notion.so/369ae7cf630d467cbfeedd3b9b3bfc46?pvs=21)的內容，我們有多種方法可以在特定情況中使用LLMs，也就是 1. **Zero-shot learning** 2. **Few-shot learning** 3. **Domain Adaptation** Zero-shot learning和few-shot learning涉及透過樣本或透過提出感興趣的特定問題來指導通用模型。引入的另一個概念是領Domain Adaptation，這會是本節的主要關注的部份。當我們深入探討主題prompting的時候就會探討有關前兩種方法的更多細節。 ::: ## Types of Domain Adaptation Methods :::info There are several methods to incorporate domain-specific knowledge into LLMs, each with its own advantages and limitations. Here are three classes of approaches: 1. **Domain-Specific Pre-Training:** - **Training Duration**: Days to weeks to months - **Summary**: Requires a large amount of domain training data; can customize model architecture, size, tokenizer, etc. In this method, LLMs are pre-trained on extensive datasets representing various natural language use cases. For instance, models like PaLM 540B, GPT-3, and LLaMA 2 have been pre-trained on datasets with sizes ranging from 499 billion to 2 trillion tokens. Examples of domain-specific pre-training include models like ESMFold, ProGen2 for protein sequences, Galactica for science, BloombergGPT for finance, and StarCoder for code. These models outperform generalist models within their domains but still face limitations in terms of accuracy and potential hallucinations. 2. **Domain-Specific Fine-Tuning:** - **Training Duration**: Minutes to hours - **Summary**: Adds domain-specific data; tunes for specific tasks; updates LLM model Fine-tuning involves training a pre-trained LLM on a specific task or domain, adapting its knowledge to a narrower context. Examples include Alpaca (fine-tuned LLaMA-7B model for general tasks), xFinance (fine-tuned LLaMA-13B model for financial-specific tasks), and ChatDoctor (fine-tuned LLaMA-7B model for medical chat). The costs for fine-tuning are significantly smaller compared to pre-training. 3. **Retrieval Augmented Generation (RAG):** - **Training Duration**: Not required - **Summary**: No model weights; external information retrieval system can be tuned RAG involves grounding the LLM's parametric knowledge with external or non-parametric knowledge from an information retrieval system. This external knowledge is provided as additional context in the prompt to the LLM. The advantages of RAG include no training costs, low expertise requirement, and the ability to cite sources for human verification. This approach addresses limitations such as hallucinations and allows for precise manipulation of knowledge. The knowledge base is easily updatable without changing the LLM. Strategies to combine non-parametric knowledge with an LLM's parametric knowledge are actively researched. ::: :::success 有多多方法可以將特定領域的知識融入LLMs，每種方法都有其優點與限制。以下是三類方法： 1. **Domain-Specific Pre-Training：** - **訓練持續時間**：幾天到幾周到幾個月 - **總結**：需要大量的領域訓練資料；可以客製化模型架構、大小、分詞器等。在這個方法中，LLMs在代表各種自然語言使用情境的廣泛資料集上做了預訓練。舉例來說，PaLM 540B、GPT-3和LLaMA 2等模型已經在規模從499 billion到2 trillion tokens不等的資料集上做了預訓練。domain-specific pre-training的範例包括ESMFold、用於蛋白質序列的ProGen2、用於科學的Galatica、用於金融的BloombergGPT和用於程式碼的StarCoder等模型。這些模型在其領域內優於通用模型，但在準確性和潛在幻覺方面仍有其侷限性。 2. **Domain-Specific Fine-Tuning：** - **訓練持續時間**：幾分到幾小時 - **總結**：新增特定領域的資料；針對特定任務的微調；更新LLM模型微調涉及對特定任務或領域的預訓練大型語言模型進行訓練，使其知識適應限縮詞的上下文。範例包括Alpaca(針對一般任務微調的LLaMA-7B)、xFinance(針對特定財務任務微調的LLaMA-13B)和 ChatDoctor(針對醫療聊天微調的LLaMA-7B)。與預訓練相比，微調的成本要小得多了。 3. **Retrieval Augmented Generation (RAG)：** - **訓練持續時間**：不需要 - **總結**：無模型權重；可調整外部資訊檢索系統 RAG涉及將LLMs的參數知識與來自信息檢索系統的外部或非參數知識的結合。這些外部知識作為提示中額外的上下文來提供給LLMs。RAG的優點包括無訓練成本、專業知識要求低以及能夠引用來源來進行人工驗證。這種方法解決了幻覺等限制，並允許精確操控知識。無需更改LLMs也能輕鬆更新知識庫。人們正在積極研究將非參數知識與LLMs的參數知識相結合的策略。 ::: ## Domain-Specific Pre-Training ![image](https://hackmd.io/_uploads/HyFReXOq0.png) Image Source https://www.analyticsvidhya.com/blog/2023/08/domain-specific-llms/ :::info Domain-specific pre-training involves training large language models on extensive datasets that specifically represent the language and characteristics of a particular domain or field. This process aims to enhance the model's understanding and performance within a defined subject area. Let’s understand domain specific pretraining through the example of [BloombergGPT,](https://arxiv.org/pdf/2303.17564.pdf) a large language model for finance. ::: :::success Domain-Specific Pre-Training涉及在大量代表特定domain或field的語言和特徵的資料集上訓練大型語言模型。這個過程旨在增強模型所定義的主題區的理解和性能。讓我們透過 [BloombergGPT](https://arxiv.org/pdf/2303.17564.pdf)大型金融語言模型的範例來了解Domain-Specific Pre-Training。 ::: :::info BloombergGPT is a 50 billion parameter language model designed to excel in various tasks within the financial industry. While general models are versatile and perform well across diverse tasks, they may not outperform domain-specific models in specialized areas. At Bloomberg, where a significant majority of applications are within the financial domain, there is a need for a model that excels in financial tasks while maintaining competitive performance on general benchmarks. BloombergGPT can perform the following tasks: 1. **Financial Sentiment Analysis:** Analyzing and determining sentiment in financial texts, such as news articles, social media posts, or financial reports. This helps in understanding market sentiment and making informed investment decisions. 2. **Named Entity Recognition:** Identifying and classifying entities (such as companies, individuals, and financial instruments) mentioned in financial documents. This is crucial for extracting relevant information from large datasets. 3. **News Classification:** Categorizing financial news articles into different topics or classes. This can aid in organizing and prioritizing news updates based on their relevance to specific financial areas. 4. **Question Answering in Finance:** Answering questions related to financial topics. Users can pose queries about market trends, financial instruments, or economic indicators, and BloombergGPT can provide relevant answers. 5. **Conversational Systems for Finance:** Engaging in natural language conversations related to finance. Users can interact with BloombergGPT to seek information, clarify doubts, or discuss financial concepts. ::: :::success BloombergGPT是一個包含50 billion億個參數的語言模型，其設計於金融業的各種任務中表現出色。雖然通用模型多功，而且在不同的任務中的表現都不錯，不過，它們在專業領域可能無法勝過domain-specific的模型。在Bloomberg，絕大多數應用程式都是金融領域，因此需要一個在金融任務中表現出色的模型，同時在一般基準上維持其競爭力。 BloombergGPT可以執行以下任務： 1. **財務情緒分析：** 分析並確定金融文本中的情緒，像是新聞文章、社群媒體貼文或財務報告。這有助於了解市場情緒並做出明智的投資決策。 2. **命名實體識別：** 對金融文件中提及的實體(像是公司、個人和金融工具)進行識別和分類。這對於從大型資料集中提取相關信息至關重要。 3. **新聞分類：** 將財經新聞文章分類為不同的主題或類別。這有助於根據特定金融領域的相關性來組織與新聞更新的優先次序。 4. **金融問答：** 回答與金融主題相關的問題。使用者可以提出有關市場趨勢、金融工具或經濟指標的問題，BloombergGPT能夠給出相關答案。 5. **金融對話系統：** 參與與金融相關的自然語言對話。使用者可以與BloombergGPT互動，尋求資訊、澄清疑慮或討論金融概念。 ::: :::info To achieve this, BloombergGPT undergoes domain-specific pre-training using a large dataset that combines domain-specific financial language documents from Bloomberg's extensive archives with public datasets. This dataset, named FinPile, consists of diverse English financial documents, including news, filings, press releases, web-scraped financial documents, and social media content. The training corpus is roughly divided into half domain-specific text and half general-purpose text. The aim is to leverage the advantages of both domain-specific and general data sources. ::: :::success 為了達成這個目標，BloombergGPT使用大型資料集進行Domain-Specific Pre-Training，該資料集將彭博大量檔案中的特定領域金融語言文件與公開資料集結合。該資料集名為FinPile，由多種英文金融文件所組成，包括新聞、文件、新聞稿、網路爬到的金融文件和社群媒體內容。訓練語料庫大致分為一半特定領域文本和一半通用文本。目的是充份利用特定領域資料來源和通用資料來源的優勢。 ::: :::info The model architecture is based on guidelines from previous research efforts, containing 70 layers of transformer decoder blocks (read more in the [paper](https://arxiv.org/pdf/2303.17564.pdf)) ::: :::success 這模型架構是基於早前研究工作的指導方針，包含70層的transformer decoder blocks(請閱讀[論文](https://arxiv.org/pdf/2303.17564.pdf)以了解更多資訊) ::: ## Domain-Specific Fine-Tuning :::info Domain-specific fine-tuning is the process of refining a pre-existing language model for a particular task or within a specific domain to enhance its performance and tailor it to the unique context of that domain. This method involves taking an LLM that has undergone pre-training on a diverse dataset encompassing various language use cases and subsequently fine-tuning it on a narrower dataset specifically related to a particular domain or task. ::: :::success Domain-Specific Fine-Tuning是針對特定任務或特定領域內改善現有語言模型的過程，以增強其效能並使其適合該領域的獨特的上下文。這個方法涉及將已經在各種語言使用案例上經過各式各樣資料集預訓練的LLMs，然後在與特定領域或任務相關的較小資料集上進行微調。 ::: :::info 💡Note that the previous method, i.e., domain-specific pre-training involves training a language model exclusively on data from a specific domain, creating a specialized model for that domain. On the other hand, domain-specific fine-tuning takes a pre-trained general model and further trains it on domain-specific data, adapting it for tasks within that domain without starting from scratch. Pre-training is domain-exclusive from the beginning, while fine-tuning adapts a more versatile model to a specific domain. ::: :::success 💡注意到，前面提到的方法，也就是Domain-Specific Pre-Training，是在特定領域的資料上訓練語言模型，是為該領域所建立的專用模型。另一方面，Domain-Specific Fine-Tuning採用預訓練的通用模型，並在特定領域的資料上進一步訓練它，使其適應該領域內的任務，而無需從頭開始。預訓練從一開始就是domain-exclusive，而微調則將一個更通用的模型適應到特定領域。 ::: :::info The key steps in domain-specific fine-tuning include: 1. **Pre-training:** Initially, a large language model is pre-trained on an extensive dataset, allowing it to grasp general language patterns, grammar, and contextual understanding (A general LLM). 2. **Fine-tuning Dataset:** A more focused dataset, tailored to the desired domain or task, is collected or prepared. This dataset contains relevant examples and instances related to the target domain, potentially including labeled examples for supervised learning. 3. **Fine-tuning Process:** The pre-trained language model undergoes further training on this domain-specific dataset. During fine-tuning, the model's parameters are adjusted based on the new dataset, while retaining the general language understanding acquired during pre-training. 4. **Task Optimization:** The fine-tuned model is optimized for specific tasks within the chosen domain. This optimization may involve adjusting parameters related to the task, such as the model architecture, size, or tokenizer, to achieve optimal performance. ::: :::success domain-specific fine-tuning的關鍵步驟包括： 1. **預訓練：** 最初，大型語言模型在大量的資料集上做預訓練，使其能夠掌握一般語言模式、語法和上下文理解(通用LLMs)。 2. **微調資料集：** 收集或準備一個針對所需領域或任務量身定做的更針對性的資料集。此資料集包含與目標領域相關的相關樣本和實例，可能包括用於監督學習的標記樣本。 3. **微調過程：** 預訓練的語言模型在這個特定領域的資料集上經歷進一步的訓練。在微調過程中，模型的參數會根據新資料集進行調整，同時保留預訓練期間獲得的一般語言理解。 4. **任務最佳化：** 微調模型針對所選定的領域內的特定任務做了最佳化。這種最佳化可能涉及調整與任務相關的參數，像是模型架構、大小或分詞器，以實現最佳效能。 ::: :::info Domain-specific fine-tuning offers several advantages: - It enables the model to specialize in a particular domain, enhancing its effectiveness for tasks within that domain. - It saves time and computational resources compared to training a model from scratch, leveraging the knowledge gained during pre-training. - The model can adapt to the specific requirements and nuances of the target domain, leading to improved performance on domain-specific tasks. ::: :::success Domain-Specific Fine-Tuning具有以下幾個優點： - 它讓模型能夠專注於特定領域，從而增強該領域內任務的有效性。 - 相較於整個模型從頭訓練，利用預訓練期間獲得的知識可以節省時間和計算資源。 - 此模型可以適應目標領域的特定需求和細微差別，從而提高特定領域任務上的表現。 ::: :::info A popular example for domain-specific fine-tuning is the ChatDoctor LLM which is a specialized language model fine-tuned on Meta-AI's large language model meta-AI (LLaMA) using a dataset of 100,000 patient-doctor dialogues from an online medical consultation platform. The model undergoes fine-tuning on real-world patient interactions, significantly improving its understanding of patient needs and providing more accurate medical advice. ChatDoctor uses real-time information from online sources like Wikipedia and curated offline medical databases, enhancing the accuracy of its responses to medical queries. The model's contributions include a methodology for fine-tuning LLMs in the medical field, a publicly shared dataset, and an autonomous ChatDoctor model capable of retrieving updated medical knowledge. Read more about ChatDoctor in the paper [here](https://arxiv.org/pdf/2303.14070.pdf). ::: :::success domain-specific fine-tuning的一個知名範例是ChatDoctor LLM，這是一種專用的語言模型，使用來自線上醫療諮詢平台上的100,000條醫患對話資料集在Meta-AI的大型語言模型元meta-AI(LLaMA)上進行微調。該模型以真實世界的患者互動進行微調，明顯提高了對患者需求的理解並提供更準確的醫療建議。ChatDoctor使用來自維基百科等線上資源的實時信息和精選的離線醫療資料庫，增強其對醫療查詢回應的準確性。該模型的貢獻包括用於微調醫學領域LLMs的方法、公開共享的資料集以及能夠檢索更新醫學知識的自主ChatDoctor模型。更多關於ChatDoctor的資訊請看[論文](https://arxiv.org/pdf/2303.14070.pdf)。 ::: ## Retrieval Augmented Generation (RAG) :::info Retrieval Augmented Generation (RAG) is an AI framework that enhances the quality of responses generated by LLMs by incorporating up-to-date and contextually relevant information from external sources during the generation process. It addresses the inconsistency and lack of domain-specific knowledge in LLMs, reducing the chances of hallucinations or incorrect responses. RAG involves two phases: retrieval, where relevant information is searched and retrieved, and content generation, where the LLM synthesizes an answer based on the retrieved information and its internal training data. This approach improves accuracy, allows source verification, and reduces the need for continuous model retraining. ::: :::success 檢索增強生成(RAG)是一種人工智慧框架，透過在生成過程中整合來自外部來源的最新且上下文相關的信息，提高LLMs生成的響應的品質。它解決了LLMs中不一致且缺乏特定領域知識的問題，減少了幻覺或錯誤響應的機會。 RAG涉及兩個階段：檢索(搜尋和檢索相關信息)和內容生成(其中LLMs根據檢索到的信息及其內部訓練資料來合成答案)。這種方法提高了準確性，允許來源驗證，並減少了模型不斷重新訓練的需求。 ::: ![image](https://hackmd.io/_uploads/HyAz-7_qC.png) Image Source: https://www.deeplearning.ai/short-courses/langchain-for-llm-application-development/ :::info The diagram above outlines the fundamental RAG pipeline, consisting of three key components: 1. **Ingestion:** - Documents undergo segmentation into chunks, and embeddings are generated from these chunks, subsequently stored in an index. - Chunks are essential for pinpointing the relevant information in response to a given query, resembling a standard retrieval approach. 2. **Retrieval:** - Leveraging the index of embeddings, the system retrieves the top-k documents when a query is received, based on the similarity of embeddings. 3. **Synthesis:** - Examining the chunks as contextual information, the LLM utilizes this knowledge to formulate accurate responses. ::: :::success 上圖概述了RAG pipeline的基礎，由三個關鍵元件所組成： 1. **擷取：** - 文件被分割成區塊(chunks)，並且從這些區塊生成嵌入，隨後儲存在索引中。 - 區塊(chunks)對於精確指出與回應給定查詢的相關資訊至關重要，類似於標準檢索方法。 2. **檢索：** - 利用嵌入索引，系統在收到查詢時根據嵌入的相似性檢索前k個文件。 3. **合成：** - 將這些區塊(chunks)作為上下文信息，LLMs利用這些知識來制訂準確的答案。 ::: :::info 💡Unlike previous methods for domain adaptation, it's important to highlight that RAG doesn't necessitate any model training whatsoever. It can be readily applied without the need for training when specific domain data is provided. ::: :::success 💡與先前的領域適應方法不同，我們要強調的是，RAG並不需要做任何模型的訓練。在提供特定領域資料情況下，不需要訓練即可輕鬆應用。 ::: :::info In contrast to earlier approaches for model updates (pre-training and fine-tuning), RAG comes with specific advantages and disadvantages. The decision to employ or refrain from using RAG depends on an evaluation of these factors. ::: :::success 跟前面的模型更新方法(pre-training and fine-tuning)相比，RAG具有一定的優點和缺點。要不要用RAG的決定取決於這些因素的評估。 ::: :::info | Advantages of RAG | Disadvantages of RAG | | --- | --- | | Information Freshness: RAG addresses the static nature of LLMs by providing up-to-date or context-specific data from an external database. | Complex Implementation (Multiple moving parts): Implementing RAG may involve creating a vector database, embedding models, search index etc. The performance of RAG depends on the individual performance of all these components | | Domain-Specific Knowledge: RAG supplements LLMs with domain-specific knowledge by fetching relevant results from a vector database | Increased Latency: The retrieval step in RAG involves searching through databases, which may introduce latency in generating responses compared to models that don't rely on external sources. | | Reduced Hallucination and Citations: RAG reduces the likelihood of hallucinations by grounding LLMs with external, verifiable facts and can also cite sources | | | Cost-Efficiency: RAG is a cost-effective solution, avoiding the need for extensive model training or fine-tuning | | ::: :::success | RAG的優點| RAG的缺點| | ---| ---| |信息新鮮度：RAG透過提供來自外部資料庫的最新資料或上下文相關的資料來解決LLMs的靜態性質。 |複雜的實作(多個移動部件)：實作RAG可能涉及建立一個向量資料庫、嵌入模型、搜尋索引等。RAG的效能取決於所有這些組件的各別效能。| |特定領域知識：RAG透過從向量資料庫獲取相關結果來補充LLMs的特定領域知識 |延遲增加：RAG中的檢索步驟涉及從資料庫查詢的動作，對比不依賴外部來源的模型，這可能會造成在生成響應的時候引入延遲。 | |減少幻覺和引用：RAG透過讓LLMs以外部的、可驗證的事實為基礎，並且還可以引用來源的方式來減少幻覺生成的可能性 | | |成本效益：RAG是一種經濟高效的解決方案，不需要做大量的模型訓練或微調 | | ::: ## Choosing Between RAG, Domain-Specific Fine-Tuning, and Domain-Specific Pre-Training ![image](https://hackmd.io/_uploads/rJePZ7_9A.png) ### **Use Domain-Specific Pre-Training When:** :::info - **Exclusive Domain Focus:** Pre-training is suitable when you require a model exclusively trained on data from a specific domain, creating a specialized language model for that domain. - **Customizing Model Architecture:** It allows you to customize various aspects of the model architecture, size, tokenizer, etc., based on the specific requirements of the domain. - **Extensive Training Data Available:** Effective pre-training often requires a large amount of domain-specific training data to ensure the model captures the intricacies of the chosen domain. ::: :::success - **專注於特定領域：** 當你需要專門針對特定領域的資料進行訓練的模型，為該領域建立專門的語言模型時，預訓練是合適的。 - **客製化模型架構：** 它允許你根據領域的特定需求自訂義模型架構、大小、分詞器等的各個方面。 - **足夠的大量訓練資料：** 有效的預訓練通常需要大量的特定領域的訓練資料，以確保模型能夠捕捉所選定的領域的複雜性。 ::: ### **Use Domain-Specific Fine-Tuning When:** :::info - **Specialization Needed:** Fine-tuning is suitable when you already have a pre-trained LLM, and you want to adapt it for specific tasks or within a particular domain. - **Task Optimization:** It allows you to adjust the model's parameters related to the task, such as architecture, size, or tokenizer, for optimal performance in the chosen domain. - **Time and Resource Efficiency:** Fine-tuning saves time and computational resources compared to training a model from scratch since it leverages the knowledge gained during the pre-training phase. ::: :::success - **專業需求：** 當你已經擁有預訓練的LLMs，並且你希望使其適應特定任務或在特定領域內時，微調是合適的。 - **任務最佳化：** 它允許你調整與任務相關的模型參數，例如架構、大小或分詞器，以獲得在所選定的領域內擁有最佳效能。 - **時間和資源效率：** 與從頭開始訓練模型相比，微調可以節省時間和計算資源，因為它利用了預訓練階段獲得的知識。 ::: ### **Use RAG When:** :::info - **Information Freshness Matters:** RAG provides up-to-date, context-specific data from external sources. - **Reducing Hallucination is Crucial:** Ground LLMs with verifiable facts and citations from an external knowledge base. - **Cost-Efficiency is a Priority:** Avoid extensive model training or fine-tuning; implement without the need for training. ::: :::success - **資訊新鮮度粉重要：** RAG提供來自外部來源的最新、上下文相關的資料。 - **減少幻覺也粉重要：** 以可驗證的事實和外部知識庫的引文為LLMs的基礎。 - **成本效益是首要：** 避免大量的模型訓練或微調；不需要訓練即可開箱即用。 :::