[Finetuning Large Language Models ] 課程筆記-微調的適用範圍(Where finetuning fits in)

### [Deeplearning.ai GenAI/LLM系列課程筆記](https://learn.deeplearning.ai/) #### [Large Language Models with Semantic Search。大型語言模型與語義搜索 ](https://hackmd.io/@YungHuiHsu/rku-vjhZT) #### [Finetuning Large Language Models。微調大型語言模型](https://hackmd.io/@YungHuiHsu/HJ6AT8XG6) - [為何要微調(Why finetune)](https://hackmd.io/@YungHuiHsu/HJ6AT8XG6) - [微調的適用範圍(Where finetuning fits in)](https://hackmd.io/@YungHuiHsu/Bkfyyh7zp) - [指令調整(Instruction-tuning)](https://hackmd.io/@YungHuiHsu/B18Hg2XMa) - [資料準備(Data preparation)](https://hackmd.io/@YungHuiHsu/ByR-G24GT) - [訓練過程(Training process)](https://hackmd.io/@YungHuiHsu/rJP6F2Vf6) - [評估與迭代(Evaluation and iteration)](https://hackmd.io/@YungHuiHsu/ryfM524Ga) - [考量與開始(Considerations on getting started now)](https://hackmd.io/@YungHuiHsu/r1KGob8fT) --- # Finetuning Large Language Models ## [微調的適用範圍(Where finetuning fits in)](https://learn.deeplearning.ai/finetuning-large-language-models/lesson/3/where-finetuning-fits-in) ### 預訓練(Pretraining) 模型一開始是完全空白，對世界毫無知識，連英文單詞都無法構成使用下一個詞預測(next token prediction)的訓練方式使用大量零散的文字資料進行預訓練，通常從網路抓取而來的「無標籤」資料使用自主監督學習法(self-supervised learning) - 預訓練後模型學會: * 語言結構 * 基礎常識 Fine-tuning 是在預訓練取得基礎語言能力後，使用標註資料訓練模型在特定任務上的專業知識。預訓練得到通用基礎能力，Fine-tuning 將模型專業化 ### 從網路抓取的数据（What is “data scraped from the internet”?） * 訓練大型語言模型使用的預訓練数据，公司通常不會公開披露來源 * 有開源的預訓練文本数据集「The Pile」 * 訓練大型語言模型需要大量計算資源與時間，非常昂貴預訓練数据通常是從網路上抓取大量文本，包括網站、百科、書籍等資料，用於訓練模型對語言的基本理解。但是具體的数据來源公司通常不會公開，需要投入大量資源進行預訓練。這是訓練大型語言模型的基礎步驟。 ### 預訓練基礎模型的局限性: ![](https://hackmd.io/_uploads/rylTK9QMp.png =400x) * 預訓練模型僅學習到語言的基礎結構和一般常識 * 對於專業領域的知識仍然缺乏理解 * 無法很好地判斷正確資訊，容易「亂編造」/「幻覺」(hallucination) * 對問題的回答可能不一致或無法直接回答問題 * 需要使用 Fine-tuning 進一步訓練，才能生成高品質的專業回覆預訓練模型具有語言理解能力，但不具備專業知識，很容易生成錯誤內容。要訓練出可以實際應用的專家級模型，仍需要在預訓練基礎上進行 Fine-tuning。 ### 預訓練後的微調(Finetuning after pretraining ) ![](https://hackmd.io/_uploads/S1BGs5QfT.png) * 微調是在預訓練階段之後的步驟 * 它涉及對已經進行預訓練的模型進行進一步的訓練 * 微調可以應用於自監督的未標記資料 * 它也可以應用於您已經策劃的“標記”資料 * 微調的一個顯著優勢是與預訓練相比，它需要的資料要少得多 * 微調是機器學習工具箱中的一個關鍵工具 * 對於生成任務，微調的定義並不明確： * 它涉及更新整個模型，而不僅僅是模型的某一部分 * 訓練目標保持不變：下一個token(字詞)預測 ### 微調的功用(What is finetuning doing for you?) ![](https://hackmd.io/_uploads/SJ8Do9XGa.png =400x) - 行為改變: * 微調有助於修改模型的行為 * 它訓練模型更一致地回應。 * 它幫助模型專注於特定任務，例如審查。微調揭示了模型的能力，使其在像對話這樣的任務上表現得更好。 * 獲取知識: * 微調使模型能夠獲取新知識。 * 它幫助模型了解基礎預訓練模型中沒有的新特定概念。 * 它還可以用來糾正模型中的舊或過時的信息。 * 兩者兼具: * 微調經常用於同時改變模型的行為並幫助它獲取新知識。 * ### 微調的任務(Tasks to finetune) | **Extraction** | **Expansion** | | --------------------------------------------------- |:----------------------------------------------------:| | ![](https://hackmd.io/_uploads/rkdPhcXfT.png =150x) | ![](https://hackmd.io/_uploads/BJNuhq7z6.pngg =150x) | * 只有文字輸入，文字輸出: * 提取(Extraction): * 輸入文本並獲得**較少**的文本輸出 * 它可以被認為是“閱讀”。此類任務包括提取關鍵字、識別主題、路由以及代理，這可能涉及計劃、推理、自我批評、工具使用等。 * 擴展(Expansion): * 輸入文本並獲得**更多**的文本輸出 * 它可以被比喻為“寫作”。這裡的任務包括聊天、寫電子郵件和編碼。 * 任務清晰度(Clarity): * 任務的清晰度是微調成功的關鍵指標 * 清晰度意味著了解輸出或結果方面的差異，了解什麼是壞的、好的甚至更好的 ### 首次微調(First time finetuning) ![](https://hackmd.io/_uploads/rygnEj7fp.png =400x) 1. 首先，使用大型語言模型 (LLM) 的提示工程來識別潛在的任務 2. 觀察並找到LLM表現得大約“還可以”的任務 3. 從觀察到的任務中，選擇一個特定的任務做進一步觀察 4. 為所選的任務收集大約1000個輸入-輸出對。確保這些對的質量比從LLM觀察到的“還可以”的表現要好 5. 使用收集到的資料繼續對較小的LLM進行微調 ### Lab 微調資料：與預訓練和基本準備進行比較 - 檢視資料集 ```python= # 用於以JSON行格式讀取和寫入資料 import jsonlines import itertools import pandas as pd from pprint import pprint # 導入HuggingFace的datasets函式庫，該庫提供了一系列用於機器學習的資料集和指標。 import datasets from datasets import load_dataset # 下載"c4"資料集（Common Crawl的網頁文本資料集）。 # "train"表示指定訓練資料集 # streaming=True。表示資料集將使用streaming傳輸、即時(real-time)下載，而不是一次性存儲在記憶體中。這對於大型資料集非常有用，以防止記憶體不夠的問題。 pretrained_dataset = load_dataset("c4"， "en"， split="train"， streaming=True) n = 5 print("Pretrained dataset:") top_n = itertools.islice(pretrained_dataset, n) pd.DataFrame(top_n) ``` ![](https://hackmd.io/_uploads/rkksui7Ga.png) - 與公司微調資料集進行對比將預訓練資料集(來自網路爬蟲的非結構化未整理資料"common Crawl")與名為 "Lamini Docs" 的公司特定微調資料集進行了對比。這個資料集結構更加有序，包括與公司相關的問答對。 ```python= filename = "lamini_docs.jsonl" instruction_dataset_df = pd.read_json(filename, lines=True) instruction_dataset_df ``` ![](https://hackmd.io/_uploads/rkO6_sXMT.png) - 各種格式化資料呈現的方法(Various ways of formatting your data_ - 常見的方法是直接連接問題和答案對 ```text=! "What are the different types of documents available in the repository (e.g., installation guide, API documentation, developer's guide)?Lamini has documentation on Getting Started, Authentication, Question Answer Model, Python Library, Batching, Error Handling, Advanced topics, and class documentation on LLM Engine available at https://lamini-ai.github.io/." ``` - 另一種方法涉及使用指令跟隨提示模板，該模板為模型提供了一種結構化的格式來識別和遵循 - 此模板可以包括標記，例如"###"，以指示接下來的內容類型，例如問題 - `prompt_template_qa` ```python= prompt_template_qa = """### Question: {question} ### Answer: {answer}""" question = examples["question"][0] answer = examples["answer"][0] text_with_prompt_template = prompt_template_qa.format(question=question, answer=answer) text_with_prompt_template ``` ```text=! "### Question:\nWhat are the different types of documents available in the repository (e.g., installation guide, API documentation, developer's guide)?\n\n### Answer:\nLamini has documentation on Getting Started, Authentication, Question Answer Model, Python Library, Batching, Error Handling, Advanced topics, and class documentation on LLM Engine available at https://lamini-ai.github.io/." ``` - `prompt_template_q` ```python= prompt_template_q = """### Question: {question} ### Answer:""" num_examples = len(examples["question"]) finetuning_dataset_text_only = [] finetuning_dataset_question_answer = [] for i in range(num_examples): question = examples["question"][i] answer = examples["answer"][i] text_with_prompt_template_qa = prompt_template_qa.format(question=question, answer=answer) finetuning_dataset_text_only.append({"text": text_with_prompt_template_qa}) text_with_prompt_template_q = prompt_template_q.format(question=question) finetuning_dataset_question_answer.append({"question": text_with_prompt_template_q, "answer": answer}) pprint(finetuning_dataset_text_only[0]) ``` ```text= {'text': '### Question:\n' 'What are the different types of documents available in the ' "repository (e.g., installation guide, API documentation, developer's " 'guide)?\n' '\n' '### Answer:\n' 'Lamini has documentation on Getting Started, Authentication, ' 'Question Answer Model, Python Library, Batching, Error Handling, ' 'Advanced topics, and class documentation on LLM Engine available at ' 'https://lamini-ai.github.io/.'} ``` - 存儲資料的常見方法: 存儲此類資料的最常見方法是使用JSON lines文件，通常帶有 ".jsonl" 擴展名。這些文件中的每一行都代表一個JSON對象。這種格式在機器學習社區中既高效又被廣泛接受。 ```python= with jsonlines.open(f'lamini_docs_processed.jsonl', 'w') as writer: writer.write_all(finetuning_dataset_question_answer) finetuning_dataset_name = "lamini/lamini_docs" finetuning_dataset = load_dataset(finetuning_dataset_name) print(finetuning_dataset) ``` ```text= DatasetDict({ train: Dataset({ features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'], num_rows: 1260 }) test: Dataset({ features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'], num_rows: 140 })}) ```