Finetuning Large Language Models

資料準備(Data preparation)

課程概要

準備訓練資料非常重要。高品質、多樣化和真實的訓練資料會比大量資料更有效
訓練資料需要預處理:將指令和回覆連接、添加提示模板、分詞(tokenize)、填充(padding)和截斷(truncation)。使用適合模型的分詞器非常重要
分詞將文本轉換為代表文字的數字。填充使輸入等長，截斷控制最大長度。都需要同時使用
可以對一個資料集批量分詞、填充和截斷。然後分割資料集為訓練集和測試集
Hugging Face提供了一些興趣資料集可供訓練，如Taylor Swift和BTS

訓練資料的品質What kind of data?

更好的訓練資料
- 高品質資料:給模型高品質輸入會得到高品質輸出，避免產生無意義的內容。
- 多樣化資料:涵蓋使用案例的各個面向，避免模型只學會重複固定模式。
- 真實資料:真實世界的資料含有更多樣式，不像生成資料會有固定模式。
更糟的訓練資料
- 低品質資料:會使模型學習並產生垃圾內容
- 單一樣式資料:模型只會學會重複單一模式，無法適應多樣化輸入
- 生成資料:已含固定模式，無法學習新的語言結構
資料量的影響
- 更多資料:通常能提升機器學習模型的表現
- 較少資料:大型語言模型(LLM)已從大量網路資料中預訓練，所以訓練資料量的影響較小
- 重點在資料品質，量較少但品質好的資料效果常比大量但品質差的資料來得好

需要哪些種類的資料?(What kind of data?)

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Better

在特定任務的微調時，高品質資料是最重要的，要小心地進行資料清理與標註
否則丟給模型垃圾、生出來的也是垃圾

Higher Quality
- 高品質的資料是微調的首要條件。如果提供低質量的輸入，模型可能會模仿這些輸入並給出不良的輸出。因此，提供高質量的資料非常重要。
Diversity
- 多樣性在資料中也很重要。如果所有的輸入和輸出都是相同的，模型可能會開始記憶它們，這可能不是你想要的。
Real
- 使用真實資料比生成資料更有效。生成的資料具有某些固有的模式，而真實的資料對於寫作任務特別有幫助。
More
- 在大多數機器學習應用中，擁有更多的資料比少量的資料更重要。但由於預訓練已從大量網絡資料中學到知識，所以更多的資料對模型有幫助，但不如前三者重要。

Worse

Lower Quality
- 低質量的資料可能會導致模型的輸出品質下降。如果提供垃圾輸入，模型可能會模仿這些輸入。
Homogeneity
- 同質性意味著資料中的樣本都非常相似，這可能會導致模型過度擬合。
Generated
- 生成的資料具有某些固有的模式，這可能不如真實的資料有效。有些服務試圖檢測某些內容是否是生成的，因為生成的資料中存在可以檢測的模式。
Less
- 較少的資料可能不足以讓模型學到足夠的模式和知識。但由於預訓練已從大量網絡資料中學到知識，所以更多的資料對模型有幫助，但不如質量重要。

準備資料的步驟(Steps to prepare your data)

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

1. Collect instruction-response pairs

收集指令響應對：例如，問答對或其他形式的輸入和輸出對。

2. Concatenate pairs (add prompt template， if applicable)

連接這些對：將指令和響應連接在一起。如果適用的話，還可以添加提示模板，

3. Tokenize: Pad， Truncate

分詞：這是將文本資料轉換為代表每個文本片段的數字的過程。
- Pad：由於模型需要固定大小的輸入，因此需要對輸入進行填充，使其達到所需的長度。
- Truncate：如果輸入超過模型可以處理的最大長度，則需要對其進行截斷。

4. Split into train/test

分割成訓練/測試集：一旦資料準備好，您需要將其分割成訓練集和測試集。在這個例子中，測試集大小被設定為資料的10%。

將你的資料分詞(Tokenizing your data)

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

將文本資料轉換為代表每個文本片段的數字的過程

分詞不一定是按單詞進行的，它是基於常見字符出現的頻率。例如，"ING" token 是分詞器中非常常見的，因為它出現在每個現在分詞中，如 "finetuning" 和 "tokenizing"。

不同分詞器(There are multiple popular tokenizers)

有多種流行的分詞器，有些分詞器專為特定的模型而設計，而其他分詞器則可能適用於多種模型

使用與模型相關的分詞器(Use the tokenizer associated with your model)
- 每個模型在訓練時都使用了特定的分詞器。如果您給模型提供了錯誤的分詞器，它會感到困惑，因為它期望不同的數字代表不同的字母組和不同的單詞
- HuggingFace的AutoTokenizer可以自動找到正確的分詞器，只需指定模型即可

Padding和Truncation 補充

為何要做Padding和truncation？
- 固定輸入大小
  - 神經網絡需要固定大小的輸入。但自然語言文本的長度是可變的，因此我們需要一種方法來確保每個輸入序列的長度都是相同的。
- 效率
  - 當處理成批的資料時，為了使計算更加高效，我們需要確保每批資料中的所有序列都有相同的長度。
- 模型容量
  - 轉換器模型(Transformer)通常有一個最大輸入長度的限制，超過這個長度的序列需要被截斷。
主要的Padding和truncation方法：
- Padding：
  - 後置填充（Post-padding）
    - 在序列的末尾添加填充token，直到達到所需的長度
    - 適用情境：例如對電影評論做情感分析時，通常評論的開頭通常包含了主要的情感資訊
  - 前置填充（Pre-padding）
    - 在序列的開頭添加填充token，直到達到所需的長度
    - 適用情境：例如語音助理，通常需要理解用戶的最後一句話來給出回應
- Truncation：
  - 後置截斷（Post-truncation）
    - 刪除序列的末尾部分，直到達到所需的長度。
    - 適用情境：例如新聞報導的重點通常都在前面
  - 前置截斷（Pre-truncation）
    - 刪除序列的開頭部分，直到達到所需的長度
    - 適用情境：例如想看文章的結論

Lab for Data preparation

程式範例都是呼叫高階API，知道背後的原理跟使用情境比較重要，實際動手體驗詳見04_Data_preparation_lab_student

Tokenizing text
將文本資料轉換為代表每個文本片段的數字的過程。例如，文本"hi， how are you?"可以被分詞為一系列的數字，每個數字都代表文本中的一個特定部分
Tokenize multiple texts at once
將多個文本輸入進行分詞的過程。例如，可以將文本列表["hi， how are you?"， "I'm good"， "yes"]一次性輸入分詞器，得到每個文本的分詞結果
Padding and truncation
於模型需要固定大小的輸入，因此可能需要對輸入進行填充或截斷
- Padding
  - 將輸入填充到所需的長度，例如，將"yes"填充為與"hi， how are you?"相同的長度
- Truncation
  - 如果輸入超過模型可以處理的最大長度，則需要對其進行截斷
Prepare instruction dataset

Tokenize a single example
























# Set a default maximum length for the sequences
max_length = 2048

# Adjust the max_length based on the actual length of the tokenized input.
# It ensures that the max_length is not unnecessarily long if the actual sequence is shorter than 2048.
max_length = min(
    tokenized_inputs["input_ids"].shape[1]，
    max_length，
)
 
# Tokenize the input text
# - return_tensors: specifies the type of tensors to be returned， in this case， numpy arrays
# - truncation: ensures that the tokenized sequence will be truncated if it exceeds the max_length
# - max_length: specifies the maximum length for the tokenized sequence
tokenized_inputs = tokenizer(
    text，
    return_tensors="np"，
    truncation=True，
    max_length=max_length
)

# Retrieve the tokenized input IDs from the tokenized inputs
tokenized_inputs["input_ids"]
# array([[ 4118， 19782，    27，  ...}})

Tokenize the instruction dataset
對整個指令資料集進行分詞的過程。這涉及將每個示例連接在一起，然後對其進行分詞，並根據需要進行填充和截斷

自定義tokenize函式 def tokenize_function




























def tokenize_function(examples):
    if "question" in examples and "answer" in examples:
      text = examples["question"][0] + examples["answer"][0]
    elif "input" in examples and "output" in examples:
      text = examples["input"][0] + examples["output"][0]
    else:
      text = examples["text"][0]

    tokenizer.pad_token = tokenizer.eos_token
    tokenized_inputs = tokenizer(
        text，
        return_tensors="np"，
        padding=True，
    )

    max_length = min(
        tokenized_inputs["input_ids"].shape[1]，
        2048
    )
    tokenizer.truncation_side = "left"
    tokenized_inputs = tokenizer(
        text，
        return_tensors="np"，
        truncation=True，
        max_length=max_length
    )

    return tokenized_inputs

將剛剛自訂的tokenize函式應用/映射(map)在整個資料集上

map方法，會遍歷資料集中的每一個元素，並對每一個元素調用指定的函數


finetuning_dataset_loaded = datasets.load_dataset("json"， data_files=filename， split="train")
pd.DataFrame(finetuning_dataset_loaded)

檢視tokenized前的finetuning資料集








tokenized_dataset = finetuning_dataset_loaded.map(
    tokenize_function，
    batched=True，
    batch_size=1，
    drop_last_batch=True
)

pd.DataFrame(tokenized_dataset)

檢視tokenized的finetuning資料集

切分訓練與測試資料集 Prepare test/train splits













split_dataset = tokenized_dataset.train_test_split(test_size=0.1， shuffle=True， seed=123)
print(split_dataset)

# DatasetDict({
train: Dataset({
    features: ['question'， 'answer'， 'input_ids'， 'attention_mask'， 'labels']，
    num_rows: 1260
})
test: Dataset({
    features: ['question'， 'answer'， 'input_ids'， 'attention_mask'， 'labels']，
    num_rows: 140
})})

AI / ML領域相關學習筆記入口頁面

Deeplearning.ai GenAI/LLM系列課程筆記

Large Language Models with Semantic Search。大型語言模型與語義搜索

Finetuning Large Language Models。微調大型語言模型

Finetuning Large Language Models

資料準備(Data preparation)

課程概要

訓練資料的品質What kind of data?

需要哪些種類的資料?(What kind of data?)

Better

Worse

準備資料的步驟(Steps to prepare your data)

1. Collect instruction-response pairs

2. Concatenate pairs (add prompt template， if applicable)

3. Tokenize: Pad， Truncate

4. Split into train/test

將你的資料分詞(Tokenizing your data)

不同分詞器(There are multiple popular tokenizers)

Padding和Truncation 補充

Lab for Data preparation

補充資料

NLP text Text Preprocessing

2019.01。mlwhiz.com。ing Series: Part 1 - Text Preprocessing Methods for Deep Learning

Tokenization

HuggingFace tokenizer tutorial

2020.01。Cathal Horan。Tokenizers: How machines read

2022.09。https://vaclavkosar.com/。kenization in Machine Learning Explained

2023.08。nghuyong。知乎。大模型基础组件 - Tokenizer

2023.08。Amal Menzli。MLOps Blog。Tokenization in NLP: Types， Challenges， Examples， Tools

2023.10。解读大模型（LLM）的token

Finetuning Large Language Models

課程概要

訓練資料的品質What kind of data?

需要哪些種類的資料?(What kind of data?)

Better

Worse

準備資料的步驟(Steps to prepare your data)

1. Collect instruction-response pairs

2. Concatenate pairs (add prompt template， if applicable)

3. Tokenize: Pad， Truncate

4. Split into train/test

將你的資料分詞(Tokenizing your data)

不同分詞器(There are multiple popular tokenizers)

Padding和Truncation 補充

Lab for Data preparation

補充資料

NLP text Text Preprocessing

Tokenization

Read more

[GenAI][AI Agents] Long-Term Agentic Memory With LangGraph - Baseline Email Assistant

[GenAI][AI Agents] Long-Term Agentic Memory With LangGraph - Introduction to Agent Memory

[AI Agents in LangGraph](https://learn.deeplearning.ai/courses/ai-agents-in-langgraph/lesson/1/introduction)

AI / ML領域相關學習筆記入口頁面