在特定任務的微調時,高品質資料是最重要的,要小心地進行資料清理與標註
否則丟給模型垃圾、生出來的也是垃圾
分詞不一定是按單詞進行的,它是基於常見字符出現的頻率。例如,"ING" token 是分詞器中非常常見的,因為它出現在每個現在分詞中,如 "finetuning" 和 "tokenizing"。
有多種流行的分詞器,有些分詞器專為特定的模型而設計,而其他分詞器則可能適用於多種模型
為何要做Padding和truncation?
主要的Padding和truncation方法:
程式範例都是呼叫高階API,知道背後的原理跟使用情境比較重要,實際動手體驗詳見04_Data_preparation_lab_student
Tokenizing text
將文本資料轉換為代表每個文本片段的數字的過程。例如,文本"hi, how are you?"可以被分詞為一系列的數字,每個數字都代表文本中的一個特定部分
Tokenize multiple texts at once
將多個文本輸入進行分詞的過程。例如,可以將文本列表["hi, how are you?", "I'm good", "yes"]一次性輸入分詞器,得到每個文本的分詞結果
Padding and truncation
於模型需要固定大小的輸入,因此可能需要對輸入進行填充或截斷
Prepare instruction dataset
Tokenize a single example
# Set a default maximum length for the sequences
max_length = 2048
# Adjust the max_length based on the actual length of the tokenized input.
# It ensures that the max_length is not unnecessarily long if the actual sequence is shorter than 2048.
max_length = min(
tokenized_inputs["input_ids"].shape[1],
max_length,
)
# Tokenize the input text
# - return_tensors: specifies the type of tensors to be returned, in this case, numpy arrays
# - truncation: ensures that the tokenized sequence will be truncated if it exceeds the max_length
# - max_length: specifies the maximum length for the tokenized sequence
tokenized_inputs = tokenizer(
text,
return_tensors="np",
truncation=True,
max_length=max_length
)
# Retrieve the tokenized input IDs from the tokenized inputs
tokenized_inputs["input_ids"]
# array([[ 4118, 19782, 27, ...}})
Tokenize the instruction dataset
對整個指令資料集進行分詞的過程。這涉及將每個示例連接在一起,然後對其進行分詞,並根據需要進行填充和截斷
自定義tokenize函式 def tokenize_function
def tokenize_function(examples):
if "question" in examples and "answer" in examples:
text = examples["question"][0] + examples["answer"][0]
elif "input" in examples and "output" in examples:
text = examples["input"][0] + examples["output"][0]
else:
text = examples["text"][0]
tokenizer.pad_token = tokenizer.eos_token
tokenized_inputs = tokenizer(
text,
return_tensors="np",
padding=True,
)
max_length = min(
tokenized_inputs["input_ids"].shape[1],
2048
)
tokenizer.truncation_side = "left"
tokenized_inputs = tokenizer(
text,
return_tensors="np",
truncation=True,
max_length=max_length
)
return tokenized_inputs
將剛剛自訂的tokenize函式應用/映射(map)在整個資料集上
map
方法,會遍歷資料集中的每一個元素,並對每一個元素調用指定的函數
finetuning_dataset_loaded = datasets.load_dataset("json", data_files=filename, split="train")
pd.DataFrame(finetuning_dataset_loaded)
tokenized_dataset = finetuning_dataset_loaded.map(
tokenize_function,
batched=True,
batch_size=1,
drop_last_batch=True
)
pd.DataFrame(tokenized_dataset)
切分訓練與測試資料集 Prepare test/train splits
split_dataset = tokenized_dataset.train_test_split(test_size=0.1, shuffle=True, seed=123)
print(split_dataset)
# DatasetDict({
train: Dataset({
features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
num_rows: 1260
})
test: Dataset({
features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
num_rows: 140
})})