# 模型Fine Tuning 模型微調與優化

與RAG不同,會對模型進行修改,但不會重新訓練一個新的模型(使用pretrained model)。
透過反覆對模型進行訓練(舊資料+新資料),達到fine-tuning效果。
- RAG 跟 Fine-tunung的差異:右圖為fine-tunung,左圖為RAG

==適合場景:Fine-tuning適合用在模型完全沒學過的情況,RAG適合用在將模型所學的錯誤部分修正。==
※並非所有模型都可以進行fine-tune。
---
# 進行Fine-tuning
Step01:建立要Fine-tuning的資料集(File)
```python=
client.files.create(
file=open("fine-tune-example.jsonl", "rb"),
purpose="fine-tune"
)
```
範例:fine-tune-example.jsonl(可以叫gpt生問題)

Step02:建立要Fine-tuning的model
```python=
client.fine_tuning.jobs.create(
training_file="file-zwwuLmddLYyWKBZNGqvEec4i",
model="gpt-3.5-turbo"
)
```
Step2.5:查看可使用的模型(已fine-tuning)
可在[網站](https://platform.openai.com/docs/overview)後臺查看結果。

```python=
for model in client.models.list(): # 查看你能使用的openai模型
print(model)
```
Step03:使用Fine-tunung後的模型
```python=
completion = client.chat.completions.create(
model="ft:gpt-3.5-turbo-0125:personal::A2YKELtS",
messages=[
{"role": "user", "content": "請問台灣立法院的院長是誰?"}
]
)
print(completion.choices[0].message.content)
```
:::warning
**Fine-tunung後的模型計價會比原本的貴**
:::
---
## Fine-tune本地的模型-Unsloth
Unsloth提供了一個更高效的方法去Fine-tune。
Step01:下載所需package
```python=
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
```
Step02:查看可Fine-tune的模型
```python=
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
"unsloth/Meta-Llama-3.1-8B-bnb-4bit", # Llama-3.1 2x faster
"unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
"unsloth/Meta-Llama-3.1-70B-bnb-4bit",
"unsloth/Meta-Llama-3.1-405B-bnb-4bit", # 4bit for 405b!
"unsloth/Mistral-Small-Instruct-2409", # Mistral 22b 2x faster!
"unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
"unsloth/Phi-3.5-mini-instruct", # Phi-3.5 2x faster!
"unsloth/Phi-3-medium-4k-instruct",
"unsloth/gemma-2-9b-bnb-4bit",
"unsloth/gemma-2-27b-bnb-4bit", # Gemma 2x faster!
"unsloth/Llama-3.2-1B-bnb-4bit", # NEW! Llama 3.2 models
"unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
"unsloth/Llama-3.2-3B-bnb-4bit",
"unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
"unsloth/Llama-3.3-70B-Instruct-bnb-4bit" # NEW! Llama 3.3 70B!
] # More models at https://huggingface.co/unsloth
```
Step03:設定要使用的模型、參數
```python=
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Llama-3.2-3B-Instruct", # or choose "unsloth/Llama-3.2-1B-Instruct"
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
# token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
#parameter adjust
model = FastLanguageModel.get_peft_model(
model,
r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
use_rslora = False, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
)
```
Step04:載入訓練用的新資料
EX:

```python=
from datasets import load_dataset
# dataset = load_dataset("mlabonne/FineTome-100k", split = "train")
dataset = load_dataset("./training_data", split = "train")
print(dataset)
```
Step05:把訓練的資料調整成模型理解的格式(揉合成新的格式)
```python=
instruction = """
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
"""
def format_chat_template(row):
row_json = [
{"role": "system", "content": instruction },
{"role": "user", "content": row["instruction"]},
{"role": "assistant", "content": row["output"]}
]
row["text"] = tokenizer.apply_chat_template(row_json, tokenize=False)
return row
dataset = dataset.map(
format_chat_template,
num_proc= 4,
)
dataset
print(dataset[5]["text"])
```
Step06:設定訓練用設定、資料量、怎麼切token...
```python=
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
dataset_num_proc = 2,
packing = False, # Can make training 5x faster for short sequences.
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
# num_train_epochs = 1, # Set this for 1 full training run.
max_steps = 60,
learning_rate = 2e-4,
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "outputs",
report_to = "none", # Use this for WandB etc
),
)
# 資料的格式
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
trainer,
instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)
```
Step07:進行編碼解碼
```python=
tokenizer.decode(trainer.train_dataset[5]["input_ids"])
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])
```
Step7.5:可以查看記憶體空間大小是否足夠
```python=
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")
```
Step08:進行訓練(fine-tuning)
```python=
trainer_stats = trainer.train()
```
Step09:使用新的模型回答問題
```python=
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
tokenizer,
chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
messages = [
{"role": "user", "content": "請問立法院長是誰?"},
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize = True,
add_generation_prompt = True, # Must add for generation
return_tensors = "pt",
).to("cuda")
outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
temperature = 1.5, min_p = 0.1)
tokenizer.batch_decode(outputs)
```
或
```python=
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
messages = [
{"role": "user", "content": "請問立法院長是誰?"},
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize = True,
add_generation_prompt = True, # Must add for generation
return_tensors = "pt",
).to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
use_cache = True, temperature = 1.5, min_p = 0.1)
```
Step10:save model
```python=
# Save to 8bit Q8_0
if True: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")
# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")
# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")
# Save to multiple GGUF options - much faster if you want multiple!
if False:
model.push_to_hub_gguf(
"hf/model", # Change hf to your username!
tokenizer,
quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
token = "", # Get a token at https://huggingface.co/settings/tokens
)
```
---
# 利⽤Ollama ⾃架與部署
https://github.com/ollama/ollama
可以幫助我們在本地端使用開源或自己的模型。
Step01:下在所需package
```python=
!curl -fsSL https://ollama.com/install.sh | sh
```
Step02:在背景執行、下載開源模型
```python=
import subprocess
subprocess.Popen(["ollama", "serve"])
import time
time.sleep(3) # Wait for a few seconds for Ollama to load!
# 下載開源模型
!ollama pull llama3
```
Step03:查看安裝過的模型
```python=
!ollama list
```
範例輸出:
```python=
NAME ID SIZE MODIFIED
llama3:latest 365c0bd3c000 4.7 GB 48 seconds ago
unsloth_model:latest 55280486e043 3.4 GB 31 minutes ago
```
Step04:呼叫模型
```python=
!curl http://localhost:11434/api/chat -d '{ \
"model": "llama3", \
"messages": [ \
{ "role": "user", "content": "請問立法院長是誰?" } \
] \
}'
```
Step05:透過langChain呼叫模型(使用較直觀)
```python=
!pip install langchain
!pip install langchain-core
!pip install langchain-ollama
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama.llms import OllamaLLM
template = "{question}"
prompt = ChatPromptTemplate.from_template(template)
model = OllamaLLM(model="llama3")
chain = prompt | model
chain.invoke({"question": "請問立法院長是誰?"})
```
Step06:載入剛剛 Fine-Tune 好的模型
```python=
print(tokenizer._ollama_modelfile)
# 模型打包
!ollama create unsloth_model -f ./model/Modelfile
# 載入
!curl http://localhost:11434/api/chat -d '{ \
"model": "unsloth_model", \
"messages": [ \
{ "role": "user", "content": "請問立法院長是誰?" } \
] \
}'
```
- LangChain載入
```python=
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama.llms import OllamaLLM
template = "{question}"
prompt = ChatPromptTemplate.from_template(template)
model = OllamaLLM(model="unsloth_model")
chain = prompt | model
chain.invoke({"question": "請問立法院長是誰?"})
```