模型Fine Tuning 模型微調與優化

# 模型Fine Tuning 模型微調與優化 ![image](https://hackmd.io/_uploads/HywJJxQ9kl.png) 與RAG不同，會對模型進行修改，但不會重新訓練一個新的模型(使用pretrained model)。透過反覆對模型進行訓練(舊資料+新資料)，達到fine-tuning效果。 - RAG 跟 Fine-tunung的差異：右圖為fine-tunung，左圖為RAG ![image](https://hackmd.io/_uploads/BJ_1xZXqJx.png) ==適合場景：Fine-tuning適合用在模型完全沒學過的情況，RAG適合用在將模型所學的錯誤部分修正。== ※並非所有模型都可以進行fine-tune。 --- # 進行Fine-tuning Step01：建立要Fine-tuning的資料集(File) ```python= client.files.create( file=open("fine-tune-example.jsonl", "rb"), purpose="fine-tune" ) ``` 範例：fine-tune-example.jsonl(可以叫gpt生問題) ![image](https://hackmd.io/_uploads/HJXZnbmckl.png) Step02：建立要Fine-tuning的model ```python= client.fine_tuning.jobs.create( training_file="file-zwwuLmddLYyWKBZNGqvEec4i", model="gpt-3.5-turbo" ) ``` Step2.5：查看可使用的模型(已fine-tuning) 可在[網站](https://platform.openai.com/docs/overview)後臺查看結果。 ![image](https://hackmd.io/_uploads/r1TdiWm9yl.png) ```python= for model in client.models.list(): # 查看你能使用的openai模型 print(model) ``` Step03：使用Fine-tunung後的模型 ```python= completion = client.chat.completions.create( model="ft:gpt-3.5-turbo-0125:personal::A2YKELtS", messages=[ {"role": "user", "content": "請問台灣立法院的院長是誰?"} ] ) print(completion.choices[0].message.content) ``` :::warning **Fine-tunung後的模型計價會比原本的貴** ::: --- ## Fine-tune本地的模型-Unsloth Unsloth提供了一個更高效的方法去Fine-tune。 Step01：下載所需package ```python= %%capture !pip install unsloth # Also get the latest nightly Unsloth! !pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git ``` Step02：查看可Fine-tune的模型 ```python= from unsloth import FastLanguageModel import torch max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally! dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+ load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False. # 4bit pre quantized models we support for 4x faster downloading + no OOMs. fourbit_models = [ "unsloth/Meta-Llama-3.1-8B-bnb-4bit", # Llama-3.1 2x faster "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit", "unsloth/Meta-Llama-3.1-70B-bnb-4bit", "unsloth/Meta-Llama-3.1-405B-bnb-4bit", # 4bit for 405b! "unsloth/Mistral-Small-Instruct-2409", # Mistral 22b 2x faster! "unsloth/mistral-7b-instruct-v0.3-bnb-4bit", "unsloth/Phi-3.5-mini-instruct", # Phi-3.5 2x faster! "unsloth/Phi-3-medium-4k-instruct", "unsloth/gemma-2-9b-bnb-4bit", "unsloth/gemma-2-27b-bnb-4bit", # Gemma 2x faster! "unsloth/Llama-3.2-1B-bnb-4bit", # NEW! Llama 3.2 models "unsloth/Llama-3.2-1B-Instruct-bnb-4bit", "unsloth/Llama-3.2-3B-bnb-4bit", "unsloth/Llama-3.2-3B-Instruct-bnb-4bit", "unsloth/Llama-3.3-70B-Instruct-bnb-4bit" # NEW! Llama 3.3 70B! ] # More models at https://huggingface.co/unsloth ``` Step03：設定要使用的模型、參數 ```python= model, tokenizer = FastLanguageModel.from_pretrained( model_name = "unsloth/Llama-3.2-3B-Instruct", # or choose "unsloth/Llama-3.2-1B-Instruct" max_seq_length = max_seq_length, dtype = dtype, load_in_4bit = load_in_4bit, # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf ) #parameter adjust model = FastLanguageModel.get_peft_model( model, r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128 target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",], lora_alpha = 16, lora_dropout = 0, # Supports any, but = 0 is optimized bias = "none", # Supports any, but = "none" is optimized # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes! use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context random_state = 3407, use_rslora = False, # We support rank stabilized LoRA loftq_config = None, # And LoftQ ) ``` Step04：載入訓練用的新資料 EX： ![image](https://hackmd.io/_uploads/B1_sMf7ckg.png) ```python= from datasets import load_dataset # dataset = load_dataset("mlabonne/FineTome-100k", split = "train") dataset = load_dataset("./training_data", split = "train") print(dataset) ``` Step05：把訓練的資料調整成模型理解的格式(揉合成新的格式) ```python= instruction = """ Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. """ def format_chat_template(row): row_json = [ {"role": "system", "content": instruction }, {"role": "user", "content": row["instruction"]}, {"role": "assistant", "content": row["output"]} ] row["text"] = tokenizer.apply_chat_template(row_json, tokenize=False) return row dataset = dataset.map( format_chat_template, num_proc= 4, ) dataset print(dataset[5]["text"]) ``` Step06：設定訓練用設定、資料量、怎麼切token... ```python= from trl import SFTTrainer from transformers import TrainingArguments, DataCollatorForSeq2Seq from unsloth import is_bfloat16_supported trainer = SFTTrainer( model = model, tokenizer = tokenizer, train_dataset = dataset, dataset_text_field = "text", max_seq_length = max_seq_length, data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer), dataset_num_proc = 2, packing = False, # Can make training 5x faster for short sequences. args = TrainingArguments( per_device_train_batch_size = 2, gradient_accumulation_steps = 4, warmup_steps = 5, # num_train_epochs = 1, # Set this for 1 full training run. max_steps = 60, learning_rate = 2e-4, fp16 = not is_bfloat16_supported(), bf16 = is_bfloat16_supported(), logging_steps = 1, optim = "adamw_8bit", weight_decay = 0.01, lr_scheduler_type = "linear", seed = 3407, output_dir = "outputs", report_to = "none", # Use this for WandB etc ), ) # 資料的格式 from unsloth.chat_templates import train_on_responses_only trainer = train_on_responses_only( trainer, instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n", response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n", ) ``` Step07：進行編碼解碼 ```python= tokenizer.decode(trainer.train_dataset[5]["input_ids"]) space = tokenizer(" ", add_special_tokens = False).input_ids[0] tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]]) ``` Step7.5：可以查看記憶體空間大小是否足夠 ```python= #@title Show current memory stats gpu_stats = torch.cuda.get_device_properties(0) start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3) max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3) print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.") print(f"{start_gpu_memory} GB of memory reserved.") ``` Step08：進行訓練(fine-tuning) ```python= trainer_stats = trainer.train() ``` Step09：使用新的模型回答問題 ```python= from unsloth.chat_templates import get_chat_template tokenizer = get_chat_template( tokenizer, chat_template = "llama-3.1", ) FastLanguageModel.for_inference(model) # Enable native 2x faster inference messages = [ {"role": "user", "content": "請問立法院長是誰？"}, ] inputs = tokenizer.apply_chat_template( messages, tokenize = True, add_generation_prompt = True, # Must add for generation return_tensors = "pt", ).to("cuda") outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True, temperature = 1.5, min_p = 0.1) tokenizer.batch_decode(outputs) ``` 或 ```python= FastLanguageModel.for_inference(model) # Enable native 2x faster inference messages = [ {"role": "user", "content": "請問立法院長是誰？"}, ] inputs = tokenizer.apply_chat_template( messages, tokenize = True, add_generation_prompt = True, # Must add for generation return_tensors = "pt", ).to("cuda") from transformers import TextStreamer text_streamer = TextStreamer(tokenizer, skip_prompt = True) _ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128, use_cache = True, temperature = 1.5, min_p = 0.1) ``` Step10：save model ```python= # Save to 8bit Q8_0 if True: model.save_pretrained_gguf("model", tokenizer,) # Remember to go to https://huggingface.co/settings/tokens for a token! # And change hf to your username! if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "") # Save to 16bit GGUF if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16") if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "") # Save to q4_k_m GGUF if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m") if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "") # Save to multiple GGUF options - much faster if you want multiple! if False: model.push_to_hub_gguf( "hf/model", # Change hf to your username! tokenizer, quantization_method = ["q4_k_m", "q8_0", "q5_k_m",], token = "", # Get a token at https://huggingface.co/settings/tokens ) ``` --- # 利⽤Ollama ⾃架與部署 https://github.com/ollama/ollama 可以幫助我們在本地端使用開源或自己的模型。 Step01：下在所需package ```python= !curl -fsSL https://ollama.com/install.sh | sh ``` Step02：在背景執行、下載開源模型 ```python= import subprocess subprocess.Popen(["ollama", "serve"]) import time time.sleep(3) # Wait for a few seconds for Ollama to load! # 下載開源模型 !ollama pull llama3 ``` Step03：查看安裝過的模型 ```python= !ollama list ``` 範例輸出： ```python= NAME ID SIZE MODIFIED llama3:latest 365c0bd3c000 4.7 GB 48 seconds ago unsloth_model:latest 55280486e043 3.4 GB 31 minutes ago ``` Step04：呼叫模型 ```python= !curl http://localhost:11434/api/chat -d '{ \ "model": "llama3", \ "messages": [ \ { "role": "user", "content": "請問立法院長是誰？" } \ ] \ }' ``` Step05：透過langChain呼叫模型(使用較直觀) ```python= !pip install langchain !pip install langchain-core !pip install langchain-ollama from langchain_core.prompts import ChatPromptTemplate from langchain_ollama.llms import OllamaLLM template = "{question}" prompt = ChatPromptTemplate.from_template(template) model = OllamaLLM(model="llama3") chain = prompt | model chain.invoke({"question": "請問立法院長是誰？"}) ``` Step06：載入剛剛 Fine-Tune 好的模型 ```python= print(tokenizer._ollama_modelfile) # 模型打包 !ollama create unsloth_model -f ./model/Modelfile # 載入 !curl http://localhost:11434/api/chat -d '{ \ "model": "unsloth_model", \ "messages": [ \ { "role": "user", "content": "請問立法院長是誰？" } \ ] \ }' ``` - LangChain載入 ```python= from langchain_core.prompts import ChatPromptTemplate from langchain_ollama.llms import OllamaLLM template = "{question}" prompt = ChatPromptTemplate.from_template(template) model = OllamaLLM(model="unsloth_model") chain = prompt | model chain.invoke({"question": "請問立法院長是誰？"}) ```