Quantize + Merge Script

# Quantization Script ## List model to quantize ```= # Gated models SeaLLMs/SeaLLM-7B-Chat vinai/PhoGPT-7B5-Instruct meta-llama/Llama-2-70b-chat-hf meta-llama/Llama-2-7b-chat-hf # Public models deepseek-ai/deepseek-llm-7b-chat mistralai/Mistral-7B-Instruct-v0.1 Weyaxi/OpenHermes-2.5-neural-chat-7b-v3-2-7B berkeley-nest/Starling-LM-7B-alpha ise-uiuc/Magicoder-S-DS-6.7B WizardLM/WizardCoder-15B-V1.0 KoboldAI/LLaMA2-13B-Tiefighter NeverSleep/Noromaid-13b-v0.1.1 NousResearch/Nous-Capybara-34B Phind/Phind-CodeLlama-34B-v2 01-ai/Yi-34B-Chat deepseek-ai/deepseek-coder-33b-instruct aisingapore/sealion7b-instruct-nc TigerResearch/tigerbot-70b-chat-v4 # Already fp16 fblgit/una-cybertron-7b-v2-fp16 ``` ## Quantized ```python= # Variables USER_NAME = "" HF_TOKEN = "" MODEL_ID = "argilla/notus-7b-v1" QUANTIZATION_METHODS = ["q4_k_m", "q5_k_m"] # Constants MODEL_NAME = MODEL_ID.split('/')[-1] # Install llama.cpp !git clone https://github.com/ggerganov/llama.cpp !cd llama.cpp && git pull && make clean && LLAMA_CUBLAS=1 make !pip install -r llama.cpp/requirements.txt # Download model !pip install huggingface_hub from huggingface_hub import snapshot_download snapshot_download(repo_id=MODEL_ID, local_dir=MODEL_NAME, local_dir_use_symlinks=False, revision="main") # Convert to fp16 fp16 = f"{MODEL_NAME}/{MODEL_NAME.lower()}.fp16.bin" !python llama.cpp/convert.py {MODEL_NAME} --outtype f16 --outfile {fp16} # Quantize the model for each method in the QUANTIZATION_METHODS list for method in QUANTIZATION_METHODS: qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf" !./llama.cpp/quantize {fp16} {qtype} {method} ``` ## Test inference ```python= import os model_list = [file for file in os.listdir(MODEL_NAME) if "gguf" in file] prompt = input("Enter your prompt: ") chosen_method = input("Name of the model (options: " + ", ".join(model_list) + "): ") # Verify the chosen method is in the list if chosen_method not in model_list: print("Invalid name") else: qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf" !./llama.cpp/main -m {qtype} -n 128 --color -ngl 35 -p "{prompt}" ``` ## Push to hub ```python= !pip install -q huggingface_hub from huggingface_hub import create_repo, HfApi # Defined in the secrets tab in Google Colab hf_token = HF_TOKEN api = HfApi() username = USER_NAME # Create empty repo create_repo( repo_id = f"{username}/{MODEL_NAME}-GGUF", repo_type="model", exist_ok=True, token=hf_token ) # Upload gguf files api.upload_folder( folder_path=MODEL_NAME, repo_id=f"{username}/{MODEL_NAME}-GGUF", allow_patterns=f"*.gguf", token=hf_token ) ``` # Merge Script We can use [**mergekit**](https://github.com/cg123/mergekit) for merging models. ## Normal merge ### TIES ```yaml= models: - model: TheBloke/Llama-2-13B-fp16 # no parameters necessary for base model - model: psmathur/orca_mini_v3_13b parameters: density: [1, 0.7, 0.1] # density gradient weight: 1.0 - model: garage-bAInd/Platypus2-13B parameters: density: 0.5 weight: [0, 0.3, 0.7, 1] # weight gradient - model: WizardLM/WizardMath-13B-V1.0 parameters: density: 0.33 weight: - filter: mlp value: 0.5 - value: 0 merge_method: ties base_model: TheBloke/Llama-2-13B-fp16 parameters: normalize: true int8_mask: true dtype: float16 ``` ### SLERP ```yaml= slices: - sources: - model: teknium/OpenHermes-2.5-Mistral-7B layer_range: [0, 32] - model: Intel/neural-chat-7b-v3-3 layer_range: [0, 32] merge_method: slerp base_model: mistralai/Mistral-7B-v0.1 parameters: t: - filter: self_attn value: [0, 0.5, 0.3, 0.7, 1] - filter: mlp value: [1, 0.5, 0.7, 0.3, 0] - value: 0.5 # fallback for rest of tensors dtype: bfloat16 ``` ## DARE TIES merge ```yaml= merge_method: dare_ties - base_model: athirdpath/BigLlama-20b-v1.1 - model: Noromaid-20b-v0.1.1 weight: 0.38 / density: 0.60 - model: athirdpath/athirdpath/Eileithyia-20b weight: 0.22 / density: 0.40 - model: athirdpath/CleverGirl-20b-Blended-v1.1-DARE weight: 0.40 / density: 0.33 int8_mask: true dtype: bfloat16 ``` `gradient` lets you set a value for a parameter that varies per layer. More reference: https://github.com/cg123/mergekit/issues/5 HumanEval+: https://wandb.ai/byyoung3/ml-news/reports/Testing-Mistral-7B-vs-Zephyr-7B-on-HumanEval-Which-Model-Writes-Better-Code---Vmlldzo1ODgwMTE2