# Quantization Script
## List model to quantize
```=
# Gated models
SeaLLMs/SeaLLM-7B-Chat
vinai/PhoGPT-7B5-Instruct
meta-llama/Llama-2-70b-chat-hf
meta-llama/Llama-2-7b-chat-hf
# Public models
deepseek-ai/deepseek-llm-7b-chat
mistralai/Mistral-7B-Instruct-v0.1
Weyaxi/OpenHermes-2.5-neural-chat-7b-v3-2-7B
berkeley-nest/Starling-LM-7B-alpha
ise-uiuc/Magicoder-S-DS-6.7B
WizardLM/WizardCoder-15B-V1.0
KoboldAI/LLaMA2-13B-Tiefighter
NeverSleep/Noromaid-13b-v0.1.1
NousResearch/Nous-Capybara-34B
Phind/Phind-CodeLlama-34B-v2
01-ai/Yi-34B-Chat
deepseek-ai/deepseek-coder-33b-instruct
aisingapore/sealion7b-instruct-nc
TigerResearch/tigerbot-70b-chat-v4
# Already fp16
fblgit/una-cybertron-7b-v2-fp16
```
## Quantized
```python=
# Variables
USER_NAME = ""
HF_TOKEN = ""
MODEL_ID = "argilla/notus-7b-v1"
QUANTIZATION_METHODS = ["q4_k_m", "q5_k_m"]
# Constants
MODEL_NAME = MODEL_ID.split('/')[-1]
# Install llama.cpp
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && git pull && make clean && LLAMA_CUBLAS=1 make
!pip install -r llama.cpp/requirements.txt
# Download model
!pip install huggingface_hub
from huggingface_hub import snapshot_download
snapshot_download(repo_id=MODEL_ID,
local_dir=MODEL_NAME,
local_dir_use_symlinks=False,
revision="main")
# Convert to fp16
fp16 = f"{MODEL_NAME}/{MODEL_NAME.lower()}.fp16.bin"
!python llama.cpp/convert.py {MODEL_NAME} --outtype f16 --outfile {fp16}
# Quantize the model for each method in the QUANTIZATION_METHODS list
for method in QUANTIZATION_METHODS:
qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf"
!./llama.cpp/quantize {fp16} {qtype} {method}
```
## Test inference
```python=
import os
model_list = [file for file in os.listdir(MODEL_NAME) if "gguf" in file]
prompt = input("Enter your prompt: ")
chosen_method = input("Name of the model (options: " + ", ".join(model_list) + "): ")
# Verify the chosen method is in the list
if chosen_method not in model_list:
print("Invalid name")
else:
qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf"
!./llama.cpp/main -m {qtype} -n 128 --color -ngl 35 -p "{prompt}"
```
## Push to hub
```python=
!pip install -q huggingface_hub
from huggingface_hub import create_repo, HfApi
# Defined in the secrets tab in Google Colab
hf_token = HF_TOKEN
api = HfApi()
username = USER_NAME
# Create empty repo
create_repo(
repo_id = f"{username}/{MODEL_NAME}-GGUF",
repo_type="model",
exist_ok=True,
token=hf_token
)
# Upload gguf files
api.upload_folder(
folder_path=MODEL_NAME,
repo_id=f"{username}/{MODEL_NAME}-GGUF",
allow_patterns=f"*.gguf",
token=hf_token
)
```
# Merge Script
We can use [**mergekit**](https://github.com/cg123/mergekit) for merging models.
## Normal merge
### TIES
```yaml=
models:
- model: TheBloke/Llama-2-13B-fp16
# no parameters necessary for base model
- model: psmathur/orca_mini_v3_13b
parameters:
density: [1, 0.7, 0.1] # density gradient
weight: 1.0
- model: garage-bAInd/Platypus2-13B
parameters:
density: 0.5
weight: [0, 0.3, 0.7, 1] # weight gradient
- model: WizardLM/WizardMath-13B-V1.0
parameters:
density: 0.33
weight:
- filter: mlp
value: 0.5
- value: 0
merge_method: ties
base_model: TheBloke/Llama-2-13B-fp16
parameters:
normalize: true
int8_mask: true
dtype: float16
```
### SLERP
```yaml=
slices:
- sources:
- model: teknium/OpenHermes-2.5-Mistral-7B
layer_range: [0, 32]
- model: Intel/neural-chat-7b-v3-3
layer_range: [0, 32]
merge_method: slerp
base_model: mistralai/Mistral-7B-v0.1
parameters:
t:
- filter: self_attn
value: [0, 0.5, 0.3, 0.7, 1]
- filter: mlp
value: [1, 0.5, 0.7, 0.3, 0]
- value: 0.5 # fallback for rest of tensors
dtype: bfloat16
```
## DARE TIES merge
```yaml=
merge_method: dare_ties
- base_model: athirdpath/BigLlama-20b-v1.1
- model: Noromaid-20b-v0.1.1
weight: 0.38 / density: 0.60
- model: athirdpath/athirdpath/Eileithyia-20b
weight: 0.22 / density: 0.40
- model: athirdpath/CleverGirl-20b-Blended-v1.1-DARE
weight: 0.40 / density: 0.33
int8_mask: true
dtype: bfloat16
```
`gradient` lets you set a value for a parameter that varies per layer.
More reference:
https://github.com/cg123/mergekit/issues/5
HumanEval+: https://wandb.ai/byyoung3/ml-news/reports/Testing-Mistral-7B-vs-Zephyr-7B-on-HumanEval-Which-Model-Writes-Better-Code---Vmlldzo1ODgwMTE2