為什麼會有這篇筆記

原本使用網路搜尋到使用Google Colab跑轉檔成GGUF，但一開始卡在找不到tokenizer models，然後吐出一段網址llama.cpp#6920。所以我就開始打開HackMD紀錄可以成功的過程。在此過程中分別使用Google Colab和Apple MacBook Pro(2020)交互測試。此筆記主要紀錄於MAC可成功執行的指令。於Google Colab主要差異在llama.cpp目錄前面的指令要改成!python /content/。

Homebrew

若使用MAC會推薦使用Homebrew管理電腦上的第三方套件版本。
安裝方式為終端機上執行以下指令：

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

前置作業

安裝huggingface-cli

MAC OS利用Homebrew安裝huggingface-cli
可直接於終端機上執行以下指令：
brew install huggingface-cli

llama.cpp

下載Github上的llama.cpp。

git clone https://github.com/ggerganov/llama.cpp.git

安裝相依套件，若有使用Anaconda，記得用虛擬環境區分套件版本，我手速太快，就忘記新增虛擬環境了。

pip install -r /llama.cpp/requirements.txt

建構llama.cpp

利用Homebrew安裝CMake。

brew install cmake

使用CMake建構llama.cpp，之後可以拿來量化二進位檔案。

cd llama.cpp
cmake -B build
cmake --build build --config Release

下載模型

到Hugging Face下載模型，於終端機輸入以下指令後，要輸入Access Tokens進行登入。

huggingface-cli login

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

下載Llama-3.1-TAIDE-LX-8B-Chat模型。

huggingface-cli download taide/Llama-3.1-TAIDE-LX-8B-Chat --local-dir LLM/Llama-3.1-TAIDE-LX-8B-Chat --exclude 'original/**'

產生tokenizer models

打開llama.cpp/convert_hf_to_gguf_update.py在models = []中新增一行。


{"name": "llama-taide",        "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/taide/Llama-3.1-TAIDE-LX-8B-Chat", },

執行修改後的convert_hf_to_gguf.py來產生tokenizer models。理論上應該要回寫到convert_hf_to_gguf.py的get_vocab_base_pre()。

python ./llama.cpp/convert_hf_to_gguf_update.py <huggingface_token>

但事與願違，一直回寫失敗，最後只好手動更新。
打開llama.cpp/convert_hf_to_gguf.py於get_vocab_base_pre()新增以下內容。





# NOTE: if you get an error here, you need to update the convert_hf_to_gguf_update.py script
#       or pull the latest version of the model from Huggingface
#       don't edit the hashes manually!
if chkhsh == "95092e9dc64e2cd0fc7e0305c53a06daf9efd4045ba7413e04d7ca6916cd274b":
    res = "llama-taide"

轉換

使用以下指令將下載來的Hugging Face模型權重轉成GGUF。
指令中第一個Llama-3.1-TAIDE-LX-8B-Chat是要轉換的模型名稱，參數--outfile <*.gguf>設定輸出成什麼檔名。

python ./llama.cpp/convert_hf_to_gguf.py Llama-3.1-TAIDE-LX-8B-Chat --outfile Llama-3.1-TAIDE-LX-8B-Chat.gguf

量化

使用以下指令將轉換完的GGUF檔量化成Q4_K_M(4bit)。

./llama.cpp/build/bin/llama-quantize ./gguf/Llama-3.1-TAIDE-LX-8B-Chat.gguf ./gguf/Llama-3.1-TAIDE-LX-8B-Chat-Q4_K_M.gguf Q4_K_M

使用Ollama測試

使用Ollama測試量化前後，於4090GPU上每秒處理tokens數差異。
可參考另一篇Ollama設定教學，不同的是Modelfile的內容要變更如下(參考Llama3.1)：

FROM ./Llama-3.1-TAIDE-LX-8B-Chat-Q4_K_M.gguf
TEMPLATE """
{{- if or .System .Tools }}<|start_header_id|>system<|end_header_id|>
{{- if .System }}

{{ .System }}
{{- end }}
{{- if .Tools }}

Cutting Knowledge Date: December 2023

When you receive a tool call response, use the output to format an answer to the orginal user question.

You are a helpful assistant with tool calling capabilities.
{{- end }}<|eot_id|>
{{- end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 }}
{{- if eq .Role "user" }}<|start_header_id|>user<|end_header_id|>
{{- if and $.Tools $last }}

Given the following functions, please respond with a JSON for a function call with its proper arguments that best answers the given prompt.

Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}. Do not use variables.

{{ range $.Tools }}
{{- . }}
{{ end }}
Question: {{ .Content }}<|eot_id|>
{{- else }}

{{ .Content }}<|eot_id|>
{{- end }}{{ if $last }}<|start_header_id|>assistant<|end_header_id|>

{{ end }}
{{- else if eq .Role "assistant" }}<|start_header_id|>assistant<|end_header_id|>
{{- if .ToolCalls }}
{{ range .ToolCalls }}
{"name": "{{ .Function.Name }}", "parameters": {{ .Function.Arguments }}}{{ end }}
{{- else }}

{{ .Content }}
{{- end }}{{ if not $last }}<|eot_id|>{{ end }}
{{- else if eq .Role "tool" }}<|start_header_id|>ipython<|end_header_id|>

{{ .Content }}<|eot_id|>{{ if $last }}<|start_header_id|>assistant<|end_header_id|>

{{ end }}
{{- end }}
{{- end }}
"""
PARAMETER num_ctx 131072
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"

Llama-3.1-TAIDE-LX-8B-Chat

ollama run Llama3.1-TAIDE-LX-8B-Chat:latest --verbose

Llama-3.1-TAIDE-LX-8B-Chat-4bit

ollama run Llama3.1-TAIDE-LX-8B-Chat-4bit:latest --verbose

使用Kuwa OS Demo Tool Use

測試一下Llama3.1支援的Tool Use在Llama-3.1-TAIDE-LX-8B-Chat是否也能使用。
使用Kuwa OS商店建立Bot，再參考./kuwa/GenAI OS/src/bot/llama3_1-tool_use.bot檔，將內容複製到模型設定檔中。

測試問天氣。
1739890040348

參考資料

Homebrew https://brew.sh/zh-tw/
Llama-3.1-TAIDE-LX-8B-Chat https://huggingface.co/taide/Llama-3.1-TAIDE-LX-8B-Chat
在 Colab 上無痛產出 llama.cpp gguf 量化模型 https://ithelp.ithome.com.tw/m/articles/10343062
使用llama.cpp將HuggingFace 取得的LLM模型轉為 GGUF格式 https://medium.com/playtech/使用llama-cpp將huggingface-取得的llm模型轉為-gguf格式-879c3bd3505c
將 HuggingFace 模型轉換為 GGUF 及使用 llama.cpp 進行量化 — — 以INX-TEXT/Bailong-instruct-7B 為例 https://medium.com/@NeroHin/將-huggingface-格式模式轉換為-gguf-以inx-text-bailong-instruct-7b-為例-a2cfdd892cbc
Build llama.cpp locally https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md