llama.cpp build on WSL2

# 1. Environment setup Follow the SOP: > https://hackmd.io/t1VhRKGKTSiG70EZHezq-Q # 2. Official scipt: windows-install-llama-cpp `curl -L "https://replicate.fyi/windows-install-llama-cpp" | bash` **Script.sh** ``` #!/bin/bash # Clone repo git clone https://github.com/ggerganov/llama.cpp cd llama.cpp git checkout e76d630 # Build mkdir build && cd build cmake .. -DLLAMA_CUBLAS=ON cmake --build . --config Release # Download model export MODEL=llama-2-13b-chat.ggmlv3.q4_0.bin if [ ! -f models/${MODEL} ]; then curl -L "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/${MODEL}" -o models/${MODEL} fi # Set prompt PROMPT="Hello! How are you?" # Run in interactive mode ./build/bin/main -m ./models/llama-2-13b-chat.ggmlv3.q4_0.bin \ --color \ --ctx_size 2048 \ -n -1 \ -ins -b 256 \ --top_k 10000 \ --temp 0.2 \ --repeat_penalty 1.1 \ --n-gpu-layers 15000 -t 8 ``` ## 2-1. Clone repo **latest llama.cpp is no longer compatible with GGML models** ``` git clone https://github.com/ggerganov/llama.cpp cd llama.cpp git checkout e76d630 ``` **Issue: Unable to load the model** > https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML/discussions/6 `git checkout dadbed99e65252d79f81101a392d0d6497b86caa` **Result: failed** > https://github.com/ggerganov/llama.cpp/issues/1408 `git checkout cf348a6` **Result: failed** > https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML `git checkout e76d630` **Result: Worked** ## 2-2. Build ``` mkdir build && cd build cmake .. cmake --build . --config Release cd ../.. ``` ![image](https://hackmd.io/_uploads/SyTa-qyBa.png) GPU CUDA `cmake .. -DLLAMA_CUBLAS=ON` If "**Failed to detect a default CUDA architecture.**" ![image](https://hackmd.io/_uploads/Hyg5JBHL6.png) 先到/usr/local/cuda/bin下，檢查是否有nvcc。如果有的話，代表有安裝到，只是系統沒找到它。注意，終端機應該會提示叫你安裝nvidia-cuda-toolkit，但不要照做，因為系統以為你沒有安裝過，但你其實已經安裝了。應該要修改系統路徑： `vim ~/.bashrc` 按i進入編輯模式，將這兩句加在文件最後面。有看到網路上一些教學，cuda是寫cuda-<version>，但我也不知道差別在哪 ``` export PATH=/usr/local/cuda/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH ``` ![image](https://hackmd.io/_uploads/BJI-erB86.png) **Reload bashrc** `source ~/.bashrc` **Check nvcc** `nvcc -V` ![image](https://hackmd.io/_uploads/ByKBeHSIp.png) ## 2-3. Download model **Script.sh** ``` #!/bin/bash cd llama.cpp export MODEL=llama-2-13b-chat.ggmlv3.q4_0.bin if [ ! -f models/${MODEL} ]; then curl -L "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/${MODEL}" -o models/${MODEL} fi cd .. ``` ![image](https://hackmd.io/_uploads/ByfVM9yBT.png) ## 2-4. Run in interactive mode **Using GPU** add flag: `--n-gpu-layers 15000` ![image](https://hackmd.io/_uploads/B11_SVgB6.png) **Using CPU** remove the flag: `--n-gpu-layers 15000` ![image](https://hackmd.io/_uploads/rJnXrEgBp.png) **Script.sh** ``` #!/bin/bash # Set prompt PROMPT="Hello! How are you?" # Run in interactive mode ./build/bin/main -m ./models/llama-2-13b-chat.ggmlv3.q4_0.bin \ --color \ --ctx_size 2048 \ -n -1 \ -ins -b 256 \ --top_k 10000 \ --temp 0.2 \ --repeat_penalty 1.1 \ --n-gpu-layers 15000 -t 8 ``` ![image](https://hackmd.io/_uploads/HJK-W4erT.png) ![image](https://hackmd.io/_uploads/HkyvyNgHp.png) ## 2-5. Test ![image](https://hackmd.io/_uploads/HJWFJVeSa.png) # 3. Debug ### Build ``` mkdir build && cd build cmake .. -DLLAMA_CUBLAS=ON ``` **找不到CUDA編譯器** ![image](https://hackmd.io/_uploads/Bym_5Qkra.png) **指定NVCC給CMakeList.txt** ``` mkdir build cmake -D CMAKE_CUDA_COMPILER="/usr/local/cuda/bin/nvcc" CMakeLists.txt cd build cmake .. -DLLAMA_CUBLAS=ON ``` ![image](https://hackmd.io/_uploads/Hyfs9H1HT.png) ``` cd .. cmake --build . --config Release ``` ![image](https://hackmd.io/_uploads/BJ6Ah-JSa.png) ![image](https://hackmd.io/_uploads/H1SAnB1Sa.png)