---
# System prepended metadata

title: llama.cpp build on WSL2
tags: [llama, WSL]

---

# 1. Environment setup
Follow the SOP:
> https://hackmd.io/t1VhRKGKTSiG70EZHezq-Q


# 2. Official scipt: windows-install-llama-cpp

`curl -L "https://replicate.fyi/windows-install-llama-cpp" | bash`

**Script.sh**
```
#!/bin/bash 

# Clone repo
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git checkout e76d630

# Build
mkdir build && cd build
cmake .. -DLLAMA_CUBLAS=ON
cmake --build . --config Release

# Download model
export MODEL=llama-2-13b-chat.ggmlv3.q4_0.bin
if [ ! -f models/${MODEL} ]; then
  curl -L "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/${MODEL}" -o models/${MODEL}
fi

# Set prompt
PROMPT="Hello! How are you?"

# Run in interactive mode
./build/bin/main -m ./models/llama-2-13b-chat.ggmlv3.q4_0.bin \
  --color \
  --ctx_size 2048 \
  -n -1 \
  -ins -b 256 \
  --top_k 10000 \
  --temp 0.2 \
  --repeat_penalty 1.1 \
  --n-gpu-layers 15000
  -t 8
```

## 2-1. Clone repo
**latest llama.cpp is no longer compatible with GGML models**
```
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git checkout e76d630 
```

**Issue: Unable to load the model**

> https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML/discussions/6
`git checkout dadbed99e65252d79f81101a392d0d6497b86caa`
**Result: failed**

> https://github.com/ggerganov/llama.cpp/issues/1408
`git checkout cf348a6`
**Result: failed**

> https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML
`git checkout e76d630`
**Result: Worked**

## 2-2. Build
```
mkdir build && cd build
cmake .. 
cmake --build . --config Release
cd ../..
```
![image](https://hackmd.io/_uploads/SyTa-qyBa.png)


GPU CUDA
`cmake .. -DLLAMA_CUBLAS=ON`

If "**Failed to detect a default CUDA architecture.**"
![image](https://hackmd.io/_uploads/Hyg5JBHL6.png)

先到/usr/local/cuda/bin下，檢查是否有nvcc。如果有的話，代表有安裝到，只是系統沒找到它。
注意，終端機應該會提示叫你安裝nvidia-cuda-toolkit，但不要照做，因為系統以為你沒有安裝過，但你其實已經安裝了。

應該要修改系統路徑：
`vim ~/.bashrc`

按i進入編輯模式，將這兩句加在文件最後面。有看到網路上一些教學，cuda是寫cuda-<version>，但我也不知道差別在哪
    
```
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
```

![image](https://hackmd.io/_uploads/BJI-erB86.png)

**Reload bashrc**
`source ~/.bashrc`
    
**Check nvcc**
`nvcc -V`
![image](https://hackmd.io/_uploads/ByKBeHSIp.png)


## 2-3. Download model
**Script.sh**
```
#!/bin/bash 
cd llama.cpp

export MODEL=llama-2-13b-chat.ggmlv3.q4_0.bin
if [ ! -f models/${MODEL} ]; then
  curl -L "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/${MODEL}" -o models/${MODEL}
fi

cd ..
```
![image](https://hackmd.io/_uploads/ByfVM9yBT.png)


## 2-4. Run in interactive mode
**Using GPU** 
add flag: `--n-gpu-layers 15000`
![image](https://hackmd.io/_uploads/B11_SVgB6.png)

**Using CPU** 
remove the flag: `--n-gpu-layers 15000`
![image](https://hackmd.io/_uploads/rJnXrEgBp.png)


**Script.sh**
```
#!/bin/bash 

# Set prompt
PROMPT="Hello! How are you?"

# Run in interactive mode
./build/bin/main -m ./models/llama-2-13b-chat.ggmlv3.q4_0.bin \
  --color \
  --ctx_size 2048 \
  -n -1 \
  -ins -b 256 \
  --top_k 10000 \
  --temp 0.2 \
  --repeat_penalty 1.1 \
  --n-gpu-layers 15000
  -t 8
```


![image](https://hackmd.io/_uploads/HJK-W4erT.png)
![image](https://hackmd.io/_uploads/HkyvyNgHp.png)

## 2-5. Test

![image](https://hackmd.io/_uploads/HJWFJVeSa.png)


# 3. Debug

### Build
```
mkdir build && cd build
cmake .. -DLLAMA_CUBLAS=ON
```
**找不到CUDA編譯器**
![image](https://hackmd.io/_uploads/Bym_5Qkra.png)


**指定NVCC給CMakeList.txt**
```
mkdir build
cmake -D CMAKE_CUDA_COMPILER="/usr/local/cuda/bin/nvcc" CMakeLists.txt
cd build
cmake .. -DLLAMA_CUBLAS=ON
```
![image](https://hackmd.io/_uploads/Hyfs9H1HT.png)


```
cd ..
cmake --build . --config Release
```
![image](https://hackmd.io/_uploads/BJ6Ah-JSa.png)
![image](https://hackmd.io/_uploads/H1SAnB1Sa.png)