# 1. Environment setup
Follow the SOP:
> https://hackmd.io/t1VhRKGKTSiG70EZHezq-Q
# 2. Official scipt: windows-install-llama-cpp
`curl -L "https://replicate.fyi/windows-install-llama-cpp" | bash`
**Script.sh**
```
#!/bin/bash
# Clone repo
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git checkout e76d630
# Build
mkdir build && cd build
cmake .. -DLLAMA_CUBLAS=ON
cmake --build . --config Release
# Download model
export MODEL=llama-2-13b-chat.ggmlv3.q4_0.bin
if [ ! -f models/${MODEL} ]; then
curl -L "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/${MODEL}" -o models/${MODEL}
fi
# Set prompt
PROMPT="Hello! How are you?"
# Run in interactive mode
./build/bin/main -m ./models/llama-2-13b-chat.ggmlv3.q4_0.bin \
--color \
--ctx_size 2048 \
-n -1 \
-ins -b 256 \
--top_k 10000 \
--temp 0.2 \
--repeat_penalty 1.1 \
--n-gpu-layers 15000
-t 8
```
## 2-1. Clone repo
**latest llama.cpp is no longer compatible with GGML models**
```
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git checkout e76d630
```
**Issue: Unable to load the model**
> https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML/discussions/6
`git checkout dadbed99e65252d79f81101a392d0d6497b86caa`
**Result: failed**
> https://github.com/ggerganov/llama.cpp/issues/1408
`git checkout cf348a6`
**Result: failed**
> https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML
`git checkout e76d630`
**Result: Worked**
## 2-2. Build
```
mkdir build && cd build
cmake ..
cmake --build . --config Release
cd ../..
```

GPU CUDA
`cmake .. -DLLAMA_CUBLAS=ON`
If "**Failed to detect a default CUDA architecture.**"

先到/usr/local/cuda/bin下,檢查是否有nvcc。如果有的話,代表有安裝到,只是系統沒找到它。
注意,終端機應該會提示叫你安裝nvidia-cuda-toolkit,但不要照做,因為系統以為你沒有安裝過,但你其實已經安裝了。
應該要修改系統路徑:
`vim ~/.bashrc`
按i進入編輯模式,將這兩句加在文件最後面。有看到網路上一些教學,cuda是寫cuda-<version>,但我也不知道差別在哪
```
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
```

**Reload bashrc**
`source ~/.bashrc`
**Check nvcc**
`nvcc -V`

## 2-3. Download model
**Script.sh**
```
#!/bin/bash
cd llama.cpp
export MODEL=llama-2-13b-chat.ggmlv3.q4_0.bin
if [ ! -f models/${MODEL} ]; then
curl -L "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/${MODEL}" -o models/${MODEL}
fi
cd ..
```

## 2-4. Run in interactive mode
**Using GPU**
add flag: `--n-gpu-layers 15000`

**Using CPU**
remove the flag: `--n-gpu-layers 15000`

**Script.sh**
```
#!/bin/bash
# Set prompt
PROMPT="Hello! How are you?"
# Run in interactive mode
./build/bin/main -m ./models/llama-2-13b-chat.ggmlv3.q4_0.bin \
--color \
--ctx_size 2048 \
-n -1 \
-ins -b 256 \
--top_k 10000 \
--temp 0.2 \
--repeat_penalty 1.1 \
--n-gpu-layers 15000
-t 8
```


## 2-5. Test

# 3. Debug
### Build
```
mkdir build && cd build
cmake .. -DLLAMA_CUBLAS=ON
```
**找不到CUDA編譯器**

**指定NVCC給CMakeList.txt**
```
mkdir build
cmake -D CMAKE_CUDA_COMPILER="/usr/local/cuda/bin/nvcc" CMakeLists.txt
cd build
cmake .. -DLLAMA_CUBLAS=ON
```

```
cd ..
cmake --build . --config Release
```

