環境:Rocky Linux
1. git clone llama.cpp
```
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
```
2. 安裝 CMAKE
```
sudo dnf update
sudo dnf install cmake gcc gcc-c++ make libcurl-devel
```
他有分要給CPU還是GPU, 本次VM沒有GPU故選擇CPU安裝方式
3. CPU Build
```
cmake -B build
cmake --build build --config Release
```
cmake --build build --config Release 會等比較久
4. 先確認 ls -la build/bin/ 有沒有 llama 的東西

5. 添加到 PATH
```
# 永久添加到 ~/.bashrc
echo 'export PATH=$PATH:~/llama.cpp/build/bin' >> ~/.bashrc
source ~/.bashrc
```
接著就可以選擇以下方式:
```
# Use a local model file
llama-cli -m my_model.gguf
# Or download and run a model directly from Hugging Face
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF
# Launch OpenAI-compatible API server
llama-server -hf ggml-org/gemma-3-1b-it-GGUF
```
也可以下載model並執行server
```
llama-server -hf jinaai/jina-embeddings-v4-text-matching-GGUF:F16 --embedding --pooling mean -ub 8192
```
參數說明:
-hf jinaai/jina-embeddings-v4-text-matching-GGUF:F16 - 從 Hugging Face 載入模型
--embedding - 啟用嵌入模式(而不是文本生成)
--pooling mean - 使用平均池化來生成句子嵌入
-ub 8192 - 設置 micro-batch size 為 8192
--host 0.0.0.0 - 允許外部訪問
--port 通常使用默認端口 8080
-c 2048 - 設置上下文長度
也可以直接用 llama-embedding
```
llama-embedding -hf jinaai/jina-embeddings-v4-text-matching-GGUF:F16 --pooling mean -p "Query: jina is awesome" --embd-output-format json 2>/dev/null
```
測試範例:
```
curl -X POST "http://127.0.0.1:8080/v1/embeddings" \
-H "Content-Type: application/json" \
-d '{
"input": [
"Query: A beautiful sunset over the beach",
"Query: Un beau coucher de soleil sur la plage",
"Query: 海滩上美丽的日落",
"Query: 浜辺に沈む美しい夕日"
]
}'
```
跑出向量就成功了
## GPU
檢查WSL有沒有支援GPU
```
nvidia-smi
```
安裝相依套件
```
sudo apt install -y nvidia-cuda-toolkit
sudo apt install -y libcurl4-openssl-dev
```
用cmake建置
```
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
```
下載模型
```
./build/bin/llama-server -hf jinaai/jina-embeddings-v4-text-matching-GGUF:F16 --embedding --pooling mean -ub 8192
```
模型放在/home/user/.cache/底下, 接著量化模型
```
./build/bin/llama-quantize ~/.cache/llama.cpp/jinaai_jina-embeddings-v4-text-matching-GGUF_jina-embeddings-v4-text-matching-F16.gguf ./jina-embeddings-v4-text-matching-Q4_K_M.gguf q4_k_m
```
啟動服務
```
./build/bin/llama-server -m /home/steven60101/models/jina-embeddings-v4-text-matching-Q4_K_M.gguf --embedding --pooling mean -ub 1024 --no-mmap --port 8002
```
簡單測試
```
curl -X POST http://127.0.0.1:8002/embeddings -H "Content-Type: application/json" -d '{"input": "Hello world"}'
```
LLM
```
./build/bin/llama-server -m /home/steven60101/models/Qwen_Qwen2.5-3B-Instruct-GGUF_qwen2.5-3b-instruct-q4_k_m.gguf --pooling mean -ub 1024 --no-mmap --port 8002
```
測試
```
curl -X POST http://127.0.0.1:8002/v1/completions -H "Content-Type: application/json" -d '{"prompt": "Hello world", "max_tokens": 10}'
```
改成支援OpenAI API的格式
```
curl -X POST http://127.0.0.1:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen2.5-3B-Instruct",
"messages": [
{
"role": "user",
"content": "你會說中文嗎?"
}
],
"max_tokens": 50,
"temperature": 0.7
}'
```
然後去ELK OpenAI Connector 設定
