環境:Rocky Linux 1. git clone llama.cpp ``` git clone https://github.com/ggml-org/llama.cpp cd llama.cpp ``` 2. 安裝 CMAKE ``` sudo dnf update sudo dnf install cmake gcc gcc-c++ make libcurl-devel ``` 他有分要給CPU還是GPU, 本次VM沒有GPU故選擇CPU安裝方式 3. CPU Build ``` cmake -B build cmake --build build --config Release ``` cmake --build build --config Release 會等比較久 4. 先確認 ls -la build/bin/ 有沒有 llama 的東西 ![image](https://hackmd.io/_uploads/H1zMiXHcxx.png) 5. 添加到 PATH ``` # 永久添加到 ~/.bashrc echo 'export PATH=$PATH:~/llama.cpp/build/bin' >> ~/.bashrc source ~/.bashrc ``` 接著就可以選擇以下方式: ``` # Use a local model file llama-cli -m my_model.gguf # Or download and run a model directly from Hugging Face llama-cli -hf ggml-org/gemma-3-1b-it-GGUF # Launch OpenAI-compatible API server llama-server -hf ggml-org/gemma-3-1b-it-GGUF ``` 也可以下載model並執行server ``` llama-server -hf jinaai/jina-embeddings-v4-text-matching-GGUF:F16 --embedding --pooling mean -ub 8192 ``` 參數說明: -hf jinaai/jina-embeddings-v4-text-matching-GGUF:F16 - 從 Hugging Face 載入模型 --embedding - 啟用嵌入模式(而不是文本生成) --pooling mean - 使用平均池化來生成句子嵌入 -ub 8192 - 設置 micro-batch size 為 8192 --host 0.0.0.0 - 允許外部訪問 --port 通常使用默認端口 8080 -c 2048 - 設置上下文長度 也可以直接用 llama-embedding ``` llama-embedding -hf jinaai/jina-embeddings-v4-text-matching-GGUF:F16 --pooling mean -p "Query: jina is awesome" --embd-output-format json 2>/dev/null ``` 測試範例: ``` curl -X POST "http://127.0.0.1:8080/v1/embeddings" \ -H "Content-Type: application/json" \ -d '{ "input": [ "Query: A beautiful sunset over the beach", "Query: Un beau coucher de soleil sur la plage", "Query: 海滩上美丽的日落", "Query: 浜辺に沈む美しい夕日" ] }' ``` 跑出向量就成功了 ## GPU 檢查WSL有沒有支援GPU ``` nvidia-smi ``` 安裝相依套件 ``` sudo apt install -y nvidia-cuda-toolkit sudo apt install -y libcurl4-openssl-dev ``` 用cmake建置 ``` cmake -B build -DGGML_CUDA=ON cmake --build build --config Release ``` 下載模型 ``` ./build/bin/llama-server -hf jinaai/jina-embeddings-v4-text-matching-GGUF:F16 --embedding --pooling mean -ub 8192 ``` 模型放在/home/user/.cache/底下, 接著量化模型 ``` ./build/bin/llama-quantize ~/.cache/llama.cpp/jinaai_jina-embeddings-v4-text-matching-GGUF_jina-embeddings-v4-text-matching-F16.gguf ./jina-embeddings-v4-text-matching-Q4_K_M.gguf q4_k_m ``` 啟動服務 ``` ./build/bin/llama-server -m /home/steven60101/models/jina-embeddings-v4-text-matching-Q4_K_M.gguf --embedding --pooling mean -ub 1024 --no-mmap --port 8002 ``` 簡單測試 ``` curl -X POST http://127.0.0.1:8002/embeddings -H "Content-Type: application/json" -d '{"input": "Hello world"}' ``` LLM ``` ./build/bin/llama-server -m /home/steven60101/models/Qwen_Qwen2.5-3B-Instruct-GGUF_qwen2.5-3b-instruct-q4_k_m.gguf --pooling mean -ub 1024 --no-mmap --port 8002 ``` 測試 ``` curl -X POST http://127.0.0.1:8002/v1/completions -H "Content-Type: application/json" -d '{"prompt": "Hello world", "max_tokens": 10}' ``` 改成支援OpenAI API的格式 ``` curl -X POST http://127.0.0.1:8001/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen2.5-3B-Instruct", "messages": [ { "role": "user", "content": "你會說中文嗎?" } ], "max_tokens": 50, "temperature": 0.7 }' ``` 然後去ELK OpenAI Connector 設定 ![image](https://hackmd.io/_uploads/Hk_toRtJZg.png)