prima.cpp server deployment tutorial

# prima.cpp server deployment tutorial ## Master Node Setup ### 1. Select Master Node Machine Choose a `TW-03` container-based machine with a 4090 GPU. ![image](https://hackmd.io/_uploads/HJ_RMkjVgl.png) ### 2. Select the Correct Image Use the following official image: ``` img-6jq1yqgo Ubuntu 24.04 CUDA11.8 ``` ### 3. Configure Port Forwarding 1. **TCP Data Port (for P2P communication)** - TCP Port: `9000` - Hostname: `tw-05.access.glows.ai` - External Port: `25111` 2. **TCP Signal Port (for P2P communication)** - TCP Port: `9001` - Hostname: `tw-05.access.glows.ai` - External Port: `26195` *Note: Both data and signal ports must use the same hostname due to a `prima.cpp` limitation.* 3. **HTTPS Port (for client requests)** - HTTP Port: `8080` - Hostname: `tw-05.access.glows.ai` - External Port: `25250` ### 4. Install Dependencies ```bash apt update -y && apt install -y gcc-9 g++-9 make cmake fio git wget libzmq3-dev nvtop htop update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 90 update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-9 90 # clone prima.cpp git clone https://github.com/DandinPower/prima.cpp.git cd prima.cpp # clone solver git clone [https://github.com/ERGO-Code/HiGHS.git](https://github.com/ERGO-Code/HiGHS.git) # master branch(364c83a51e44ba6c27def9c8fc1a49b1daf5ad5c) cd HiGHS mkdir build && cd build cmake .. make -j16 make install ldconfig # build prima.cpp cd ../.. make USE_HIGHS=1 GGML_CUDA=1 -j16 ``` ### 5. Download Model ```bash mkdir gguf wget https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q4_k_m-00001-of-00002.gguf -P gguf/ wget https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q4_k_m-00002-of-00002.gguf -P gguf/ ``` ### 6. Launch Master Server ```bash ./llama-server -m gguf/qwen2.5-7b-instruct-q4_k_m-00001-of-00002.gguf --alias qwen2.5-7b-instruct -c 16384 -b 16384 -np 4 --world 2 --rank 0 -lw "14,14" -ngl 14 -fa --keep-out-in-cuda --comm_datatype q4_0 --host 0.0.0.0 --port 8080 --data_port 9000 --signal_port 9001 --master 127.0.0.1 --master_data_port 9000 --next tw-07.access.glows.ai --next_node_data_port 27031 --next_node_signal_port 26647 ``` **Commands Explanation** ``` -c 16384: Total context length this service can serve. -b 16384: Must match -c; otherwise it will Segfault. -np 4: Allow 4 users to infer concurrently. Each gets 16384 / 4 = 4096 context length. --world 2: Number of ranks. Here, 1x 4090 + 1x L40s. --rank 0: Local rank ID (master node = 0). -lw "14,14": Qwen2.5-7B has 28 layers. Split 14/14 across ranks. -ngl 14: Load all 14 layers into GPU. -fa: Enable FlashAttention. --keep-out-in-cuda: Enable the output layer in GPU to siginificantly improve decoding performance --comm_datatype: Communication datatypes (available: q8_0, q4_0, q2_k) can significantly improve prefilling performance. q8_0 and q4_0 have almost no impact on quality, but q2_k might cause some degradation. --host 0.0.0.0: Listen on all interfaces. --port 8080: Matches HTTPS port forwarding. --data_port 9000: Matches data port. --signal_port 9001: Matches signal port. --master 127.0.0.1: Loopback, as master is local. --master_data_port 9000: Same as above. --next tw-07.access.glows.ai: Next (slave) node hostname. --next_node_data_port 27031: Data port of slave. --next_node_signal_port 26647: Signal port of slave. ``` > For additional arguments, refer to the official usage guide. > --- ## Slave Node Setup ### 1. Select Slave Node Machine Choose a `TW-04` VM-based machine with an L40s GPU. ![image_1](https://hackmd.io/_uploads/Hkyy7JsVgg.png) ### 2. Select the Correct Image Use the following official image: ``` img-9rp47yp3 Ubuntu 24.04 Server with NVIDIA Driver 570 and Docker, NVIDIA Container Toolkit ``` ### 3. Install Dependencies ```bash sudo apt update -y && sudo apt install -y gcc-9 g++-9 make cmake fio git wget libzmq3-dev nvtop htop sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 90 sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-9 90 # Install nvcc wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-ubuntu2404.pin sudo mv cuda-ubuntu2404.pin /etc/apt/preferences.d/cuda-repository-pin-600 wget https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda-repo-ubuntu2404-12-8-local_12.8.0-570.86.10-1_amd64.deb sudo dpkg -i cuda-repo-ubuntu2404-12-8-local_12.8.0-570.86.10-1_amd64.deb sudo cp /var/cuda-repo-ubuntu2404-12-8-local/cuda-*-keyring.gpg /usr/share/keyrings/ sudo apt-get update sudo apt-get -y install cuda-toolkit-12-8 sudo echo 'export PATH=/usr/local/cuda-12.8/bin${PATH:+:${PATH}}' >> ~/.bashrc source ~/.bashrc sudo ldconfig nvcc -V # clone prima.cpp git clone https://github.com/DandinPower/prima.cpp.git cd prima.cpp # clone solver git clone https://github.com/ERGO-Code/HiGHS.git # master branch(364c83a51e44ba6c27def9c8fc1a49b1daf5ad5c) cd HiGHS mkdir build && cd build cmake .. make -j16 sudo make install sudo ldconfig # build prima.cpp cd ../.. make USE_HIGHS=1 GGML_CUDA=1 -j16 ``` ### 4. Download Model ```bash mkdir gguf wget https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q4_k_m-00001-of-00002.gguf -P gguf/ wget https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q4_k_m-00002-of-00002.gguf -P gguf/ ``` ### 5. Configure Port Forwarding 1. **TCP Data Port** - TCP Port: `9000` - Hostname: `tw-07.access.glows.ai` - External Port: `27031` 2. **TCP Signal Port** - TCP Port: `9001` - Hostname: `tw-07.access.glows.ai` - External Port: `26647` ### 6. Launch Slave Node ```bash ./llama-cli -m gguf/qwen2.5-7b-instruct-q4_k_m-00001-of-00002.gguf --world 2 --rank 1 -ngl 14 -fa --comm_datatype q4_0 --data_port 9000 --signal_port 9001 --master tw-05.access.glows.ai --master_data_port 25111 --next tw-05.access.glows.ai --next_node_data_port 25111 --next_node_signal_port 26195 ``` **Commands Explanation** ``` --world 2: Number of ranks (1x 4090 + 1x L40s). --rank 1: Local rank ID (slave = 1). -ngl 14: Load all 14 layers into GPU. -fa: Enable FlashAttention. --comm_datatype: Communication datatypes (available: q8_0, q4_0, q2_k) can significantly improve prefilling performance. q8_0 and q4_0 have almost no impact on quality, but q2_k might cause some degradation. --data_port 9000: Matches configured data port. --signal_port 9001: Matches configured signal port. --master tw-05.access.glows.ai: Master node hostname. --master_data_port 25111: Master's external data port. --next tw-05.access.glows.ai: Next node = master again. --next_node_data_port 25111: Master's data port (again). --next_node_signal_port 26195: Master's signal port. ``` --- ## Client Request (After Setup) Once both master and slave nodes are active, you can send a request like this: ```bash curl https://tw-05.access.glows.ai:25250/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qwen2.5-7b-instruct", "messages": [ {"role": "user", "content": "what is edge AI?"} ], "max_tokens": 4096, "temperature": 0.7, "stream": true }' ``` > ⚠️ Ensure max_tokens does not exceed the per-user context limit: 16384 / 4 = 4096. >