# prima.cpp server deployment tutorial
## Master Node Setup
### 1. Select Master Node Machine
Choose a `TW-03` container-based machine with a 4090 GPU.

### 2. Select the Correct Image
Use the following official image:
```
img-6jq1yqgo
Ubuntu 24.04 CUDA11.8
```
### 3. Configure Port Forwarding
1. **TCP Data Port (for P2P communication)**
- TCP Port: `9000`
- Hostname: `tw-05.access.glows.ai`
- External Port: `25111`
2. **TCP Signal Port (for P2P communication)**
- TCP Port: `9001`
- Hostname: `tw-05.access.glows.ai`
- External Port: `26195`
*Note: Both data and signal ports must use the same hostname due to a `prima.cpp` limitation.*
3. **HTTPS Port (for client requests)**
- HTTP Port: `8080`
- Hostname: `tw-05.access.glows.ai`
- External Port: `25250`
### 4. Install Dependencies
```bash
apt update -y && apt install -y gcc-9 g++-9 make cmake fio git wget libzmq3-dev nvtop htop
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 90
update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-9 90
# clone prima.cpp
git clone https://github.com/DandinPower/prima.cpp.git
cd prima.cpp
# clone solver
git clone [https://github.com/ERGO-Code/HiGHS.git](https://github.com/ERGO-Code/HiGHS.git) # master branch(364c83a51e44ba6c27def9c8fc1a49b1daf5ad5c)
cd HiGHS
mkdir build && cd build
cmake ..
make -j16
make install
ldconfig
# build prima.cpp
cd ../..
make USE_HIGHS=1 GGML_CUDA=1 -j16
```
### 5. Download Model
```bash
mkdir gguf
wget https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q4_k_m-00001-of-00002.gguf -P gguf/
wget https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q4_k_m-00002-of-00002.gguf -P gguf/
```
### 6. Launch Master Server
```bash
./llama-server -m gguf/qwen2.5-7b-instruct-q4_k_m-00001-of-00002.gguf --alias qwen2.5-7b-instruct -c 16384 -b 16384 -np 4 --world 2 --rank 0 -lw "14,14" -ngl 14 -fa --keep-out-in-cuda --comm_datatype q4_0 --host 0.0.0.0 --port 8080 --data_port 9000 --signal_port 9001 --master 127.0.0.1 --master_data_port 9000 --next tw-07.access.glows.ai --next_node_data_port 27031 --next_node_signal_port 26647
```
**Commands Explanation**
```
-c 16384: Total context length this service can serve.
-b 16384: Must match -c; otherwise it will Segfault.
-np 4: Allow 4 users to infer concurrently. Each gets 16384 / 4 = 4096 context length.
--world 2: Number of ranks. Here, 1x 4090 + 1x L40s.
--rank 0: Local rank ID (master node = 0).
-lw "14,14": Qwen2.5-7B has 28 layers. Split 14/14 across ranks.
-ngl 14: Load all 14 layers into GPU.
-fa: Enable FlashAttention.
--keep-out-in-cuda: Enable the output layer in GPU to siginificantly improve decoding performance
--comm_datatype: Communication datatypes (available: q8_0, q4_0, q2_k) can significantly improve prefilling performance. q8_0 and q4_0 have almost no impact on quality, but q2_k might cause some degradation.
--host 0.0.0.0: Listen on all interfaces.
--port 8080: Matches HTTPS port forwarding.
--data_port 9000: Matches data port.
--signal_port 9001: Matches signal port.
--master 127.0.0.1: Loopback, as master is local.
--master_data_port 9000: Same as above.
--next tw-07.access.glows.ai: Next (slave) node hostname.
--next_node_data_port 27031: Data port of slave.
--next_node_signal_port 26647: Signal port of slave.
```
> For additional arguments, refer to the official usage guide.
>
---
## Slave Node Setup
### 1. Select Slave Node Machine
Choose a `TW-04` VM-based machine with an L40s GPU.

### 2. Select the Correct Image
Use the following official image:
```
img-9rp47yp3
Ubuntu 24.04 Server with NVIDIA Driver 570 and Docker, NVIDIA Container Toolkit
```
### 3. Install Dependencies
```bash
sudo apt update -y && sudo apt install -y gcc-9 g++-9 make cmake fio git wget libzmq3-dev nvtop htop
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 90
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-9 90
# Install nvcc
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-ubuntu2404.pin
sudo mv cuda-ubuntu2404.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda-repo-ubuntu2404-12-8-local_12.8.0-570.86.10-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2404-12-8-local_12.8.0-570.86.10-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2404-12-8-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-8
sudo echo 'export PATH=/usr/local/cuda-12.8/bin${PATH:+:${PATH}}' >> ~/.bashrc
source ~/.bashrc
sudo ldconfig
nvcc -V
# clone prima.cpp
git clone https://github.com/DandinPower/prima.cpp.git
cd prima.cpp
# clone solver
git clone https://github.com/ERGO-Code/HiGHS.git # master branch(364c83a51e44ba6c27def9c8fc1a49b1daf5ad5c)
cd HiGHS
mkdir build && cd build
cmake ..
make -j16
sudo make install
sudo ldconfig
# build prima.cpp
cd ../..
make USE_HIGHS=1 GGML_CUDA=1 -j16
```
### 4. Download Model
```bash
mkdir gguf
wget https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q4_k_m-00001-of-00002.gguf -P gguf/
wget https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q4_k_m-00002-of-00002.gguf -P gguf/
```
### 5. Configure Port Forwarding
1. **TCP Data Port**
- TCP Port: `9000`
- Hostname: `tw-07.access.glows.ai`
- External Port: `27031`
2. **TCP Signal Port**
- TCP Port: `9001`
- Hostname: `tw-07.access.glows.ai`
- External Port: `26647`
### 6. Launch Slave Node
```bash
./llama-cli -m gguf/qwen2.5-7b-instruct-q4_k_m-00001-of-00002.gguf --world 2 --rank 1 -ngl 14 -fa --comm_datatype q4_0 --data_port 9000 --signal_port 9001 --master tw-05.access.glows.ai --master_data_port 25111 --next tw-05.access.glows.ai --next_node_data_port 25111 --next_node_signal_port 26195
```
**Commands Explanation**
```
--world 2: Number of ranks (1x 4090 + 1x L40s).
--rank 1: Local rank ID (slave = 1).
-ngl 14: Load all 14 layers into GPU.
-fa: Enable FlashAttention.
--comm_datatype: Communication datatypes (available: q8_0, q4_0, q2_k) can significantly improve prefilling performance. q8_0 and q4_0 have almost no impact on quality, but q2_k might cause some degradation.
--data_port 9000: Matches configured data port.
--signal_port 9001: Matches configured signal port.
--master tw-05.access.glows.ai: Master node hostname.
--master_data_port 25111: Master's external data port.
--next tw-05.access.glows.ai: Next node = master again.
--next_node_data_port 25111: Master's data port (again).
--next_node_signal_port 26195: Master's signal port.
```
---
## Client Request (After Setup)
Once both master and slave nodes are active, you can send a request like this:
```bash
curl https://tw-05.access.glows.ai:25250/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5-7b-instruct",
"messages": [
{"role": "user", "content": "what is edge AI?"}
],
"max_tokens": 4096,
"temperature": 0.7,
"stream": true
}'
```
> ⚠️ Ensure max_tokens does not exceed the per-user context limit: 16384 / 4 = 4096.
>