# Multi Split for Large Model Experiments for benchmarking prima.cpp on glows-ai
## Glows machine setup
1. Choose `4090` with `Ubuntu24.04 CUDA11.8`
## Installation Scripts for each machine
1. Install dependencies
```bash
apt update -y && apt install -y gcc-9 g++-9 make cmake fio git wget libzmq3-dev
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 90
update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-9 90
# Install monitor tools
apt install -y nvtop htop
# Install network tools ([https://www.xmodulo.com/how-to-install-tcpping-on-linux.html](https://www.xmodulo.com/how-to-install-tcpping-on-linux.html))
apt install -y tcptraceroute bc
wget http://www.vdberg.org/~richard/tcpping
chmod 755 tcpping
```
2. Install prima.cpp
```bash
# clone prima.cpp
git clone https://github.com/DandinPower/prima.cpp.git
# my version upon on original main branch(fbf853341b2e154e550802094d3ab1fbe81c0eb4)
# implement the manually port configuration and flexible gguf load
cd prima.cpp
# clone solver
git clone [https://github.com/ERGO-Code/HiGHS.git](https://github.com/ERGO-Code/HiGHS.git) # master branch(364c83a51e44ba6c27def9c8fc1a49b1daf5ad5c)
cd HiGHS
mkdir build && cd build
cmake ..
make -j16
make install
ldconfig
# build prima.cpp
cd ../..
make USE_HIGHS=1 GGML_CUDA=1 -j16
# clone my utils
git clone https://github.com/DandinPower/prima.cpp_benchmark_utils.git
```
3. Prepare gguf
```bash
mkdir download
wget https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf -P download/
cd prima.cpp_benchmark_utils/gguf-split-b5734
./llama-gguf-split --split-max-tensors 50 ../../download/Llama-3.2-1B-Instruct-Q4_K_M.gguf ../../download/Llama-3.2-1B-Instruct-Q4_K_M
cd ../..
```
- for rank 2 machine (no need first split, can use metadata converter
```bash
python prima.cpp_benchmark_utils/gguf_metadata_converter/converter.py download/Llama-3.2-1B-Instruct-Q4_K_M-00001-of-00003.gguf
```
## Gnerate the CMDS by Generator
1. Follow the Readme.md: [https://github.com/DandinPower/prima.cpp_benchmark_utils/tree/main?tab=readme-ov-file#primacpp-commands-generator](https://github.com/DandinPower/prima.cpp_benchmark_utils/tree/main?tab=readme-ov-file#primacpp-commands-generator)
2. create a config.json
```bash
{
"gguf_file": "download/Llama-3.2-1B-Instruct-Q4_K_M-00001-of-00003.gguf",
"world": 3,
"ctx_size": 4096,
"n_predict": 1024,
"master_node": {
"layer_window_size": 4,
"loopback_ip": "127.0.0.1",
"public_ip": "tw-05.access.glows.ai",
"data_port": 9000,
"signal_port": 9001,
"public_data_port": 25443,
"public_signal_port": 25751,
"splits": "0,1,2"
},
"server_nodes": [
{
"layer_window_size": 8,
"public_ip": "tw-05.access.glows.ai",
"data_port": 9000,
"signal_port": 9002,
"public_data_port": 25142,
"public_signal_port": 25450,
"splits": "0,1"
},
{
"layer_window_size": 4,
"public_ip": "tw-05.access.glows.ai",
"data_port": 9000,
"signal_port": 9001,
"public_data_port": 25149,
"public_signal_port": 25457,
"splits": "1,2"
}
]
}
```
3. Run the scripts
```bash
python generate_commands.py --prompt "<|User|>What is 1+1?<|Assistant|>" --config-path config.json --multi-splits
```
4. Get the results
```bash
Master Node Command:
------------------------------------------------------------
./llama-cli --splits 0,1,2 -m download/Llama-3.2-1B-Instruct-Q4_K_M-00001-of-00003.gguf -c 4096 -n 1024 -p "<|User|>What is 1+1?<|Assistant|>" --world 3 --rank 0 --prefetch -lw "4,8,4" -ngl 4 --master 127.0.0.1 --data_port 9000 --signal_port 9001 --next tw-05.access.glows.ai --master_data_port 25443 --next_node_data_port 25142 --next_node_signal_port 25450
------------------------------------------------------------
Server 0 Node Command:
./llama-cli --splits 0,1 -m download/Llama-3.2-1B-Instruct-Q4_K_M-00001-of-00003.gguf --world 3 --rank 1 --prefetch -ngl 8 --master tw-05.access.glows.ai --data_port 9000 --signal_port 9002 --next tw-05.access.glows.ai --master_data_port 25443 --next_node_data_port 25149 --next_node_signal_port 25457
------------------------------------------------------------
Server 1 Node Command:
./llama-cli --splits 1,2 -m download/Llama-3.2-1B-Instruct-Q4_K_M-00001-of-00003.gguf --world 3 --rank 2 --prefetch -ngl 4 --master tw-05.access.glows.ai --data_port 9000 --signal_port 9001 --next tw-05.access.glows.ai --master_data_port 25443 --next_node_data_port 25443 --next_node_signal_port 25751
------------------------------------------------------------
```
5. Execute the commands on each device to get the result
```
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1024, n_keep = 1
<|User|>What is 1+1?<|Assistant|> The answer is 2! [end of text]
llama_perf_sampler_print: sampling time = 1.57 ms / 24 runs ( 0.07 ms per token, 15247.78 tokens per second)
llama_perf_context_print: load time = 18226.64 ms
llama_perf_context_print: prompt eval time = 107.45 ms / 17 tokens ( 6.32 ms per token, 158.21 tokens per second)
llama_perf_context_print: eval time = 15.52 ms / 1 runs ( 15.52 ms per token, 64.42 tokens per second)
llama_perf_context_print: total time = 193.76 ms / 18 tokens
```