Experiments Log for benchmark prima.cpp on glows-ai

# Glows machine setup 1. Choose `4090` with `Ubuntu24.04 CUDA11.8` # Installation Scripts for each machine 1. Install dependencies ```bash apt update -y && apt install -y gcc-9 g++-9 make cmake fio git wget libzmq3-dev update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 90 update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-9 90 ``` 2. Install monitor tools ```bash apt install -y nvtop htop ``` 3. Install network tools ([https://www.xmodulo.com/how-to-install-tcpping-on-linux.html](https://www.xmodulo.com/how-to-install-tcpping-on-linux.html)) ```bash apt install -y tcptraceroute bc wget http://www.vdberg.org/~richard/tcpping chmod 755 tcpping ``` - Usage: `./tcpping <hostname> <port>` 4. Install prima.cpp ```bash # clone prima.cpp git clone https://github.com/DandinPower/prima.cpp.git # my version upon on original main branch(fbf853341b2e154e550802094d3ab1fbe81c0eb4) cd prima.cpp # clone solver git clone [https://github.com/ERGO-Code/HiGHS.git](https://github.com/ERGO-Code/HiGHS.git) # master branch(364c83a51e44ba6c27def9c8fc1a49b1daf5ad5c) cd HiGHS mkdir build && cd build cmake .. make -j16 make install ldconfig # build prima.cpp cd ../.. make USE_HIGHS=1 GGML_CUDA=1 -j16 ``` 5. Install model weights (GGUF) ```bash mkdir download wget https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF/resolve/main/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -P download/ wget https://huggingface.co/Qwen/QwQ-32B-GGUF/resolve/main/qwq-32b-q4_k_m.gguf -P download/ wget https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-70B-GGUF/resolve/main/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf -P download/ ``` - Three different types of model that can run on prima.cpp (gemma3 not support due to the ggml version) - Quantization: Q4_K_M - Huggingface Hub - [https://huggingface.co/Qwen/QwQ-32B-GGUF](https://huggingface.co/Qwen/QwQ-32B-GGUF) - [https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF](https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF) - [https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-70B-GGUF](https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-70B-GGUF) # Public Network Benchmark ## Add manually port configuration for public network - Commit: https://github.com/DandinPower/prima.cpp/commit/1e7ae71ce5d3d41d68763db0e0dece95524a96d1 ## Setup port forwarding <aside> Please ensure the hostname for different port is the same hostname, since the glows.ai port forwarding might assign different hostname </aside> - Following is an valid example 1. For master node - data port: 9000; signal port: 10001 - data port forwarding: 9000 → tw-02.access.glows.ai:23832 - signal port forwarding: 10001 → tw-02.access.glows.ai:23134 2. For server node - data port: 10000; signal port: 10001 - data port forwarding: 10000→ tw-03.access.glows.ai:50173 - signal port forwarding: 10001 → tw-03.access.glows.ai:50692 # Benchmark Results ## Summary of Benchmark Results This section summarizes the performance of distributed LLM inference across various network conditions and model sizes using `prima.cpp`. We focus on three key metrics to evaluate performance: - **TTFT (Time To First Token):** Measured by `prompt eval time`. This shows how quickly the model starts generating a response after receiving a prompt. Lower is better. - **TPOT (Time Per Output Token):** Measured by `ms per token for eval time`. This indicates the latency for generating each subsequent token. Lower is better. - **Throughput (tokens/s):** Measured by `tokens per second for eval time`. This is the inverse of TPOT and represents the generation speed. Higher is better. The benchmarks explore scenarios with GPUs in the same region (low latency), different regions (high latency), and compare them against local single-GPU performance. ### Table 1: 2x 4090 on Different Nodes in the **Same Region** **Setup:** - **GPUs:** 2x RTX 4090 - **Nodes:** Two separate nodes - **Region:** Same Region (TW-1) - **Network RTT:** ~0.17 ms - 0.30 ms - **Layer Split:** Model layers are split evenly between the two GPUs. | Model | Context | TTFT (ms) | TPOT (ms/token) | Throughput (tokens/s) | Layer Split (GPU 1 / GPU 2) | | --- | --- | --- | --- | --- | --- | | **QwQ-32B** | Short (~17 tokens) | 66.95 | 36.37 | 27.49 | 32 / 32 | | **QwQ-32B** | Long (~1135 tokens) | 858.46 | 37.41 | 26.73 | 32 / 32 | | **DeepSeek-8B** | Short (~10 tokens) | 33.66 | 14.69 | 68.06 | 16 / 16 | | **DeepSeek-8B** | Long (~1105 tokens) | 420.60 | 15.51 | 64.48 | 16 / 16 | | **DeepSeek-70B** | Short (~10 tokens) | 108.74 | 63.64 | 15.71 | 40 / 40 | | **DeepSeek-70B** | Long (~1105 tokens) | 1478.42 | 65.81 | 15.20 | 40 / 40 | **Description & Insights:** Observe that as model size increases, the computation time per token (TPOT) also increases, leading to lower throughput. For the largest model, DeepSeek-70B, the throughput stabilizes at around **15 tokens/s**. The TTFT for long contexts is notable, taking nearly 1.5 seconds to process the prompt before generation begins. ### Table 2: 2x 4090 on Different Nodes in **Different Regions** **Setup:** - **GPUs:** 2x RTX 4090 - **Nodes:** Two separate nodes - **Regions:** TW-1 (Banciao) & TW-3 (Yunlin) - **Network RTT:** ~5 ms - 7 ms - **Layer Split:** Model layers are split evenly between the two GPUs. | Model | Context | TTFT (ms) | TPOT (ms/token) | Throughput (tokens/s) | Layer Split (GPU 1 / GPU 2) | | --- | --- | --- | --- | --- | --- | | **QwQ-32B** | Short (~17 tokens) | 157.37 | 48.90 | 20.45 | 32 / 32 | | **QwQ-32B** | Long (~1135 tokens) | 7650.91 | 49.78 | 20.09 | 32 / 32 | | **DeepSeek-8B** | Short (~10 tokens) | 93.11 | 22.24 | 44.97 | 16 / 16 | | **DeepSeek-8B** | Long (~1105 tokens) | 5633.75 | 23.97 | 41.73 | 16 / 16 | | **DeepSeek-70B** | Short (~10 tokens) | 240.15 | 81.40 | 12.29 | 40 / 40 | | **DeepSeek-70B** | Long (~1105 tokens) | 13909.01 | 83.01 | 12.05 | 40 / 40 | **Description & Insights:** This benchmark introduces a significant network latency of ~6ms by running across different geographical regions. The impact is immediate and substantial. Compared to the same-region setup, TTFT values are drastically higher, especially for long contexts (e.g., from 1.5s to 13.9s for DeepSeek-70B). Throughput also drops across all models, as each token generation step must now wait for data to travel across the higher-latency link. ### Table 3: Performance Comparison: Same Region vs. Different Regions **Setup:** - **Slowdown Factor:** Calculated as `(High RTT Metric) / (Low RTT Metric)` for TTFT and TPOT. For Throughput, it’s `(Low RTT Metric) / (High RTT Metric)`. A value of `2.0x` means it is twice as slow. | Model | Context | Metric | Same Region (Low RTT) | Different Region (High RTT) | Slowdown Factor | | --- | --- | --- | --- | --- | --- | | **QwQ-32B** | Short | TTFT (ms) | 66.95 | 157.37 | **2.35x** | | | | TPOT (ms/token) | 36.37 | 48.90 | **1.34x** | | | | Throughput (tok/s) | 27.49 | 20.45 | **1.34x** | | **QwQ-32B** | Long | TTFT (ms) | 858.46 | 7650.91 | **8.91x** | | | | TPOT (ms/token) | 37.41 | 49.78 | **1.33x** | | | | Throughput (tok/s) | 26.73 | 20.09 | **1.33x** | | — | — | — | — | — | — | | **DeepSeek-8B** | Short | TTFT (ms) | 33.66 | 93.11 | **2.77x** | | | | TPOT (ms/token) | 14.69 | 22.24 | **1.51x** | | | | Throughput (tok/s) | 68.06 | 44.97 | **1.51x** | | **DeepSeek-8B** | Long | TTFT (ms) | 420.60 | 5633.75 | **13.40x** | | | | TPOT (ms/token) | 15.51 | 23.97 | **1.55x** | | | | Throughput (tok/s) | 64.48 | 41.73 | **1.55x** | | — | — | — | — | — | — | | **DeepSeek-70B** | Short | TTFT (ms) | 108.74 | 240.15 | **2.21x** | | | | TPOT (ms/token) | 63.64 | 81.40 | **1.28x** | | | | Throughput (tok/s) | 15.71 | 12.29 | **1.28x** | | **DeepSeek-70B** | Long | TTFT (ms) | 1478.42 | 13909.01 | **9.41x** | | | | TPOT (ms/token) | 65.81 | 83.01 | **1.26x** | | | | Throughput (tok/s) | 15.20 | 12.05 | **1.26x** | **Description & Insights:** - **TTFT is Highly Sensitive to Latency:** TTFT is dramatically affected, with a slowdown factor reaching over **13x** for long prompts. This is because the entire prompt must be processed across the high-latency link. - **TPOT is Less Sensitive but Still Impacted:** Subsequent token generation (TPOT) is less affected but still sees a consistent slowdown of **1.25x to 1.55x**. This is because each token requires a communication round trip. - **Larger Models Mask Latency Better:** The largest model (DeepSeek-70B) shows the least relative slowdown in TPOT (`~1.26x`). This suggests its longer on-GPU computation time for each token helps to hide a portion of the network latency. ### Table 4: Local vs. Distributed Performance — A Comprehensive Comparison **Setup Definitions:** - **Local (1x GPU):** The baseline performance on a single RTX 4090. For models too large for one GPU (DeepSeek-70B), this involves offloading layers to the CPU. - **Distributed (Same Region):** The model is split across two RTX 4090s in a low-latency region (RTT < 0.3ms). - **Distributed (Different Regions):** The model is split across two RTX 4090s in different geographical regions (RTT ~6ms). The best performance for each model and context is marked in **bold**. | Model | Context | Metric | Local (1x GPU) | Distributed (Same Region) | Distributed (Different Regions) | | --- | --- | --- | --- | --- | --- | | **QwQ-32B** | Short | TTFT (ms) | **59.74** | 66.95 | 157.37 | | | | TPOT (ms/token) | **34.85** | 36.37 | 48.90 | | | | Throughput (tok/s) | **28.69** | 27.49 | 20.45 | | **QwQ-32B** | Long | TTFT (ms) | **496.27** | 858.46 | 7650.91 | | | | TPOT (ms/token) | **35.75** | 37.41 | 49.78 | | | | Throughput (tok/s) | **27.97** | 26.73 | 20.09 | | — | — | — | — | — | — | | **DeepSeek-8B** | Short | TTFT (ms) | **25.17** | 33.66 | 93.11 | | | | TPOT (ms/token) | **13.59** | 14.69 | 22.24 | | | | Throughput (tok/s) | **73.60** | 68.06 | 44.97 | | **DeepSeek-8B** | Long | TTFT (ms) | **212.25** | 420.60 | 5633.75 | | | | TPOT (ms/token) | **14.17** | 15.51 | 23.97 | | | | Throughput (tok/s) | **70.59** | 64.48 | 41.73 | | — | — | — | — | — | — | | **DeepSeek-70B** | Short | TTFT (ms) | 1531.06 | **108.74** | 240.15 | | | | TPOT (ms/token) | 463.06 | **63.64** | 81.40 | | | | Throughput (tok/s) | 2.16 | **15.71** | 12.29 | | **DeepSeek-70B** | Long | TTFT (ms) | 11090.16 | **1478.42** | 13909.01 | | | | TPOT (ms/token) | 477.66 | **65.81** | 83.01 | | | | Throughput (tok/s) | 2.09 | **15.20** | 12.05 | **Final Conclusions:** This final comparison reveals the core trade-offs of distributed inference: 1. **If the model fits on one GPU, local is always fastest.** For the **QwQ-32B** and **DeepSeek-8B** models, local execution is unequivocally superior. Distributing a model that already fits in a single GPU’s VRAM only adds network latency, which harms performance without providing any benefit. 2. **If the model is too large for one GPU, distribution is essential.** For the **DeepSeek-70B** model, which cannot fit on a single 4090, the local performance is abysmal due to slow CPU offloading. In this case, distributed inference is a transformative solution. - The `Distributed (Same Region)` setup becomes the optimal choice, delivering a **~7.3x increase in throughput** and making the large model practical to use. - Even the high-latency `Distributed (Different Regions)` setup is substantially faster than the local offloaded run, demonstrating that keeping all model layers in VRAM is paramount for performance, even if it means paying a network penalty. ## Benchmark Details ### 2x4090 on different node but same region 1. QwQ-32B - Longer context prompts ``` <|im_start|>user\nThe following is a detailed documents containing a mix of technical explanations, narrative passages, and structured information. Please carefully read through the entire content and provide a concise summary that captures the main ideas, key points, and any important structure or hierarchy in the text. === BEGIN DOCUMENT === The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development. We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama. In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration of thinking mode (for complex, multi-step reasoning) and non-thinking mode (for rapid, context-driven responses) into a unified framework. This eliminates the need to switch between different models—such as chat-optimized models (e.g., GPT-4o) and dedicated reasoning models (e.g., QwQ32B)—and enables dynamic mode switching based on user queries or chat templates. Meanwhile, Qwen3 introduces a thinking budget mechanism, allowing users to allocate computational resources adaptively during inference, thereby balancing latency and performance based on task complexity. Moreover, by leveraging the knowledge from the flagship models, we significantly reduce the computational resources required to build smaller-scale models, while ensuring their highly competitive performance. Empirical evaluations demonstrate that Qwen3 achieves state-of-the-art results across diverse benchmarks, including tasks in code generation, mathematical reasoning, agent tasks, etc., competitive against larger MoE models and proprietary models. Compared to its predecessor Qwen2.5, Qwen3 expands multilingual support from 29 to 119 languages and dialects, enhancing global accessibility through improved cross-lingual understanding and generation capabilities. To facilitate reproducibility and community-driven research and development, all Qwen3 models are publicly accessible under Apache 2.0. === END DOCUMENT === Now, provide a summary of the above document. Make sure to: - Highlight the central topic and scope - Summarize major sections or chapters - Point out any arguments, conclusions, or recommendations - Be concise, but preserve essential meaning and nuance\n<|im_end|>\n<|im_start|>assistant\n<think>\n ``` - Local RUN: - GPU Layers/ ALL Layers = (64/64) - Short Context Commands = `./llama-cli -m download/qwq-32b-q4_k_m.gguf -c 1024 -p "<|im_start|>user\nWhat is 1+1?\n<|im_end|>\n<|im_start|>assistant\n<think>\n" -n 512 -ngl 64` 1. 4090-0 (TW-01) ```bash llama_perf_sampler_print: sampling time = 49.24 ms / 492 runs ( 0.10 ms per token, 9991.47 tokens per second) llama_perf_context_print: load time = 6035.24 ms llama_perf_context_print: prompt eval time = 59.74 ms / 17 tokens ( 3.51 ms per token, 284.55 tokens per second) llama_perf_context_print: eval time = 16345.39 ms / 469 runs ( 34.85 ms per token, 28.69 tokens per second) llama_perf_context_print: total time = 16815.64 ms / 486 tokens ``` 2. 4090-1 (TW-01) ```bash llama_perf_sampler_print: sampling time = 38.21 ms / 377 runs ( 0.10 ms per token, 9867.56 tokens per second) llama_perf_context_print: load time = 6288.86 ms llama_perf_context_print: prompt eval time = 61.71 ms / 17 tokens ( 3.63 ms per token, 275.47 tokens per second) llama_perf_context_print: eval time = 12870.67 ms / 354 runs ( 36.36 ms per token, 27.50 tokens per second) llama_perf_context_print: total time = 13265.88 ms / 371 tokens ``` - Longer Context Commands = `./llama-cli -m download/qwq-32b-q4_k_m.gguf -c 2048 -p "" -n 1024 -ngl 64` 1. 4090-0 (TW-01) ```bash llama_perf_sampler_print: sampling time = 106.58 ms / 2159 runs ( 0.05 ms per token, 20257.27 tokens per second) llama_perf_context_print: load time = 3405.53 ms llama_perf_context_print: prompt eval time = 496.27 ms / 1135 tokens ( 0.44 ms per token, 2287.05 tokens per second) llama_perf_context_print: eval time = 36392.05 ms / 1018 runs ( 35.75 ms per token, 27.97 tokens per second) llama_perf_context_print: total time = 37399.00 ms / 2153 tokens ``` 2. 4090-1 (TW-01) ```bash llama_perf_sampler_print: sampling time = 110.71 ms / 2159 runs ( 0.05 ms per token, 19501.22 tokens per second) llama_perf_context_print: load time = 4095.68 ms llama_perf_context_print: prompt eval time = 574.95 ms / 1135 tokens ( 0.51 ms per token, 1974.10 tokens per second) llama_perf_context_print: eval time = 37908.28 ms / 1018 runs ( 37.24 ms per token, 26.85 tokens per second) llama_perf_context_print: total time = 39082.38 ms / 2153 tokens ``` - Distributed RUN: - RTT: 0.174ms ~ 0.30ms - GPU Layers/ ALL Layers = (32/64) | GPU Layers/ ALL Layers = (32/64) - Short Context Commands = - Master node: `./llama-cli -m download/qwq-32b-q4_k_m.gguf -c 1024 -n 512 -p "<|im_start|>user\nWhat is 1+1?\n<|im_end|>\n<|im_start|>assistant\n<think>\n" --world 2 --rank 0 --prefetch -lw "32,32" -ngl 32 --master 127.0.0.1 --next tw-03.access.glows.ai --data_port 9000 --signal_port 10001 --master_data_port 9000 --next_node_data_port 50173 --next_node_signal_port 50692` - Server node: `./llama-cli -m download/qwq-32b-q4_k_m.gguf --world 2 --rank 1 --prefetch -ngl 32 --master tw-02.access.glows.ai --next tw-02.access.glows.ai --data_port 10000 --signal_port 10001 --master_data_port 23832 --next_node_data_port 23832 --next_node_signal_port 23134` ```bash llama_perf_sampler_print: sampling time = 45.08 ms / 450 runs ( 0.10 ms per token, 9981.37 tokens per second) llama_perf_context_print: load time = 8553.51 ms llama_perf_context_print: prompt eval time = 66.95 ms / 17 tokens ( 3.94 ms per token, 253.91 tokens per second) llama_perf_context_print: eval time = 15530.41 ms / 427 runs ( 36.37 ms per token, 27.49 tokens per second) llama_perf_context_print: total time = 15944.83 ms / 444 tokens ``` - Long Context Commands = - Master node: `./llama-cli -m download/qwq-32b-q4_k_m.gguf -c 2048 -n 1024 -p "" --world 2 --rank 0 --prefetch -lw "32,32" -ngl 32 --master 127.0.0.1 --next tw-03.access.glows.ai --data_port 9000 --signal_port 10001 --master_data_port 9000 --next_node_data_port 50173 --next_node_signal_port 50692` - Server node: `./llama-cli -m download/qwq-32b-q4_k_m.gguf --world 2 --rank 1 --prefetch -ngl 32 --master tw-02.access.glows.ai --next tw-02.access.glows.ai --data_port 10000 --signal_port 10001 --master_data_port 23832 --next_node_data_port 23832 --next_node_signal_port 23134` ```bash llama_perf_sampler_print: sampling time = 108.42 ms / 2159 runs ( 0.05 ms per token, 19912.38 tokens per second) llama_perf_context_print: load time = 9239.47 ms llama_perf_context_print: prompt eval time = 858.46 ms / 1135 tokens ( 0.76 ms per token, 1322.13 tokens per second) llama_perf_context_print: eval time = 38086.41 ms / 1018 runs ( 37.41 ms per token, 26.73 tokens per second) llama_perf_context_print: total time = 39465.33 ms / 2153 tokens ``` 2. DeepSeek-R1-Distill-Llama-8B - Longer context prompts ``` <｜User｜>The following is a detailed documents containing a mix of technical explanations, narrative passages, and structured information. Please carefully read through the entire content and provide a concise summary that captures the main ideas, key points, and any important structure or hierarchy in the text. === BEGIN DOCUMENT === The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development. We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama. In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration of thinking mode (for complex, multi-step reasoning) and non-thinking mode (for rapid, context-driven responses) into a unified framework. This eliminates the need to switch between different models—such as chat-optimized models (e.g., GPT-4o) and dedicated reasoning models (e.g., QwQ32B)—and enables dynamic mode switching based on user queries or chat templates. Meanwhile, Qwen3 introduces a thinking budget mechanism, allowing users to allocate computational resources adaptively during inference, thereby balancing latency and performance based on task complexity. Moreover, by leveraging the knowledge from the flagship models, we significantly reduce the computational resources required to build smaller-scale models, while ensuring their highly competitive performance. Empirical evaluations demonstrate that Qwen3 achieves state-of-the-art results across diverse benchmarks, including tasks in code generation, mathematical reasoning, agent tasks, etc., competitive against larger MoE models and proprietary models. Compared to its predecessor Qwen2.5, Qwen3 expands multilingual support from 29 to 119 languages and dialects, enhancing global accessibility through improved cross-lingual understanding and generation capabilities. To facilitate reproducibility and community-driven research and development, all Qwen3 models are publicly accessible under Apache 2.0. === END DOCUMENT === Now, provide a summary of the above document. Make sure to: - Highlight the central topic and scope - Summarize major sections or chapters - Point out any arguments, conclusions, or recommendations - Be concise, but preserve essential meaning and nuance\n<｜Assistant｜> ``` - Local RUN: - GPU Layers/ ALL Layers = (32/32) - Short Context Commands =`./llama-cli -m download/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -c 1024 -p "<｜User｜>What is 1+1?<｜Assistant｜>" -n 512 -ngl 64` 1. 4090-0 (TW-01) ```bash llama_perf_sampler_print: sampling time = 30.55 ms / 348 runs ( 0.09 ms per token, 11391.53 tokens per second) llama_perf_context_print: load time = 958.87 ms llama_perf_context_print: prompt eval time = 25.17 ms / 10 tokens ( 2.52 ms per token, 397.30 tokens per second) llama_perf_context_print: eval time = 4510.77 ms / 332 runs ( 13.59 ms per token, 73.60 tokens per second) llama_perf_context_print: total time = 4696.28 ms / 342 tokens ``` 2. 4090-1 (TW-01) ```bash llama_perf_sampler_print: sampling time = 12.54 ms / 149 runs ( 0.08 ms per token, 11885.77 tokens per second) llama_perf_context_print: load time = 961.47 ms llama_perf_context_print: prompt eval time = 24.59 ms / 10 tokens ( 2.46 ms per token, 406.74 tokens per second) llama_perf_context_print: eval time = 1927.70 ms / 133 runs ( 14.49 ms per token, 68.99 tokens per second) llama_perf_context_print: total time = 2069.22 ms / 143 tokens ``` - Longer Context Commands = `./llama-cli -m download/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -c 2048 -p "" -n 1024 -ngl 64` 1. 4090-0 (TW-01) ```bash llama_perf_sampler_print: sampling time = 50.48 ms / 1644 runs ( 0.03 ms per token, 32564.13 tokens per second) llama_perf_context_print: load time = 899.13 ms llama_perf_context_print: prompt eval time = 212.25 ms / 1105 tokens ( 0.19 ms per token, 5206.00 tokens per second) llama_perf_context_print: eval time = 7550.86 ms / 533 runs ( 14.17 ms per token, 70.59 tokens per second) llama_perf_context_print: total time = 7981.97 ms / 1638 tokens ``` 2. 4090-1 (TW-01) ```bash llama_perf_sampler_print: sampling time = 54.12 ms / 1699 runs ( 0.03 ms per token, 31390.30 tokens per second) llama_perf_context_print: load time = 943.55 ms llama_perf_context_print: prompt eval time = 221.83 ms / 1105 tokens ( 0.20 ms per token, 4981.25 tokens per second) llama_perf_context_print: eval time = 8813.77 ms / 588 runs ( 14.99 ms per token, 66.71 tokens per second) llama_perf_context_print: total time = 9300.27 ms / 1693 tokens ``` - Distributed RUN: - RTT: 0.174ms ~ 0.30ms - GPU Layers/ ALL Layers = (16/32) | GPU Layers/ ALL Layers = (16/32) - Short Context Commands = - Master node: `./llama-cli -m download/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -c 1024 -n 512 -p "<｜User｜>What is 1+1?<｜Assistant｜>" --world 2 --rank 0 --prefetch -lw "16,16" -ngl 16 --master 127.0.0.1 --next tw-03.access.glows.ai --data_port 9000 --signal_port 10001 --master_data_port 9000 --next_node_data_port 50173 --next_node_signal_port 50692` - Server node: `./llama-cli -m download/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf --world 2 --rank 1 --prefetch -ngl 16 --master tw-02.access.glows.ai --next tw-02.access.glows.ai --data_port 10000 --signal_port 10001 --master_data_port 23832 --next_node_data_port 23832 --next_node_signal_port 23134` ```bash llama_perf_sampler_print: sampling time = 17.84 ms / 208 runs ( 0.09 ms per token, 11661.15 tokens per second) llama_perf_context_print: load time = 6986.87 ms llama_perf_context_print: prompt eval time = 33.66 ms / 10 tokens ( 3.37 ms per token, 297.07 tokens per second) llama_perf_context_print: eval time = 2821.07 ms / 192 runs ( 14.69 ms per token, 68.06 tokens per second) llama_perf_context_print: total time = 2981.61 ms / 202 tokens ``` - Long Context Commands = - Master node: `./llama-cli -m download/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -c 2048 -n 1024 -p "" --world 2 --rank 0 --prefetch -lw "16,16" -ngl 16 --master 127.0.0.1 --next tw-03.access.glows.ai --data_port 9000 --signal_port 10001 --master_data_port 9000 --next_node_data_port 50173 --next_node_signal_port 50692` - Server node: `./llama-cli -m download/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf --world 2 --rank 1 --prefetch -ngl 16 --master tw-02.access.glows.ai --next tw-02.access.glows.ai --data_port 10000 --signal_port 10001 --master_data_port 23832 --next_node_data_port 23832 --next_node_signal_port 23134` ```bash llama_perf_sampler_print: sampling time = 50.63 ms / 1662 runs ( 0.03 ms per token, 32828.33 tokens per second) llama_perf_context_print: load time = 584.88 ms llama_perf_context_print: prompt eval time = 420.60 ms / 1105 tokens ( 0.38 ms per token, 2627.22 tokens per second) llama_perf_context_print: eval time = 8544.92 ms / 551 runs ( 15.51 ms per token, 64.48 tokens per second) llama_perf_context_print: total time = 9190.89 ms / 1656 tokens ``` 3. DeepSeek-R1-Distill-Llama-70B - Longer context prompts ``` <｜User｜>The following is a detailed documents containing a mix of technical explanations, narrative passages, and structured information. Please carefully read through the entire content and provide a concise summary that captures the main ideas, key points, and any important structure or hierarchy in the text. === BEGIN DOCUMENT === The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development. We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama. In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration of thinking mode (for complex, multi-step reasoning) and non-thinking mode (for rapid, context-driven responses) into a unified framework. This eliminates the need to switch between different models—such as chat-optimized models (e.g., GPT-4o) and dedicated reasoning models (e.g., QwQ32B)—and enables dynamic mode switching based on user queries or chat templates. Meanwhile, Qwen3 introduces a thinking budget mechanism, allowing users to allocate computational resources adaptively during inference, thereby balancing latency and performance based on task complexity. Moreover, by leveraging the knowledge from the flagship models, we significantly reduce the computational resources required to build smaller-scale models, while ensuring their highly competitive performance. Empirical evaluations demonstrate that Qwen3 achieves state-of-the-art results across diverse benchmarks, including tasks in code generation, mathematical reasoning, agent tasks, etc., competitive against larger MoE models and proprietary models. Compared to its predecessor Qwen2.5, Qwen3 expands multilingual support from 29 to 119 languages and dialects, enhancing global accessibility through improved cross-lingual understanding and generation capabilities. To facilitate reproducibility and community-driven research and development, all Qwen3 models are publicly accessible under Apache 2.0. === END DOCUMENT === Now, provide a summary of the above document. Make sure to: - Highlight the central topic and scope - Summarize major sections or chapters - Point out any arguments, conclusions, or recommendations - Be concise, but preserve essential meaning and nuance\n<｜Assistant｜> ``` - Local RUN: - GPU Layers/ ALL Layers = (40/80) - Short Context Commands =`./llama-cli -m download/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf -c 1024 -p "<｜User｜>What is 1+1?<｜Assistant｜>" -n 512 -ngl 40` 1. 4090-0 (TW-01) ```bash llama_perf_sampler_print: sampling time = 9.58 ms / 116 runs ( 0.08 ms per token, 12111.09 tokens per second) llama_perf_context_print: load time = 22787.25 ms llama_perf_context_print: prompt eval time = 1531.06 ms / 10 tokens ( 153.11 ms per token, 6.53 tokens per second) llama_perf_context_print: eval time = 46306.12 ms / 100 runs ( 463.06 ms per token, 2.16 tokens per second) llama_perf_context_print: total time = 50248.27 ms / 110 tokens ``` 2. 4090-1 (TW-01) ```bash llama_perf_sampler_print: sampling time = 16.21 ms / 168 runs ( 0.10 ms per token, 10365.25 tokens per second) llama_perf_context_print: load time = 17564.82 ms llama_perf_context_print: prompt eval time = 1677.34 ms / 10 tokens ( 167.73 ms per token, 5.96 tokens per second) llama_perf_context_print: eval time = 88459.93 ms / 152 runs ( 581.97 ms per token, 1.72 tokens per second) llama_perf_context_print: total time = 93082.67 ms / 162 tokens ``` - Longer Context Commands = `./llama-cli -m download/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf -c 2048 -p "" -n 1024 -ngl 40` 1. 4090-0 (TW-01) ```bash llama_perf_sampler_print: sampling time = 84.87 ms / 2019 runs ( 0.04 ms per token, 23789.32 tokens per second) llama_perf_context_print: load time = 9315.00 ms llama_perf_context_print: prompt eval time = 11090.16 ms / 1105 tokens ( 10.04 ms per token, 99.64 tokens per second) llama_perf_context_print: eval time = 433713.00 ms / 908 runs ( 477.66 ms per token, 2.09 tokens per second) llama_perf_context_print: total time = 447865.75 ms / 2013 tokens ``` 2. 4090-1 (TW-01) ```bash llama_perf_sampler_print: sampling time = 97.42 ms / 2129 runs ( 0.05 ms per token, 21853.38 tokens per second) llama_perf_context_print: load time = 4496.53 ms llama_perf_context_print: prompt eval time = 8772.86 ms / 1105 tokens ( 7.94 ms per token, 125.96 tokens per second) llama_perf_context_print: eval time = 598811.42 ms / 1018 runs ( 588.22 ms per token, 1.70 tokens per second) llama_perf_context_print: total time = 610962.89 ms / 2123 tokens ``` - Distributed RUN: - RTT: 0.174ms ~ 0.30ms - GPU Layers/ ALL Layers = (40/80) | GPU Layers/ ALL Layers = (40/80) - Short Context Commands = - Master node: `./llama-cli -m download/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf -c 1024 -n 512 -p "<｜User｜>What is 1+1?<｜Assistant｜>" --world 2 --rank 0 --prefetch -lw "40,40" -ngl 40 --master 127.0.0.1 --next tw-03.access.glows.ai --data_port 9000 --signal_port 10001 --master_data_port 9000 --next_node_data_port 50173 --next_node_signal_port 50692` - Server node: `./llama-cli -m download/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf --world 2 --rank 1 --prefetch -ngl 40 --master tw-02.access.glows.ai --next tw-02.access.glows.ai --data_port 10000 --signal_port 10001 --master_data_port 23832 --next_node_data_port 23832 --next_node_signal_port 23134` ```bash llama_perf_sampler_print: sampling time = 13.69 ms / 163 runs ( 0.08 ms per token, 11909.11 tokens per second) llama_perf_context_print: load time = 11939.41 ms llama_perf_context_print: prompt eval time = 108.74 ms / 10 tokens ( 10.87 ms per token, 91.97 tokens per second) llama_perf_context_print: eval time = 9355.41 ms / 147 runs ( 63.64 ms per token, 15.71 tokens per second) llama_perf_context_print: total time = 9827.13 ms / 157 tokens ``` - Long Context Commands = - Master node: `./llama-cli -m download/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf -c 2048 -n 1024 -p "" --world 2 --rank 0 --prefetch -lw "40,40" -ngl 40 --master 127.0.0.1 --next tw-03.access.glows.ai --data_port 9000 --signal_port 10001 --master_data_port 9000 --next_node_data_port 50173 --next_node_signal_port 50692` - Server node: `./llama-cli -m download/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf --world 2 --rank 1 --prefetch -ngl 40 --master tw-02.access.glows.ai --next tw-02.access.glows.ai --data_port 10000 --signal_port 10001 --master_data_port 23832 --next_node_data_port 23832 --next_node_signal_port 23134` ```bash llama_perf_sampler_print: sampling time = 80.00 ms / 1975 runs ( 0.04 ms per token, 24685.96 tokens per second) llama_perf_context_print: load time = 6431.46 ms llama_perf_context_print: prompt eval time = 1478.42 ms / 1105 tokens ( 1.34 ms per token, 747.42 tokens per second) llama_perf_context_print: eval time = 56859.09 ms / 864 runs ( 65.81 ms per token, 15.20 tokens per second) llama_perf_context_print: total time = 58895.63 ms / 1969 tokens ``` ### 2 4090 on different node and different region (TW-1 板橋; TW-3 雲林) 1. QwQ-32B - Longer context prompts ``` <|im_start|>user\nThe following is a detailed documents containing a mix of technical explanations, narrative passages, and structured information. Please carefully read through the entire content and provide a concise summary that captures the main ideas, key points, and any important structure or hierarchy in the text. === BEGIN DOCUMENT === The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development. We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama. In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration of thinking mode (for complex, multi-step reasoning) and non-thinking mode (for rapid, context-driven responses) into a unified framework. This eliminates the need to switch between different models—such as chat-optimized models (e.g., GPT-4o) and dedicated reasoning models (e.g., QwQ32B)—and enables dynamic mode switching based on user queries or chat templates. Meanwhile, Qwen3 introduces a thinking budget mechanism, allowing users to allocate computational resources adaptively during inference, thereby balancing latency and performance based on task complexity. Moreover, by leveraging the knowledge from the flagship models, we significantly reduce the computational resources required to build smaller-scale models, while ensuring their highly competitive performance. Empirical evaluations demonstrate that Qwen3 achieves state-of-the-art results across diverse benchmarks, including tasks in code generation, mathematical reasoning, agent tasks, etc., competitive against larger MoE models and proprietary models. Compared to its predecessor Qwen2.5, Qwen3 expands multilingual support from 29 to 119 languages and dialects, enhancing global accessibility through improved cross-lingual understanding and generation capabilities. To facilitate reproducibility and community-driven research and development, all Qwen3 models are publicly accessible under Apache 2.0. === END DOCUMENT === Now, provide a summary of the above document. Make sure to: - Highlight the central topic and scope - Summarize major sections or chapters - Point out any arguments, conclusions, or recommendations - Be concise, but preserve essential meaning and nuance\n<|im_end|>\n<|im_start|>assistant\n<think>\n ``` - Local RUN: - GPU Layers/ ALL Layers = (64/64) - Short Context Commands = `./llama-cli -m download/qwq-32b-q4_k_m.gguf -c 1024 -p "<|im_start|>user\nWhat is 1+1?\n<|im_end|>\n<|im_start|>assistant\n<think>\n" -n 512 -ngl 64` 1. 4090-0 (TW-01) ```bash llama_perf_sampler_print: sampling time = 29.75 ms / 299 runs ( 0.10 ms per token, 10049.07 tokens per second) llama_perf_context_print: load time = 8471.34 ms llama_perf_context_print: prompt eval time = 75.07 ms / 17 tokens ( 4.42 ms per token, 226.45 tokens per second) llama_perf_context_print: eval time = 9706.31 ms / 276 runs ( 35.17 ms per token, 28.44 tokens per second) llama_perf_context_print: total time = 10060.85 ms / 293 tokens ``` 2. 4090-1 (TW-03) ```bash llama_perf_sampler_print: sampling time = 107.13 ms / 529 runs ( 0.20 ms per token, 4937.83 tokens per second) llama_perf_context_print: load time = 15950.25 ms llama_perf_context_print: prompt eval time = 65.43 ms / 17 tokens ( 3.85 ms per token, 259.80 tokens per second) llama_perf_context_print: eval time = 20423.35 ms / 506 runs ( 40.36 ms per token, 24.78 tokens per second) llama_perf_context_print: total time = 21009.79 ms / 523 tokens ``` - Longer Context Commands = `./llama-cli -m download/qwq-32b-q4_k_m.gguf -c 2048 -p "" -n 1024 -ngl 64` 1. 4090-0 (TW-01) ```bash llama_perf_sampler_print: sampling time = 106.19 ms / 2159 runs ( 0.05 ms per token, 20331.86 tokens per second) llama_perf_context_print: load time = 3387.66 ms llama_perf_context_print: prompt eval time = 565.96 ms / 1135 tokens ( 0.50 ms per token, 2005.43 tokens per second) llama_perf_context_print: eval time = 36405.00 ms / 1018 runs ( 35.76 ms per token, 27.96 tokens per second) llama_perf_context_print: total time = 37474.49 ms / 2153 tokens ``` 2. 4090-1 (TW-03) ```bash llama_perf_sampler_print: sampling time = 210.53 ms / 2159 runs ( 0.10 ms per token, 10254.83 tokens per second) llama_perf_context_print: load time = 3413.56 ms llama_perf_context_print: prompt eval time = 613.74 ms / 1135 tokens ( 0.54 ms per token, 1849.33 tokens per second) llama_perf_context_print: eval time = 42003.69 ms / 1018 runs ( 41.26 ms per token, 24.24 tokens per second) llama_perf_context_print: total time = 43395.86 ms / 2153 tokens ``` - Distributed RUN: - RTT: 5ms ~ 7ms - GPU Layers/ ALL Layers = (32/64) | GPU Layers/ ALL Layers = (32/64) - Short Context Commands = - Master node: `./llama-cli -m download/qwq-32b-q4_k_m.gguf -c 1024 -n 512 -p "<|im_start|>user\nWhat is 1+1?\n<|im_end|>\n<|im_start|>assistant\n<think>\n" --world 2 --rank 0 --prefetch -lw "32,32" -ngl 32 --master 127.0.0.1 --next tw-05.access.glows.ai --data_port 9000 --signal_port 9001 --master_data_port 9000 --next_node_data_port 26765 --next_node_signal_port 25072` - Server node: `./llama-cli -m download/qwq-32b-q4_k_m.gguf --world 2 --rank 1 --prefetch -ngl 32 --master tw-02.access.glows.ai --next tw-02.access.glows.ai --data_port 9000 --signal_port 9001 --master_data_port 23436 --next_node_data_port 23436 --next_node_signal_port 23670` ```bash llama_perf_sampler_print: sampling time = 53.17 ms / 514 runs ( 0.10 ms per token, 9667.11 tokens per second) llama_perf_context_print: load time = 7978.43 ms llama_perf_context_print: prompt eval time = 157.37 ms / 17 tokens ( 9.26 ms per token, 108.03 tokens per second) llama_perf_context_print: eval time = 24009.95 ms / 491 runs ( 48.90 ms per token, 20.45 tokens per second) llama_perf_context_print: total time = 24584.34 ms / 508 tokens ``` - Long Context Commands = - Master node: `./llama-cli -m download/qwq-32b-q4_k_m.gguf -c 2048 -n 1024 -p "" --world 2 --rank 0 --prefetch -lw "32,32" -ngl 32 --master 127.0.0.1 --next tw-05.access.glows.ai --data_port 9000 --signal_port 9001 --master_data_port 9000 --next_node_data_port 26765 --next_node_signal_port 25072` - Server node: `./llama-cli -m download/qwq-32b-q4_k_m.gguf --world 2 --rank 1 --prefetch -ngl 32 --master tw-02.access.glows.ai --next tw-02.access.glows.ai --data_port 9000 --signal_port 9001 --master_data_port 23436 --next_node_data_port 23436 --next_node_signal_port 23670` ```bash llama_perf_sampler_print: sampling time = 96.53 ms / 2053 runs ( 0.05 ms per token, 21267.78 tokens per second) llama_perf_context_print: load time = 3962.39 ms llama_perf_context_print: prompt eval time = 7650.91 ms / 1135 tokens ( 6.74 ms per token, 148.35 tokens per second) llama_perf_context_print: eval time = 45403.43 ms / 912 runs ( 49.78 ms per token, 20.09 tokens per second) llama_perf_context_print: total time = 53596.64 ms / 2047 tokens ``` 2. DeepSeek-R1-Distill-Llama-8B - Longer context prompts ``` <｜User｜>The following is a detailed documents containing a mix of technical explanations, narrative passages, and structured information. Please carefully read through the entire content and provide a concise summary that captures the main ideas, key points, and any important structure or hierarchy in the text. === BEGIN DOCUMENT === The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development. We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama. In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration of thinking mode (for complex, multi-step reasoning) and non-thinking mode (for rapid, context-driven responses) into a unified framework. This eliminates the need to switch between different models—such as chat-optimized models (e.g., GPT-4o) and dedicated reasoning models (e.g., QwQ32B)—and enables dynamic mode switching based on user queries or chat templates. Meanwhile, Qwen3 introduces a thinking budget mechanism, allowing users to allocate computational resources adaptively during inference, thereby balancing latency and performance based on task complexity. Moreover, by leveraging the knowledge from the flagship models, we significantly reduce the computational resources required to build smaller-scale models, while ensuring their highly competitive performance. Empirical evaluations demonstrate that Qwen3 achieves state-of-the-art results across diverse benchmarks, including tasks in code generation, mathematical reasoning, agent tasks, etc., competitive against larger MoE models and proprietary models. Compared to its predecessor Qwen2.5, Qwen3 expands multilingual support from 29 to 119 languages and dialects, enhancing global accessibility through improved cross-lingual understanding and generation capabilities. To facilitate reproducibility and community-driven research and development, all Qwen3 models are publicly accessible under Apache 2.0. === END DOCUMENT === Now, provide a summary of the above document. Make sure to: - Highlight the central topic and scope - Summarize major sections or chapters - Point out any arguments, conclusions, or recommendations - Be concise, but preserve essential meaning and nuance\n<｜Assistant｜> ``` - Local RUN: - GPU Layers/ ALL Layers = (32/32) - Short Context Commands =`./llama-cli -m download/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -c 1024 -p "<｜User｜>What is 1+1?<｜Assistant｜>" -n 512 -ngl 64` 1. 4090-0 (TW-01) ```bash llama_perf_sampler_print: sampling time = 8.08 ms / 102 runs ( 0.08 ms per token, 12625.32 tokens per second) llama_perf_context_print: load time = 2598.13 ms llama_perf_context_print: prompt eval time = 26.83 ms / 10 tokens ( 2.68 ms per token, 372.70 tokens per second) llama_perf_context_print: eval time = 1208.09 ms / 86 runs ( 14.05 ms per token, 71.19 tokens per second) llama_perf_context_print: total time = 1328.83 ms / 96 tokens ``` 2. 4090-1 (TW-03) ```bash llama_perf_sampler_print: sampling time = 19.55 ms / 121 runs ( 0.16 ms per token, 6189.26 tokens per second) llama_perf_context_print: load time = 3763.38 ms llama_perf_context_print: prompt eval time = 31.39 ms / 10 tokens ( 3.14 ms per token, 318.55 tokens per second) llama_perf_context_print: eval time = 1732.56 ms / 105 runs ( 16.50 ms per token, 60.60 tokens per second) llama_perf_context_print: total time = 1906.07 ms / 115 tokens ``` - Longer Context Commands = `./llama-cli -m download/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -c 2048 -p "" -n 1024 -ngl 64` 1. 4090-0 (TW-01) ```bash llama_perf_sampler_print: sampling time = 31.28 ms / 1448 runs ( 0.02 ms per token, 46285.64 tokens per second) llama_perf_context_print: load time = 911.34 ms llama_perf_context_print: prompt eval time = 152.32 ms / 1105 tokens ( 0.14 ms per token, 7254.27 tokens per second) llama_perf_context_print: eval time = 4744.64 ms / 337 runs ( 14.08 ms per token, 71.03 tokens per second) llama_perf_context_print: total time = 5064.27 ms / 1442 tokens ``` 2. 4090-1 (TW-03) ```bash llama_perf_sampler_print: sampling time = 158.41 ms / 1976 runs ( 0.08 ms per token, 12473.57 tokens per second) llama_perf_context_print: load time = 938.46 ms llama_perf_context_print: prompt eval time = 158.56 ms / 1105 tokens ( 0.14 ms per token, 6969.19 tokens per second) llama_perf_context_print: eval time = 14920.32 ms / 865 runs ( 17.25 ms per token, 57.97 tokens per second) llama_perf_context_print: total time = 15584.13 ms / 1970 tokens ``` - Distributed RUN: - RTT: 5ms ~ 7ms - GPU Layers/ ALL Layers = (16/32) | GPU Layers/ ALL Layers = (16/32) - Short Context Commands = - Master node: `./llama-cli -m download/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -c 1024 -n 512 -p "<｜User｜>What is 1+1?<｜Assistant｜>" --world 2 --rank 0 --prefetch -lw "16,16" -ngl 16 --master 127.0.0.1 --next tw-05.access.glows.ai --data_port 9000 --signal_port 9001 --master_data_port 9000 --next_node_data_port 26765 --next_node_signal_port 25072` - Server node: `./llama-cli -m download/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf --world 2 --rank 1 --prefetch -ngl 16 --master tw-02.access.glows.ai --next tw-02.access.glows.ai --data_port 9000 --signal_port 9001 --master_data_port 23436 --next_node_data_port 23436 --next_node_signal_port 23670` ```bash llama_perf_sampler_print: sampling time = 46.45 ms / 522 runs ( 0.09 ms per token, 11237.89 tokens per second) llama_perf_context_print: load time = 7153.67 ms llama_perf_context_print: prompt eval time = 93.11 ms / 10 tokens ( 9.31 ms per token, 107.40 tokens per second) llama_perf_context_print: eval time = 11251.75 ms / 506 runs ( 22.24 ms per token, 44.97 tokens per second) llama_perf_context_print: total time = 11595.13 ms / 516 tokens ``` - Long Context Commands = - Master node: `./llama-cli -m download/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -c 2048 -n 1024 -p "" --world 2 --rank 0 --prefetch -lw "16,16" -ngl 16 --master 127.0.0.1 --next tw-05.access.glows.ai --data_port 9000 --signal_port 9001 --master_data_port 9000 --next_node_data_port 26765 --next_node_signal_port 25072` - Server node: `./llama-cli -m download/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf --world 2 --rank 1 --prefetch -ngl 16 --master tw-02.access.glows.ai --next tw-02.access.glows.ai --data_port 9000 --signal_port 9001 --master_data_port 23436 --next_node_data_port 23436 --next_node_signal_port 23670` ```bash llama_perf_sampler_print: sampling time = 93.47 ms / 2124 runs ( 0.04 ms per token, 22724.84 tokens per second) llama_perf_context_print: load time = 1939.78 ms llama_perf_context_print: prompt eval time = 5633.75 ms / 1105 tokens ( 5.10 ms per token, 196.14 tokens per second) llama_perf_context_print: eval time = 24277.51 ms / 1013 runs ( 23.97 ms per token, 41.73 tokens per second) llama_perf_context_print: total time = 30310.76 ms / 2118 tokens ``` 3. DeepSeek-R1-Distill-Llama-70B - Longer context prompts ``` <｜User｜>The following is a detailed documents containing a mix of technical explanations, narrative passages, and structured information. Please carefully read through the entire content and provide a concise summary that captures the main ideas, key points, and any important structure or hierarchy in the text. === BEGIN DOCUMENT === The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development. We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama. In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration of thinking mode (for complex, multi-step reasoning) and non-thinking mode (for rapid, context-driven responses) into a unified framework. This eliminates the need to switch between different models—such as chat-optimized models (e.g., GPT-4o) and dedicated reasoning models (e.g., QwQ32B)—and enables dynamic mode switching based on user queries or chat templates. Meanwhile, Qwen3 introduces a thinking budget mechanism, allowing users to allocate computational resources adaptively during inference, thereby balancing latency and performance based on task complexity. Moreover, by leveraging the knowledge from the flagship models, we significantly reduce the computational resources required to build smaller-scale models, while ensuring their highly competitive performance. Empirical evaluations demonstrate that Qwen3 achieves state-of-the-art results across diverse benchmarks, including tasks in code generation, mathematical reasoning, agent tasks, etc., competitive against larger MoE models and proprietary models. Compared to its predecessor Qwen2.5, Qwen3 expands multilingual support from 29 to 119 languages and dialects, enhancing global accessibility through improved cross-lingual understanding and generation capabilities. To facilitate reproducibility and community-driven research and development, all Qwen3 models are publicly accessible under Apache 2.0. === END DOCUMENT === Now, provide a summary of the above document. Make sure to: - Highlight the central topic and scope - Summarize major sections or chapters - Point out any arguments, conclusions, or recommendations - Be concise, but preserve essential meaning and nuance\n<｜Assistant｜> ``` - Local RUN: - GPU Layers/ ALL Layers = (40/80) - Short Context Commands =`./llama-cli -m download/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf -c 1024 -p "<｜User｜>What is 1+1?<｜Assistant｜>" -n 512 -ngl 40` 1. 4090-0 (TW-01) ```bash llama_perf_sampler_print: sampling time = 1.25 ms / 24 runs ( 0.05 ms per token, 19277.11 tokens per second) llama_perf_context_print: load time = 21652.98 ms llama_perf_context_print: prompt eval time = 1472.94 ms / 10 tokens ( 147.29 ms per token, 6.79 tokens per second) llama_perf_context_print: eval time = 3662.41 ms / 8 runs ( 457.80 ms per token, 2.18 tokens per second) llama_perf_context_print: total time = 7445.68 ms / 18 tokens ``` 2. 4090-1 (TW-03) ```bash llama_perf_sampler_print: sampling time = 1.70 ms / 23 runs ( 0.07 ms per token, 13561.32 tokens per second) llama_perf_context_print: load time = 11997.84 ms llama_perf_context_print: prompt eval time = 2118.80 ms / 10 tokens ( 211.88 ms per token, 4.72 tokens per second) llama_perf_context_print: eval time = 3497.25 ms / 7 runs ( 499.61 ms per token, 2.00 tokens per second) llama_perf_context_print: total time = 8240.47 ms / 17 tokens ``` - Longer Context Commands = `./llama-cli -m download/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf -c 2048 -p "" -n 1024 -ngl 40` 1. 4090-0 (TW-01) ```bash llama_perf_sampler_print: sampling time = 33.38 ms / 1466 runs ( 0.02 ms per token, 43921.15 tokens per second) llama_perf_context_print: load time = 15369.37 ms llama_perf_context_print: prompt eval time = 11105.35 ms / 1105 tokens ( 10.05 ms per token, 99.50 tokens per second) llama_perf_context_print: eval time = 172086.32 ms / 355 runs ( 484.75 ms per token, 2.06 tokens per second) llama_perf_context_print: total time = 186043.40 ms / 1460 tokens ``` 2. 4090-1 (TW-03) ```bash llama_perf_sampler_print: sampling time = 95.59 ms / 1996 runs ( 0.05 ms per token, 20879.97 tokens per second) llama_perf_context_print: load time = 20139.95 ms llama_perf_context_print: prompt eval time = 11816.95 ms / 1105 tokens ( 10.69 ms per token, 93.51 tokens per second) llama_perf_context_print: eval time = 426649.52 ms / 885 runs ( 482.09 ms per token, 2.07 tokens per second) llama_perf_context_print: total time = 441924.42 ms / 1990 tokens ``` - Distributed RUN: - RTT: 5ms ~ 7ms - GPU Layers/ ALL Layers = (40/80) | GPU Layers/ ALL Layers = (40/80) - Short Context Commands = - Master node: `./llama-cli -m download/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf -c 1024 -n 512 -p "<｜User｜>What is 1+1?<｜Assistant｜>" --world 2 --rank 0 --prefetch -lw "40,40" -ngl 40 --master 127.0.0.1 --next tw-05.access.glows.ai --data_port 9000 --signal_port 9001 --master_data_port 9000 --next_node_data_port 26765 --next_node_signal_port 25072` - Server node: `./llama-cli -m download/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf --world 2 --rank 1 --prefetch -ngl 40 --master tw-02.access.glows.ai --next tw-02.access.glows.ai --data_port 9000 --signal_port 9001 --master_data_port 23436 --next_node_data_port 23436 --next_node_signal_port 23670` ```bash llama_perf_sampler_print: sampling time = 2.22 ms / 35 runs ( 0.06 ms per token, 15737.41 tokens per second) llama_perf_context_print: load time = 9570.21 ms llama_perf_context_print: prompt eval time = 240.15 ms / 10 tokens ( 24.01 ms per token, 41.64 tokens per second) llama_perf_context_print: eval time = 1546.57 ms / 19 runs ( 81.40 ms per token, 12.29 tokens per second) llama_perf_context_print: total time = 2200.30 ms / 29 tokens ``` - Long Context Commands = - Master node: `./llama-cli -m download/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf -c 2048 -n 1024 -p "" --world 2 --rank 0 --prefetch -lw "40,40" -ngl 40 --master 127.0.0.1 --next tw-05.access.glows.ai --data_port 9000 --signal_port 9001 --master_data_port 9000 --next_node_data_port 26765 --next_node_signal_port 25072` - Server node: `./llama-cli -m download/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf --world 2 --rank 1 --prefetch -ngl 40 --master tw-02.access.glows.ai --next tw-02.access.glows.ai --data_port 9000 --signal_port 9001 --master_data_port 23436 --next_node_data_port 23436 --next_node_signal_port 23670` ```bash llama_perf_sampler_print: sampling time = 46.16 ms / 1608 runs ( 0.03 ms per token, 34839.13 tokens per second) llama_perf_context_print: load time = 4608.39 ms llama_perf_context_print: prompt eval time = 13909.01 ms / 1105 tokens ( 12.59 ms per token, 79.44 tokens per second) llama_perf_context_print: eval time = 41258.02 ms / 497 runs ( 83.01 ms per token, 12.05 tokens per second) llama_perf_context_print: total time = 55711.08 ms / 1602 tokens ``` # Deprecation ## Private Network Benchmark - Using Glows Mesh Cluster feature, but limit to same region so deprecate 1. For master node (192.168.1.1) ```bash ./llama-cli -m download/qwq-32b-q4_k_m.gguf -c 1024 -n 256 -p "what is edge AI?" --world 2 --rank 0 --master 192.168.1.1 --next 192.168.1.2 --prefetch -lw "32,32" -ngl 32 ``` 2. For Client node (192.168.1.2) ```bash ./llama-cli -m download/qwq-32b-q4_k_m.gguf --world 2 --rank 1 --master 192.168.1.1 --next 192.168.1.1 --prefetch -ngl 32 ``` ### Support manually port by modify source code - Original prima.cpp don’t support to manually set port, because it assume machine is on same subnet, after i add 3 param argument, this part is deprecated 1. For master node ```cpp void llama_init_sockets(struct llama_context * ctx, uint32_t n_world, uint32_t my_rank) { ... std::string recv_endp = "tcp://*:9000"; // listening data port std::string send_endp = "tcp://tw-03.access.glows.ai:50644"; // next node data port std::string master_endp = "tcp://127.0.0.1:9000"; // loopback data port std::string signal_endp = "tcp://*:10000"; // listening signal port ... } int llama_rebuild_topo(llama_context * ctx, uint32_t * n_layer_window, device_info * dev_info_set) { ... std::string send_endp = "tcp://tw-03.access.glows.ai:50644"; // next node data port ... } void llama_free_sockets(struct llama_context * ctx, char ** msg) { ... std::string endp = "tcp://tw-02.access.glows.ai:23493"; // next node signal port ... } ``` 2. For client node ```cpp void llama_init_sockets(struct llama_context * ctx, uint32_t n_world, uint32_t my_rank) { ... std::string recv_endp = "tcp://*:9000"; // listening data port std::string send_endp = "tcp://tw-03.access.glows.ai:50260"; // next node data port std::string master_endp = "tcp://tw-03.access.glows.ai:50260"; // master data port std::string signal_endp = "tcp://*:10000"; // listening signal port ... } int llama_rebuild_topo(llama_context * ctx, uint32_t * n_layer_window, device_info * dev_info_set) { ... std::string send_endp = "tcp://tw-03.access.glows.ai:50260"; // next node data port ... } void llama_free_sockets(struct llama_context * ctx, char ** msg) { ... std::string endp = "tcp://tw-02.access.glows.ai:23259"; // next node signal port ... } ```