Theoretical Analysis Supporting the Suitability of Diffusion LLMs for Offloading

## 1 Introduction and Motivation In the **offloading inference** scenario, we use slower-bandwidth storage such as CPU RAM or SSD to store model parameters. In this context, diffusion LLMs are a better architectural choice than autoregressive LLMs. This is because, during inference, each denoising step in a diffusion LLM is compute-bound due to its bidirectional attention mechanism. In contrast, autoregressive LLMs are memory-bound, as they require minimal computation when the KV cache is enabled, making them less suitable for offloading, which relies heavily on efficient data movement. Additionally, diffusion LLMs need only a few denoising steps to generate a token, whereas autoregressive models generate tokens one at a time. Below is a summary of two architectural features that make **diffusion LLMs** inherently more suitable for offloading than **autoregressive (AR) LLMs**: | Key Point | Diffusion LLM | Autoregressive LLM | | ----------------------------------- | ---------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------- | | **1. Fewer model loads per output** | For example, 128 denoising steps can decode a 512-token block. | 512 decoding steps, one per token. | | **2. Higher compute per load** | Each step uses full bidirectional attention across the entire context, making it very compute-heavy. | With KV cache, only performs GEMV for attention and FFN to generate a single new token, making it compute-light and memory-bound. | Point ① is intuitive. Below I prove point ② by building an AIT‑based efficiency model and plotting efficiency vs. bandwidth. --- ## 2 Theoretical Model To better demonstrate the impact of offloading on compute efficiency during the inference stage, I follow the compute efficiency analysis model from Section 4 of the ZeRO-Infinity paper. I compare single-token decoding for an autoregressive model at the 4096th token with single-step denoising for a diffusion LLM at sequence length 4096. This analysis aims to show that, in each forward operation, diffusion models(being more compute-intensive) place less demand on bandwidth in offloading scenarios, allowing more room for optimization, such as overlapping computation and PCIe I/O. The hypothesis is that during the decoding stage of an autoregressive model, only the new token needs to be computed, but the entire model must still be moved, making it memory-bound. In contrast, during the denoising step of a diffusion model, full bidirectional attention must be computed and the entire model also moved, making it compute-bound. The following section provides a detailed analysis to support whether each case is compute-bound or memory-bound. ## 2.1 Assumptions - Only consider about attention layer and FFN layer, exclude all the other layer. - References from LLaDA and Llama3 [1,2]. - **Same backbone:** Llama‑3‑8B (32 layers, 4096 d_model, 8 KV / 32 total heads, FFN 4096→14336→4096 with SwiGLU gate). - **Precision:** FP16 for parameters and activations → 2 bytes / weight element. - **Context:** AR step decodes **one** new token after a 4095‑token prompt; diffusion step denoises the **entire 4096‑token** latent sequence. - **Weights off‑chip:** Every step streams the full transformer block weights (only consider model weights offload, no consider kvcache offload). - **Hardware sweep:** SSD→GPU bandwidth 4 – 100 GB s⁻¹. Peak throughput 200 – 800 TFLOPs s⁻¹ . ## 2.2 Data Transfer Parameter size Since the model architecture is assumed to be the same, one forward pass of both the autoregressive and diffusion models requires transferring the entire model from slow memory to the GPU. Therefore, they incur the same data transfer cost per forward pass. Below is a detailed breakdown of the data transfer size. | Component | Shape | Size | | --------------------- | --------------------- | ----------------------- | | Q projection | 4096 × 4096 | 2·4096·4096 = 33.55 M | | K projection | 4096 × 1024 | 2·4096·1024 = 8.39 M | | V projection | 4096 × 1024 | 2·4096·1024 = 8.39 M | | O projection | 4096 × 4096 | 2·4096·4096 = 33.55 M | | Up projection | 4096 × 14336 | 2·4096·14336 = 117.44 M | | Gate projection | 4096 × 14336 | 2·4096·14336 = 117.44 M | | Down projection | 14336 × 4096 | 2·14336·4096 = 117.44 M | | **Per layer total** | | **436.2 M** | | **Model total** | 32 x Per layer total | **13.958 G** | ## 2.3 FLOPs per step - **Matrix–vector multiply:** 2 × m × n FLOPs (1 multiply + 1 add per entry). - **Matrix–matrix multiply:** 2 × m × n x k FLOPs (1 multiply + 1 add per entry). - Omit soft‑max, bias adds, etc. ## 2.4 Auto Regressive decode FLOPs (past = 4 095) ### 2.4.1 Attention block (Grouped‑Query) | Component | Shape | FLOPs | | --------------------- | --------------------- | ----------------------- | | Q projection | 4096 × 4096 | 2·4096·4096 = 33.55 M | | K projection | 4096 × 1024 | 2·4096·1024 = 8.39 M | | V projection | 4096 × 1024 | 8.39 M | | Attention scores Q·Kᵀ | 32 heads × 4096 × 128 | 2·32·4096·128 = 33.55 M | | Weighted values A·V | 32 heads × 4096 × 128 | 2·32·4096·128 = 33.55 M | | O projection | 4096 × 4096 | 2·4096·4096 = 33.55 M | | **Attention total** | | **151.98 M** | ### 2.4.2 Feed‑forward block (SwiGLU) | Step | Shape | FLOPs | | ---------------------- | ------------- | ------------ | | Up projection | 4096 × 14336 | 117.44 M | | Gate projection | 4096 × 14336 | 117.44 M | | Down projection | 14336 × 4096 | 117.44 M | | **FFN total** | | **352.32 M** | ### 2.4.3 Per‑layer subtotal and total Attention + FFN = 151.0 M + 352.3 M ≈ 503.3 M FLOPs All = 32 x 503.3 M = 16.11 G FLOPS ## 2.5 Diffusion FLOPs (length=4096) ### 2.5.1 Attention block (Bidirectional) | Component | Shape | FLOPs | | --------------------- | --------------------- | ----------------------- | | Q projection | 4096 x 4096 × 4096 | 2·4096·4096·4096 = 0.137 T | | K projection | 4096 x 4096 × 1024 | 2·4096·4096·1024 = 0.034 T | | V projection | 4096 x 4096 × 1024 | 2·4096·4096·1024 = 0.034 T | | Attention scores Q·Kᵀ | 4096 x 32 heads × 4096 × 128 | 2·4096·32·4096·128 = 0.137 T | | Weighted values A·V | 4096 x 32 heads × 4096 × 128 | 2·4096·32·4096·128 = 0.137 T | | O projection | 4096 x 4096 × 4096 | 2·4096·4096 = 0.137 T | | **Attention total** | | **0.616 T** | ### 2.5.2 Feed‑forward block (SwiGLU) | Step | Shape | FLOPs | | ---------------------- | ------------- | ------------ | | Up projection | 4096 x 4096 × 14336 | 2·4096·4096·14336 = 0.481 T | | Gate projection | 4096 x 4096 × 14336 | 2·4096·4096·14336 = 0.481 T | | Down projection | 4096 x 14336 × 4096 | 2·4096·4096·14336 = 0.481 T | | **FFN total** | | **1.443 T** | ### 2.5.3 Per‑layer subtotal and total Attention + FFN = 0.616 T + 1.443 T ≈ 2.059 T FLOPs All = 32 x 2.059 T = 65.888 T FLOPS ## 2.6 Arithmetic intensity and Efficiency To understand the bandwidth requirements in offloading scenarios, I adopt an efficiency metric introduced by the *Zero-Infinity* [3] paper. This metric quantifies the trade-off between compute time and communication time, assuming **no overlap** between the two. The original formulation is: $$ \text{efficiency} = \frac{\text{compute time}}{\text{compute time} + \text{communication time}} $$ This reflects the fraction of time spent doing useful computation versus waiting on data movement. To express this in terms of system characteristics, define: $$ \text{compute time} = \frac{\text{total computation}}{peak_{tp}} \quad,\quad \text{communication time} = \frac{\text{total data movement}}{bw} $$ **Arithmetic Intensity (AIT)**: $$ AIT = \frac{\text{total computation}}{\text{total data movement}} $$ Substituting into the efficiency formula, It obtain the simplified expression used throughout this section: $$ \text{efficiency} = \frac{AIT \times bw}{AIT \times bw + peak_{tp}} $$ This equation shows how training efficiency depends on the arithmetic intensity of the workload, the available memory bandwidth ($bw$), and the peak computational throughput of the accelerator ($peak_{tp}$). Higher AIT indicates more compute per byte of data moved, making the workload less sensitive to bandwidth constraints. I use this model, as in *Zero-Infinity*, to characterize the bandwidth demands of Auto Regressive LLM and Diffusion LLM under offloading, and quantify how they impact overall system efficiency. ### 2.6.1 Arithmetic intensity $$ \text{AIT}_{\text{AR}}=\frac{16.11×10^{9}}{13.958×10^{9}}≈1.15\text{ F/B} \quad\bigl(\textbf{memory‑bound}\bigr) $$ $$ \text{AIT}_{\text{Diff}}=\frac{65.888×10^{12}}{13.958×10^{9}}≈4.72×10^{3}\text{ F/B}\quad\bigl(\textbf{compute‑bound}\bigr) $$ Diffusion’s AIT is **\~4100×** higher. ## 3. Plots - Efficiency vs. Bandwidth - **Diffusion LLM (first plot)** ![diffusion_efficiency](https://hackmd.io/_uploads/ByGy0odHll.png) - Even on an 800 TFLOP GPU, a modest 40 GB/s SSD link already yields ~20 % efficiency; at 100 GB/s it reaches 37 – 70 %. - **Autoregressive LLM** ![ar_efficiency](https://hackmd.io/_uploads/HJfkAidrgl.png) - Efficiency remains **<0.06 %** across the entire bandwidth range—even 100 GB/s cannot feed a 200 TFLOP GPU. The orders‑of‑magnitude gap visualises point ②: diffusion steps have enough compute to *hide* data movement; AR decoding does not. ## 4. Discussion and Takeaways 1. **Bandwidth tolerant:** Diffusion LLMs maintain respectable efficiency (>20%) with current SSD bandwidths. In contrast, AR LLMs waste nearly all GPU compute. 2. **Scaling outlook:** As GPU peak throughput scales from 200 to 800 TFLOPs, AR efficiency degrades further. Diffusion degrades more gracefully due to higher AIT. 3. **Aggregate benefit:** When combining the per-step efficiency advantage with the fact that diffusion LLMs require 4 to 8 times fewer parameter loads to generate the same number of tokens (see point ①), diffusion LLMs emerge as a better architecture for offloaded inference. 4. This discussion does not consider optimizations for diffusion LLMs. However, recent work like fast-dLLM [4] proposes two techniques. The first is **DualCache**, which uses an approximate key-value cache to reduce compute requirements per denoising step with minimal quality loss. Even with DualCache to do aggressive approximation on attention computation, Diffusion still has higher computation per forward since it still need to compute all tokens within a sub-block, unlike autoregressive models that compute only one. This means diffusion still has more room to benefit from offloading. The second is **confidence-parallel decoding**, which reduces the total number of denoising steps. Reducing the number of denoising steps is also beneficial in offloading scenarios, as it decreases the number of full-model transfers required. ## 5. Code The following code generates the two plots shown above. It follows the AIT calculations mentioned earlier, with adjustable parameters for bandwidth range and target machine TFLOPs. ```python import numpy as np import matplotlib.pyplot as plt # --------------------------------------------------------------------- # 1. Model-wide constants (attention + FFN weights only, FP16) # --------------------------------------------------------------------- BYTES_PER_GB = 1e9 weight_bytes = 13.958e9 # 436.2 MB · 32 layers = 13.958 GB # --------------------------------------------------------------------- # 2. FLOPs per inference step taken from Section 2.4/2.5 # --------------------------------------------------------------------- flops_ar = 16.11e9 # 32 · 503.3 M (single-token decode) flops_diff = 65.888e12 # 32 · 2.059 T (4096-token denoise) # --------------------------------------------------------------------- # 3. Arithmetic intensity (FLOPs per transferred byte) # --------------------------------------------------------------------- ait_ar = flops_ar / weight_bytes # ≈ 1.15 F/B → memory-bound ait_diff = flops_diff / weight_bytes # ≈ 4.72e3 F/B → compute-bound # --------------------------------------------------------------------- # 4. Parameter sweep # --------------------------------------------------------------------- bw_values = np.linspace(4, 100, 200) # SSD→GPU GB/s peak_tp_values = np.array([200, 400, 600, 800]) * 1e12 # FLOPs/s def efficiency(ait, bw_GBs, peak_tp): """Zero-Infinity efficiency with NO overlap.""" bw_Bps = bw_GBs * BYTES_PER_GB # convert GB/s → B/s return (ait * bw_Bps) / (ait * bw_Bps + peak_tp) # --------------------------------------------------------------------- # 5. Plot: Diffusion — single denoising step # --------------------------------------------------------------------- plt.figure() for pt in peak_tp_values: plt.plot( bw_values, efficiency(ait_diff, bw_values, pt) * 100, label=f"{int(pt/1e12)} TFLOPs" ) plt.xlabel("SSD→GPU Bandwidth (GB/s)") plt.ylabel("Efficiency (%)") plt.title("Diffusion LLM – single denoising step") plt.legend() plt.grid(True) plt.tight_layout() plt.savefig('diffusion_efficiency.png', dpi=300) # --------------------------------------------------------------------- # 6. Plot: Autoregressive — single-token decoding # --------------------------------------------------------------------- plt.figure() for pt in peak_tp_values: plt.plot( bw_values, efficiency(ait_ar, bw_values, pt) * 100, label=f"{int(pt/1e12)} TFLOPs" ) plt.xlabel("SSD→GPU Bandwidth (GB/s)") plt.ylabel("Efficiency (%)") plt.title("Autoregressive LLM – single-token decoding") plt.legend() plt.grid(True) plt.tight_layout() plt.savefig('ar_efficiency.png', dpi=300) ``` # References 1. [LLaDA: Large Language Diffusion Models](https://arxiv.org/abs/2502.09992) 2. [Llama3: The Llama 3 Herd of Models](https://arxiv.org/abs/2407.21783) 3. [ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857) 4. [Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding](https://arxiv.org/abs/2505.22618)